Mystery code

Hello everyone, this is Jay Chou

Today I will show you something interesting!

Not only interesting, but also knowledge.

The topic starts with two lines (one line to be precise) magic code:

#include <stdio.h>
int main[] = { 232,-1065134080,26643,12517440,4278206464,12802064,(int)printf };

This is a piece of C++ code. Guess what will be output after compiling and running?

You may ask: This TM does not even have a main function, can it be compiled successfully?

Really!

Let's compile it in Visual Studio under the Windows platform and g++ under the Linux platform, and then execute them to see the effect:

Windows:

在这里插入图片描述

Linux:

在这里插入图片描述

Not only can it be compiled successfully, but it can also run normally. A MZ ELF was output on Linux.

Students who are familiar with the PE file format may know that MZ is the mark at the beginning of the PE file, and ELF is also the mark at the beginning of the executable file on Linux.

In other words: above line of code is executed, the string at the head of the executable file is printed out!

在这里插入图片描述

Disassemble the truth

Seeing this, you may have two questions:

  • Why can it be compiled without the main function?
  • Why is such a string of information output?

Regarding the first question, I believe everyone should have guessed it. Although there is no main function in the code, there is a main array! Is it related to it?

Yes, that's right, for the compiler, whether it is a function or a variable, it is finally processed into a symbol, and the compiler does not distinguish whether the symbol comes from a function or an array. So, we fooled the compiler with a main array.

In other words: compiler treats the main array as the main function, and treats the data in the main array as the function body instructions of the main function.

To answer the second question, you have to look at this strange number in the main array. What kind of code is it?

Convert the value in the main array to hexadecimal to see, and align it according to an int variable that occupies 4 bytes:

在这里插入图片描述

Going a step further, use the disassembly engine to see what instruction this piece of hexadecimal data is?

在这里插入图片描述

Next, let's analyze these instructions one by one.

call $+5

This is a very important instruction, please remember: The call instruction is executed, the address of the next instruction will be pushed onto the top of the thread's stack, which is used when the function returns. Road , who is the next one? It is the pop eax below, so when the call instruction is executed, the address of the pop eax instruction below will be pushed onto the top of the stack.

Furthermore, the target address after call is $+5, which is the address of this call instruction +5 bytes, and it is also the address of the following pop eax instruction, so the target function of call is immediately following The place where the pop eax command starts below.

So what's the point of executing this call $+5 so hard? In fact, it is to get the address of the memory space where the current piece of code is located, but there is no way to directly read the value of the instruction register EIP, so with the help of a call, push the address of this piece of code onto the stack, and then take it out. Know where this code is placed in memory and is executing.

This technique is a common trick used by hackers to write shellcode.

pop eax

Note that when the execution reaches this point, the top of the thread's stack stores the location of this instruction, which is the result of the call instruction above.

Next, pop eax, take the address stored on the top of the stack and put it in the eax register. Now the memory address of the current instruction is stored in eax.

add eax, 13h

What's the point of getting this address with such a great effort? Don’t worry, look at this instruction and add 13h to it, which is 19 in decimal. Looking back at the hexadecimal byte table of the main array, after adding 19, it happens to be the position of the last element of the main array— —The address of the printf function is stored inside.

So, as of here, the purpose of the first three instructions is to get the address of the printf function.

push 400000h ↵↵After getting the printf function, start calling. Here is a parameter passed to printf: 0x00400000, which is the address of the string to be printed.

mov edi, 400000h ↵↵ Here is also passing parameters to the printf function. Here and the above one, one passes the parameters through the stack and the other passes the parameters through the registers, in order to be compatible with function calls on both the Windows platform and the Linux x64 platform. Agreed.

The string address passed is 0x00400000 because it happens to be the default base address for loading executable files on the two platforms.

在这里插入图片描述

Windows:

在这里插入图片描述

Linux:

(gdb) x /16c 0x00400000
0x400000: 127 '\177' 69 'E' 76 'L' 70 'F' 2 '\002' 1 '\001' 1 '\001' 0 '\000'
0x400008: 0 '\000' 0 '\000' 0 '\000' 0 '\000' 0 '\000' 0 '\000' 0 '\000' 0 '\000'

call dword ptr [eax]

Remember that eax stores the address of the last grid of the main array, and this grid stores the address of the printf function.

Then, call call through a pointer to call printf to complete the print output.

pop eax

After the function is called, the stack needs to be balanced, and the parameters are pushed on the stack before the function is called, and it has to be popped out here.

retn

Note that this retn instruction corresponds to the call instruction. Call is used to call a function and push the return address onto the stack, while the retn instruction pops the data at the top of the stack as the return address and jumps back to execute it.

Remember, now this code is in the context of being called by the first call instruction. Under normal circumstances, should the execution of retn return to the back of the call instruction? Wouldn’t it be messed up to go back to pop eax again? But note that now that the return address at the top of the stack has been popped out in advance (pop eax in the second line), what is the top data of the stack after executing retn now?

This data is the return address of the caller who called the main function reserved at the top of the stack when the thread executes to the beginning of the entire main function. So this retn does not return to the back of the first call, but returns to the place where the main function was called at the previous level.

As for who is calling the main function, this is not the focus of this article. It belongs to the category of the CRT functions of the respective C/C++ runtime libraries on Linux and Windows.

At this point, you should be able to understand how this program runs and why there is such output information.

A few notes

  1. First of all, in order to be able to compile smoothly, on Linux, you need to use g++ instead of gcc to compile, because when the global variable of main is initialized, the C language stipulates that it must be a constant and cannot be dynamically determined (the last printf function address is Dynamic), and also have to add the -fpermissive compilation option.
  2. The random loading function of the module needs to be turned off. In order to resist security attacks in modern operating systems, the load base address of executable files is randomized to prevent guessing. The prerequisite for this code to run normally is that the executable file load base address is 0x00400000. It cannot be randomized, so it needs to be turned off by the compiler.
  3. Finally, according to the previous analysis, we also know that the program actually executes the data in the main array as code. Under the security protection of modern operating systems, the memory page where the data is located is refused by default, because these memory pages have only read and write permissions, but no executable permissions. This security mechanism is called DEP/NX. So for normal operation, you need to turn this off. For g++, just add the -z execstack compilation option.

Summarize

In fact, the idea of this code is not my original. There is a International C Language Confusion Code Contest (IOCCC, The International Obfuscated C Code Contest) abroad. The feature of this competition is that writes the most showy code and achieves the most effect 161503dabda5b7, among which there is such an award-winning case.

Later, a domestic big cow also created his own version, refer to the link:

https://blog.csdn.net/masefee/article/details/6606813

However, this version is only applicable to the Windows platform. On this basis, I changed the current version to support both Windows and Linux platforms.

This code itself has no meaning and has no practical value, but through the code to study the underlying principles behind the code and the program, understand how the CPU calls functions, passes parameters, jumps, and manipulates the stack. These are the topics of this article. The meaning lies.

Leave a question for everyone, can the following line of code run normally, and what did it do when it runs?

int main[] = {0xC3};

代码熬夜敲
210 声望354 粉丝

李志宽、前百创作者、渗透测试专家、闷骚男一位、有自己的摇滚乐队