I often get asked in the CTF teams I play how to get started with reverse engineering by the newer members. Doing this repeatedly semester after semester has started to get repetitive. As such, I compiled the following post as a general guideline for people to get started with reversing. However, while writing I found that there was a lot of background knowledge needed so I am going to split this into a series of posts.

Reverse engineering is a difficult skill and I cannot build a completely thorough overview in a blog post. I hope that my content will help get at least a few people start off in the right foot for getting started with reversing.

Architecture 101

Programs are written in high level, human readable, programming languages which have to be compiled first into a language that the CPU can interpret known as machine code. As the programs are shipped in this machine code we need to learn how this language works in order to reverse engineer programs. To do that we need to read the machine code in a, somewhat more human friendly, family of languages known as assembly.

I’ll be using Intel x86 family of architectures throughout this tutorial as its the most popular CPU architecture available for regular computers.

Instructions

In general, when you see an instruction, it represents a short mnemonic to describe what the instruction does as well as its operands. Before I introduce instructions, it is helpful to know that there are two popular syntax formats for x86: Intel and AT&T. I will be using Intel for the rest of the tutorial as I find it easier to work with, but AT&T is presented briefly for completeness sake.

=======================================================
| Syntax | Format                                     |
|--------|--------------------------------------------|
| Intel  | <Instruction> <destination>, <source>      |
| AT&T   | <Instruction> <source>,      <destination> |
=======================================================

For example, an instruction that sets a register to the constant value 0x1337 would look like the following:

mov  eax, 0x1337   ; intel
movl $0x1337, %eax ; AT&T

You can easily identify if you are using AT&T syntax since in AT&T registers are prepended with percent signs %.

Intel uses a CISC architecture so there are a LOT of instructions that can be overwhelming to wrap your head around at first. But they can pretty much all fit into one of the following classes of instructions with examples provided.

  • Data manipulation (mov, lea)
  • Arithmetic and logic (add, sub, xor)
  • Control-flow (jmp, jeq, push, pop, call, leave, ret)

You don’t have to commit all of the instructions to memory. I would make sure you know the ones I mentioned previously at least, and look up the weird instrctions as you go. Over time you will gain familiarity with the architecture. I recommend this page as it gives detailed explanations per instruction. If you want to go deeper, check out the Intel instruction manual. I have rarely found situations where I wasn’t able to figure out what an instruction does after a quick web search though.

Registers

Registers are small and fast storage on the CPU. There are four general purpose registers in 32 bit: EAX, EBX, ECX, EDX and four “pointer” registers ESP, EBP, ESI, EDI. Generally you can use all of these registers interchangeably with a few special exceptions hence why they are called general purpose registers.

When x86 was designed it was a 32 bit extension to the 16 bit 8086 architecture the general purpose registers. For backwards compatibility, Intel made each register an extension its smaller counter part. That is, you can access either the 16 bit part of a register, or the 32 bits all together. The 32 bit version of registers are identified be a prefix of the letter “E” for “EXTENDED”. That is, EAX is the 32-bit extended version of the 16-bit register AX.

Similarly, when x86 was extended to 64 bits with AMD’s x86_64, also known as x64, the “E” prefix was replaced with an “R”. That is RAX is the 64-bit version of EAX. AMDs x64 also introduced a few more general purpose registers R8 through R15. In case you are wondering, R0through R7 are aliases to the other general purpose registers previously defined.

Lastly, for 16 bit registers each byte can be accessed individually with the high portion AH or low portion AL. The following table summarizes an example of how the registers relate to each other for the AX register.

|===============================|
| Byte  | 7| 6| 5| 4| 3| 2| 1| 0|
|-------+-----------------------|
| 64-bit|         RAX           |
|-------+-----------------------|
| 32-bit|  |  |  |  |    EAX    |
|-------+--+--+--+--+-----------|
| 16-bit|  |  |  |  |  |  | AX  |
|-------+--+--+--+--+--+--+-----|
| 8-bit |  |  |  |  |  |  |AH|AL|
|===============================|

There are also segment registers which are used by the operating system to model memory segments. That is not going to be touched upon in this post but I may describe them in later posts.

The EFLAGS register is used to hold results of operations, mostly for the use of upcoming conditional operations. Each bit of the EFLAGS register holds a relevant piece of data. As there are way too many to cover, I would instead refer back to the previously linked page. In practice, I would look up the needed flags for an instructions conditions as I work.

Lastly, the most important register is the “instruction pointer” RIP. The instruction pointer always contains the address of the next instruction to be executed. If you can control this register, you control the execution of the program.

Memory

As we have a limited count of registers, we usually need to access memory. This section will describe how memory is accessed in the x86 family of processors.

Intel is a little-endian architecture. As such, bytes are transmitted with the least significant bit first. This matters to us when we look at memory contents. For example, if we write the value “ABCD”, it will be stored as “DCBA”.

In Intel syntax memory addresses are represented by the bracket notation similar to how C represents the index of an array.

mov eax, [0x1337]; moved value at address 0x1337 to register eax

When we write C programs we use many variables to store our results. These variables are stored at locations throughout memory. However, memory accesses are relatively slow. A compiler would generate code move the value from the memory locations that store our variables to registers when we perform some operations.

To demonstrate that, we have the following assembly code of a small test program.

int main() {
    int a, b, c;
    a = 1;
    b = 2;
    c = a + b;
    return 0;
}

Compile that program and open it in gdb. Running disassemble main would provide you with an assembly output of the main function. To read the following snipped of code know that the first column is the address of the instruction, second column is the offset from the start of the function for ease of readability, and lastly we have the assembly instructions.

   0x000000000000063a <+0>:	push   rbp
   0x000000000000063b <+1>:	mov    rbp,rsp
   0x000000000000063e <+4>:	mov    DWORD PTR [rbp-0xc],0x1 ; a = 1
   0x0000000000000645 <+11>:	mov    DWORD PTR [rbp-0x8],0x2 ; b = 2
   0x000000000000064c <+18>:	mov    edx,DWORD PTR [rbp-0xc] ; Put a in a register
   0x000000000000064f <+21>:	mov    eax,DWORD PTR [rbp-0x8] ; Put b in a register
   0x0000000000000652 <+24>:	add    eax,edx                 ; Add both and save result to eax
   0x0000000000000654 <+26>:	mov    DWORD PTR [rbp-0x4],eax ; Save result to c
   0x0000000000000657 <+29>:	mov    eax,0x0                 ; Set return result
   0x000000000000065c <+34>:	pop    rbp
   0x000000000000065d <+35>:	ret

You can see that the third and fourth instructions initialize our variables by placing values in memory addresses. We placed the values in our variables by using the mov instruction. After setting those memory address, the program grabbed the values and placed them in the registers edx and eax. The add instruction adds the value in edx to eax saving the result to eax.

Lastly, we need to store the result of the addition back into memory, so program saves the value in the eax register back into a memory location.

You might wonder why are the memory locations offsets to the rbp register? To understand that we first need to understand how code is organized in a program.


Code Layout

When the program is loaded, a program has its own Virtual address space with the following layout.

=========== <--- 0XFFFF
|  STACK  |
|.........|
|    |    |
|    |    |
|    v    |
|         |
|    ^    |
|    |    |
|    |    |
|.........|
|  HEAP   |
|---------|
|  BSS    |
|---------|
|  DATA   |
|---------|
|  TEXT   |
=========== <--- 0x0000

Each of these segments has its purpose. The following are short descriptions of the segments.

TEXT

The text segment is a read only executable section which holds the actual programs instructions. This segment is of fixed size. When you disassemble a program your disassembler reads and parses this segment to provide the useful information.

DATA and BSS

The .data and .bss sections hold global variables. The .data segment holds static variables which have a value that is predefined. Meanwhile, the .bss segment holds the uninitialized global variables.

As an aside piece of trivia, these segments were separated for a small optimization in storing programs. Since the .data segment has the initial values you need to store them all with their matching values in the program. Meanwhile, the .bss segment only needs to holds the total size of all the variables since they don’t have an initial value, so the stored program file only needs a single variable to describe the entire segment.

Heap and Stack

The heap is all the memory that is dynamically allocated through system calls such as malloc and free. This segment grows towards higher addresses.

The stack is interesting. Unlike the other segments, it grows downwards from higher memory addresses towards the heap. The top of the stack is pointed at by the sp register. A programs stack is split into subsections called stack frames where the rbp register points to the bottom of the frame. This is the same rbp that we saw in the previous disassembly.


Function prologue and epilogue

Software is organized into functions, we need to organize the memory in a way to allow for each function to have its own copy of local variables. The way that programs do so, is when you enter a function the prologue is ran to create a new stack frame. Exiting the function, the epilogue is ran to “release” the stack frame and return it to the same state as prior to calling the function. Also, since a function would have instructions located at a different part of the .text.

For the prologue:

  1. Save the base of the old frame pointer.
  2. Set the new bottom of frame pointer to point to the location of the old top of stack.
  3. (optionally) grow the stack to save space for local variables.

For an example, we have the following prologue creating a stack frame that is 16 bytes large.

push   rbp
mov    rbp,rsp
sub    rsp, 0x10

The epilogue does the reverse.

mov rsp, rbp
pop rbp
ret

The ret instruction is used to return control flow to the calling function. When ret is executed, the top of stack is popped, placing the value top of the stack into the rip. Thus, ret is effectively a shorthand for pop rip.

On the other hand, you enter a function using the call instruction. The calling function saves the address of the next instruction on the top of the stack automatically using the call instruction. This instruction pushes the value of rip onto the stack, and sets rip to the argument.


Variables inside of a call stack

To demonstrate we have a diagram of a typical stack frame. Do note that this diagram assumes cdecl as the callling convention on a 32bit machine. Newer machines pass some arguments through registers first.

=====================================
| Location    | Contents            |
|-------------+---------------------|
=====================================
| RBP + 0x10  | Argument 3          |
| RBP + 0xc   | Argument 2          |
| RBP + 0x8   | Argument 1          |
| RBP + 0x4   | Saved return address|
| RBP         | Saved frame pointer |
| RBP - 0x4   | Local variable 1    |
| RBP - 0x8   | Local variable 2    |
| RBP - 0xc   | Local variable 3    |
=====================================

As you can see in the above diagram, you can reference both variables and arguments by their offsets of rbp. Use this information while reversing to keep track of what data being passed into functions.

By now you should have the basic knowledge needed to start reversing. Check my next post to get started!