Practical Binary Analysis

Chapter 1

Compilation of a source files take several stages before it can be compiled to a an executable. The stages are as follows:

  • Preprocessing: Include all the #include and #define into the source file and substitute the information across all the occurrences, so we have a coherent C style source file.

  • Compilation: In this stage the preprocessed source files are turned into files with assembly instructions.

  • Assembly: The compiled files are now converted into machine readable code, any thing before this stage was human readable but not machine readable. The files produced in this stage are relocatable i.e. they do not have fixed addresses for any of the references and majority of the references are not resolved.

  • Linking: This is where all the location independent assembly files are linked together and their references, and cross-references are resolved. However, even after this stage there may be many references that are not resolved, this kind of linking result in dynamically linked executable, where the references are resolved on the run-time.

GCC different compilation stages and commands

# Preprocessing Stage
# Produces a preprocessed file
$ gcc -E -P source.c -o output.pre

# Compilation Stage
# Produces a compiled file with assembly instructions
$ gcc -S source.c -o output.asm

# Assembly Stage
# Produces a relocatable object file
$ gcc -c source.c -o output.o

# Linker + All above
# Produces a complete file with all stages completed
$ gcc source.c -o output.out

To view all the symbols of an elf file we can use readelf to emit out the required information.

$ readelf --syms $elf_file

In Linux binaries debugging information is usually generated in DWARF format and is usually embedded into the binary. On the other hand, in Windows, Portable Debugging (PDB) format is used and this information is provided in separate files and is not included with the PE file.

To strip a binary of all the debugging information we can use strip utility to remove all this information from the file.

$ strip --strip-all $elf_file
$ file $elf_file
elf_file: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically
linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32,
BuildID[sha1]=HAS, stripped
# Using objdump to dump out data from one section.
$ objdump -sj $section_name $elf_file
# EXAMPLE: objdump -sj .roadata a.out

# To dump the compelte disassembly of the file 
$ objdump -M intel -d $elf_file

readelf utility can also provide information about what all relocation will be taking place in the linker phase for all the assembly level files. This information can be viewed using the following command.

$ readelf --relocs compilation_example.o

The information is in tabular form and the value of offset provides the location where the new resolved data is to be placed, note that the offset is dependent on exactly where the information is to placed operand or operator.

When the binary is stripped of all of its symbols, the text section of the binary is nothing but a big blob of text, thus, while automatic function analysis it is very important to understand and get the functions right, else the functions could be messed up and won't make much sense.

Decompilation of a standalone object file and that of a linked executable is very different, the linker does add many sections to the binary for dynamic resolution, program setup, parameters, and other important things that the standalone object file does not consider.

Loading a binary into a process and running it is an intriguing process, the OS first creates a process, assigns it a virtual address space, and then loads an interpreter into the memory. Now the control is transferred to the interpreter which is ld-linux.so and ntdll.dll in their respective OSes.

In case of ELF binaries there is a section called .interp that contains information about which interpreter is to be used to load the binary. The interpreter loads the binary into its virtual address space. It then parses the binary to find out which dynamic libraries the binary uses. The interpreter maps these into the virtual address space and then performs any necessary last-minute relocation in the binary’s code sections to fill in the correct addresses for references to the dynamic libraries. After relocation is complete, the interpreter looks up the entry point of the binary and transfers control to it, beginning normal execution of the binary.

Last updated