The Mach-O file format is analyzed in detail, and the dynamic link related knowledge points are expounded emphatically

The paper

A process is the result of an executable file being loaded in memory in a format understood by the operating system so that the operating system can parse the file, resume its dependencies (such as libraries), initialize the runtime environment, and execute.

Mach-o (Mach Object File Format) is an executable File on macOS, whereas Linux and most Unix systems use the native ELF(Extensible Firmware Interface) Format. Windows supports PE32/PE32+, while macOS supports three executable file formats: interpreter script file, general binary format, and Mach-O format, as shown below:

Executable format magic use
The script \x7FELF Mainly used for shell scripts, but also used in other interpreters, such as Perl, AWK, etc. This is the common script file in#!The string after the tag is the instruction mode of executing the command, and the command is passed in the stdin of the file
Universal binary format 0xcafebabe 0xbebafeca Contains a binary format supported by multiple architectures, supported only on macOS
Mach-O 0xfeedface(32)0xfeedfacf(64) MacOS’s native binary format

Universal Binary, also known as “Fat Binary”, is a kind of package collection of Binary files of multiple architectures, mainly to solve historical problems and support Power PC(PPC) architecture and Inter architecture.

The common ones include: executable file, dynamic library file, dynamic linker, etc., which are all in Mach-O format. You can view the specific executable file format by using the file command, as shown in the figure below:

Mach-o File format

Its structure is shown in the following figure and consists of four parts:

  • The Header in the head

    It describes the CPU type, file type, load command and other information of the file.

  • Load Commands Load commands

    It describes the specific organization structure of the data in the file and how to use different loading commands to represent different data types.

  • The Data Data segment

    It stores code and data, including code, character constants, classes, methods, etc., and has multiple segments. Each Segment contains zero to multiple sections.

  • Loader Info Link information and others

    The end of the file contains a series of link information, such as dynamic linker used to link executable files or dependencies to use symbol tables, string tables, and signature information;

Why do sections exist in a Segment Segment?

The purpose of segmentation is as follows: Different segments can be mapped to different virtual storage areas, facilitating read and write permission management. Using the modern CPU cache system and the local principle of the program, the instruction and data cache are separated to improve the cache hit ratio; Instruction or data sharing helps improve memory utilization. The section can not be completely in accordance with the size of the page memory alignment, improve memory space utilization.

Header

The mach-o header contains the following data structures (distinguishing between 32-bit and 64-bit architectures) :

//32bit
struct mach_header {
    uint32_t    magic;        /* mach magic number identifier */
    cpu_type_t    cputype;    /* cpu specifier */
    cpu_subtype_t    cpusubtype;    /* machine specifier */
    uint32_t    filetype;    /* type of file */
    uint32_t    ncmds;        /* number of load commands */
    uint32_t    sizeofcmds;    /* the size of all the load commands */
    uint32_t    flags;        /* flags */
};
//64bit
struct mach_header_64 {
    uint32_t    magic;        /* mach magic number identifier */
    cpu_type_t    cputype;    /* cpu specifier */
    cpu_subtype_t    cpusubtype;    /* machine specifier */
    uint32_t    filetype;    /* type of file */
    uint32_t    ncmds;        /* number of load commands */
    uint32_t    sizeofcmds;    /* the size of all the load commands */
    uint32_t    flags;        /* flags */
    uint32_t    reserved;    /* reserved */
};
Copy the code

There is no major difference in the header structure between 32-bit and 64-bit architectures, except that 64-bit has one more reserved field. The field name is as follows:

  • Magic: The number of magic spells used to determine whether the file is 32-bit or 64-bit

  • Cputype: INDICATES the CPU type, such as ARM or X86_64

  • Cpusubtype: indicates the SPECIFIC CPU type, such as ARM64 or ARMV7

  • Filetype, filetype, such as executable file, library file, dynamic linker, symbol file, debugging information, etc. MH_EXECUTE stands for executable file. The specific filetype is defined as follows:

    /* Constants for the filetype field of the mach_header */
    #define    MH_OBJECT    0x1        /* relocatable object file */
    #define    MH_EXECUTE    0x2        /* demand paged executable file */
    #define    MH_FVMLIB    0x3        /* fixed VM shared library file */
    #define    MH_CORE        0x4        /* core file */
    #define    MH_PRELOAD    0x5        /* preloaded executable file */
    #define    MH_DYLIB    0x6        /* dynamically bound shared library */
    #define    MH_DYLINKER    0x7        /* dynamic link editor */
    #define    MH_BUNDLE    0x8        /* dynamically bound bundle file */
    #define    MH_DYLIB_STUB    0x9        /* shared library stub for static */
    #define    MH_DSYM        0xa        /* companion file with only debug */
    #define    MH_KEXT_BUNDLE    0xb        /* x86_64 kexts */
    Copy the code
  • NCMD, the number of commands to load

  • Sizeofcmds, the sizeof the address space occupied by all load commands in the file

  • Reserved: reserved field

  • Flags, the flag bit, is defined as follows:

    #define    MH_NOUNDEFS    0x1        // There are no undefined symbols, no link dependencies
    #define    MH_DYLDLINK    0x4        // This file is the input file for dyld and cannot be statically linked again
    #define    MH_PIE 0x200000        // The loader is in a random address space and is only used in MH_EXECUTE
    #define    MH_TWOLEVEL    0x80    // Two-level namespaces
    Copy the code

In addition to MachOView, you can also use the otool command to view the MachO file information. First, we will analyze the Header content: otool -h XXX.

Load commands

The Load commands follow the header (as shown below). They clearly tell the loader how to process binary data. Some commands are handled by the kernel and some are handled by the dynamic linker.

  • LC_SEGMENT/LC_SEGMENT_64: maps the Segment (32/64 bits) to the process address space, containing all Section loading information in the Segment.

    The _PAGEZERO segment has no access and is used to handle null Pointers. Its value is 0. _DATA/_DATA_CONST indicates the read/write data segment; The _LINKEDIT link Section contains some symbol tables, indirect symbol tables, rebase opcodes, bind opcodes, export symbols, function startup information, data tables, code signatures, string tables, etc. There is no Section under the load command. LC_SYMTAB is required to parse symbol table and string table.

    The file offset in the _LINKEDIT loading command information is 0x4000(decimal 16384), which exactly corresponds to the starting address of the Dynamic Loader Info. The file size is 0x5840(decimal 22592)=0x9840(0x9830+10)-0x4000. This corresponds to the data section from the Dynamic Loader Info to the end of the file;

  • LC_DYLD_INFO_ONLY: Loads dynamic link library information (redirection address, weak reference binding, lazy load binding, open function offset, etc.)

  • LC_SYMTAB: Load symbol table address

  • LC_DYSYMTAB: Loads dynamic symbol table addresses

  • LC_LOAD_DYLINKER: loads the dynamic loading library

  • LC_UUID: identifies the unique identifier of the file, which is also included in crash parsing to check for a match between the DYSM file and the crash file

  • LC_VERSION_MIN_MACOSX/LC_VERSION_MIN_IPHONEOS: Determines the minimum OS version required for binary files

  • LC_SOURCE_VERSION: The source code version used to build the binary

  • LC_MAIN: sets the entry address and stack size of the main thread of the program

  • LC_ENCRYPTION_INFO_64: obtains encryption information

  • LC_LOAD_DYLIB: loads additional dynamic libraries

  • LC_FUNCTION_STARTS: Defines a table of function start addresses, making it easy for debuggers and other programs to see if an address is in the function

  • LC_DATA_IN_CODE: A non-directive table defined in a code segment

  • LC_CODE_SIGNATURE: obtains application signature information

The data structure of the specific load command is as follows (64-bit format, not much different from the 32-bit format) :

struct segment_command_64 { /* for 64-bit architectures */
    uint32_t    cmd;        /* LC_SEGMENT_64 */
    uint32_t    cmdsize;    /* includes sizeof section_64 structs */
    char        segname[16];    /* segment name */
    uint64_t    vmaddr;        /* memory address of this segment */
    uint64_t    vmsize;        /* memory size of this segment */
    uint64_t    fileoff;    /* file offset of this segment */
    uint64_t    filesize;    /* amount to map from the file */
    vm_prot_t    maxprot;    /* maximum VM protection */
    vm_prot_t    initprot;    /* initial VM protection */
    uint32_t    nsects;        /* number of sections in segment */
    uint32_t    flags;        /* flags */
};
Copy the code
  • CMD:isLoad commandsThe type of theta, hereLC_SEGMENT_64Represents mapping a 64-bit segment of the file to the address space of the process;
  • Cmdsize:On behalf ofload commandThe size of the
  • The segment name:The name of the section
  • The VM Address:The virtual memory address of the segment
  • VM Size:The virtual memory size of the segment
  • The file offset:Segment offset in file
  • The file size:The size of the segment in the file
  • Nsects:Marked theSegmentHow much of thesecetion

In addition to using MachOView, you can also use otool -l XXX to view, as shown below:

The address size of the segment can be viewed by size -l -m XXX, as shown in the figure below:

The following focuses on several important load commands to facilitate subsequent understanding of the whole program startup, dynamic loading, reverse knowledge points.

LC_SEGMENT_64(__PAGEZERO)

The contents of the loading command are as shown in the figure below:

The virtual address range is 0x0 to 0x100000000, which corresponds to the 4GB space. The virtual address space of the file starts from 0x100000000, that is, all the code and data are loaded to the address after 4GB. The corresponding file content size is 0, that is, does not take up real space in the file, and has no read/write/execute permission, so that the kernel can recognize the null pointer or pointer truncated the wrong range of the address space call and throw segment exceptions, such as EXC_BAD_ACCESS exception.

LC_SEGMENT_64(__LINKEDIT) & LC_DYLD_INFO_ONLY

__LINKEDIT contains information about dynamic links, such as virtual address space addresses, file offsets, and file permissions, while LC_DYLD_INFO_ONLY contains offset/size information about relocations, bindings, and exports.

LC_SYMTAB

For the LC_SYMTAB load command, the data structure is defined as follows:

struct symtab_command {
    uint32_t cmd;     /* LC_SYMTAB */
    uint32_t cmdsize; /* sizeof(struct symtab_command) */
    uint32_t symoff;  /* symbol table offset */
    uint32_t nsyms;   /* number of symbol table entries */
    uint32_t stroff;  /* string table offset */
    uint32_t strsize; /* string table size in bytes */
};
Copy the code

This command tells linkers (either static or dynamic) the location and size of the Symbol Table and String Table.

Where the structure of the symbol is defined by the kernel, as follows:

struct nlist_64 {
    union {
        uint32_t n_strx;   /* index into the string table */
    } n_un;
    uint8_t  n_type;       /* type flag, see below */
    uint8_t  n_sect;       /* section number or NO_SECT */
    uint16_t n_desc;       /* see <mach-o/stab.h> */
    uint64_t n_value;      /* value of this symbol (or stab offset) */
};
Copy the code
  • n_un, the ordinal number of the symbol’s name in the string table (in a Mach-o file, unique)
  • n_sect(Valid values for internal symbols start at 1 and are up to 255; The external symbol is zero.)
  • n_value, the address value of the symbol (which changes with its section during the link)
  • n_typeIs an 8-bit compound field:
    • bit[5:8]: If the value is not 0, it is a debugging-related symbol. See the value type for detailsmach-o/stab.h
    • bit[4:5]: 1 indicates that the symbol is private (external symbol)
    • bit[1:4]: symbol type
      • N_UNDF(0 x0) : undefined
      • N_ABS(0x2): The symbolic address points to the absolute address, which the linker will not change later
      • N_SECT(0xe): The local symbol, that is, the symbol defined in the current Mach-o
      • N_PBUD(0xc): pre-bound symbol
      • N_INDR(0xa): indicates that this symbol is the same as another symbol,n_valuePoints to the string table, which is the name of the same symbol
    • bit[0:1]: indicates that the symbol is external. That is, the symbol is either defined externally or locally but can be used externally.

LC_DYSYMTAB

For the LC_DYSYMTAB load command, the data structure is as follows:

struct dysymtab_command {
    uint32_t cmd;	/* LC_DYSYMTAB */
    uint32_t cmdsize;	/* sizeof(struct dysymtab_command) */
    uint32_t ilocalsym;	/* index to local symbols */
    uint32_t nlocalsym;	/* number of local symbols */
    uint32_t iextdefsym;/* index to externally defined symbols */
    uint32_t nextdefsym;/* number of externally defined symbols */
    uint32_t iundefsym;	/* index to undefined symbols */
    uint32_t nundefsym;	/* number of undefined symbols */
    uint32_t tocoff;	/* file offset to table of contents */
    uint32_t ntoc;	/* number of entries in table of contents */
    uint32_t modtaboff;	/* file offset to module table */
    uint32_t nmodtab;	/* number of module table entries */
    uint32_t extrefsymoff;	/* offset to referenced symbol table */
    uint32_t nextrefsyms;	/* number of referenced symbol table entries */
    uint32_t indirectsymoff; /* file offset to the indirect symbol table */
    uint32_t nindirectsyms;  /* number of indirect symbol table entries */
    uint32_t extreloff;	/* offset to external relocation entries */
    uint32_t nextrel;	/* number of external relocation entries */
    uint32_t locreloff;	/* offset to local relocation entries */
    uint32_t nlocrel;	/* number of local relocation entries */
};
Copy the code

Indriectsymoff specifies the location and number of file offsets of Dynamic Symbol Table, including local, external, undefined external and indirect Symbol tables.

Use otool -i XXX to obtain the indirection symbol table contents;

The indirection symbol contains the symbol name, the Section where the symbol is located, and the indirection address of the symbol. The indirection symbol is in the __stubs, __GOT, and __la_symbol_ptr sections.

__stubs, _DATA_CONST.__got, _data. __la_symbol_ptr in LC_SEGMENT_64 The header field contains the Indirect Sym Index(Reserverd1) field, which indicates the entry number in the Indirect Symbol Table, as shown in the figure below:

The symbol in _la_symbol_ptr starts with an entry number of 26 in the indirection symbol table.

LC_LOAD_DYLINKER

The load command contains the path for the important program to start the dynamic linker, as shown in figure x86_64, which is /usr/lib/dyld.

Segment & Section

Section data structure

struct section { /* for 32-bit architectures */
    char        sectname[16];    /* name of this section */
    char        segname[16];    /* segment this section goes in */
    uint32_t    addr;        /* memory address of this section */
    uint32_t    size;        /* size in bytes of this section */
    uint32_t    offset;        /* file offset of this section */
    uint32_t    align;        /* section alignment (power of 2) */
    uint32_t    reloff;        /* file offset of relocation entries */
    uint32_t    nreloc;        /* number of relocation entries */
    uint32_t    flags;        /* flags (section type and attributes)*/
    uint32_t    reserved1;    /* reserved (for offset or index) */
    uint32_t    reserved2;    /* reserved (for count or sizeof) */
};
Copy the code
  • Sectname:Such as_text,stubs
  • Segname:thesectionSubordinate to thesegment, such as_TEXT
  • Addr:sectionAt the beginning of memory
  • The size:sectionThe size of the
  • Offset:sectionFile migration of
  • Align:Byte alignment
  • Reloff:The file offset of the relocation entry
  • Nreloc:Number of entrances that need to be relocated
  • Flags:containssectionthetypeandattributes

Common sections are shown in the following table:

Section use
_TEXT.__text Main program code
_TEXT.__cstring C language string
_TEXT.__const constConstant modified by the keyword
_TEXT.__stubs Used forStubPlaceholder code for, in many places calledThe pile of code, used to redirect tolazynon-lazyThe symbol ofsectionIs marked asS_SYMBOL_STUBS.TEXT SegmentIn the code anddylibReferences to external symbols refer to function symbolsstubs. Each of these items is indirectly addressed by JMP code and can be skippedla_symbol_ptr SectionIn the.
_TEXT.__stubs_helper whenStubUnable to find the final point after the real symbol address
_TEXT.__objc_methname Objective-CMethod names
_TEXT.__objc_methtype Objective-CMethod type
_TEXT.__objc_classname Objective-CThe class name
_TEXT.__eh_frame Debugging Auxiliary Information
_TEXT.__unwind_info Used to store processing exception information
_DATA.__data Initialized mutable data
_DATA.__la_symbol_ptr lazy bindingIn the pointer table, the Pointers in the table all start at__stub_helper
_DATA.nl_symbol_ptr nonlazy bindingThe pointer in each entry points to a symbol that has been searched by the dynamic chain machine during the loading process
_DATA.__got Global offset table
_DATA.__const An uninitialized constant
_DATA.__cfstring ProgramCore FoundationString (CFStringRefs)
_DATA.__bss BSSTo store global variables that are initialized, often referred to as static memory allocation
_DATA.__common Uninitialized symbol declaration
_DATA.__mod_init_func Initialize the function atmainBefore the call
_DATA.__mod_term_func The termination function, atmainCall after return
_DATA.__objc_classlist Objective-CThe class list
_DATA.__objc_protolist Objective-CThe prototype
_DATA.__objc_imginfo Objective-CImage information
_DATA.__objc_selfrefs Objective-C selfreference
_DATA.__objc_protorefs Objective-CA prototype reference
_DATA.__objc_superrefs Objective-CSuper class reference
_DATA.__got

For the _data.__got section, the contents look like the figure below:

It’s like a table. Each entry is an address value that defines a non-lazy Symbol address. All entries have a value of 0. The purpose of this method is to solve the problem of storing symbols whose destination address cannot be determined during the linking phase. When the image is loaded, the dynamic linker dyld will relocate the symbol corresponding to each entry and write its real address as the content of the entry. As to how DyLD determines Symbol information, it can be seen from the symbols in the above Indirect Symbol Table, including Symbol names and Indirect Symbol addresses.

_DATA.__la_symbol_ptr

This corresponds to the _data. __la_symbol_ptr section, which looks like this:

The actual contents point to the _text.__stub_helper section, which leads to the dyLD_STUB_binder Symbol, an entry in the Non_Lazy Symbol Pointer in the __GOT section, which is a function, Defined in dyLD_STUB_binder. S and provided by dyld.

Dyld_stub_binder internally looks for the real address of the lock call symbol, writes it to the _la_symbol_ptr entry, and then jumps to the real address.

_TEXT.__stubs

For the _text.__stubs section, it reads as follows:

This content is also a table, and each entry is a piece of data, called a “symbol stub.” Run the otool -v xx -s _TEXT __stubs command to view the following information:

Its contents are JMPQ jump instruction, jump address to the first address as an example:

0x100003000 = 0x100001dbc(rip) + 0x1244
Copy the code

This address points to the __la_symbol_ptr section, which ultimately points to dyLD_STUB_binder.

Loader info

The link load information contains dynamic load informationDynamic Loader Info(Contains information such as the offset value of redirection address, weak reference binding, lazy loading binding, and open function. The load command isLC_DYLD_INFO_ONLY), function start address tableFunction Starts(The load command isLC_FUNCTION_STARTS), symbol tableSymbol Table, dynamic symbol tableDynamic Symbol Table, code snippets are not instruction tablesData in Code Table, string tableString Table(null terminator) and code signatureCode Signature, as shown in the figure below:

Dynamic Loader Info

Due to address space layout randomization (ASLR) and address-independent executable (position- Indendent excutable, PIE), the loading address of the program in memory is random. Therefore, the internal address needs to be corrected in the dynamic link phase. The Rebase data describes what is a reference to MachO internally and fixes it, while the Bind data describes what is an external reference and fixes it. The Lazy Bind data describes which symbols need to be bound late, that is, only the first time they are used, not at startup, to improve startup efficiency. Export data describes symbols that are visible to the outside world. It consists of Opcodes, immediate numbers, and offset values with ULEB128 / SLEB128 codes.

PIE(position-independent executable) is a technique for generating address independent executable programs. If the compiler uses PIE in the process of generating the executable, there is an unpredictability of where the executable will be loaded when it is loaded into memory. PIE has a twin brother, PIC(Position-Independent Code). This function is the same as PIE, which allows compiled programs to be randomly loaded into a memory address. The difference is that PIC is used when generating dynamic linked libraries (SO in Linux) and PIE is used when generating executables.

For example, the protocol and operation of Rebase is to add the offset to its value after the address is found. The specific operand and immediate number are obtained by using REBASE_OPCODE_MASK(0xF0) and REBASE_IMMEDIATE_MASK(0x0F) to operate and & on the data. For example, the data byte 0x100004000 is 0x11, and its operand is 0x10= 0x11&0xf0, which corresponds to REBASE_OPCODE_SET_TYPE_IMM. The immediate number 0x01= 0x11&0x0f is type=1(REBASE_TYPE_POINTER). Specific operands and immediate number corresponding to the logic can be consulted dyld source code.

Note: The Actions in MachOView are misleading. Operations such as relocation and binding are read in byte order and operate until all data is read completely. The specific reasons for the annotation are unknown and will be added when confirmed.

Dynamic Symbol Table

For the Indirect Symbols in the Dynamic Symbol Table, its content is a Table, and the content of each item is its serial number in the Symbol Table, as shown in the figure below:

Its content is 0x3c=60, corresponding to the symbol table 60th symbol, through the symbol table starting address 0x4380, each symbol takes 0x10, then 0x4740=0x4380+0x10*0x3c, corresponding to the _CFRunLoopAddSource symbol address.

String Table

The String Table contains all symbol names, and each name is separated by an empty String, as shown in the figure below:

The String Table Index field in the Symbol Table is the Index String in the String Table.

Reference

Mach-o: File format analysis

MachO file structure detail

Mach-o file format exploration

Macho (1)

Understanding Macho files (2) – missing __OBJC segment and new _DATA segment

Understanding Mac OSX & iOS

loader.h

Mach-o with dynamic linking

Mach-o with static linking

Apple operating system executable mach-o

The startup process of an iOS app

Randomization of unused address space for APP Vulnerability scanner