preface

This article mainly analyzes the MachO file (also known as binary executable file), I believe you will encounter the concept of MachO file in the usual development, but most people do not know what it is, this article and you will specifically analyze its origin and its internal structure.

A, the Mach – O

πŸ‘‡ 1. Regardless of the high-level language (C OC Swift, etc.), the first step is to generate the AST syntax tree, but the compiler front end is different (Clang, Swift, Rust) 2. The IR intermediate code is then generated through CIL MIR or SIL generator, both of which belong to LLVM IR 3. Finally, it is handed to MIR to generate the machine code, which is the MachO file

LLVM actually helps us complete the whole process, as for what LLVM is, you can refer to my previous article πŸ‘‰ LLVM compilation process.

1.1 Common Mach-O file formats

  • Object file.o
  • The library files
    • .a
    • .dylib
    • .framework
  • Executable file
  • dyld
  • .dsym

1.1.1 verify

.o,.outExecutable file

In case 1, create a test.c file with the following contents: πŸ‘‡

#include <stdio.h>

int main() {
    printf("test\n");
    return 0;
}
Copy the code

Verify the.o file πŸ‘‡

⚠️ Note: -c is not specified. Out format is generated by default, -isysroot can be specified if ‘stdio.h’ file not found.

Verify. Out executable file πŸ‘‡

Verify the executable file πŸ‘‡

Generate a test3 executable directly at πŸ‘‡

Test3 test2 test3 test2 test3 test2 test3 πŸ‘‡

You can see that the md5 of the generated executable file is the same.

⚠️ Note: in principle, md5 of test3 should be the same as test2 and a.out. The source code has not changed, so it should be the same. -isysroot may be generated differently, presumably related to CommandLineTools (there is one in the system, and one in Xcode).

Example 2 create another test1.c file with the following contents: πŸ‘‡

#include <stdio.h>

void test1Func() {
    printf("test1 func \n");
}
Copy the code

Modify the test c πŸ‘‡

#include <stdio.h>

void test1Func();

int main() {
    test1Func();
    printf("test\n");
    return 0;
}
Copy the code

Demo, demo1, demo2πŸ‘‡

clang -o demo  test1.c test.c 
clang -c test1.c test.c 
clang -o demo1 test.o test1.o
clang -o demo2 test1.o test.o
Copy the code

Check out their MD5πŸ‘‡

Here’s demo1 and demo2Md5 is differentBecause of test.o and test1.oIn a different order.

1.2 Objdump command to query Mach-O

objdump –macho -d demo

The above figure clearly shows that the order of method calls is different, which explains why MD5 is different. This is similar to the Build Phases -> Compile Sources order in Xcode.

⚠️ Note: The order of the source files is different, the compiled binaries are different (the same size), and the order of the binaries is different.

.aFiles,

Create a library library directly and go to πŸ‘‡

//find /usr -name "*.a"
file libTestLibrary.a
libTestLibrary.a: current ar archive random library
Copy the code
.dylibfile
 file /usr/lib/libprequelite.dylib
/usr/lib/libprequelite.dylib: Mach-O 64-bit dynamically linked shared library x86_64
Copy the code

⚠️ Note: dyld** is not a ** executable, it is a dynamic linker, triggered by the system kernel.

dyldfile
cd /usr/lib
file dyld
dyld: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit dynamic linker x86_64] [i386:Mach-O dynamic linker i386]
dyld (for architecture x86_64): Mach-O 64-bit dynamic linker x86_64
dyld (for architecture i386):   Mach-O dynamic linker i386
Copy the code
.dsymfile
file TestDsym.app.dSYM
TestDsym.app.dSYM: directory

cd TestDsym.app.dSYM/Contents/Resources/DWARF

file TestDsym
TestDsym: Mach-O 64-bit dSYM companion file arm64
Copy the code

Two, engineering configuration

2.1 Checking the Type of the Mach-O file

We can see the types of Mach-o files in the project configuration, as shown below πŸ‘‡

You can also use the command line to view πŸ‘‡

File Your Mach-o file path

As you can see, two architectures are supported:arm64andarmv7. Of course, you can also see the supported architecture directly in Xcode πŸ‘‡And the schema Settings in Xcode are inBuild Settings -> ArchitecturesIn πŸ‘‡

The configuration options are πŸ‘‡

  • ArchitecturesπŸ‘‰ Supported architecture.
  • Build Active Architecture OnlyπŸ‘‰ By default, only the current device architecture is compiled in debug mode, and supported devices are compiled in release mode.
  • $(ARCHS_STANDARD)πŸ‘‰ environment variable, which represents the currently supported schema.

If we need to modify the Architectures directly configure (add ARMV7s)πŸ‘‡

2.2 Universal binaries(Universal binary)

  • A program code proposed by Apple that can be used in binary files of multiple architectures simultaneously.
  • Optimal performance for multiple architectures simultaneously in the same package.
  • Because of the need to store multiple types of code, general-purpose binary applications are generally larger than single-platform binary applications.
  • Because multiple architectures have common non-execution resources (outside of code), they are not multiples of a single version (in special cases, they can be multiples with only a small number of code files).
  • Since only a portion of the code is invoked during execution, no additional memory is required to run.

When we drag the generic binary into Hopper, we can see that it lets us select the corresponding schema πŸ‘‡

2.3 lipo command

Lipo is a tool to manage Fat files by looking at CPU architectures, extracting specific architectures, consolidating and splitting library files.

2.3.1. SeeMachOFile supported architecture

Lipo-info MachO file

lipo -info EvergrandeCustomerApp_Example
Architectures in the fat file: EvergrandeCustomerApp_Example are: armv7 arm64
Copy the code
2.3.2 LiPO-Thin Splits an architecture

Lipo MachO file – THIN architecture – Output Output file path

2.3.3 Using Lipo-create to merge multiple schemas

Lipo-create MachO1 macho2-output Specifies the output file path

MachO file structure

A macho file is a MAC OS or ios executable file format that the system loads to execute code. The related structure is shown below πŸ‘‡

In the figure aboveMach-OThe composition structure of πŸ‘‡ is shown in the figure

  • HeaderContains general information about the binary file
    • Byte order, schema type, number of load instructions, etc
    • This allows you to quickly verify information such as whether the current file is 32-bit or 64-bit, the corresponding processor, and the file type
  • Load commandsA table with a lot of content
    • The content includes the location of the region, symbol table, dynamic symbol table, etc
  • DataUsually the largest part of the object file
    • containsSegementSpecific data of

3.1 The way to view MachO files

There are two ways to view the MachO file structure πŸ‘‡

  1. The command lineOtool -f MachO file
$ otool -f xxx.app/xxx
Fat headers
fat_magic 0xcafebabe
nfat_arch 2
architecture 0
    cputype 12
    cpusubtype 9
    capabilities 0x0
    offset 16384
    size 69642576
    align 2^14 (16384)
architecture 1
    cputype 16777228
    cpusubtype 0
    capabilities 0x0
    offset 69664768
    size 80306624
    align 2^14 (16384)
Copy the code
  1. MachO ViewVisualization tool

3.2 MachO Header structure

Fat Header

First let’s look at Fat headers. What is Fat HeaderπŸ‘‡

For multiple schemas MachO will have a Fat Header that contains the CPU type and schema. Offset and Size represent the Offset and Size of each schema in the binary.

In the figure above,armv7theThe offsetandThe size of the, respectively,16384 ε’Œ 79315040And then look atOffset of arm64is79347712It can be found that16384 + 79315040 = 79331424 < 79347712, but79347712-16384 = 79331328.79331328/(1024 * 16) = 4842, including(1024 * 16)On behalf of16 k bytesSize, because πŸ‘‡

A page 16K in iOS and a page aligned in MachO.

This also validates page alignment and is why LC_LOAD_DYLIB can be inserted in Load Commands.

The Header of the data

The Header in arm64 architecture is shown in the figure above. The corresponding code structure of DYLD is as follows (loader.h) πŸ‘‡

struct mach_header_64 {
    uint32_t    magic;      /* mach magic number identifier */
    cpu_type_t  cputype;    /* cpu specifier */
    cpu_subtype_t   cpusubtype; /* machine specifier */
    uint32_t    filetype;   /* type of file */
    uint32_t    ncmds;      /* number of load commands */
    uint32_t    sizeofcmds; /* the size of all the load commands */
    uint32_t    flags;      /* flags */
    uint32_t    reserved;   /* reserved */
};
Copy the code
parameter paraphrase
magic Magic number, quick location is 64 bit or 32 bit
cputype CPU type, such as ARM
cpusubtype CPU type: ARM64, ARMV7
filetype File types, such as executable files
ncmds Number of Load Commands, Load Commands totalA number of
sizeofcmds Size of Load Commands, Load Commands totalThe size of the
flags Identify the functionality supported by binaries, mainly related to system loading and linking
reserved Arm64 characteristic, reserved field

The fileType type is πŸ‘‡

#define MH_OBJECT 0x1 /* relocatable object file */ #define MH_EXECUTE 0x2 /* demand paged executable file */ #define MH_FVMLIB 0x3 /* fixed VM shared library file */ #define MH_CORE 0x4 /* core file */ #define MH_PRELOAD 0x5 /* preloaded  executable file */ #define MH_DYLIB 0x6 /* dynamically bound shared library */ #define MH_DYLINKER 0x7 /* dynamic link editor */ #define MH_BUNDLE 0x8 /* dynamically bound bundle file */ #define MH_DYLIB_STUB 0x9 /* shared library stub for  static linking only, no section contents */ #define MH_DSYM 0xa /* companion file with only debug sections */ #define MH_KEXT_BUNDLE 0xb /* x86_64 kexts */ #define MH_FILESET 0xc /* a file composed of other Mach-Os to be run in the same userspace sharing a single linkedit. */Copy the code

3.3 the Load Commands

After retrieving the Header, dyld starts loading and parsing Load Commands. The rough structure of Load Comands is as follows: πŸ‘‡

3.3.1 load_commandThe structure of the body

The code is πŸ‘‡

/*
 * The load commands directly follow the mach_header.  The total size of all
 * of the commands is given by the sizeofcmds field in the mach_header.  All
 * load commands must have as their first two fields cmd and cmdsize.  The cmd
 * field is filled in with a constant for that command type.  Each command type
 * has a structure specifically for it.  The cmdsize field is the size in bytes
 * of the particular load command structure plus anything that follows it that
 * is a part of the load command (i.e. section structures, strings, etc.).  To
 * advance to the next load command the cmdsize can be added to the offset or
 * pointer of the current load command.  The cmdsize for 32-bit architectures
 * MUST be a multiple of 4 bytes and for 64-bit architectures MUST be a multiple
 * of 8 bytes (these are forever the maximum alignment of any load commands).
 * The padded bytes must be zero.  All tables in the object file must also
 * follow these rules so the file can be memory mapped.  Otherwise the pointers
 * to these tables will not work well or at all on some machines.  With all
 * padding zeroed like objects will compare byte for byte.
 */
struct load_command {
    uint32_t cmd;       /* type of load command */
    uint32_t cmdsize;   /* total size of command in bytes */
};
Copy the code

Each load_command must contain πŸ‘‡

  1. cmdπŸ‘‰ Load type
  2. cmdsizeπŸ‘‰ Load size

3.3.2 rainfall distribution on 10-12 allload_commandSpecific information of

Let’s take a closer look at what information each load_command contains πŸ‘‡

__PAGEZERO

Null pointer trapThe purpose is to make peace with32 -Instructions are completely separate. (32 -The address inThe following 4 g.A 64 - bitaddressMore than 4 g, including 0xffffffff = 4G). There are several important descriptions πŸ‘‡

  • Segment Name πŸ‘‰ __PAGEZERODoes not occupy data (file size is 0)VM Size(ARM64 4G, ARMV7 is smaller).
  • VM AddrπŸ‘‰ Virtual memory address
  • VM SizeπŸ‘‰ Virtual memory size. The size of the runtime in memory, normally the same as File size, except __PAGEZERO.
  • File offsetπŸ‘‰ Indicates the offset of the data in the file.
  • File sizeπŸ‘‰ Size of data in a file.

Normally we locate the address by VM Addr + ASLR.

__TEXT, __DATA, __LINKEDIT

They are roughly the same structure as __PAGEZERO and are used to map (32-bit / 64-bit) segments of a file into the process address space. It is divided into three blocks πŸ‘‰ corresponding to Section (__TEXT + __DATA) and __LINKEDIT in DATA respectively, telling dyLD how much space it occupies.

LC_DYLD_INFO_ONLY

Dynamically link related information.

  • RebaseπŸ‘‰ Redirect (ASLR) offsetaddressandThe size of the. fromRebase Info Offset + ASLRStart loading336One byte of data.
  • BindingπŸ‘‰ bindingExternal symbol.
  • Weak BindingπŸ‘‰ Weak binding.
  • Lazy BindingπŸ‘‰ lazy binding, use when binding.
  • Export infoπŸ‘‰ open function.
LC_SYMTAB

Symbol table address.

  • Symbol Table OffsetπŸ‘‰ Symbol table address offset
  • Number of SymbolπŸ‘‰ Total number of symbols
  • String Table OffsetπŸ‘‰ Address offset of the string table
  • Symbol Table SizeπŸ‘‰ The size of the string table
LC_DSYMTAB

Dynamic symbol table address.

There are also someIndex, quantity, address offsetAnd other information.

LC_LOAD_DYLINKER

Use who load, iOS is using dyLD load, as shown below πŸ‘‡

LC_UUID

The UUID of the file, which is the unique identifier of the MachO file.

LC_VERSION_MIN_IPHONES

Supports the lowest operating system version.

LC_SOURCE_VERSION

Version number of the source code.

LC_MAIN

The entry address and stack size of the main program.

LC_ENCRYPTION_INFO_64

Encrypted information.

LC_LOAD_DYLIB

Path to dependent libraries, including third-party libraries.

Library of the system πŸ‘‡Third-party library πŸ‘‡

LC_RPATH

Path to the Frameworks library.

  • @executable_path πŸ‘‡

  • @loader_pathπŸ‘‡

LC_FUNCTION_STARTS

Function start address table.

LC_DATA_IN_CODE

A list of non-instructions defined in a code section.

LC_DATA_SIGNATURE

Code signing.

3.4 the Data

Data contains sections (__TEXT + __DATA) and __LINKEDIT.

3.4.1 track__TEXT

__TEXT is the code segment, which is the code we write. The main subsections are πŸ‘‡

1. __text: code section, where machine compiled code is stored 2. 3. __stub_helper: Used to help do dynamic linking (dyLD). 4. __objC_methName: method name of objC 5. __cString: string constants included in code execution, such as' #define kGeTuiPushAESKey @"DWE2#@e2!" `, that DWE2 # @ e2! It's going to be in this area. __objC_className: objC classname 7. __objC_methType: objC method type 8. __ustring: 9. 11. __dof_RACSignal: 12. __dof_RACCompou: 13.Copy the code

3.4.2 __DATA

__DATA data segment. The main subsections are πŸ‘‡

1. __got: Stores the actual address of the reference symbol, similar to the dynamic symbol table, storing the pointer to the '__nl_symbol_ptr' function. 2. __la_symbol_ptr:lazy symbol Pointers. Lazy-loaded function pointer address (the address of the function implemented in C code). Use with __stubs and stub_helper. The specific principle remains temporarily. 3. __mod_init_func: module initialization method. 4. __const: Stores constant data. Such as a const modifier with an extern export. __cfString: uses Core Foundation strings. __objc_classList: objC classlist, saves class information. __objc_nlclslist: a list of objective-C +load functions that are executed before __mod_init_func. __objc_catList: Categories 9. __objc_nlcatList :Objective-C categories +load function list 10. __objC_protolist: objC protocol list 11. __objC_imageinfo: objC image information 12. Save the objc_classData structure data. Address used to map class-related data, such as class name, method name, etc. 13. __objC_selrefs: the objC method referenced 14. __objC_protorefs: the objC protocol referenced 15. __objc_superrefs: objC superclass reference 17. __objc_ivar: pointer to objC ivar, storing properties. 18. __objc_data: objC data. Used to hold data needed by the class. The main thing is to map the __objc_const address to find data about the class. 19. __data: The log store protocol and some fixed address (has been initialized) static quantity. 20. __bSS: Stores uninitialized static quantities. For example: 'static NSThread *_networkRequestThread = nil; 'Where the size indicates the memory occupied by the application running, not the actual space occupied. So when you calculate the size you have to get rid of that. 21. __common: Stores exported global data. Like static, but without the static modifier. For example, NSDictionary* g_registerOrders in KSCrash; G_registerOrders are stored in __commonCopy the code

Rule 3.4.3__LINKEDIT

__LINKEDIT mainly contains πŸ‘‡

  • Dynamic Loader Info πŸ‘‰ Dynamic loading information

  • Function Starts πŸ‘‰

  • Symbol Table πŸ‘‰ Symbol Table

  • Dynamic Symbol Table πŸ‘‰ Dynamic Symbol Table

  • String Table πŸ‘‰ String Table

  • Code Signature πŸ‘‰ Code Signature

validation

We know that we can get the classname by _objc_classname and the method name by _objc_methname, but how do these two data match in series? __objc_classList is used for mapping.

Two tools are needed to verify this problem: πŸ‘‰ MachOView and Hopper.

  1. MachOViewOpen the Mach-o file and look directly at it__objc_classlistπŸ‘‡

We choose the first address, atHopperLook in the102725E28(byGSearch) πŸ‘‡

Double click, corresponding to__objc_classπŸ‘‡

θ€Œ__objc_classThe corresponding source is πŸ‘‡

typedef struct objc_class{
        struct __objc_class* isa;
        struct __objc_class* superclass;
        struct __objc_cache* cache;
        struct __objc_vtable* vtable;
        struct __objc_ data* data;
}objc_class;
Copy the code
  1. The first member isisaThe pointer is pointingMetaClassAnd the corresponding address is102badf90πŸ‘‡

2. The second member is a pointer to the parent class with the corresponding address00000000000000003. The fifth member points to__objc_ data, double-click it, and the corresponding address is102737e30πŸ‘‡

And then let’s see__objc_dataCorresponding data structure source πŸ‘‡

typedef struct objc_data{
    uint32_t flags;
    uint32_t instanceStart;
    uint32_t instanceSize;
    uint32_t reserved;
    void* ivarlayout;
    char* name;
    struct __objc_method_list* baseMethod;
    struct __objc_protos* baseProtocol;
    struct __objc_ivars* ivars;
    struct __objc_ivars weakIvarLayout;
    struct __objc_ivars baseProperties;
}
Copy the code

Several key members πŸ‘‡

  1. The sixth membernameThe savedThe name of the classAnd the corresponding address is0x102445615

The class name corresponding to this address is_AFURLSessionTaskSwizzling, so far, foundThe class name.

  1. The seventh memberbaseMethodHolds all the methods of the class, same, corresponding address is102737de0πŸ‘‡

Then look at__objc_method_listThe source of πŸ‘‡

typedef struct objc_method_list{
    uint32_t flags;
    uint32_t count;
}
Copy the code

The main data used is count. The corresponding data is 3, and the corresponding decimal number is also 3, indicating that there are three methods πŸ‘‡

The data structure corresponding to the method is πŸ‘‡

typedef struct objc_method{
    char* name;
    char* signature;
    void* implementation;
}
Copy the code

The objc_method_list structure takes up 8(4+4) bytes, and the address of __objc_method_list is 0000000102737DE0 + 8 bytes = the address of the first method 0000000102737DE8, The objc_method structure takes up 24(8*3) bytes, add another 24 bytes to get the address of the second method 0000000102737e00, and add another 24 bytes to get the address of the third method 0000000102737e18.

Then we look at the address of the first method 0000000102737de8 πŸ‘‡ in MachOView

In the figure above,0000000102737de8 Stored in theThe first 8 bytesThe address is0102331F27, to go toHopperTo search for the address πŸ‘‡

In the same way,The second 8 byteThe address is01023326C4πŸ‘‡

The third 8-byte address is 010232C51DπŸ‘‡

At this point, we look for the address in __objc_classlist, first finding the class name through the ISA pointer to __objc_class, then finding the address of the following member variable base method, finding the objc_method_list list of methods, Then according to the objc_method structure size, calculate the address of memory translation, find all method names.

This is an example of how DyLD loads a class name and associates it with a list of methods.

conclusion

  • Mach-O is a file format
    • Includes: executable file, static library, dynamic library, dyLD, etc
    • Executable file:
      • Generic binaries: a collection of multiple schemas
      • Lipo command
        • -info Displays architecture
        • ‐thin split architecture
        • ‐ CREAT incorporated architecture
  • The Mach -o structure
    • Header: Used to quickly determine the CPU type and file type of a file
    • Load Commands: Instructs the loader (e.g. Dyld) how to set up and Load binary data
    • Data: stores Data πŸ‘‡
      • code
      • data
      • String constant
      • class
      • methods