The folded wood is born at the very end; Nine layers of Taiwan, from the base soil; A journey of a thousand miles begins with a single step. — Lao Tzu · Tao Te Ching

For a closed source system, if you want to study the internal implementation of some logic, you need to master and understand assembly language. For some logic that needs high performance, assembly language may be the best choice, and for some logic may only be realized by assembly. As for the last ability: when we want to implement the logic of all OC method calls of a HOOK, because HOOK method cannot destroy the parameter stack of the original function, and also need to call the original function at the appropriate time without paying attention to the input parameter of the original function, we can only choose to use assembly language to achieve.

View the assembly code of the program

In fact, most of the time we do not require to write a piece of assembly code or machine instructions, but if you can read simple assembly code can peek into some of the underlying implementation logic and principle of the system. On the market, of course, there are also many disassembly tool software to assembly code pseudo code into a high-level language, downside is that these tools are mostly static analysis tools, and the disassembly of the code may not entirely correct, sometimes we may be more hope at run time to debug or analyze some problems, such effect will be better able to read assembly code.

Three ways to look at assembly code

Xcode provides three ways to view program assembly code:

  1. You can switch between assembly code mode and high-level language mode from the Debug menu ->Debug Workflow->Always Show Disassembly at breakpoints when the program is running.
  2. With the shortcut Alt + Command + \, you can set a symbolic breakpoint on a system function or a third-party library function or a class’s method, so that the program will switch to assembly code mode when the corresponding function or method is called. You can read and understand the implementation of a function or method in this way.
  3. If you want to see the pseudo-assembly code generated by a high-level language file, you need to go to the Product menu ->Perform Action->Assemble “XXXXX” to see the pseudo-assembly code generated by this file. What you see in simulator mode is assembly code for X64, and what you see in device mode is assembly code for ARM.

An introduction to the clang command

Viewing the generated assembly code in the third way above is actually done through the clang command. Clang is a C/C++/Objective-C compiler that includes preprocessing, parsing, optimization, code generation, assembly, linking, and more. All the work we do to build programs through menus is done internally with clang. You can type man clang in the command terminal to see all the parameters and instructions for the command. You can also use the Command + 9 shortcut in the Xcode project to see the detailed process of each project you build, which includes the practice of compiling and linking programs using clang commands.

As you can see, both source code compilation and program linking are implemented using clang commands. Don’t be intimidated by the large number of compiler linking options in the command. These parameters are actually set in the Build Settings of the visual project

To understand the complete compilation options Settings and meaning can reference: pewpewthespells.com/blog/builds…

We will cover only a few of the main options for the clang command:

Clang [- arch < arm | arm64 | x86_64 >] [-x < objective - c | objective - | | | assembler - with c + + c c + + - CPP >] [-l < database path >] [-i < > header file path] [-f < > frame header file path] [SDK - isysroot system path] [fobjc - arc | - fno - objc - arc] [- LXXX] [-] framework XXX [- Xlinker option] [- Xlinker Value] [-e source file] [-rewrite-objc source file] [-c source file] [-s source file] [-filelist LinkFileList file] [-o output file]Copy the code

1. General parameters

☞ – arch < arm | arm64 | x86_64 | i386 > : the architecture of the generated code, choose four.”

☞ – x < objective – c | objective – c + + c c + + | | | assembler – with – CPP: specify the language compiled file, choose five, defaults to objective – c. This option is used at compile time.

☞ -i < header file path >: Specifies the search path for the #import or # include. h file.

☞ -l < library path >: Specifies the search path for dynamic or static library files when linking. This option is used in the link phase.

☞ -f < frame header path >: Specifies the header search path for #import of a frame library.

☞ -isysroot System SDK path: specifies the path of the system framework SDK used by the program. Such as: – isysroot/Applications/Xcode. App/Contents/Developer/Platforms/iPhoneOS platform/Developer/SDKs/iPhoneOS12.1. The SDK Indicates that the current program is compiled or linked using the real iOS12.1 SDK.

☞ fobjc – arc | – fno – objc – arc: show the current program is compiled by using arc compiled or MRC.

☞ -lxxx: Used only for linking, to indicate that the library named libxxx is linked into the program.

☞ – Framework XXX: Used only for linking, to indicate that the framework library named XXX is linked into the program.

☞ -xlinker option-xlinker value: Specifies the options for setting the link. These options must appear in pairs. The meaning is option = value.

2. The preprocessing

☞ -e source file -o output file: Preprocess the source code. This is the process of expanding all #include and #import headers, expanding all macro definitions, and converting all enumeration values to constant values. You can view the preprocessing results of a source code file from the **Product menu ->Perform Action->Preprocess “XXXXX “**.

Generate C++ code

☞ -rewrite-objc source file: convert the OC code to the corresponding C++ language implementation. And generate a corresponding C++ code with the suffix. CPP in the current directory of the source file. You can use this method to learn more about arc implementation, block implementation and call principle, various OC keywords implementation logic, OC class attributes and methods implementation logic, class method definition, runtime mechanism, etc. So using this parameter can help us spy a lot on iOS. There is a common error you may encounter when using this command:

In file included from xxxx.m:9:
xxxx.h:9:29: fatal error: module 'UIKit' not found
#pragma clang module import UIKit /* clang -E: implicit import for #import <UIKit/UIKit.h> */
                     ~~~~~~~^~~~~
1 warning and 1 error generated.

Copy the code

This is mainly because the system SDK path file is not found, so you can specify the system SDK path with the -isysroot parameter. Here is an example in use:

clang -rewrite-objc -arch arm64 -isysroot / Applications/Xcode. App/Contents/Developer/Platforms/iPhoneOS platform/Developer/SDKs/iPhoneOS12.1 SDK XXXX. MCopy the code

The path after -isysroot should be the path of the corresponding system SDK, and the value in -arch should have the same structure as the SDK in the path.

4. Generate assembly code

☞ -s source file -o output file: To generate assembly code from a source file, specify the source file after the -s parameter. The output file after -o is the corresponding assembly code file, usually with an. S extension. Note here that the -arch parameter is also used to specify the output architecture.

5. Compile

☞ -c source file -o output file: Use these two parameter options when compiling a source file, where -c is followed by the source file to be compiled, and -o is the target file for the extension.

Links to 6.

☞ -filelist LinkFileList file -o output file: Link execution takes all target. O files as input parameters, but save the path to these. O files to an extension for administrative convenience. LinkFileList file, then use the -filelist parameter followed by the corresponding. LinkFileList file to specify a collection of target files. The output file after -o is the corresponding executable file.

The introduction of assembly code in the project

You can also introduce assembly code directly into the XCode project or use assembly code to write programs and functions. To add assembly files, go to the File menu ->New->File… -> Select Assembly File from the list. Usually assembly code has a.s extension, and the resulting file is an empty file in which you can write assembly code. The system also supports setting breakpoints in assembly code for debugging. Because iOS supports a variety of architectures, you can use several macros in assembly code to distinguish between x86_64 and ARM or ARM64, as in the following code:

// You can pass like a high-level language#include introduces header files.
#include <xxx.h>/ / arm system#ifdef __arm__// Directives and data definitions //arm64 architecture#elif __arm64__// Instruction and data definitions //x86 32-bit architecture#elif __i386__// Instruction and data definitions // x86_64-bit architecture#elif __x86_64__// Directives and data definitions // other systems#else

#endif

Copy the code

When you add an assembly file to a project, you need to know how to write assembly code. The detailed description of assembly instructions is too large to introduce here, here mainly introduces some commonly used assembly keywords, in order to help you can better read and write procedures.

Common assembly syntax

Both AT&T and ARM assembly language keywords are used in Xcode. At the beginning. Writing assembly code is mainly about data definition and code instructions. An assembly language file can also use c-like file introductions and various precompiled instructions, as well as references to variables and symbols and functions defined in high-level languages.

1. The comments

The comments in the assembly instruction are the same as in C/C++/OC. Under arm system assembly code special line comments are after the code; X86_64 assembler code typically has a line comment of ##.

Section 2.

Both instruction and data management units are sections. In iOS mach-O file format, data and instructions are stored in units of segments and sections. Any code and data is always defined within a section. Each section belongs to a segment, and each section has a unique name. Section defines the keywords and syntax as follows:

.section < segment name >,< section name >,< section property >Copy the code

The same segment name and section name may appear in additional sections where data and code are defined at the beginning of the section specified by.section and end at the beginning of the definition of the next section. The system will eventually store the contents of the same section name and section name together when generating code. In general, all instruction code is defined in the __TEXT section, and data definitions are defined in the __DATA section. If section names are not specified in assembly code, data and code are under __TEXT, __TEXT by default. The system also provides two section definition keywords that simplify code segments and data segments.

// The code segment definition is equivalent to.section __TEXT, __text. text // the data segment definition is equivalent to.section __DATA, __data.dataCopy the code

In addition to specifying names in section definitions in disassembly code, you will also see attributes defined by sections such as regular,pure_instructions,no_dead_strip,cstring_literals, and so on. These attributes represent the same meaning as the flags field in the struct section_64 structure in the Mach-O file format. The flags settable values are those macro definition values starting with S_ in

.

3. Labels and symbols

A label is an understandable offset representation of an address. It is an alias of an address. The goal of using tags is to make your program code more readable. Tags can be defined and referenced in other instructions, and they can be referenced in data variables. The definition rules of labels are as follows:

Tag name 1: // Code and data Tag name 2: // code and dataCopy the code

Tags can be thought of as local pointer variables in a file. Tags defined in data segments are used as addresses to access variables, and tags defined in code segments are used for instruction hops. For example:

Long 13. Text LAB1: // tag definition mov AGE(%rip), %rax // tag usage JMP LAB1 // tag usageCopy the code

Sometimes, you can also define the direction, direction label only Numbers, then the direction function can be used in, behind the direction of the label with a b shows that jump to the current instruction the previously defined a recently in the direction of the label, label and direction followed a f shows that behind the jump to the current instruction defined a recently in the direction of the label. Like the code shown below:

//x86_64 demo code, which defines the direction tags, and also how to jump to these direction tags to use the method. .text mov %rax, %rax 1: //a mov %rax, %rax 2: //b mov %rax, %rax 2: Mov %rax, %rax JMP 2b // mov %rax, %rax JMP 2b // mov %rax, %rax JMP 2b // mov %rax, %rax JMP 2b // mov %rax, %rax JMP 2b // MOV %rax, %rax JMP 1f // MOV %rax, %raxCopy the code

A label is simply an alias for an address offset within a file and can only be referenced within a defined file. The label needs to be declared as a symbol in order for it to be referenced and accessed externally. High-level language file can be defined in external access functions and global variables are a symbol, whether function or global variable memory addresses, is actually a address location, while the alias addresses can be represented with the tag, so if you want to put a label defined as external access, you need to tag name statement for the symbols. Just like static functions and variables and global functions and variables in high-level languages, there are two types of symbolic declarations in assembly language:

// A visible global symbol that can be referenced and accessed by external programs. .global Global symbol name: // Private external symbol that can only be referenced and accessed within the program. .private_extern Private external symbol name Private external symbol name:Copy the code

The symbol name matches the label name. C function names, global variables and other symbols are preceded by an underscore _ at compile time. So in high level languages names correspond to real symbols that are prefixed with an underscore, so it is generally best to declare symbols and label names in assembly language with an underscore. And do not use this descender line in declarations in other high-level languages, as in the following example:

//xxx.s // Define a global variable symbol _testSymbol in the data segment. .data .global _testSymbol _testSymbol: .int 10 ............................................. //xxx.m // High level languages declare the use of this symbol. extern inttestSymbol;

int main(int argc, char *argv[])
{
   printf("testSymbol = %d".testSymbol);
   return 0;
}


Copy the code

At the same time, when referring to symbols defined by high-level languages in assembly code, an underscore prefix should also be added.

4. The alignment

Because of the nature of memory addressing access, some of our code or data must be stored at multiples of some number, known as alignment. Set alignment keywords as follows:

// indicates that the address here is a multiple of (2^3)8. P2align = align align = p2align .align 3 .p2align 3Copy the code

5. A macro definition

Assembly language can also use macro definition like C language to do some code reuse processing. The syntax for macro definitions is as follows:

Macro can be written in any other assembly code and keywords // macros can take arguments, macros use arguments always from$0Start. // endmacro. EndmacroCopy the code

When using a defined macro, simply insert the macro name in the appropriate place. If the macro has arguments, the arguments follow the macro name and are separated by commas. Here is an example of macro definition and use:

// macro Test mov x0,$0
mov x1, The $1.endmacro // uses Test 10,20Copy the code

6. Definition of data

The definition of data is similar to the definition of variables in C language, and assembly code also supports multiple types of data definition. The syntax for defining a data is as follows:

.< data type > valueCopy the code

There are the following data types:

type describe For example,
.byte A single byte .byte 0x10
.long The long integer is 4 bytes .long 0x10
.quad 4 times type, 8 bytes long .quad 0x10
.asciz A string ending in 0 .asciz “Hello world!”
.ascii A string that does not end in 0 .ascii “Hello world!”
.space The number of empty bytes, followed by the number .space 4
.short The short integer is 2 bytes .short 0x10

The value of a data type can be a constant, an expression, or a label symbol. If we want to give a data definition a name similar to that of a variable, we can combine it with a label. Such as:

name:
.asciz Ouyang Big BrotherAge:.long 13 nickname:.quad name // the nickname variable is a pointer indicating the same as name.Copy the code

If you want to access the variable with the tag name defined above in a code block, you can use the following instruction:

Leaq name(%rip), %rax movL age(%rip), %ebx movq nickname(%rip), % RCX name@PAGE add x0, x0, name@PAGEOFF adrp x1, age@PAGE add x1, x1, age@PAGEOFF ldr x1, [x1] adrp x2, nickname@PAGE add x2, x2, nickname@PAGEOFFCopy the code

7. Definition of functions

There is no special keyword for function definition in assembly language, only the definition of code blocks in assembly language, all executable code blocks are stored in code blocks. The so-called function call is actually the first address of the calling function code. So function calls in files can actually be done with tags, and function calls in other files can be done with symbols. The processing of parameters in functions is specified according to the ABI rules for function call parameter passing. For details, refer to my introduction to CPU registers in iOS.

The following is an implementation of an addition function that evaluates the sum of two parameters in x86_64-bit architecture:

Text.global _add. align 3 _add: movq %rdi,% RBX movq %rsi,%rax addq % RBX,%rax ret LExit_add:Copy the code

8. Instruction writing

There is no need to write instructions in assembly language here, otherwise there is not enough to finish a book, you can refer to the relevant assembly code books, the best way is to read the CPU architecture manual:

  • Arm32 bit reference manual

  • Arm 64-bit reference manual

  • X86_64-bit Reference Manual

9. False conditional statements

Assembly language has instructions for comparing and jumping, but we can still use pseudo-conditional statements to make our code more readable. The syntax for pseudo-conditional statements is as follows:

Elseif logical expressions.else. EndifCopy the code

10.CFI: Invoke framework directives

This part of the pseudo-instruction begins with.cfi. It is mainly used to record the frame stack information of the function and for exception handling. Specific instructions, please refer to the blog.csdn.net/permike/art…

References symbols in assembly code files

Because assembly code source files do not have what is called a.h header file declaration. So when you want to use functions or global variables defined in assembly language in other files, you can declare symbolic use at the top of your source code file:

//xxxxx.m // function declares extern void function symbols without underscores (argument list); // Variables use a variable symbol that declares extern types without underscores;Copy the code

Embed assembly code in a high-level language

We can also embed assembly code in high-level languages. The main purpose of embedding is to optimize the performance of the code. There are also some capabilities that high-level languages cannot do, such as obtaining the address of the current execution instruction and reading the value of some status registers and special registers. There are even scenarios where assembly code can be used to solve multithreading problems that high-level languages need to solve with locks, etc. For specific embedding methods and rules, I’m going to sneak a little lazy and go directly to this link:

Blog.csdn.net/pbymw8iwm/a…

You can clearly understand the rules for embedding, which have been described in detail in this article. Here are three specific examples:

  • High-level language variables serve as input and output to embedded assembly code
Long add(long a, long b) {long c = 0;#if __arm64__
     __asm__(
             "ldr x11, %1\n"
             "ldr x12, %2\n"
             "add %0, x11, x12\n"
             :"=r"(c)
             :"m"(a),"m"(b)
             );
    
#elif __x86_64__
    
    __asm__(
            "movq %1,%%rdi\n"
            "movq %2,%%rsi\n"
            "addq %%rdi,%%rsi\n"
            "movq %%rsi,%0\n"
            :"=r"(c)
            :"m"(a),"m"(b)
            );
    
#else
        c = a + b;
#endif
    
    return c;
}

Copy the code
  • The value of a special register of the system is output to a variable in a high-level language
// Prints the address of the current instruction and the current thread ID voidfoo()
{
    unsigned long pc = 0;
    unsigned long threadid = 0;
    
#if __arm64__//TPIDRRO_EL0 is the thread ID in the kernel, and the special instruction Mrs Is used to read __asm__("adr x0, #0\n"
              "stur x0, %0\n"
              "mrs %1,TPIDRRO_EL0\n"
              :"=m"(pc),"=r"(threadid)
              );
    
#elif __x86_64__// X86 cpus do not have special registers to hold thread ID __asm__("leaq (%%rip), %%rdi\n"
            "movq %%rdi, %0\n"
            :"=m"(pc)
            );
#else
    NSAssert(0, @"oops!");
#endif
    
   
    NSLog(@"pc=%ld, threadid=%ld",pc, threadid);
    
}

Copy the code
  • Lockless multithreaded variable access assume that two variables X and Y are defined in the program. Now thread A is responsible for reading the values of these two variables for processing, and thread B is responsible for writing the latest values of these two variables. These two variables are related, and must be written and read at the same time. If it’s implemented in a high-level language, in order to ensure synchronization, you need to lock both threads where they read and write two variables. In the ARM architecture, it can be usedldp,stpTwo instructions to implement atomic operations at the instruction level for best performance because locking is not required.
// Assume that the x,y variables are stored in the critical array. long critical[2]; voidread(long *px, long *py)
{
#if __arm64__
    __asm__(
            "ldp x9, x10, %2\n"
            "stur x9,%0\n"
            "stur x10,%1\n"
            :"=m"(*px),"=m"(*py):"m"(critical)
           );  
#else// Other architectures must lock when reading. *px = critical[0]; *py = critical[1];#endif
}

void write(long x, long y)
{
#if __arm64__
    __asm__(
            "stp %1, %2, %0":"=m"(critical):"r"(x),"r"(y)
           );
#else// Other architectures must lock when writing two variables. critical[0] = x; critical[1] = y;#endif
}



Copy the code

👉 [Back to directory]


Welcome to visit myMaking the address