preface

Recently, amap APP completed a startup optimization project, which exceeded expectations and reduced the dual-port startup time by more than 65%. The speed of iOS has reached less than 400 milliseconds on iPhone7. As the products say after use, fast to not used to. Calculate the time saved for users every day, or quite a sense of achievement, this article to make a summary.

(The pictures in the article are all hand-painted by versatile technical brother)

Multidimensional performance analysis during startup

To optimize, the first thing to do is to analyze the various performance dimensions of the startup phase, including main thread time, CPU, memory, I/O, network. This will give you a more complete picture of the overhead of the startup phase and will help you spot unreasonable method calls.

The faster the startup, the more method calls should be executed on demand, spreading the startup load, leaving only initializations of methods and libraries that all methods depend on after startup, such as the network library, Crash library, and so on. The rest of the functionality that needs to be pre-loaded can be implemented after the startup phase.

What are the types of priming and what are the stages?

Startup types are as follows:

  • Cold: The APP is restarted and no process exists in the memory.

  • Warm: the APP is restarted after the recent end, with part of the APP in memory but no process.

  • -Penny: It’s not finished, it’s just suspended, it’s all in memory, it’s still in progress.

The analysis phase is typically performed for Cold types to stabilize the test environment. In order to stabilize the test environment, sometimes we need to find some stable models. For iOS, iPhone7 has medium performance and good stability, which is very suitable. Vivo series of Android is relatively stable, and Data fluctuations of Huawei and Xiaomi series are relatively large.

In addition to the model, it is also very important to control the temperature of the test machine. Once the temperature is too high, the system will lower the frequency and affect the test data. Sometimes flight mode is set and Mock network requests are used to reduce the impact of unstable network on test data. It is best to restart the iCloud account back, placed for a period of time to test again, more accurate.

The purpose of understanding the startup phase is to focus on the scope and determine which phase is fast in terms of user experience so that the user can see and respond to the user’s actions faster.

To put it simply, iOS startup is divided into loading Mach-O and runtime initialization. Loading Mach-O will determine whether the file is loaded by the first byte of the file, also known as the magic number, when the following four types of file can be determined:

  • The macro in loader. H corresponding to 0xfeedface is MH_MAGIC

  • 0 xfeedfact macro is MH_MAGIC_64

  • Macro MH_GIGAM NXSwapInt (MH_MAGIC)

  • Macro MH_GIGAM_64 NXSwapInt (MH_MAGIC_64)

Mach-o is divided into:

  • Intermediate object file (MH_OBJECT)

  • Executable binary (MH_EXECUTE)

  • VM shared library file (MH_FVMLIB)

  • Core file (MH_CORE)

  • Preload (MH_PRELOAD)

  • Dynamic Shared Libraries (MH_DYLIB)

  • Dynamic linker (MH_DYLINKER)

  • Static link files (MH_DYLIB_STUB) symbol files and debugging information (MH_DSYM).

Once mach-O is determined, the kernel forks a process and execve starts loading. Check the Mach-o Header. The dyLD and program are then loaded into the Load Command address space. Dyld is started with dyLD_STUB_binder, which does rebase, binding, lazy binding, exporting symbols, and can also be hooked through DYLD_INSERT_LIBRARIES.

Dyld_stub_binder assigns offsets to dyLD’s special bytecode Segment, the real address, and writes the real address to la_symbol_ptr, jumping to the real address using the stub jump instruction. Dyld loads all dependent libraries and performs non-lazybinding on the trie structure symbols exported from the dynamic library. Binding resolves other module functions and data reference process, that is, importing symbols.

A Trie, also called a number tree or prefix tree, is a search tree. Find complexity O(m), where m is the length of the string. The worst hash complexity compared to a hash table is O(N), which is usually O(1) and takes O(m) time to evaluate the hash. The disadvantage of hashing is that it allocates a large chunk of memory, and the more content it has, the more memory it takes up. The Trie is not only fast to find, but also fast to insert and delete, making it ideal for storing predictive text or auto-complete dictionaries.

In order to further optimize the space occupied, a deterministic finite automaton like Trie can be compressed into a deterministic acyclic finite state automaton (DAFSA), which has a small space and compresses the same branches.

For larger content, further optimizations can be made, such as reinterpreting the original string to a longer string using a letter reduction implementation technique; Using single-linked lists, nodes are designed to be represented by symbols, child nodes, and the next node. Stores an alphabet array as a 256-bit bitmap representing the ASCII alphabet.

Although Trie does a lot of performance optimization, too many symbols can still increase the performance cost. For dynamic libraries, do not export too many symbols, and try to keep the public symbols small and the private symbols rich. This is easy to maintain, version-compatible, and optimizes the dynamic loading time to the process.

The constructor function of the attribute is then executed. Here’s an example:

#include <stdio.h>

__attribute__((constructor))
static void prepare() {
    printf("%s\n"."prepare");
}

__attribute__((destructor))
static void end() {
    printf("%s\n"."end");
}

void showHeader() { 
    printf("%s\n"."header");
}
Copy the code

Running results:

ming@mingdeMacBook-Pro macho_demo % ./main "hi"
prepare
hi
end
Copy the code

The runtime initialization process is divided into:

  • Load the class extension.

  • Load C++ static objects.

  • Call the +load function.

  • Execute main.

  • Application initialization to applicationDidFinishLaunchingWithOptions performed.

  • Initialize the frame rendering, and by the time viewDidAppear is done, the user can see it.

So the analysis of the startup phase ends at viewDidAppear. This optimization has been done before the Application initialization, but the effect is not obvious and there is no essential improvement. Therefore, this optimization mainly analyzes the multi-latitude performance in the stage from Application initialization to viewDidAppear.

In fact, there are a lot of tools to choose from. The System Trace provided by Apple can provide a comprehensive System behavior, display the underlying System thread and memory scheduling, and analyze problems such as locks, threads, memory, and System calls. In general, the use of System resources by APP at every moment can be clearly known through System Trace.

System Trace can look at thread status, see if threads are being used properly relative to the number of cpus, and see whether threads are executing, suspending, context switching, being interrupted, or being preempted. You can also see the time spent on virtual memory usage, such as allocating physical memory, decompressing memory, and caching without caching. You can even see the fever.

After importing sys/kdebug_signpost.h into your code, pair kDEBUg_SignPost_start with kDEBUg_Signpost_end. These two methods take five parameters, the first id, the last color, and the reserved fields in between.

Xcode11 XCTest also provides an Api for measuring performance. Apple launches optimization feature at WWDC 2019:

It also describes how App Launch, the latest template in Instruments, analyzes startup performance. However, App Launch is currently unable to automate the retention, averaging, Diff, filtering, and association analysis of launch data.

The following analyzes the main thread time, CPU, network, memory, I/O and other dimensions:

  • Main thread time

The most important multilatitude performance analysis that end users feel is the main thread time analysis. Use Massier, an Objective-C method tracking tool developed by EverettJf, to get the main thread method time directly:

Generate Trace JSON for analysis, or see this code

GCDFetchFeed/SMCallTraceCore. C at master ming1016 / GCDFetchFeed dead simple

Manually hook objc_msgSend to generate an Objective-C method time-consuming data for analysis. There is also a pegging method that parses the IR (speeds up compilation) and then inserts a time statistic function before and after each method.

Later in this article, I’ll focus on developing tools to further analyze this data to monitor the time taken by the startup method.

Hook all method calls, useful for detailed analysis, but has a large impact on the total startup time, to obtain a more accurate startup phase of the time consumption also need to rely on manual buried points.

In order to better analyze the startup time problem, more and more manual burying points will be buried, which will also affect the startup time accuracy, especially when there are many teams and modules, the problem will be prominent. However, when checking startup time, each team usually only pays attention to its own analysis or some related modules. Based on this, different modules can be grouped into buried points and flexibly combined, so that various needs can be taken into account.

  • CPU

Why analyze the slow startup in addition to the main thread method time, but also the performance of other dimensions?

We first take a look at the performance of slow startup. Slow startup means slow interface response, slow network (large amount of data and many requests), and CPU overload and downfrequency (large number of parallel tasks and many operations). We can see that there are many factors affecting startup, which need to be considered comprehensively.

For cpus, WWDC

What’s New in Energy Debugging – WWDC 2018 – Videos – Apple Developer

This paper introduces the use of Energy Log to check the CPU power consumption, which is judged as the power consumption if the CPU thread occupies more than 80% for three minutes or one minute in the background. Meanwhile, the thread stack of power consumption is recorded for analysis. MetrickKit also collects power and performance statistics and reports them every 24 hours. Mattt has a post on NShipster about it:

So how do you get detailed CPU usage? Which method uses how much CPU.

There are several ways to get detailed CPU usage. Thread is the basic unit of computer resource scheduling and allocation. CPU usage is drawn down to basic units like threads. The task_theads act_list array contains all threads, and the interface using thread_info returns basic thread information defined in the thread_basic_info_t structure. The information contained in this structure includes thread running time, running status, and scheduling priority, as well as CPU usage information CPU_usage.

See:

objective c – Get detailed iOS CPU usage with different states – Stack Overflow

GT GitHub – Tencent/GT

There is also code to get the CPU.

The overall CPU usage can be obtained from host_cpu_load_info by using the host_statistics function, where cpu_ticks is the number of clock pulses that the CPU runs. The CPU_STATE_USER, CPU_STATE_NICE, and CPU_STATE_SYSTEM states in use can be obtained by dividing the states in the CPU_ticks array by the total CPU usage.

You can also get the number of CPU cores from NSProcessInfo’s activeProcessorCount. Online data analysis shows that the performance of phones with the same model and system is completely different, which is caused by the system reducing the CPU frequency after overheating or excessive battery loss.

Therefore, if the CPU frequency can also be obtained for those mobile phones to reduce the frequency to optimize, to ensure smooth experience. You can refer to:

Github.com/zenny-chen/…

  • memory

To get the actual memory usage of your APP, see the WebKit source code:

JetSam will determine the memory usage of the APP and kill the APP if it exceeds the threshold. Here is the code for JetSam to obtain the threshold:

The total device physicalMemory size can be obtained from physicalMemory in NSProcessInfo.

  • network

For network monitoring, you can use tools such as Fishhook Hook network base library CFNetwork. The network situation is complicated, so it is necessary to set some key indicators related to time. The indicators are as follows:

  • DNS time
  • SSL time
  • The first package time
  • The response time

Only with these indicators can we better analyze network problems. The network requests during the startup phase are very large, so HTTP performance is very important. The following are the sessions related to the WWDC network:

Your App and Next Generation Networks – WWDC 2015 – Videos – Apple Developer

Networking with NSURLSession – WWDC 2015 – Videos – Apple Developer

Networking for the Modern Internet – WWDC 2016 – Videos – Apple Developer

Advances in Networking, Part 1 – WWDC 2017 – Videos – Apple Developer

Advances in Networking, Part 2 – WWDC 2017 – Videos – Apple Developer

Optimizing Your App for Today’s Internet – WWDC 2018 – Videos – Apple Developer

  • I/O

It can be used for I/O

Frida, A world – class dynamic instrumentation framework | Inject JavaScript to explore native apps on Windows and macOS. GNU/Linux, iOS, Android, and QNX

This dynamic binary staking technique inserts custom code while the program is running to fetch data such as I/O time and data size to process. Frida can also be used on other platforms.

More information on multidimensional analysis can be found in previous WWDC presentations. Here’s a list of WWDC sessions for startup optimization over the past 16 years, and every one of them is amazing.

Using Time Profiler in Instruments – WWDC 2016 – Videos – Apple Developer

Optimizing I/O for Performance and Battery Life – WWDC 2016 – Videos – Apple Developer

Optimizing App Startup Time – WWDC 2016 – Videos – Apple Developer

App Startup Time: Past, Present, and Future – WWDC 2017 – Videos – Apple Developer

Practical Approaches to Great App Performance – WWDC 2018 – Videos – Apple Developer

Optimizing App Launch – WWDC 2019 – Videos – Apple Developer

Defer task management

After the above analysis of the main thread time method and the performance of each latitude, those methods that are not necessary to be executed during startup can be executed on demand or later.

Task postponement can’t be executed in one bold breath on the main thread after startup, so the user just sees the page and still can’t respond to the action. So how do you do that? Create four queues:

  • Asynchronous serial queue
  • Asynchronous parallel queue
  • Idle main thread serial queue
  • Asynchronous serial queue at idle time

Dependent tasks can be placed into asynchronous serial queues for execution. Asynchronous parallel queues can be executed in groups, such as dispatch_group, and then the number of tasks per group can be limited to prevent instantaneous spikes in CPU, threads and memory from affecting the main thread user operations. A finite number of serial queues can be defined, each serial queue doing a specific thing. This also prevents sudden spikes in performance costs that cause unresponsive user operations. If dispatch_semaphoRE_t is used, priority reversal may occur when the semaphore blocks the main queue. Therefore, reduce the use of dispatch_semaphore_t to ensure QoS propagation. Dispatch Group can be used instead, with the same performance and functionality. Asynchronous programming can be written directly by the GCD interface or by using Ali’s coroutine framework

coobjc GitHub – alibaba/coobjc

The idle queue implementation listens for the main thread runloop state, starts executing tasks in the idle queue at kCFRunLoopBeforeWaiting, and stops at kCFRunLoopAfterWaiting.

How to maintain after optimization?

It’s easier to attack but harder to defend, like reducing the pack size by 48 megabytes when you first joined a new team, but being able to defend it for over a year requires means as well as determination. For starting optimization, it is necessary to monitor each performance latitude, but it is still necessary to find some breakthroughs to quickly and conveniently locate problems after discovering them. My idea is to list the time-consuming methods in the startup stage one by one according to the time line, and each one includes method name, method hierarchy, class, module and maintainer. For convenience, it’s also good to be able to easily view the content of the method code.

Next, I will develop a tool to explain in detail how to achieve this effect.

  • Parsing json

As mentioned earlier, after the method of exporting a copy of the Chrome Trace specification takes json, the data is parsed first. The JSON data looks something like this:

{"name":"[SMVeilweaa]upVeilState:"."cat":"catname"."ph":"B"."pid": 2381,"tid": 0."ts": 21}, {"name":"[SMVeilweaa]tatLaunchState:"."cat":"catname"."ph":"B"."pid": 2381,"tid": 0."ts": 4557}, {"name":"[SMVeilweaa]tatTimeStamp:state:"."cat":"catname"."ph":"B"."pid": 2381,"tid": 0."ts": 4686}, {"name":"[SMVeilweaa]tatTimeStamp:state:"."cat":"catname"."ph":"E"."pid": 2381,"tid": 0."ts": 4727}, {"name":"[SMVeilweaa]tatLaunchState:"."cat":"catname"."ph":"E"."pid": 2381,"tid": 0."ts": 5732}, {"name":"[SMVeilweaa]upVeilState:"."cat":"catname"."ph":"E"."pid": 2381,"tid": 0."ts": 5815},...Copy the code

You can create a fire map using Chrome’s Trace-viewer. The name field contains information about classes, methods and parameters; the CAT field can be added with other performance data; ph B indicates the start of the method; E indicates the end of the method; ts indicates the performance data.

Many projects execute a large number of methods during startup, many of which take very little time. You can filter out methods that are less than 10 milliseconds to focus your analysis.

The time is also color-coded. External time refers to the time taken by a system other than a child method or a third party method without source code. The rule is the time taken by a parent method call minus the total time taken by its children.

By filtering method calls that take less time so far, problem methods can be found more easily. However, some methods take only a few times to execute, but many times to execute, resulting in a large sum of time. This situation also needs to be reflected in the presentation page. In addition, when the external time is high or when you do not understand the method, it is necessary to search for the corresponding method source code for analysis in the project source code, and some method names are very common when you also need to spend a lot of time to filter useless information.

Therefore, the next two things need to be done, the first is to add up the number of method calls and time, reflected in the display page, the other is to obtain method source code from the project can be displayed in the display page click.

The complete idea is as follows:

  • Display method source code

To display the source code on the page, you need to parse the.xcworkspace file, and use the.xcworkspace file to fetch all the.xcodeProj files in the project. Analyze the.xcodeProj file to take all the.m and.mm source file paths, parse the source code, take the source content of the method for display.

Parsing. Xcworkspace

So if YOU open.xcworkspace, you can see that the main file in this package is contents.xcworkspaceData. The content is an XML:

<? xml version="1.0" encoding="UTF-8"? > <Workspace version ="1.0">
   <FileRef
      location = "group:GCDFetchFeed.xcodeproj">
   </FileRef>
   <FileRef
      location = "group:Pods/Pods.xcodeproj">
   </FileRef>
</Workspace>
Copy the code

Parsing. Xcodeproj

The file path of XCodeProJ is found in the Location property of the FileRef node. Each XcodeProj file contains the source file of the project. In order to obtain the source code of the method for display, the path of all the source files contained in the project should be taken out first.

The file contents of XcodeProj might look something like this.

In fact, there are a lot of content, need to be resolved one by one.

Considering that xCodeProJ has many useful comments, more structures will be designed to hold values and comments. The idea is to determine whether the next level is a key value structure or an array structure based on the type of XcodeprojNode. If XcodeprojNode is of type dicStart, the child is a key value structure. If the type is arrStart, it is an array structure. When you run into dicEnd and it’s the same as the original dicStart, you recurse to the next level of the tree. ArrEnd doesn’t use recursion, and XcodeProj’s Array only has value type data.

Once you have the basic node tree structure, you can design the structure of each section in XcodeProJ. There are mainly the following sections:

  • PBXBuildFile: file that will eventually be associated with PBXFileReference.

  • PBXContainerItemProxy: Deployed element.

  • PBXFileReference: all kinds of files, source code, resources, libraries and other files.

  • PBXFrameworksBuildPhase: Used to build the framework.

  • PBXGroup: a folder that can be nested and contains the relationship between files and folders.

  • PBXNativeTarget: Indicates the setting of Target.

  • PBXProject: Project setting with information needed to compile the Project.

  • PBXResourcesBuildPhase: Compile resource files, including xib, storyboard, PList, and images.

  • PBXSourcesBuildPhase: Compile the source file (.m).

  • PBXTargetDependency: Dependency of Taget.

  • PBXVariantGroup:.storyboard file.

  • XCBuildConfiguration: Xcode builds configuration, corresponding to the content of Xcode’s Build Setting panel.

  • XCConfigurationList: Build configuration-related, including project files and target files.

Once you have the section structure Xcodeproj, you can start analyzing the paths of all the source files. According to the section listed above, PBXGroup contains all folder and file relationships. The key of Xcodeproj’s PBXGroup field is a folder, and the value is a collection of files. Therefore, you can design a structure XcodeprojSourceNode to store folder and file relationships.

Next you need to get the full file path. Get the folder path through the recusiveFatherPaths function. The thing to note here is the need to deal with.. / This folder path character.

Parse the.m. mm file

For objective-C parsing, refer to LLVM, where you only need to find the source code for each method, so you can do it yourself. Take a look at how LLVM defines tokens before splitting them. The definition file is here:

Opensource.apple.com/source/lldb…

According to this definition, I designed the token structure, the main part is as follows:

// Cut symbol [](){}.&=*+-<>~! / % ^ |? :; .# @
public enum OCTK {
    caseUnknown // Is not a tokencaseEof // End of filecaseEod // End of linecase codeCompletion // Code completion marker
    case cxxDefaultargEnd // C++ default argument end marker
    caseThe comment / / commentcaseIdentifier // such as abcde123caseNumericConstant (OCTkNumericConstant) // Integer, floating point 0x123, used to explain calculations, but not used to analyze codescaseCharConstant / / 'a'caseStringLiteral / / "foo"caseWideStringLiteral / / L "foo"caseAngleStringLiteral // <foo> to be dealt with needs to be considered as less than symbols // standard definition part // punctuationcasePunctuators (OCTkPunctuators) // keywordscaseKeyword (OCTKKeyword) // @ Keywordcase atKeyword(OCTKAtKeyword)
}
Copy the code

The full definition is here:

MethodTraceAnalyze/ParseOCTokensDefine.swift

Github.com/ming1016/Me…

The word segmentation process can be seen in the LLVM implementation:

clang: lib/Lex/Lexer.cpp Source File

Clang.llvm.org/doxygen/Lex…

When I deal with word segmentation, I mainly deal with one by one according to the delimiter, and I have carried out special treatment for code comments and strings, one token for a comment and one token for a complete string. My word segmentation code:

MethodTraceAnalyze/ParseOCTokens.swift

Github.com/ming1016/Me…

As long as you get the source code in the class name and method, so when parsing, you only need to parse the class definition and method definition. Node design in the syntax tree:

Public struct OCNode {public vartype: OCNodeType public var subNodes: [OCNode] public var identifier: String // Public var lineRange: (Int,Int) // line range public varsource: String // Corresponding code} // Node type public enum OCNodeType {case `default`
    case root
    case `import`
    case `class`
    case method
}
Copy the code

LineRange records the lineRange of the file in which the method resides, so you can pull the code out of the file and record it in the source field.

Parsing the syntax tree requires defining the different states of the parsing process:

private enum RState {
    case normal
    caseEod technicians / / linecaseMethodStart // methodStartcaseMethodReturnEnd // End of the method return typecaseMethodNameEnd // The method name endscaseMethodParamStart // Method parameter startcaseMethodContentStart // Method content startscaseMethodParamTypeStart // Method parameter type startcaseMethodParamTypeEnd // The method parameter type endscaseMethodParamEnd // End of method argumentscaseMethodParamNameEnd // The method parameter name endscase at                    // @
    case atImplementation      // @implementation

    caseNormalBlock // oc method external block {}, for c method}Copy the code

The code that fully resolves the class and line range of the method is here:

MethodTraceAnalyze/ParseOCNodes.swift

Parsing.m and.mm files, a serial solution, for a large project, the speed of each solution is difficult to accept, so the parallel way to read and parse multiple files. After testing, it is found that each group of more than 60 can make the best use of the CPU of our machine (2.5ghz dual-core Intel Core I7), the memory occupation is only 60M, and the project of more than 10,000. M files can be solved in about two and a half minutes.

A Wait of the Dispatch group is used to ensure that the parallel group completes before moving on to the next group.

Now that you have the source code for each of these methods, you can match the previous trace method. Page display only need to write section JS to be able to control the click to display the corresponding method of the source.

The page display

Before the HTML page is displayed, replace the newlines and Spaces in the code with the corresponding sums in the HTML.

letallNodes = ParseOC.ocNodes(workspacePath: "/ Users/Ming/Downloads/GCDFetchFeed GCDFetchFeed/GCDFetchFeed xcworkspace") varsourceDic = [String:String]()
for aNode in allNodes {
    sourceDic [aNode identifier] = aNode. The source, replacingOccurrences (of: "\ n", with: "< / br >"). The replacingOccurrences (of: "", with: "The")}Copy the code

Use the p tag as the source code display tag, the number of the method execution order plus the method name as the ID of the P tag, and then use display: None; Hide the P tag. The method name executes a piece of JS code with the A tag and the click attribute, which displays the method’s corresponding code when the A tag is clicked. This js code looks like this:

function sourceShowHidden(sourceIdName) {
    var sourceCode = document.getElementById(sourceIdName);
    sourceCode. Style. The display = "block"; }Copy the code

The final effect is shown below:

Dynamic analysis and static analysis are combined, and later can be compared through different versions to find which methods of code implementation has changed, which can be displayed on the page. You can further statically analyze which methods call I/O functions, start new threads, new queues, and so on, and display them on a page for easy analysis.

At the end of the reading, you can see that the method analysis tool does not use any wheel. In fact, some existing wheels can be used, such as JSON, XML, XCodeProJ, Objective-C syntax analysis, etc. The reason why it is not used is that the languages and technologies used by different wheels are quite different. If a single wheel is not updated when the format is updated it will affect the entire tool. The development of this tool is mainly in the analysis of the work, so the use of its own analysis technology can also make the function more focused, do not do useless functions, reduce the amount of code maintenance, the format to be resolved after the update, can also be independent to update the way of analysis. More importantly, you can get your hands dirty with the syntax design of these formats.

conclusion

This paper summarizes the technical means of startup optimization. In general, the determination of startup optimization is far more important than the technical means, determining whether more can be optimized. There are many technical means. I think the difference between good and bad means is only in efficiency. In the worst case, the problem can be solved by checking the time one by one manually.