Moment For Technology

Douyin Quality Construction - iOS startup optimization

Posted on April 2, 2023, 7:13 p.m. by Anahita Varma
Category: ios Tag: ios


Startup is the first impression an App gives to users. The slower the startup is, the higher the probability of user loss. A good startup speed is an indispensable part of user experience. Start optimization involves a lot of knowledge points, the surface is also very wide, an article is difficult to contain all, so it is divided into two parts: principle and combat, this article is combat.

Principles: Douyin Quality Construction -iOS startup optimization principles

How to do startup optimization?

Before the formal content of the article begins, we can think about, if you do start optimization, how to carry out?

In fact, this is a relatively big problem. In similar cases, we can break the big problem into several smaller ones:

  • How are online users actually starting up?
  • How to find points that can be optimized?
  • After optimization, how to maintain?
  • Is there any mature experience that can be used for reference and how the industry does it?

This corresponds to the three modules of this article: monitoring, tools, and best practices.


Start the burial site

Since you want to monitor, you need to be able to get the startup time in the code. The starting point for startup is always the same: the time the process is created.

The end point of Launch corresponds to the first frame of Launch Image disappearance perceived by users. Douyin adopts the following scheme:

  • IOS 12 and below: Root viewDidAppear for viewController
  • IOS 13 + : applicationDidBecomeActive

Apple's official accounting method is the first CA::Transaction::commit, but the corresponding implementation is within the system framework, and Tiktok is very close to this point.

In stages

It is obvious that only one burial point with start-up time is not enough for troubleshooting online problems. It can be combined by stages and single burial point. The following is the current monitoring scheme of Douyin:

The call sequence of + Load and Initializer is related to the link order. By default, the link order is in ascending order based on the Pod name of CocoaPod. Therefore, if a name starts with AAA, a + Load and Initializer will be executed first.

Non-intrusive monitoring

The company's APM team provides a non-intrusive startup monitoring solution that breaks the startup process into coarse-grained, business-neutral stages: process creation, earliest +load, didFinishLuanching start, and first screen first drawing.

The first three time points are easy to obtain without intrusion

  • Process creation: YessysctlThe system call gets the timestamp of the process creation
  • Earliest +load: As in the staged monitoring above, the +load is executed first by prefixing the Pod with AAA
  • DidFinishLaunching: Monitoring SDK initialization generally, very early in the startup, the didFinishLaunching time is used as the didFinishLaunching time

The first render completion time we want to align with MetricKit is the time when the CA::Transaction::commit() method is called.

Through Runloop source code analysis and offline debugging, we found that the sequence of CA::Transaction::commit(), CFRunLoopPerformBlock, and kCFRunLoopBeforeTimers from morning to night is as follows:

You can get the callback for the two points in time above by registering a block with Runloop in didFinishLaunch or an Observer of BeforeTimer as follows:

/ / register block
CFRunLoopRef mainRunloop = [[NSRunLoop mainRunLoop] getCFRunLoop];
CFRunLoopPerformBlock(mainRunloop,NSDefaultRunLoopModeAnd ^ () {NSTimeInterval stamp = [[NSDate date] timeIntervalSince1970];
    NSLog(@"runloop block launch end:%f",stamp);
// Register the kCFRunLoopBeforeTimers callback
CFRunLoopRef mainRunloop = [[NSRunLoop mainRunLoop] getCFRunLoop];
CFRunLoopActivity activities = kCFRunLoopAllActivities;
CFRunLoopObserverRef observer = CFRunLoopObserverCreateWithHandler(kCFAllocatorDefault, activities, YES.0And ^ (CFRunLoopObserverRef observer, CFRunLoopActivity activity) {
    if (activity == kCFRunLoopBeforeTimers) {
        NSTimeInterval stamp = [[NSDate date] timeIntervalSince1970];
        NSLog(@"runloop beforetimers launch end:%f",stamp);
        CFRunLoopRemoveObserver(mainRunloop, observer, kCFRunLoopCommonModes); }});CFRunLoopAddObserver(mainRunloop, observer, kCFRunLoopCommonModes);
Copy the code

After actual measurement, we finally selected the non-invasive first-screen rendering scheme:

  1. IOS13 or later is usedrunloopTo register akCFRunLoopBeforeTimersThe callback gets a more accurate time when the App's first screen rendering is completed.
  2. Oss later than iOS13 are usedCFRunLoopPerformBlockMethod injection block to obtain the App first screen rendering completion time is more accurate.

Monitoring period

The life cycle of App can be divided into three stages: research and development, gray scale and online. The purposes and methods of monitoring in different stages are different.

Research and development phase

The main purpose of monitoring in the research and development stage is to prevent deterioration. Corresponding to this, there will be offline automatic monitoring. Problems can be found and solved as soon as possible through actual startup performance tests.

Triggered by the scheduled task, package in release mode first, then run an automated test, test results will be reported after the test, which is convenient to track the overall change trend through kanban.

If deterioration is found, an alarm message will be sent first, and then the corresponding deterioration MR will be found through binary search, and then the automatic flame map and Instrument will assist in locating the problem.

So how to ensure that the results of the test are stable and reliable?

The answer is to control variables:

  • Turn off iCloud do not log in AppleID Airplane mode
  • The fan is cooled and the data line is authenticated by MFI
  • Let it sit for a while before restarting the phone and starting the next test
  • Multiple measurements were averaged variance was calculated
  • The AB variable in the Mock startup process

In practice, we found that the iPhone 8 had the best stability, followed by the iPhone X, and the iPhone 6 had poor stability.

In addition to automated testing, you can add some permissions to the development process to prevent launch degradation. These permissions include

  • New Dynamic library
  • Added +load and static initialization
  • Added startup task Code Review

Fine-grained Code Review is not recommended. Unless you have a good understanding of the relevant business, it is generally impossible to see whether there will be deterioration.

On-line grayscale

The strategy of gray scale and online is similar, mainly looking at the market data and configuration alarm, market monitoring and alarm has a lot to do with the company's infrastructure, if there is no corresponding infrastructure Xcode MetricKit itself can also see the startup time: Open Xcode - Window - Origanizer - Launch Time

The market data itself is statistical, and there will be some statistical rules:

  • In the first few days of the release, the startup speed was slow, because the first startup of the App after iOS 13 update needed to create a startup closure, which was a slow process
  • The release of a new version slows down pcT50 of an earlier version. The upgrade speed of devices with poor performance is slow. As a result, the proportion of devices with poor performance in earlier versions is high
  • The sampling rate adjustment will affect PCT50. For example, the proportion of iPhone 6 is high in some regions. If the sampling rate increases in these regions, the proportion of devices with poor performance will increase.

Based on this background, we usually control variables: region, model, version, and sometimes even the trend of startup time based on time.


Once the monitoring is done, we need to find some points that can be optimized, and we need tools. There are two main categories: Instrument and self-research.

Time Profiler

Time Profiler is one of the most commonly used tools for daily performance analysis. It usually selects a Time period and then aggregates the Time spent on the call stack. But Time profilers are only good for coarse-grained analysis. Let's see how it works:

By default, Time Profiler samples every 1ms, collecting only the call stack in the running thread, and finally summarizing it in a statistical manner. For example, method3 is not sampled in any of the five samples shown in the figure below, so method3 is not seen in the stack aggregated. So the Time seen in the Time Profiler is not the actual Time when the code is executed, but the Time when the stack appears in the sample statistics.

Time Profiler supports some additional configurations. If the calculated Time differs greatly from the actual Time, you can enable it:

  • High Frequency, reduce the sampling interval
  • Record Kernel Callstacks Record the call stack of the Kernel
  • Record Waiting Thread, the Thread that records the block

System Trace

Since Time Profiler supports coarse-grained analysis, are there any fine-grained analysis tools? The answer is System Trace.

Since we want to refine the analysis, we need to mark a short period of time, which can be marked by Point of interest. In addition, System Trace is useful for analyzing virtual memory and thread state:

  • Virtual Memory: Focus on the Page In event, because there are many Page In the boot path and it is relatively time-consuming
  • Thread State: Focus on pending and preemption states, keeping in mind that the main Thread is not always running
  • System Load threads have a priority, and the number of high-priority threads should not exceed the number of System cores


Os_signpost is an iOS 12 API for marking time periods in Instruments. It has no impact on startup. In combination with the phased monitoring described earlier, we can divide startup into multiple phases in Instrument and analyze specific problems with other templates:

In combination with swizzle, os_SignPOST can perform unexpected functions such as hook all load methods and hook UIImage methods.

Other Instrument Templates

In addition to these, there are several templates that are more commonly used:

  • Static Initializer: analyzes the C++ Static Initializer
  • Time Profiler + System Trace App Launch: Xcode 11 + Time Profiler + System Trace
  • Custom Instrument: Customizes Instrument. The simplest method is to use OS_SignPOST as the template data source to perform simple customizations. For details, see WWDC sessions.

Flame figure

Flame charts are useful for analyzing time-dependent performance bottlenecks by plotting the elapsed time of business code directly. In addition, the flame diagram can be automatically generated and then diff, so it can be used for automated attribution.

There are two common implementations of flame diagrams

  • hook objc_msgSend
  • Pile at compile time

Essentially, you type two dots at the beginning and end of a method to know how long the method takes, and then convert it to Chrome's standard JSON format for analysis. Note that even if you use Mmap to write files, there are still some errors, so the problems found are not necessarily problems and need to be double-checked.

Best practices

The overall train of thought

The overall idea of optimization is actually four steps:

  1. Delete the boot option, the most direct
  2. If it cannot be deleted, try delay, which includes the first access and finding a suitable time to warm up after startup
  3. Can't delay can try concurrency, using multi-core multithreading
  4. If concurrency doesn't work, try making your code execute faster

This is going to be divided by Main, so let's look at the optimization before and after Main; Next, how to optimize Page In; Finally, some unconventional optimization solutions, which have high architectural requirements, will be explained.

Before the Main

The startup process before Main is as follows:

  • Loading dyld
  • Create startup closure (required to update App/ restart phone)
  • Loading the Dynamic library
  • Bind Rebase Runtime initialization
  • + Load and static initialization

The dynamic library

Reducing the number of dynamic libraries can also reduce the time required to create and load dynamic libraries during the startup closure phase. It is recommended that the number of dynamic libraries be less than 6.

The recommended approach is to switch from a dynamic library to a static library because of the additional reduction in package size. Another way is to merge dynamic libraries, but it is not feasible in practice. Finally, don't link to libraries (including the system) that you don't need, because it will slow down the creation of closures.

Referral code

The offline code reduces the time it takes to initialize Rebase Bind Runtime. So how do you find code you don't need and bring it offline? It can be divided into static scanning and online statistics.

The simplest static scan is based on AppCode, but AppCode is very slow to index when the project is large. Another type of static scan is based on Mach-O:

  • _objc_selrefs_objc_classrefsThe sel and class referenced are stored
  • __objc_classlistAll sel and class are stored

A set of differences between the two will tell you which classes /sel are not needed, but ObjC supports runtime calls that require a second check before deletion.

Another way to count useless code is to count it online. There are three main methods:

  • Viewconterler permeability can be counted by the declaration period method corresponding to HOOK
  • Class permeability, which traverses all classes at Runtime and determines whether a Class is accessed by the Objective C Runtime flag
  • Line-level permeability, which requires compile-time staking, is detrimental to packet size and execution speed.

The first two are high ROI options, and the Class level penetration is sufficient most of the time.

+ load migration

In addition to the time-consuming method itself, +load will also cause a large number of pages In. In addition, the existence of + Load is also a shock to the stability of App, because Crash cannot capture.

For example, many DI containers need to bind protocols to classes, so they need to be registered early in startup (+load) :

+ (void)load
    [DICenter bindClass:IMPClass toProtocol:@protocol(SomeProcotol)]
Copy the code

With clang attributes, this process can be migrated to compile time:

typedef struct{
    const char * cls;
    const char * protocol;
__used static ClassPROTOCOL_NAME _DI_VALID_METHOD(void){\
    return [CLASS_NAME class]; \ }\ __attribute((used, section(_DI_SEGMENT"," _DI_SECTION ))) static _di_pair _DI_UNIQUE_VAR = \
__attribute((used, section(_DI_SEGMENT "," _DI_SECTION ))) static _di_pair _DI_UNIQUE_VAR = \
Copy the code

The principle is simple: macros provide interfaces, compile time writes the class and protocol names to specific binary sections, and runtime reads the relationship to know which class the protocol is bound to.

Some of you will have noticed that there is a useless method called _DI_VALID_METHOD, which only exists in debug mode to keep the compiler type-safe.

Statically initiate migration

Static initializations, like the +load method, also cause a lot of Page ins, usually from C++ code, such as network or effects libraries. Other static initializations are introduced through header files and can be confirmed by preprocessing.

A few typical migration ideas:

  • STD :string converted to const char *
  • Static variables are moved inside the method because static variables inside the method are initialized the first time the method is called
namespace {
    static const std::string bucket[] = {"apples"."pears"."meerkats"};
const std::string GetBucketThing(int i) {
     return bucket[i];
std::string GetBucketThing(int i) {
  static const std::string bucket[] = {"apples"."pears"."meerkats"};
  return bucket[i];
Copy the code

After the Main


Startup requires a framework to control, and Douyin adopts a lightweight, central solution:

  • There is a configuration bin for starting tasks, which contains only the order and thread of starting tasks
  • The warehouse implements the protocol BootTask, indicating that this is a start task

The startup task execution process is as follows:

Why do you need an initiator?

  • Global concurrent scheduling: For example, AB tasks are concurrent and C tasks wait for AB to complete. Framework scheduling can also reduce the number of threads and control priorities
  • Delayed execution: Provides some time for the business to be initialized in a warm-up manner
  • Fine monitoring: The time consumption of all tasks can be monitored, and offline automatic monitoring can also benefit
  • Control: Adjusting the sequence of startup tasks, adding and deleting tasks can be controlled through Code Review

The three parties SDK

Some third-party SDKS take a lot of time to start, such as Fabric, which is about 70ms faster than PCT50 after Tiktok is offline.

In addition to logging out, many SDKS can be deferred, such as the SHARE and login SDKS. In addition, you can evaluate the impact on startup performance before accessing the SDK. If the impact is significant, you can feedback it to the SDK provider for modification, especially the paid SDK, and they are willing to cooperate with you to make some modifications.

High frequency method

Some methods take less time individually, but can be called many times on the startup path, and the cumulative time is also high, such as reading the configuration in info.plist:

+ (NSString *)plistChannel
    return [[[NSBundle mainBundle] infoDictionary] objectForKey:@"CHANNEL_NAME"];
Copy the code

This can be done by simply adding a layer of memory cache. This problem can be seen when timeprofilers are used for longer periods of time.

The lock

Locks affect startup time because sometimes subthreads hold the lock first, and the main thread needs to wait for the child to release the lock. Also be aware that the system has many hidden global locks, such as dyLD and Runtime. Here's an example:

Here is the main thread block caused by UIImage imageNamed:

As you can see from the stack on the right, imageNamed triggers dlopen, which waits for dyLD's global lock. The Thread State Event of System Trace is used to find the next Event where the Thread was blocked. This Event indicates that the Thread can be run again because another Thread released the lock:

By analyzing what the background thread is doing at this time, we will know why the lock is held and how to optimize it.

Number of threads

The number and priority of threads can affect startup time. The priority can be set by setting QoS. User Interactive/Initiated tasks should be set to User Interactive/Initiated tasks in child threads that require the main thread to wait.

The number of threads should not exceed the number of CPU cores, which can be analyzed by System Trace's System Load.

dispatch_queue_attr_t attr = dispatch_queue_attr_make_with_qos_class(DISPATCH_QUEUE_SERIAL, QOS_CLASS_UTILITY, - 1);
dispatch_queue_t queue = dispatch_queue_create("com.custom.utility.queue", attr);
operationQueue.qualityOfService = NSQualityOfServiceUtility
Copy the code

The number of threads can also affect startup time, but it is difficult to control threads globally in iOS. For example, it is difficult to control background threads when two or three libraries need to start, but threads in business can be controlled by starting tasks.

It doesn't matter if there are many threads, as long as there aren't many concurrent executions. You can use System Trace to check context switching time and determine if the number of threads is a bottleneck to start.

The picture

There are a lot of images to boot. Is there a way to optimize the loading time of images?

Use Asset to manage images instead of putting them directly in bundles. Asset is optimized at compile time to make loading faster, and loading images in Asset is faster than loading images in bundles because UIImage imageNamed has to walk through the Bundle to find the image. Loading the Asset diagram takes most of the time on the first diagram, because indexing can be reduced by putting the starting diagram into a small Asset.

Every time you create a UIImage you need IO, which is decoded during the first frame rendering. So you can optimize this time by preloading the child threads (creating uiImages) ahead of time.

As shown below, images are only used in the later stages of "RootWindow creation" and "first Frame render", so you can start tasks early in the startup process by opening preloaded child threads.


Fishhook is a library for hook C functions, but the first call to the library is time-consuming and is best kept offline. Fishhook traverses multiple segments of Mach-O to find the mapping between function Pointers and function symbol names as shown In the following figure. The side effect is that a lot of Page In is required. For large apps, it takes 200ms+ to start cold on iPhone X.

If you have to use Fishhook, call it from a child thread and not directly from _dyLD_register_func_FOR_add_image. Because this method holds a global mutex for DYLD, the library often calls DLSYm and Dlopen when the main thread is started, and it also needs this lock internally, causing the child thread mentioned above to block the main thread.

The first frame render

Different apps have different business forms, and there are many differences in the optimization methods of the first frame rendering. There are several common optimization points:

  • LottieView: Lottie is a library used by Airbnb for AE animation, but loading the ANIMATION json and reading the image is slow. You can display a static image first and start the animation after startup, or the child thread pre-sets the image and JSON into Lottie cache
  • Lazy initialization: Don't create the View to hidden first, which is a bad habit
  • AutoLayout: The time consuming of AutoLayout is also relatively high, but this part is usually heavy with historical burden. You can evaluate the ROI and see whether to change it to frame
  • Loading animation: App usually has a Loading animation, which is better not to use GIF. Offline measurement of the Loading time of a GIF of 60 frames is close to 70ms

Other Tips

There are a few Tips to watch out for in startup optimization:

Do not delete the TMP / directory as this directory stores the iOS 13+ startup closure. If you delete it, it will be re-created on the next startup. The process of creating closures is slow. The next step is IO optimization, which is usually done by using Mmap to make IO faster, or by preloading data early in startup.

Here are a few other things that will take significantly more time on the iPhone 6:

  • WebView User Agent: Obtained at startup for the first time, cached later, and refreshed at the end of each startup
  • KeyChain: delay obtaining or preloading
  • VolumeView: Delete the VolumeView

IPhone 6 is a watershed, performance will fall off a cliff, you can switch off some user interaction on iPhone 6 in exchange for core experience (remember AB verification).

Page In time

The startup path will trigger a large number of Page ins, is there a way to optimize this part of the time?

Section of the rename

The App Store encrypts the TEXT section of the uploaded App and decrypts it when Page In occurs. The decryption process is time-consuming. Rename_section () {rename_section () {rename_section ();

Douyin renaming scheme:

Copy the code

This optimization works In iOS 13, because iOS 13 optimizes the decryption process so that Page In does not need decryption, which is one of the reasons iOS 13 starts faster.

Binary rearrangement

Is there any way to optimize since the starting path triggers a lot of Page ins?

Startup is local, that is, only a small number of functions are used during startup, and the distribution of these functions is scattered, so the data utilization of Page In is not high. If we can arrange the startup functions into the binary continuum, then we can reduce the number of Page In, thus optimizing the startup time:

In the following diagram, methods 1 and 3 are used at startup, and two Page In are required to execute the corresponding code. If we put methods 1 and 3 together, we only need one Page In, thus increasing startup speed.

The linker LD has a parameter, -order_file, that supports ordering binaries in symbolic order. There are two main ways to get the symbol used at startup:

  • Tik Tok solution: Static scan get +load and C++ static initialization, hook objc_msgSend get Objective C symbols.
  • Facebook scheme: LLVM function peg, grayscale statistics start path symbol, generate order_file with most user symbols.

Facebook's LLVM function peg was customized for Order_file, and the code they developed for LLVM is now incorporated into the main LLVM branch.

Facebook's scheme is more refined, and the order_file generated is the best solution, but it's a lot of work. Tiktok's solution does not require source code compilation, does not need to transform the existing compilation environment and process, and is the least invasive. The disadvantage is that it can only cover about 90% of symbols.

- Gray scale is a stage that any optimization should make good use of, because many new optimization schemes have uncertainties and need to be verified in gray scale first.

Unconventional solution

Dynamic library lazy loading

We mentioned earlier that you can reduce the amount of code by deleting it. Is there any way to reduce the amount of code that has to be loaded at startup without reducing the amount of code?

  • The answer is lazy loading of dynamic libraries.

What is a lazy-loaded dynamic library? Normal dynamic libraries are directly or indirectly linked by the main binary, so they are loaded at startup. If you only package into the App and do not participate in the link, then it will not automatically load when you start up. When you need to use the content in the dynamic library at runtime, you can manually load it lazily.

Lazy loading dynamic libraries need to be modified both at compile time and at run time.

Dynamic libraries such as A. framwork are lazily loaded because they do not participate in direct or indirect linking of the main binary. There are bound to be dependencies in common between dynamic libraries, which can be packaged into the shared. framework to solve the problem of common dependencies.

The runtime is loaded via -[NSBundle load], essentially calling the underlying dlopen. So when do dynamic libraries trigger manual loading?

Dynamic libraries can be divided into two types: business and functional. The business is an entry point to the UI, and can converge the dynamic library loading logic inside the route, so that the external does not know that the dynamic library is lazily loaded, and can be better fault tolerant. Feature libraries (such as qr.framework above) are a bit different because there are no UI entries and need to maintain their own wrappers:

  • The App relies directly on the Wrapper, so outsiders don't know that the dynamic library is lazily loaded
  • Wrapper encapsulates the logic for dynamic invocation, which is invoked through dlSYM, etc

Dynamic library lazy loading, in addition to reducing the number of code loaded at startup, also prevents long-term startup degradation by adding business code, since business initialization is done at first access.

This scheme also has other advantages, such as greatly reducing the local compilation time and other performance indicators after the dynamic library is converted. The disadvantage is that the package size will be sacrificed to a certain extent, but the lazy dynamic library can be optimized by means of segment compression to compensate for this loss.

Background Fetch

Background Fetch can start the App in the Background for a period of time. For time-sensitive apps (such as news), data can be refreshed in the Background, which can improve the speed of Feed loading and further improve user experience.

So why does this kind of "background preservation" improve startup speed? Let's look at a typical case:

  1. The system starts the App in the background
  2. The background App was killed due to memory problems. Procedure
  3. The user immediately starts the App, so this startup is a hot startup because the cache is still there
  4. Again, the system starts the App in the background
  5. This time the user clicked the App while the App was in the background, so this startup is a background back to the foreground, because the App is still alive

Here are two typical scenarios to see why Background Fetch can improve startup speed:

  • Increase the ratio of hot start to cold start
  • A background boot back to the foreground is defined as a boot because it is a boot from the user's point of view

Background startup has some points to pay attention to, such as daily activities, advertising, and even AB into the group logic will be affected, need to do a lot of adaptation. Initiators are often needed to support this, because tasks that normally start at didFinishLaunch need to be delayed at background start until the first time back to the foreground.


Finally, we distilled a few things that we think are important in any optimization:

  • White box optimization, know why slow, which parts are optimized.
  • Online data are the compass of optimization and the only way to measure the optimization effect. AB experiment is recommended to verify the impact on business.
  • Don't forget to build against deterioration, especially if the business is iterating quickly, or you may not be able to optimize faster than deterioration.
  • Do long-term architecture construction, good architecture will be long-term to start these basic performance escort.

Join us

We are the team responsible for tiktok client basic capability research and development and new technology exploration. We focus on engineering/business architecture, rd tools, compilation systems and other aspects to support rapid business iteration while ensuring the rd efficiency and engineering quality of super-large teams. Continuously explore performance/stability and strive to provide the ultimate basic experience for hundreds of millions of users around the world.

If you are passionate about technology, welcome to join the basic technology team of Douyin, and let us build a hundred-million-level national App. At present, we have recruitment needs in Shanghai, Beijing, Hangzhou and Shenzhen. For internal promotion, please contact [email protected] with email title: name - years of work - Douyin - Basic technology - iOS/Android.

Welcome to Bytedance Technical Team

Resume mailing address: [email protected]

About (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.