APM monitoring system: lag monitoring, startup time monitoring, CPU usage monitoring

APM stands for Application Performance Monitoring, which monitors and manages the Performance and availability of software applications. Application performance management is critical to the continuous and stable running of an application. So this article from the latitude of an iOS App performance management to talk about how to accurately monitor and how to report data and other technical points

App performance is one of the important factors affecting user experience. Performance problems include Crash, network request errors or timeout, slow UI response, mainline latency, high CPU and memory usage, and high power consumption. Most of the problems are caused by developers using thread locks incorrectly, system functions, programming specification issues, data structures, and so on. The key to solving the problem is to find and locate the problem as early as possible.

This article focuses on summarizing the causes of APM and how to collect data. After APM data is collected, the APM uploads data to the server based on certain policies based on the data reporting mechanism. The server consumes this information and produces a report. Please summarize how to build a flexible, configurable and powerful data reporting component.

1. Caton monitoring

The lag problem is the inability to respond to user interactions on the main thread. It affects users’ direct experience, so App lag monitoring is an important part of APM.

The number of refresh frames per second (FPS) is 60 for iPhone and 120 for some iPad models, which is also used as a reference parameter for the monitoring of caton. Why is it a reference parameter? Because it’s not accurate. Let’s talk about how to get an FPS. CADisplayLink is a system timer that refreshes the view at the same rate as frame refreshes. [CADisplayLink displayLinkWithTarget: self selector: @ the selector (# # # :)]. Why not take a look at the following example code

_displayLink = [CADisplayLink displayLinkWithTarget:self selector:@selector(p_displayLinkTick:)];
[_displayLink setPaused:YES];
[_displayLink addToRunLoop:[NSRunLoop currentRunLoop] forMode:NSRunLoopCommonModes];
Copy the code

As the code shows, the CADisplayLink object is added to a Mode of the specified RunLoop. So again CPU level operation, Caton’s experience is the result of the entire image rendering: CPU + GPU. Keep reading

1. Screen rendering principle

Talk about how old CRT displays work. CRT electron gun scans from top to bottom line according to the above method. After the scanning is completed, the monitor presents a frame, and then the electron gun returns to the initial position to continue the next scan. To synchronize the display with the system’s video controller, the display (or other hardware) generates a series of timing signals using a hardware clock. When the gun moves to a new line and is ready to scan, the monitor emits a horizontal synchronization signal, or HSync; When a frame is drawn and the gun is restored to its original position before the next frame is drawn, the display sends a Vertical synchronization signal, or VSync. The display is usually refreshed at a fixed rate, which is the frequency at which the VSync signal is generated. Although today’s displays are basically LCD screens, the principle remains the same.

Typically, a picture on the screen is displayed by the CPU, GPU, and monitor working together in the way shown above. The CPU calculates what needs to be realistic (such as view creation, layout calculation, image decoding, text drawing, etc.) according to the code written by the engineer, and then submits the calculation results to the GPU, which is responsible for layer composition, texture rendering, and then the GPU submits the rendering results to the frame buffer. Then the video controller will read the data in the frame buffer line by line according to the VSync signal, and transmit the data to the display through digital to analog conversion.

In the case of only one frame buffer, there are efficiency problems in reading and refreshing frame buffer. In order to solve the efficiency problem, the display system will introduce two buffers, namely double buffer mechanism. In this case, the GPU will pre-render a frame into the frame buffer for the video controller to read, and after the next frame is rendered, the GPU will point the video controller’s pointer directly to the second buffer. Improved efficiency.

For now, double buffers improve efficiency, but pose new problems: When the video controller has not finished reading, that is, the screen content is partially displayed, the GPU submits a newly rendered frame to another frame buffer and points the pointer of the video controller to the new frame buffer, the video controller will display the lower half of the data of the new frame on the screen, causing the picture tearing.

To solve this problem, the GPU usually has a mechanism called V-sync. When vSYNC is enabled, the GPU will wait until the video controller sends the V-sync signal before rendering a new frame and updating the frame buffer. These mechanics solve the problem of tearing and increase the flow of the screen. But more computing resources are needed

Answering questions

When vSYNC is enabled, the GPU will wait until the video controller sends the V-sync signal to render a new frame and update the frame buffer. The GPU will render a new frame and update the frame buffer when v-sync is sent.

Imagine a display displaying the first and second frames of images. Firstly, in the case of double buffer, THE GPU first renders a frame image and stores it into the frame buffer, and then makes the pointer of the video controller directly into the buffer to display the first frame image. After the content of the first frame is displayed, the video controller sends the V-sync signal. After receiving the V-sync signal, the GPU renders the second frame image and points the pointer of the video controller to the buffer of the second frame.

It looks like the second frame is waiting for the video controller to send the V-sync signal after the first frame is displayed. Isn’t it? Is that true? 😭 what do you think? Of course not. 🐷 otherwise there would be no point in double buffers

Revelation. Please see below

When the first V-sync signal arrives, render a frame and put it in the frame buffer, but do not show it. When the second V-sync signal is received, read the first rendered result (the video controller’s pointer points to the first frame buffer), and render a new frame and store the result in the second frame buffer. After receiving the third V-Sync signal, the contents of the second frame buffer are read (the pointer of the video controller points to the second frame buffer), and the third frame image is rendered and sent to the first frame buffer, and the cycle repeats successively.

Check out the info that requires a ladder: Multiple buffering

2. Causes of caton

Upon the arrival of VSync signal, the graphics service of the system will notify the App through CADisplayLink and other mechanisms, and the main thread of the App will start to calculate the display content (view creation, layout calculation, picture decoding, text drawing, etc.) in the CPU. Then the calculated content is submitted to GPU, which transforms, synthesizes and renders the layers. Then THE GPU submits the rendering results to the frame buffer and waits for the next VSync signal to display the previously rendered results. In the case of VSync mechanism, if the CPU or GPU does not complete the content submission within a VSync time cycle, the frame will be discarded and displayed again at the next opportunity. At this time, the screen will still display the previously rendered image, so this is the reason why the INTERFACE of CPU and GPU is stuck.

Currently iOS devices have double caching mechanism, also have three caching mechanism, Android is now the mainstream three caching mechanism, in the early days of single caching mechanism. Example of iOS 3 buffering

CPU and GPU resources are consumed for many reasons. Such as frequent object creation, property adjustment, file reading, view level adjustment, layout calculation (AutoLayout) The number of views is more difficult to solve the linear equation), image decoding (optimization of reading large pictures), image rendering, text rendering, database reading (read or write more optimistic lock, pessimistic lock scenario), lock use (for example: improper use of spin lock will waste CPU) and so on. Developers use their own experience to find the optimal solution (which is not the focus of this article).

3. How does APM monitor stagnation and report it

CADisplayLink is definitely not used, this FPS is for reference only. Generally speaking, Caton has two ways of monitoring: listening for RunLoop status callbacks and child threads ping the main thread

3.1 Monitoring mode of RunLoop status

RunLoop listens for input sources for scheduling processing. Such as networks, input devices, periodic or delayed events, asynchronous callbacks, and so on. RunLoop receives two types of input sources: asynchronous messages from another thread or from different applications (source0 events), and events from scheduled or repeated intervals.

The RunLoop status is shown below

Step 1: Notify Observers that RunLoop is about to start entering loop, followed by loop

If (currentMode->_observerMask & kCFRunLoopEntry) // Notify Observers: RunLoop will enter loop __CFRunLoopDoObservers(rL, currentMode, kCFRunLoopEntry); / / into the loop result = __CFRunLoopRun (rl, currentMode, seconds, returnAfterSourceHandled, previousMode);Copy the code

Step 2: Start Observers with a do while loop to keep them alive, notify them that RunLoop triggers a Timer callback, a Source0 callback, and then executes the added block

If (RLM ->_observerMask & kCFRunLoopBeforeTimers) // Notify Observers: RunLoop is about to trigger the Timer callback __CFRunLoopDoObservers(RL, RLM, kCFRunLoopBeforeTimers); If (RLM ->_observerMask & kCFRunLoopBeforeSources) // notify Observers: RunLoop will trigger the Source callback __CFRunLoopDoObservers(RL, RLM, kCFRunLoopBeforeSources); // Execute the added block __CFRunLoopDoBlocks(rl, RLM);Copy the code

Step 3: After the RunLoop fires the Source0 callback, if Source1 is ready, it jumps to HANDLE_MSG to process the message.

// If there is a port based Source1 in ready state, process the Source1 directly and jump to the message if (MACH_PORT_NULL! = dispatchPort && ! didDispatchPortLastTime) { #if DEPLOYMENT_TARGET_MACOSX || DEPLOYMENT_TARGET_EMBEDDED || DEPLOYMENT_TARGET_EMBEDDED_MINI  msg = (mach_msg_header_t *)msg_buffer; if (__CFRunLoopServiceMachPort(dispatchPort, &msg, sizeof(msg_buffer), &livePort, 0, &voucherState, NULL)) { goto handle_msg; } #elif DEPLOYMENT_TARGET_WINDOWS if (__CFRunLoopWaitForMultipleObjects(NULL, &dispatchPort, 0, 0, &livePort, NULL)) { goto handle_msg; } #endif }Copy the code

Step 4: Once the callback is triggered, notify Observers that they are about to enter hibernation

Boolean poll = sourceHandledThisLoop || (0ULL == timeout_context->termTSR); // Notify Observers: threads of RunLoop about to enter sleep if (! poll && (rlm->_observerMask & kCFRunLoopBeforeWaiting)) __CFRunLoopDoObservers(rl, rlm, kCFRunLoopBeforeWaiting); __CFRunLoopSetSleeping(rl);Copy the code

Step 5: After hibernation, it waits for the MACH_port message to wake up again. Only the following four conditions can be awakened again.

Port-based source events
The Timer to time
RunLoop timeout
The caller wakes up

do { if (kCFUseCollectableAllocator) { // objc_clear_stack(0); // <rdar://problem/16393959> memset(msg_buffer, 0, sizeof(msg_buffer)); } msg = (mach_msg_header_t *)msg_buffer; __CFRunLoopServiceMachPort(waitSet, &msg, sizeof(msg_buffer), &livePort, poll ? 0 : TIMEOUT_INFINITY, &voucherState, &voucherCopy); if (modeQueuePort ! = MACH_PORT_NULL && livePort == modeQueuePort) { // Drain the internal queue. If one of the callout blocks sets the timerFired flag, break out and service the timer. while (_dispatch_runloop_root_queue_perform_4CF(rlm->_queue)); if (rlm->_timerFired) { // Leave livePort as the queue port, and service timers below rlm->_timerFired = false; break; } else { if (msg && msg ! = (mach_msg_header_t *)msg_buffer) free(msg); } } else { // Go ahead and leave the inner loop. break; } } while (1);Copy the code

Step 6: Notify the Observer when awakened that the thread of the RunLoop has just been awakened

// notify Observers: threads of RunLoop have just been awakened if (! poll && (rlm->_observerMask & kCFRunLoopAfterWaiting)) __CFRunLoopDoObservers(rl, rlm, kCFRunLoopAfterWaiting); // handle messages handle_msg:; __CFRunLoopSetIgnoreWakeUps(rl);Copy the code

Step 7: After RunLoop wake up, process the messages received during wake up

If the Timer time expires, the Timer callback is triggered
If it is Dispatch, a block is executed
If it is a source1 event, the event is processed

#if (RLM ->_timerPort!) #if (RLM ->_timerPort! = MACH_PORT_NULL && livePort == rlm->_timerPort) { CFRUNLOOP_WAKEUP_FOR_TIMER(); // On Windows, we have observed an issue where the timer port is set before the time which we requested it to be set. For example, we set the fire time to be TSR 167646765860, but it is actually observed firing at TSR 167646764145, which is 1715 ticks early. The result is that, when __CFRunLoopDoTimers checks to see if any of the run loop timers should be firing, it appears to be 'too early' for the next timer, and no timers are handled. // In this case, the timer port has been automatically reset (since it was returned from MsgWaitForMultipleObjectsEx), and if we do not re-arm it, then no timers will ever be serviced again unless something adjusts the timer list (e.g. adding or removing timers). The  fix for the issue is to reset the timer here if CFRunLoopDoTimers did not handle a timer itself. 9308754 if (! __CFRunLoopDoTimers(rl, rlm, mach_absolute_time())) { // Re-arm the next timer __CFArmNextTimerInMode(rlm, rl); }} #endif // If there are blocks dispatched to main_queue, Execute block else if (livePort == dispatchPort) {CFRUNLOOP_WAKEUP_FOR_DISPATCH(); __CFRunLoopModeUnlock(rlm); __CFRunLoopUnlock(rl); _CFSetTSD(__CFTSDKeyIsInGCDMainQ, (void *)6, NULL); #if DEPLOYMENT_TARGET_WINDOWS void *msg = 0; #endif __CFRUNLOOP_IS_SERVICING_THE_MAIN_DISPATCH_QUEUE__(msg); _CFSetTSD(__CFTSDKeyIsInGCDMainQ, (void *)0, NULL); __CFRunLoopLock(rl); __CFRunLoopModeLock(rlm); sourceHandledThisLoop = true; didDispatchPortLastTime = true; } // If a Source1 (based on port) emits an event, handle the event else {CFRUNLOOP_WAKEUP_FOR_SOURCE(); // If we received a voucher from this mach_msg, then put a copy of the new voucher into TSD. CFMachPortBoost will look in the TSD for the voucher. By using the value in  the TSD we tie the CFMachPortBoost to this received mach_msg explicitly without a chance for anything in between the two pieces of code to set the voucher again. voucher_t previousVoucher = _CFSetTSD(__CFTSDKeyMachMessageHasVoucher, (void *)voucherCopy, os_release); CFRunLoopSourceRef rls = __CFRunLoopModeFindSourceForMachPort(rl, rlm, livePort); if (rls) { #if DEPLOYMENT_TARGET_MACOSX || DEPLOYMENT_TARGET_EMBEDDED || DEPLOYMENT_TARGET_EMBEDDED_MINI mach_msg_header_t *reply = NULL; sourceHandledThisLoop = __CFRunLoopDoSource1(rl, rlm, rls, msg, msg->msgh_size, &reply) || sourceHandledThisLoop; if (NULL ! = reply) { (void)mach_msg(reply, MACH_SEND_MSG, reply->msgh_size, 0, MACH_PORT_NULL, 0, MACH_PORT_NULL); CFAllocatorDeallocate(kCFAllocatorSystemDefault, reply); } #elif DEPLOYMENT_TARGET_WINDOWS sourceHandledThisLoop = __CFRunLoopDoSource1(rl, rlm, rls) || sourceHandledThisLoop; #endifCopy the code

Step 8: Determine whether to enter the next loop based on the current RunLoop status. When the external force stops or the loop times out, the next loop is not continued, otherwise it enters the next loop

If (sourceHandledThisLoop && stopAfterHandle) {/ / said into the loop parameters processed event is returned retVal = kCFRunLoopRunHandledSource; } else if (timeout_context->termTSR < mach_absolute_time()) {retVal = kCFRunLoopRunTimedOut; } else if (__CFRunLoopIsStopped(rl)) { __CFRunLoopUnsetStopped(rl); RetVal = kCFRunLoopRunStopped; retVal = kCFRunLoopRunStopped; } else if (rlm->_stopped) { rlm->_stopped = false; retVal = kCFRunLoopRunStopped; } else if (__CFRunLoopModeIsEmpty(rl, RLM, previousMode)) {// source/timer retVal = kCFRunLoopRunFinished; }Copy the code

The complete annotated RunLoop code is here. Source1 is used by RunLoop to handle system events from the Mach port, and Source0 is used to handle user events. After receiving a system event from Source1, the Source0 event handler is essentially called.

RunLoop Six states

Typedef CF_OPTIONS(CFOptionFlags, CFRunLoopActivity) {kCFRunLoopEntry, // enter loop kCFRunLoopBeforeTimers, // Trigger Timer callback kCFRunLoopBeforeSources, // Trigger Source0 callback kCFRunLoopBeforeWaiting, Mach_port message kCFRunLoopAfterWaiting, mach_port message kCFRunLoopExit, // Exit loop kCFRunLoopAllActivities // loop all state changes}Copy the code

A RunLoop can block a thread if it takes too long to execute a method before it goes to sleep, or if it takes too long to receive a message after it wakes up and cannot proceed to the next step. If it is the main thread, it is stuck.

Once the state of KCFRunLoopBeforeSources before sleep is found, or KCFRunLoopAfterWaiting after wake up does not change within the set time threshold, it can be judged as stuck. At this time, stack information is dumped to restore the crime scene, and then the stuck problem is solved.

Start a child thread, constantly monitoring the loop to see if it is stuck. It is considered to be stuck when the threshold is exceeded for n times. Catton then performs a stack dump and reports it (with some mechanism, data processing covered in the next part).

WatchDog has different values in different states.

Launch: 20s
Resume: 10s
Suspend: 10s
Quit: 6s
Background: 3min (before iOS7 can apply for 10min; Then it was changed to 3min; Continuous application, up to 10min)

The lag threshold is set based on the WatchDog mechanism. The threshold value in the APM system must be smaller than that in the WatchDog system. Therefore, the threshold value ranges from 1 to 6.

Use long dispatch_semaphore_WAIT (dispatch_semaphore_t dSEMa, dispatch_time_t timeout) to check whether the main thread is blocked. Returns zero on success, or non-zero if the timeout occurred. A return of non-0 indicates timeout blocked the main thread.

Why KCFRunLoopBeforeSources and KCFRunLoopAfterWaiting? Because most of the latons are between KCFRunLoopBeforeSources and KCFRunLoopAfterWaiting. For example, intra-app events of type Source0

The flow chart of Runloop detection is as follows:

The key codes are as follows:

Context = {0, (__bridge void *)self, NULL, NULL}; CFRunLoopObserverContext = {0, (__bridge void *)self, NULL, NULL}; _observer = CFRunLoopObserverCreate(kCFAllocatorDefault, kCFRunLoopAllActivities, YES, 0, &runLoopObserverCallBack, &context); // Add the new observer to the current thread's runloop CFRunLoopAddObserver(CFRunLoopGetMain(), _observer, kCFRunLoopCommonModes); _semaphore = dispatch_semaphoRE_create (0); __weak __typeof(self) weakSelf = self; // Dispatch_async (dispatch_get_global_queue(0, 0), ^{__strong __typeof(weakSelf) strongSelf = weakSelf; if (! strongSelf) { return; } while (YES) { if (strongSelf.isCancel) { return; } // Long semaphoreWait = dispatch_semaphore_wait(self->_semaphore, dispatch_time(DISPATCH_TIME_NOW, strongSelf.limitMillisecond * NSEC_PER_MSEC)); if (semaphoreWait ! = 0) { if (self->_activity == kCFRunLoopBeforeSources || self->_activity == kCFRunLoopAfterWaiting) { if (++strongSelf.countTime < strongSelf.standstillCount){ continue; } // Stack information dump and upload data to the server based on certain policies in combination with the data reporting mechanism. Stack dump is covered below. Data reported in [to build a powerful, flexible, configurable data reporting component] (https://github.com/FantasticLBP/knowledge-kit/blob/master/Chapter1%20-%20iOS/1.80.md)}}  strongSelf.countTime = 0; }});Copy the code

3.2 Monitoring mode of the subthread ping main thread

Start a child thread and create a semaphore with an initial value of 0 and a Boolean type flag bit with an initial value of YES. Will set the flag bit to NO task distributing in the main thread, the child thread dormancy threshold time, time to judge whether flag bit is the main thread after successful (value of NO), if not successful that have taken place in the main thread caton case, the dump stack information combined with the data reporting mechanism, according to certain strategy to upload data to the server. Data reporting will focus on creating powerful, flexible and configurable data reporting components

while (self.isCancelled == NO) { @autoreleasepool { __block BOOL isMainThreadNoRespond = YES; dispatch_semaphore_t semaphore = dispatch_semaphore_create(0); dispatch_async(dispatch_get_main_queue(), ^{ isMainThreadNoRespond = NO; dispatch_semaphore_signal(semaphore); }); [NSThread sleepForTimeInterval:self.threshold]; if (isMainThreadNoRespond) { if (self.handlerBlock) { self.handlerBlock(); }} dispatch_semaphore_wait(semaphore, DISPATCH_TIME_FOREVER); }}Copy the code

4. A stack dump

Getting the method stack is a hassle. Let’s get our thinking straight. [NSThread callStackSymbols] can get the call stack of the current thread. But when a trap is detected, there is nothing you can do to get the main thread stack information. There is no way to get back to the main thread from any thread. Let’s do a knowledge review.

In computer science, a call stack is a stack type of data structure used to store thread information about a computer program. This stack is also called the execution stack, program stack, control stack, runtime stack, machine stack, and so on. The call stack is used to track the point at which the subroutine of each activity should return control after completion of execution.

A Wikipedia search for a diagram and example of the “Call Stack” appears belowThe figure above represents a stack. It is divided into several stack frames, and each stack Frame corresponds to a function call. That’s the blue part down hereDrawSquareFunction, which is called during executionDrawLineFunction, shown in green.

You can see that the stack frame consists of three parts: function parameters, return address, and local variables. For example, if the DrawLine function is called from within DrawSquare, the argument required by the DrawLine function is pushed. The second returns the address (control information. For example: function A calls function B, the address of the next line of code that calls function B is the return address); Local variables inside the third function are also stored in the stack.

A Stack Pointer indicates the top of the current Stack. Most operating systems grow the Stack downwards, so the Stack Pointer is the minimum. The address that the Frame Pointer points to stores the value of the last Stack Pointer, which is the return address.

In most operating systems, each stack frame also holds the frame pointer to the previous stack frame. Therefore, knowing the Stack Pointer and Frame Pointer of the current Stack Frame can be used to recursively retrieve the Frame at the bottom of the Stack.

The next step is to get the Stack Pointer and Frame Pointer for all threads. And then we keep going back to the crime scene.

5. Mach Task knowledge

Mach task:

When the App is running, there will be a Mach Task corresponding to which there may be multiple threads executing tasks at the same time. A Mach Task is described in OS X and iOS Kernel Programming as: A Task is a container object through which virtual memory space and other resources are managed, including devices and other handles. In short, Mack Task is an execution environment abstraction of machine-independent threads.

Task can be understood as a process that contains a list of its threads.

Task_threads: stores all threads under the target_task task in an act_list array with an act_listCnt number

kern_return_t task_threads
(
  task_t traget_task,
  thread_act_array_t *act_list,                     // List of thread Pointers
  mach_msg_type_number_t *act_listCnt  // Number of threads
)
Copy the code

thread_info:

kern_return_t thread_info
(
  thread_act_t target_act,
  thread_flavor_t flavor,
  thread_info_t thread_info_out,
  mach_msg_type_number_t *thread_info_outCnt
);
Copy the code

How to get thread stack data:

Kern_return_t task_threads(task_inspect_t target_task, thread_act_array_t *act_list, mach_msg_type_number_t *act_listCnt); You can retrieve all threads, but this method retrieves thread information for the lowest level of Mach threads.

For each thread, Kern_return_t thread_get_state(thread_act_T target_act, thread_state_flavor_t flavor, thread_state_t old_state, mach_msg_type_number_t *old_stateCnt); Method to get all of its information, which is populated with a parameter of type _STRUCT_MCONTEXT. This method has two parameters that vary by CPU architecture. So you need to define macros to mask differences between cpus.

The _STRUCT_MCONTEXT structure stores the Stack Pointer of the current thread and the Frame Pointer of the topmost Stack Frame, and traces the entire thread call Stack.

However, the above method takes the kernel thread, and the information we need is NSThread, so we need to convert the kernel thread to NSThread.

The P for PThread is short for POSIX, which stands for Portable Operating System Interface. The idea is that each system has its own unique threading model, and different systems have different apis for threading operations. So the purpose of POSIX is to provide abstract PThreads and associated apis. These apis have different implementations in different operating systems, but accomplish the same functionality.

Unix systems provide task_threads and thread_get_state to operate on kernel systems, and each kernel thread is uniquely identified by an ID of type thread_T. The unique identification of a pthread is of type pthread_t. The conversion between kernel threads and pthreads (thread_t and pthread_t) is easy because pthreads are designed to “abstract kernel threads.”

The memoryStatus_action_NEEdedPthread_create method creates a thread with a callback function called nsthreadLauncher.

static void *nsthreadLauncher(void* thread)  
{
    NSThread *t = (NSThread*)thread;
    [nc postNotificationName: NSThreadDidStartNotification object:t userInfo: nil];
    [t _setName: [t name]];
    [t main];
    [NSThread exit];
    return NULL;
}
Copy the code

@ “_NSThreadDidStartNotification NSThreadDidStartNotification is string.

<NSThread: 0x... >{number = 1, name = main}Copy the code

In order for NSthreads to correspond to kernel threads, there must be a one-to-one mapping by name. The Pthread API pthread_getname_NP also gets the kernel thread name. Np stands for not POSIX, so it cannot be used across platforms.

Store the original name of the NSThread, change the name to some random number (timestamp), and then iterate over the name of the kernel thread pthread. If the name matches, the NSThread matches the kernel thread. Restore the thread name to its original name. For the main thread, since pthread_getname_NP is not available, thread_t is retrieved in the load method of the current code and the name is then matched.

static mach_port_t main_thread_id;  
+ (void)load {
    main_thread_id = mach_thread_self();
}
Copy the code

2. Monitor App startup time

1. Monitor App startup time

App launch time is one of the most important factors affecting user experience, so we need to quantify how fast an App starts. Startup is classified into cold startup and hot startup.

Cold start: The App is not yet running and the entire App must be loaded and built. The application is initialized. There is much room for optimization in cold start. Cold start time from application: didFinishLaunchingWithOptions: method, App in general here on the basis of the various SDK and App initialization.

Thermal activation: application has been run in the background (common scenarios: such as user click the Home button in the process of using the App, then open the App), due to some event will wake up to the front desk, the App will be applicationWillEnterForeground: method accepts applications into the front desk

The idea is simple. The following

In the monitoring classloadMethod to get the current time value
Listen for notifications after App startup is completeUIApplicationDidFinishLaunchingNotification
Get the current time after receiving the notification
The time difference between step 1 and Step 3 is the App startup time.

Mach_absolute_time is a CPU/ bus dependent function that returns the number of CPU clock cycles. The value does not increase when the system is hibernating. It’s a nanosecond number. Two nanoseconds before and after the capture will need to be converted to seconds. You need a baseline based on system time, obtained by mach_timebase_info.

mach_timebase_info_data_t g_apmmStartupMonitorTimebaseInfoData = 0; mach_timebase_info(&g_apmmStartupMonitorTimebaseInfoData); uint64_t timelapse = mach_absolute_time() - g_apmmLoadTime; double timeSpan = (timelapse * g_apmmStartupMonitorTimebaseInfoData.numer) / (g_apmmStartupMonitorTimebaseInfoData.denom  * 1e9);Copy the code

2. Online monitoring of startup times is fine, but it needs to be optimized during development.

To optimize startup time, you need to know exactly what you did during the startup phase and plan for the situation.

Pre-main stage is defined as the stage from App startup to system call main function; The main stage is defined as the main function entry to the main UI frame viewDidAppear.

App startup process:

Parsing info. plist: load related information such as splash screen; Sandbox establishment, permission check;
Mach-o loading: If it’s a fat binary, look for parts that fit the current CPU architecture; Load all dependent Mach-O files (recursively calling the mach-O load method); Define internal and external pointer references, such as strings, functions, and so on; Methods in loading classification; C++ static object loading, calling Objc+load()Functions; The execution declaration is _attributeC function (constructor);
Program execution: call main(); Calls UIApplicationMain (); Call applicationWillFinishLaunching ();

The Pre – the Main stage

The Main stage

2.1 loading Dylib

Every dynamic library load dyLD needs

Analyze the dynamic libraries on which you depend
Find the Mach-O file for the dynamic library
Open the file
Verify that the file
Register file signatures in the system core
Call mmap () for each segment of the dynamic library

Optimization:

Reduce non-system library dependencies
Use static libraries instead of dynamic ones
Merge non-system dynamic libraries into one dynamic library

2.2 Rebase && Binding

Optimization:

Reduce the number of Objc classes, reduce the number of selectors, and remove unused classes and functions
Reduce the number of c++ virtual functions
Switch to Swift struct (essentially reducing the number of symbols)

2.3 Initializers

Optimization:

use+initializeInstead of+load
Do not use attribute*((constructor)) to mark a method display as an initializer; instead, let it be executed when the initialization method is called. Such as dispatch_one, pthread_once(), or STD ::once(). That is, the first time to use the initialization, delay some of the work time and try not to use c++ static objects

2.4 Influencing factors of pre-main stage

The more dynamic libraries are loaded, the slower they start.
The more ObjC classes, the more functions, the slower the startup.
The larger the executable, the slower it starts.
The more constructor functions C has, the slower it starts.
The more C++ static objects, the slower the startup.
The more + loads of ObjC, the slower it starts.

Optimization means:

Reduce dependence on unnecessary libraries, both dynamic and static; If possible, transform the dynamic library into a static library; If you must rely on dynamic libraries, merge multiple non-system dynamic libraries into a single dynamic library
Check that the framework should be set to Optional and Required. If the framework is available on all versions of iOS currently supported by the App, it should be set to Required. Otherwise, it should be set to Optional because optional has some additional checks
Merge or delete some OC classes and functions. As for clearing the unused classes in the project, use the code check function of AppCode to find the classes that are not used in the current project (it can also be analyzed according to linkmap files, but the accuracy is not very high).

There is an open source project called FUI that does a very good job of identifying classes that are no longer being used with great accuracy. The only problem is that it can’t handle classes provided by dynamic and static libraries, or C++ class templates

Remove static variables that are useless
Deletes methods that are not called or that are obsolete
Defer things to +initialize that you don’t have to do in the +load method. Try not to use C++ virtual functions (creating virtual tables is expensive).
Class and method names should not be too long: Each class and method name in iOS has a string value stored in a __cstring section, so the length of class and method names also has an effect on the executable file size

Because of the dynamic nature of Object-C, the object-C Object model stores the string of the class/method name because the class/method name reflection is needed to find the class/method to call.

Replace all attribute(((constructor) functions, C++ static object initialization, and ObjC’s +load function with dispatch_once();
Compressing the image to a size acceptable to the designer can be a bonus.

Why does compressing images speed up startup? Because the start of large and small pictures to load ten to twenty is very normal, the picture is small, IO operation is small, of course, the start will be fast, more reliable compression algorithm is TinyPNG.

2.5 Main phase optimization

Reduce the process of initiating initialization. Lazy loading can be lazy, can be put in the background initialization can be put in the background initialization, can delay the initialization, do not block the start time of the main thread, has been offline business code directly delete
Optimize code logic. Eliminate unnecessary logic and code to reduce the time consumed by each process
The startup phase uses multithreading to initialize, maximizing CPU performance
Use pure code rather than XIBs or storyboards to describe the UI, especially the main UI framework such as TabBarController. Because xiBS and storyboards still need to be parsed into code to render the page, it’s an extra step.

3. Acceleration of startup time

Memory page missing exception? When a page of virtual memory is accessed in use and the corresponding physical memory gap does not exist (has not been loaded into physical memory), a page gap exception occurs. The impact takes time, in milliseconds.

When do a large number of page – missing exceptions occur? When an application is first launched.

The code required for startup is distributed on page 1, page 2, page 3 of the VM… In this case, the startup time will have a great impact, so the solution is to put the code required for the startup of the application (binary optimization) on a few pages, so as to avoid the memory page shortage exception, and optimize the App startup time.

Binary reordering speeds up App startup by “resolving memory page misses” (which can take a few milliseconds).

The time when an App has a lot of “missing pages” is when the App is first launched. Therefore, the optimization means is to “concentrate the methods that affect App startup and put them on a certain page or several pages” (pages in virtual memory). The Xcode project allows developers to specify an “Order File” that can be “loaded in the Order of the methods in the File”, You can view the linkMap File (you need to set the Order File and Write Link Map Files parameters in Buiild Settings in Xcode).

In fact, the difficulty is how to get the method used by the startup time call? The code may be Swift, block, C, OC, so hook will definitely not work, fishhook will not work, clang peg can meet the needs.

3. CPU usage monitoring

1. The CPU architecture

Central Processing Unit (CPU). The mainstream architectures in the market include ARM (ARM64), Intel (x86), and AMD. Among them, Intel uses Complex Instruction Set Computer (CISC) and ARM uses Reduced Instruction Set Computer (RISC). The difference lies in different CPU design concepts and methods.

Early cpus were all CISC architectures, designed to perform required computing tasks with a minimum of machine language instructions. For multiplication, for example, on cpus in CISC architecture. A single instruction MUL ADDRA, ADDRB can multiply memory ADDRA and the number incense in memory ADDRB, and store the result in ADDRA. All it does is read data from ADDRA and ADDRB into registers and write the product to memory depending on the CPU design, so CISC architecture adds CPU complexity and processing requirements.

The RISC architecture requires software to specify the steps. For example, the multiplication instruction above is implemented as MOVE A, ADDRA; MOVE B, ADDRB; MUL A, B; STR ADDRA, A; . This architecture reduces CPU complexity and allows more powerful cpus to be produced at the same level of craftsmanship, but requires a higher level of compiler design.

The current market is that most iphones are based on ARM64 architecture. The ARM architecture has low power consumption.

2. Obtain thread information

How do you monitor CPU usage

Start the timer and continuously execute the following logic according to the set period
Gets the current task task. Get all thread information from the current task (thread count, thread array)
Check whether the CPU usage of any thread exceeds the preset threshold
If any thread usage exceeds the threshold, the stack is dumped
Assemble data, report data

Thread information structure

struct thread_basic_info { time_value_t user_time; /* user run time (user run time) */ time_value_t system_time; /* System run time (system run time) */ integer_t cpu_usage; /* scaled CPU usage percentage (CPU usage, Max 1000) */ policy_t policy; /* Scheduling policy in effect */ integer_t run_state; /* run state */ integer_t flags; /* Various flags (various flags) */ integer_t suspend_count; /* suspend count for thread */ integer_t sleep_time; /* Number of seconds that thread * has been sleeping */};Copy the code

The code talked about this in the stack restore section, but forget to look at the analysis above

thread_act_array_t threads; mach_msg_type_number_t threadCount = 0; const task_t thisTask = mach_task_self(); kern_return_t kr = task_threads(thisTask, &threads, &threadCount); if (kr ! = KERN_SUCCESS) { return ; } for (int i = 0; i < threadCount; i++) { thread_info_data_t threadInfo; thread_basic_info_t threadBaseInfo; mach_msg_type_number_t threadInfoCount; kern_return_t kr = thread_info((thread_inspect_t)threads[i], THREAD_BASIC_INFO, (thread_info_t)threadInfo, &threadInfoCount); if (kr == KERN_SUCCESS) { threadBaseInfo = (thread_basic_info_t)threadInfo; Todo: if (! (threadBaseInfo->flags & TH_FLAGS_IDLE)) { integer_t cpuUsage = threadBaseInfo->cpu_usage / 10; if (cpuUsage > CPUMONITORRATE) { NSMutableDictionary *CPUMetaDictionary = [NSMutableDictionary dictionary]; NSData *CPUPayloadData = [NSData data]; NSString *backtraceOfAllThread = [BacktraceLogger backtraceOfAllThread]; CPUMetaDictionary[@"MONITOR_TYPE"] = APMMonitorCPUType; // 2. Payload (a JSON object whose Key is STACK_TRACE) The value for the stack information after base64) NSData * CPUData = [SAFE_STRING (backtraceOfAllThread) dataUsingEncoding: NSUTF8StringEncoding]; NSString *CPUDataBase64String = [CPUData base64EncodedStringWithOptions:0]; NSDictionary *CPUPayloadDictionary = @{@"STACK_TRACE": SAFE_STRING(CPUDataBase64String)}; NSError *error; // the NSJSONWritingOptions parameter must be passed 0, because the server needs to process logic according to \n, Passing zero is generated by the json string without \ n NSData * parsedData = [NSJSONSerialization dataWithJSONObject: CPUPayloadDictionary options: 0 error:&error]; if (error) { APMMLog(@"%@", error); return; } CPUPayloadData = [parsedData copy]; / / 3. The data reported in [to build a powerful, flexible, configurable data reporting component] (https://github.com/FantasticLBP/knowledge-kit/blob/master/Chapter1%20-%20iOS/1.80.md) Speak [[HermesClient sharedInstance] sendWithType: APMMonitorCPUType meta: CPUMetaDictionary content: CPUPayloadData]; }}}}Copy the code

The content of the article is too long, divided into several chapters, please click to view, if you want to view the whole coherent, please visit here

APM monitoring system: lag monitoring, startup time monitoring, CPU usage monitoring

1. Caton monitoring

1. Screen rendering principle

2. Causes of caton

3. How does APM monitor stagnation and report it

3.1 Monitoring mode of RunLoop status

3.2 Monitoring mode of the subthread ping main thread

4. A stack dump

5. Mach Task knowledge

2. Monitor App startup time

1. Monitor App startup time

2. Online monitoring of startup times is fine, but it needs to be optimized during development.

2.1 loading Dylib

2.2 Rebase && Binding

2.3 Initializers

2.4 Influencing factors of pre-main stage

2.5 Main phase optimization

3. Acceleration of startup time

3. CPU usage monitoring

1. The CPU architecture

2. Obtain thread information

Related Posts

Swift Photo/video selector

The internal Layout of Flutter

From a brief book on iOS client, talk about the detailed design of Hybrid solution