demand

Log is very important for online troubleshooting, many problems are actually very occasional, the same system version, the same equipment, may be the user’s recurrence, and development through the same operation and equipment is not recurrence. However, this problem cannot remain unresolved, so you can troubleshoot the problem using logs. The fault may be caused by the background or the client logic. Record logs at key points to quickly locate the fault.

Let’s say we have a million daily users, and 1% of them have a problem, even if it’s not a crash, or a business or playback problem. So that’s 10,000 users, and 10,000 users is a lot of users. And most users do not take the initiative to contact customer service when they encounter problems, but switch to other platforms.

Although we now have Kibana network monitoring, we can only check whether there is a problem with the network request, whether the user requested the server at a certain time, and whether the data delivered by the server is correct. However, if the problem of business logic is located, the client still needs to record logs.

The status quo

We had a logging system in the project before, but from a business and technical perspective, there were two problems. At the service layer, users need to manually export logs and send them to customer service, which may cause unnecessary disturbance to users. And most users will not agree to the customer service request, will not export logs to the customer service. From a technical point of view, the existing log system code is very messy, and the performance is very poor, leading to online dare not continue to record logs, will lead to player lag.

In addition, active recording is only enabled in the current log system in the DEBUG environment. It is not enabled online. Users need to manually enable active recording when online problems occur, and the recording duration is only three minutes. It is precisely because of the many problems that exist now, so people are not very active in the use of logs, and it is difficult to troubleshoot problems online.

The project design

Train of thought

In view of the existing problems, I plan to make a new log system to replace the existing log system. The positioning of the new logging system is very simple, that is, pure recording of business logs. We do not record Crash and buried points, which can be used as future extensions. The log system records three types of logs: service logs, network logs, and player logs.

Log collection The active reclamation policy is adopted. A user’s UID is filled in on the log platform, and the uid is used to deliver the reclamation command to the specified device. The reclamation command is delivered in long connection mode. After receiving the retrieving command, the client filters the logs based on the filtering criteria, writes the logs to different files on a daily basis, compresses the logs and uploads them to the back end.

On the log platform, you can search for logs based on specified conditions and download files to view logs. In order to facilitate the developer to view the log, all the logs extracted from the database are written in. TXT format and uploaded to this file.

API design

The API design of the call should be simple enough for the business layer to use just like an NSLog call. So for the DESIGN of the API, I used the way of macro definition, the call method is the same as NSLog, very simple call.

#if DEBUG
#define SVLogDebug(frmt, ...)   [[SVLogManager sharedInstance] mobileLogContent:(frmt), ##__VA_ARGS__]
#else
#define SVLogDebug(frmt, ...)   NSLog(frmt, ...)
#endif
Copy the code

There are three types of logs: service logs, player logs, and network logs. The three types of logs correspond to different macro definitions. Different macro definitions, write database types are also different, can be filtered user logs.

  • Business log:SVLogDebug.
  • Player log:SVLogDebugPlayer.
  • Cyber Days:SVLogDebugQUIC.

Elimination strategy

It’s not just writing to the database, it’s also thinking about elimination strategies. The elimination policy requires a balance between the number of logs recorded and the timeliness of problems. The number of logs should be sufficient to troubleshoot problems without occupying too much disk space. Therefore, the system deletes uploaded logs after they are uploaded. In addition, log elimination strategies are as follows.

  1. A log can only be saved for a maximum of three days, and the previous three days will be deleted. Checks are performed after the application is started and background threads perform this process.
  2. A maximum threshold is added for logs. Logs that exceed the threshold are deleted from front to back in chronological order. The threshold size we define is200MBUsually not more than this size.

Record basic Information

Key information is also important during troubleshooting, such as the user’s current network environment and configuration items, which have some impact on the execution of the code. For this problem, we will also record some user configuration information and network environment for easy troubleshooting, but will not involve user latitude and longitude and other privacy information.

The database

The old plan

The previous logging solution, which was implemented through DDLog, had serious performance problems. Its writing log way, is through NSData to achieve, in the sandbox to create a TXT file, through a handle to write local files, each time after writing the handle seek to the end of the file, next time directly at the end of the file to continue to write logs. Logs are handled as NSData, which means frequent local file writes, and one or more handle objects are kept in memory.

Another problem with this approach is that because it is written directly to binary, it is stored locally in TXT files. There is no way to do filtering and other operations in this way, the scalability is very poor, so we plan to use the database to achieve the new log scheme.

Scheme selection

I compared the mainstream databases on iOS platform and found that WCDB has the best overall performance and is better than FMDB in some aspects. Moreover, as it is implemented by C++ code, there will be no extra consumption of OC message sending and forwarding from the aspect of code execution.

According to WCDB’s official website, WCDB is compared to FMDB. FMDB is a framework that simply wraps SQLite, which is not very different from SQLite directly. WCDB is deeply optimized on the basis of SQLCipher, and its comprehensive performance is higher than that of FMDB. The following is the performance comparison, data from the OFFICIAL WCDB document.

The single read operation WCDB is about 5% slower than FMDB and keeps reading throughout the for loop.

A single write operation WCDB is 28% faster than FMDB, and a for loop keeps writing.

The bulk write operation is more obvious, WCDB is 180% faster than FMDB, a batch task writes a batch of data.

It can be seen from the data that THE performance of WCDB is much faster than FMDB in the write operation, and the local log is the most frequent write operation, so this just meets our needs, so the choice of WCDB as the new database solution is the most appropriate. Moreover, WCDB has been used in the exposure module of the project, which proves that this scheme is feasible and has good performance.

Table design

The table design of our database is very simple, with the following four fields, different types of logs are distinguished by type. If you want to add new log types, you can also extend them in your project. Because the database is used, so the scalability is very good.

  • Index: primary key, used for indexing.
  • Content: logs are recorded.
  • CreateTime: createTime, log time.
  • Type: indicates the log type, which is used to distinguish the three log types.

Database optimization

We are a video application, involving the main functions of playing, downloading, uploading and so on. These functions will record a large number of logs to facilitate the troubleshooting of online problems. Therefore, avoiding the database too large has become my design of logging system, more important point.

According to the log scale, I conducted a lot of tests on the three modules of playing, downloading and uploading, playing for two nights a day, downloading 40 episodes of TV dramas and uploading multiple HD videos, and the accumulated log number is about 50,000. I found that the database folder has reached the size of 200MB+, which is quite large, so the database needs to be optimized.

I looked at the database folder, there are three files, DB, SHM, WAL, mainly because the database log file is too big, db file is not big. Therefore, you need to call sqlite3_WAL_checkpoint to write WAL content to the database to reduce the size of WAL and SHM files. However, WCDB does not provide a direct checkpoint method, so after investigation, it is found that database shutdown can trigger checkpoint.

I listen for the Terminal notification when the application exits and put the actual processing as late as possible. This ensures that the logs are not missed and that the database can be shut down when the program exits. After verification, the optimized database disk footprint is very small. 143,987 databases, the size of database files is 34.8MB, the size of compressed logs is 1.4MB, and the size of compressed logs is 13.6MB.

Mode of…

The WAL schema is covered in passing for a deeper understanding of the database. Wal mode was added to SQLite in version 3.7, but it is not enabled by default. WCDB for iOS has wal enabled automatically, with some optimizations.

The WAL file optimizes concurrent operations in multiple threads. If there is no WAL file, read and write operations are mutually exclusive in the traditional DELETE mode. To prevent half-written data from being read, read operations are performed after the write operation is complete. Wal files are designed to solve concurrent read and write problems. SHM files index WAL files.

SQLite compares the commonly used DELETE and WAL modes, each of which has its own advantages. Delete directly reads and writes db-page data to the same file. Therefore, the read and write operations are mutually exclusive and do not support concurrent operations. Wal is a new DB-page in Append, which is fast in writing data and supports concurrent operations. You only need to write data without reading the current DB-page.

Because the DB-page of delete mode operations is discrete, delete mode performance is much worse when performing bulk writes, which is why WCDB performs better on bulk writes. Wal reads data from both DB and WAL files, which affects the data read performance to some extent. Therefore, WAL query performance is lower than that of DELETE.

In WAL mode, you need to control the number of db-pages of WAL files. If the number of DB-pages is too large, the file size will be uncontrolled. Wal files are not always added. According to SQLite’s design, wal files can be merged into DB files by checkpoint operation. However, the query operation will be blocked during synchronization. Therefore, checkpoint execution cannot be performed frequently. The WCDB sets a threshold of 1000. When the page reaches 1000, a checkpoint is performed.

This 1000 is an empirical value for the wechat team, too much will affect the read and write performance, and take up too much disk space. If the device is too small, the checkpoint is frequently executed, blocking read and write operations.

# define SQLITE_DEFAULT_WAL_AUTOCHECKPOINT  1000

sqlite3_wal_autocheckpoint(db, SQLITE_DEFAULT_WAL_AUTOCHECKPOINT);

int sqlite3_wal_autocheckpoint(sqlite3 *db, int nFrame){
#ifdef SQLITE_OMIT_WAL
  UNUSED_PARAMETER(db);
  UNUSED_PARAMETER(nFrame);
#else
#ifdef SQLITE_ENABLE_API_ARMOR
  if(!sqlite3SafetyCheckOk(db) ) return SQLITE_MISUSE_BKPT;
#endif
  if( nFrame>0) {sqlite3_wal_hook(db, sqlite3WalDefaultHook, SQLITE_INT_TO_PTR(nFrame));
  }else{
    sqlite3_wal_hook(db, 0.0);
  }
#endif
  return SQLITE_OK;
}
Copy the code

You can also set a limit on the size of the log file. The default is -1, which means there is no limit. JournalSizeLimit means that the excess will be overwritten. Try not to modify this file, it may cause wal file corruption.

i64 sqlite3PagerJournalSizeLimit(Pager *pPager, i64 iLimit){
  if( iLimit>=- 1 ){
    pPager->journalSizeLimit = iLimit;
    sqlite3WalLimit(pPager->pWal, iLimit);
  }
  return pPager->journalSizeLimit;
}
Copy the code

Issued instructions

Logging platform

Logs should be reported without users’ awareness and automatically uploaded without users’ active cooperation. In addition, not all user logs need to be reported. Only faulty user logs are required to avoid wasting storage resources on the server. To solve these problems, we developed a log platform, which told the client to upload logs by issuing upload instructions.

Our log platform does a relatively simple, input UID to the specified user to send upload instructions, the client after uploading logs, can also be queried by UID. As shown in the figure above, you can select the following log type and time range when sending the command. After receiving the command, the client will filter based on these parameters. If you do not select these parameters, they are the default parameters. You can also use these three parameters when searching.

The log platform corresponds to a service. When you click the button to send the upload command, the service will send a JSON to the long connection channel, which contains the above parameters and can be used to expand other fields in the future. The uploaded logs are uploaded by day. Therefore, you can search by day. Click Download to preview the log content.

Long connection channel

In order to issue instructions, we make use of the existing long connection. When the user feedback problems, we will record the user’s UID. If the technology needs logs to troubleshoot problems, we will issue instructions through the log platform.

The command is sent to the public long connection service background. The service sends the command through the long connection channel. If the command is sent to the client, the client replies with an ACK message informing the channel that it has received the command, and the channel removes the command from the queue. If the user does not open the App at this time, this command will be re-issued next time the user opens the App and establishes a connection with the long connection channel.

Pending upload instructions will be in the queue for up to three days, since after three days the instructions will lose their timeliness and the problem may have been resolved by other means.

Silent push

When a user opens the App, the log command can be delivered over the long-link channel. There is another scenario, which is also the most common scenario. We are still exploring how to solve the problem of log reporting when users do not open the App.

At that time, WE also investigated the log retrieval of Meituan, and the plan of Meituan included the silent push strategy. However, after our investigation, we found that the silent push was basically meaningless and could only cover some small scenes. For example, the user App is killed by the system, or suspended in the background, etc., which is not common for us. In addition, I also communicated with the people in the push team. The feedback from the push team was that there were some problems with the arrival rate of silent push, so the silent push strategy was not adopted.

Upload the log

Shard to upload

When designing the solution, the backend does not support the direct display of logs, so you can only download them as files. Therefore, it is agreed with the backend to upload log files on a daily basis. For example, the retrieval time is 19:00 on April 21 and 21:00 on April 23. In this case, it is divided into three files, that is, one file for each day. The files on the first day only contain the logs generated from 19:00, and the files on the last day only contain the logs generated until 21:00.

This method is also a fragment upload policy. Each log file is compressed into a ZIP file and then uploaded. On the one hand, this is to ensure the success rate of upload, too many files will lead to a decrease in the success rate, on the other hand, it is to do file segmentation. It is observed that after each file is compressed into ZIP, the file size can be controlled within 500KB. 500KB is an empirical value when we upload video slices before, and this empirical value is a balance between the success rate of upload and the number of fragments.

Log names are timestamp combinations, and the time unit should be accurate to minutes to facilitate time filtering operations on the server. Upload Uploads are submitted in form mode. After the upload is complete, local logs are deleted. If the upload fails, a retry policy is used. Each fragment can be uploaded for a maximum of three times. If the upload fails for three times, the fragment fails to be uploaded this time and the fragment can be uploaded again at other times.

security

To ensure the security of the log data, the log upload request is transmitted through HTTPS, but this is not enough. HTTPS can still obtain the plaintext information of the SSL pipeline in other ways. Therefore, the transmitted data needs to be encrypted. When selecting an encryption policy, use a symmetric encryption policy to ensure performance.

However, symmetric encryption is delivered through asymmetric encryption, and each upload command corresponds to a unique secret key. The client compresses and encrypts the compressed file. After encryption, the file is fragmented and uploaded. After receiving the encrypted file, the server decrypts the zip file using the secret key and decompresses it.

Take the initiative to report

After the launch of the new log system, we found that the retrieval success rate was only 40%, because some users lost contact after feedback problems, or did not open the App after feedback problems. For this problem, THERE are two main ways to analyze user feedback problems. One is that users feed back problems from the system Settings, and after communication with customer service, technical intervention to troubleshoot problems. The other is the feedback from users through feedback groups, App Store comment sections, operations and other channels after problems occur.

Both methods are applicable to log retrieval, but the first method has a specific trigger condition, that is, the user clicks to enter the feedback interface. Therefore, for users who report problems in this scenario, we added an active reporting method. That is, when users click the feedback button, they will proactively report logs generated within three days based on the current time. In this way, the success rate of log reporting can be increased to about 90%. After the success rate is increased, more people will be encouraged to access the log module, which is convenient for troubleshooting online problems.

Manually export

You can also manually export logs. To manually export logs, you can log in to the debugging page, select a log to share on the debugging page, activate the system sharing panel, and share logs through a corresponding channel. Before the new log system, it is through this way to let users manually export logs to customer service, you can imagine how serious the interruption to users.

At present, manual export still exists, but it is only used for testing and development at the debug stage. Manual export of logs is not required for online operation.