background

Meituan Dianping is the world’s leading one-stop lifestyle service platform, providing online and offline e-commerce network to over 600 million consumers and over 4.5 million quality merchants. Meituan Dianping’s business covers more than 200 diversified categories and 2,800 city and county networks, leading the market in catering, food delivery, hotel travel, beauty, family, leisure and entertainment. With a big platform comes a big responsibility. On mobile, how to quickly locate and solve online problems to improve user experience is a great challenge for us. Online occasionally occur one page can’t open, the new activity to rob a single button click no response, can not login, can’t place the order, because the Android fragmentation model, the network environment, ROM, complex and varied operating system version, the local environment, these personalized usage scenario is difficult in local recurrence, feedback and questions are vague description, It is difficult to quickly locate and solve problems. To this end, we developed Holmes, a dynamic log system, hoping that it can help us follow the clues of online bugs like Sherlock Holmes and discover the truth behind them.

Existing solutions

  • Send temporary packages to users for installation
  • QA tries to reproduce the problem
  • Online debug Debugging tool
  • Manual buried point recovery in advance

Disadvantages of existing methods

  • Temporary delivery: the user cooperation process is cumbersome, and it takes a long time to solve the problem
  • QA replay: Try existing models and find it difficult to replay personalized scenes
  • Online debug: The network environment is unstable, and the code is confused. Debugging costs are high, and users cannot accept too much time
  • Manual buried point: the coverage is limited, can not be predicted in advance, and due to the large volume of business, multi-area cooperative development, business types, etc., it is difficult to unified buried point scheme, and in troubleshooting a large number of manual buried point will produce a lot of redundant reported data, looking for a useful buried point information is like looking for a needle in a haystack

Objective demands

  • Get online logs quickly
  • You don’t need a lot of burials or even no burials
  • Accurate problem field logs

implementation

Dynamic logs provide a solution to quickly locate online problems that are difficult to locate. The user’s mobile phone automatically generates the log information about the execution of the method in advance. When the user needs to troubleshoot the user’s problems, the user’s operation path is reproduced by delivering the precise retrieval user logs through signaling. The dynamic log system also supports dynamic code delivery, so as to dynamically analyze the object snapshot at run time, dynamically add buried points and other functions, and analyze user problems in complex scenarios.

Automatic submerged point

Automatic burying point is the online App automatically generates logs, how to automatically generate logs? We insert method for the pile method to record the execution path (the call stack), at the beginning of a method to insert a pile of code, when the run method will record the method signature, process, thread, such as time to form a complete implementation information (here we call TraceLog), will be credited to the DB TraceLog waiting for the signal sent to get data.

Public void onCreate(Bundle Bundle) {if (holmes.isenable (....) ) { Holmes.invoke(....) ; return; } super.onCreate(bundle); setContentView(R.layout.main); }Copy the code

The historical data

Tracelog forms a historical execution path of code. Once an online problem occurs, you can retrieve historical user data for troubleshooting. Tracelog has the following advantages:

  1. Tracelog is automatically generated without manual burying by the developer
  2. Piling covers all business codes, and Tracelog is not restricted by Proguard inline method. Piling is before Proguard, so the piling code will also be inlined after the method is inlined, so that the complete execution path information compared with the original code will be recorded
  3. Retrievable logs can be collected forward and backward based on a central point (for example, if there is no response after clicking the button, only the code execution path after clicking the button is retrieved to analyze the problem), which can avoid reporting a lot of useless logs, reduce the time and complexity of troubleshooting problems

This section describes the working flow of Tracelog

Method run produce method call log first passes through the checker for testing, the checker contains thread detection method and detection (interference) reduce information, filter is similar to regular task thread testing for this thread has been constantly produce logs, methods can be in a certain time the frequency of the detection method calls, to filter out the method called frequently, If the method is not filtered, it performs asynchronous processing. Secondly, it obtains a Tracelog object from the object pool. The Tracelog objects enter the production queue, including assembly time, thread, and serial number, and enter the consumption queue.

Tracelog Data display

Log retrievals are displayed on the Trace platform in chronological order:

The problem summary

Our platform deployment implemented several versions and summarized many cases. It is found that Tracelog analysis can only narrow the scope of problems in most scenarios. However, many problems are determined to be the exception of a method. At this time, it is necessary to know the execution information of the method, for example: Input parameters, current object fields, return values, and other information to determine the execution logic of the code. Only Tracelog can feel as if it is just a foot away from the door. How can I get a snapshot of the memory generated when the method is running? This is the dynamic power of dynamic logging.

Dynamic distributed

Signaling is delivered to the target user, a code is dynamically executed and the result is reported. We use Lua script to obtain the snapshot information of the object when the method is running. Why Lua? The Lua runtime library is very small and can call Java code and the language is concise and easy to understand. Dynamic execution of Lua has three important opportunities: immediate execution, pre-method execution, and post-method execution.

  • Immediate execution: The system immediately executes and reports the result after receiving the signaling
  • Pre-method execution: Lua scripts are executed before a method is executed to dynamically obtain input parameters, object fields, and other information
  • Post-method execution: Lua scripts are executed after a method is executed to dynamically retrieve information about return values, input parameter changes, and object field changes

There are some problems with executing Lua scripts after methods. We only pile before methods. If we pile after methods, we can solve the problem of executing Lua scripts after methods, but this increases the code volume and affects the number of proGuard inline methods.



We use reflection to perform current method, when entering method insert pile code does not directly in front of the execution method of experience through reflection calls himself in a pile of code, thus achieved a dynamic AOP functions can execute the script after method, this method also has a same problem, is there will be a dead loop, The solution to this problem is simply to mark reflection calls when they are performed to avoid the problem of an infinite loop.

What else can we make a script do? In addition to obtaining object snapshot information, it also adds DB query, reporting ordinary text, ShardPreferences query, obtaining Context object, query permissions, adding burial-point to local, uploading files and other comprehensive capabilities, and Lua script is far more than that. Lua scripts can be used to invoke Java methods to simulate code logic, enabling deeper dynamic capabilities.

Display dynamically delivered data

Object data:



Permission Information:



The DB data:

Technical challenges

Dynamic log in the process of development encountered a lot of technical difficulties, we encountered a lot of problems in the implementation of the program, the following is a review of the problems and solutions.

The problem of large data volume

  • The main thread is stuck
  1. Since logs are generated by multiple threads at the same time, thread synchronization security should be considered. The use of synchronized or lock can ensure the safety of synchronization, but at the same time, it also brings about the problem of mutual exclusion between multiple threads, causing the main thread to wait and get stuck. Here, CAS technology solution is used to realize the self-defined data structure, ensuring the safety of thread synchronization and solving the problem of mutual exclusion between multiple threads.
  2. Because too much data is generated, a large amount of I/O is generated when the DB is being stored. As a result, the CPU is occupied for a long time, affecting the CPU usage of other threads. To solve this problem, the first is to adopt thread filtering and method filtering to reduce the generation of useless logs, and reduce the level of processing threads do not compete with the main thread CPU time, and then to reduce the frequency of I/O data batch processing, and the original Delete+ INSERT operation in the database operation to update+ INSERT. Tracelog stores 300,000 pieces of data (about the records that meituan App has used more than six times). If the number reaches 300,000, part of the early data will be deleted and new data will be written. The longer the operation, the more delete operations, and the larger the CPU resource ratio. A practical comparison of database operations shows that it is more efficient to update data directly after 300,000 using update (I won’t go into too much detail here). Our optimizations went from more than 40% CPU to around 20% CPU and then down to less than 10% CPU on low – and mid-range machines.
  • Creating too many objects leads to frequent GC

Log is generated a Tracelog object, a large number of log will result in frequent GC, we use the object pool for this problem to make the object reuse, thus reducing the object creation reduce the frequency of GC, object pooling is similar to the android OS. Message. Obtain () works.

  • Too many interference logs affect analysis problems

We have filtered out most of the interference logs, but there are still some methods that cause interference logs when code is executed frequently. For example: custom View library, log type library, monitoring type library, etc., the log of these methods will affect our DB storage space, resulting in the retention of too much normal method execution path, in this case, it is likely that the development of this concern has been washed out of the log. How to solve this problem? When inserted pile allows developers to configure some filtering or identification rules to decided whether to deal with this method, the thrust increases a binary parameters on pile method, and then according to the configuration of the rules on the corresponding bit is set to 0 or 1, the method performs when only need one exclusive or operation will know whether this method should be recorded, This added recognition has almost no effect on the execution time of the original method. The log generated by this method is exactly what the developers expect. After several tests, our log can retain the complete behavior of the user more than 6 times, and the CPU usage is reduced to less than 5%.

Performance impact

Does staking logging every method have an impact on the code’s method time? Finally, we tested the time consuming and CPU usage of the method respectively on low-end models.

  • The test time affected by the method is between 55 and 65ms on average for 1 million times, and the method execution time is very small
  • The CPU consumption test is within 5%, as shown in the following figure:

  • The test of memory usage is about 56kB, as shown in the following figure:

Object snapshot

When the method is running, the object snapshot is obtained to keep the field log. To extract the object snapshot, it is necessary to clone an object in depth (in order to prevent the object from being changed before the information is completely recorded, which will affect the final judgment of the code execution). There are the following methods for clone object in Java:

  • Implement a Clone interface
  • Implement a serialization interface
  • Use Gson serialization

Both the Clone interface and the serialization interface have the same problem. It is possible that this object does not implement the corresponding interface, so deep Clone cannot be carried out, and the implementation of clone interface also cannot achieve deep Clone. Java serialization has IO problems and low execution efficiency. In the end, only the Gson serialization method will work, but Gson also has many pitfalls. If an object has the same fields as the parent class, Gson will overwrite the fields of the parent class in the sequence. If two objects have scenes that reference each other, then the Gson serialization will simply loop in an infinite loop.

How to solve these problems? Finally, we refer to some open source library schemes and some API of Java system to develop a deep Clone library, and define our own data objects and use Gson to solve the problem of object snapshot. Depth of clone, which make use of the Java system API, to create out the outline of a target object, and then using the reflection copies of all fields in the object to the shell on the object, finally the shell object can form with the original object exactly the same thing, at the same time for Android increase some types of special treatment, In order to improve the speed, basic type, set, map and other system types are quickly processed, and the objects completed by Clone are directly processed by snapshot.

conclusion

Dynamic logs have no cost for service development and do not disturb users. When troubleshooting an online problem, the method execution path may directly reflect the cause of the problem, or at least narrow the scope of the problem code to a method, and then use the dynamic delivery of Lua scripts to determine the location of the problem code. Dynamic distributed function of dynamic logging can also be used as a kind of ability, for other needs dynamic execution code or the basis of dynamic data library, for example: meet some hard to solve the collapse of the scene, in addition to the normal stack information, at the same time also can according to different types of collapse, a dynamic collection and some other auxiliary information to help troubleshoot problems.

The authors introduce

Shao Fei is a senior engineer of Meituan-Dianping. He joined Meituan-Dianping in 2015 and has been responsible for user behavior log SDK, performance monitoring SDK, dynamic log, and service abnormal monitoring

Chen Tong is a senior engineer of Meituan-Dianping. She joined Meituan-Dianping in 2015 and has been responsible for infrastructure development such as code static check, network layer optimization and dynamic log

Li Feng, senior engineer of Meituan-Dianping, joined Meituan-Dianping in 2017 and is now mainly responsible for dynamic log performance optimization

recruitment

The client technology team of meituan platform is recruiting technical experts for a long time. Interested students can send their resumes to fangjintao#meituan.com. For details, please click: Detailed JD








If you answer “thinking questions”, find mistakes in the article, or have questions about the content, you can leave a message to us at the background of wechat public account (Meituan-Dianping technical team). Each week, we will select one “excellent responder” and give a nice small gift. Scan the code to pay attention to us!