First, why use traffic recording and playback?

1.1 Business status of Vivo

In recent years, Vivo’s Internet field is in a state of rapid development. Meanwhile, vivo’s mobile phone shipments have been among the top in China, and the user scale is very large after years of accumulation. Therefore, vivo mobile phone factory built many applications, such as browser, short video, live broadcast, information, app store, etc., are facing users with high concurrency, complex system. These user-facing systems have very high requirements for the user experience, and the quality assurance of these services is a top priority.

1.2 Test pain points

As our business grows in size and complexity, problems and challenges arise. In it, “How to ensure the correctness of original services after system modification during service iteration, upgrade or even reconstruction?” Is one of the big problems that we’re trying to solve.

Simple business systems can be solved by regular automated testing tools plus manual testing. For complex systems, regression testing becomes a difficult project. Take our recommendation system as an example, a recommendation system undertakes dozens of recommendation scenarios. How do you change one recommended scenario without affecting the others?

Previously, we solved the problem by writing automated test cases, but there are many pain points in manually written test cases:

  • It is difficult to write test cases, construct data, and simulate real user behavior.

  • Some of the code logic is difficult to verify with test scripts. For example, sending a message cannot verify that the message content is ok.

  • Relying on artificial construction of use cases is difficult to consider all scenarios of the system, easy to cause use cases omission.

  • As the complexity of system deployment increases, the cost of environment maintenance is high.

For these complex business systems in the iterative process of regression testing efficiency, we conducted some continuous exploration.

1.3 Scheme Exploration

We combined with the characteristics of Vivo Internet system to the industry of some solutions for extensive research and reference and listed the following requirements: the new solution to be simple and efficient, users can easily use without too much understanding; Business access costs are low enough to allow rapid regression testing; The new scheme should be universal and extensible enough to adapt to the changing system architecture.

We refer to the technical solutions of some head Internet companies and find that traffic recording and playback is a very good choice. There are a number of head companies in the industry that have made good progress and landing value based on this technology, which gives us some reference and confidence. Therefore, for flow recording playback, we have carried out some more in-depth exploration and landing, that is, our Moonlight box platform.

Second, what is flow recording and playback?

Before introducing specific practices, what is traffic recording and playback?

Traffic recording playback is performed by replicating real traffic on the line (recording) and then performing mock requests (playback) in a test environment to verify the code logic. By collecting online traffic and playing back in the test environment, the difference of each sub-call and the result of entry call are compared one by one to find out whether there is a problem in the interface code.

Using this mechanism for regression testing has many advantages: first, it is simple and efficient to replace test cases by recording traffic, and it is easy to form rich test cases; Second, the playback of online traffic can perfectly simulate the real behavior of users and avoid the differences existing in manual writing. In addition, object comparison between recorded data and playback data can be used to verify the system logic more deeply and subtly. The last recorded flow without maintenance, with access to use, very convenient.

3. Moonlight Box Platform

This innovative mechanism of recording and playback of traffic is excellent in theory, but it is not easy to implement, and there are many problems to solve. The following will introduce the implementation scheme of traffic recording and playback in Vivo Internet system and the problems encountered as well as how we solve these problems.

3.1 Underlying Architecture

Vivo Moonlight Box platform draws on the experience of open source JVM-Sandbox-Repeater project, and makes secondary development and transformation on the basis of JVM-Sandbox-Repeater. Moonlight Box platform includes two modules, server and Java Agent. The overall architecture is shown in the following figure.

3.1.1 Service Architecture

The following figure shows the overall business architecture of our server, which can be divided into task management, data management, coverage analysis, configuration management, alarm monitoring and other modules.

  • The task management module manages user recording and playback tasks, including task start and stop, task progress, task status, etc.

  • The data management module is used to manage the traffic data recorded and played back by users, and analyze the data.

  • The coverage analysis module is used to count the user regression coverage index.

  • The configuration management module is used to configure the global parameters of the system and applications.

  • Monitoring module is used to analyze Agent performance indicators.

There are also message notification modules.

3.1.2 Agent architecture

The following figure shows the overall architecture of the Agent module, which is the core of the traffic recording and playback process. Agent is realized based on bytecode mechanism, which consists of four layers:

  • At the bottom is the base container layer, which is the standard Java-Agent implementation.

  • Above the container layer is the dependency layer, which introduces the third-party resources we need, implements bytecode staking, class load isolation, and class metadata management capabilities.

  • On top of the dependency layer is the basic capability layer, in which basic atomic functions are implemented, such as recording and playback plug-in management, data management, data comparison, sub-call Mokc, operation monitoring, configuration loading and other capabilities.

  • The top layer is the business logic layer, which combines the basic logic functions together to form a complete business unit. Currently, In addition to supporting traffic recording and playback, Moonlight Box also supports dependency analysis, data Mock and other functions.

3.2 Startup process of Moonlight Box

The most important thing to start the recording and playback task is to deliver our Agent to the designated service machine without intrusion and Attach the Agent to our service process automatically.

The startup process of Moonlight Box is shown in the figure below. Users first configure recording and playback tasks on moonlight Box platform. After the configuration is complete, the configuration information is stored in the database and the startup script and vivo Repeater agent package are delivered to the user-configured machine through the VCS (Job scheduling platform developed by Vivo). The shell script is then executed and the Sandbox is pulled up to attach the Agent to the target JVM. The Agent can then create a JVM Sandbox on the target JVM through reflection, and the Sandbox pulls multiple modules through the SPI.

The most important of these is vivo Repeater Module, which loads plug-ins via SPI that ultimately enhance code on the target JVM in an ASM fashion for bytecode pegging, and uses these enhanced plug-ins to intercept and deliver traffic to storage for recording and playback.

The above execution process allows users to perform complex traffic recording and playback functions with only a small amount of information configured on the console. The detailed recording and playback process is described below.

3.3 Traffic Recording Process

The following is a flow recording process. The call link of a traffic includes an entry call and several sub-calls. The recording process of traffic is to bind the entry call and sub-calls into a complete call record through a unique ID. Moonlight Box finds the appropriate code point (key entry and exit) of the entry call and subcall, implements code enhancement at this code point based on bytecode pegging technology, implements call interception, records the entry parameter and return value of the call, and then generates a recording identifier according to the corresponding call type (such as Dubbo, HTTP). When the call is completed, Moonlight Box will collect the call record of the entire traffic. After that, Moonlight Box will perform operations such as desensitization and serialization of the data, and finally encrypt the data and send it to the server for storage.

Recording is a complex process, in this process we continue to step on some pits and encountered some problems, I will list a few more important problems to share with you.

3.3.1 Difficulty 1: Full GC

At the beginning, Full GC occurred in vivo’s internal system when using Moonlight Box. After analysis, it is found that the recorded interface calls Guava too much, resulting in too much recorded request traffic, resulting in FULL GC. This is because before an interface traffic is recorded, all recorded data is in memory, and once traffic or subcalls are defecated, frequent Full GC is likely to result. In addition, some high-concurrency systems have more interfaces, and there is performance pressure to record multiple high-concurrency interfaces at the same time. Therefore, we optimized the performance of Moonlight Box as follows:

  • Strictly limit the number of concurrent recording and the number of single traffic sub-calls;

  • Monitoring and abnormal degradation of the recording process;

  • Merge the same child call recording process to reduce the number of child calls.

  • Real-time monitoring of recording cache occupancy, beyond the warning line for timely degradation processing.

After continuous optimization, the recording process is very smooth, and there is no Full GC phenomenon caused by excessive flow or other problems.

3.3.2 Difficulty 2: Call link series

Thread context identification exists in traffic recording and playback. Many vivo systems have custom business thread pools or use third-party frameworks with thread pools (such as Hystrix), which will lead to identification loss and failure to connect the whole call link.

Pandora’s Box initially relied on the basic ability of jVM-Sandbox-repeater to store record tokens in ThreadLocal to connect the entire call link when thread pools were not in use. With the thread pool, we use our own Agent to automatically enhance the Java thread pool through recording playback flags, but this will conflict with the company’s Agent call chain to enhance the thread pool and cause the JVM to crash. This way, there is no way.

In the end, we decided to cooperate with the call chain team of the company to transfer the recording mark with the help of the Tracer context of the call chain. Both sides made some changes, especially the two agents made some adjustments to the buried location of HTTP and Dubbo. ForkJoinPoool thread pools have not been solved yet and will continue to be supported.

3.3.3 Difficulty three: Data security

The third is how to ensure the data security of recorded traffic, many systems have some perceptual data. The user can configure the fields to be desensitized on the Platform of Moonlight Box, and the Agent will desensitize these fields in memory according to the configuration information when recording traffic to ensure the data security during transmission and stored process. In addition, Moonlight Box will strictly control the viewing permissions of traffic details to prevent cross-project data query behavior.

3.3.4 Difficulty four: Traffic deduplication

The fourth is the flow of heavy problem. Sometimes a business party may record a large amount of the same traffic when using the Platform of Moonlight Box, resulting in a long time for subsequent playback and low efficiency for troubleshooting. Therefore, we consider how to reduce the number of the same traffic as much as possible while ensuring the interface coverage. The current approach is to de-iterate based on traffic input and execute the call stack. During recording, the Agent de-configures traffic based on the de-configures information to ensure that the stored traffic data is unique. This mechanism greatly reduces the amount of recorded traffic in some scenarios and improves the traffic usage efficiency.

3.4 Traffic Playback Process

The following figure shows the process of traffic playback. Traffic playback is a process in which the system after iteration is called again through the entry call of obtaining recorded traffic, and then the logical correctness of the system is verified. Unlike recording, playback is Mock for external calls and does not actually access the database. The playback process compares the input parameters of recorded and playback child calls, blocks playback traffic if the arguments are inconsistent, and mocks the result of recorded child calls if the arguments are consistent. When the playback is completed, a response result will also be generated. At this time, we will compare the original recording result with the replay response result. According to the comparison result and the comparison result of the sub-call, the correctness of the tested system can be obtained.

Playback is a more complex process, because recording and playback are generally performed in different versions of systems in different environments, which may vary greatly. If not handled properly, the success rate of playback will be low. At first, the success rate of all applications connected to Moonlight Box was relatively low. Later, after long-term optimization and fine operation, the success rate of playback of Moonlight Box continued to improve. Here are some of the bumps and strategies.

3.4.1 Difficulty 1: Time difference

The first difficulty is the impact of time difference. There are time-related logics in some system service logics. Due to the difference in recording and playback time, many scenes have time-related logics, leading to playback failure. We did some research on this, and eventually managed to align playback time with recording time.

For the native method of System.CurrentTimemillis (), the Agent will dynamically modify the bytecode of the method body, proxy the invocation of this method by businesses, and dynamically replace it with the time acquisition method defined by the platform in advance to ensure time replacement. Solving this problem is easy for classes like Date. In addition, non-native time methods such as LocalDateTime in JDK8 are relatively easy to Mock out. Using these mechanisms basically eliminates the problem of time differences in business logic, and eliminates the problem of playback failure due to time.

3.4.2 Difficulty 2: System noise reduction

The second difficulty is how to deal with system noise. Many systems have some common noise fields, such as traceId and sequenceId, which may also cause playback failures. Check services one by one during initial service access, which is inefficient. Later, The Global, application, and interface noise fields can be configured at the Global level. Many common noise fields can be directly configured at the global level. You only need to configure the noise fields for service access.

3.4.3 Difficulty three: Unified environment

The third difficulty is environmental differences. Take vivo Internet system as an example. Generally, recording is performed in the online environment and playback is performed in the test environment and pre-release environment. At the beginning, there are many cases of playback failure due to inconsistent environment, which affects the overall success rate of playback. In view of this problem, we have carried out a series of explorations and solutions. When Moonlight Box is recorded online, it will record an online environment configuration at the same time. When it is played back offline, online configuration will be used to automatically replace the offline environment configuration, ensuring data consistency in the configuration center through this mechanism. In addition, for some configuration data of system memory nature, Moonlight box supports configuration interface synchronization of memory data. With these solutions, we basically ensured consistency between our online and offline environments, and significantly reduced the number of playback failures due to environment configuration.

3.4.4 Difficulty four: Sub-call matching

The fourth difficulty is the child call matching problem. The matching policy specified at the beginning cannot meet the requirements of complex service scenarios. As a result, no traffic or matching errors often occur, making playback difficult. Later we specify different matching strategies for different playback child calls: cache types are matched by cache keys; HTTP types are matched by URI; Dubbo matches by interface, method name, parameter type, and so on. In addition, if multiple same sub-calls are matched, we will compare the system call stack and the input parameter request parameters to find the most likely matching flow by combining the two dimensions of call stack and request parameter, and improve the matching success rate through these refined matching strategies.

3.4.5 Difficulty five: Troubleshooting

The fifth difficulty is troubleshooting. Recording and playback is a very complex process, and it is very difficult to analyze and troubleshoot any problems caused by Agent running on the business machine. To improve screening efficiency, we support several approaches:

1) Support playback analysis call link diagram, which will be explained in detail below;

2) Output detailed commands and parameters for task startup. By outputting parameters of task startup commands, it is very convenient for us to start and simulate the recording and playback tasks running online locally, which improves the efficiency of investigation.

3) Local one-click installation of Agent. After modifying the Agent code locally, we can install the new Agent in the local remote test environment with one click.

In addition to these features, we have developed a number of productivity tools, which are not covered here.

3.5 Rich protocol support

Vivo’s business types are very large, and there are differences in different business stacks. These systems need to be adapted to our platform with corresponding plug-ins. Through our continuous improvement of plug-ins, we have now supported the following dozens of plug-ins, basically covering a variety of common middleware.

3.6 Other features of Moonlight Box Platform

3.6.1 Visualizing Call Links

At first, the platform developers have to help troubleshoot service playback failures. The troubleshooting process is time-consuming and laborious. For this situation, we provide some visual operation and maintenance tools. One of them is the link call analysis diagram. We track and record the recording and playback process in detail, and help users analyze the execution process by invoking link diagrams. When a fault occurs, users can clearly view the location and root cause of the fault, improving troubleshooting efficiency.

3.6.2 Regression code coverage

One of the advantages of Moonlight box is that it has high traffic coverage and can easily form high coverage. How to verify that the playback traffic does cover the various scenes of the business system, so that users can use the Moonlight box without doubt and be relieved to go online boldly.

To solve this problem, Moonlight Box provides the statistical ability of code regression coverage. We use the internal coco-Server platform to calculate the system full coverage and incremental code coverage. In order to identify the coverage data from traffic playback, we need to call the interface to clear the coverage data in the machine memory before playback. This method may have the possibility of traffic conflict with other traffic. After the coco-Server platform completes traffic dyeing to distinguish traffic sources, there will be no such concerns.

3.6.3 Timing Recording and Playback

Although the operation process of traffic recording and playback is very simple, it is still cumbersome for some frequent service personnel. In particular, some versions involve too many systems, and recording and playing multiple systems at the same time is inefficient. In order to improve efficiency, Moonlight Box supports the ability to record and play back tasks on a customized schedule. Scheduled tasks can be recorded and played back in batches, reducing manual operation costs and improving platform experience.

3.7 Other applications of Moonlight Box

In addition to automated testing, we have also carried out some exploration and application in other aspects. The first is flow pressure measurement. Users can analyze the recorded flow generated pressure measurement model through moonlight Box platform. The second is problem location, which uses the Platform of Moonlight Box to play back online problems offline, so as to help test and developers to reproduce the problem site. The last one is security analysis. Regular recording of test environment traffic can help security engineers provide traffic materials and identify security risks of service systems.

4. Core indicators

The access of Moonlight Box platform is very simple, and the initial access of business can be completed within 10 minutes. In less than a year, the platform has been connected to nearly 200 business systems, many of which are the most core applications in Vivo Internet system. Over 1W times of recording and playback have been completed in one year. After the access to Moonlight Box, the platform found a total of dozens of online problems of different businesses in advance, effectively reducing the number of online accidents. In many scenarios, using the moonlight Box platform’s flow recording and playback functionality improved the productivity of testers and developers by more than 80%, exceeding our overall goals.

5. Future planning

In the future planning, we mainly focus on two aspects, one is functional planning, the second is collaborative open source.

5.1 Function planning

At present, we have completed the basic function construction of the platform, but there are still problems such as efficiency. In the future, we will focus on the optimization in the following two directions:

1) We hope to achieve accurate testing to avoid full playback of recorded data every time and further reduce playback time. The precision test needs to analyze the changed code, obtain the influence range of the changed code, and then screen the corresponding traffic for replay based on this, so as to reduce the replay coverage.

2) It is hoped that it can be combined with CI/CD under Vivo Internet system. When the business system is released to the pre-release environment, it can automatically trigger recording and playback tasks. In this way, it can identify some risks to the system and improve user efficiency before going online.

5.2 Open Source Co-creation

Open source is the future of software, we have been a beneficiary of open source, and we look forward to actively participating in open source projects and contributing to the community. We participate in open source github.com/alibaba/jvm… Project and become a core contributor to the community. The first phase contributed a total of five important plug-ins that the community did not have. In the future, we plan to gradually return some of the core capabilities of Moonlight Box to the community as outlined below.

Author: Liu YanJiang, Xu Weiteng, Vivo Internet Server Team

This article is based on Mr. Liu YanJiang’s speech at the “2021 Vivo Developer Conference”. Public id reply [2021VDC] to get information about the topics of the Internet technology sub-conference.