background

In Meituan’s values, “customer focus” is placed very important, so we become increasingly intolerant of service failures. Especially at present, the company’s business is in the stage of rapid growth, every failure is a very large loss for the company. The entire IT infrastructure is very complex, including network, server, operating system, and application level problems can occur. In this context, we must carry out a comprehensive “physical examination” of the service, so as to ensure the stability of meituan’s multiple business services and provide high-quality user service experience. Truly through the following technical means, to help everyone eat better, better life:

  • Verify the stability and scalability of the service under peak traffic.
  • Verify the stability of new live features.
  • Conduct fault drills such as degrade and alarm.
  • More accurate capacity assessment for online services.

The full-link pressure test is based on the online real environment and actual service scenarios. It simulates massive user requests to test the entire system. In the early days, in the absence of full-link pressure measurement, the main pressure measurement methods are:

  • Initiate a service call to a single machine or cluster online.
  • The online traffic is recorded and then played back on a single machine.
  • Change the weight to measure the drainage pressure.

However, it is difficult to comprehensively test the entire service cluster in the above methods. If the health status of the entire cluster is calculated based on partial results, the real performance level of the entire system is often “overgeneralized” and cannot be evaluated. The main reasons are as follows:

  • Focusing only on the core services involved will not cover all the links.
  • Systems are connected in series through some basic services, such as Nginx, Redis cache, database, disk, network, etc., while basic service problems can not be exposed in a single service pressure test.

Considering many factors, full-link pressure measurement is the only way to accurately evaluate the performance level of the whole system. At present, all the core business lines of the company have been connected to full-link pressure measurement, and the monthly average pressure measurement times have reached tens of thousands of times, helping the business to smoothly survive the impact of several peak flows.

The solution

Quake is a company-level full-link pressure measurement platform. Its goal is to provide comprehensive, secure, and realistic pressure measurement of the entire link to help businesses make more accurate capacity assessments. So we have the following requirements for Quake:

  • Provides the ability to simulate real traffic on the line
    • Unlike DDoS attacks, pressure testing has application scenarios, whereas DDoS attacks may require only one request. In order to restore user behavior more realistically, we need to obtain the real flow on the line for pressure measurement.
  • Ability to quickly create pressure measuring environment
    • The environment here refers to the online environment, because if the pressure is measured in the offline environment, such factors as cluster size, database size, network conditions, etc. cannot be simulated in the offline environment, even if the factor of “whether the machine configuration is the same” is not taken into account. Thus, the pressure measurement results are of little reference value.
  • Supports multiple pressure measurement types
    • In addition to the standard HTTP protocol, the pressure test type needs to support the INTERNAL RPC and mobile protocols of meituan.
  • Provides real-time monitoring and overload protection during pressure measurement
    • Full-link pressure measurement is a process that requires real-time attention to service status, especially in the detection of limits, requiring the ability to accurately control QPS, second-level monitoring, preset fuse degradation, and the ability to quickly locate problems.

Quake overall architecture design

Quake set data structure, pressure test isolation, scene management, dynamic control, process monitoring, pressure test report as a whole, pressure flow to simulate the real measuring, distributed pressure measuring ability of the whole link pressure measuring system, by simulating mass users real business scenarios, high pressure test was carried out on the business in advance, all-around detection business application performance bottlenecks, Ensure smooth handling of business peaks.

Architecture diagram

The overall architecture of Quake is divided into:

  • Quake-Web: The pressure management terminal. It is responsible for data construction, environment preparation, scenario management, dynamic adjustment of the pressure process, and presentation of the pressure report.
  • Quake-brain: a scheduling center responsible for pressure resource scheduling, task distribution, and machine resource management.
  • Quake-agent: A pressure engine responsible for simulating various pressure flow.
  • Quake-Monitor: a monitoring module that collects pressure measurement results and monitors service indicators.

Core functions of the management terminal

The data structure

In the traditional data structure, the tester usually maintains a batch of pressure measurement data by himself. However, this method has great disadvantages. On the one hand, the maintenance cost is relatively high; on the other hand, the data diversity constructed by this method is not enough. In a real service scenario, you need to play back the traffic generated during peak hours. Only in the face of such traffic impacts can you truly reflect the problems that may occur in the system.

Quake provides two main data constructs for HTTP and RPC:

Collect HTTP service access logs

For HTTP services, Nginx layer will generate request access logs, we unified access to these logs, to meet the needs of pressure measurement traffic data. The architecture diagram is as follows:

  • S3 is the final log storage platform

Hive is used as a data warehouse tool in the bottom layer, which enables services to construct data using a simple SQL-like language on the platform. Quake filters the data from the stack and stores it in S3 as a word list file for pressing.

  • Word list: Metadata needed to pressure, each line representing a request, including method, path, params, header, body, and so on.

Real-time recording of RPC line traffic

For RPC services, the volume of service calls far exceeds the magnitude of HTTP, so it is unlikely that the environment will log accordingly online. Here, we use real-time traffic recording for online services. Combined with the recording function provided by RPC framework, we start recording for some machines in the cluster. According to the interface and method name to be recorded, the request data is reported to the Broker for recording traffic, which then generates the final pressure word list. Upload the file to a storage platform (S3).

  • RPC Client: the caller of the service
  • Server: indicates the service provider
  • Broker: post-recorded traffic buffer server
  • S3: ultimate storage platform for traffic

Other optimizations:

  • Flow parameter migration

In some scenarios, the constructed traffic cannot be used directly. You need to offset the user ID and mobile phone number. Quake also provides a variety of substitution rules including four operations, interval bound, random numbers, and time types.

  • Sharding of thesaurus files

We need to physically fragment the thesaurus files generated by data construction to ensure that the size of each fragment file is as uniform as possible and controlled within a certain size. The main reason for doing this is that the subsequent pressure test must be carried out by a distributed pressure cluster for traffic input. Considering the speed of single-machine pull word table and the limitation of loading word table size, if the word table is divided, it can help the task scheduling to allocate the word table more reasonably.

Pressure test isolation

The biggest difference between online pressure measurement and offline pressure measurement is that online pressure measurement should be safe and controllable, without affecting the normal use of users, and without causing any data pollution to the online environment. To achieve this, the first problem to be solved is the identification and penetration of pressure measurement flow. With the labeling, each service and middleware can group the labeling and implement the shadow table solution according to the labeling.

Test identification transparent transmission

For a single service, it is easy to identify the pressure traffic by placing a special pressure flag in the request header. HTTP and RPC services are the same. However, it is very difficult to maintain the manometric flag throughout the entire call link.

  • Pass-through across threads:

For services that involve multithreaded invocations, ensure that test identifiers are not lost across threads. In this example, the main thread writes the test id to the ThreadLocal object of the current thread according to the pressure request. (ThreadLocal creates a copy for each thread to store its own copy variables.) Using InheritableThreadLocal, variables in the parent thread are passed to the child thread, ensuring that pressure markers are passed. In the case of thread pool, the thread pool is also encapsulated. When adding thread tasks to the thread pool, additional variables in ThreadLocal are saved, and then the variables in ThreadLocal are replaced during the task execution.

  • Pass-through across services:

For cross-service invocation, the architecture team reworked all the middleware involved. Using the inter-service transfer context feature of Mtrace (unified distributed session tracking system within the company), the attribute of test identifier is added on the basis of the original transmission context to ensure that the test identifier is always carried in the transmission. The following is a diagram of the upstream and downstream calls to Mtrace:

Link to diagnose

Due to the complexity of link relationships, links involved in a pressure survey may be very complex. In many cases, it is difficult to identify which indirectly dependent services are dependent on, and any time there is a problem, such as a sub-standard middleware version, the test identity will not be passed through. Quake provides the capability of link matching analysis. Through the platform, it can tentatively send service requests that need to be tested. Based on the data provided by Mtrace, it helps services quickly locate service nodes marked with transparent transmission failures.

  • Overview of link diagnostics

  • Link diagnosis Details location

Pressure survey service isolation

Some large pressure tests are usually carried out late at night during the low peak period. It is recommended that relevant personnel pay attention to the system indicators they are responsible for, so as not to affect the normal use of the line. Quake provides a more secure and convenient way to perform routine pressure surveys. In the low peak period, the machine is basically in a relatively idle state. We will quickly create a pressure group for the whole link online according to business requirements, and separate a batch of idle machines for pressure measurement. Isolate normal and test traffic at the machine level to reduce the impact of pressure tests on the service cluster.

Based on the transparent identification mechanism, the Quake platform provides different isolation policies based on IP address, number of machines, and percentage. Services only need to provide the required service name, which is enabled and disabled by Quake.

Pressure measurement data isolation

Another trickier issue is crushing write requests, because it writes a lot of dirty data to the real database. We borrowed ali’s original idea of “shadow table” isolation. The core idea of the “shadow table” is to use the same database online, including sharing the memory resources in the database, because it can be closer to the real world, but when writing data, it will be written in another “shadow table”.

A similar idea applies to KV storage. The implementation of MQ (message queue) is described here. MQ includes both production and consumption. Based on actual needs, the business can choose to ignore the message with the test identity at the production end or ignore the message at the consumer end after receiving the message.

Core design of dispatching center

As the brain of the whole manslaughter system, the dispatch center manages all manslaughter tasks and manslaughter engines. Based on its own scheduling algorithm, the scheduling center divides each pressure test task into several plans that can be executed on a single pressure test engine, and sends the plans to different engines in the way of instruction, so as to execute the pressure test task.

Computing resources

Different scenarios require different machine resources. Taking HTTP service as an example, the maximum value of single pressure function is completely different for the two requests with request/response body less than 1K and response time less than 50ms and about 1s. There are many factors that affect the pressure measurement capability. The computing center will calculate resources according to different parameters of the pressure measurement model.

Key reference data include:

  • Pressure measure the QPS expected to arrive.
  • Pressure the average response time and request/response body size of requests.
  • Pressure measurement of thesaurus size, number of fragments.
  • Pressure measurement type.
  • Machine room required for pressure measurement.

Event injection mechanism

Because the pressure measurement process is always changing dynamically, the service will adjust the pressure according to the actual situation of the system. In the whole process, there are many types of events, including the event of adjusting QPS, the event of triggering fuse, the event of opening accident injection, the event of opening code-level performance analysis, etc. At the same time, there are many kinds of triggering events, including user manual triggering, due to the system protection mechanism touching, etc. Therefore, we have also made corresponding optimization in the architecture, and the general structure is as follows:

At the level of code design, we adopt the observer and chain of responsibility pattern, taking the specific situation that triggers the event as the observation topic, and the subscribers of the topic will generate a series of execution events according to the situation type. In the execution event, the responsibility chain mode is introduced to effectively split the respective processing logic for later maintenance and capacity expansion.

Machine management

The dispatching center manages all the pressing machine resources, which are distributed in several machine rooms in Beijing and Shanghai. The pressing machine is deployed in a containerized way, which provides a basic guarantee for the subsequent dynamic expansion, gray upgrade and abnormal removal of the pressing machine.

  • Dynamic capacity

The platform also needs to deploy some machines in advance for daily business pressure testing because of the high and low demand for pressure testing. When service application resources are insufficient, the platform dynamically expands capacity in containerized mode as required. The advantage of this, on the one hand, is to save machine resources, on the other hand is easy to upgrade. It is not hard to imagine that upgrading 50 machines will cost less than upgrading 200 machines.

  • Gray scale to upgrade

The pool maintains hundreds of machines, which can be difficult to upgrade if needed. Our previous approach was to take the machines offline when there was no business pressure test and then deploy them in batches, a time-consuming and painful upgrade process. To this end, we introduced the concept of gray upgrade, to provide a version of each press concept, machine selection, the use of stable version of the machine. Replace unused machines in batches according to their current state of use, and change the machine’s selected strategy to the latest version after the new version runs the benchmark and regression tests. In this way, we can make the upgrade process relatively smooth, stable, and oblivious to business.

  • Abnormal removed

The dispatch center maintains heartbeat detection with all pressure machines and provides the ability to remove and replace abnormal nodes. The removal capability of the machine is very necessary during the pressure testing process, because we need to ensure that all the machine behavior is controlled during the pressure testing. Otherwise, when the pressure needs to be reduced or stopped, if the pressure machine cannot respond normally, the consequences will be very serious.

Pressure test engine optimization

For the choice of pressure engine, Quake uses a homegrown pressure engine. This is also for the consideration of scalability and performance. In particular, in the aspect of scalability, it mainly supports various protocols, which will not be elaborated here. In terms of performance, we have done a lot of performance optimization on the engine to ensure that it generates enough requests per second.

Performance issues

Conventional pressure engines, using the BIO approach, use multiple threads to simulate the number of concurrent users, with each thread working in a request-wait-response manner.

Communication:

The main problem with this approach is that thread resources are completely wasted during the waiting process. Performance problems are also more severe in this combined mode (combined mode: a sequence of user actions is simulated. For example, the request group includes users logging in, adding to shopping cart, creating order, paying order, and viewing payment status. These requests are sequential, and the next request depends on the result of the previous one. If there are five sequential requests in the group, each of which is 200ms, it takes 1s to complete the group of requests. In this case, the maximum QPS for a single machine is the maximum number of threads that can be created. We know that the machine can create a limited number of threads, and the cost of frequent switching between threads, so this communication method can achieve a limited maximum QPS per machine.

The second problem of this model is that the granularity of thread number control is too coarse. If the request response is very fast, only tens of milliseconds, if one thread is added, the QPS may increase by nearly 100. It is impossible to accurately control QPS by increasing the number of threads, which is very dangerous to the limit of the detection system.

IO model optimization

Let’s take a look at the implementation mechanism of NIO first. From the perspective of client initiating requests, IO events exist respectively as connection ready events (OP_CONNECT), IO ready readable events (OP_READ) and IO ready writable events (OP_WRITE). All IO events are registered with an event Selector, which listens and handles them uniformly. The Selector uses the method of IO multiplexing.

After we understand the NIO processing mechanism, we will consider how to optimize it. The whole idea is to make a specified number of requests per second based on preset QPS, and then do subsequent reads and writes in an IO non-blocking manner, eliminating the waiting time for requests in the BIO. The optimized logic is as follows:

Optimization 1: Reactor multithreaded model is adopted

Here, the time is mainly spent on read and write events of IO. In order to achieve as many pressure test requests as possible within a unit time, we separate connection events from read and write events. Connection events are processed in the way of single-thread Selector. Read and write events are processed by multiple Worker threads, and each Worker thread is also processed in the way of NIO, with its own Selector handling read and write operations of IO events. Here, each Worker thread has its own event queue, and data is isolated from each other, mainly to avoid the performance overhead caused by data synchronization.

Optimization 2: Separate service logic from I/O read and write events

The business logic here is mainly for the processing of the request results, including the sampling and reporting of the request data, the analysis and verification of the pressure measurement results, and the matching of the request conversion rate. If this logic is processed in Worker threads, it will inevitably affect the IO reading speed. Because the Selector performs single-threaded processing after listening for an IO ready event, its processing should be as simple and fast as possible, otherwise it will affect the processing of other ready events, and even cause queue backlogs and memory problems.

Memory optimization

Another important metric for a pressure engine is the time to Full GC, because frequent Full GC in the engine will cause jitter in the actual pressure curve (QPS), which will magnify the real response time of the pressured service and cause the real QPS to fluctuate up or down from the preset value. In serious cases, if Full GC occurs for a long time, it will directly lead to the problem that the pre-loaded QPS cannot be pressed up.

Here is a set of pressure curves produced by Full GC:

In order to solve the GC problem, the optimization is mainly based on two dimensions of the application’s own memory management and JVM parameters.

Allocate memory objects properly

  • Request object loading mechanism optimization:

The engine first loads the word table data into memory, and then generates a request object from the word table data to send. There is a limit to the size of the word table data, which will enter the “old age”. If the proportion of the “old age” is too high, the situation of Full GC will occur frequently. In this case, if the word table data is too large, streaming loading can be considered to maintain a certain number of requests in the queue and control the memory size by loading while playing back.

  • Request object quick to use fast pin:

The engine creates 1W request objects per second, which may be destroyed the next second after processing, assuming a single machine is running 1W QPS. If the destruction is too slow, it will cause a large number of invalid objects to be promoted to the old age, so in the processing of the response result, there should be no time-consuming operation to ensure the fast release of the request object.

Object reuse is abandoned here because the requested basic information takes up less memory space. However, once converted to the object to be sent, it takes up much more memory space than the original data, which is the same problem in HTTP and RPC services. In addition, when Apache HttpAsyncClient was used as an asynchronous framework for HTTP requests, the Response object of the actual request was found hanging on the request object. That is to say, after a request object receives the result, the object memory increases the space occupation of the response result. If the method of reuse request object is adopted, it is easy to cause the problem of memory leakage.

JVM parameter tuning

Here we take the CMS collector of the JVM as an example. For a high concurrency scenario, a large number of objects are generated at once, and the lifetime of these objects is very short, we need:

  • Increase the size of the Cenozoic to make sure there is enough room for new objects. Of course, if the old age is set too small, it will result in frequent Full GC.
  • Appropriately increase the survival times of the new generation to be promoted to the old age, reduce the probability of the invalid object to be promoted to the old age; At the same time, control the size of the new generation of survival zone, if set too small, it is easy to cause those who cannot accommodate the new generation of objects to advance.
  • Trigger the Full GC for the old generation early, because if you wait until the old generation is Full, it may be too late to start collecting, which can easily lead to a long Full GC. Generally set at 70% of the safe water level for recycling. In addition, the Young GC needs to be triggered at the time of collection, which reduces the application pause time during the remarking phase and prevents a large number of invalid objects from aging after the collection is complete.
  • Set the number of GC’s that need to be compressed, which in many cases is the main cause of long GC’s. Because memory collation is handled in a single-threaded manner using Serial Old, the process can be very slow. Especially if there is not enough space in the old era, GC times can become longer.

Monitoring module

Pressure testing will definitely have an impact on online services, especially for those that test the limits of systems. We need to have second-level monitoring capabilities, as well as a reliable circuit breaker degradation mechanism.

Client monitoring

The pressure measurement engine summarizes the data every second and reports it to the monitoring module, which makes statistical analysis based on all the reported data. The analysis here needs to be processed in real time so that second monitoring can be achieved on the client side. The monitored data includes the response of each TP line, QPS curve fluctuation, error rate and sampling log analysis, etc.

  • Real time QPS curve

  • Error rate statistics

  • Sampling log

Server monitoring

In addition to monitoring and analyzing the results reported by the engine, Quake also integrates integrated monitoring components within the company, including the Falcon system for monitoring machine metrics (open source by Xiaomi) and the CAT system for monitoring service performance (open source by Meituan). Quake provides a unified management configuration service that makes it easy for businesses to observe the health of the entire system on Quake.

Circuit breaker protection mechanism

Quake provides both client-side and server-side circuit breaker protection.

The first is client-side fuses. Quake analyzes monitoring data in real time according to the service’s custom fuses threshold. When fuses are reached, the task scheduler sends instructions to the pressure engine to reduce QPS or directly interrupt the pressure to prevent the system from crashing.

The pressed service also provides a circuit breaker mechanism. Quake integrates with the company’s in-house circuit breaker component (Rhino) to provide circuit breaker degradation and current limiting capabilities during pressure measurements. At the same time, Quake also provides the ability to conduct stress test failure drills, where manual fault injection is performed to verify system-wide degradation plans.

Project summary

Finally, a few lessons learned from working on the Quake project:

Small step for a run

Even before Quake, Meituan had an internal pressure platform (Ptest) that was positioned as a performance pressure test for a single service. We analyzed some of the problems with the Ptest platform, which has very limited pressure engine capabilities. In the early days of Meituan’s development, if there were two large lines of business to pressure test, the machine resources would often be insufficient, which required the business side to coordinate with each other. Because the initial investment cost is too high to prepare for a pressure test, users need to construct their own word lists, especially RPC services, and users also need to upload IDL files, which is very tedious.

Quake addressed these business pain points, and it took the team more than a month to develop the first version and get it up and running quickly. At the time, there was a pressure test just before Cat’s Eye, the first Quake, and it did well. Subsequently, we implemented an average of one iteration every two weeks, and gradually added machine isolation, shadow table isolation, data migration rules, circuit breaker protection mechanisms, code-level performance analysis, and more.

A quick response

At the beginning of the project, the customer service faced a lot of problems, including not only problems in use, but also some bugs and defects in the system itself. At that time, once we encountered a large-scale pressure test of the business line, our team was on standby to solve the problem directly on site. After the system became stable, our group adopted the customer service shift system, with one student in charge of customer service in each iteration to ensure quick response to problems encountered in business. This is especially necessary in the early days of a project. If the use experience of the business department is not good, the reputation of the project will also be poor, which will cause great problems for the subsequent promotion.

Project promotion

This should be a problem for all internal projects, and in many cases, promotion results can make the difference between life and death. In the early stage, we carried out pilot projects in some representative business lines. If we encounter problems in the pilot process, or some good ideas and suggestions provided by business students, we can quickly carry out iteration and implementation. After that, we continued to expand the scope of the pilot, including meituan Takeout, Cat’s Eye, wine tour, finance, and other large BG’s conducted several rounds of full-process, large-scale full-link pressure tests on Quake.

As the overall functionality of Quake improved and multiple pain points in Ptest (the previous stress system) were addressed, we gradually rolled it out across all lines of business and trained internally. From the data collected so far, more than 90% of Meituan’s business has migrated from Ptest to Quake. In addition, the overall statistical data has been significantly improved compared with Ptest.

Open the ecological

The Quake goal was to create a full-link pressure platform, but we didn’t go after it. There are also some teams within the company that go ahead and do a lot of “water testing” work. In fact, this is a good thing, if all things rely on the platform to complete, there will be endless demands, and many things placed on the platform level, may not be able to solve.

Quake also provides many apis for other platforms to access, leaving the business platform alone to do some of the work of highly customizing the business. The platform only provides basic capabilities and data support, and our team focuses its core efforts on things that are more valuable to the platform.

Working across teams

In fact, there are many teams involved in the whole project, and many of the components mentioned above require the support of the architecture team. At the cross-team level, we should have a “win-win” mentality. Like many of the monitoring components, fuse components, and performance analysis tools used by the Quake platform, some of them are new to the sibling team. Quake integrated it into the platform, on the one hand, to reduce its own duplication of wheels; On the other hand, it can also help brother team to promote product research and development.

Author’s brief introduction

Geng Jie is a senior engineer at Meituan Dianping. In 2017, I joined Meituan-Dianping and successively took charge of full-link pressure measurement project and MagicDB database agent project. Currently, I am mainly responsible for the overall RESEARCH and development and promotion of these two projects, and committed to improving the overall research and development efficiency and quality of the company.

recruitment

The team is looking for engineers in Java, Go, algorithm, AI and other technical fields, based in Beijing and Shanghai, welcome interested students to send resumes to [email protected].