One morning, I was awakened by a call from a customer, who was extremely anxious and asked whether CDN could help them solve the problem of “one-hour second kill”. They just carried out “one-hour second kill activity” yesterday, but the concurrency was too large, resulting in service downtime and user complaints.

To clear my head, I asked my partner three questions:

(1) What are the manifestations of service downtime?

(2) What is the basic structure of the business?

(3) How much is the peak concurrency of seckill?

Following these clues, we first restored the application scenario together:

Business architecture diagram of an e-commerce company

The company, a PEER-to-peer (P2P) wealth management site, often sees “hourly activity” in which users snap up high-interest wealth management products on the hour. As shown in the figure above, end user requests go through front-end load balancing and then to the Web Server running the actual e-commerce logic. The next layer is 8 Redis running on VM, which is responsible for storing Cache data related to business, such as user Profile, financial product information, user bill information, etc. The actual landing data is stored in MySQL, which only carries out simple library and table division and read and write separation.

In “second kill”, risk control and operation personnel first select financial products, and then mark them into the database; Activities began to be released by product people and snapped up by end users.

The company’s business mainly comes from the mobile terminal, with less traffic at ordinary times, but the “seckill” activity will instantly generate a large amount of traffic, with the peak concurrency reaching more than 100,000 (including bot). Such large concurrency is mainly concentrated in the following two types of interfaces:

For financial products, the refresh interface is similar to GET /get_fprod.php? Uid ={$1}& PID ={$2}& SID ={$3}.

The order interface of financial products is similar to GET /order_fprod? Uid = {1} $& pid = {$2} & oid = {$3} & sid = {$4}, the request of such interface quantity is less, accounted for less than 1%, but there are a lot of 504 timeout.

Where UID is the user ID, PID is the FINANCIAL product ID, OID is the order number, and SID is a random token identifier that changes with the client user.

Interpretation.

Based on the scenarios obtained through communication with customers, the following conclusions were preliminarily drawn:

(1) Customers are mainly mobile services, and the product renders UI on the client side through API. There are almost no static resources in the product, and the bandwidth flow is not high, so the traditional CDN cannot achieve the function of unloading pressure;

(2) A large number of timeout requests (502/504) are generated during the second kill, indicating that the user requests have exceeded the service capacity of the server and need to be expanded.

Based on the above two points, I did not suggest the company to purchase CDN service, but recommended service expansion. However, with our deeper analysis of the business, we gradually found some strange things.

“Weird” phenomenon

(1) The load between master and slave of the database is extremely unbalanced. Through MySQL management tool, it is found that the Query volume of the master database is as high as 80%.

(2) The load of Redis Cache nodes is extremely unbalanced. By checking Redis INFO, it can be found that one Redis has a large number of requests, accounting for more than 90%, while other Redis have a low number of requests.

The above anomalies piqued the interest of technicians on both sides, which may be the key! As the analysis goes on, the reason for the first phenomenon emerges: the company does not use the database middleware layer for routing distribution of MySQL requests as some large e-commerce platforms do. Instead, it uses a language-level framework to complete read and write separation on the business code side. This leads to two drawbacks:

Programmers bypass the language layer framework development, not really implement read and write separation; Product personnel require real-time performance, forcing developers to modify business logic, which will sacrifice read/write separation, so that data is read/write in the master library.

Cache penetration Diagram

Then, the reason for the second phenomenon gradually becomes clear: when seconds kill, a large number of users access very few financial products. When the PID of these products are hash to the same Redis, it will lead to hotspot imbalance of Cache nodes, and all requests are finally concentrated in one Redis, which is the bottleneck of business!

Suit the remedy to the case

  1. Use database middleware for read/write separation and horizontal scaling control

There are many benefits to using database middleware, the most important of which is the ability to hide some of the database details from the business layer and gain greater control over the business. Of course, there are obvious drawbacks to introducing a database middle tier, as adding a layer of components to the overall business architecture violates the “simple and effective” design principle. For many Internet companies, it is not unusual to have a database middle tier in the early or even middle stages. But when the business grows to a certain stage, introducing the database middle tier is more beneficial than harmful.

Based on experience, we recommend our customers to use MySQL Route, which can basically meet simple requirements, such as connection reuse; Load balancing; Read/write separation.

MySQL Route architecture diagram

The figure above shows the official architecture of the MySQL Router. As you can see, the advantage of the MySQL Router is plug-in design. A series of plug-ins are provided for use.

In addition to MySQL Router, there are many open source database middleware that can be used in China, such as Ali and Meituan. Using the database middle tier can not only solve performance problems, but also play a role in security, such as auditing, traffic limiting, and even blocking SQL injection, bad SQL statements, and so on.

2. Use APIS to accelerate services to relieve the pressure on the server

Cache service imbalance is a thorny problem. In “second kill” mode, users frequently access the information of a few WEALTH management products. When the Cache data is allocated to the same node, a large number of requests are immediately concentrated on one or a few nodes. This is the essence cause of Cache service imbalance. This problem occurs not only in the “second kill” scenario of e-commerce, but also in other service types that have instant hotspot access. Take Weibo as an example, the interface was slow and even the service was down due to celebrity hot events, which is also the root cause. At the moment of “breaking the news”, a micro-blog spreads massively in a short time. The micro-blog ID is opened at the same time, and all traffic will be concentrated in a Redis node.

How can this problem be solved? First, Cache is usually hash based on the key of a data structure. When a large number of users access only one or more keys, the load of Redis Cache nodes will be unbalanced. Whether this will affect the service depends on the concurrency, but this is a huge risk. To solve this problem, the customer proposes a solution: Divide the Cache data of a wealth management product into multiple keys to reduce the probability that keys are allocated to the same Cache node. But there are big drawbacks to this approach:

(1) The code needs to be modified. The logic that can be completed by one GET request needs to be replaced by multiple ones.

(2) The time consumption of all daily GET/SET operations will increase exponentially, because 1% of hot events increase 99% of the time of conventional operations, which seriously violates the 80/20 rule.

Based on the above problems, we recommend customers to use the “API acceleration” of ClN-X in Baishan Cloud Aggregation to solve this problem.

API to accelerate

API acceleration is completely different from the link acceleration of traditional CDN. It optimizes THE API request by caching the content returned by API and combining with TCP wan optimization technology. Baishan API accelerates to cache the response data of each API in millisecond level at the edge nodes of the whole network, and exchange the response data in the node memory with LRU (Least Recently Used) algorithm. During a “hot event”, the hottest information is continuously stored on the edge node. When a client accesses the API, the edge node can directly return the result without returning to the source site. The overall architecture is as follows:

API Acceleration architecture diagram

The API acceleration service provides the API acceleration capability at the edge of the network, including the API return result caching capability and the API request return source network acceleration capability.

The conventional wisdom is that dynamic resources (apis) cannot be cached, but Hakuyama argues that “any resource can be cached, just with different expiration times.” For common static resources, the cache expiration time is long. It’s not that the API can’t be cached, it just has a short expiration time. For example, an API for querying stock prices can set an expiration time of 50 milliseconds; The reaction time of the 100-meter runner is 100-200 ms, and 50 ms does not affect the user experience of PC or mobile terminal.

If there is no cache, if 10,000 users access the backend at the same time within 1 second, the backend will endure 10,000 concurrent requests. If you set the cache time to 50 ms, you can theoretically reduce back-end concurrency to 20 (1 SEC /50 ms =20), back-end load to 1/500th, and other requests are returned directly to the user by the cache server.

To sum up, Baishan API speeds up to provide millisecond level cache for customers, improves end user response speed without affecting user experience, and reduces service load pressure on the server side.

API acceleration also supports user-defined cache rules, including QueryString, Header, and Path, to make them closer to services. Set the following rules for scenarios:

GET /get_fprod.php? Uid ={$1}& PID ={$2}&sid={$3}, each wealth management product has an independent ID, the product information does not change with the user ID and client random information, so the Cache key can ignore {$1} and {$3}, /get_fprod.php? Pid ={$2} stores the millisecond Cache key on the edge node.

How do you determine the expiration time of the cache? Related to the service, this needs to analyze the desensitization log provided by the customer, can initially set the expiration time to 500 milliseconds, and finally need to consider the RTT value to adapt to the WAN environment; The RTT is automatically captured and updated in real time by the API Acceleration service.

The actual effect

The API acceleration service is configured for the customer’s major bottleneck interfaces, and the effect of enabling and disabling API acceleration service is compared from the following two dimensions at peak time:

Average response time of end users and response code 200 ratio Average load of service clusters

As shown in Figure A, the average response time to end user requests during peak periods shrinks from about 3 seconds to less than 40 milliseconds. As shown in Figure B, the proportion of response code 200 for all requests increased from around 70% to 100% during the peak period. Figure C shows that back-end CPU Idle increased from around 10% to around 97% during the peak period. The measured comparison data show that API acceleration has a significant effect on reducing the average response time and improving user experience, especially in reducing the load of back-end servers. The Idle of back-end CPU using API acceleration can be kept above 91%.

I have specially sorted out the above technologies. There are many technologies that can not be explained clearly by a few words, so I simply recorded some videos with my friends. The answers to many questions are simple, but the thinking and logic behind them are not simple. If you want to learn Java engineering, high performance and distributed, simple. Micro services, Spring, MyBatis, Netty source analysis of friends can add my Java advanced group: 680130298, group of Ali Daniel live explain technology, and Java large Internet technology video free to share to you.

  • Those with 1-5 work experience, who do not know where to start in the face of the current popular technology and need to break the technical bottleneck can add group.

  • After staying with the company for a long time, I was very comfortable, but I hit a wall in the interview when I changed my job. Need to study in a short period of time, job-hopping can be added to the group.

  • If there is no work experience, but the foundation is very solid, Java working mechanism, common design ideas, common Java development framework master proficient can add group.

The follow-up suggest

The database imbalance and cache Redis imbalance have been resolved, but there are still many aspects that can be improved:

1. The queue service asynchronizes the request

At present, customers’ final landing database requests are directly requested to MySQL without queue buffering. Therefore, it is recommended to use queue service to queue peak requests. The advantage of queue service is that it can schedule requests when there is a large number of visits and control the concurrency actually reaching the database, so as to effectively protect the backend of the database.

2. API firewall shields malicious bots

User logs contain a large number of obvious and regular scan software traces, such as SQLMap and FImap, which do not affect services significantly but occupy server resources. To improve security and service efficiency, you are advised to shield the scanning behavior at the front end of load balancing. In addition to malicious Bot, behaviors such as order scrambling and brushing will also affect the service, so it is recommended to use API to protect service identification and interception.

3. The product layer considers service degradation design

In the overall business of the customer, there is no service degradation design, and the product function priority is not divided, resulting in the confusion of many basic services such as database and Cache. In case of serious problems such as database penetration caused by “seckill”, the entire service will be unavailable. In this case, business units should be reorganized and basic services should be divided according to priority, with the first screen, product list, purchase, order and other information having the highest priority. Second, non-important functions, such as comments, bills, etc. If the back-end load is heavy, you can discard minor functions if necessary to reduce the back-end load and ensure service stability.

conclusion

It is a complex project to solve the situation like “one-hour seconds kill activity”. The problems such as uneven database load and uneven Cache load exposed by customers in this paper can be solved by using the technology of database middle layer and API acceleration, and finally achieve ideal results.

The “seconds kill” case above is just a typical application scenario of API acceleration, and I will write a more systematic analysis of the API acceleration problem in the following articles.