This article is yu PI during his internship in Tencent, from zero to one week emergency online million high concurrency system related experience, thinking and perception, to share with you.

Take 5 minutes to read this article and you will learn:

  1. Deepen the understanding of the actual working environment and working state

  2. Learn the design ideas, technology selection and understanding of high concurrency system

  3. Learn how to communicate with multiple parties at work

  4. Learn how to cooperate with the test

  5. Learn how to deal with emergencies

  6. How to summarize after the event

  7. Feel the pain and struggle of the author’s work

preface

Since the year before, I took over an urgent project with my tutor. After working overtime to finish the first phase, the effect of the project was remarkable, so I started to develop the second phase immediately after the year after, and the goal was to go online in a week. Due to the complicated business logic, tight time limit, shortage of manpower and many partners, the project is very difficult and challenging, so I started the work of 007 with my tutor.

Telecommuting has certainly been a boon to the James Bond work-a-day regime, and I spent most of that time dreaming about coding.

Project introduction

Firstly, I will introduce the project and system I am responsible for. Of course, the business information related to the project cannot be disclosed. Here, the business is stripped and only the key system model is introduced, as shown below:

As shown, I am responsible for a state flow system and a query system, as well as the database services they depend on.

The function of the status flow system is to logically modify the status field of a certain data in the database, and send notification to other service sides according to the status after successful modification.

Query system, as the name implies is to query data from the database, including the most basic authentication, query and other functions.

First analyze some difficulties in the system:

  1. The query system is a high fan in service, which is invoked by other services. There are three problems:

    High concurrency: The number of business side requests aggregated together will generate high concurrency requests in the order of millions after evaluation.

    Compatibility: How to design an API that meets the requirements of each business side while being easily understood.

    Complex docking: It is very complicated to communicate with multiple students on the business side at the same time to discuss interfaces.

  2. The business logic of state flow system is quite complex.

  3. There is interaction between the state flow system, query system and other business sides (such as sending notifications and calling each other), which requires high delay, fault tolerance and consistency.

Analysis of the difficulties, before writing code, to write feasible technical scheme.

Design ideas

In the actual work, it is necessary to write a detailed technical plan. Excellent engineers will consider various scenarios, evaluate various risks, estimate workload and record various problems in technical solutions, which not only help them sort out their own ideas and summarize, but also provide reference and persuasion for others (for example, if you expect 7 days to go by, who will trust you if there is no plan?). .

According to the 80-20 theorem, in a complex system, the proportion of time spent on writing technical solutions and sorting out design ideas and actual code development is 8:2.

Design follows the principle of “fit the business”, there is no best architecture, only the most suitable architecture for the business. Don’t over-design!

In addition, consider the urgency of the project and the cost of manpower, ensure availability first, then strive for perfection.

Some simple designs are skipped here. Here are some key designs and technology selection for system difficulties and business requirements:

1. High concurrency

When it comes to high concurrency, the first thing that comes to mind is caching and load balancing.

Load balancing is simply “Throw money at machines!” However, saving machines, reducing costs and increasing efficiency for the company is the belief of every back-end engineer, which depends on technology selection, architecture design and coding to achieve. The goal is to maximize the number of concurrent requests with the limited resources of each machine.

Technology selection

The selection is as follows:

Programming framework: Choose Jersey, the lightweight Restful framework, with Guice, the lightweight dependency injection library

Web Server: Choose the high-performance lightweight NIO server Grizzly

Cache: Tencent developed massive distributed storage system CKV+ (support Redis protocol, data monitoring platform)

Database sub-database sub-table: choose the company’s own research infrastructure, not in detail

Load balancing: lightweight reverse proxy server Nginx and L5 load balancing, millions of concurrent need to add more than ten machines

CDN and preheating: can support efficient file download service

Among them, cache is the key to resist high concurrent traffic, and should be designed with emphasis.

Caching scheme

1. Data structure design

Those of you who have used caching know that the design of the cache Key is very important. Depending on your business, ensure that cache keys do not conflict and are easy to find. Here I select the request parameter + interface unique ID to concatenate the key. And the paging query interface can reuse the cache of the full query interface.

2. Cache degradation

If the key/redis connection fails to be found, check the database directly.

3. Cache updates

When the database changes, the cache needs to be deleted. Because there are non-required request parameters, the cache key can be an ambiguous value. For example, if there are two request parameters a and B, the key may be A or AB.

For the interface with fixed request fields (all fields are required), when updating the cache, you can simply add a unique key and delete it.

For interfaces where the request field is not fixed (there are non-required fields), you can use redis’s scan command for range scanning (not keys!). Or loop together all possible keys. For example, run the scan command to clear all caches prefixed with key user1.

4. Cache penetration

Whether the queried list is empty or not is written to the cache. However, this method is not recommended when services return multiple error codes because of high complexity and high cost.

2. The compatibility

Compatibility mainly depends on interface design. To be compatible with multiple service sides, request parameters and response parameters need to be set as flexibly as possible. When designing the interface, do not align it with all the business sides. Otherwise, one improper field design may lead to a complete loss.

Here are three tips:

  1. Provide documents with accessible links for immediate review by callers (such as Tencent documents).

  2. The request parameters should not be too many, and should be easy to understand, do not set too complex parameters to force compatibility, if necessary, can be customized for a business side interface.

  3. Try to have as many response parameters as possible (too many, not too many), knowing that you need to change the code each time you add a return field, and properly redundant fields avoid this problem.

3. Message notification

The difficulties mentioned above are as follows: the state flow system interacts with the query system and other business sides to send notifications to each other. When the state flows, other businesses need to be notified and the system should be queried to update the cache immediately. It has a high requirement for real-time information.

There were two initial scenarios:

  1. Each system provides a callback interface for receiving notifications. It can ensure real-time performance, but the tight coupling between the systems is not conducive to expansion.

  2. Message queues are used to decouple applications and asynchronous messages.

Finally, the second solution was adopted, and TubeMQ (Trillion level distributed messaging middleware, incubated by Open source Apache) developed by Tencent was selected, for the following reasons:

  1. The notification data of a state flow system may be followed by other consumers, and using message queues facilitates scaling and is less intrusive to the code.

  2. Message queues persist messages

  3. TubeMQ supports consumer load balancing for high performance

  4. TubeMQ is large and can store trillions of messages

  5. Support the company to develop components, easy to form a unified specification

Technology selection and solution selection should not only focus on current business needs, but also have a cutting-edge perspective.

4. Risk assessment

Do not, before choosing middleware/framework, as much as possible to understand, assess the potential risks. General companies have their own knowledge base, you can make good use of internal resources or find Google Baidu.

Here I assessed the risks posed by TubeMQ from the perspective of message reliability, message sequencing, message duplication, monitoring alarms, and found some possible risks. For example, when a consumer fails to consume a message indicating a change in the data state, the cache is not updated in time, resulting in inconsistencies between the database and the cache.

So, how to avoid risk? I designed a solution for message reliability and data consistency from the perspective of message queue producers and consumers.

The solution

Reliability of the manufacturer’s message:
  1. The Tube guarantees that the message will be delivered and automatically resends it if it fails.

  2. When a message is sent, a callback is triggered. In the callback, you can determine the status of sending and confirming the message. You can put the message that fails to be sent into the queue.

Consumer message reliability and data consistency:
  1. A maximum of three retries are performed when the consumption fails

  2. If the consumption fails after the retry, logs are generated to ensure that messages are not lost

  3. This section describes how to read logs from a scheduled task and generate an alarm

The development process

In fact, the development process is nothing to say, is in accordance with the established technical scheme to knock code.

Here are a few development tips:

  1. If you are working on multiple projects at the same time, you can have a separate Git branch for each project, commit and merge them in batches, otherwise it will be very tiring for others to read your submitted code!

  2. Report some data for each request, such as the number of requests, request time, and number of failed requests, for monitoring statistics.

  3. Record multiple logs. Detailed and clear logs help you quickly locate faults.

Problem solving

Many problems are not detected during local development and are discovered in test and online environments. The process of problem solving is like a roller coaster ride, and the status quo is: test => develop => test => go live => develop => test, and so on.

Two tips:

  1. When you encounter a problem, don’t panic, you can take a few deep breaths first, because the problem can be solved, can not solve then you may be solved!

  2. After solving the problem, don’t get excited, take a few deep breaths first, because you will create new problems, and often the new problems are worse!

Here are some of the questions that impressed fish skin.

1. Transaction commit error?

Reason: The function called in a transaction also has a transaction, so the transaction is wrapped in the transaction, breaking the isolation.

Solution: Modify code to ensure transaction isolation.

2. Dependencies exist, but project startup error?

Cause: There are multiple versions of JAR packages, causing Java code to use reflection to dynamically generate classes without knowing which jar classes to use.

Solution: Delete unnecessary jar packages.

3. The cache is not updated immediately

Cause: The actual number of cache keys is up to 10 million. As a result, the efficiency of using the scan command to update the cache is too low and lasts for more than 20 seconds.

Solution: Change the solution to update the cache, instead of using the scan command, patch together all possible keys in the business code and delete them one by one.

Think that’s the end of the matter? Don’t forget the tips above:

“Once you’ve solved the problem, don’t get excited, take a few deep breaths, because you’ll create new problems, and often the new problems are worse!”

4. Cache still not updated immediately?

Cause: A service requires strong data consistency. The status in the cache and database must be the same. The cache, although updated in milliseconds, is not consistent in real time.

Solution: Customize an interface for the service side. This interface does not query the cache but directly searches the database to ensure that the data queried is the latest value.

5. Request stuck

After the service runs for some time, all requests are blocked! The heart can’t stand it.

Cause: Using jstack to print thread information and analyzing thread_dump file, it is found that the connection number is exhausted because the cache library Jedis does not manually release the connection. As a result, the new requesting thread will wait for the Jedis connection to be released continuously, and thus becomes stuck.

Solution: add the code to release Jedis connection.

6. An alarm is generated when online environment analysis logs are generated, indicating that the DISK I/O usage exceeds 99%.

Cause: The cat command is mistakenly used to view the original undivided log file. The log file is too large (tens of GB), causing the disk I/O capacity to be directly overwhelmed.

Solution: Run the less, tail, head commands to replace cat and delete the large log files that have been backed up.

7. The process flashes back

Troubleshooting: Usually, there is an error log for the JVM process to flash back, but it is not found, and the troubleshooting is in despair. There is nothing to do but pray that the problem does not recur. Then the problem really did not appear, thank you!

Reason: Later, upon inquiry, someone manually killed the process. Okay, ***.

8. The message notification of online environment was sent successfully, why there is no expected data update effect?

Check whether the message is consumed first, and then check whether the message is processed correctly.

Check: Check online logs and find that the messages are not consumed. But when you look at the monitoring screen, you see that the message was consumed by the machine in the test environment!

Cause: The test environment and the online environment belong to the same consumer group. When the message arrived, only one consumer in the same consumer group could successfully consume the message, which was consumed by the test environment, leading to the online environment data not updated.

The problem was discovered late the night before the launch. It is too late to apply for a consumer group, so we can only drop the service of the test environment first. The second day after the application of consumer groups, according to the environment to distinguish which consumer groups can be used, so that each consumer group will consume messages independently, successfully avoid message competition.

9. The report! The flow is too big to hold!

Cause: There are not enough machines, need to be urgently expanded

Solution: Ten new machines are applied for urgently. After the initial configuration is complete, the concurrency is increased after the new machines are deployed.

Tip: When you need to perform the same operation on multiple machines, there are two quick ways to do it.

  1. Use the parallel operation function of SSH connection tool to automatically type commands for all machines (supported by XShell software)

  2. After configuring one machine, you can run the rsync command to synchronize the configuration to other machines

10. The day before launch, you told me there was something wrong with the interface design?

Reason: Serious communication problems!

At work, some colleagues may not pay attention to checking the interface design scheme because of their busy business. When they are finished, they will @ you repeatedly and ask you privately. We must not do that!

Solution: emergency teleconference, pull group check scheme

11. There’s a bug online!

An online bug is a serious matter that requires urgent response. Get up even in your dreams!

Cause: The test environment may not be identical to the online environment, and the test environment may not detect all problems. Therefore, a pre-release environment is usually required for verification. The data uses online data, but it is an independent server to ensure that online data is not affected.

Solution: emergency troubleshooting and locating problems, successfully repaired in three minutes!

There are certain skills to fix bugs, share the following personal error path:

Screenshot/problem => Request => Whether the bug can be reproduced, closely cooperating with the test => Data => Data source (whether the real data is consistent with the interface data) => Data processing

Explain:

Usually found that the problem is user of operations, or test, they will throw a question or problem related to the screenshot, at this time, we should think of this problem corresponding to the rapid function (that is, the corresponding request/interface), and then let the problem description is as much as possible to provide information (such as request parameters, time, etc.).

If the problem persists for a long time, you can ask whether you can create a case that reproduces the problem. In this way, you only need to view the latest logs to facilitate troubleshooting.

Once the request is located, we analyze which data of the request and response is abnormal, that is, locate the key data, then locate the source of the data (from the database or from the cache), and see if the response data is consistent with the real data source. If they are inconsistent, there may be problems in the processing of data in the business logic, and further analysis.

Efficient communication advice: Describe the problem, try to use data to speak, while giving screenshots, to provide complete data, requests and other information, help others to analyze.

12. Some incorrect data appears online

This is a predictable problem. Fortunately, email alarms have been configured in the project to report information about error data, and the amount of error data is not large.

Solution: After fixing the bug causing the incorrect data, write a program to loop all the error information and generate the request code, and then manually execute the request code to refresh the asynchronous data on the line.

Suggestion: Risk should be considered as far as possible in the design, and hierarchical alarm policies can be made according to the severity of the problem (SMS > email > communication software).

13. Online machine OOM!

Three days after the launch, I found the problem, and some online machines even appeared OOM (heap memory overflow), resulting in service unavailable. After investigation, it is the current version of the third-party middleware used that has bugs. Therefore, before using components, it is necessary to conduct full research and risk assessment to select the correct version.

Blood lessons

  1. Problems must be solved in the test environment as much as possible, otherwise online problems are very unfriendly to the heart.

  2. Don’t be pollyannaish and think it’s ok to go online. Check more and stay alert.

  3. When using third-party dependencies, strictly check the dependency version numbers to ensure stable versions. Using older or inconsistent versions can cause serious bugs!

If you find a problem after launch, you will go through the following process, which I call the HAPY process.

For example, when you find a bug in your DB service, you only need to change one line of DB service code. However, do:

  1. Modify a line of DB service code

  2. Run unit tests

  3. DB services are bundled into dependency packages

  4. Modify the dependency package of “State Flow System” and “Query System” to DB service (change version number/update local cache to pull the latest package)

  5. Re-publish Status Flow System and Query System to the test environment

  6. You may have to return it to the test class for regression testing

  7. After passing the test, submit the code of “State Flow System” and “Query System” again and initiate CR (code review)

  8. Have a colleague or Leader read the code through CR

  9. Merging branches

  10. Publish “state flow system” and “query system” to the online environment, and verify each machine (rolling deployment).

  11. New bugs found again

It’s a gross explosion, but at steps 2, 6, and 8, there is idle waiting time. At this time we can do other work, write down work content, problems, etc.

Conclusion thinking

First of all, summarize the time consuming of each stage of this project:

Understand requirements: 5%

Development: 15%

Communication confirmation issues: 30%

Testing and validation: 30%

On-line and verification: 20%

Bug fixing takes about 60% of the time through the next few processes.

Problems in the project process:

  1. I did not participate in the requirement review in the early stage, so I knew little information.

  2. The night before it went live, you were still temporarily aligning interfaces? This should be confirmed in the communication plan stage.

  3. About 80% of the time is spent communicating, querying data, providing data, and validating it.

  4. Before they finished testing, they began to test in series, which led to the same bug being found by multiple parties and repeatedly @, resulting in low efficiency of bug correction.

  5. The unfamiliarity with self-developed middleware leads to a high cost of time.

  6. The big picture is not enough to anticipate some possible problems in advance.

  7. Lack of research on middleware and failure to check dependent version number at the beginning resulted in OOM of online machine.

What I feel good about:

  1. Close cooperation with the test students, mutual understanding, high test efficiency

  2. A detailed interface document is prepared for the query system and uploaded to the company’s knowledge base for real-time reference

  3. Emergency fix line bug in 3 minutes

  4. 30 minutes from acceptance to online

  5. When found middleware problems, communicate with the counterparty immediately and design a low cost solution that has no impact on them

  6. Actively help other students to query data, troubleshooting problems

  7. Write scripts to efficiently solve part of the wrong data

Growth and harvest:

  1. Ability to stay up late under pressure ↑

  2. Design thinking ability ↑

  3. Communication skills

  4. Problem solving skills ↑

  5. Advanced command familiarity ↑

  6. Middleware familiarity ↑

  7. Cluster management ability ↑

  8. Ability to reject demand ↑

  9. Ridicule ability

  10. Blow 🐂 ability ↑

subsequent

After the project was launched, through summary and review, I found areas worthy of optimization in the project and also thought about some more sound mechanisms, which will be gradually implemented. Such as:

1. Some configurations of the two systems are the same

Currently, the copy-paste method is used to synchronize the same configuration. This method is relatively simple. However, the disadvantages are also obvious. If the configuration of one system is changed and the configuration of the other system is forgotten, an error can occur.

In fact, you can introduce a configuration center that centrally manages configuration files for multiple systems and supports manual modification, multiple environments, grayscale, configuration version rollback, and so on.

You can use Ali’s Nacos or Ctrip’s Apollo, which provides an interface to manage configuration.

2. Attention must be paid to the problem of flash back!

However, you can monitor the process in real time and restart the process automatically.

There are two ways to do this:

  1. You can use a tool, such as Supervisor or Monit, to manage and restart processes

  2. Write shell scripts, and periodically observe the process status and restart through scheduled tasks. It is recommended that scheduled tasks be connected to a distributed task scheduling platform. When scheduled tasks are numerous, visual management and convenient control and scheduling are necessary.

3. Ensure message queue reliability

  1. Message retransmission mechanism: As mentioned in the scheme, the retransmission queue (or dead-letter queue) is designed, and the message in the retransmission queue is preferentially sent when sending again. However, to avoid infinite queue retransmission, a retransmission threshold must be set for each message.

  2. Email alarm: If the number of retransmission times exceeds the threshold, an email alarm is directly sent and the message is not added to the queue.

The job is really simple but not simple, who says the back end is just CRUD?

It’s nearly 7000 words. I hope you can like this article more!

This article is participating in the “Nuggets 2021 Spring Recruitment Campaign”, click to see the details of the campaign