This article is yu PI during his internship in Tencent, from zero to one week emergency online million high concurrency system related experience, thinking and perception, to share with you.

Take 5 minutes to read this article and you will learn:

1. Deepen the understanding of the actual working environment and working status

2. Learn the design ideas, technology selection and understanding of high concurrency system

3. Learn how to communicate with multiple parties in work

4. Learn how to work with the test

5. Learn how to handle emergencies

6. How to summarize after the event

7. Feel the pain and struggle of the author’s work

preface

Since the year before, I took over an urgent project with my tutor. After working overtime to finish the first phase, the effect of the project was remarkable, so I started to develop the second phase immediately after the year after, and the goal was to go online in a week. Due to the complicated business logic, tight time limit, shortage of manpower and many partners, the project is very difficult and challenging, so I started the work of 007 with my tutor.

Telecommuting has certainly been a boon to the James Bond work-a-day regime, and I spent most of that time dreaming about coding.

Project introduction

Firstly, I will introduce the project and system I am responsible for. The project background, business and other information cannot be disclosed naturally. Here, the business is stripped out and only the key system model is introduced, as shown below:

As shown, I am responsible for a state flow system and a query system, as well as the database services they depend on.

The function of the status flow system is to logically modify the status field of a certain data in the database, and send notification to other service sides according to the status after successful modification.

Query system, as the name implies is to query data from the database, including the most basic authentication, query and other functions.

First analyze some difficulties in the system:

1. The query system is a high fan in service and is invoked by other services. There are three problems:

High concurrency: The number of business side requests aggregated together will generate high concurrency requests in the order of millions after evaluation.

Compatibility: How to design an API that meets the requirements of each business side while being easily understood.

Complex docking: It is very complicated to communicate with multiple students on the business side at the same time to discuss interfaces.

2. The business logic of state flow system is quite complex.

3. There is interaction between the state flow system, query system and other business sides (such as sending notifications and calling each other), which requires high delay, fault tolerance and consistency.

Analysis of the difficulties, before writing code, to write feasible technical scheme.

Design ideas

In the actual work, it is necessary to write a detailed technical plan. Excellent engineers will consider various scenarios, evaluate various risks, estimate workload and record various problems in technical solutions, which not only help them sort out their own ideas and summarize, but also provide reference and persuasion for others (for example, if you expect 7 days to go by, who will trust you if there is no plan?). .

According to the 80-20 theorem, in a complex system, the proportion of time spent on writing technical solutions and sorting out design ideas and actual code development is 8:2.

Design follows the principle of “fit the business”, there is no best architecture, only the most suitable architecture for the business. Don’t over-design!

In addition, consider the urgency of the project and the cost of manpower, ensure availability first, then strive for perfection.

Some simple designs are skipped here. Here are some key designs and technology selection for system difficulties and business requirements:

1. High concurrency

When it comes to high concurrency, the first thing that comes to mind is caching and load balancing.

Load balancing is simply “Throw money at machines!” However, saving machinery and cost for the company is the belief of every back-end engineer, which depends on technology selection and architecture design. The goal is to maximize the resources of each machine to withstand the maximum number of concurrent requests.

The selection is as follows:

Programming framework: Choose Jersey, the lightweight Restful framework, with Guice, the lightweight dependency injection library

Web Server: Choose the high-performance lightweight NIO server Grizzly

Cache: Tencent developed massive distributed storage system CKV+ (support Redis protocol, data monitoring platform)

Database sub-database sub-table: choose the company’s own research infrastructure, not in detail

Load balancing: lightweight reverse proxy server Nginx and L5 load balancing, millions of concurrent need to add more than ten machines

CDN and preheating: can support efficient file download service

Among them, cache is the key to resist high concurrent traffic, and should be designed with emphasis.

Caching scheme

1. Data structure design

Those of you who have used caching know that the design of the cache Key is very important. Depending on your business, ensure that cache keys do not conflict and are easy to find. Here I select the request parameter + interface unique ID to concatenate the key. And the paging query interface can reuse the cache of the full query interface.

2. Cache degradation

If the key/redis connection fails to be found, check the database directly.

3. Cache updates

When the database changes, the cache needs to be deleted. Because there are non-required request parameters, the cache key can be an ambiguous value. For example, if there are two request parameters a and B, the key may be A or AB.

For the interface with fixed request fields (all fields are required), when updating the cache, you can simply add a unique key and delete it.

For interfaces where the request field is not fixed (there are non-required fields), you can use redis’s scan command for range scanning (not keys!). Or loop together all possible keys. For example, run the scan command to clear all caches prefixed with key user1.

4. Cache penetration

Whether the queried list is empty or not is written to the cache. However, this method is not recommended when services return multiple error codes because of high complexity and high cost.

2. The compatibility

Compatibility mainly depends on interface design. To be compatible with multiple service sides, request parameters and response parameters need to be set as flexibly as possible. When designing the interface, do not align it with all the business sides. Otherwise, one improper field design may lead to a complete loss.

Here are three tips:

1. Provide documents with accessible links for callers to look up immediately (such as Tencent documents).

2. The request parameters should not be too many and should be easy to understand. Do not set too complicated parameters to force compatibility.

3. Have as many response parameters as possible (too many, not too many), knowing that every time you add a return field, you have to change the code, and properly redundant fields avoid this problem.

3. Message notification

The difficulties mentioned above are as follows: the state flow system interacts with the query system and other business sides to send notifications to each other. When the state flows, other businesses need to be notified and the system should be queried to update the cache immediately. It has a high requirement for real-time information.

There were two initial scenarios:

1. Each system provides a callback interface for receiving notifications. It can ensure real-time performance, but the tight coupling between the systems is not conducive to expansion.

2. Use message queues to decouple applications and send asynchronous messages.

Finally, the second solution was adopted, and TubeMQ (Trillion level distributed messaging middleware, incubated by Open source Apache) developed by Tencent was selected, for the following reasons:

1. There may be other consumers after the notification data of the state flow system. The use of message queue is conducive to expansion and less intrusive to the code.

2. Message queues persist messages

3. TubeMQ supports consumer load balancing with high performance

4. TubeMQ has a large capacity, which can store trillions of messages

5. Support the company to develop components, easy to form a unified standard

Technology selection and solution selection should not only focus on current business needs, but also have a cutting-edge perspective.

4. Risk assessment

Do not, before choosing middleware/framework, as much as possible to understand, assess the potential risks. General companies have their own knowledge base, you can make good use of internal resources or find Google Baidu.

Here I assessed the risks posed by TubeMQ from the perspective of message reliability, message sequencing, message duplication, monitoring alarms, and found some possible risks. For example, when a consumer fails to consume a message indicating a change in the data state, the cache is not updated in time, resulting in inconsistencies between the database and the cache.

So, how to avoid risk? I designed a solution for message reliability and data consistency from the perspective of message queue producers and consumers.

The solution

Reliability of the manufacturer’s message:

1. Tube can guarantee that the message will be delivered and automatically resend if it fails to be sent.

2. When the sending of a message ends, a callback is triggered. In the callback, you can determine the status of sending and confirming the message.

Consumer message reliability and data consistency:

1. If the consumption fails, retry for a maximum of three times

2. If the consumption fails again, log the message to ensure that the message is not lost

3. Read logs from scheduled tasks, and generate an alarm if the consumption fails again

The development process

In fact, the development process nothing to say, is according to the technical plan to knock code.

Here are a few tips:

1. If you are working on multiple projects at the same time, you can create a separate Git branch for each project.

2. Report data for each request, such as the number of requests, request duration, and number of failed requests, for monitoring statistics.

3. Record multiple logs. Detailed and clear logs help you quickly locate faults

Problem solving

Many problems are not detected during local development and are discovered in test and online environments. The process of problem solving is like a roller coaster ride, and the status quo is: test => develop => test => go live => develop => test, and so on.

Two tips:

1. When you encounter a problem, don’t panic, just take a few deep breaths, because the problem can be solved, if not, you may be solved!

2. After solving the problem, don’t get excited and take a few deep breaths, because you will create new problems, and often the new problems are worse!

Here are some of the questions that impressed fish skin.

1. Transaction commit error?

Reason: The function called in a transaction also has a transaction, so the transaction is wrapped in the transaction, breaking the isolation.

Solution: Modify code to ensure transaction isolation.

2. Dependencies exist, but project startup error?

Cause: There are multiple versions of JAR packages, causing Java code to use reflection to dynamically generate classes without knowing which jar classes to use.

Solution: Delete unnecessary jar packages.

3. The cache is not updated immediately

Cause: The actual number of cache keys is up to 10 million. As a result, the efficiency of using the scan command to update the cache is too low and lasts for more than 20 seconds.

Solution: Change the solution to update the cache, instead of using the scan command, patch together all possible keys in the business code and delete them one by one.

Think that’s the end of the matter? Don’t forget the tips above:

“Once you’ve solved the problem, don’t get excited, take a few deep breaths, because you’ll create new problems, and often the new problems are worse!”

4. Cache still not updated immediately?

Cause: A service requires strong data consistency. The status in the cache and database must be the same. The cache, although updated in milliseconds, is not consistent in real time.

Solution: Customize an interface for the service side. This interface does not query the cache but directly searches the database to ensure that the data queried is the latest value.

5. Request stuck

After the service runs for some time, all requests are blocked! The heart can’t stand it.

Cause: Using jstack to print thread information and analyzing thread_dump file, it is found that the connection number is exhausted because the cache library Jedis does not manually release the connection. As a result, the new requesting thread will wait for the Jedis connection to be released continuously, and thus becomes stuck.

Solution: add the code to release Jedis connection.

6. An alarm is generated when online environment analysis logs are generated, indicating that the DISK I/O usage exceeds 99%.

Cause: The cat command is mistakenly used to view the original undivided log file. The log file is too large (tens of GB), causing the disk I/O capacity to be directly overwhelmed.

Solution: Run the less, tail, head commands to replace cat and delete the large log files that have been backed up.

7. The process flashes back

Troubleshooting: Usually, there is an error log for the JVM process to flash back, but it is not found, and the troubleshooting is in despair. There is nothing to do but pray that the problem does not recur. Then the problem really did not appear, thank you!

Reason: Later, upon inquiry, someone manually killed the process. Okay, ***.

8. The message notification of online environment was sent successfully, why there is no expected data update effect?

Check whether the message is consumed first, and then check whether the message is processed correctly.

Check: Check online logs and find that the messages are not consumed. But when you look at the monitoring screen, you see that the message was consumed by the machine in the test environment!

Cause: The test environment and the online environment belong to the same consumer group. When the message arrived, only one consumer in the same consumer group could successfully consume the message, which was consumed by the test environment, leading to the online environment data not updated.

The problem was discovered late the night before the launch. It is too late to apply for a consumer group, so we can only drop the service of the test environment first. The second day after the application of consumer groups, according to the environment to distinguish which consumer groups can be used, so that each consumer group will consume messages independently, successfully avoid message competition.

9. The report! The flow is too big to hold!

Cause: There are not enough machines, need to be urgently expanded

Solution: Ten new machines are applied for urgently. After the initial configuration is complete, the concurrency is increased after the new machines are deployed.

Tip: When multiple machines do the same operation, there are two quick ways to do it.

1. Use the parallel operation function of SSH connection tool to automatically type commands for all machines (supported by XShell software)

2. After configuring one machine, run the rsync command to synchronize the configuration to other machines

10. The day before launch, you told me there was something wrong with the interface design?

Reason: Serious communication problems!

At work, some colleagues may not pay attention to checking the interface design scheme because of their busy business. When they are finished, they will @ you repeatedly and ask you privately. We must not do that!

Solution: emergency teleconference, pull group check scheme

11. There’s a bug online!

An online bug is a serious matter that requires urgent response. Get up even in your dreams!

Cause: The test environment may not be identical to the online environment, and the test environment may not detect all problems. Therefore, a pre-release environment is usually required for verification. The data uses online data, but it is an independent server to ensure that online data is not affected.

Solution: emergency troubleshooting and locating problems, successfully repaired in three minutes!

There are certain skills to fix bugs, share the following personal error path:

Screenshot/problem => Request => Whether the bug can be reproduced, closely cooperating with the test => Data => Data source (whether the real data is consistent with the interface data) => Data processing

Explain:

Usually found that the problem is user of operations, or test, they will throw a question or problem related to the screenshot, at this time, we should think of this problem corresponding to the rapid function (that is, the corresponding request/interface), and then let the problem description is as much as possible to provide information (such as request parameters, time, etc.).

If the problem persists for a long time, you can ask whether you can create a case that reproduces the problem. In this way, you only need to view the latest logs to facilitate troubleshooting.

Once the request is located, we analyze which data of the request and response is abnormal, that is, locate the key data, then locate the source of the data (from the database or from the cache), and see if the response data is consistent with the real data source. If they are inconsistent, there may be problems in the processing of data in the business logic, and further analysis.

Efficient communication advice: Describe the problem, try to use data to speak, while giving screenshots, to provide complete data, requests and other information, help others to analyze.

12. Some incorrect data appears online

This is a predictable problem. Fortunately, email alarms have been configured in the project to report information about error data, and the amount of error data is not large.

Solution: After fixing the bug causing the incorrect data, write a program to loop all the error information and generate the request code, and then manually execute the request code to refresh the asynchronous data on the line.

Suggestion: Risk should be considered as far as possible in the design, and hierarchical alarm policies can be made according to the severity of the problem (SMS > email > communication software).

13. Online machine OOM!

Three days after the launch, I found the problem, and some online machines even appeared OOM (heap memory overflow), resulting in service unavailable. After investigation, it is the current version of the third-party middleware used that has bugs. Therefore, before using components, it is necessary to conduct full research and risk assessment to select the correct version.

Blood lessons

1. Any problems must be solved in the test environment as far as possible, otherwise online problems are very unfriendly to the heart.

2. Don’t be blindly optimistic and think it’s ok to go online. Check more and stay alert.

3. When using a third-party dependency, check the dependency version number to ensure that the dependency version is stable. Using older or inconsistent versions can cause serious bugs!

If you find a problem after launch, you will go through the following process, which I call the HAPY process.

For example, when you find a bug in your DB service, you only need to change one line of DB service code. However, do:

1. Modify a line of DB service code

2. Run unit tests

3. The DB service is packaged into a dependency package

4. Modify the dependency package of “State Flow System” and “Query System” on DB service (change version number/update local cache to pull the latest package)

5. Re-release Status Flow System and Query System to the test environment

6. You may have to reassign regression tests to test students

7. After passing the test, submit the code of “State Flow System” and “Query System” again and initiate CR (code review)

8. Have a colleague or Leader read the code and pass CR

9. Merge branches

10. Release “state flow system” and “query system” to the online environment, and verify each machine (rolling deployment).

11. Find new bugs again

It’s a gross explosion, but at steps 2, 6, and 8, there is idle waiting time. At this time we can do other work, write down work content, problems, etc.

conclusion

First of all, summarize the time consuming of each stage of this project:

Understand requirements: 5%

Development: 15%

Communication confirmation issues: 30%

Testing and validation: 30%

On-line and verification: 20%

Bug fixing takes about 60% of the time through the next few processes.

Problems in the project process:

1. I did not participate in the requirements review in the early stage and knew little information.

2. The night before the launch, you were still temporarily aligning interfaces? This should be confirmed in the communication plan stage.

3. About 80% of your time is spent communicating, querying data, providing data, and validating it.

4. Before the test is finished, they start to test in series, which leads to the same bug being found by multiple parties and repeatedly @, resulting in low efficiency of bug correction.

5. Unfamiliarity with self-developed middleware leads to high time cost.

6. The big picture is not enough to anticipate some possible problems.

7. Lack of research on middleware and failure to check dependent version number at the beginning led to OOM of online machine.

What I feel good about:

1. I cooperated closely with the test students and understood each other, with high test efficiency

2. Wrote detailed interface documents for the query system and uploaded them to the company’s knowledge base for real-time reference

3. Emergency fix online bug in 3 minutes

4. From demand acceptance to online in 30 minutes

5. Communicated with the counterparty immediately when finding middleware problems and designed low-cost solutions that had no impact on them

6. Actively help other students to query data and troubleshoot problems

7. Write scripts to efficiently solve part of the wrong data

Growth and harvest:

1. Ability to stay up late under pressure

2. Design thinking ability

3. Communication skills

4. Problem-solving skills

5. Familiarity with advanced commands ↑

6. Middleware familiarity ↑

7. Cluster management ability ↑

8. Ability to reject demands

9. Teasing ability ↑

10. Blowing 🐂 ability ↑

subsequent

After the project was launched, through summary and review, I found areas worthy of optimization in the project and also thought about some more sound mechanisms, which will be gradually implemented. Such as:

1. Some configurations of the two systems are the same

Currently, the copy-paste method is used to synchronize the same configuration. This method is relatively simple. However, the disadvantages are also obvious. If the configuration of one system is changed and the configuration of the other system is forgotten, an error can occur.

In fact, you can introduce a configuration center that centrally manages configuration files for multiple systems and supports manual modification, multiple environments, grayscale, configuration version rollback, and so on.

You can use Ali’s Nacos or Ctrip’s Apollo, which provides an interface to manage configuration.

2. Attention must be paid to the problem of flash back!

However, you can monitor the process in real time and restart the process automatically.

There are two ways to do this:

1. Use a tool, such as Supervisor or Monit, to manage and restart processes

2. Write shell scripts and perform scheduled tasks to periodically observe process status and restart processes. It is recommended that scheduled tasks be connected to a distributed task scheduling platform. When scheduled tasks are numerous, visual management and convenient control and scheduling are necessary.

3. Ensure message queue reliability

1. Message retransmission mechanism: As described in the scheme, the retransmission queue is designed and messages in the retransmission queue are preferentially sent. However, to avoid infinite queue retransmission, a retransmission threshold must be set for each message.

2. Email alarm: If the number of retransmission times exceeds the threshold, an email alarm is directly sent and the message is not added to the queue.

The job is really simple but not simple, who says the back end is just CRUD?

END

Three things to watch ❤️

If you find this article helpful, I’d like to invite you to do three small favors for me:

Like, forward, have your “like and comment”, is the motivation of my creation.

Follow the public account “Java Doudi” to share original knowledge from time to time.

Also look forward to the follow-up article ing🚀

Original address: mp.weixin.qq.com/s/j-D16UMks…

Goose plant big guy personal experience proves, a week on-line million level concurrent system