About the author: Used to work for Alibaba, Daily Youxian and other Internet companies, as the technical director, 15 years of e-commerce Internet experience.

For more dry goods, please pay attention to wechat public number: Erma Reading

Many people are asked this question in their interviews: What system failures have they encountered? How did it work out? The following are a number of online failure cases summarized by the author based on his 15 years of Internet research and development experience. I believe you can help you calmly deal with the interviewer’s questions!

There are not many images in this article, but the content is very dry! Understanding is the main, interview is complementary, learn to apply!

Fault 1: Frequent JVM FULL GC quick check

Before we share this example, let’s talk about which scenarios lead to frequent Full GC:

  1. Memory leak (object reference was not released in time due to code problem, so the object could not be collected in time)

  2. Infinite loop

  3. The big object

Especially for big players, he is more than 80 percent of the time.

So where do big objects come from?

  1. The result set of databases (including NOSql databases such as Mysql and Mongodb) is too large

  2. A large object transmitted by a third-party interface

  3. Message queue, message is too large

According to many years of experience on the Internet, the vast majority of the situation is caused by a large result set database.

Ok, now let’s introduce this online failure:

In the case of no release, POP service suddenly starts to go crazy Full GC, watch heap memory monitor no memory leak, roll back to previous version, still problem, embarrassing!!

Jmap-dump :format=b,file= file name [PID]), then use a tool such as MAT to figure out what objects are taking up a lot of space, and then look at the relevant references to find the problem code. This method takes a long time to locate a problem. If the problem is a key service, the fault cannot be located and resolved for a long time.

So here’s how we do it. First according to the routine analysis of heap memory snapshot, at the same time the other students to check the database server network IO monitor, if the database server network IO has increased dramatically, and the point in time, the basic is the database can be determined large result sets led to a Full GC, hurriedly seek a DBA rapid positioning large SQL (for DBA is very simple, If the DBA does not know how to locate it, he will be fired, haha), it is very easy to locate the code after SQL. In this way, we quickly located the problem. Originally is an interface must be transmitted parameters did not pass in, also did not add verification, resulting in SQL statement behind two conditions, a check tens of thousands of records out, really pits ah! Isn’t that much faster? Ha ha, 5 minutes.

At that time, the DAO layer was implemented based on Mybatis, and the SQL statement with the problem was as follows:

<select id="selectOrders" resultType="com.***.Order" >select * from user where 1=1<if test=" orderID ! = null ">and order_id = #{orderID}</if ><if test="userID ! =null">and user_id=#{userID}</if ><if test="startTime ! =null">and create_time >= #{createTime}</if ><if test="endTime ! =null">and create_time <= #{userID}</if ></select>Copy the code

OrderID = orderID; orderID = orderID; orderID = orderID; orderID = orderID; orderID = orderID; orderID = orderID; orderID = orderID; orderID = orderID; orderID = orderID; orderID = orderID But neither parameter is passed, only startTime and endTime are passed. So a single Select found tens of thousands of records.

Therefore, we must be careful to use if test when using Mybatis, because it will bring disaster if we are not careful. Later we split the above SQL into two:

Query order by order ID:

<select id="selectOrderByID" resultType="com.***.Order" >select * from user whereorder_id = #{orderID}</select>
Copy the code

Query order by userID:

<select id="selectOrdersByUserID" resultType="com.***.Order" >select * from user whereuser_id=#{userID}<if test="startTime ! =null">and create_time >= #{createTime}</if ><if test="endTime ! =null">and create_time <= #{userID}</if ></select>Copy the code

Fault 2: Memory leaks

Before introducing an example, consider the difference between a memory leak and a memory overflow.

Out of memory: An out of memory occurs when a program does not have enough memory to use. When the memory runs out, the program basically doesn’t work.

Memory leak: A memory leak occurs when a program fails to release memory in a timely manner, resulting in a gradual increase in memory usage. Memory leaks generally do not cause programs to fail. However, continuous memory leaks, accumulated to the memory limit, memory overflow occurs. In Java, if a memory leak occurs, the GC collection is not complete, and the heap memory usage increases gradually after each GC. The following is a monitoring of a MEMORY leak in the JVM, and we can see that the heap memory usage is higher after each GC than before.

Images from the Internet

The memory leak scenario was that local caches (a framework developed by the company’s infrastructure team) were used to store product data, which was not too many, but hundreds of thousands of items. If you store only hot items, you won’t have much memory, but if you store full items, you won’t have enough memory. We initially added a 7-day expiration date to each cache record to ensure that most of the items in the cache were hot. However, after a refactoring of the local cache framework, the expiration time was removed. With no expiration time, over time the local cache grows larger and a lot of cold data is loaded into the cache. One day, an alarm message is received indicating that the heap memory is too high. Quickly download heap memory snapshot through jmap (jmap-dump :format=b,file= file name [PID]), and then use Eclipse mat tool to analyze the snapshot, found a large number of commodity records in the local cache. After locating the problem, the architecture team quickly added an expiration time and restarted the service node by node.

Thanks to the addition of server memory and JVM heap memory monitoring, we caught the memory leak in time. Otherwise, as the leakage problems accumulate over time, if one day OOM will be miserable. So the technical team in addition to CPU, memory and other operations monitoring, JVM monitoring is also very important.

Fault 3: Idempotent problem

Many years ago, I was working as a Java programmer at a large e-commerce company and developed a points service. At that time, the business logic was that after the user finished the order, the order system sent a message to the message queue, and the integral service gave the user points after receiving the message, and added the newly generated points to the user’s existing points.

Because of network reasons, messages are sent repeatedly, which leads to repeated consumption of messages. At that time, I was still a rookie in the workplace, did not consider this situation. Therefore, repeated points will occasionally occur after the launch, that is, two or more points will be added to the user after the completion of an order.

Later, we added a score record table. Before adding points to the user for each consumption message, we checked the score record table according to the order number. If there is no score record, we added points to the user. This is known as “idempotency”, meaning that repeated operations do not affect the final result. In real development, many scenarios that require retry or repeated consumption are idempotent to ensure correct results. For example, to avoid repeated payments, the payment interface is also idempotent.

Fault four: Cache avalanche

We often encounter situations where we need to initialize the cache. For example, we went through a user system refactoring where the user system table structure changed and the cache information changed. After reconstruction, you need to initialize the cache and store user information in batches to Reids before going online. The expiration time of each user information cache record is 1 day. After the expiration time, the latest data is queried from the database and pulled to Redis. Everything was fine when grayscale went live, so it was soon released in full. The whole process went smoothly and the code farmers were very happy.

The next day, however, disaster struck! At a certain point in time, all kinds of alarms start to pour in. User system response suddenly becomes very slow, even no response at all. When you look at the monitoring, the user service CPU suddenly went up (IO Wait was high), Mysql traffic skyrocketed, Mysql server pressure increased, and Reids cache hit ratio dropped to the bottom. Relying on our powerful monitoring system (operation and maintenance monitoring, database monitoring, APM full link performance monitoring), the problem was quickly located. The reason is that a large number of user records in Reids fail in a centralized manner, and the user records for obtaining user information cannot be found in Redis, resulting in a large number of requests penetrating into the database and instantly bringing great pressure to the database. User services and other associated services are also affected.

This cache concentration failure, resulting in a large number of requests penetrating the database at the same time, is known as a “cache avalanche.” If the cache expiration point is not reached, the performance test will not detect the problem. So make sure people pay attention.

Therefore, when you need to initialize cache data, it is important to ensure that the expiration time of each cache record is discrete. For example, if we set the expiration time for this user information, we can use a large fixed value plus a small random value. For example, the expiration time can be: 24 hours + 0 to 3600 seconds random value.

Fault five: Disk I/OS block the thread

The problem occurred in the second half of 2017, and there was a period of time when geogrid service was unusually slow, with automatic recovery lasting from seconds to tens of seconds at a time.

If the slow response is persistent, it’s ok to grab the thread stack directly with JStack and basically locate the problem quickly. The critical duration is only a few tens of seconds at most, and it is sporadic. It only happens once or twice a day, sometimes every few days, and the time point of occurrence is uncertain. It is obviously not realistic for people to stare at and then manually grab the thread stack with JStack.

Well, since manual method is not practical, let’s do it automatically, write a shell script to automatically execute jSTACK, execute jStack every 5 seconds, put the result of each execution into a different log file, only save 20000 log files.

The Shell script is as follows:

#! /bin/bashnum=0log="/tmp/jstack_thread_log/thread_info"cd /tmpif [ ! -d "jstack_thread_log" ]; then mkdir jstack_thread_logfiwhile ((num <= 10000)); do ID=`ps -ef | grep java | grep gaea | grep -v "grep" | awk '{print $2}'` if [ -n "$ID" ]; then jstack $ID >> ${log} fi num=$(( $num + 1 )) mod=$(( $num%100 )) if [ $mod -eq 0 ]; then back=$log$num mv $log $back fi sleep 5doneCopy the code

The next time the response was slow, we found the jStack log file at the corresponding point in time and found that there were many threads blocking the logback output process. Later we streamlined the log and changed the log output to asynchronous. The problem was solved. I suggest you keep it for later, when you encounter similar problems, you can use it!

Fault 6: Database deadlock

Before analyzing the case, let’s take a look at MySQL INNODB. In MySQL INNODB engine, the primary key is in the form of clustered index, that is, the leaf nodes of B tree store both index values and data records, that is, data records and primary key indexes exist together. The leaf node of the common index stores only the value of the primary key index. After finding the leaf node of the common index in a query, we need to find the clustered index leaf node according to the primary key index in the leaf node and get the specific data record in it. This process is also called “back table”.

The scene of the failure is about the order system of our mall. There is a timed task that runs every hour to cancel all orders that haven’t been paid for an hour ago. The customer service background can also cancel orders in batches.

The order table T_ORDER has the following structure:

id Order ID, primary key
status The order status
created_time Order Creation time

Id is the primary key of the table, and the normal index is on the CREATED_time field.

Cluster index (primary key ID)

Id (index) status created_time
1 UNPAID The 2020-01-01 07:30:00
2 UNPAID The 2020-01-01 08:33:00
3 UNPAID The 2020-01-01 09:30:00
4 UNPAID The 2020-01-01 09:39:00
5 UNPAID The 2020-01-01 09:50:00

Plain index (CREATED_time field)

Created_time (index) Id (primary key)
The 2020-01-01 09:50:00 5
The 2020-01-01 09:39:00 4
The 2020-01-01 09:30:00 3
The 2020-01-01 08:33:00 2
The 2020-01-01 07:30:00 1

A timed task runs every hour and cancels all unpaid orders from the previous two hours, such as those from 8 to 10 a.m. at 11 a.m. The SQL statement is as follows:

update t_order set status = 'CANCELLED' where created_time > '2020-01-01 08:00:00' and created_time < '2020-01-01 10:00:00' and status = 'UNPAID'
Copy the code

Customer service batch cancellation SQL is as follows:

update t_order set status = 'CANCELLED' where id in (2, 3, 5) and status = 'UNPAID'
Copy the code

A deadlock can occur when the above two statements are executed simultaneously. Let’s analyze why.

For the first scheduled task, the CREATED_time normal index is found and locked, and then the primary key index is found and locked.

The first step, created_time, is to lock the normal index

The second step, the primary key index lock

SQL, directly go to the primary key index, directly add lock on the primary key index.

We can see that the primary key of the scheduled task SQL is locked in 5,4,3,2 order. SQL primary key lock sequence is 2,3,5. SQL > lock 2; SQL > lock 2; SQL > lock 2; SQL > alter database lock 3; SQL > alter database lock 3; A “deadlock” occurs when two SQL statements wait for each other’s locks.

The solution is to ensure that the lock order is consistent from the SQL statement. Or change the customer service batch cancel order SQL to each SQL operation can only cancel an order, and then execute SQL in the program repeatedly, if the number of batch operation of the order is not many, this stupid method is also feasible.

Fault 7: Domain name hijacking

Let’s take a look at how DNS resolution works. When we visit www.baidu.com, we will first query the IP address of Baidu server according to www.baidu.com to the DNS domain name resolution server, and then visit the website corresponding to the IP address through HTTP protocol. DNS hijacking is a form of Internet attack that resolves the domain name of a target website to another IP address by attacking or forging a domain name resolution server (DNS). As a result, the request cannot reach the target site or is redirected to another site. The diagram below:

Here is an example of DNS hijacking that we have experienced.

If you look at the red box in the picture, the picture at the top should be a picture of a product, but instead it shows an advertising picture. Is there a picture mismatch? No, DNS was hijacked. It was supposed to show pictures of goods stored on the CDN, but was hijacked to show images of links to advertisements from other websites. Since the CDN image links used the insecure HTTP protocol at the time, they could easily be hijacked. It was later changed to HTTPS, and the problem was solved.

Of course, domain name hijacking can be done in many ways, and HTTPS does not circumvent all problems. So, in addition to some security precautions, many companies have their own backup domain names that they can switch to at any time in the event of domain hijacking.

Fault eight: Bandwidth resources are exhausted

The situation that the system cannot be accessed due to the exhaustion of bandwidth resources is rare, but it should be noticed. Let’s see. We had an accident earlier.

Here’s the scene. Social e-commerce companies have a unique QR code for each shared product picture, which is used to distinguish the product from the sharer. So the TWO-DIMENSIONAL code to use procedures to generate, initially we use Java to generate two-dimensional code in the server. Early due to the system traffic is not large, the system has no problems. But one day the operation suddenly made an unprecedented promotion of preferential strength, the system instantaneous page view increased dozens of times. The problem followed, the network bandwidth is directly full, because the bandwidth resources are exhausted, resulting in a lot of page request response is slow or even no response. The reason is that the number of TWO-DIMENSIONAL code generation instant also doubled dozens of times, each two-dimensional code is a picture, bringing great pressure to the bandwidth.

How do you solve it? If the server can’t handle it, consider the client. Generate QR code in the client APP processing, make full use of the user terminal phone, currently Android, IOS or React have related TO generate QR code SDK. This not only solves the bandwidth problem, but also releases the SERVER generated two-dimensional code consumption of CPU resources (two-dimensional code generation process requires a certain amount of computing, CPU consumption is more obvious).

Internet bandwidth is very expensive, we still need to use it sparingly!

The cases shared in this article are my own experience, I hope to help readers.

About the author: Used to work for Alibaba, Daily Youxian and other Internet companies, as the technical director, 15 years of e-commerce Internet experience.

For more dry goods, please pay attention to wechat public number: Erma Reading