With the development of the Domestic Internet industry, although the trillion-scale super cluster is not as rare as a few years ago, it is also rare, especially the opportunity to troubleshoot the performance of super trillion-scale super cluster is even more rare.

The performance failure troubleshooting of Hadoop NameNode with a scale of more than one trillion is also the biggest cluster, the longest time, the largest troubleshooting workload and the most hair loss I have encountered in the years since I started my business. Finally, I had to turn to Daniel for help.

Therefore, after the problem was solved, I immediately recorded the whole process of the investigation and summarized it. I hope it can be helpful to all the students who saw it. Enjoy:

Up for

The cause of the matter is due to recent customer feedback recording database LSQL performance suddenly deteriorated, before the second level response of data query and retrieval, but now always in the “circle”, stuck. Since it was a sudden phenomenon, the business change was ruled out in advance by the on-site staff, but no problems were found. As the first large project with a data volume of one trillion yuan received after the establishment of my company, I also attach great importance to it and went to the site as soon as possible.

Here introduce the platform architecture, the bottom using Hadoop distributed storage, the intermediate database using LSQL, real-time data import using Kafka. The daily data scale is 50 billion, the data storage cycle is 90 days, and there are more than 4000 data tables in total. Among them, the largest single table data scale is nearly 2 trillion records, the total data scale is nearly 5 trillion, and the storage space occupies 8PB.

The basic uses supported by the data platform mainly include full-text retrieval, multi-dimensional query, geo-location retrieval, data collision and other operations. Some businesses also involve data statistics and analysis, and there will be a very small amount of data export and multi-table association operations.

After a

①****Before the day: The fault is preliminarly located

Before I started my business, I worked in Hermes system in Tencent, and the daily real-time data volume reached 360 billion/day. After that, the real-time data volume reached nearly one trillion/day. In order not to be so Versailles, I just want to say that I had the same feeling as Liang Qichao when he gave a speech at Peking University: I don’t know much about super-large clusters, but I still have a little! Faced with the current daily 50-100 billion scale system, I am considering whether to buy the day’s return ticket……

In order to quickly locate the problem, I asked for some logs and jstack from the site before departure. The initial location was the bottleneck of Hadoop NameNode, and we had done NN optimization for many times before, and we had nothing else but familiar hands.

Here’s a look at the stack analysis at that time, and I’m sure everyone here will be confident that this is clearly hadoop stalled.

First Day: Try to adjust log4j

The first thing I did when I got to the site was to constantly grab the Hadoop Namenode stack Jstack. The conclusion from this is that the problem is indeed Catton on NN. Here NN is a global lock, and all read and write operations are sorted and wait, as shown below:

1. Where is the card

The number of waits for this lock is more than 1000. Let’s take a closer look. What is the thread that owns this lock doing?

2. Problem analysis

Obviously, there is a bottleneck in the recording log, which is blocked for too long.

  1. Log4j for the record should not add [%L], it will create the Throwable object, which in Java is a heavy object.

  2. Logs are recorded too frequently, and the disk cannot be brushed.

  3. Log4j has global locks, which can affect throughput.

3. Adjust the plan

  1. The hadoop version adopted by the customer is version 2.6.0. There are many problems in log processing of this version of Hadoop, so we call in the patch where the official statement is clear that there is a problem

Issues.apache.org/jira/browse…

The NN is slow because of logs. Procedure

Issues.apache.org/jira/browse…

Log out of the lock to avoid lock

Issues.apache.org/jira/browse…

ProcessIncrementalBlockReport caused by logging, seriously affect the performance of NN

  1. None Example Disable all info logs of namenode

Observe that the global lock blocks the NN when there is a large amount of log output.

The current change mode is to mask log output from Log4J and disable all namenode info level logs.

  1. The [%L] parameter is removed from log4j log output

This parameter creates a new Throwable object in order to get the line number. This object has a significant impact on performance, and mass creation can affect throughput.

  1. Example Enable asynchronous audit logs

The DFS. The namenode. The audit.log file async is set to true, change the audit log to asynchronous.

4. Optimize the effect

After optimization, it is true that log4J delays do not exist, but hadoop throughput is still stuck on lock.

(3) Second Day: Optimize du, troubleshoot and resolve all delays

Picking up where we left off yesterday:

1. After fixing the log4j problem, continue to capture the jStack in the following position:

2. Through code analysis, it is found that there is a lock here, and it is confirmed that this will cause all access blocking:

3. Further study the code and find that it is controlled by the following parameters:

(In version 2.6.5, the default value is 5000, so this issue no longer exists)

The core logic of this parameter is that if the configuration value is greater than zero, it will release the lock after a certain number of files, allowing other programs to continue executing. This problem only existed in hadoop2.6.0, and it has been fixed in later versions.

4. Solutions

  1. Make an official patch:

Issues.apache.org/jira/browse…

  1. LSQL internally removed all usage of Hadoop DU

5. Why patch

In version 2.6.5, you can define the sleep time by yourself, and the default sleep time is 500ms, while in version 2.6.0, the sleep time is 1ms. I worry that it is too short, and there will be problems.

Continue with the original idea and check all jStacks. Hadoop has no active threads in jStack, but it is still stuck on the read/write lock switch. This shows:

1. Each function inside namenode has been optimized, jStack can hardly catch;

2. The stack call can only see about 1000 read/write locks constantly switching, indicating that the request concurrency of NN is very high, and the lock context switching between multiple threads has become the main bottleneck.

So the main idea at the moment should be to reduce the frequency of NN calls.

④Third Day: Minimize the frequency of NN requests

On the third day of the scene, a rainstorm suddenly fell, “love is like blue sky and white clouds, clear skies, sudden storms”…… Just like my mood.

To reduce the frequency of NN requests, several methods are tried:

1. Enable the shard function for different tables of LSQL in the record database

Considering that there are more than 4000 tables in the field, and each table has more than 1000 concurrent write fragments, it is possible that too many files are written at the same time, resulting in too high NN request frequency. Therefore, it is considered to fragment and merge those small tables. As the number of files written is less, the request frequency will naturally decrease.

2. Cooperate with field personnel to clear unnecessary data and reduce the pressure of Hadoop cluster. After the cleanup, the number of file blocks in the Hadoop cluster is reduced from nearly 200 million to 130 million, which is enough to clean up.

3. Adjust the heartbeat frequency of a series of interactions with NN: such as blockManager and other related parameters.

4. Change the internal lock type of an NN from fair lock to non-fair lock.

The following parameters are involved in the adjustment:

  • DFS. Blockreport. IntervalMsec changed from 21600000 l to 259200000 l (3 days), the full amount of the heart

  • DFS. Blockreport. Incremental. IntervalMsec incremental data heartbeat from 0 to 300, as far as possible a batch report (old version without the parameters)

  • DFS. The namenode. Replication. The interval is changed from 3 seconds to 60 seconds, reduce the heart rate

  • Dfs.heartbeat. Interval Change the heartbeat interval from the default 3 seconds to 60 seconds to reduce the heartbeat frequency

  • DFS. The namenode. Invalidate. Work. PCT. Per the iteration changed from 0.32 to 0.15 (15% nodes), reduce the scanning node number

Stacks involved in this adjustment:

The end result is that the problem remains. I’m at my wit’s end. I’m at my wit’s end. I don’t know what to do.

⑤Fourth Day: There is no way out. Consider establishing a diversion mechanism

Having stayed up for three nights in a row, I reported to the company and customers on the morning of the fourth day to investigate the specific situation and directly said that I had no ideas. We hope to enable plan B:

1. Enable hadoop federated solution and solve current problems by multiple Namenode;

2. Modify the LSQL database immediately, and adapt the Hadoop multi-cluster scheme to one LSQL database, that is, build two identical clusters. The LSQL database starts 600 processes, 300 processes request the old cluster, and 300 processes are diverted to the new cluster, so as to reduce the pressure.

The advice at home (and at work) is to go back to bed and make a decision when you’re sane.

The customer suggested to continue the investigation, because the system has been running steadily for more than a year, there is no reason why it suddenly failed. I still hope to further study it.

Like most system failures that can be resolved with a single reboot, I decided to take a nap, hoping the problem would be solved when I woke up.

In there, I had to turn to former comrades in arms, is my former high specifically responsible for the HDFS Daniel when tencent, his master degree comparable to hadoop I familiar with all kinds of hair loss prevention tips, and optimization of large clusters of experience, may meet but cannot be asked, and I think if he can’t request a thing or two, I’m afraid no one make a deal, I also don’t have to be in vain.

Gao Gao first asked about the basic situation of the cluster and gave me a number of effective suggestions. What excites me most is that according to high analysis, our cluster is definitely not hitting the performance ceiling.

⑥The last day: Analyze each function that calls NN’s lock

I did not directly look at the JMX information this time, for fear that the result would not be accurate. Btrace is used to find out which thread is frequently locking the NN and causing the NN to be overloaded.

Spent three hours analysis, in a final surprise found this thread processIncrementalBlockReport request frequency is very high, much higher than other threads. Isn’t this thread the logic for the incremental heartbeat of the Datanode (DN) node? Why is the frequency so high? Didn’t I change my heart rate? Didn’t it work?

After carefully checking the Hadoop code, I found that there was something wrong with this logic, which would be called immediately every time data was written or deleted. However, the heartbeat parameters I set were not optimized in this aspect in the client’s Hadoop cluster of this version, and it was useless to set them. Therefore, I urgently searched for patch methods on the Internet and finally found this one. It solves not only the problem of heartbeat rate, but also the problem of lock frequency, and improves nn throughput by reducing the number of times locks are used, thereby reducing the number of context switches.

Quickly hit this patch, obviously found NN throughput up, and not only access NN not card, real-time Kafka consumption speed also suddenly from the original processing of 4 billion per hour, up to 10 billion per hour, the performance of the library also doubled. After patch was applied, this problem was fundamentally solved.

The root cause is the single lock design inside HDFS NameNode, which makes the lock extremely heavy. The cost of holding this lock is high. Each request requires the lock and the NN processes the request, which involves a lot of lock contention. Therefore, once the lock of an NN has been imported/deleted, it is easy for NameNode to handle a large number of requests at once, and other users’ tasks will be affected immediately. The main function of patch this time is to change the increment-reported locks to asynchronous locks, so that deletion and reporting operations do not affect the query.

For detailed description and modification, please refer to here:

Blog.csdn.net/androidlush…

Total knot

Finally, for the troubleshooting of this performance failure, I will summarize the causes of the problem and the solutions:

① Causes of the problem

The system has been running smoothly before, but the sudden problems are mainly caused by the following reasons:

1. Users delete a large number of files, causing pressure on Hadoop

  • Recently, the hard disk is about to be full, and I cleaned up a batch of data centrally

  • Recently hadoop was unstable and a large number of files were released centrally

2. Recently, the daily data volume has increased significantly. After the Hadoop is tuned, the data is reentered and the statistics on the number of data entries are performed based on logs

3. Backlog of consumption data

During this tuning process, kafka was consuming data at full speed due to a data backlog for several days. In the case of full speed consumption, nn will have a big impact.

4. Snapshot and Mover impact on Hadoop

  • A large number of data blocks are released during snapshot clearing, resulting in data deletion

  • Mover adds a large number of data blocks, causing the system to delete a large number of file blocks on SSDS. And because of the increased number of nodes, the heartbeat is frequent, instantaneous has processIncrementalBlockReport cause greater pressure to the nn

② My suggestions

1.Never give up easily!

On the fourth day of the investigation, after trying various solutions, I also thought about giving up and thought that the performance failure was unsolvable. At this time, we might as well discuss with colleagues, even former colleagues and leaders, may bring different ideas and inspiration, to believe in the wisdom of groups!

2. Understand the hadoop principle, which is the key point of this hadoop tuning

(1) When deleting files in HDFS: Namenode only deletes the directory entry and records the data blocks to be deleted to the pending deletion blocks list. The next time the Datanode sends a heartbeat to the Namenode, the Namenode sends the deletion command and the list to the Datanode. Therefore, the pending deletion blocks list is very long, resulting in timeout.

(2) When we import data: Client data will be written to the datanode, while the datanode after receiving data blocks, will be immediately transferred to NN processIncrementalBlockReport report, write data quantity, the more the more frequently, machine number, the more the more process, call NN will be more frequent. Therefore, the asynchronous lock patch will be effective here.

3. Most importantly: never use hadoop2.6.0!!

In the official words of Hadoop, other versions have a few of bugs, while this version has a lot of bugs, so the first thing after returning to urge the customer to upgrade and change the version as soon as possible.

PS: If you want to take your Hadoop performance to the next level, it is recommended to update to Hadoop3.2.1 and enable federated mode. We have sorted out the issues and precautions that you may encounter.

If you want to break through performance bottlenecks again, we’ve also prepared a hands-on article on how to improve router performance bottlenecks.

Wechat public number search “soft” can be the first time to obtain technical practice dry goods.