During the operation and maintenance of the HBase cluster, RegionServer breaks down abnormally, service write delay increases, or service write failure occurs. This chapter combines the author’s experience, enumerates several common problems in real production line environment, and introduces the basic troubleshooting ideas of these problems. In addition, this section describes logs in the HBase system and summarizes how to use monitoring and log tools to troubleshoot problems. This section provides a routine for troubleshooting.

Regionserver downtime

Case 1: Regionserver breaks down due to GC

A long FullGC is the most common cause of RegionServer breakdown. To analyze such problems, follow the following troubleshooting process:

Symptom: An alarm is received indicating that the Regionserver process exits.

1. Locate the cause of the breakdown

Step 1: Generally, it cannot be found on the monitor. You need to search for the keywords of type 2 – A long garbage collecting pause or ABORTING region server – in the RegionServer log. For a long Full GC scenario, searching for the first keyword will retrieve:

     2019-06-14T17:22:02.054 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 20542ms
     GC pool 'ParNew' had collection(s): count=1 time=0ms
     GC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=20898ms
     2019-06-14T WARN [regionserver60020.periodicFlusher] util.Sleeper: We slept 20936ms instead of 100ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

Step 2: In general, THE CMS GC policy generates serious Full GC in two scenarios: 1. Concurrent Mode Failure 2. Promotion Failure.

The 2017-06-14 T17:22:02. 054 + 0800:21039.790 [FulGC20172017-06-14 T17:22:02 0544 + 0800:21039.790: [CMS2017-06-14 t17:22:02. 0544 + 0800:21041.477: [CMS - concurrent - mark: 1767/1782 SEC] [Times: User =14.01 sys=0.00 real=1.79 secs](concurrent mode fallure): 25165780K->25165777K(25165824K), 18.4242160 secs] 26109489K->26056746K(26109568K), [CMS Perm: 48563 k to 48534 k (262144 k), 18.4244700 secs s] [Times: User sys = 0.00 real = = 28.77 18.42, the 2017-06-14 secs]] T17:340800-21058.215:22:20:47 Totalime for which appll cation threads were stopped:184270530 seconds

It is now almost certain that the CMS GC in concurrent Mode Failure mode caused the long application suspension.

2. Breakdown cause analysis

Cause and effect analysis: CMS GC in concurrent mode failure mode triggered by the JVM will generate a stop the world for a long time, and the upper-layer application will be suspended for a long time. The session established between RegionServer and Zookeeper times out. If the session times out, Zookeeper notifies the Master to kick the RegionServer out of the cluster.

What is GC in concurrent Mode Failure mode? Why the long pause? Assume that the HBase system is implementing CMS to reclaim the old generation space. During the process, a batch of objects are promoted from the young generation. Unfortunately, the old generation has no space to accommodate these objects. In this scenario, the CMS collector will stop working, the system will go into stop-the-world mode, and the reclamation algorithm will degenerate into a single-thread replication algorithm, reallocating all the remaining objects in the entire heap memory to SO, freeing up all other space. Obviously, it’s going to take a long time.

3. Solutions

Since this is an old problem caused by not being able to GC, just having the CMS collector collect a little earlier can most likely prevent this from happening. The JVM provides parameter XX: CMSInitiatingOccupancyFraction = N to set the timing of the CMS recycling, N said that the current has to use the old s memory of young generation of the proportion of the total memory, can add value to smaller the earlier recovery, such as 60

In addition, it is recommended to pay attention to whether the system BlockCache is enabled in offheap mode, whether the JVM startup parameters are reasonable, and whether the JVM heap memory management is not properly using the out-of-heap memory.

Case 2: Regionserver breaks down due to a serious system Bug

Large scan fields cause RegionServer breakdown

Symptom: The RegionServer process exits

1. Locate the cause of the breakdown

Step 1: Logs. Check the GC related, if not to continue the search keyword “abort”, check the suspicious log “Java. Lang. OutOfMemoryError: Requested array exceeds the VM limit”

Step 2: Source verification. Find the FALTAL level log with the stack, locate the source code or search on the Internet based on the keyword, confirm that the exception occurred when the scan result data is transmitted back to the client, due to the amount of data, the requested array size exceeds the maximum value specified by the JVM (interge.max_value -2).

2. Cause and effect analysis

In some scenarios, the JVM throws An OutOfMemoryError when applying for an array due to a bug in the HBase system. As a result, RegionServer breaks down

3. Essential cause analysis

This problem can be considered as an HBase bug. You should not apply for an array that exceeds the threshold specified by the JVM. On the other hand, it can also be considered improper usage by the business side.

  • Table columns are too wide and there is no column data limit in the SCAN result. As a result, a row of data may contain a large number of columns and exceed the array threshold
  • The value of KeyValue is too large, and there is no limit on scan returns. As a result, the size of returned data exceeds the array threshold.

4. Solutions

Can do it on the server restrictions hbase. Server. Scanner. Max. Result. The size of the size Can also be at the time of client to access the return results size limit (scan. SetMaxResultSize)

Hbase write exception

Example: HDFS capacity reduction causes some write exceptions

Symptom: Some write requests in service feedback times out abnormally. In this case, the HBase is decommissioning multiple Datanodes in the HDFS cluster.

1. Locate the cause of the write exception

Step 1: In theory, smooth decommissioning does not cause upper-layer service awareness

Step 2:Check the HBase node cluster monitoring. The NODE I/O load is high during decommissioning

It is preliminarily determined that the abnormal write is related to the high I/O load during the logout period.

Step 3: View the RegionServer log at the relevant point in time and search for “Exception” to get the key 2 lines:

2020-04-24 13:03:16,685 WARN [ResponseProcessor for block bp-1655660372-10.x.x.x-1505892014043 :blk_1093094354_19353580] 2020-04-24 13:03:16,685 WARN [ResponseProcessor for block bp-1655660372-10.x.x.x-1505892014043 :blk_1093094354_19353580] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-1655660374-10.x.x.x-1505892014043:blk_1093094354_19353580 java.io.IOException: Bad response ERROR for block BP-1655660372-10.x.x.x-1505892014043:blk_1093094354_19353580 from datanode 10.x.x.x:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:828) 2020-04-24 13:03:16,700 INFO [sync.0] wt.fshlog: Slow sync cost: 13924 ms, current pipelin: [10.x.x.x:50010, 10.x.x.x:50010]

HLog takes too long to execute sync (13924ms) and the write response is blocked.

Step 4: Check the DataNode logs. It is found that the disk flushing is slow and abnormal information is displayed

The 2020-04-24 13:03:16, 686 ERROR org. Apache. Hadoop. HDFS. Server. The datanode. The datanode: newsrec-hbase10.dg.163.org:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.x.x.x:55637 dst:10.x.x.x:50010

2. Cause analysis of the write exception

  • When multiple Datanodes are decommissioned at the same time, they copy a large number of data blocks. As a result, the bandwidth and I/O pressure of all nodes in the cluster increase sharply.
  • High NODE I/O loads lead to slow data block dumping by Datanodes. As a result, HLog disk flushing in HBase times out. When cluster write pressure is heavy, write accumulation times out

3. Solutions

  • The DataNode should be decommissioned during low service peak
  • Do not retire multiple Datanodes at the same time. Otherwise, THE I/O pressure will increase dramatically in a short period of time.

Fault analysis of hbase operation and maintenance

Production line problem is the mentor of system operation and maintenance engineers. The reason why I say this is because the analysis of the problem can allow us to accumulate more methods of problem location, and let us have a deeper understanding of the working principle of the system, and even access to a lot of knowledge areas that were impossible to access before. It’s like going to an unknown world to explore an unknown problem. The more you go inside, the more you can see the world that others can’t see. So when a problem occurs on the production line, we must seize the opportunity to find the source. It is no exaggeration to say that a large part of a technician’s core competence is the ability to locate and solve problems.

In fact, solving the problem is just a result. From the moment when you receive the alarm and see the problem to the final solution, there must be three stages: problem location, problem analysis, problem repair. Problem location is to find the nature of the triggering problem through certain technical means. Problem analysis is to sort out the whole process from the principle. Problem solving depends on problem analysis.

1. Locate the problem

Locating the trigger cause of the problem is the key to solve the problem. The basic process for locating problems is as follows:

  • Monitor and analyze indicators. Many questions can be directly found in the monitoring interface intuitive answers. For example, business feedback at a certain point the post read latency becomes very high. The first response is to check whether there is any anomaly in the SYSTEM IO, CPU, or bandwidth. If you see that the IO utilization becomes abnormally high at the corresponding point in time, you can almost confirm that this is the cause of the read performance deterioration. I/O utilization is not the underlying cause, but it is an important link in the chain of problems. Let’s explore why I/O utilization is abnormal at this point in time.

There are many monitoring indicators that are useful for locating problems, which can be classified into system basic indicators and service-related indicators. Basic system indicators include system I/O usage, CPU load, and bandwidth. Service indicators include RegionServer read/write TPS, average read/write latency, request queue length /Compaction queue length, MemStore memory change, and BlockCache hit ratio.

  • Analyze logs. For system performance problems, monitoring indicators may help, but for system exception problems, monitoring indicators may not provide any clue. In this case, the core HBase system logs include RegionServer logs and Master logs. In addition, GC logs, HDFS logs (NameNode logs and DataNode logs), and Zookeeper logs are helpful to analyze problems in specific scenarios.

For log analysis, you do not need to read the logs from beginning to end. You can directly search for keywords such as “Exception”, “ERROR”, or even “WARN”, and analyze the logs based on the time period.

  • Online help. After analyzing monitoring indicators and logs, o&M personnel can obtain benefits. In some cases, we see “Exception”, but we don’t know what it means. Go to the Internet for help. First of all, search engines according to the relevant log search, most of the time can find the relevant articles, because you encounter a large probability of others will also encounter the problem. If there is no clue, then go to professional forums, such as StackOverflow, hbase-help.com, and base-related communication groups. Finally, you can also send an email to the community to consult the community technical staff.
  • Source code analysis. After the problem is resolved, it is recommended to reconfirm the problem through the source code

2. Problem analysis

Solving the unknown is like a journey into the unknown. The process of locating a problem is to walk into the unknown. The farther you go, the more things you see, the bigger the world you see. And yet, if the dazzling scene was not smoothing carefully, there would be nothing to say in his head, when anyone asked about the place.

Problem analysis is a reverse process of problem location. Starting from the most essential cause of the problem, combined with the working principle of the system, and constantly deduce, finally deduce the abnormal behavior of the system. To analyze this process clearly, not only the monitoring information and exception logs are needed, but also the analysis needs to be combined with the working principles of the system. So looking back, only to understand the principle of the system clearly, in order to analyze the problem clearly.

3. Problem repair

If you can explain the context of the problem clearly, I believe it will be easy to give targeted solutions. This should be the easiest part of exploring the whole problem. There is no problem that cannot be solved, if there is, it is not a clear analysis of the problem.

Reference: HBase Principles and Practices


Growing operation and maintenance implementation engineer focuses on platform implementation, SLA management/tool construction, Devops development