Li Yu, the most famous, is a sharing guest of WOTA Global Architecture and Operation Technology Summit. She is currently a senior technical expert of Alibaba Search Business Division, and a PMC & Committer of HBase open Source community. Open source technology enthusiasts, mainly concerned with distributed system design, big data infrastructure platform construction and other fields. For three consecutive years, I have designed and developed storage systems based on HBase/HDFS to cope with the double 11 access pressure, and I have rich practical experience in large-scale cluster production.


HBase, as the core storage system for taobao’s index building and online machine learning platform, is an important part of Alibaba’s search infrastructure. This document describes the history, scale, application scenarios, and problems and optimization of HBase in Alibaba search.


History, scale, and service capability of HBase search in Ali


History: Ali Search started using HBase in 2010, and there are more than ten versions of HBase. The version currently in use is a heavily optimized version of the community version. Community version 1.1.2 is not recommended, because it has serious performance problems. The experience of later versions will be much better.


Cluster scale: At present, the number of search nodes in Ali alone exceeds 3000, and the largest cluster exceeds 1500. Alibaba Group has far more nodes than that.


Service capacity: Last year’s Singles’ Day, alibaba search offline cluster throughput peak more than 40 million times per second, single throughput peak reached 100,000 times per second. A single CPU core can also support 8000+ QPS when the CPU usage exceeds 70%.


HBase search roles and application scenarios in Ali


Role: HBase is the core storage system of Ali Search. It closely integrates with the computing engine and mainly provides search and recommendation services.

HBase application search and recommendation process


The figure shows the HBase application search and recommendation process. In the index construction process, commodities stored in online databases such as MySQL and all online data generated by users are imported into HBaes by streaming mode and provided to search engines for index construction. In the recommendation process, Porshe, a machine learning platform, stores model and feature data in HBase, and stores the data clicked by users in HBase in real time. The model is updated through online training to improve the accuracy and effect of online recommendation.


Application Scenario 1: Index construction. Taobao and Tmall have a variety of online data sources, depending on how many different online stores Taobao has and how many users visit.


Index building application scenarios


As shown above, we export the data from HBase in bulk at night for search engines to build full indexes. In the daytime, online goods and user information are constantly changing. These dynamic data are stored online and updated to HBase in real time, triggering incremental index construction to ensure real-time search results.


At present, end-to-end delay control can be achieved in seconds, that is, inventory changes, product shelves and other information can be quickly searched in the user terminal after the update of the server.


Index building application scenario abstract diagram


As shown in the figure above, the entire index building process can be abstracted into a continuous update process. For example, the full volume and increment are regarded as a Join, and there are different data sources on the line and they are updated in real time. The whole process is a long-term continuous process. In this case, HBase and streaming computing engines are combined.


Application Scenario 2: Machine learning. Here is a simple machine learning example: the user wants to buy a 3000 yuan mobile phone, so in Taobao according to the 3000 yuan condition of screening, but there is no favorite. After that, users will search from the beginning, and then they will use machine learning model to put the mobile phone of about 3,000 yuan in the top position of the search results, that is, to use the previous search results to influence the ranking of the next search results.


Analyzing online Logs


As shown in the preceding figure, online logs are analyzed into goods and users, imported into distributed and persistent message queues, and stored in HBase. Data updates are generated along with the click behavior logs of online users, and corresponding models are updated accordingly to carry out machine learning training, which is an iterative process.


Problems and optimization of HBase search application in Ali


HBase architecture layers. Before talking about problems and optimization, the HBase architecture diagram is divided into the following parts:


HBase architecture diagram


The first is the API, some application programming interface. RPC, which divides the remote procedure call protocol into two parts: the client will initiate access and the server will handle access. MTTR failover, Replication data Replication, table processing, etc., are all areas of distributed management. The middle Core is the data processing part of the Core, such as writing and query, and the bottom layer is HDFS (distributed file system). There are many problems and optimization problems encountered by HBase in Ali search application. The following describes the bottleneck and optimization of RPC, asynchronism and throughput, GC and burr, I/O isolation and optimization, and I/O utilization.


Problems and optimizations I: RPC bottlenecks and optimizations


RPC Server thread model


The actual problem on the PPC server side is that the original RpcServer threading model is inefficient. As shown in the figure above, you can see that the whole process is usually fast, but is handled by different threads, which is very inefficient. After being rewritten based on Netty, threads can be reused more efficiently to implement the HBase RpcServer. The average RPC response time decreased from 0.92ms to 0.25ms, and the throughput increased nearly twice.


Problem and optimization 2: asynchrony and throughput


The actual problems of RPC client are that flow computing has high requirements for real-time performance, distributed system cannot avoid second-level burrs, synchronous mode is sensitive to burrs, and there are bottlenecks in throughput. The optimization method is to implement non-blocking client based on Netty and callback based on protobuf non-blocking Stub/RpcCallback. When integrated with Flink, the measured throughput is increased by 2 times compared with synchronous mode.


Problem and optimization three: GC and burr


As shown in the figure above, the actual problem in this part is that with the high I/O throughput of pCI-SSDs, the swap rate of read cache is greatly increased, and cache memory on the heap is not reclaimed in a timely manner, resulting in frequent CMS gc and even fullGC. The optimization method is to implement offheap of read path E2E, which reduces the gc frequency of Full and CMS by more than 200% and increases the read throughput by more than 20%.



As shown in the figure above, it is a result on the line, 17.86m before QPS, 25.31m after optimization.


Problem and Optimization 4: I/O isolation and optimization


HBase is sensitive to I/O. When a disk is full, a large number of burrs may occur. In a computing and storage hybrid deployment environment, shuffle data generated by MapReduce operations and Flush/Compaction of HBase contribute large I/OS.


How do you circumvent these effects? Using the Heterogeneous Storage function of HDFS, the ALL_SSD policy is used for Hfiles of WAL(write-Ahead-log) and important service tables, and the ONE_SSD policy is used for Hfiles of common service tables. Ensure that Bulkload supports specified storage policies. At the same time, make sure that MR temporary data directory (graphs. Cluster. The local. Dir) only use SATA disk.


Burr optimization after I/O isolation of an HBase cluster


Operations such as Compaction, Flush limiting, and PER-CF Flush can be performed to impact HBase I/OS. The green line from left to right shows p999 data for response time, processing time and wait time. Taking response time as an example, 99.9% of requests do not exceed 250ms.


Problem and optimization 5: IO utilization



Three copies are written in HDFS. The common model has 12 HDDS. The I/O capability of SSDS is much higher than that of HDDS. As shown in the figure above, the actual problem is that WAL alone cannot make full use of disk IO.



As shown in the figure above, to make full use of IO, regions can be grouped by proper mapping to achieve multi-wal. Namespace-based WAL grouping supports I/O isolation between apps. In terms of online performance, full HDD write throughput increased by 20%, full SSD write throughput increased by 40%. The mean online write response latency decreased from 0.5ms to 0.3ms.


Open Source & the Future


Why embrace open source? First of all, imagine what would happen if people didn’t show their optimization, thinking it was an advantage over others? If people share their strengths, they will get positive feedback. Second, HBase teams are usually small, and loss of staff can cause significant losses. If content is contributed to the community, the maintenance cost of the code can be greatly reduced. Doing open source together is much better than one company doing it alone, so we have to contribute.

In the future, on the one hand, Alisearch will further make the PPC server also asynchronous, use the HBase kernel in streaming computing, and provide an embedded mode for HBase. On the other hand, try replacing the HBase kernel with a new DB to achieve higher performance.


The above content is arranged according to the lecture of professor ju ding in WOTA2017 “big data system architecture”.