preface

HBase is a non-relational column-oriented distributed database (NoSQL) based on Hadoop. The design concept is derived from Google’s BigTable model. HBase is a distributed storage system with high reliability, high performance, and high scalability for real-time read/write and random access to large-scale data sets. It is widely used in big data-related fields. The HBase system transparently slices stored data to achieve high storage and computing scalability.

Since 2017, Zhihu has gradually adopted HBase system to store all kinds of online business data, and built various application models and data calculation tasks on HBase service. With the development of Zhihu in the past two years, the core architecture team of Zhihu has built a set of HBase service platform management system based on the open source container scheduling platform Kubernetes. After nearly two years of research and development iterations, a relatively complete HBase automatic operation and maintenance service system has been formed. Supports quick deployment of HBase clusters, smooth capacity expansion and reduction, fine-grained monitoring of HBase components, and fault tracing.

background

Zhihu does not have long experience in using HBase. In early 2017, the HBase service was mainly used for offline algorithms, recommendation, anti-cheating, and storage and calculation of basic data warehouse data, which was accessed by MapReduce and Spark. At that time, Zhihu’s online storage mainly adopted MySQL and Redis systems, among which:

  • MySQL: support most of the business data storage, when the data size increases after the need for expansion of the table, the table will bring certain complexity, some business hopes to block the thing, because there are some historical reasons in the table design RMSDB have saved some should be in the form of stored data, hope to do the migration. In addition, MySQL is based on SSD, although it performs well, it is also expensive;
  • Redis: can provide large scale cache, can also provide some storage support. Redis has excellent performance, but the main limitation is that data Resharding is more complicated, and the second is high memory cost.


In view of the problems existing in the above two kinds of online storage, we hope to establish a set of online storage NoSQL service, as a supplement to the above two kinds of storage; During the selection process, we also considered Cassandra. In the early stage, some businesses tried to use Cassandra as storage. After the operation and maintenance of Cassandra system for a period of time, the team next door encountered many problems. Currently, except for systems related to Tracing, other businesses have abandoned Cassandra.

Starting from the existing offline storage system, we selected HBase as one of the supporting components of Zhihu online storage after measuring the stability, performance, code maturity, upstream and downstream system undertaking, industry usage scenarios and community activity.

HBase On Kubernetes

In the early stage, Zhihu had only one cluster for offline computing. All services run in one cluster, and the HBase cluster is deployed with other offline computing YARN and Impala. Daily offline computing and data reading and writing of HBase are seriously affected by other systems. In addition, HBase is monitored only at the host level. When running problems occur, it is difficult to troubleshoot and it takes a long time for the system to recover services. In this case, we need to rebuild a system for online services.

In such scenarios, our requirements for online HBase services are clear:

Isolation:

  • From the perspective of the business side, it is hoped that the related services can be isolated from the environment and the rights are transferred to the service to avoid misoperation and service interaction.
  • For response time and service availability, SLA can be specified according to business needs.
  • It can also be more adaptable to resource allocation and blockCache configuration, providing business-level monitoring and alarm, and quickly locating and responding to problems.


Resource usage: From the perspective of OPERATION and maintenance (O&M), allocate resources appropriately to improve the CPU usage, memory usage, and disk usage.

Cost control: Teams can maximize o&M benefits at the lowest cost. Therefore, convenient call interfaces are required to flexibly apply for, expand, manage, and monitor HBase clusters. And the cost includes the machine resources, and the engineers. The system we had online was maintained by a single engineer.

Integrated the above requirements, reference before our team infrastructure platform for experience, the ultimate goal is to make the HBase service into a basic component service platform to provide to the upstream business, this also is one of zhihu technology platform department working ideas, as far as possible all the components on the business of the black box, interface, and service. At the same time, the granularity of use and monitoring should be as accurate, detailed and comprehensive as possible. We built an online HBase management operation and maintenance system.

Why Kubernetes?

As mentioned above, we hope to servitize the entire HBase system platform, which involves how to manage, operate and maintain the HBase system. Zhihu has rich experience in micro-services and containers. At that time, all of our online services have completed the migration of containers. Over ten thousand level business containers run smoothly on Bay, a container management platform based on MESOS (see [1]); At the same time, the team is also actively trying to do Infrastructure container, has successfully implemented the basic message queue component Kafka on Kubernetes system container (see [2]), Therefore, we decided to use Kubernetes to manage and schedule HBase resources.

Kubernetes[3] is Google’s open source container cluster management system, which is the open source version of Borg, Google’s large-scale container management technology for many years. Kubernetes provides resource management and scheduling solutions for various dimension components, resource use of isolated containers, HA work for each component, and a relatively complete network solution. Kubernetes is designed as an ecosystem platform for building components and tools that make it easy to deploy, extend, and manage applications. With the blessing of Kubernetes Dafa, we soon had the first landing version ([4]).

In the early generation

The platform uses Kubernetes(K8S for short) API to set up multiple logically isolated HBase clusters on shared physical clusters. Each cluster consists of a group of Masters and RegionServers (RS for short). The clusters share a set of HDFS storage clusters, and each Zookeeper cluster is independent. The cluster is managed by a set of management system Kubas service ([4]).


First-generation architecture


The module definition

To construct an HBase cluster in the K8S, use the basic components of the K8S to describe HBase components. The resource components of K8S are as follows:

  • Node: defines a host Node, which can be a physical machine or a VM.
  • Pod: A set of closely related containers, which is the basic unit of K8S scheduling;
  • ReplicationController: A set of POD controllers that ensure pod numbers, health, and elasticity.


Based on our previous experience in Kafka on K8S, for the sake of high availability and scalability, we did not adopt the deployment mode of multiple containers in one Pod. Instead, we used one ReplicationController to define a type of HBase component, which is Master in the figure above. Regionserver also has Thriftserver created on demand. Using these concepts, we can define a minimum HBase cluster on K8S as follows:

2 * Master ReplicationController;

3 * Regionserver ReplicationController;

2 * Thriftserver ReplicationController (optional);

High availability and fault recovery

As a system oriented towards online business services, high availability and failover must be considered in the design. In the overall design, we consider availability and failover at the component level, cluster level and data store level respectively.

Component level

HBase has considered many failover and recovery schemes:

  • Zookeeper cluster: Designed to ensure availability.
  • Master: Registers multiple Master nodes in the Zookeeper cluster for HA and update of the primary node.
  • RegionServer: Stateless. RegionServer automatically migrates regions after nodes fail and go offline. RegionServer has no adverse impact on service availability.
  • Thriftserver: At that time, most services are Python and Golang. Thrift is used to implement HBase, and Thriftserver itself is a single point. Here, HAProxy is used to proxy a group of Thriftserver services.
  • HDFS: it is composed of Namenode and DataNode. The HA function of Namenode is enabled to ensure the cluster availability of HDFS.

Cluster level

  • Pod container failure: Pods are maintained by ReplicationController. K8S ControllerManager monitors component failures in its storage etCD. If the number of copies is less than the default value, a new Pod container will automatically start working.
  • Kubernetes cluster crash: this scenario has happened in the production environment. In view of this situation, we adopt a small number of physical machines with containers for mixed deployment of businesses with high SLA requirements. When extreme scenarios occur, the impact on important businesses can be controlled.


Data level

All HBase clusters built on the K8S share a set of HDFS clusters. Data availability is provided by multiple copies of the HDFS cluster.

Implementation details

The allocation of resources

At the initial stage, physical nodes use cpus of 2 x 12 cores, 128 GB memory, and 4 TB disks. Disks are used to build HDFS, and cpus and memory are used to build hBase-related nodes in the K8S environment.

The Master component manages HBase clusters, and the Thriftserver component acts as an agent. Therefore, resources of these two components are allocated according to fixed quotas.

When designing Regionserver component resource allocation, consider two ways to define resources:


Resource allocation mode


Allocation based on business requirements:

  • Evaluate QPS and SLA based on the service description and configure parameters for the service, including blockcache, region size, and number.
  • Advantages of business optimization, can make full use of resources, reduce the cost of business resource occupation;
  • The management cost increases, and each service needs to be evaluated, which is unfriendly to platform maintenance personnel. In addition, business students need to understand HBase.


Unified specification of resource allocation:

  • CPU and MEM are allocated according to preset quotas, providing multi-file configuration, and the CONFIGURATION of CPU and MEM is packaged.
  • The number of RegionServers is directly added during service expansion, which ensures stable configuration, low O&M cost, and convenient troubleshooting.
  • There are limitations for some businesses with specific access methods, such as CPU computing, large KV storage, or businesses with MOB requirements, which need special customization;


At that time, few online services were to be connected, so Regionserver was configured based on service customization. In the formal environment, a unified set of RegionServers is used for the same service, and no Regionserver group with mixed configuration exists.

Parameter configuration

The base image is built based on CDH5.5.0-HBase1.0.0

# Example for hbase dockerfile

# install cdh5.5.0 – hbase1.0.0

ADD hdfs-site.xml /usr/lib/hbase/conf/

ADD core-site.xml /usr/lib/hbase/conf/

ADD env-init.py /usr/lib/hbase/bin/

ENV JAVA_HOME /usr/lib/jvm/java-8-oracle

ENV HBASE_HOME /usr/lib/hbase

ENV HADOOP_PREFIX /usr/lib/hadoop

ADD env-init.py /usr/lib/hbase/bin/

ADD hadoop_xml_conf.sh /usr/lib/hbase/bin/

  • Fixed environment variables, such as JDK_HOME and HBASE_HOME, are injected into the container image via ENV;
  • The environment variables related to HDFS, such as hdFS-site. XML and core-site. XML, are pre-added to the Docker image. During the construction process, they are added to the HBase related directory to ensure that HBase services can access HDFS through the corresponding configuration.
  • Hbase-related configuration information, such as the Zookeeper cluster address, HDFS data directory path, heap memory, and GC parameters that depend on component startup, need to be modified based on the information passed into Kubas Service. A typical example of these parameters is as follows:


REQUEST_DATA = {

“name”: ‘test-cluster’,

“rootdir”: “hdfs://namenode01:8020/tmp/hbase/test-cluster”,

“zkparent”: “/test-cluster”,

“zkhost”: “zookeeper01,zookeeper02,zookeeper03”,

“zkport”: 2181,

“regionserver_num”: ‘3’,

“codecs”: “snappy”,

“client_type”: “java”,

“cpu”: ‘1’,

“memory”: ’30’,

“status”: “running”,

}

When starting Docker with Kubas Service, In the startup command, use hadoop_xml_conf.sh and env-init.py to modify the hbase-site. XML and hbase-env.sh files to complete the final configuration injection, as shown below:

source /usr/lib/hbase/bin/hadoop_xml_conf.sh

&& put_config –file /etc/hbase/conf/hbase-site.xml –property hbase.regionserver.codecs –value snappy

&& put_config –file /etc/hbase/conf/hbase-site.xml –property zookeeper.znode.parent –value /test-cluster

&& put_config –file /etc/hbase/conf/hbase-site.xml –property hbase.rootdir –value hdfs://namenode01:8020/tmp/hbase/test-cluster

&& put_config –file /etc/hbase/conf/hbase-site.xml –property hbase.zookeeper.quorum –value zookeeper01,zookeeper02,zookeeper03

&& put_config –file /etc/hbase/conf/hbase-site.xml –property hbase.zookeeper.property.clientPort –value 2181

&& service hbase-regionserver start && tail -f /var/log/hbase/hbase-hbase-regionserver.log

Network communication

Network, using the native network mode on Kubernetes, each Pod has its own IP address, containers can communicate directly between, at the same time in Kubernetes cluster added DNS automatic registration and anti-registration function, to Pod identification name as a domain name, Synchronize the information to the global DNS during Pod creation and restart and destruction.

We encountered a problem in this area, when our DNS resolution could not solve the corresponding container domain name by IP in Docker network environment, which caused the Regionserver to register the service name with the Master and the Zookeeper cluster after startup. The same Regionserver is registered with the Master twice. As a result, the Master and Regionserver cannot communicate with each other properly, and the cluster cannot provide services properly.

After our source code research and experiments, we modify the /etc/hosts file before the Regionserver service is started by the container, to mask the injected hostname information from Kubernetes; Such a change enables container-started HBase clusters to be successfully started and initialized, but simplifies OPERATION and maintenance (O&M). Regionserver records are in IP format on the Master page provided by HBase, which makes monitoring and troubleshooting inconvenient.

There is a problem

The first-generation architecture was successfully implemented. After nearly ten cluster services were successfully connected, this architecture faced the following problems:

HBase cluster management is cumbersome:

  • The HDFS cluster storage needs to be manually determined in advance and an independent Zookeeper cluster needs to be applied for. In the early stage, multiple HBase clusters share a Zookeeper cluster to save trouble, which is inconsistent with our original design.
  • The container IDENTIFIER is inconsistent with the RegionServer address registered in the HBase Master, which affects fault location.
  • Single Regionserver runs on an independent ReplicationController (RC for short). To make full use of RC features, a single Regionserver increases or decreases RC capacity.


HBase configuration:

  • The original design lacks flexibility. Hbase-site. XML and hbase-env. Sh related to HBase service configuration are fixed in the Docker Image.
  • XML and core-site. XML configuration files related to HDFS are directly configured into the image because the original design is to share a set of HDFS clusters as storage for multiple HBase clusters. If you want to go online in Kubas Service HBase that depends on other HDFS clusters, you need to rebuild the image.


HDFS isolation:

  • As the number of HBase clusters increases, different HBase cluster services have different requirements on HDFS I/O consumption. Therefore, HDFS clusters dependent on HBase need to be separated.
  • XML and core-site. XML configuration files related to HDFS correspond to related Docker images, while the versions of different Docker images are completely managed by the developers themselves. The initial implementation did not take these issues into account;


Monitoring operation and maintenance:

  • Index data is insufficient, in-heap and out-of-heap memory changes, region and table access information is not extracted or aggregated
  • Region The hotspot region cannot be located within a short period of time.
  • New or offline components can only be detected by scanning kubas Service database. Component anomalies such as regionServer offline or restart or Master switchover cannot be reported in time.


refactoring

In order to further solve the problems existing in the initial version of the architecture and optimize the HBase management and control process, we reviewed the existing architecture and upgraded it based on the new features of Kubernetes. Golang was used to rewrite the whole service of Kubas management system (the initial version was developed in Python), and based on Kubas management system, several basic microservices for monitoring and operation were developed. Improved flexibility of HBase cluster deployment on Kubernetes. The architecture is shown as follows:


Second generation architecture diagram


Deployment & Config Map

Deployment

  • Deployment is a concept in Kubernetes. It is a set of update object descriptions in Pod or ReplicaSet. Replicationcontroller.deployment, used to replace replicationController.deployment, inherits all functions of ReplicationController and has more new management features.
  • In the new Kubas management system, the new design uses Deployment instead of ReplicationController to manage Pod. Use a Deployment Deployment to deploy a group of Regionservers instead of a single Regionserver to deploy ReplicationController, which improves the flexibility of cluster Deployment and capacity expansion management.
  • Each group of Deployment will inject labels of various information dimensions, such as related cluster information, service type, owning application, etc.



Deployment Deployment


ConfigMap

  • ConfigMap is a resource object Kubernetes uses to store configuration files. Through ConfigMap, external configuration can be mounted to a specified location in the container before starting the container, and this provides configuration information for programs running in the container.
  • After the reconstruction, all HBase component configurations are stored in the ConfigMap file. The system administrator generates several HBase configuration templates in advance and stores them in the ConfigMap file of the K8S system as required.
  • When a service provider provides an HBase service request, the administrator can configure the HBase service based on service resource requirements and a configuration template. Render HBase configuration files such as hbase-site. XML and hbase-env.sh for the HBase cluster component and save them to ConfigMap.
  • Finally, when the container starts, K8S mounts the configuration file in the ConfigMap to the path specified in the configuration according to the deployment.
  • Similar to Deployment, each ConfigMap is labeled to associate the associated ConfigMap with the corresponding cluster and application.



ConfigMap archive


Component Parameter Configuration

After the ConfigMap feature was introduced, the request information for creating the cluster was changed.

RequestData

{

“name”: “performance-test-rmwl”,

“namespace”: “online”,

“app”: “kubas”,

“config_template”: “online-example-base.v1”,

“status”: “Ready”,

“properties”: {

“hbase.regionserver.codecs”: “snappy”,

“hbase.rootdir”: “hdfs://zhihu-example-online:8020/user/online-tsn/performance-test-rmwl”,

“hbase.zookeeper.property.clientPort”: “2181”,

“hbase.zookeeper.quorum”: “zookeeper01,zookeeper02,zookeeper03”,

“zookeeper.znode.parent”: “/performance-test-rmwl”

},

“client_type”: “java”,

“cluster_uid”: “k8s-example-hbase—performance-test-rmwl—example”

}

In the config_template command, the configuration information template used by the HBase cluster is specified. All component configurations related to the HBase cluster are rendered by this configuration template.

The basic running configuration information of HBase components, such as component type, startup command used, image file used, and initial number of copies, is also predefined in config_template.

servers:

{

“master”: {

“servertype”: “master”,

“command”: “service hbase-master start && tail -f /var/log/hbase/hbase-hbase-master.log”,

“replicas”: 1,

“Image” : “dockerimage. Zhihu. Example/apps/example – master: v1.1”,

“requests”: {

“cpu”: “500m”,

“memory”: “5Gi”

},

“limits”: {

“cpu”: “4000m”

}

},

}

The Docker image file works with the ConfigMap function to store configuration file information in a predefined path and add the soft link file to the actual HBase configuration path.

RUN mkdir -p /data/hbase/hbase-site

RUN mv /etc/hbase/conf/hbase-site.xml /data/hbase/hbase-site/hbase-site.xml

RUN ln -s /data/hbase/hbase-site/hbase-site.xml /etc/hbase/conf/hbase-site.xml

RUN mkdir -p /data/hbase/hbase-env

RUN mv /etc/hbase/conf/hbase-env.sh /data/hbase/hbase-env/hbase-env.sh

RUN ln -s /data/hbase/hbase-env/hbase-env.sh /etc/hbase/conf/hbase-env.sh

The build process

Combined with the introduction of Deployment and ConfigMap and changes to Dockerfile, the HBase build process has also improved:


HBase on Kubernetes construction process


  • Compile relevant Dockerfile and build basic HBase component image;
  • Build basic attribute configuration templates for HBase to be created and customize basic resources. This part can be used to create ConfigMap in Kubernetes cluster through Kubas API.
  • When creating a Deployment cluster, the detailed ConfigMap of various components in HBase cluster is rendered by calling Kubas API and combining with ConfigMap template previously constructed, and then Deployment is constructed in Kubernetes cluster.
  • Finally, an HBase component container running in Kubernetes Node is completed through the ConfigMap configuration of the image loading component built before.


By combining K8S’s ConfigMap configuration template with Kubas API calls, We can deploy a set of available HBase minimum cluster (2Master + 3RegionServer + 2Thriftserver) in a short time. When all Host hosts have cached Docker image files, It takes less than 15 seconds to deploy and start an HBase cluster.

In the absence of an exclusive front-end console, the Kubernetes Dashboard can be used to expand or shrink HBase cluster components, query, modify, update, and redeploy component configurations.

Resource control

After the reconstruction, the HBase service is opened to the internal business of Zhihu. In a short period of time, the number of HBase clusters in Zhihu has increased to more than 30+ clusters. With the increase in the number of HBase clusters, two problems gradually emerge:

  1. Operation and maintenance cost increases: the number of clusters requiring operation and maintenance increases gradually;
  2. Resource waste: Many services do not have high service volumes. However, to ensure high availability of HBase, at least two Master and three Region Servers need to be provided. However, the load on the Master is usually very low, resulting in resource waste.


To solve the above two problems without breaking the resource isolation requirement, we add the HBase RSGroup function to the management system of the HBase platform.

The optimized architecture is as follows:


The use of the RSGroup


By the business to the Taiwan side HBase cluster management itself has isolation, so in further resources management, the flat Taiwan side was relegated to HBase cluster management, through monitoring of each individual cluster index, if the load of business cluster in online after a period of time is lower than the threshold, the flat island’s party will cooperate with business, Migrate the HBase cluster to a Mixed HBase cluster.

At the same time, if the load of an HBase service running in a Mixed HBase cluster increases and exceeds the threshold for a period of time, the platform will consider promoting related services to a separate cluster.

IDC more optimization

With the development and expansion of Zhihu services, the infrastructure of Zhihu is gradually upgraded to the multi-machine room architecture. In this process, the HBase platform management mode of Zhihu is further upgraded and the multi-machine room management mode is established. The basic architecture is shown in the figure below:


Multiple IDC access modes


  • The service HBase cluster runs on multiple IDCs. The service determines the primary/secondary mode of the IDC room. Data from the secondary IDC cluster is synchronized through the data synchronization component on the platform.
  • Kubas service of each IDC is mainly responsible for specific operations of local Kubernetes cluster, including HBase cluster creation and deletion management, RegionServer expansion and other HBase component management. Kubas service deployment is related to the equipment room. Only the K8S cluster in the equipment room is connected.
  • The Kubas service of each IDC reports the cluster information of the machine room to the cluster discovery service, and updates the master/slave information of the relevant cluster.
  • The Client uses the Client SDK encapsulated by the platform to access the HBase cluster in multiple equipment rooms. The Client uses the cluster discovery service to determine the primary/secondary relationship of the HBase cluster. In this way, related read/write operations are separated, and the Client can direct the write/modify access to the cluster of the primary IDC.
  • For cross-room data synchronization, HBase Replication WALTransfer is developed to provide incremental data synchronization.


Data synchronization

Data synchronization across HBase clusters is required in various service scenarios, for example, data synchronization between offline HBase clusters and online clusters, and data synchronization between multiple IDCs. HBase data can be synchronized in full replication or incremental replication modes.


HBase Data Synchronization


On the zhihu HBase platform, data is synchronized between HBase clusters in two ways

HBase Snapshot:

We adopted HBase Snapshot for full data replication. It is mainly used in the scenario where offline data is synchronized with online data.

WALTransfer:

It is used for incremental data synchronization between HBase clusters. Incremental Replication We do not use HBase Replication. We use our own WALTransfer component to perform incremental synchronization of HBase data.

WALTransfer provides a WAL file list by reading source data in the HBase cluster, locates WAL files in the HDFS cluster, and writes HBase incremental data to the destination cluster in sequence. Related details will be explained in future articles

monitoring

We can see from the previous reconstructed architecture diagram that we have added many modules in Kubas service, which basically belong to the monitoring and management module of HBase platform.

Kubas – Monitor components

The basic monitoring module uses polling to discover new HBase clusters and subscribes to the Zookeeper cluster to discover the Master and Regionserver groups of the HBase cluster.

Collects Regionserver Metric data, including:

  • Region information, number of online regions, number of stores, size of storeFile, size of StoreFileIndex, and number of memStore hits and misses.
  • Blockcache information, such as how much blockcache is used, how much blockcache is free, cumulative miss rate, and hit ratio.
  • Read/write request statistics, such as maximum and minimum read/write response time, read/write table distribution, read/write data volume, and read/write failure times.
  • Operation information for compact and split, such as queue length, number of operations, and time;
  • Handler information, such as queue length, number of active handlers, and number of active readers;


Other indicators such as container CPU and Mem usage come from Kubernetes platform monitoring, disk I/O and disk usage come from host monitoring



HBase Monitoring


Kubas – Region – Inspector component


  • Collect HBase table Region information and use the HBase API to obtain data statistics of each HBase Region and aggregate Region data into data tables.
  • You can invoke open source components to form a Region distribution diagram of an HBase cluster to locate Region hotspots.



HBase Region distribution monitoring


The monitoring information collected by the preceding modules can basically describe the status of HBase clusters running on Kubernetes and help O&M managers locate and rectify faults.

Future Work

With the rapid development of the company’s business, The HBase platform business of Zhihu is also constantly iterating and optimizing. In the short term, we will further improve the management service capability of Zhihu HBase platform from the following aspects:

  • Improve cluster security and stability. Add HBase permission support to improve security isolation in multi-tenant access scenarios.
  • User cluster build customization. A user data management system is provided to enable service users to build HBase clusters by themselves and add plug-ins such as Phoniex.
  • Operation and maintenance detection automation. Automatic cluster expansion, automatic hotspot detection and transfer, etc.

Welcome Java engineers who have worked for one to five years to join Java Programmer development: 854393687

Group provides free Java architecture learning materials (which have high availability, high concurrency, high performance and distributed, Jvm performance tuning, Spring source code, MyBatis, Netty, Redis, Kafka, Mysql, Zookeeper, Tomcat, Docker, Dubbo, multiple knowledge architecture data, such as the Nginx) reasonable use their every minute and second time to learn to improve yourself, don’t use “no time” to hide his ideas on the lazy! While young, hard to fight, to the future of their own account!