This article is excerpted from the Bytedance Infrastructure Practices series. “Bytedance Infrastructure Practice” is a series of articles designed by the technical teams and experts of Bytedance Infrastructure department to share the team’s practical experience and lessons in the development and evolution of infrastructure, and to exchange and grow with technical students. Bytedance has many storage requirements for massive structured and semi-structured data. This article will take an in-depth look at the evolution of ByteDance’s distributed table storage system and show how we are working our way to the top of the industry, starting with addressing our business needs.

The profile

Distributed table storage system has a wide range of applications in the industry. Google has released two generations of distributed table storage systems, Bigtable and Spanner, to fulfill all the table storage requirements of its internal and external cloud services. Among them, HBase, the open source implementation of Bigtable, is widely used by domestic and foreign companies. In addition, the open source graph database JanusGraph, timing database OpenTSDB, geographic information database GeoMesa and relational database Phoenix are all based on HBase for data storage. Bytedance also uses HBase as a form storage service. In practice, HBase has high latency and low availability in scenarios with large data volumes and heavy loads. For this byte storage team beat form after a large number of open source system research and eventually decided to develop a set of compatible HBase data model and semantics, strongly consistent high availability, low latency, high throughput, the global order and of SSD and HDD disk storage system with friendly distributed form at the same time, in response to a byte to beat business brought the rapid development of technical challenges. At present, the first-generation table storage system Bytable1.0 has been running stably as the bottom data storage of search and recommendation, and our team is evolving the second lattice storage system Bytable2.0 on this basis.

1. The background

With the start and development of toutiao whole web search project, the business needs a globally ordered, large capacity and efficient form storage system to store all links and web pages in the Internet, and ensure that all changes on the Internet can be updated to the form storage system in real time. Google, the leader in search engine technology, built Bigtable for this purpose and used it in several of its projects, including search, Earth and finance. Search was featured in his published paper, and many other search companies have since followed suit. Our team initially used HBase, Bigtable’s open source implementation, to provide search services. In the entire network link relations under the updated in real time demand of the need to provide enough high availability and low enough time delay, because of the very large amount of its data so will create a lot of data fragmentation, cluster overall latency and availability will be shard instance as the increase of the number of index levels of deterioration, so for each shard instance of latency and availability higher requirements are put forward. However, HBase has high tail delay and low availability, which cannot meet our requirements. Therefore, our team began to investigate and select the technology.


2. Bytable1.0 is the first generation of table storage system

2.1 Technology selection

In our case, the volume of web links and web data is so large that using all SSDS will greatly increase the storage cost, so we are targeting devices that need to provide efficient support on HDDS as well as SSDS. Currently, globally ordered open source distributed table storage systems are widely used, such as HBase, TiDB, Co ckroachDB, etc. It has been verified that HBase cannot meet our requirements due to its long latency tail and availability problems. TiDB and Co ckroachDB exist under our requirements :(1) data migration requires multi-way merging and scanning data; (2) Two logs are required to write data; These two causes the head wobble to be unfriendly to HDDS. Because none of these systems could solve our problems, our team finally decided to combine the current software and hardware resources provided by the company and business scenarios. Independently developed a set of high availability, strong consistency, low latency, high throughput, global order, SSD and hdd-friendly distributed table storage system Bytable1.0 compatible with HBase data model and semantics. When designing the solution, our team follows the principle of simplifying the design as much as possible while meeting the business requirements. Distributed table storage systems in the industry often use shared storage architecture, but at that time, the internal distributed file storage system could not provide high enough availability and SLA. To meet the high availability requirements of the service, we decided to manage local disks directly without relying on the distributed file storage system.

2.2 System Architecture

Let’s introduce the system architecture of Bytable1.0. As shown in the figure below, it is mainly composed of Master, PlacementDriver and TabletServer modules. The Client communicates with the Master to get meta information about the Tablet (similar to Bigtable, a Tablet represents a data fragment in the table, and a part of the table corresponding to a KeyRange). To obtain the address of the TabletServer where the Tablet is located, the Client communicates with the TabletServer for data reading and writing. NameNode and DataNode similar to HDFS handle the tasks of the control plane and data plane respectively. We divide the whole Bytable1.0 into the control plane (Master). Decision planes (Placement Driver) and data planes (TabletServer), and we’ll go through the internal design of each plane in detail.

Bytable1.0 Global architecture diagram

2.2.1 Control Plane Master

The Master controls the specific process control of the Tablet’s active switchover, passive switchover, migration, splitting, and merging. In addition, the Master exposes the location information of the Tablet distribution to the Client. If the Master does not receive a heartbeat message from a TabletServer within a specified period of time, the TabletServer is considered Offline. Then initiate a master selection round for the Leader Tablet on this TabletServer. The process of Master selection is similar to Raft, only this part of logic is moved up to Master unified management. First, collect the latest log number of each copy of Tablet, then select enough new copies as the primary target, vote for each copy of Tablet, and send the primary request to the primary target when the vote is successful. Since Master takes over all the work of the control plane, the interaction between logic and RPC is very complicated. Moreover, since there is no high requirement on performance, Go is chosen for development to reduce the development cost and improve the development efficiency.

2.2.2 Decision plane Placement Driver

The Master controls the flow of the specified action, but does not make the decision whether to initiate the specified action (except for passive primary selection), which is the responsibility of the PlacementDriver. It scans the Master to obtain the load information of the partition, generates the decision information after calculation of various policies, and sends the decision information to the Master for actual execution. Since policy updates tend to be more frequent, splitting the policy updates into two modules does not require a rolling restart of the Master, making it completely unconscious to the user. In order to reduce the writing cost of complex policies, PD also chooses Go language for development.

2.2.3 Data Plane TabletServer

TabletServer hosts the data stores from multiple tablets and does the actual data reading and writing. Each Tablet forms a replication group with corresponding tablets on other TabletServers, and data is replicated between multiple members of the replication group using the simplified Raft replication protocol. The simplified Raft Replication protocol only implements Log Replication and moves Leader Election and Member Change to the Master side. In this way, the data plane and the control plane are separated and the logic of the data plane is simplified. At the same time unifying the control plane operations into the Master allows for more complex Master selection strategies and allows multiple sets of Raft heartbeats to be merged into the Master to reduce the additional consumption of heartbeats.

The data path is the critical path of the system, which requires higher performance to improve throughput and reduce latency. At the language level, we chose C++ to provide efficient and predictable performance. In addition, we also made a lot of optimizations in the log engine and data engine. At present, most systems in the market, such as MySQL, MongoDB, TiDB, Co ckroachDB, etc., will write replication log and engine log when writing logs. In the case of HDD, the write traffic will be doubled when writing two logs and the magnetic head will swing frequently. Restrict write performance. Therefore, in our implementation, the engine log is avoided and only the replication log is used to avoid the wobble problem of the magnetic head. In addition, TiDB and Co ckroachDB use RocksDB as a merge storage solution for multiple sets of replicated logs, but it introduces unnecessary MemTable inserts and SST Flush & Compaction overhead, which is not efficient. So our team developed a WAL storage engine that merges multiple sets of replicated logs, allowing a single log write without causing the HDD head to swing, and allowing it to reach full HDD write bandwidth without Compaction.

For the data storage engine, our team adopted RocksDB, a popular and widely proven storage engine in the industry. Unlike TiDB and Co ckroachDB, each Tablet of Bytable1.0 corresponds to an instance of RocksDB. During data migration, RocksDB files can be read, transferred and written directly. No additional scanning, file generation, and file injection are required to avoid the problem of magnetic head movement caused by multiple file merging during scanning, the potential Write Stall problem caused by RocksDB file injection operation going into L0, and the problem that RocksDB file injection operation is not allowed to have concurrent common Write delays. Fast and lightweight data migration is realized. In addition, you can properly control the size of an LSMTree to avoid the problem that the size of an LSMTree does not match the RocksDB parameter.

Efficient implementation of the Split and Merge operations is a problem introduced by the use of a Tablet against a RocksDB instance. For Split operation, we need to Split the original RocksDB instance into two RocksDB instances. In our implementation, we take advantage of the LSMTree feature, first we will do a full engine file hard chain, then stop the write, and then do an incremental engine file hard chain and open the write. Unavailable time is the time required for a small number of incremental hard chain operations. For the Merge operation, we have modified RocksDB to store the injected files and GlobalSequence in the Manifest at file injection time to allow direct injection of files in another RocksDB instance. Similar to the Split process, the Merge process uses an incremental strategy for injection to reduce unavailability.

Compared with the implementation of multiple tablets sharing one RocksDB instance, Bytable1.0 has advantages in the stability and efficiency of migration and the size control of a single LSMTree, but has some disadvantages in the jitter caused by splitting and merging.


The following figure shows the latency comparison between Bytable1.0 TabletServer and HBase RegionServer in a mixed read/write scenario (50% read and 50% write) on the same machine. It can be seen that Bytable1.0 significantly reduces the average delay and p99 delay.


2.3 Optimization Details

2.3.1 Point-read Optimization

Because we encapsulated the Table data model on top of RocksDB’s KV model, we could not use its built-in BloomFilter to optimize the point-read performance. We manually generated a BloomFilter for each SST to store in its TableProperty during a Flush and Compaction process, and turned off the built-in BloomFilter to save space. In the query, the TableFilter function is used to find the manually generated BloomFilter corresponding to SST for judgment, so as to optimize the dot read performance.


2.3.2 Hotspot write Optimization

Because we used Raft as the copy-copying protocol and Raft’s Apply process is serial, we need to extend performance through hotspot Tablet splitting when a write hotspot occurs. Because our split merge operation is relatively heavy, hotspot split decision is not suitable for too real-time and sensitive. In order to expand the write performance of a single Tablet, we implemented a new RocksDB MemTable, which only realizes O(1) complexity online write in the write queue. Data is concurrently written to the actual SkipList by a background thread pool, and ReadIndex is used to wait while it is read. Extend the maximum load capacity of a single thread to the maximum load capacity of a single machine.


2.3.3 Write backvoltage optimization

When a single Tablet writes too fast, one copy is always slower than the others due to the slight difference in hardware resources, which gradually causes the slow copy to reload snapshots. When this happens, if one of the two fast replicas goes Down, the Tablet becomes temporarily unwritable (one of the replicas is reloading the snapshot). In order to avoid this situation, we will carry out slow node backpressure under certain conditions, limit the overall writing speed, prevent slow copy from falling too far behind, and realize the availability improvement in the limit case. In addition, we will set certain limits to avoid slow node backvoltage caused by disk slow down failure.


2.4 Distributed Transactions

At present, distributed transactions are required in some business scenarios. According to Percolator’s transaction model, our team has implemented a set of distributed transaction solutions on the top of Bytable, which will be described in detail in subsequent special articles.

3. Bytable2.0 pursues the extreme second lattice storage system

With the launch and stability of Bytable1.0, business has higher requirements on the cost, stability and performance of table storage system. In terms of cost, a three-copy Compaction consumes nearly twice as much storage space as an Erasure Code (EC) distributed file system. It also consumes twice as much CPU when a three-copy Compaction occurs. From the perspective of resource utilization, there is a mismatch between read/write requests and disk space usage. In this case, CPU resources are available but disk space is used up, or CPU resources are used up but disk space is available. From the perspective of getting through to the offline Hadoop ecosystem, backups, exports, and data loading are now cumbersome. From the perspective of stability, there is still some delay jitter in the process of splitting and merging, and we also hope to optimize the tail delay to the utmost. From the perspective of operation and maintenance complexity, distributed file system can make our service stateless. Combined with the scheduling technology of K8S, we no longer need to consider machine and disk fault repair and re-online, which will greatly reduce our operation and maintenance cost. Therefore, it is imperative to design and implement distributed table storage system based on shared storage.

3.1 Technology selection

Unlike Bytable1.0, which aims only to meet current business needs, we hope to seek optimal solutions in a larger scope and build the world’s leading distributed table storage system. With the evolution of ByteDance distributed file storage system to online mode, we have a distributed file storage system for online use ByteStore. In addition, ByteJournal has developed a distributed log storage system using Quorum protocol, which has paxos-like out-of-order log replication and commit features to minimize latency. In addition, ByteJournal can provide a total of 5 copies written and 2 copies committed, but requires the advanced features of 4 copies Recover to provide extreme tail delay optimization over the Paxos protocol. This time we choose to stand on the shoulders of giants, using ByteStore to store engine files, using ByteJournal to manage replication logs, relying on these two distributed storage systems to achieve the second lattice storage system based on shared storage Bytable2.0. In addition, the international business also has the global consistent table storage requirements and wants to be able to fine-grained Settings and change the replication configuration of some data to control the read and write latency (the distance through which the data is accessed) of the corresponding data.

3.2 System Architecture

In the system design of Bytable2.0, the design idea of Spanner, the second generation distributed table storage system of Google, is used for reference. In this section, we will introduce the system architecture of Bytable2.0. As shown in the figure below, we divide Bytable2.0 into TabletServer and GlobalMaster. Computing scheduling layer (GroupMaster) and data scheduling layer (PlacementDriver).

The main differences between Bytable2.0 and Bytable1.0 are as follows:

  1. Unlike Bytable1.0, which stores Tablet distribution information in Master, Bytable2.0 creates a corresponding MetaTable for each table and stores it in TabletServer. One benefit of this is to isolate access to Tablet distribution information for each table, and another is that if MetaTable becomes a bottleneck, it can be made splitable by adding a RootTable in future implementations.
  2. Unlike Bytable1.0, where one Tablet corresponds to a replicate group, Bytable2.0 has multiple tablets in a replicate group. A TabletServer owns one Replica of multiple ReplicationGroups. One ReplicationGroup has multiple tablets. This allows data from multiple discontinuous keyranges to be placed in the same ReplicationGroup to facilitate fine-grained partitioning, and allows data from discontinuous Keyranges to be processed locally in a single ReplicationGroup. The replication capability of ReplicationGroup is decoupled from the KeyRange cut. Even if the KeyRange has a large number of cuts, ReplicationGroup still converges.
  3. Instead of having a single Master in Bytable1.0, Bytable2.0 introduces GlobalMaster and GroupMaster components. GlobalMaster records meta information about all tables and ReplicationGroup. GroupMaster synchronizes meta information managed by GlobalMaster and stores it locally redundantly. GroupMaster only manages TabletServer instances and Replica allocation in its own Region, achieving intra-region autonomy. The impact of inter-region network status on intra-region availability is removed.

The main differences between Bytable2.0 and HBase are as follows:

  1. HBase supports synchronous or asynchronous primary/secondary replication. Data read in HBase is consistent within clusters but not between clusters. Bytable 2.0 supports cross-region data replication to ensure global consistency.
  2. In HBase, the mapping between regions and KeyRange is one-to-one. Therefore, fine-grained replication of the KeyRange creates a large number of small regions, which is not friendly to HBase. Bytable 2.0 supports fine-grained Tablet data replication relationship Settings.

When reading or writing data, the Client communicates with the GroupMaster to obtain the locations of replicas in the ReplicationGroup corresponding to the MetaTable. Then request the MetaTable to locate the ReplicationGroupId that the Tablet belongs to. Then go to GroupMaster to query the locations of replicas in the ReplicationGroup. Finally, find the corresponding TabletServer for reading and writing according to the location information.

Bytable2.0 Global architecture diagram

In a standalone deployment scenario, the above architecture is too bloated, because we implemented the modularization well enough to compile the GlobalMaster, GroupMaster, and PlacementDriver modules into the same Master binary. To reduce the complexity of deployment and maintenance, a cluster can contain only Master and TabletServer services, as shown in the following figure.

Bytable2.0 thin provisioning

3.2.1 TabletServer at the Data Read/Write Layer

Different from the one-to-one relationship between ReplicationGroup and tablets in Bytable1.0, a TabletServer in Bytable2.0 contains several ReplicationGroup replicas. Each ReplicationGroup contains several tablets. This design makes splitting and merging tablets very light, just modifying the metadata without any pauses. Migration in this design is no longer about moving a Tablet from one TabletServer to another, Instead, we import the corresponding data of this Tablet from one ReplicationGroup to another. We use SST file Range to extract the corresponding data lightly.

In the write process, the Leader Replica first writes to WAL through ByteJournal. When the log is written successfully, the Leader Replica writes to MemTable. The Follower Replica reads the WAL from ByteJournal and writes it to MemTable. The Leader Replica periodically checks and writes a SST file to the ByteStore. When the Leader Replica and Follower Replica share a distributed file system, only the Leader Replica performs Checkpoint and Compaction. If the Leader Replica and Follower Replica do not share a distributed file system, a new Compaction Leader is elected to Checkpoint and Compaction occurs on each distributed file system. This eliminates the additional CPU loss that occurs when multiple replicas are created separately, and Follwer Replicas solve the problem of no master/slave redundancy in earlier versions of HBase.

In the reading process, the Leader Replica and Follower Replica read the SST data from the corresponding ByteStore, merge it with the MemTable in memory, and finally return it to the user. Follwer Replica can specify the version to read, or use ReadIndex to obtain the latest log number from the Leader Replica and wait for it to read after applying locally.


3.2.2 Global Management Layer GlobalMaster

GlobalMaster is the global management module of the entire cluster. It is responsible for exposing operation interfaces to O&M personnel and creating and deleting tables and ColumnFamily. It stores meta information about all tables and ColumnFamily, and all ReplicationGroup meta information contained in each table. GroupMaster synchronizes meta information from the GlobalMaster to obtain replicas to be scheduled and manage ReplicationGroup replicas in the corresponding Region.


3.2.3 Calculating the Scheduling Layer GroupMaster

GroupMaster is a management module in a Region that manages multiple TabletServer instances and replicas assigned by ReplicationGroups in the Region. In addition to the heartbeat and keep-live work of the multiple tabletservers registered with the GroupMaster, it also obtains Replica information for scheduling based on the metadata synchronized from the GlobalMaster. Scheduling is performed on multiple Tabletservers registered to the GroupMaster, and TabletServer is told to open or close the corresponding Replica.


3.2.4 PlacementDriver for data scheduling Layer

PlacementDriver is a decision module for data scheduling. It communicates with TabletServer, collects load information of each TabletServer, scans metatablets corresponding to each table, and performs load balancing calculation. The final decision is whether you need to split, merge, or migrate certain tablets.

3.3 Future Outlook

3.3.1 offline Compaction

Compaction is a Batch offline operation compared to a streaming read or write operation. This offline operation runs in an online service, reducing resource utilization and Qos of the online service. Therefore, we plan to allow execution of this Compaction process to be committed to Yarn or another offline operating platform to reduce CPU consumption for online services. The Aliyun X-Engine team uses a similar solution to load a native Compaction CPU Offload to the FPGA. However, it is heavily dependent on specific models and hardware, so it can be considered to use Yarn scheduling platform to schedule FPGA resources in a more reasonable way and fully utilize its resources.

3.3.2 Analytical storage engine

In Spanner: Becoming a SQL System, Spanner provides an analysis engine for in-block columns that works well in some scenarios. TiDB also provides a solution for analytical queries by hanging a ClickHouse for Raft (Yandex’s open source analytical data) from a node. Bytable2.0 also has some analytical usage scenarios, and we plan to provide an analytical storage engine in the future, evolving into an HTAP system.

3.3.3 Multi-mode Database

At present, there are various API interfaces in the database market, such as Redis, MongoDB, MySQL and so on. Meanwhile, in the open source market, databases such as JanusGraph database, timing database OpenTSDB, geographic information database GeoMesa, and relational database Phoenix are all based on HBase for data storage. This gives us some enlightenment that Bytable2.0 can also be used as a general storage service in the future, realizing a variety of API layers in the upper layer and providing large capacity and low cost solutions for these API services.

3.3.4 Exploration of new distributed transaction mechanism

The implementation of distributed transactions in the industry often requires a trade-off between throughput and latency. In the distributed transaction solution of Bytable, Percolator transaction model is used, which relies on the single point of timing service TSO and makes a certain tradeoff between throughput and delay. In the subsequent distributed transaction scheme of Bytable2.0, we hope to further optimize the throughput problem and cross-region delay problem of Percolator model in Bytable1.0 distributed transaction scheme, and explore new solutions.

4. To summarize

With the rapid development of bytedance search and recommendation scenarios and the continuous enrichment of business requirements, we quickly built the first-generation distributed table storage system Bytable1.0 in the early stage, which solved a large number of business problems. However, due to the simplification and tradeoff of the first-stage system, there are still some problems such as cost and expansibility that need to be optimized. We solve these problems in the second generation of Spanner table storage system Bytable2.0. Welcome students who are interested to join the table storage team to build the ultimate distributed table storage solution, and build a solid infrastructure together to escort the rapid development of Bytedance.

5. References

[1] Chang, Fay, et al. “Bigtable: “A Distributed Storage System for Structured Data.” ACM Transactions on Computer Systems (TOCS) 26.2 (2008): 1-26.

[2] Apache HBase: An open-source, distributed, versioned, column-oriented store modeled after Google’ Bigtable.

[3] TiDB: An open source distributed HTAP database compatible with the MySQL protocol.

[4] Taft, Rebecca, et al. “Co ckroachDB: The Resilient Geo-Distributed SQL Database.” Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2020.

[5] Dong, Siying, et al. “Optimizing Space Amplification in RocksDB.” CIDR. Vol. 3. 2017.

[6] Corbett, James C., et al. “Spanner: “ACM Transactions on Computer Systems (TOCS) 31.3 (2013): 1-22.” Google’s globally distributed Database.

[7] Bacon, David F., et al. “Spanner: Becoming a SQL system.” Proceedings of the 2017 ACM International Conference on Management of Data. 2017.

[8] Huang, Gui, et al. “X-Engine: An optimized storage engine for large-scale E-commerce transaction processing.” Proceedings of the 2019 International Conference on Management of Data. 2019.

More share

Bytedance self-developed consistent online KV & Table Storage Practices – Part 1

Bytedance self-developed consistent online KV & Table storage Practices – Part II

InfoQ Exclusive Headline Search: From recommendation to search, how to build an alternative to search technology?

Bytedance’s practice on Go Network library

Bytedance infrastructure team

Bytedance’s infrastructure team is an important team that supports the smooth operation of bytedance’s multi-hundred-million-scale user products, including Douyin, Toutiao, Watermelon Video and Huoshan Video, providing guarantee and driving force for the rapid and stable development of Bytedance and its businesses.

Within the company, the infrastructure team is mainly responsible for the construction of Bytedance private cloud, managing clusters of tens of thousands of servers, being responsible for the mixed deployment of tens of thousands of computing/storage units and online/offline units, and supporting the stable storage of EB massive data.

Culturally, the team embraced open source and innovative hardware and software architectures. Job.bytedance.com (” read the original article “at the end of the article). If you are interested, please contact [email protected].

Welcome to the Bytedance technical team