Takeaway: This paper mainly introduces the complete practice of BaikalDB in Tongcheng Yilong. The paper summarizes BaikalDB into six core features, which are BaikalDB high availability and HTAP feature Practice, BaikalDB High performance and scalability Practice, BaikalDB low cost thinking. I hope that’s helpful.

The full text is 14032 words and takes an estimated reading time of 19 minutes.

I. BaikalDB high availability and HTAP feature practices

We started to investigate the open source NewSQL database BaikalDB in 2019, and tried to solve some practical problems encountered in our work, such as the slow query speed of OLAP business running on the row storage database, and the high availability solution of cross-center deployment of database to be improved. In the recent six months of research and practice, we submitted the listing feature to the community. And BaikalDB was used to deploy OLAP services based on column storage, OLTP services based on row storage, and high availability deployment scheme based on dual-center, which effectively solved related problems. Here is a share of related experience, hoping to provide reference for students who encounter similar problems.

1, BaikalDB selection consideration

1.1 The Industry is laying out NewSQL

1.2 Comparison of core technologies of NewSQL database

  • Note 1: ShardingSphere Paxos replication protocol based on MySQL MGR has not been released.

  • Note 2: TiDB 3.0 has supported both optimistic and pessimistic transactions.

  • Note 3: Due to the limited resources of this author, there are a number of NewSQL that have not been compared: Amazon Aurora, Aliyun PolarDB, AnalyticDB, Apple FoundationDB, CockroachDB, Huawei GaussDB, Yandex ClickHouse, etc.

1.3 NewSQL technology selection

Path selection:

  • Pure self – research: limited ability, limited investment

  • Pure open source: Cannot meet customization needs in a timely manner

  • Cloud service: considering security and cost, build IDC and K8S for core business in the short term

  • Semi-self-developed: our choice, do not repeat the wheel, the main function is completed by the community, concentrated limited strength to meet the needs of the company, NewSQL options are: TiDB, BaikalDB, CockRoachDB and so on.

From the above several open source DB, BaikalDB was chosen for the following reasons:

  • Similar background: BaikalDB comes from Baidu Phoenix Nest advertising business team. As the growth of advertising business has gone through the whole process from single machine to database and table to distributed, we are faced with similar problems.

  • Tested: With practical experience in using multiple businesses of Baidu advertising platform, thousand-level cluster nodes and Pb-level data scale, we can follow up and use them with controllable risks.

  • Technology stack matching: BaikalDB (c++ implementation, 100,000 lines of code refined), few dependencies (BRPC, braft, rocksdb), community friendly, easy to deploy, technology stack matching.

  • Features are relatively complete: basically meet our needs, we can focus on meeting the needs of the company.

1.4 BaikalDB profile

BaikalDB is a baidu source (github.com/baidu/BaikalDB) distributed relational HTAP database. Supports random real-time read and write of PB-level structured data.

The structure is as follows:

Among them:

  • BaikalStore is responsible for data storage and is organized by region. The three regions of the three stores form a Raft group to implement three copies and multi-instance deployment. When the Store instance breaks down, region data can be automatically migrated.

  • BaikalMeta is responsible for Meta information management, including partition, capacity, permission, balance, etc. Raft guarantees the three-copy deployment. Meta breakdown only affects data expansion and migration, but does not affect data read and write.

  • BaikalDB is responsible for front-end SQL parsing, query plan generation and execution, stateless all-isomorphic multi-instance deployment, and the number of instances down does not exceed the QPS bearing limit.

Core features:

  • Strong consistency: Implement distributed transactions at the Read Committed level to ensure ACID properties in the database

  • High availability: The Multi Raft protocol ensures consistency of multiple copies of data, self-healing of a few nodes, cross-equipment room deployment, remote multi-activity, availability >99.99%, RTO=0, RPO<30s

  • High scalability: The share-nothing architecture separates storage from computing. Online scaling is completed within 5 minutes and dynamic schema change takes effect 30 seconds

  • High performance: table 1000, single table: 1 billion rows, 1000 columns: QPS >1W point check P95 < 100ms

  • Ease of use: Compatible with the MySQL 5.6 protocol

2. BaikalDB online migration process

We have deployed a BaikalDB cluster with 50 storage nodes and a data scale of 10 billion rows online. This chapter will focus on the steps of service migration to BaikalDB to ensure a smooth online process and seamless service migration. The whole on-line process can be divided into the following stages:

2.1 Listing feature development

Since the first business we launched was an analysis business, which was suitable for column storage, we developed and submitted the column storage feature with the help and guidance of the community. The principle is shown in column storage engine

2.2 O&M Tools

  • Deployment tool: For the time being, the online deployment script is mainly developed by ourselves, and the company K8S will be connected in the future.

  • Monitoring tools: Prometheus, BaikalDB It is very easy to connect to Prometheus. See Exporting Monitoring Indicators to Prometheus

  • Data synchronization tool: Canal + data synchronization platform, similar principles will not be expanded;

  • Physical backup tool: SSTBackup, see sST-based backup and restore for instructions, can achieve full + incremental physical backup, internal data format, only applicable to BaikalDB;

  • Logical backup tool: BaikalDumper, emulates MySQLDumper, exports SQL statements and can be imported to other databases. Currently, only full export is supported.

  • See BaikalDB Sysbench for instructions

  • Missing tool: data subscription tool based on MySQL Binlog, under development.

2.3 Data Migration

Data is synchronized in full + incremental mode. The full mode adopts batch insertion. The incremental principle is similar to MySQL binlog subscription. A total of 8 billion pieces of data are synchronized in full for about 3 days, and the incremental synchronization is stabilized within 10ms.

2.4 Service Testing

After data synchronization, BaikalDB has the equivalent data set with online. In the business test stage, the focus is on the support ability of real SQL of online system, and we adopt the full traffic playback method.

The diagram below:

Through live playback of real online traffic, we mainly verified:

  • Services use 100% compatibility of SQL

  • Peak service performance carrying capacity

  • 7*24 hour stability

2.5 Service Launch

After the preceding steps, the only operation required for service launch is to change the database connection configuration.

2.6 O&M and Monitoring

The following figure shows the monitoring screenshots of some indicators Prometheus after launch. It can be seen that QPS, response time, and month-on-month changes are all very stable.

2.7 Precautions

Features of BaikalDB that are still being perfected:

  • Subquery: under development

  • Information_ SCHEMA: Graphical tool support and system variable support incomplete, under development

  • Row and column coexistence: a table can optionally be stored in rows or columns, but cannot coexist, under development

  • Distributed clock: affects transaction isolation level, consistent read with Follower, under development

  • View: Scheduled

  • Triggers, stored procedures and other traditional relational database functions, no planning

What to note when using BaikalDB:

  • Data modeling: DB does not replace the role of database user. Good data models, table structure design, use of primary keys and indexes, and SQL statements are 10 times better than bad ones in our actual tests.

  • Write amplification: related to the number of layers of RocksDB, it is recommended that the data size of a single store be controlled within 1T;

  • Parameter tuning: The default Settings are reasonable. You are advised to modify only a few parameters as required.

Export TCMALLOC_SAMPLE_PARAMETER=524288 #512kexport TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=209715200 Turn it up to avoid thread race and improve performance -bthread_concurrency=200 # suggestion (number of CPU cores -4) * 10-low_query_timeout_s=60 # slow query threshold, -peer_balance_by_ip=false - Max_background_jobs = 24 # Recommended (number of CPU cores - 4) -cache_size = 64M # Recommended not to exceed 40% of the single instance memory spaceCopy the code
  • Large transaction limit: row lock limit: per_TXn_MAX_NUM_LOCKS 100W by default; DB and Store RPC package size limit: max_body_size 256M by default; Message body restriction: Max protobuf size =2G. This restriction cannot be modified for protobuf. It is recommended that the number of rows affected by a transaction should not exceed 10M;

  • Reserved Buffer: It is recommended that the resource usage be about 40%. In Dr, a single room must bear double pressure. When allocating copies, stores whose disk_USed_percent exceeds 80% will be deleted.

3. High availability and HTAP deployment solution

Our BaikalDB cluster was deployed in four IDC rooms in two cities, supporting both row-based OLTP services and column-based OLAP services. This chapter will show how we designed the deployment scheme to leverage the high availability and HTAP capabilities of BaikalDB.

Diagram of dual-center HTAP deployment:

  • Note 1: Meta node deployment is omitted in the figure

  • Note 2: Multiple Store and DB nodes are actually deployed for each IDC.

In BaikalDB, Region is the basic storage unit. Each table is composed of at least one Region. Several peers in each Region form A Raft Group together. Each store can create A row storage table or column storage table. When creating A table for A business, you can select the distribution of copies according to different scenarios. For example, for A business in city A, you can select 2 copies in city A and 1 copies in city B. For city B, you can select city A 1 copy and City B 2 copy.

Assume that there are two services using the BaikalDB cluster, and service A is an OLAP service deployed in city A, and the table ctable is represented by Region1. Service B is an OLTP service deployed in city B. Table RTABLE is represented by Region2. The configuration process is as follows:

  1. Example Initialize meta room information
Echo -e "Bj sz\n"curl -d '{"op_type": "OP_ADD_LOGICAL","logical_rooms": {"logical_rooms" : [" bj ", "sz"]}} 'http://$1/MetaService/meta_managerecho - e "insert bj physical room \ n" curl - d' {" op_type ": "OP_ADD_PHYSICAL","physical_rooms": {"logical_room" : "bj","physical_rooms" : [" bj1 ", "bj2"]}} 'http://$1/MetaService/meta_managerecho - e "\ n" echo - e "insert sz physical room \ n" curl - d' {" op_type ": "OP_ADD_PHYSICAL","physical_rooms": {"logical_room" : "sz","physical_rooms" : ["sz1","sz2"]}}' http://$1/MetaService/meta_managerCopy the code
  1. Set physical room information for each BaikalStore
Vim store/conf/gflag- default_Physical_room =bj1Copy the code
  1. Set physical room information for each BaikalDB
Vim db/conf/gflag- default_Physical_room =bj1Copy the code
  1. When creating a table, specify the replica policy and storage type as required
-- Service A is an OLAP type request deployed in city A, which uses the column table. The building sentence is as follows: CREATE TABLE `TestDB`.`ctable` (`N_NATIONKEY` INTEGER NOT NULL,`N_NAME` CHAR(25) NOT NULL,`N_REGIONKEY` INTEGER NOT NULL, 'N_COMMENT' VARCHAR(152),PRIMARY KEY (' N_NATIONKEY '))ENGINE=Rocksdb_cstore COMMENT='{" COMMENT ":" this is a column store table ", "resource_tag":"bizA", "namespace":"TEST_NAMESPACE","dists": [ {"logical_room":"bj", "count":2}, {"logical_room":"sz", "count":1}] }'; -- Service B is an OLTP type request deployed in city B that uses a row-store table. The building statement is as follows: CREATE TABLE `TestDB`.`rtable` (`N_NATIONKEY` INTEGER NOT NULL,`N_NAME` CHAR(25) NOT NULL,`N_REGIONKEY` INTEGER NOT NULL, 'N_COMMENT' VARCHAR(152),PRIMARY KEY (' N_NATIONKEY '))ENGINE=Rocksdb COMMENT='{" COMMENT ":" this is a row table ", "resource_tag":"bizB", "namespace":"TEST_NAMESPACE","dists": [ {"logical_room":"bj", "count":1}, {"logical_room":"sz", "count":2}] }';Copy the code

Advantages:

  • Disaster recovery capability: When a few nodes fail, whether at the machine level, equipment room level, or city level, the RPO(data loss duration) is equal to 0s and the RTO (data recovery duration) is less than 30s.

  • Async Write: Due to Raft, most of the peers are returned when they are written successfully. Although one of the three peers is distributed in another city, the delay exists. However, the Write operation usually only takes effect after two peers in the same city are written.

  • Follower Read: Each city has at least one copy. If Read services need to be deployed in two cities, BaikalDB provides the nearby Read function. In route selection, the Region in the same logical room as the DB is selected for Read operations, ensuring Read performance in both cities.

  • HTAP capability: Businesses can select row and column storage tables based on business scenarios, and each store can support both tables.

  • Resource isolation: If you are concerned that HTAP workload may affect each other, businesses can group stores using the RESOURCE_TAG field, for example, resource_TAG = bizA for store1, Then store1 will only assign regions to tables that were created with resource_tag = bizA.

For:

  • Dr Capability: the maximum RPO is 3s when most nodes are faulty. You can add a degradation policy to the allocation policy of BaikalDB replicas. If most equipment rooms are faulty, copies are allocated to a few equipment rooms to ensure that the RPO is still 0.

  • Async Write: There is still a delay when the Write occurs in a few cities, but this rarely happens. For example, if both services need to write to the same table, one table must be in the remote write state. In this case, you are advised to split two tables for write and use Union or view for read.

  • Follower Read: can be enhanced to Read consistently on the followers. The distributed clock feature (in development) is needed to compensate for the Follower falling behind in the Leader’s requests.

2. BaikalDB high performance and scalability practices

Core features

This is also our focus in the business promotion order, namely

  • First Must to business scenario matching precision (1 consistency) and stable (2 high availability)

  • The second best (Had better) are more data (3 scalability) and fast (4 performance).

  • The last Should be (Should) use friendly (5 high compatibility) vs. cost saving (6 low cost)

Referred to as: how fast and how stable.

This article will share and summarize BaikalDB’s performance and scalability data by introducing two actual test cases before the business is launched.

1. Benchmarking based on line memory OLTP scenarios

1.1 Test Objectives

If BaikalDB is considered as a product, the purpose of benchmarking is to add a product specification. With the participation of the performance test students, we conducted a benchmark test for two months and wrote the following specification information on the outer package of BaikalDB:

  • In the case of 1000 nodes, the designed cluster can support 18 types of data. The data capacity of a single node is 1T, and the overall capacity of the cluster is 1P.

  • In the benchmark data test, the cluster single point performance is at least 2000 QPS and the write time is at least 50 ms, and the read time is at least 4000 QPS and the read time is at least 20 ms.

  • The external interface is compatible with MySQL 5.6.

  • Based on the Multi Raft protocol, the data copy consistency is guaranteed. The failure of a few nodes does not affect the normal operation.

  • In the Share-nothing architecture, storage and computing are separated. The impact of online scaling and expansion is limited to automatic internal data balancing and transparent to external data. The new field that takes effect at 30s has no impact on the cluster.

1.2 Test Range

  • Performance test (row storage, large table 104 field, small table 63 field, total cluster basic data 1TB, 2TB, 3TB)

  • Test against the mysql benchmark

  • Table structure field number influence (large table 104 field, small table 63 field)

  • Impact of total basic data size (1TB, 2TB, and 3TB)

  • Table structure effects (” self-incrementing primary key “, “slice key as global index”, “slice key as primary key”)

  • Scalability testing under pressure

  • Add a node (store)

  • Minus node (store)

  • Dynamically add columns

1.3 Test Environment

3 Meta, 5 dB, 5 Store to obtain the benchmark, and another machine as a node to add or subtract (centos 7.4 32 core 2.4GHZ, 128 GB memory 1.5TSSD)

Test the deployed

1.4 Main Indicators

  • Optimal capacity (KTPS) :

  • 5 machine configured cluster (3 Meta, 5 DB, 5 Store)

  • The maximum throughput that can be stabilized for two consecutive minutes

  • The average read response time is less than 20ms, and the average write response time is less than 50ms

  • Maximum throughput: DML operation requests per second

  • The unit is KTPS (thousand operation requests per second)

  • define

  • Preconditions:

  • Response time: The time elapsed between sending the DML operation and receiving the returned result, in milliseconds

  • DiskIOUtil Disk usage: What percentage of a second is spent on I/O operations, or what percentage of a second is spent on an I/O queue that is non-empty

  • If the value is close to 100%, too many I/O requests are generated, the I/O system is fully loaded, and the disk may have a bottleneck.

  • If the value is greater than 30%, I/O pressure is high and read/write waits occur

  • Optimal capacity determination method

The performance index of the system decreases with the increase of the number of concurrent users. The performance interval and inflection point are analyzed and identified, and the performance threshold is determined.

1.5 Test Conclusion

  • The performance test

  • Read: The chip key is the main key mode. The five-node read capacity is 72K+TPS, and the performance is 85% higher than mysql. The bottleneck is CPU

  • Write: The chip key is the main key mode. The 5-node write performance is 9.6K+TPS, which is similar to that of mysql and is between 85% and 120% of mysql. The bottleneck is DiskIO

  • Scalability test

  • Add node: front-end throughput is stable

  • Subtraction: The subtraction operation is required to ensure that the cluster capacity is sufficient to withstand the transfer of the subtracted node

  • Add: 22 seconds for new column to take effect (125 million base data)

1.6 Performance Test Details

Compare test with mysql benchmark:

Influence of number of table structure fields:

Impact of the total base data size of the cluster:

Table structure influence:

Extensibility Test details

Continuously add or subtract nodes:

Continuous service added column:

Continuous add test curve (after 22 seconds all insert statements with new columns are successful, the red curve indicates that the number of failures decreases to 0)

2. Test the OLAP scenario based on the column storage

2.1 Test Background

The indicator monitoring system monitors real-time online service data and generates health status indicators. If the indicators exceed the threshold, an alarm is triggered. About 50 2 billion lines, data table query over sql10 are aggregation query, retrieve the number of columns does not exceed four columns, query conditions for a certain time interval range queries, before is run over a line of distributed database, and this is a typical olap scenario, we adopt baikaldb column storage mode and online through the contrast test of The test objects are all online real data. The two DB clusters have the same configuration. When the query performance is tested, they both bear the same write load.

2.2 Test results:

It can be seen from the test results that the column storage of BaikalDB can effectively reduce disk IO, improve the speed of table scanning and calculation, and improve the query performance in SQL queries with few columns in wide tables and aggregate queries. This kind of query is also a common way to write SQL in OLAP scenarios.

3, performance and scalability summary and thinking

3.1 Perspectives of Performance Analysis

  • Resource bottleneck perspective

  • It is recommended that the size of the query and transaction be limited to 10 mbit/s. If the data amount is too large, the transaction and RPC packet size limit will be triggered.

  • Select * from TB where id > x limit 1000; /*{“full_export”:true}*/ select * from TB where id > x limit 1000; Syntax.

  • When io.util is full, files on the L0 layer of the Rocksdb cannot be compressed. As a result, the space is enlarged rapidly and the disk is quickly filled up. Related parameters: max_background_jobs=24

  • It is recommended that the size of the rocksdb data file not exceed 1TB. According to the default configuration, when the size exceeds 1TB, one layer will be added to rocksdb and the write size will increase by 10. If the number of disks on the server is larger than 1 TB, you are advised to deploy a single-node multi-instance deployment.

  • Full IO. Util may cause memory to flush in time and cause memory to be full (store).

  • If there are too many slow queries, queries will be stored in stores and the memory will be full (store). Related parameters db: slow_query_timeout_s=60s;

  • You are advised to set the store memory to more than 40% of the instance where the cache is located, because the memory directly affects the cache hit ratio. Excessive cache misses will increase the time of IO. Related parameters: cache_size=64M

  • Export TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=209715200 #200M when the value is large.

  • CPU: The number of read QPS (DB, store) is too large. Monitor the vars/bthread_count indicator

  • IO: Occurs when the write QPS is too large (store). You can monitor the IO. Util indicators and store/rocksdb/LOG logs.

  • Mem.

  • Disk:

  • The NetWork:

3.1.1 User View

  • Join primary key vs increment primary key

  • BaikalDB is sharding by primary key. The selection of primary key affects the physical distribution of data. It is a good practice to set the query condition on as few sharding as possible. Schoolid, collegeID, classid # school, college, class;

  • BaikalDB implements auto-increment primary key mainly to be compatible with MySQL. Because global auto-increment is completed through meta single raft group, there will be RPC overhead and scalability problems. In addition, data insertion requests for the global auto-increment primary key tend to be concentrated in the last Region node, which may cause bottlenecks.

  • Global vs. local indexes

  • The global index is different from the primary table and has its own sharding rules. The advantage is the route discovery function, but the disadvantage is the RPC cost. When the local index is combined with the primary table, the sharding needs to be determined by the primary key, otherwise it needs to be broadcast.

  • Local indexes are recommended for small tables.

  • For large tables (> 100 million), search criteria prefixed with primary key, use local index, otherwise use global index.

  • Row store vs column store

  • The wide table has fewer columns (the number of query columns/total columns is less than 20%). The OLAP requests of aggregate queries are stored in columns

  • In other cases, we use row store

3.1.2 Realization perspective

  • Balance: BaikalDB periodically performs the Leader Balance and Peer Balance. During the load balancing process, a large amount of data migration may affect performance. Generally, it occurs in the following scenarios:

  • Adding or deleting a node

  • Import data in batches

  • In case of machine failure

  • SQL Optimizer: Currently, part of THE CBO is implemented based on RBO. If the performance is related to the execution plan, explain and Index hint can be used for tuning.

3.1.3**** Scalability Considerations

  • The stability of

  • Meta, especially if the main Meta is dead, will not be able to provide DDL, and if the main Auto Incr is dead, will not be able to provide insert for the increment primary key table.

  • If the store reads are normal but half of the store leader immediately hangs, the write function must wait until all the leaders are migrated to the health center.

  • Expansion: According to the actual situation, expansion is relatively stable

  • Volume reduction: Volume reduction rarely occurs, but in the case of two-center single-center failure, it is equivalent to half of the volume reduction. At this time, the stability is still worth attention and optimization, mainly including:

  • Limit capacity:

  • The calculation logic is as follows

  • In theory, if the Meta disk is 20G and the Store disk is 1T, it can manage 10K Store nodes and 10P data.

  • In fact, baidu’s largest cluster manages 1000 store nodes, with 1T disks per store, totaling about 1P of total cluster space.

  1. Region metadata = 0.2K, Meta20G memory Total number of manageable regions = 20G/0.2k = 100M

  2. Region Data amount = 100M, Store 1T Number of regions managed by disks Each Store = 1T/100M = 10K

  3. Total Data amount that can be managed = Total number of regions x Region data amount = 100M x 100M = 10P

  4. Total Number of Stores that can be managed = Number of Regions/Number of Regions Each Store = 100M/ 10K = 10K

  • linear

  • Fixed range: O (1)

  • You can push down the store full: O(n/db concurrency)

  • Do not push down store Full quantity: O (n)

  • JOIN query: O (n^2)

4, afterword.

  • The performance and scalability analysis data in this paper are all from actual projects, rather than TPCC and TPCH standardized tests, but the test projects are representative to some extent.

  • BaikalDB test data is based on the V1.0.1 release, and the latest V1.1.2 release has more optimizations:

  • Optimize the performance of seek, the measured range scan speed is doubled;

  • Partitioned Index Filters are used to further improve the memory usage efficiency.

  • Added the function of using statistics to calculate cost selection index (part);

  • Raft replication and split execution of transaction multiple statements improves the concurrency of followers.

Three, BaikalDB low cost thinking

This is also our focus in the business promotion order, namely

1. First of all, Must to business scenario matching precision (1 consistency) and stable operation (2 high availability);

2, the second best (had better) is data (scalability) and fast (high performance);

3, the last should be (should) use friendly (5 high compatibility) and cost saving (6 low cost).

Referred to as: how fast and how stable.

As the final article of the series is about the cost of thinking, strong if consistent with the high availability is a concern for the user whether meet the requirements on the function, scalability and high performance is a concern for the boss on the specification is worth the investment, so compatibility and cost is project implementers should concern, because it is related to the project to promote the difficulty.

If you recognize the trends in NewSQL and see the actual business scenario requirements for your company, this article will discuss the problems that need to be overcome in the implementation of the project and suggest a cost perspective to help calculate the workload of the project. For example, if the investigation finds that the potential target users of the company are all running on mysql database, then whether the company is compatible with mysql protocol directly determines the willingness of users, the learning cost of users, the cost of business code transformation, and the cost of ecological supporting. If there are more than 10 potential users, the cost will be magnified by 10 times. If the majority of potential services are using PostgreSQL, then it is best to select PG-compatible NewSQL. This is why this article discusses compatibility under the category of low cost.

1. Cost classification

** The narrow sense of cost: refers to the development, licensing, operation and maintenance of the database software and hardware costs, characteristics are quantifiable, ** for example: 1. Development investment: 24 person-months

2. License authorization: 100,000 / year

3. Operation and maintenance input: 2 DBA

4. Hardware cost: 10 servers

** Generalized cost: generally refers to the total cost generated in the process of organizing and implementing the database application engineering practice. ** Features are related to project management and implementation, including:

1. Learn about development costs

2. Cost of testing and verification

3. Cost of service migration

4. User habits

5. Operation and maintenance tools

6. Software maturity

7. Technology foresight

The narrow sense of cost can be understood as whether the thing is worth doing, and the broad sense of cost can be understood as the work needed to make the thing done. Up to the publication of this paper, BaikalDB has been implemented in more than 10 businesses of the company. How to evaluate the actual costs and benefits generated by these applications (narrow sense)? What work (broad) should be done during the implementation of the project? These two questions will be discussed below.

2, the narrow sense of the cost theoretically can be reduced 100 times

Since BaikalDB is open source, development costs can be ignored and deployment is simple, its narrow cost is mainly focused on hardware costs. In combination with the company’s dual-center construction and Kubernetes cloud native platform strategy, we have given the solution of BaikalDB two places, four centers and three copies, which is expected to reduce the hardware cost by 100 times in theory. The solution is as follows:

1. Note 1: Meta node deployment is omitted in the figure

2. Note 2: Multiple Store and DB nodes are actually deployed for each IDC.

3. Note 3: The column and column mixing feature is still under development

In the figure above, a completely symmetrical two-center deployment scheme is adopted, and each data center has two peer IDC rooms. Based on the concept of BaikalDB’s logical room and physical room, the above deployment can be completed and city-level DISASTER recovery capability can be provided. Raft through the ranks of the Group level and only 3 copies stored technology’s ability to provide HTAP, configuration details are in the first article HTAP deployment is described, based on prior to increase the ranks of the stored (functionality not yet implemented), further merged application scenarios, decrease the number of copies so as to achieve the goal of cost savings. The specific cost savings of the above plan are mainly reflected in the following three aspects:

2.1 The cost of HTAP is reduced by 5 times

To meet different application scenarios, data is distributed to different data components such as ETL, ES, HBASE, Kafka, and TiDB using the synchronization tool. The storage resources of core services are magnified by more than five times. Based on the database to use present situation, the core business generally 1 set of OLTP main library + + 5 sets of OLAP database, data synchronization platform using this scheme can satisfy all the above scene at the same time, and save the data synchronization delay problem between heterogeneous data sources, because of the heterogeneous data sources more than five, combined into 1, 5 times expected earnings increase.

2.2 The cost reduction from unit resource efficiency is four times

  • In OLTP scenarios, performance is improved by 85%

The BaikalDB storage layer uses RocksDB’s LSM-Tree storage, compared with InnoDB’s B+ Tree, balancing read, write and space amplification makes the read and write performance more balanced. Optimize write performance by using out-of-place update and space scaling (innoDB is equivalent to sacrificing global order and storage space for write performance), take full advantage of SSD’s random read, and use efficient cache to speed up point queries. Range checking is optimized by using merge/compaction and concurrent reads from SSDS. According to the benchmark performance test results in the second performance test article, the overall performance of the inline BaikalDB for OLTP scenarios can be improved by 85%.

  • OLAP storage improves performance by 10 times

In OLAP scenarios, SQL query statements are generally wide tables with few columns, and aggregate functions are used. Data is suitable to be stored in columns. An example is as follows:

cstore vs rstore

Demo

SELECT CounterID, count() FROM hits GROUP BY CounterID ORDER BY count() DESC LIMIT 20;Copy the code

line

The column type

Using BaikalDB’s storage engine made our first OLAP service (aggregated queries with about 10 billion data volumes) 10 times faster.

Services generally consist of one OLTP scenario and n OLAP scenarios. On average, BaikalDB can improve unit resource efficiency by about 4 times.

2.3 Fivefold cost reduction for cloud native resilience

Take mysql as an example. Due to the difficulty of scaling down, scaling up and scaling down, it has to keep enough Buffer in order to cope with a few business peak periods (such as Singles’ Day only once a year). As a result, the input of hardware resources cannot change dynamically with the change of traffic, and the elasticity is insufficient. The company is also aware of related problems, and is actively building a cloud native platform based on Kubernetes (see the figure below). For specific articles, see the practice of K8s in the same Cheng Yilong Cloud Native [4].

BaikalDB’s share-nothing cloud native feature combines perfectly with k8S’s scheduling capabilities and is expected to increase resource utilization to 40%, resulting in a 5x elastic return. The reason why it is not increased to more than 80% is to meet the requirements of two-center disaster recovery and mutual backup to cope with the double pressure of a single center caused by a city-level fault.

In summary, total revenue = HTAP revenue * Energy efficiency revenue per unit resource * Resource utilization revenue = 5 * 4 * 5 = 100 times

3. In fact, it is difficult to quantify the cost in a broad sense, but there can be no short board

Under the condition of the narrow cost estimation is worth doing, need from the Angle of project management or project implementation, thinking about how to put the project smoothly, because each company the actual situation of each project are different, the project of the assessment is hard to quantify, but we in the project implementation must be considered these factors, and any link is not up to standard are likely to lead to the failure of the project, Here, the workload to be completed in the project promotion is called the generalized cost, and our practical experience is summarized and scored. The full score is 10 points, and the higher the score, the better. The evaluation is different from the previous performance test, which has a strong subjectivity and is for reference only.

3.1 Learning and development cost: 9 points

Learning BaikalDB mainly involves relying on the BaikalDB code itself.

BaikalDB’s dependencies are few and far between. There are three main ones:

1. BRPC: Apache project, the most commonly used industrial RPC framework in Baidu;

2. Braft: Baidu’s open source industry-level implementation library for Raft consistency and replication state machine;

3. RocksDB: KV storage engine, integrating the best efforts of Google and Facebook;

The above three open source projects, the community is relatively mature, each one is worth learning.

BaikalDB itself as a pure open source distributed database, code 100,000 lines, C++ language, the code structure is concise, clear organization. Although there are not many documents and they are still being perfected, the project has maintained a good coding style and the code is readable. Most of the implementation principles can be mastered by directly reading the code, while a few need to combine the theoretical knowledge of database and distributed field with the study of papers. Maintain clarity and simplicity in modularized abstraction and layering. Some core objects such as LogicalPlanner: LogicalPlanner, expression: ExprNode, execution operator: ExecNode has only one layer of inheritance relationship and features a prominent thin adhesive layer, which effectively reduces the learning cost of software and reflects the KISS principle (Keep It Simple, Stupid!) of Unix open source culture.

All in all, BaikalDB is a very good DB to learn, give 9 points.

3.2 Test verification cost: 7 points

BaikalDB’s testing and verification workload is similar to other DB’s, and 7 points will be given for compliance.

3.3 Business migration cost: 8 points

The cost of business migration mainly includes data migration and SQL rewriting. Because BaikalDB is compatible with MySQL protocol, the company has developed a data synchronization platform based on MySQL ecology, and the cost of data migration is not large. For business code developed with MySQL, most cases do not need to rewrite SQL, rewriting mainly occurs in MySQL syntax not yet supported by BaikalDB such as (subquery, some system functions) and slow query rewriting, business migration cost is 8 points.

3.4 User habits: 7 points

1, graphical tools: can use Navicat, IDEA built-in MySQL UI, DataGrip for simple table browsing, SQL execution function, can not carry out complex management operations;

2. The company has already connected with the system (such as work order, permission, one-stop inquiry, etc.) : it has not been connected yet;

3. Acceptance process of new concepts (e.g. new data files, deployment mode, resource isolation, multi-tenancy, etc.).

3.5 O&M Tools: 7 points

  • Backup tool:

Hot backup: Backup and recovery based on SST

Cold backup: Backs up data logically using SQL statements.

/{“full_export”:true}/ select * from tb where id > x limit 1000;

  • Monitoring tool: Prometheus

  • Operation and maintenance script: script[5]

  • Deployment tool: Ansible

  • Pressure testing tool: Sysbench

  • Subscription tool: Binlog feature in development

  • Synchronization tool: Canal

3.6 Software maturity: 7 points

3.7 Technical foresight: 9 points

The pb-level distributed scaling capabilities, dynamic Schema change capabilities, distributed transactions, remote multi-live capabilities, resource isolation, cloud-native K8s, HTAP capabilities provided by BaikalDB match the future needs or plans of the company, so give 9 points.

4, afterword.

So far BaikalDB in the same cheng Yilong application practice of the series of articles ended, BaikalDB as an open source two-year-old NewSQL database is still very young, there is a lot of room for improvement. At the same time, as the back wave also draws lessons from many before the wave design ideas, has a certain advantage in the back. BaikalDB simple implementation, powerful, professional community friendly, whether used for code learning or business applications have a lot of room to grow, welcome interested friends to participate together. Due to the limited level of the author, if there is something wrong in the text, also hope to understand and correct.

Recruitment Information:

The R&D department of Baidu Business Platform is mainly responsible for the platform construction of Baidu business products, including advertising, landing page hosting, whole-domain data insight and other core business directions. It is committed to making customers and ecological partners continue to grow with platforming technical services, and becoming the most dependent business service platform for customers.

Whether you are back end, front end, or algorithm, there are a number of positions waiting for you, welcome to submit your resume, Baidu business platform RESEARCH and development department looking forward to your joining!

Resume email: [email protected] (Note [Baidu Business])

Recommended Reading:

Database design and practice for large-scale commercial systems

Baidu love pan mobile terminal web page seconds open practice

Decrypt 100 TB data analysis how to run in 45 seconds

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods · Industry information · online salon · Industry conference

Recruitment information · Internal promotion information · Technical books · Around Baidu

Welcome your attention