The article is reprinted from the public account Huatai Securities Digital Technology.

Original link: mp.weixin.qq.com/s/Hp-ZJLdvd…

The author introduces the project team of digital technology distributed database of Huatai Securities, which is mainly responsible for the construction of distributed database system of Huatai Securities. The members of the project team all have many years of database working experience, have deep research on database operation principle, and have accumulated a lot of practical experience.

Problems such as limited storage capacity of traditional database, high cost of capacity expansion, server pressure and long response time gradually become prominent, and distributed database emerged at the historic moment. At the end of 2016, Huatai Securities has begun to investigate distributed database products. In recent years, the state has continuously improved the strategic requirements of information technology autonomy and control, and developed and supported the domestic database industry, which can not only improve the ability of autonomy and control, but also continuously reduce the operating cost of enterprises. After many comparisons, this paper will introduce the technical characteristics of TiDB, matters needing attention in development and the practice progress of TiDB in Huatai Securities.

1. Technical characteristics of TiDB

1.1 TiDB profile

TiDB is an open source distributed NewSQL database that combines the best features of traditional RDBMS and NoSQL. Its design is inspired by Google Spanner and F1. TiDB is designed to cover 100% of OLTP scenarios and 80% of OLAP scenarios, with more complex OLAP analysis done through TiSpark. TiDB shields the intrusion of Sharding schemes such as database and table, so developers no longer need to pay attention to details such as how to slice data, and focus on business development, which greatly improves the productivity of research and development.

1.2 Overall Architecture

TiDB adopts a shared-nothing, computing and storage separated distributed cluster architecture, which mainly consists of three core components: TiDB Server, PD Server and TiKV Server. In addition, there is a TiSpark component that addresses users’ complex OLAP needs. The overall architecture is shown in the figure below:

  • TiDB Server

Responsible for receiving SQL requests, processing SQL related logic, finding TiKV address for storing data required for calculation through PD, interacting with TiKV to obtain data, and finally returning results. TiDB Server is stateless. It does not store data, but only calculates data. It can be expanded infinitely and provides unified access addresses externally through load balancing components (such as LVS, HAProxy, or F5).

  • Placement Driver (PD) Server

PD Server is the management module of the whole cluster. It implements multi-copy cluster architecture through Raft protocol to ensure data consistency and high availability. There are three main tasks: first, metadata information of storage cluster (which TiKV node is a Key stored in); Second, TiKV cluster scheduling and load balancing (such as data migration, Raft group leader migration, etc.); Third, allocate globally unique and increasing transaction ids.

  • TiKV Server

TiKV Server is responsible for storing data. Externally, TiKV is a distributed key-value storage engine that supports transactions. The basic unit of data storage is Region. Each Region stores data of a Key Range (from StartKey to EndKey). Each TiKV node is responsible for multiple regions. TiKV uses Raft protocol for replication to maintain data consistency and high availability. Replicas are managed by Region. Multiple regions on different nodes form a Raft Group and are replicas of each other. Data load balancing among multiple TiKV is scheduled by PD based on Region.

  • TiSpark

TiSpark is the main component of TiDB to solve users’ complex OLAP requirements. Spark SQL runs directly on TiDB storage layer TiKV, and integrates the advantages of TiKV distributed cluster and big data community ecology. At this point, TiDB can support both OLTP and OLAP through a single system, eliminating user data synchronization troubles.

1.3 Core Features

TiDB has the following core features:

  • Highly compatible with MySQL

For business systems without transaction conflict scenarios, it is easy to migrate from MySQL to TiDB without modifying the code in most cases, and the MySQL cluster after database and table can be migrated in real time through TiDB tool.

  • Horizontal elastic extension

Horizontal scaling involves two aspects: computing power and storage power. TiDB Server is responsible for processing SQL requests. As services grow, you can simply add TiDB Server nodes to improve the overall processing capacity. TiKV is responsible for storing data. With the increase of data volume, more TiKV Server nodes can be deployed to solve the problem of data capacity. PD will schedule between TiKV nodes based on Region and migrate part of data to the newly added nodes. So in the early days of a business, you can deploy a small number of service instances and add TiKV or TiDB instances as needed as the volume of business grows.

  • Support for distributed transactions

TiDB supports standard ACID transactions, which are distributed through two-phase commit and optimistic locking.

  • High availability

TiDB/TiKV/PD all tolerate partial instance failures without affecting the availability of the entire cluster. TiDB is stateless and can be deployed with multiple instances. The front-end provides services externally through load balancing components. When a single instance fails, sessions running on the instance are affected. From the perspective of applications, a single request fails and services can be obtained after reconnection. PD is a cluster that uses Raft protocol to keep data consistent. When a single instance fails, if that instance is not Raft’s leader, the service is not affected at all. If this instance is the Raft Leader, a new Raft leader is selected and the service is automatically restored. PD cannot provide external services during the election process, which takes about three seconds. TiKV is a cluster that maintains data consistency via Raft protocol and load balancing scheduling via PD. When a node fails, all regions stored on the node are affected. For the Leader node in the Region, services are interrupted and the Region waits for a re-election. For the followers nodes in the Region, the services are not affected. When a TiKV node fails and cannot be recovered within a period of time, PD will migrate the data on it to other TiKV nodes.

  • One-stop HTAP solution

As a typical OLTP database, TiDB provides a one-stop Hybrid Transactional and Analytical Processing (HTAP) solution with strong OLAP performance and TiSpark. One store processes OLTP & OLAP simultaneously without the traditional tedious ETL process.

  • Cloud native SQL database

TiDB is a database designed for the cloud, supporting public, private and hybrid clouds.

2. TiDB development notes

As a new NewSQL database, TiDB has achieved remarkable results in many ways. Especially in terms of compatibility, TiDB can be said to be compatible with more than 90% of MySQL’s behavior, which lays a good foundation for smooth migration of business systems. However, we still need to be careful about the remaining 10% of incompatible behaviors to avoid risks to business systems.

2.1 transactions

(1) Isolation level of TiDB

Different from many traditional databases, TiDB supports Snapshot Isolation (SI), which is implemented by “Optimistic lock +MVCC”. Repeatable Read (RR) isolation level is equivalent to Repeatable Read (RR) isolation level but has some differences. Details are as follows:

① The SI isolation level of TiDB can avoid Phantom Reads, but the RR isolation level of THE ANSI/ISO SQL standard cannot.

The so-called phantom read refers to that transaction A first obtains N records according to the conditional query, and then transaction B changes m records other than these N records or adds M records that meet the query conditions of transaction A. As A result, when transaction A initiates A request again, it finds n+ M records that meet the conditions, and phantom read is generated.

② The SI isolation level of TiDB cannot avoid Write Skew, and the select for Update syntax is used to avoid Write Skew.

Write skew is when two concurrent transactions read two different but related rows of records, each of which updates its read row, and each of which commits the transaction. If there is a constraint between the two related rows of records, the end result may be a violation of the constraint. The “black and white ball” below is often used to illustrate writing skew:

(3) TiDB cannot avoid Lost Updates under default configuration.

The so-called lost update refers to that two transactions A and B read the same record and update the value of the same column. If A commits the transaction before B, A finds its update lost when it queries again after B commits the transaction.

TiDB does not avoid missing updates in its default configuration because late committed transactions are automatically retried during a transaction conflict (retries get the latest TSO, which isa transaction restarted). Tidb_disable_txn_auto_retry can be set to 1 to avoid missing updates. However, after modification, conflicting transactions will fail and be rolled back without automatic retry.

(2) Affected Rows returned by the DML statement in an explicit transaction is not trusted

As with all distributed databases that use optimistic locking, in explicitly executed transactions (set to non-auto-commit AUTOCOMMIT =0, or explicitly declare the start of the transaction using the BEGIN statement), The affected Rows returned by the DML operation are not guaranteed to be the same as the number of data rows that were affected when the transaction was ultimately committed.

In the following case, transaction B lost its update in concurrency and its Affected Rows is not reliable.

This is because in explicitly executed transactions, DML operations and commit operations are executed separately. During the transaction commit process, if retry occurs due to transaction conflict, TiKV cannot be found, network instability, etc., TiDB will obtain a new timestamp to execute DML operations in this transaction again. The original SI isolation level can cause unrepeatable Read and phantom Read exceptions similar to RC (Read Committed) isolation level after retry. Because the retry mechanism is done internally, if the final transaction commits successfully, the user is generally not aware that a retry has occurred and therefore cannot be affected Rows as a condition for the program’s execution logic. In an implicit transaction (committed as a single SQL), the statement returns the result of the commit, so affected Rows in an implicit transaction is trusted.

(3) Spring does not support PROPAGATION_NESTED propagation behavior

The PROPAGATION_NESTED behavior supported by Spring starts a nested transaction, which is a child of the current transaction that starts independently. A savePoint is recorded at the start of a nested transaction. If the nested transaction fails, the transaction will be rolled back to the savePoint state. The nested transaction is part of the outer transaction and will be committed together when the outer transaction commits. The following example shows the savePoint mechanism:

mysql> BEGIN;
mysql> INSERT INTO T2 VALUES(100);
mysql> SAVEPOINT svp1;
mysql> INSERT INTO T2 VALUES(200);
mysql> ROLLBACK TO SAVEPOINT svp1;
mysql> RELEASE SAVEPOINT svp1;
mysql> COMMIT;
mysql> SELECT * FROM T2;
+------+
| ID |
+------+
| 100 |
+------+
Copy the code

TiDB does not support savepoint and therefore does not support PROPAGATION_NESTED behavior. If an application based on the Java Spring framework uses the PROPAGATION_NESTED behavior, an adjustment must be made on the application side to remove the logic of nested transactions.

(4) Restrictions on large transactions

When a log-based database is faced with large transactions, it needs to manually increase the available log capacity to avoid the log being filled by a single transaction. Large transactions that modify data can be problematic due to the requirements of TiDB’s distributed two-phase commit. Therefore, TiDB sets some limits on transaction size to reduce this impact:

  • Each key-value pair does not exceed 6MB

  • The total number of key-value pairs does not exceed 300,000

  • The total size of key-value pairs cannot exceed 100MB

A row of data is a key-value pair, and a row of index is a key-value pair. When a table has only two indexes, three key-value pairs are written to each insert row. The best practice is to change the large transaction into paging SQL and submit it in sections. The TiDB can use the offset of order BY and limit to implement paging function.

update tab setValue = 'new_value'where id in (select id from tab order by id limit 0,10000);
commit;
update tab setValue = 'new_value'where id in (select id from tab order by id limit 10000,10000);
commit;
update tab setValue = 'new_value'where id in (select id from tab order by id limit20000100 00); commit; . .Copy the code

2.2 since the ID

Auto_increment (ID) of TiDB is only self-increasing and unique, but not continuously allocated. TiDB currently uses batch allocation, so if data is inserted on multiple TIdbAs at the same time, the assigned auto-increment ID will be discontinuous. When multiple threads concurrently insert data to different TIDB-servers, the self-increment ID of the inserted data may be small. In addition, TiDB allows auto_INCREMENT to be specified for columns of integer type, and a table allows only one column whose attribute is AUTO_INCREMENT.

2.3 Uniqueness constraint

As with other databases, primary keys and unique indexes in TiDB are unique constraints on the data in the table, but with the following differences:

  • Primary keys in TiDB must be declared when a table is being built. The current version (V2.1.0) cannot add, modify, or delete primary keys for existing tables. Unique indexes do not have this restriction

  • The Drop Column operation does not support dropping primary key columns

TiDB does not support foreign keys. Remove all statements that create foreign keys from table structures. The foreign key cascading operation of multiple table data needs to be completed in the application.

2.4 the index

As with table data, table indexes in TiDB are stored as KV in the storage engine. A row index is a KV pair. For example, a table with two indexes will write three KV pairs for each row of data inserted.

TiDB supports primary key indexes, unique indexes, and secondary indexes that can be composed of a single column or multiple columns (composite indexes). TiDB currently (V2.1.0) does not support bidirectional, full-text, partitioned table global indexes.

The predicates in TiDB are =, >, <, >=, <=, like ‘… % ‘, not like ‘… % ‘, in, not in, <>,! If =, is null, is not null, the index can be used. The optimizer decides whether to use the index. The query predicate in TiDB is like ‘%… ‘like’ %… % ‘, not like ‘%… ‘, not like ‘%… % ‘, <=> cannot use index.

TiDB currently (V2.1.0) does not have the ability to make use of both indexes on a table for queries on the same table.

Composite indexes in TiDB, like any other database, are designed as a general rule to prioritize the most discriminating columns of data values as possible so that SQL can filter out fewer rows as quickly as possible during execution. In the current version (all v2.1.0 and below), it is important to note that a range query for the previous column in a composite index suspends the use of subsequent index columns, as illustrated in the following example. In the following query:

select a,b,c from tablename whereA < predicate > < value1 '>' and b < predicate > '< value2 >' and c < predicate > '< value3 >';Copy the code

If the predicate of a condition is = or in, then the combined index (a,b,c) can be used on b’s query condition. Select a,b,c from tablename where a = 1 and b<5 and c= ‘ABC’;

Similarly, if the predicates of both a condition and b condition are = or in, then a query on C can utilize composite indexes (a,b, C). Select a,b,c from tablename where a in (1,2,3) and b = 5 and c= ‘ABC’;

If the predicate of a condition is neither = nor in, then the query on B cannot take advantage of the composite index (a,b,c). In this case, condition B scans the data filtered by condition A without indexes. Select a,b,c from tablename where a > 1 and B <5 and c= ‘ABC’;

This is because in TiDB, if the first column in the composite index is used for a range query, the subsequent column query will perform a non-index scan in the data range filtered by the previous column.

To sum up, when designing composite indexes in TiDB, it is necessary to prioritize the highly differentiated columns as much as possible, and prioritize the columns that frequently perform range queries.

2.5 Write Hotspot

TiKV is a KV system that is segmented according to range. The Key of KV determines which region to write in. For tables with non-integer primary keys or no primary keys, TiDB uses an implicit increment ROWID. A large number of inserts will write data sets to a single region, causing write hotspots. By setting the table level option SHARD_ROW_ID_BITS (as shown below), you can split ROwiD into multiple regions, alleviating the write hotspot problem. However, if the setting is too large, the number of RPC requests will increase and the CPU and network overhead will be increased.

SHARD_ROW_ID_BITS = 4 indicates that 2^4=16 fragments

SHARD_ROW_ID_BITS =6 indicates that 2^6=64 fragments

SHARD_ROW_ID_BITS = 0 indicates 2^0, which is the default value

CREATE TABLE statement example:

CREATE TABLE t (c int) SHARD_ROW_ID_BITS = 4;

ALTER TABLE example:

ALTER TABLE t SHARD_ROW_ID_BITS = 4;

Partitioning a table can spread the data of one table into multiple physical tables. By designing partitioning rules properly, hot writes can be avoided.

2.6 Unsupported Features

TiDB is compatible with MySQL in most cases, but some features are not currently supported because they are not well implemented in distributed environments. For example:

  • The stored procedure

  • view

  • The trigger

  • Custom function

  • Foreign key constraints

  • The full text indexing

  • Spatial index

  • Non-utf8 character set

  • CREATE TABLE tblName AS SELECT STMT syntax

  • … …

3. Practice machines

We have been investigating and testing TiDB since the beginning of 2017. So far, we have tested the function and performance of TiDB in multiple business systems. TiDB has also grown from a young product with unstable operation, poor performance and missing peripheral tools at the beginning to a financial distributed NewSQL database with stable product, linear expansion of performance with the number of nodes, rich peripheral tools and hot community.

In late April 2019, we launched the first set of TiDB production cluster, with 6 servers constituting the architecture of “3 TiDB Server + 3 TiKV Server”. PD Server and TiDB Server share the Server. Each server is configured with 128 GB memory, two 12-core cpus, and six 960GB SSDS for RAID 10.

At present, the business system connected to the production TiDB cluster is implemented in the following stages:

1) Use TiDB Lightning tool to import the historical stock data of the previous month into TiDB cluster in the form of files

2) On the launch day, use TiDB Lightning tool to import the data of that month into TiDB cluster in the form of files

3) After going online, the business side will double write and use Kafka to synchronize the new data to the TiDB production cluster

4) After several months of stable operation, the query traffic will be gradually cut to TiDB

5) After several months of stable operation, the query traffic and write traffic will be cut to TiDB, and the new data will be synchronized to the original Mycat+MySQL environment through service double write

6) Completely offline the original Mycat+MySQL environment

It is currently in the third stage. Since its launch, TiDB cluster has been running steadily, with the highest QPS reaching 34,000 transactions per second. The write speed is comparable to that of the original MySQL environment, there is no data backlog in Kafka, and the system resource usage is balanced and there is still some margin.

conclusion

From the results of our practice, TiDB as a NewSQL database does show a good technical advantage. Its MySQL compatibility makes it much cheaper for business systems to transform, and its support for distributed transactions allows business systems to access TiDB as if it were a single MySQL library. The multi-copy mechanism based on RAFT protocol greatly ensures the consistency and availability of data. Its cloud native design concept makes capacity expansion and reduction very convenient, greatly freeing the time of operation and maintenance personnel.

However, we also need to see its disadvantages. The biggest disadvantage of TiDB is that it is still young and many functions have not been perfected. Therefore, our idea is to try it in a small range first, select non-transaction system for promotion, and then expand the promotion range after stable operation.