Last October, we released TiDB version 1.0. We’ve been working day and night for two and a half years, and we think version 1.0 is ready for use in production environments. Over the next six months, we worked tirelessly on version 2.0 while maintaining stability for version 1.0 and adding necessary new features. After half a year and 6 RC versions, TiDB 2.0 GA version is officially released today.

Version 2.0 Planning

In the planning phase of version 2.0, we did a lot of thinking about what this version should do. Based on our existing users, technology trends, and community voices, we decided that version 2.0 needed to focus on the following:

  • ** Ensures the stability and correctness of TiDB. ** These two points are the basic functions of a database software. As the cornerstone of a business, any jitter or error may cause a huge impact on the business. There are already a large number of users using TiDB online, and their data volumes are growing and their businesses are evolving. We are very concerned about how to maintain long-term stable operation of TiDB cluster, how to reduce system jitter, and how to conduct intelligent scheduling, so we have done a lot of research and analysis.

  • ** Improves query performance of TiDB under large data volume. ** From the users we contact, many customers have hundreds of GB or more than hundreds of TB of data, on the one hand, the data will continue to increase, on the other hand, also hope to do real-time query of these data. Therefore, if it can improve the query performance under large data volume, it will be very helpful to users.

  • ** Optimized TiDB for ease of use and maintainability. ** The complexity of the whole TiDB system is relatively high, and the difficulty of operation, maintenance and use is greater than that of a stand-alone database. Therefore, we hope to provide as convenient a solution as possible to help users use TiDB. For example, simplify deployment, upgrade, and expansion as much as possible, and locate system anomalies as easily as possible.

Around the above three principles, we have made a lot of improvements, some of which are visible to the outside world (such as the significant improvement of OLAP performance, a large number of monitoring items and various optimizations of operation and maintenance tools), and many of which are hidden behind the database to improve the stability and correctness of the database.

Correctness and stability

After the 1.0 release, we started building and refining the automated test platform Schrodinger, a radical departure from manually deploying cluster tests. We also added a lot of new test cases to cover everything from RocksDB at the bottom, to Raft, to Transaction, to SQL.

In the Chaos test, we introduced more error injection tools, such as using SystemTap to delay I/O, and error injection tests in the code specific business logic, fully ensure that TiDB can run statically under abnormal conditions.

We did a lot of TLA+ demonstration work and some simple testing, and after 1.0 we started using TLA+ system to make sure our implementation was correct in design.

In terms of storage engines, in order to improve the stability and performance of large-scale clusters, we optimized Raft processes and introduced new features such as Region Merge and Raft Learner. Optimize the hot spot scheduling mechanism, collect more information, and make more reasonable scheduling according to these information; Optimize RocksDB’s performance by using features such as Deleted Filesinranges to improve space reclamation efficiency, reduce disk load, and use disk resources more smoothly.

OLAP performance optimization

In version 2.0, we refactored the SQL optimizer and execution engine to select the optimal query plan as quickly as possible and execute it as efficiently as possible.

Version 1.0 has moved from a rules-based query optimizer to a cost-based query optimizer, but it is not perfect. In version 2.0, we optimized the accuracy and timely update of statistics, and improved the SQL optimizer’s capabilities. The estimation of query cost is more accurate, the analysis of complex filtering conditions is more detailed, the processing of associated sub-query is more elegant, and the selection of physical operator is more flexible and accurate.

In this release, the SQL execution engine introduced a new internal data representation called Chunk. A structure holds a batch of data instead of a row of data. The data of the same column is continuously stored in memory, which makes the memory usage more compact. Significantly reduced memory consumption; 2. Batch allocation of memory, reducing GC overhead; 3. Data can be transferred in batches between operators to reduce the call overhead; 4. In some scenarios, vector computations can be performed and CPU Cache misses can be reduced.

After the above two changes are made, TiDB performance in OLAP scenario has been greatly improved, as compared with tPC-H github.com/pingcap/doc… All queries run faster in 2.0, and most of them are multiples or orders of magnitude better. In particular, some queries that didn’t run in 1.0 run well in 2.0.

Ease of use and operation

To make TiDB easier to install and use, we have also made a number of optimizations in monitoring, operations, and tools.

In terms of monitoring, more than 100 monitoring items are added, and some runtime information is exposed through HTTP interfaces and SQL statements for system tuning or locating problems in the system.

In terms of operation and maintenance, our operation and maintenance tools have been optimized to simplify the operation process, reduce the operation complexity and the impact of the operation process on the online. At the same time, it has more diversified functions, including automatic deployment of the Binlog component and TLS.

Version update

A rolling upgrade from TiDB 1.0 to 2.0 is available, as described in this document.

One more thing

We released it at the same timeTiSpark 1.0 GAVersion, know?