Annual inventory database: from upper cloud to cloud native

Author | Huang Dongxu

This article, part of the “2021 Technology Roundup” series, focuses on the important developments in the database in 2021.

“2021 Year-end Technical Inventory” is a major project launched by Digger, covering Serverless, Service Mesh, big front end, database, artificial intelligence, low code, edge computing and many other technical fields. Looking to the past, to the future, review the development of IT technology in 2021, inventory the major events of IT technology, and forecast the future trend of IT technology. In the meantime, we’re kicking off our 15th tech-themed essay, with a look at what you see as the tech trends of 2022.

2021 is a golden year for global data technology, and also a year for the vigorous development of Chinese database. In 2021, the database industry has made great progress in terms of technology, ecology and industry scene expansion. More and more people are paying attention to the database field. The massive, real-time and scalable new generation of data architecture has become the key support for enterprise digital innovation. Besides the architecture, the biggest change in China’s database industry this year is the rise of open source trend, and the value of open source is also highly concerned by mainstream enterprise users.

Technical changes: Cloud technology, HTAP Rapid growth, Hadoop gradually faded

The biggest change in data technology is likely to be the impact of cloud infrastructure on databases, and the biggest change in data technology over the next few years will be the cross-innovation between distributed databases and cloud infrastructure. In recent developments in the database industry, one thing that has changed the entire database industry very profoundly, rather than the engineering question of “do YOU write good code?”, is the change at the bottom of the database.

In the past, when we think about database software and system software, we will first make an assumption: software is running on specific hardware such as computers, even if it is distributed database, each node is still an ordinary computer. Now that assumption has changed. When our next generation is old enough to learn to program or write code, they won’t be able to see the HARDWARE like we do now, like CPU, hard disk, network, etc., they will be able to see a S3 API provided by AWS. In fact, this change is not only the change of software carrier, but more importantly, the underlying logic of architecture and programming has changed.

The impact and change of the cloud on infrastructure and software is profound. When it comes to PingCAP, the biggest feeling is that there is probably a lot more product investment in TiDB Cloud now than there is in database kernels.

The second big change on the technology side is that HTAP has gone mainstream. We often say “data-driven”, and the most urgent challenge for enterprises is the “massive” and “fragmentation” of data. On the one hand, enterprises need to deal with a large amount of data, and this book is a challenge; on the other hand, the fragmentation of data systems makes it difficult to screen valuable parts. In the past, in the face of data online transaction processing and fact analysis processing two big needs, database is divided into a lot of subdivided categories, each database has learning costs, vassals are divided, “their own”; This creates fragmented islands of data and complex technology stacks. Data islands greatly limit the comprehensive application of data, and also reduce the overall ability of real-time aggregation, real-time analysis and real-time decision-making of enterprises.

In the context of the accelerating digitization process, leading companies in the industry have been trying to make breakthroughs in application scenarios and technologies over the past two years to break the limitations of these fragmented data islands and merge online transactions (OLTP) and offline analytics (OLAP), which we call HTAP capabilities. The converged value that HTAP brings is in the simplification of the technology stack and the enhancement of real-time capabilities. HTAP makes real-time/interactive BI a reality for many enterprises, providing a stack of data services for high-growth enterprises and digital innovation scenarios.

We published A paper in 2020 called TiDB-A Raft-based HTAP Database.pdf describing the theoretical basis of HTAP. This “stack” architecture has been tried in the past, but it is only in the last two years that HTAP has truly become the mainstream technology, and is widely used in digital real-time scenarios, thanks to the innovation of the theoretical architecture and the maturity of the infrastructure.

The third change is less of a technological breakthrough and more of an “offload” than the first two. In 2021, the Apache Software Foundation (ASF) announced the withdrawal of at least 19 of its open source projects to Apache Attic (open source projects for archiving), 10 of which are part of the Hadoop ecosystem. I personally feel that Hadoop as a big data stack itself is largely gone, as the need for real-time data analysis and a unified stack grows.

Database is a kind of basic software that must be combined with business scenes. We often say that real scenes are the best architects. Only with scenes and users can database technology have the opportunity to progress and develop. The gradual demise of Hadoop has given new opportunities to emerging databases, such as distributed technologies and NewSQL databases. In some ways this is a big step forward.

Ecological change: Open source databases become mainstream and the dominant model for building ecology

From an ecological point of view, 2021 is the first year that Open source will enter the mainstream market of enterprise software in China.

We have always said that open source is the best path to the success of basic software technology. At the beginning of 2021, open source was written into the 14th Five-Year Plan for the first time. As a result, open source has attracted more and more attention from outsiders other than developers. In the database space, the deployment of open source databases surpassed commercial databases for the first time, and Databricks and Mongo have more than doubled in valuation or market capitalization from 2020. The IPO boom of global open source software companies has arrived.

From its inception, PingCAP has been open source and learned a lot. From an ecological point of view, the open source development model can quickly accumulate users. TiDB version 1.0 was released in November 2017. From its birth to now, we know the names of more than 2000 users and more than 1600 contributors. In the Contribution Rank of CNCF open source organization, PingCAP ranked sixth in the world in the past two years. From a technical point of view, open source speeds up product iteration. Basically every year TiDB code is rewritten, we keep about 40% of the code updated every year. The speed of iteration is made possible by the open source community, a speed of evolution that no one team, no one company, no one enterprise can ever achieve by building a database from scratch.

Academic and industrial communities are also getting more connected through open source. This year, SIGMOD, the international academic conference with the highest academic status in the field of database, was held in Xi ‘an for the first time in China, which means that China’s database development also has a certain influence in the world. From our own point of view, PingCAP hosted the VLDB Summer School in 2021. We cooperated with CCF Database Committee to guide students to understand the theory and engineering implementation of distributed transactions based on TiDB Talent Plan. With this as reference, the development practice of distributed database has realized the output from industrial practice to academic research. China has the most challenging database application scenarios in the world. TiDB has been approved by more and more global users through large-scale user practice, and is on the way to define the de facto standard of distributed database. We also hope that through this collaboration, we can empower academics with practice, cultivate more database talents, and maintain technological leadership.

DBaaS landing practice

The cloud is reshaping the business logic of databases

The development of technology, the evolution of ecology let more people join the database track, especially in the FIELD of OLAP has emerged many excellent startups. When thinking about value, most people always hope to focus on core application scenarios or data storage, because in the past, the profit model of commercial database is basically collecting protection fees, so the more important the protected scenes are, the more protection fees can be collected. However, this idea will have a major change in the open source + cloud era:

Open source databases have matured to meet most of the core business scenario requirements

The cloud can standardize delivery

The user base of open source databases is huge

Based on the two premises, in fact, you will find in addition to pursuing a core scene this way a new train of thought: find the user journey in general path, regardless of the core is not the core, extreme optimization user experience (developer), and then take advantage of the cloud infrastructure to do the cost is low, finally using flow entrance + PLG go viral transmission route. The commercial success of Hashricorp Terraform’s cross-cloud deployment tool in 2021 demonstrates this.

In the last two years I’ve also redefined PingCAP’s mission: to make our services available to developers around the world, Any where with Any Scale. Therefore, I have been thinking about how to scale up and commercialize a popular open source database.

The premise of scalability must be cloud-based SaaS services. If two years ago there were doubts about the reliability and maturity of the cloud, in 2022 there is no doubt now.

Of course, from DB to DBaaS, it is not only simple to change the underlying resources into cloud, but also more and more important things to be considered: technical cost reduction and efficiency, operation and maintenance automation, multi-tenant management, data security should be considered in compliance, and pricing model and commercialization strategy are all taken into consideration.

Next, I will take TiDB as an example to introduce some experience and thinking in database DBaaS implementation practice.

Cost savings: Separate architectural designs

Cloud native technology ultimately has to solve the problem of cost.

In the past, the boundary between TiDB computation and storage was very fuzzy, making it difficult to handle scenarios with different load rates. In the case of local deployment, if storage capacity needs to be increased, storage nodes need to be added. Due to hardware limitations, CPU and network bandwidth will increase synchronously in addition to disks, resulting in a waste of resources, which is a problem faced by all shared-nothing databases.

Up in the clouds, things are different. For example, AWS block storage service EBS, especially GP3 series, can run on different machines and achieve the same IOPS and Cost. The performance and integration of cloud native are very good. In order to make use of the characteristics of GP3, can we move the boundary of computation and storage down from the original TiKV to storage, and now most of TiDB and TiKV can be computing units, which is more flexible?

The cloud’s cost savings don’t stop there. The real money in the cloud is CPU, and the bottleneck will be computing, not capacity. Clusters and instances can be optimized based on Spot Instances & Clusters based on shared resource pools, select the type of storage service on demand, and perform optimization for different types of EC2 Instance combination delivery in specific scenarios, serverless computing, and flexible computing resources will all be possible.

In addition to computing storage separation, network, memory, and even CPU caching will be separated in the future, as far as I can see. Because for an application, especially a distributed application, its requirements for hardware resources are always different. No matter what business, just like cooking, there is only one dish in hand can not do what flowers, but there are a lot of raw materials, you can do free combination according to taste, cloud brings is such an opportunity.

security

In addition to cost, cloud security is also an important issue. The public clouds officially supported by TiDB are AWS and GCP. 太阳太阳 network users are using their own VPC on the cloud, and there is also a link of opening up at 太阳. We can’t see the user’s data, but the user can access his/her business with high performance. How can we ensure the security?

Security on the cloud is completely different from what we think on the cloud. A particularly simple example: in the cloud, you only need to consider permissions within the RBAC database, but in the cloud, it is very complicated, requiring a complete user security architecture from network to storage. The key to good security on the cloud is never to reinvent yourself, because there are almost always security holes. So we are now taking full advantage of the full set of security mechanisms that the cloud provides, such as key management and rules. Of course, the best part is that these services can be clearly priced, just make a billing model.

Operation and maintenance automation

Another important point about DBaaS construction, and actually related to cost, is the automation of operations and maintenance. The cloud is a scale business, and one of the most difficult parts of the domestic database business right now is delivery. In some offline data centers, a big client can’t wait to have 20 people on site. What we want to achieve is to be able to support a 1000 customer system with a 10-person delivery team, which is a prerequisite for scale. TiDB is deployed on the cloud by Kubernetes, managed by Gardener and managed multiple Kubernetes clusters.

Kubernetes

What are the steps to turning TiDB into a cloud service? The first step is to code all human operations. TiDB to expand, not human capacity expansion, the system itself can expand? TiDB failure recovery, people can not participate, can the machine participate? We have changed the operation and maintenance of all TiDB into Kubernetes Operator, which means we have realized automatic operation and maintenance of TiDB. Kubernetes can mask the interface complexity of all cloud vendors, and every cloud vendor will provide Kubernetes services.

Pulumi

I just said that if the logic of deployment, operation and maintenance and scheduling of these things is written by people, it will be unstable and unmaintainable. Our philosophy is to solidify anything that can be turned into code, never rely on people, including opening a server or buying a virtual machine, we will turn it into a script in Pulumi programming language.

Gardener

TiDB uses Gardener’s API to manage and control Kubernetes clusters in different regions, and each Kubernetes cluster can be divided into TiDB clusters of different tenants to form a multi-cloud, multi-region, multi-az cross-cloud and cross-AZ system. One benefit of this architecture is that users can enable TiDB on demand in the cloud service providers and geographic regions where their applications reside, keeping the technology stack unified.

Being A tenant across the cloud would be smooth and easy, but with Gardener, cross-cloud migration for users would be nothing more than moving A Kubernetes cluster from A to B, to AWS or GCP.

business SLA

There are also many things to consider in the SLA, which is what TiDB will do and is doing.

TiDB has a large number of overseas customers, whose demand for database is very different from that of domestic users, and cross-data center is a rigid requirement. Due to current data security requirements in various countries, there are many restrictions on the transmission of data, and compliant, cross-data center capabilities are important for databases. For example, in the face of Europe’s GDPR control, if some data can be kept in Europe, only those things that are not under the control can be released, it will save a lot of trouble. We believe that this ability will become a critical requirement for Chinese manufacturers and customers, including manufacturing and domestic compliance.

This function can be easily implemented in the cloud. For example, AWS itself is a multi-AZ and multi-region architecture, without considering the underlying layer. Users only need to click the mouse on the interface to open several machines in another data center, and the data will be gone. There is much more to consider if you are dealing with global data distribution or global and Local transactions.

Now TiDB is ready for a rainy day, and it’s coming soon.

To provide services on the cloud, technology is important and compliance is a prerequisite. Ecological integration on the cloud has a main line, which is to follow the data. The upstream, downstream and control of data are the three most important points. Upstream of TiDB is the data files in MySQL and S3, and downstream only needs to support synchronization with Kafka or other message queue services. In terms of data management and control, especially for overseas users, it is more hoped to get through with platforms like DataDog and Confluent than to do overall management and control through database manufacturers.

future

I have a few technologies that I’m particularly bullish on. The first is cloud protogenicity of distributed, highly scalable OLTP databases. I don’t think AWS Aurora, or the current mainstream of these so-called Cloud Native databases, has reached a final form, it is just a Cloud of a standalone database technology. It is still an open question how distributed databases will really integrate with cloud native. The essence of this question is how distributed databases go from On Cloud to In Cloud.

The second is AI for DB. Lean Index, for example, is one of the areas THAT I pay close attention to. For some libraries and tables that we often visit, in the future in the cloud, can we use some machine learning methods to help users better build indexes, speed up queries and use databases more efficiently?

In addition, it is at the storage engine end. The research I’m doing now is to turn the database storage engine into a highly microservice-oriented system, and then move every module to the cloud. When compaction happens to the lSM-tree data structure that we currently use, it can cause performance jitter. On the cloud, however, using serverless, shared storage, it’s possible to separate compaction from Serving so that the impact of resource augments on performance is minimal. There’s a lot more that can be done during a compaction process, including when I say Lean Index.

All of this is based on the premise of database cloud biogenicity. As a final note, TiDB will launch a free 12-month trial for developers in November 2021, with rapid deployment, HTAP support by default, computing isolation via containers, and dedicated block storage for free use on the cloud. Our website is tidbCloud.com and we will support domestic cloud in the future. We look forward to your experience and feedback.

About the author:

Dongxu Huang, co-founder and CTO of PingCAP, is a senior basic software engineer and architect. He once worked in Microsoft Research Asia, NetEase Youdao and Wandoujia. He is good at distributed system and database development and has rich experience and unique insights in the field of distributed storage. Passionate open source enthusiasts and open source software authors, representative works of distributed Redis cache solution Codis, as well as distributed relational database TiDB. In 2015, I started my own business and founded PingCAP. My main work in PingCAP is to design and develop the open source NewSQL database TiDB from scratch. At present, the project has accumulated more than 29,000 stars on GitHub, becoming the world’s top open source project in this field.

Related links:

Serverless: industry, academic, community blossomed everywhere, domestic manufacturers quickly stuck

Kubernetes Ecology: large version “inside volume”, safety is worth paying attention to

Big front End: The front end is in deep water, and low code for developers continues to heat up

Year-end inventory service grid: Practicality first, ecology first

Ecological landscape year-end inventory Rust | the ocean (last)

Ecological landscape year-end inventory Rust | the ocean (next)