I still remember clearly that at this time five years ago, when I was still in Wandoujia, I chatted with Liu Qi and Cui Qiu in the afternoon about the imagination of the future database, just like a seed. Today, it seems to be luxuriant and lush with emotion. According to the convention, five years is an important node, not as long as ten years, nor as short as one or two years, is a good review node, here seriously review the past, look forward to the future.

Five years ago, the idea was simple: to build a better distributed database. From an academic point of view, it did not propose any amazing magic algorithm, and the shared-nothing architecture we chose was nothing new in the industry at that time, but what really excited me was the following: What we want to build is really the foundation software for the entire system of Single Source of Truth. How to understand this sentence? I’ll talk about it later.

Data is central to the architecture

As an architect in the Internet industry, I deal with various types of data almost every day. So many years of experience and different systems in different industries can be summarized as follows:

Data is central to the architecture.

If you think about it, everything we do is really about data. Data generation, data storage, data consumption, data flow…… It just changes the form of data and service mode according to different needs. Computer department students may remember the teacher said: program = algorithm + data structure, I dare to imitate this sentence: system = business logic X data. Can say a lot of architectural issues are out in the data layer, such as common chimney system brings problems, especially the problem of data island, actually, in essence, the reason is out when no data layer will get through, if not from the data architecture to think, is likely to be “headache cure head, the foot painful medicine foot”, fee along while strength or awkward, in turn, if the data layer, good governance, Just like getting through the “ren Du two veins”, to the effect of a thousand catties.

But the ideal is usually full, the reality is very skinny. At least when we started five years ago, we didn’t think we had a system that could handle data. Curious readers may be asking: Isn’t there Hadoop? There is no? Can a relational database also be divided into tables? In fact, these are almost all the candidates for dealing with the storage problem at that time. The common characteristic of these solutions is that they are not perfect.

To be specific, these solutions do not cover the scenarios of data applications. For complex businesses, n solutions may be used at the same time to complete the coverage. This is why data pipelines like Kafka have become more popular in recent years as Internet businesses have become more complex. From a data governance perspective, it makes sense to build roads between “islands” in order to achieve full coverage of various data platforms.

We were wondering if we could have a system that covered as many scenarios as possible with one unified interface.

We need a Single Source of Truth. Data should be in all corners of the application logic, my ideal for any data in the system can access should be unrestricted (regardless of the permissions and security first, this is another problem), the “unrestricted” here is a more generalized, such as: capacity is unlimited, as long as there is enough physical resource, the system can be unlimited extension; Without access model restrictions, we can freely correlate and aggregate data; There is no consistency limitation; Operations require little human intervention…

An architecture centered on distributed databases

At that time, I was particularly fascinated by an American TV series: Person of Interest. In this TV series, there was a god-like artificial intelligence, The Machine, which collected all The data and analyzed it so as to predict or intervene in The actions of people in The future. Although The theme of this American drama is relatively orthodox, such as chivalry and righteousness, what fascinates me more is whether we can design a Machine. Although I’m not an AI expert so far, it seems feasible to design a database for The Machine. One of the more exciting things we’ve found over the years is:

An architecture centered around distributed databases is possible.

How do I understand this? For example, as mentioned above, the fragmentation of the data layer inevitably means that the business layer needs higher complexity to make up for it. In fact, many engineers tend to think about the cost of maintaining the system in a linear way. But practical experience tells us that this is not the case. The complexity of a system with only one database and ten databases is not a simple 10x, considering the flow of data, the maintenance cost is only likely to be more, and other problems caused by heterogeneity are not taken into account.

What does a distributed database-centric architecture look like? Understandably, the core of the architecture is a storage system with a wide enough scenario coverage and unlimited horizontal scaling capability. Most of the flow of data is confined within the database, so the application layer can be almost stateless, because the central database takes care of most of the state, and each application can be accelerated through its own cache. I want to remind you that the reason WHY I emphasized horizontal scaling above is because limited scaling is also an important cause of fragmentation. We can never accurately predict the future, and it’s hard to imagine how our business will change even a year from now (think of this pandemic). The old adage is true: Change is the only constant.

Another frequently asked question is why the cache layer needs to be closer to the business layer, or why the huge database at the center should not be responsible for caching. My understanding is that only the business understands the business better and knows what data to cache with what policies, and it makes sense to cache closer to the business for performance (low latency) reasons.

In contrast to the above statement, “The only constant is change,” the biggest benefit of this architecture is “change without change,” or in a simpler word: simplicity. Google actually figured this out a long time ago, because they understood what complexity really was.

Another example is HTAP. If you pay attention to the development of databases, you must be familiar with the word HTAP recently. In fact, in my opinion, the essence of HTAP lies in the coverage mentioned above.

Traditional data architectures typically separate OLTP, OLAP, and offline warehouses, with each system doing its own job and synchronizing through separate pipelines (and sometimes ETL).

Here’s what an HTAP system looks like:

Although on the surface, this is a simple integration of the interface layer, but the implications are profound. First, the details of data synchronization are hidden together, which means that the database layer can decide how to synchronize data, and since the OLTP and OLAP engines are in the same system, Many details are not lost during synchronization, such as transaction information, which means that the internal analysis engine can do things that traditional OLAP cannot. In addition, for the use of the business layer, one less system means a more unified experience and a smaller cost of learning and transformation. Do not underestimate the power of unification.

Where is the future?

The above is what happened in the past five years, and it has almost come true step by step as we envisioned when we started our business. So what will happen in the next five years? As I learned more about the industry and technology, there was at least one thing I knew for sure:

Flexible scheduling will be the core capability of the future database

No one can deny that the biggest change in IT in the last decade has been brought about by the cloud, and the revolution is still underway. What are the core capabilities of the cloud? I think resilience. The granularity of computing resource allocation is becoming more fine-grained, like moving from only owning to renting, or even being as flexible as staying in a hotel. What does that mean? The essence is that we don’t have to pay upfront for the “imagined” peak of our business.

In the past, whether we purchased servers or rented cabinets, we needed to set a advance amount. Before the business peak came, these costs had already been paid in advance. The advent of the cloud has turned resilience into a fundamental capability of infrastructure, and I expect the same thing to happen with databases.

There may be a lot of friends will have a question, isn’t it now almost all databases claim to support transparent level extension? In fact, I hope you will not narrow the understanding of “flexible scheduling” as scalability, and the focus of the word “scheduling” **, I give a few examples to facilitate your understanding:

  1. Can the database automatically recognize workload and scale according to workload? For example, automatic purchasing machines, anticipating spikes, create more copies of hot data and redistribute data, expand capacity ahead of time. After the business peak has passed, the automatic recycling machine to reduce capacity.

  2. Can the database sense the business characteristics and determine the distribution based on the access characteristics? For example, if the data has obvious geographical characteristics (for example, Chinese users are likely to visit China, and American users are likely to visit the United States), the system will automatically place the geographical characteristics of the data in different data centers.

  3. Can the database sense the type of query and the frequency of access to automatically determine the storage media for different types of data? For example, cold data is automatically transferred to a less expensive storage such as S3, hot data is stored on a highly equipped flash memory, and the exchange of hot and cold data is completely transparent to the business side.

Behind everything mentioned here is the reliance on “elastic scheduling” capabilities. In the future, I believe that the cost of physical resources will continue to decrease, and the unit price of computing resources will continue to decrease. As a result, when storage cost and computing resources become no problem, the problem will become “how to allocate resources efficiently” **. If efficient allocation is the goal, “schedulability” is the obvious foundation. Of course, as is the rule of all things, you have to walk before you can run. I believe that in the coming period of time, we will see the first batch of new databases with such capabilities.

The next stage is intelligence

What does it look like further into the future? I don’t know, but just like The Machine, only enough data can give birth to intelligence. I believe that just like we don’t know about The universe and The ocean, our current understanding of data must be superficial. Even we haven’t recorded a lot of data, there must be a bigger mystery hidden in The massive data. I don’t know what insights can be gained from data, how to better change our lives, but I don’t think it’s going to be humans. While what we’re talking about in this section may sound like science fiction, I’d like to believe in a future where new intelligence emerges from a sea of data.

The end of the

In the five years since I started my business, I have looked back at the simplest starting point: to write a better database and completely solve the annoying MySQL database and table problems. It seems that we have not deviated from our original aspiration, but during this journey, we have seen a bigger world step by step, and become more and more capable and confident to turn what we believe into reality:

I have a dream that in the future, software engineers will no longer have to work overtime to maintain the database, and all kinds of data-related problems will be automatically and properly handled by the database.

I have a dream that in the future, our data processing will no longer be fragmented, and any business system can store and obtain data conveniently.

I have a dream that in the future, when we face the flood of data, we will calmly respond to all changes.

I recently heard a quote that I personally like: Half of ambition is patience. Building a perfect database is not an overnight job, but I believe we are on the right track.

All the past is a prologue.

Author: Dongxu Huang, co-founder and CTO of PingCAP, senior basic software engineer and architect, formerly worked at Microsoft Research Asia, netease Youdao and Wandoujia. Good at distributed system and database development, rich experience and unique insights in the field of distributed storage. Passionate open source enthusiasts and open source software authors, representative works of distributed Redis cache solution Codis, as well as distributed relational database TiDB. In 2015, I started my own business and founded PingCAP. My main work in PingCAP is to design and develop the open source NewSQL database TiDB from scratch. At present, the project has accumulated more than 23,000 stars on GitHub, becoming the world’s top open source project in this field.


One More Thing

In The next four weeks, The author Of this article, Dongxu Huang, will open “The Future of Database live series” to describe The Future of Database in our eyes and share The design philosophy and practice of TiDB 4.0, a milestone on our journey to explore The Future. First live will open on Saturday night 8 PM, for details, please click: www.oschina.net/event/23157…