Abstract: On April 17 (Paris time) Ali Cloud POLARDB went abroad, appeared in ICDE2018, and simultaneously held ali cloud own POLARDB technology special. At the conference, Ali Cloud presented its academic achievements, thus promoting Cloud Native DataBase to become an industry standard.

April 17 (Paris time) Ali Cloud POLARDB went abroad, appeared in ICDE2018, and simultaneously held Ali cloud own POLARDB technology special. At the conference, Ali Cloud presented its academic achievements, thus promoting Cloud Native DataBase to become an industry standard.

The following is ali Cloud senior technical expert CAI Songlu’s speech transcript:



Now I would like to introduce our cloud native database -POLARDB.

You may ask what exactly is a “cloud native database”, what is the standard for a cloud native database, how and why do we define it? For now, let’s put those questions aside, and we’ll answer them later.



Now let’s take a look at what a cloud native database looks like, what is the threshold, what are the characteristics, first, the cloud native database must have superior performance, with millions of levels of QPS; Secondly, there must be very large scale storage, up to 100TB of storage space; The ability to have zero downtime on the cloud is also very important; Last but not least, the cloud is an ecosystem, and the database must be compatible with the open source ecosystem.

A cloud native database is like a sports car. A sports car may have many characteristics, such as appearance and speed, but a car with such appearance and speed is not necessarily a sports car. So to get back to our question of what is the standard for cloud native databases, before we answer that question, let’s take a look at the history of databases.



Let me look at the history of database development from four dimensions. First, from the perspective of data scale, we are living in an era of data explosion. Secondly, some database theories have evolved, especially CAP theory, and we have made breakthroughs in algorithmic theory. Third, users and application scenarios are changing; Fourth, infrastructure is also migrating from traditional IDCs to the cloud.



Why the data explosion? How does the data explode? This is what I refer to the Internet the queen John obi mikel’s report, you can see, the Internet history can be divided into three stages as a whole, the first phase of what we call the PC to the Internet, data is mainly composed of PC, the second phase can be called the mobile Internet, data is mainly produced by the smart phone, the third stage is the IoT, from now on, Almost all of that data is generated by iot devices, be it your watch, your fridge, your lights in your house, your car, any device.

With the explosion of data comes a lot of challenges. Massive data means that the cost of storing and analyzing data increases dramatically, moving and analyzing data becomes difficult, and in short, data is hard to use.



In recent years some theory of distributed system is A major change, CAP is one of them, CAP theory was introduced in 2000, in this theory, C represents the consistency, A representative availability, fault tolerance, the P partitions the theory of the core point is in the P C and can’t be satisfied at the same time, A CAP is A great theory, However, there are many misconceptions about CAP theory. One of them is that C and A are mutually exclusive, so some systems choose to abandon one to satisfy the other, such as many NoSQL databases, but in reality, the relationship between C and A is not 0 and 1, but 100% and 99%.

Now we also can be modeled as A problem, the problem of P P problem is mainly composed of network CARDS, switches, wrong routing configuration such as the fault cause, we consider such an example, A node because the errors in the network card network partitions, to the node of the request will fail, if we can automatically weed out the node, Then the request will be retried and sent to A healthy node. In fact, we do the same thing when A node goes down. So basically, the P and A problem can be treated as A class of problems in some cases.

How can we automatically weed out failed nodes and still maintain data consistency? We can use PAXOS algorithm to achieve this goal, PAXOS can ensure consistency, and PAXOS only needs more than half of the operations to succeed, so when there are network partitions, downtime, slow nodes, we can tolerate these problem nodes and automatically eliminate them. If more than half of the nodes are divided into several partitions, we choose consistency at the expense of availability, but this rarely happens in the real world, so in most cases we can guarantee the integrity of both C and A by modeling the P problem as an A problem.



Twenty years ago, only the government and some big companies would choose databases. Now, all businesses from big companies to small and micro enterprises will run on databases, and the needs for databases are changing.

First of all, the database must be flexible, sometimes we need to do a market activity, such as Black Friday, we do not know the real traffic on that day, sometimes there will be a sudden hot information, the traffic will surge and is unpredictable, we hope the database can be flexible to expand; Secondly, with the development of globalization, trade is becoming more and more transparent. Many users have small business scale, which means less profit. They need more economical solutions. At present, customers have strict requirements and are sensitive to delay. The lower the database delay is, the less the loss will be brought to customers. Finally, the database must be able to respond quickly to every potential problem and recover quickly from the problem state.

In summary, the current potential demand for database is: flexibility, low cost, high performance, business sustainability.



Now, everything is always online, before the cloud era, data was scattered in IDC, now the data is located in the data lake, the data is generated online and applied to the training model at the same time, so the data is generated online, analyzed online and applied online; Our life is now always online, such as clothing, food, housing, travel, work, social contact, entertainment, etc., can be solved online, you can use a mobile phone to stay at home to survive; Now your customers are always online, all over the world, day and night, all the time.



At present, 70% of new enterprises in China are affected by data challenges. The main problems they face are as follows: high cost, they cannot afford commercial licenses and professional engineers; They have a strong will to develop but not a strong data capacity; For them, data backup, data mining and troubleshooting are very difficult things; Their data is currently an island scattered in many places, unable to be brought online and wasted, but it continues to explode, making it difficult to store, move, analyze and use

Based on the evolution and data challenges outlined above, we believe that a cloud-native database should meet the following criteria.



First, a cloud native database is not only a TP database, but also an AP database. TP and AP are fused together, which we call HTAP, and we benefit a lot from this architecture. Secondly, the cloud native database must be serverless. With Serverless, we can cut costs significantly. Finally, the cloud native database must be smart, like a consultant, capable of doing a lot of diagnostic and administrative work that can improve the user experience and remove the user from the chore.

Now let’s elaborate.



In HTAP, TP and AP share a storage, so there is no data delay for analysis. Because there is no need for data synchronization, we do not need to synchronize data from the master node to a read-only node. At this time, the data is real-time, which is very beneficial for the application with demanding delay requirements. When TP and AP share the same storage, the cost of at least one copy is saved. For the computing layer, AP and TP are isolated by container, so AP has no impact on TP, and with this layer of isolation, TP and AP can be easily extended separately.



With Serverless, the product specification or version upgrade can be achieved at zero cost. The compute nodes will run in a lightweight container. The client session life cycle is relatively short, so when we roll the upgrade, the client will hardly feel any change. With ServerLess, it’s easy to use on demand, pay by storage, have low computing costs, and you can specify different storage strategies for different business models, using more memory and SSDS for busy businesses, and putting data on HDDS for idle businesses, which can drastically reduce costs.



When it comes to intelligence, the intelligent advisor will analyze the data generated by the instance such as CPU/ storage/memory usage and water mark and give you a SQL or index optimization suggestion. The intelligent advisor looks like a DBA to outsiders and arm the user as a professional DBA. When there is a problem on the data link, because the data link is long and complex, we do not know exactly where the problem is. When we have a monitoring and diagnosis system, we can easily diagnose the whole link and quickly give the root cause. Intelligent consultants can also handle other functions such as cost control, security, and auditing.



Under the guidance of the above standards and thresholds, we created a new cloud native database -POLARDB, I will elaborate on POLARDB from the architecture, product design, future work and other aspects in an all-round way.



This is a big picture of POLARDB architecture. The upper layer is the computing layer, and the lower layer is the storage layer. Storage and computing are separated. We also implemented a variant of ParellelRaft that supports out-of-order submission and is much faster than traditional RAFT. We are using a lot of new hardware at the same time to improve performance and reduce costs, and POLARDB is also designed for the next generation of hardware.



For computing nodes, there is only one write entry, and R/W nodes are responsible for all types of requests, including read and write requests. Other nodes are read-only nodes and can only handle read requests. For storage nodes, we use ParellelRaft algorithm to achieve three-copy consistent write. Why do we want to do memory computing separation? First, we can choose different types of hardware for the storage and computing phases. For the computing layer, we focus on CPU and memory performance, and for the storage layer, we focus on low-cost storage implementation, so that the storage and computing nodes can be customized and optimized. After separation, compute nodes are stateless and easy to migrate and failover. Storage replication and HA can also be improved. Storage can now be easily pooled and managed on a block basis, so there are no fragmentation issues, no imbalances, and it’s easy to scale; It’s also easy to implement Serverless in this architecture, and you don’t even need any compute nodes when your business is idle.



In POLARDB, any logic runs in user state, a request is made from the database engine, PFS maps it to a set of IO requests, and PolarSwitch sends these requests to the storage layer via RDMA. Then the ParellelRaft leader processes these requests and persists them to disk through SPDK and synchronizes two copies of data to two Followers through RDMA. There is no system call on the whole path, there is no context switch, and there is no redundant data copy. No context switch plus 0 copy makes POLARDB an extremely fast database.



PFS is a static library and is linked to the database process. PFS is a distributed file system. The metadata of the file system is synchronized through the PAXOS algorithm, so PFS allows parallel reads and writes. In order to speed up access, there is a cache of metadata in the database process, and the cache is invalidated by version control. To help you understand PFS, this is an analogy between PFS and a traditional file system.



We use ParellelRaft to synchronize three pieces of data. ParellelRaft is one of the variants of Raft. Raft does not allow out-of-order commits and log holes. The semantics and serialization of transactions are taken care of by the database engine layer, and we have a whole catch-up mechanism for voids, which is a very complex process, which we discuss in detail in our VLDB 2018 paper, which has just been accepted.



At POLARDB, we are using some new hardware and maximizing the capabilities of the hardware. In addition to RDMA, we are also working on open-channel to maximize the value of SSDS. We are also accelerating the storage layer through 3D XPoint technology.



This is a comparison of the pressure measurements between POLARDB and RDS using sysbench. The purple columns are for POLARDB and the pink ones are for RDS. Based on these numbers, the average read performance of POLARDB is 6 times that of RDS and the average write performance is 3 times that of RDS using the same resources. So the lift is huge.



Let’s take a look at POLARDB’s product features. In terms of performance, it is easy to expand to 100W QPS with a delay of less than 2ms. The maximum storage capacity is 100TB. Flexibility, can be very convenient to do horizontal and vertical expansion, and 0 downtime in the specification upgrade without interference; For compatibility, MySQL is 100% compatible; In terms of availability, we choose consistency over availability when errors occur, so currently we promise three nines for availability and five nines for data reliability.



In POLARDB, read and write are separated into different nodes. There is only one write node. The write node can handle read and write requests, and there can be multiple read-only nodes.



For scalability, all nodes, read-only nodes, instance storage, and storage clusters can be vertically expanded. When failover is performed between read/write nodes and read-only nodes, zero downtime can be achieved.



For data migration, you can start your database by loading a backup on OSS (similar to AWS S3), or you can copy data from RDS in real time by using POLARDB as a slave of RDS. You can also use DATA Transfer Service (DTS) to migrate from RDS and other databases to POLARDB.



For data reliability, you can perform failover between multiple availability areas in one region. You can start a standby instance in other availability areas and use Redolog to replicate data. You can also perform failover between multiple regions. We usually use binlog for replication. For backup, we can print a snapshot in seconds and transfer it to OSS.



We still have a lot of work to do. At the engine level we will support multiple writes. Currently POLARDB is a shared disk architecture, but in the future it will be a shared all resources architecture. For example, in the computing layer, we will introduce a centralized Cache role to improve our query performance. We are also porting more databases to POLARDB, such as more MySQL versions, PostgreSQL, DocumentDB, etc. In the storage layer, we will use 3D XPoint technology to improve IO performance, and we will also use open-channel technology to improve SSD performance. In the future, we will also push down more engine layer logic to reduce IO as much as possible, making the computing layer simpler.



Our basic idea about academics is that academics should be grounded in engineering, and engineering should have academic output, instead of pure engineering or academia. Our team has submitted several papers to Sigmod and VLDB, and these papers will be published soon. You will see more detailed information about POLARDB and its distributed storage. We are also recruiting in Europe, and we have offices in Paris and Germany.