Author’s brief introduction

Chen Yunhai

DaoCloud data platform Architect, long-time focus on database systems, distributed systems, blockchain, etc.

It was the best of times, it was the worst of times.

We have all kinds of technologies to choose from in this day and age.

It must be called the Internet

In the booming field of computer technology, new technological forms and new business models are constantly emerging and dazzling. At the beginning of the century to flow to the king of the dotcom bubble burst, has not stopped the present Internet industry continues to climb, just in the nearly 20 years, consumers and Internet practitioners have a long enough time bit by bit will be solid industry foundation, the original cloud base tower, into a real industry today.

The Internet has from a noun in the field of computer network, communication slowly evolved into an adjective, the Internet industry, Internet application of the Internet (Internet application) represents the high concurrency, sensitive state IT, rapid expansion, data driven, lean operation, etc., everything is different from traditional industry, the traditional applications.

The PaaS platform supporting Internet applications has become increasingly clear. Container technology, represented by Docker, has overtaken virtual machines in momentum and provided a standard operating environment for Internet applications. On top of the cluster management and programming standards, Kubernetes and Swarm (Spring Cloud, Tyk, Prometheus, etc.), It has jointly built a high availability PaaS platform for Internet applications, including load balancing, elastic scaling, service registration and discovery, application monitoring and other functions.

On the other hand, the typical characteristics of Internet applications such as high concurrent traffic, large data storage and cross-geographical areas make the traditional application architecture based on relational database at a loss. The birth of NewSQL is closely related to the requirements of current Internet applications.

NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL Systems for Online Transaction Processing (OLTP) Read-write workloads while still maintaining the ACID guarantees of a traditional database system.

NewSQL is an ambitious attempt to match the extensibility of NoSQL and implement the acid-compliant transactions that traditional relational databases excel at. However, considering Relational databases, from Edgar F. Codd’s article A Relational Model of Data for Large Shared Data Banks to the popularity of NoSQL, Academic and industrial database theory and engineering has almost been experienced for more than 40 years, so, billed as New NewSQL, in the end New where? Is NewSQL real or just a commercial game?

Relational models and SQL standards

Popular in relational database and became the standard database fields in the true sense, once there have been several tend to be in a model database, basically can be classified as type navigation database, they are on the physical structure for tape storage, on the logical structure is a certain level of physical world map. For example, the hierarchical model organizes data in a tree structure, with each child node having only one parent. The organizational structure within a company is a familiar hierarchy.

A navigational database is a type of database in which records or objects are found primarily by following references from other objects.

One hierarchical model database that has to be mentioned is IBM’s IMS (Information Manangement System), which was born in 1968. Legend has it that IN 1966, NASA approached IBM for software that would effectively track and manage the tens of thousands of rocket parts for the moon missions.

If we could put a man on the Moon, could we also create a computer program to track the millions of rocket parts it takes?

Today, IMS is still going strong, playing a major role in big manufacturing, banking and hosting community events from time to time. But IT has faded out of the sight of those of us who advertise as Internet practitioners, after all, in the open source, micro-service, sensitive IT, asset-light Internet industry, IT has appeared to be incompatible.

But it to the whole database field, even the whole computer science has left an indelible contribution, up to now, again the head of the programmer, also will not easily put himself deep in the application of their own management, query data mire.

It helped introduce the idea that an application’s code should be separate from the data that it operates on. This allows developers to write applications that only focus on the access and manipulation of data, and not the complications and overhead associated with how to actually perform these operations.

In 1970, Edgar Codd, also of IBM, published A heavyweight article in the history of database development: A Relational Model of Data for Large Shared Data Banks https://cs.uwaterloo.ca/~david/cs848s14/codd-relational.pdf), Edgar in this famous essay, based on mathematical theory, We demonstrate that in order to achieve Symmetric Exploitation, i.e. users can explore the remaining unknown attributes based on any known combination of attributes, several dependencies in the navigation database must be eliminated:

  • Sorting rely on
  • The index dependence
  • Access path dependency

Many people believe that the success of a relational database is the perfect mathematical model, but the other side of the can not be ignored, relational database in the chaotic period of database development, for its open the side window, originally navigation databases, the focus is on data writing and based on the path of information retrieval, and relational database, let people see the dawn of data analysis.

Later, with the development of hardware technology, especially storage devices, disks supporting semi-random Access appeared, and relational databases were like a duck to water. Moreover, relational algebra, E-R model, SQL standard, Codd’s 12 Rules, data warehouse and so on were deduced. It has dominated the database field for nearly 40 years.

The introduction of low-cost hard drives that provided semi-random access to data led to new models of database storage better suited to these devices. Among these, the relational database and especially SQL became the canonical solution from the 1980s through to about 2010.

Relation in relational database does not refer to the Relation between tables that we intuitively feel through foreign keys. Relation is a mathematical concept, which is defined as:

Given n sets S1, S2… Sn, R is an n-tuples whose first element is taken from S1, the second element is taken from S2, and so on. We call R an Relation based on the N sets, and Sj is the JTH Domain of R.

Brief statement:

R is the set S1 × S2 ×… Sn is a subset of the Cartesian product

In the industry, Oracle in 1979, DB2 in 1983, MySQL and PostgreSQL in the open source field in the 1990s, are well-known databases.

The next big thing in the database world is the standardization of SQL. Because Edgar didn’t specify how to implement relational databases in his famous paper, just described a relational model, there are multiple relational databases on the market, each with a different storage engine and data organized in a variety of ways. A procedural database manipulation language is clearly inappropriate.

For example, a school organizes an examination. In order to record the scores of each student, the teachers in the Academic Affairs Office need to deal with the following details:

1. Storage location of the score table

2. Organization of columns

3. The separator

4. Decompression algorithm

.

You’ll agree, too, that’s no good! Indeed, this is too intrusive for the application. We’re not writing apps, we’re writing trivial computing details!

SQL is a high-level non-procedural database manipulation language, which allows users to work on high-level data structures without requiring users to specify the storage method of data, and does not require users to understand the specific data manipulation methods. It presents the database with a simple interface, which can make the database system with completely different underlying structure and between different databases, using the same SQL as data input, operation and management.

You just tell the database what to do in this standardized language, rather than telling it what to do step by step in order to do it.

Going back to the score entry example, you just need to do the following to the database:

create table score(id int, name string, level char);

insert into score values(12, John, 'B')

insert into score values(19, Lily, 'A'); .Copy the code

Query the list of students who received an ‘A’? A piece of cake:

select * from score where level = 'A';Copy the code

Decoupling is an enduring topic in the field of computer. Even with the existence of data locality, the separation of storage and computation is a powerful trend. The decoupling of application and database gives birth to database, and the decoupling of programming and running platform gives birth to high-level programming languages and compilers. The decoupling of application and host platform gives birth to container technology and Docker.

To digression, it is important to remember that the point of coupling is to be uncoupled, just as the point of records is to be broken.

Exploration in the Internet age

Internet application is like a roaring beast. For people who are accustomed to traditional application, it is like the beast described by our ancestors in The Book of Mountains and Seas.

Faced with the challenges of high concurrency, large capacity and cross-region of Internet applications, Sharding is the simplest solution to divide and rule. The basic idea is to gather nine hunters who are capable of dealing with one snake and fight together in the face of the monster hydra.

Sharding’s specific approach is to divide a Database into several parts and put them on different servers, so as to enhance the performance of the Database by distributed means. Sharding also distinguishes horizontal and vertical Sharding. If there are many tables in the database, different tables can be placed on different servers, which is vertical Sharding. If a table has a large amount of data, it needs to be horizontally shard and distributed to multiple servers. In Internet application scenarios, horizontal Sharding is generally the main method, and Database middleware is the main method in implementation. Sharding refers to horizontal Sharding in cases not specifically discussed later.

Due to the limitation of principle, Sharding scheme is almost difficult to be effectively extended. For example, it is predicted that a large Internet application needs 5 Sharding databases, and the data distribution strategy is:

hash(some_field_in_record) % 5

Later, when seven database servers were required due to increased traffic, the corresponding data distribution strategy was as follows:

hash(some_field_in_record) % 7

In order to ensure that the old data can still be accessed correctly, there is a need to do data redistribution, such a large amount of data reloading, is a very long and painful process. Internet applications generally cannot withstand such a long period of service interruption. You can choose to operate on the standby repository, but the process is quite cumbersome.

Of course, there are consistent Hash algorithms that go deeper, but they tend to skew the load between shard databases, and there is no good solution in theory.

In addition, Sharding scheme has a serious degradation of transaction support, and most associations involving Sharding tables require application developers to implement their logic by themselves.

Sharding middleware works well for simple operations like reading or updating a single record.It is more difficult, however, to execute queries that update more than one record in a transaction or join tables.

Eventually some companies gave up on Sharding middleware efforts and started developing their own database management systems, opening the way for NoSQL. Traditional database systems often sacrifice high availability and high performance for consistency and correctness. This kind of trade-off is not suitable for Web-based Internet applications, which require more system high availability and high concurrency performance. Unstructured data that is excluded from relational databases is also an important driver of this process. For some simple Internet applications, they just repeatedly write records and perform look-up queries based on primary keys. As a result, the relational model and SQL become redundant. Data writing and querying are done through more efficient apis. Originally NoSQL was short for No more SQL.

Later, in terms of ease of use, some systems slowly added some SQL support, in addition, NoSQL in high availability, performance and other aspects also have a good performance, NoSQL eventually evolved into Not Only SQL.

However, without the relational model, SQL support is only A small part of the standard. In terms of API, NoSQL does not and cannot have A common standard, that is, applications running on NoSQL A system cannot be migrated to NoSQL B, which is commonly referred to as technology stack binding. NoSQL is more like a wild and unruly horse. Bole sees it as a swift horse, and some people are bruised and bruised.

Two of the best known NoSQL systems are Google’s Bigtable and Amazon’s Dynamo, both of which started as in-house services (now open to the cloud), and other organizations are starting to rally the open source community around their design concepts. Several well-known systems were created, including Cassandra, HBase, and MongoDB.

The return of the NewSQL

It’s funny how the computer industry evolves. People who understand their needs and their existing systems, and they don’t want the existing systems to impose too many restrictions on them, so they start over and create something new that works. Later, in order to benefit the world and benefit all sentient beings, universality was connected, various interfaces and norms were added, and standardized products continued to be abandoned and abandoned by a new round of big cattle.

Introduces the architecture of distributed no (in scalability, high availability, and performance for the traditional relational database are greatly promoted, however they are in addition to the cost of transaction support and relationship model, at the same time, most of the system for high availability and give up the strong consistency of the system, and the eventual consistency model, Coupled with the lack of SQL and uniform API specifications, it is difficult for ordinary application developers to properly build their Internet applications on such systems. App developers, including those within Google, have similar complaints.

Developers of many OLTP applications found it difficult to build these applications without a strong schema system, cross-row transactions, consistent replication and a powerful query language.

OLTP application developers focus on the high concurrency of the database and support for transactions, which are read and written, with the following typical characteristics:

1.Short lived (i.e., no user)

2.Touch a small subset of data using index lookups (i.e., no full table scans or large distributed joins)

Always (i.e., executing the same queries with different inputs)

Data analysis, data mining, etc., do not belong to OLTP and do not belong to the scenario targeted by NewSQL.

NewSQL is a regression of sorts, trying to pick up ACID that NoSQL has abandoned. ACID is the four essential elements for database transactions to execute correctly. NewSQL is trying to pick up transactions.

  • Atomicity atomic
  • Consistency Consistency
  • The Isolation Isolation,
  • Durability persistence

Consistency here is not the same concept as consistency in distributed systems. Traditional relational databases tend to be stand-alone versions. Consistency here refers to transactions that, whether successful or not, do not break any defined constraints on data, such as foreign key constraints. Consistency in distributed systems does not mean that multiple copies of the same data are identical, but that different observers read the same data in the same way.

With the introduction of transaction support, NewSQL has further added SQL support and distributed consistency support in order to better liberate developers.

NewSQLs enable applications to execute a large number of concurrent transactions to ingest new information and modify the state of the database using SQL (instead of a proprietary API). If an application uses a NewSQL DBMS, then developers do not have to write logic to deal with eventually consistent updates as they would in a NoSQL system.

In What’s Really New with NewSQL? In that article, the author classifies existing NewSQL databases into three categories:

A new system with a new architecture

Distributed shared-nothing gene, zero history burden, multi-node concurrency control, multi-copy fault tolerance, distributed query and optimization.

Send the query to the data rather than bring the data to the query

Independently manages data stores and has more direct and fine-grained control over data, independent of the existing distributed storage system and distributed file system.

Examples: ClustrixDB, CockroachDB, Google Spanner, MemSQL, NuoDB

Re-implemented Sharding middleware

Centralized middleware is responsible for query distribution, coordinating transaction processing, managing data and replica distribution, node management; The data node is responsible for data storage and query, receiving read and write requests from the middleware, and returning results.

A database that presents a single logical layer to the application, without modifying the underlying DBMS. Applications based on traditional RDBMS can migrate seamlessly without even changing any code.

Examples: AgilData Scalable Cluster, MariaDB MaxScale, ScaleArc, ScaleBase.

Cloud database with new architecture

DataBase as a Service (DBaaS) : The cloud Service provider is responsible for operation and maintenance. Users only need to apply for resources as required and pay as required.

Examples: Amazon Aurora, ClearDB.

From the author’s attitude on paper, we guess that the author did not approve of classification 2. Interested students can read the original text. Other NewSQL systems have noticed the benefits of protocol compatibility, such as CockroachDB with PostgreSQL Wire Protocol and ClustrixDB with MySQL.

Really new ?

Returning to the debate over New in NewSQL, this section is too technical, and I recommend that interested students read the original text, we only have a general summary here. In What’s Really New with NewSQL? (address: http://db.cs.cmu.edu/papers/2016/pavlo-newsql-sigmodrec2016.pdf) the article, the author from the following several aspects to NewSQl technology are analyzed:

  • Memory oriented storage
  • Data partition
  • Concurrency control
  • Secondary indexes
  • A copy of the mechanism
  • Fault recovery

The main takeaway from our analysis is that NewSQL database systems are not a radical departure from existing system architectures but rather represent the next chapter in the continuous development of database technologies. Most of the techniques that these systems employ have existed in previous DBMSs from academia and industry. But many of them were only implemented one-at-a-time in a single system and never all together. What is therefore innovative about these NewSQL DBMSs is that they incorporate these ideas into single platforms. Achieving this is by no means a trivial engineering effort. They are by-products of a new era where distributed computing resources are plentiful and affordable, but at the same time the demands of applications is much greater.

There is nothing new under the sun. The various technologies used in NewSQL are already widely used in multiple databases, but this time NewSQL combines them together. The author looks forward to a new era of NewSQL, affirming NewSQL’s engineering achievements in particular, after all:

Distributed systems engineering is full of tradeoffs.

Future Light HTAP

Business support and data collection are only part of the closed loop of enterprise data. Revitalizing digital assets and creating data-driven and business-intelligent companies is a big vision of the current Internet industry. However, since the underlying data storage structure makes it difficult to compromise the performance of Fast Analytics and Fast Inserts and updates at the same time, most enterprises today still rely on another system — OLAP.

The data that OLTP flows into the system is imported into the OLAP analytical database through ETL and other processes. After that, the decision-making results are generated through report analysis, data mining, machine learning and other means, which react on the business and adjust the business to form a data closed loop.

OLTP and OLAP, which maintain two sets of data at the same time, multiply the storage overhead of the system because of data redundancy. Depending on the data replication of ETL, the timeliness of decision system is greatly affected. At the same time, fixed time ETL processes also put a lot of pressure on the performance of both systems.

People’s pursuit of grand unification is endless. In What’s Really New with NewSQL? In this paper, the author predicts that the next big trend in database systems will be the integration of OLTP and OLAP, also known as Hybrid Transaction-Annlytical Processing.

Cloudera’s Kudu storage engine, launched in 2015, attempts to do just that, but according to Google Trends, the project hasn’t caught on. Many vendors in NewSQL have this Roadmap (some say they are ALREADY HTAP systems, but I prefer it to be a long way off), such as CockroachDB, ClustrixDB, MemSQL.

Let me borrow a graph to illustrate the advantages of HTAP.

reference

[1] Codd, E.F (1970). “A Relational Model of Data for Large Shared Data Banks”. Communications of the ACM. Classics. 13 (6): 377–387.

[2]Andrew Pavlo , Matthew Aslett, What’s Really New with NewSQL? , ACM SIGMOD Record, V. 45 n.2, June 2016 [doi>10.1145/3003665.3003674]

[3] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google’s global-distributed Database. In OSDI, 2012.

[4] David F. Bacon, Nathan Bales, Nico Bruno, Brian F. Cooper, Adam Dickinson, Andrew Fikes, Campbell Fraser, Andrey Gubarev, Milind Joshi, Eugene Kogan, Alexander Lloyd, Sergey Melnik, Rajesh Rao, David Shue, Christopher Taylor, Marcel van der Holst, and Dale Woodford.2017. Spanner: Becoming a SQL System. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD ’17). ACM, New York, NY, USA, 331-343. The DOI: https://doi.org/10.1145/3035918.3056103

[5] ClustrixDB. https://www.clustrix.com/

[6] CockroachDB. https://www.cockroachlabs.com/

[7] Navigational database.

https://en.wikipedia.org/wiki/Navigational_database

[8] Kudu. https://kudu.apache.org/