preface

In the context of the Internet and big data, more and more websites and application systems need to support massive data storage, high concurrent requests, high availability, and high scalability. The traditional relational database has been difficult to cope with the similar requirements, and various NoSQL (Not Only SQL) databases are produced.

This article will analyze the problems with traditional databases and how several types of NoSQL can solve these problems. Select the correct data storage technology for different service scenarios.

The body of the

1. Disadvantages of traditional database

disadvantages explain
In the big data scenarioI/Ohigher Because the data is byLine storage, even if only for oneA columnThe relational database will do the same thingThe entire line dataScan from the storage deviceRead into memory, resulting inI/Ohigher
Structured storageinflexible The store isrows, cannot be storedFlexible data structures
Table structureschemaInconvenient extension If necessary, modifyTable structure, which needs to be executedDDL(data definition language) statement, resulting inLock table, some services are unavailable
Full-text searchFunction is weak Relational databases can only performThe substringMatch the query, when the data in the table gets larger, even in theThe indexIn the case of,likeScan the table for a matchVery slow
It is difficult tostorageTo deal withcomplexRelational data Traditional relational databases, they’re not very good at thisBetween data pointsThe relationship between

2. No introduction

NoSQL, a general term for non-relational databases, can be understood as a strong complement to relational databases.

While NoSQL performs much better than non-relational databases in many ways, some features are often missing. What is more common is the absence of transaction functionality. The four essential ACID elements for proper database transaction execution are as follows:

The name of the describe
A Atomicity All operations in a transaction are either completed or not, and do not end somewhere in between. If an error occurs during the execution of a transaction, it is rolled back to the state before the transaction began, as if the transaction had never been executed.
C Consistency Consistency The integrity of the database is not compromised before and after the transaction begins.
I The Isolation Isolation, The ability of a database to allow multiple concurrent transactions to read, write, and modify data simultaneously. Isolation prevents data inconsistency due to cross-execution when multiple transactions are executed concurrently.
D Durability persistence After the transaction is completed, the changes to the data are permanent and are not lost in the event of a system failure.

To address the shortcomings of traditional relational databases, the following are five common categories of NoSQL solutions:

3. Column database

Column database is a data storage database based on column related storage architecture. It is mainly suitable for batch data processing and instant query. The corresponding row database, data in row related storage architecture for space allocation, mainly suitable for small batch data processing, often used for online transactional data processing.

The column storage feature of the column database can solve the high I/O problem of relational databases in some specific scenarios.

3.1. Basic principles

The traditional relational database is in accordance with the row to store the database, called row database, and the column database is in accordance with the column to store data.

There are two ways to put a table into a storage system, and we use row storage for the most part. The line storage method is to place rows into consecutive physical locations, much like traditional records and file systems.

Column storage method is to store data in the database according to columns, which is similar to row storage. The following figure is a graphical explanation of the two storage methods:

3.2. Common column databases

3.2.1. HBase

HBase is an open source non-relational distributed database (NoSQL) modeled by Google’s BigTable and implemented in Java. It is part of the Apache Software Foundation’s Hadoop project, runs on top of the HDFS file system, and provides services similar to BigTable scale for Hadoop. As a result, it can store huge amounts of sparse data fault-tolerant.

3.2.2. BigTable

BigTable is a compressed, high-performance, and highly scalable data storage System based on the Google File System (GFS). It is used to store large-scale structured data and is suitable for cloud computing.

3.3. Related features

Advantages of 3.3.1.

  • Efficient storage space utilization

Column database has invented different algorithms according to the data characteristics of different columns, so that it has much higher compression rate than row database. The general compression rate of the ordinary row database is about 3:1 to 5:1, and the compression rate of the column database is generally about 8:1 to 30:1.

More commonly, compress data using a dictionary table:

Here’s what the chart looks like. After the dictionary table is compressed, the strings in the table become numbers. Because each string appears only once in the dictionary table, the purpose of compression is achieved.

  • High query efficiency

Reading multiple columns of data from the same column is efficient because the columns are stored together. A single disk operation can read all the specified columns of data into memory. The following figure illustrates the benefits of column storage (and data compression) through the execution of a single query.

The procedure is as follows:

  1. Go to the dictionary table and find the number corresponding to the string (only one string comparison).
  2. withdigitalGo to match in the list, and set the position on match to1.
  3. The matching results of different columns are bit operated to obtain the record subscripts that meet all the conditions.
  4. Use this index to assemble the final result set.
  • Suitable for aggregation operations

  • Suitable for large amounts of data, not small data

3.3.2 rainfall distribution on 10-12. Shortcomings

  • Not suitable for scanning small amounts of data

  • Not suitable for random updates

  • Not suitable for real-time operations with deletions and updates

  • Single-row data supports ACID transactions, multi-row data transactions, transactions do not support normal rollback, Isolation supports Isolation, Durability, Atomicity does not guarantee Consistency.

3.4. Application Scenarios

This section uses HBase as an example to describe the application scenarios of column databases.

  • It is suitable for a large amount of data (100TB of data) and requires fast random access.

  • It is suitable for writing intensive applications with a large number of writes and a relatively small number of reads per day, such as IM history messages, game logs, etc.

  • Suitable for applications that do not require complex query conditions to query data. HBase supports only rowkey-based queries. Single-record or small-range queries are acceptable for HBase. Large scale queries may have some impact on performance due to distribution. HBase is not applicable to data models with join, multilevel indexes, and complex table relationships.

  • Applications with high performance and reliability requirements.

  • Because HBase itself has no single point of failure, the availability of HBase is very high.

  • It is suitable for applications with large data volumes and unpredictable growth, and applications that require elegant data expansion. HBase supports online scale-out. Even if the data volume increases in a period of time, HBase scale-out can meet the requirements.

  • Stores structured and semi-structured data.

4. K-v database

4.1. Basic concepts

A database that uses key-value storage, where data is organized, indexed, and stored as key-value pairs.

KV storage is ideal for data that does not involve too much data relational business. It can effectively reduce the number of reads and writes to the disk, has better read and write performance than SQL database storage, can solve the problem that relational databases cannot store data structures.

4.2. Common K-V databases

2. Redis

Redis is an open source, net-based, memory-based, optional persistence key-value pair storage database written in ANSI C. Redis is one of the most popular databases for storing key-value pairs.

4.2.2. Cassandra

Apache Cassandra (commonly referred to as C* in the community) is an open source distributed NoSQL database system. Originally developed by Facebook for storing data in simple formats such as inboxes, it combines the data model of Google BigTable with the fully distributed architecture of Amazon Dynamo. Cassandra is a popular distributed structured data storage solution.

4.2.3 Memcached

Memcached is an open source, high-performance, allocated memory object caching system. Used to speed up dynamic Web applications and reduce load on relational databases. It can handle any number of connections, using non-blocking network IO. Because it works by creating a chunk of memory and a Hash table, Memcached manages the Hash tables itself.

Memcached is simple and powerful. Its simple design facilitates rapid deployment, makes it easy to find problems, and solves many large data caches.

4.2.4. LevelDB

LevelDB is a Key/Value Pair embedded database management system programming library developed by Google and distributed under an open source BSD license.

4.3. Related features

K-v database features, Redis is used as an example:

This advantage

  • High performance

Redis can support up to 10W OF TPS on a single machine.

  • Rich data types

Redis supports String, Hash, List, Set, Sorted Set, Bitmap, and Hyperloglog data structures.

  • Rich features

Redis also supports Publish/SUBSCRIBE, notification, and key expiration features.

4.3.2. Shortcomings

  • Redis The transactionCan’t supportatomicpersistence(AD), only supportsIsolation,consistency(IC).

There is no guarantee of atomicity for transactions in Redis, because transactions do not support roll back, and common operations in Redis are atomicity due to Redis’ single-threaded model.

4.4 Application Scenarios

4.4.1. Application Scenarios

  • Suitable for storingThe user information(e.g.,The session),The configuration file,parameter,The shopping cartAnd so on. The information is usually the same asIDHook.

4.4.2. Inapplicable Scenarios

  • It is not appropriate to need to query by value rather than by key. There is no way to query by Value in the key-value database.

  • Not suitable for relationships between data that need to be stored. You cannot associate data with two or more keys in a key-value database.

  • Not suitable for scenarios that require transactional support. A fault in the key-value database cannot be rolled back.

5. Document database

5.1. Basic concepts

Document database A database used to store semi-structured data as documents. Document databases typically store data in JSON or XML formats.

  • Because of the no-Schema nature of the document database, any data can be stored and read.

  • Since the data format used is JSON or BSON, since JSON data is self-describing, there is no need to define fields before use, and reading a field that does not exist in JSON does not cause syntax errors like SQL. Can solve the relational database table structure schema extension inconvenient problem.

5.2. Common Document database

5.2.1. Mongo

MongoDB is a database based on distributed file storage. Written in C++ language. Designed to provide scalable, high-performance data storage solutions for WEB applications.

MongoDB is a product between a relational database and a non-relational database. Among the non-relational databases, NoSQL is the most versatile and most similar to a relational database.

5.2.2. CouchDB

CouchDB is a document-oriented distributed database developed in Erlang to store semi-structured data, similar to Lucene’s index structure.

CouchDB supports RESTful apis that use JSON as the storage format, JavaScript as the query language, and MapReduce and HTTP as NoSQL databases for the API. One notable feature is the multi-master replication capability. In addition, CouchDB is built on top of the powerful B-tree storage engine.

5.3. Related features

Take MongoDB as an example to describe the features of document-based databases:

Advantages of 5.3.1.

  • The new fields do not need to be as simple as the relational database, the first DDL statement to modify the table structure, program code directly read and write.

  • Easy compatibility with historical data. For historical data, even if there is no new field, there will be no error, only a null value will be returned, and the code is compatible with processing.

  • Easy to store complex data. JSON is a powerful description language for describing complex data structures.

5.3.2. Shortcomings

Compared with the traditional relational database, the shortcomings of document database mainly lie in its weak support for multiple data records, which can be specifically reflected as follows:

  • Atomicity: Supports only single line or document level Atomicity, not multi-line, multi-document, and multi-statement Atomicity.

  • Isolation: The Isolation level supports only Read committed levels, which may cause non-repeatable reads and phantom reads.

  • Complex queries are not supported. For example, to perform join query, you need to perform multiple operations on the database.

5.4. Application Scenarios

5.4.1. Application Scenarios

  • The amount of data is large or will be large in the future.

  • Table structure is not clear, and the field is increasing, such as content management system, information management system.

5.4.2. Inapplicable Scenarios

  • Transactions need to be added on different documents. Document-oriented databases do not support transactions between documents.

  • Multiple documents require complex queries, such as join operations.

6. Full-text search engine

6.1. Basic Concepts

Traditional relational database, mainly through the index to achieve the purpose of fast query. In the full-text search business, the index is also powerless, mainly reflected in the following aspects:

  • The conditions of full-text search can be arranged and combined at will, and if they are met by indexes, the number of indexes is very large.

  • The fuzzy matching method of full-text search cannot be satisfied by index, so only like can be used for query, and like query is the whole table scan, so the efficiency is very low.

The emergence of full-text search engine is to solve the weak problem of full-text search of relational database.

6.2. Basic Principles

The technical principle of full-text search engines is called inverted Index, an indexing method whose basic principle is to index words to documents. In contrast, there is a forward index, the basic principle of which is to index documents to words.

  • The following collection of documents is now available:

  • The index is as follows:

As you can see, the forward index is used to query the contents of a document by its name.

  • A simple inversion index is as follows:

  • The inverted index with word frequency information is as follows:

As you can see, inverted indexes are used to query the contents of documents based on keywords.

6.3. Common full-text search engines

6.3.1. ElasticSearch

ElasticSearch is a search engine based on Apache Lucene. It provides a distributed, multi-tenant on full text search engine. ElasticSearch is developed in Java and provides a RESTful Web interface. According to DB-Engines, ElasticSearch is the most popular enterprise search engine.

6.3.2. Solr

Solr is the Apache Lucene project’s open source enterprise search platform. Its main functions include full-text retrieval, hit labeling, faceted search, dynamic clustering, database integration, and rich text (such as Word, PDF) processing. Solr is highly scalable and provides distributed search and index replication.

6.4. Related features

Full-text search engine, using ElasticSearch as an example:

Advantages of 6.4.1.

  • High query efficiency, suitable for mass data processing near real time.

  • scalability

  • The cluster-based environment facilitates horizontal scaling and can carry PB level data.

  • Supports high availability, ElasticSearch clusters are flexible, can discover new or failed nodes, reorganize and rebalance data, and ensure that data is secure and accessible.

6.4.2. Shortcomings

  • ACID support for transactions is insufficient; data in a single document is ACID supported. For transactions with multiple documents, the normal rollback of transactions is not supported. They support Isolation (based on optimistic locking) and Durability. They don’t support Atomicity, Consistency.

  • There is weak support for complex operations involving multiple table associations through foreign keys in similar databases.

  • There is a delay in reading and writing data. The written data can be retrieved at most 1 second.

  • Update performance is low, the underlying implementation is to delete data first, then insert new data.

  • The memory footprint is high because Lucene loads the index portion into memory.

6.5. Application Scenarios

6.5.1. Application Scenarios

  • Distributed search engine and data analysis engine.
  • Full text search, structured search and data analysis.
  • The near real time processing of massive data can disperse massive data to multiple servers for storage and retrieval.

6.5.2. Inapplicable Scenarios

  • Data needs to be updated frequently.

  • Complex associated query is required.

7. Graphic database

7.1. Basic Concepts

Graph database uses graph theory to store relational information between entities. The most common example is the relationship between people in social networks. Relational databases do not work well for storing this relational data; queries are complex, slow, and exceed expectations.

The unique design of graphic database makes up for this defect and solves the problem that relational database is weak in storing and processing complex relational data.

7.2. Common graph database

7.2.1. Secondary

Neo4j is a high-performance, NOSQL graph database that stores structured data “on a graph network” rather than “in tables”. It is an embedded, disk-based Java persistence engine with full transactional features.

Neo4j can also be seen as a high-performance graph engine. Programmers work within an object-oriented, flexible network structure rather than rigid, static tables.

7.2.2. ArangoDB

ArangoDB is a native multi-model database system. The database system supports three important data models (key/value, document, and graph).

ArangoDB contains a database core and unified query language AQL (ArangoDB Query Language). The query language is declarative, allowing different data access patterns to be combined in a single query. ArangoDB is a NoSQL database system, but AQL is similar to SQL in many ways.

7.3. Basic principles

Graph database, taking Neo4j as an example:

  • Neo4j uses the concept of a graph in data structures for modeling.

  • The two most basic concepts in Neo4j are nodes and edges. Nodes represent entities, and edges represent relationships between entities. Both nodes and edges can have their own properties. Different entities are linked through various relationships to form complex object graphs.

For relational data, the storage structures of the two databases are as follows:

In Neo4j, nodes are stored using index_free adjacency, which means that each node has a pointer to its neighbors. In this way, the neighbor node can be found in the complexity of O(1). In addition, according to the official statement, in Neo4j, edge S is the most important, which is first-class entities and needs to be stored separately. This is good for speed when traversing the graph, and also makes it easy to traverse in any direction.

7.4. Related features

Advantages of 7.4.1.

  • High performance

Graph traversal is a unique algorithm of graph data structure, that is, starting from a node, according to its connection relationship, it can quickly and easily find out its neighboring nodes. This method of finding data is not affected by the size of the data, because neighboring queries always look for limited local data and do not search the entire database.

  • Flexibility of design

The natural extensibility of the data structure, along with its unstructured data format, makes graph database designs highly scalable and flexible. Adding nodes, relationships, and attributes as requirements change does not affect the normal use of the original data.

  • Development agility

The data model is straightforward, with little change from the requirements discussion to program development and implementation.

  • Full ACID support

Unlike other NoSQL databases, Neo4j also has full transaction management features and fully supports ACID transaction management.

7.4.2. Shortcomings

  • The number of nodes, relationships, and their properties is limited.

  • Splitting is not supported.

7.5. Application Scenarios

7.5.1. Application Scenarios

  • In some relational data applications, such as social networks.

  • The recommendation engine, which presents the data in the form of graphs, is very beneficial to recommendation formulation.

7.5.2. Scenarios Not applicable

  • Records a large amount of event-based data, such as log records and sensor data.

  • Processing of large scale distributed data, similar to Hadoop.

  • Does not apply to structured data that should be stored in a relational database.

  • Binary data storage.

summary

For the selection of relational database and NoSQL database, several indicators need to be considered:

  • The amount of data
  • concurrency
  • The real time
  • Consistency requirement
  • Distribution of read and write
  • The data type
  • security
  • Operational costs

Common system database selection reference is as follows:

System type Database selection
Enterprise internal management system For example, the operating system, the data amount is small, the concurrency amount is small, the first considerationRelational database
Internet heavy traffic system For example, e-commerce single product page, background considerationRelational database, the front desk to considerIn-memory database
Log system The original dataConsider to chooseColumn database.Log searchConsider to chooseInverted index
Search-type system For example, site search, non-general search, commodity search, background considerationRelational database, the front desk to considerInverted index
Transactional system For example, inventory management, trading, accounting, consider optionsRelational database + Cache database + Conformance protocol
Off-line calculation For example, massive data analysis, consider optionsColumn databaseorRelational databaseCan be
Real-time computing For example, real-time monitoring can be selectedIn-memory databaseorColumn database

Design practices should be based on requirements, business-driven architecture, whether RDB/NoSQL/DRDB is chosen. It must be requirement-oriented, and the final data storage solution must be a comprehensive design that considers all kinds of trade-offs.


Welcome to pay attention to the technical public number: one technology Stack

This account will continue to share backend technology essentials, including virtual machine fundamentals, multi-threaded programming, high-performance frameworks, asynchronous, cache and messaging middleware, distributed and microservices, architecture learning and advanced learning materials and articles.