This article is included in the wechat official account Meandni, reprinted with clear source or white list.

Database itself is a single function, only can be used as the data storage medium, but the wrong selection of the database that is probably the cost of project performance fell sharply, for many enterprise applications that is fatal damage, in addition, choose a different database types will also be decided to other modules in the system design, as a result, database selection is very important for the whole project, This requirement is often referred to as non-functional requirements (NFRs). For databases, there are three main factors to consider:

  • The data structure
  • Query mode
  • The data size

There are already a variety of storage solutions on the market. In this article, we will discuss how to choose the one that best suits you.

The cache

If the project requires frequent calls to the database API or some remote service with high latency, caching between the client and the database may be the first consideration to reduce latency. At present, the commonly used cache schemes are Memcached, Hazelcast and Redis. These schemes are almost the same, but Redis is the most widely used and stable database cache solution in China.

The cache is

File storage

If you need to develop a trill, B product station, you need to store a large number of images, video, data, such as merely a database may not meet our requirements, because the need to store the file instead of a general data information, database nature still can be used to query information data, and the file itself is not “query”, Just get the entire file on demand. In this case, the solution that fits the project’s requirements is an object (Blob) storage solution, such as Amazon S3. Often, Blob storage can also be used in conjunction with A CDN to reduce latency, so that content services can be delivered geographically regardless.

Provides text search function

Large apps such as Taobao and JD.com offer content search functions that allow users to sort data by product type and brand, often using search engines such as Solr or Elasticsearch, which also support fuzzy search. For example, the user’s spelling errors are taken into account, which greatly improves the user experience.

The search engine is not a database, so we can’t use a search engine like Elasticsearch as the data source, so we need to load the database into Elasticsearch to reduce the search latency. And then provide a search function based on that.

TSDB (Time Series Database)

Time series database is a kind of relational database. It is mainly used to process the data with Time label (according to the sequential change of Time, that is, Time serialization). The data with Time label is also called Time series data.

Particularly sensitive to time if we want to develop the system, such as stock trading, financial analysis system, at this point you need to often within a certain time of data analysis, such as in the past 1 week, 10 days, 1 month, 1 year, and so on, TSDB will be given at a speed of millisecond these we need data, traditional database is very difficult to do this.


At present, OpenTSDB and InfluxDB are commonly used timing database in the market.

The data warehouse

Many projects also need a database that can store huge amounts of data. For example, Didi needs to store all order information to analyze which city and time period is the most used. These systems are usually different from transactions that are perceived by regular users, and can use an offline type of data warehouse. Hadoop is currently the mainstream data warehouse solution.

SQL OR NoSQL

As stated at the beginning of this article, data structure is one of the important factors when we use to choose the database. If we want to store structured or tabular data, we can use a relational database.

Relational database

We’ll also consider whether databases need ACID properties (Atomicity, Consistency, Isolation, persistence).

  • Atomicity guarantees that all operations are all or nothing.

  • Consistency: Ensures the consistency of database status before and after operations.

  • Isolation, which means that multiple transactions occur independently, and one transaction will not be affected by another ongoing parallel transaction. This ensures that the database should be able to handle concurrent transactions without causing data inconsistencies.

  • Persistence, a guarantee that changes will be permanently written to disk once a transaction completes and will not be lost to system failure.

If ACID is required for our project, a relational database (RDBMS) such as MySQL, Oracle, Postgres, etc., is required, but a non-relational database is also possible if ACID is not required.

For example, in the project need to build indexes for goods, each item have often different attributes and information, such as drugs have a shelf life, there are energy saving level, etc., for example our users in the form again each user may also have different attribute values, in this case, we can’t data expressed in tabular form, you can choose to use no database.

Document database

In addition, in addition to storing, we often need to query these types of data. This requires consideration of the query mode. We will decide which database to use based on the type of data stored and the type of query. If the project contains a large amount of data, including various attributes and various query requests, Document DB, such as Couchbase and MongoDB, will be used.

Elasticsearch and Solr are also special document databases.

If we don’t have a various attributes, data query types were limited, easy to add and delete enough, but has the advantage of large databases, such as drops in the driver’s seat, this kind of data every moment will increase, in this case, we usually use Columnar storage model database, also known as the column type (Columnar DBs) database, Such as Cassandra and HBase. In this kind of database, each column is identified by a column key, and each column key corresponds to several values. It is easy to obtain data containing a certain column.

Row and column databases in relational databases

For small personal projects, we usually choose Cassandra because it is very lightweight and easy to deploy. Its performance is as good as that of HBase, which is too bloated based on Hadoop. Therefore, we can say that Cassandra can be selected when querying data by specifying the key directly through the WHERE statement.

If we store the taxi-related order data in Didi in Cassandra, the driver’s ID can be used as the column key of each column partition. When we want to query the distance of the driver in a specific period of time, Cassandra can immediately help us query these data in the corresponding column. However, at this moment, When we want to query the ride records of passengers in a certain date, Cassandra needs to query the whole partition because the customer ID is not the column key of the partition, and Cassandra’s performance will be greatly reduced!

This case, we can use different partition key copies of the same data to another table or column, at this time, when we receive the customer ID and date of query, we can be directed to the partition kay directly for the customer ID in the table, this is the query less species but a large quantity of data, as long as the type of query, Cassandra (and HBase) can scale indefinitely, but if the type of query is very large, we have to copy each partition key over and over again until a certain limit is reached.

If we can’t control the type of query, we’ll opt for something like MongoDB, but Cassandra is the perfect solution if we only need large-scale scaling for a few queries.

Database selection flowchart

Now, we have a general idea of where to go. If you store structured data and need ACID properties, use a relational database (such as MySQL), if you store large amounts of data with many properties, use a document database (such as Mongo DB), if the data is very simple and there are few types of queries, Use a column database (such as Cassandra), but in a real project, it’s not that simple.

A mixture of

We take the example of taobao, for a product, only one in the inventory, but many users want to buy, then should eventually can only be sold to a user, which requires our database has the ACID properties, therefore, need the MySql this kind of relational database, data is also increasing, but goods of taobao attributes also varied, We also need a NoSQL database using Cassandra, a column store model. Which one should we choose? In real projects, we often use a mix of the two databases, for example, storing undelivered order data in the MySQL database, and once the order is complete, we can move it to Cassandra for permanent storage.

Our demand will become more complex, if we need to buy goods customers build a report system, taobao goods usually made of different brands, different versions sold to different customers, as a result, the report also not for a single product, but on a subset of the products, this kind of demand can use Cassandra or MySQL, But a better solution would be to use a documentation database like MongoDB, where we can store a subset of order data in MongoDB that tells us which users purchased the quantity of a particular item when and on what date. So, if we want to query how many people bought macBooks in the last month, we can get the order ID from MongoDB and use this order ID to query other data from Cassandra or MySQL.

read

https://www.influxdata.com/time-series-database/

https://en.wikipedia.org/wiki/Column-oriented_DBMS

My WeChat public number, mp.weixin.qq.com/s/YKryOUF83…