Based on the cognitive

Traditional relational database (general, refer to InnoDB Mysql)

  • 1TB: If the database will expand to TB level, you need to consider MySQL database by database.
  • 10 million rows or 10GB: if the number of records in a single table exceeds 10 million rows or the disk space occupied by a single table exceeds 10GB, consider table division
    • 1000 writes per second: The write rate of a node exceeds 1000 writes per second. Consider introducing Redis or message queues as write buffers according to business scenarios to realize asynchronous database write operations

Source: Liu Chunhui “How is Shopee database selection?”

The stand-alone QPS of the database is only a few thousand, which obviously cannot meet the requirements of high concurrency in Internet business scenarios. Therefore, separate database and separate table + read and write become normal.

  • Repository splitting, that is, dividing a library into multiple libraries and deploying it on multiple instances.
  • Read and write analysis: The primary library carries the write request, and the secondary library carries the read request. Since most business scenarios are read rather than write, a master library with multiple slave libraries can effectively reduce the stress on a single library.

If write requests continue to increase (or if the amount of data increases) and a bottleneck is reached, the database and tables continue to be divided. Conversely, if read requests rise, more slave libraries are hung.

complexity

However, the online business situation is often much more complicated. With the increase of data volume and users, the following problems will gradually emerge in the database sub-table (sub-table, which is generally not a big problem according to the business) :

  • Shard complex

    • Capacity estimation is difficult

      • A single data does not consider the sub-table. As business development requires sub-table, and the sub-table field is not the existing ID field, the cost of change is high

      10 times architecture design, 100 times data design

    • Coordinate&Integrity

      • Take an order as an example, the business will filter data by buyer, seller, order status, payment method and other dimensions. If the buyer dimension is used to divide the database into tables, it will be difficult to query the seller dimension. And vice versa. The solution is generally based on the main query as the dimension, and the heterogeneous indexes are established respectively, so the consistency of different data is bound to be difficult
    • Data skew

      • Social business data, such as likes and follows, are often distributed unevenly after being divided into databases and tables according to user dimensions, and the amount of data in a few fragments may be much larger than other fragments. These large fragments are also read and write hotspots, which can easily become performance bottlenecks.

      Solution: better results can be obtained by using large prime modulo and subdivision table

  • Query complex

    • Multi-table query
    • Multidimensional dimensions: see the order section

Storage type

In practice, OLTP databases are rapidly segmented by business scenarios and domains due to the complexity of data fragmentation visibility.

By type of application

  • Key-value type: id-value
  • Table type: ID Collection + time-value
  • Relational model: ID + ID Collection + time-value

Reference: Recognize resources

According to the storage model

  • B-tree: reads friendly and orderly data. LIRS algorithm divides the buffer pool into two levels. The data enters the first level first. If the data is accessed twice or more in a short period of time, the data becomes hot and enters the second level
  • Bitcak: Write friendly, data disordered. The index information of the primary key and value is stored in memory, and the actual contents of the primary key and value are stored in disk files. Periodically perform a merge operation to implement garbage collection. Bitcask uses hint files to speed up the reconstruction of hash tables
  • LSM: Write friendly, orderly data. When a new data change is written to the disk in batches, historical data on the disk is merged with the last change in the memory. When a Compaction occurs, garbage collection occurs periodically

With the fatigue of sub-table, different from many traditional relational databases, new distributed databases are increasingly used in production environments, such as TiDB.

Reference: Practice and challenges of TiDB in the context of trillions of business data of Zhihu