How many kinds of logs does InnoDB have

Error log: record the error message and record some warning information or correct information query log: record all information to the database request, whether or not these requests to get the correct execution of the slow query log: set a threshold, will run for more than the value of all SQL statements are recorded to the slow query log file The binary logs: Records all operation relay logs transaction logs that perform changes to the database

Four isolation levels for transactions

Read uncommitted Read committed repeatable read serialization

B+ tree index and hash index difference

B+ tree is a balanced multi-fork tree, the height difference from the root node to each leaf node is no more than 1, and the Pointers of leaf nodes are linked to each other, which is orderly. Hash index is the use of a certain hash algorithm, the key value into a new hash value, the retrieval does not need to be similar to B+ tree from nodes to leaf nodes step by step search, only need a hash algorithm, is disordered. Advantages of hash indexes: equivalent queries. Hash indexes have a definite advantage (as long as there are not a lot of duplicate keys, which are inefficient because of the so-called hash collision problem). Hash indexes are not applicable to the following scenarios: Range query is not supported. Index sorting is not supported. The leftmost prefix matching rule of federated indexes is not supported

What are the storage engines

MyISAM, InnoDB, BDB (BerkeleyDB), Merge, Memory (Heap), Example, Federated, Archive, CSV, Blackhole, MaxDB, etc. Mature, stable, easy to manage, fast to read. Some features do not support table level locking (transactions, etc.). InnoDB: support transactions, foreign keys and other features, data row locking. It takes up too much space and does not support full-text indexes.

Clustered and non-clustered indexes

B+ tree is a sequential storage structure with the smallest on the left and the largest on the right. The node only contains the ID index column, while the leaf node contains the index column and data. This kind of index storage method of data and index is called clustered index, and a table can only have one clustered index. Assuming no primary key is defined, InnoDB selects a unique non-empty index instead, or implicitly defines a primary key as the cluster index if no primary key is defined. Non-clustered indexes (secondary indexes) hold primary key IDS, unlike myISam, which holds data addresses.

Index failure scenario

  • LIKE statements beginning with “%”, fuzzy match
  • The index is not used before OR after the OR statement
  • Implicit conversion of data type (vARCHar may be automatically converted to int if not quoted)
  • Use is not equal to query
  • Columns are involved in mathematical operations or functions
  • Mysql does not use indexes when parsing full table scans faster than using indexes
  • When using a federated index, the first condition is a range query, and the second condition cannot use the index even if it complies with the left-most prefix rule

Cover index

Overwriting an index means that in a query, if an index contains or overwrites the values of all the fields to be queried, it is called overwriting an index, and there is no need to query back to the table.

B+ trees do not need to query data back to the table when they meet the requirements of clustered index and overwritten index.

In the index of a B+ tree, the leaf node may store the current key value, or it may store the current key value as well as the entire row of data. This is the clustered index and the non-clustered index. In InnoDB, only primary key indexes are clustered indexes. If there is no primary key, a unique key is selected to create a clustered index. If there is no unique key, a key is implicitly generated to build the cluster index. When a query uses a clustered index, the entire row of data can be retrieved at the corresponding leaf node, so there is no need to run a query back to the table.

Must a non-clustered index be queried back into the table?

Not necessarily. This involves whether all the fields required by the query match the index. If all the fields match the index, then there is no need to perform the query back to the table.

An index pushdown

Select * from user where name like ‘li %’ and age=35 and male=0; Select * from select *; select * from select *; Select * from table (select *); select * from table (select *)

Basic characteristics of transactions

Atomicity refers to the fact that all operations in a transaction either succeed or fail. Consistency refers to the fact that the database always transitions from one consistent state to another consistent state. Isolation means that changes made by one transaction are not visible to other transactions until they are finally committed. Persistence refers to the fact that once a transaction commits, the changes made are permanently stored in the database.

What if there are multiple transactions going on at the same time

The concurrency of multiple transactions generally causes the following problems: dirty read: transaction A reads uncommitted content from transaction B, and transaction B rolls back later. Non-repeatable reads: Setting transaction A to read only what transaction B has committed will result in two different queries within transaction A, because transaction B committed during this time. Phantom read: transaction A reads A range of contents while transaction B inserts A single piece of data in the meantime. Cause “hallucinations”

The isolation level of the transaction

Read uncommit Data that has not been committed by other transactions may be read, also known as dirty reads. Read commit If a read is committed and the two read results are inconsistent, it is called an unrepeatable read. Repeatable Read: this is the default level of mysql, meaning that the result is the same every time, but phantom reading is possible. Serializable, which is not normally used, locks every row read, causing a lot of timeouts and lock contention.

READ UNCOMMITTED READ UNCOMMITTED

  • Rarely used
  • No guarantee of consistency
  • Dirty read: using data that has never been validated (e.g. earlier versions, rollback)

Lock:

  • Execute in unlocked mode
  • Possible problems: dirty read, phantom read, unrepeatable read

READ COMMITTED: READ COMMITTED

  • Each query sets and reads its own new snapshot.
  • Only line-based bin-log is supported
  • UPDATE optimizations: Semi-consistent Read
  • Unrepeatable read: Without locking, other transactions UPDATE or DELETE will affect the query result
  • Phantom: After the lock, do not lock the gap, other transactions can INSERT.

Lock:

  • Locks index records without locking gaps between records
  • Possible problems: illusory, unrepeatable reads

REPEATABLE READ: REPEATABLE READ

  • InnoDB’s default isolation level
  • Use the snapshot created when the transaction is first read
  • Multi-version technology

Lock:

  • When a unique query condition with a unique index is used, only the index records found are locked, not the gaps.
  • Other query conditions lock the range of scanned indexes, preventing other sessions from inserting values in the range through a gap lock or a nearby key lock.
  • Possible problem: InnoDB is not guaranteed to be illusory and needs to be locked

Serialization: SERIALIZABLE

The strictest level, where transactions are executed sequentially, consumes the most resources; Review of issues:

  • Dirty read: using data that has never been validated (e.g. earlier versions, rollback)
  • Unrepeatable read: Without locking, other transactions update or DELETE affect the result set
  • Phantom: Locked, same query executed at different points in time produces different result sets

How to solve it? Increase the isolation level and use a gap lock or next-key lock

Undo log: Deletes logs

  • Ensure atomicity of transactions
  • Uses: transaction rollback, consistent read, crash recovery
  • Records the undo actions required when the transaction is rolled back
  • An INSERT statement, corresponding to a DELETE undo log
  • Each UPDATE statement corresponds to a reverse UPDATE undo log

Save location:

  • System TABLESPACE (MySQL 5.7 default)
  • Undo tablespaces (MySQL 8.0 default)

Rollback segment

Redo log: Redo log

  • Ensure the persistence of transactions to prevent power failures or crashes before data is flushed to disk after a transaction is committed.
  • A redo log is written during transaction execution to record the changes made to the data page.
  • Performance improvement: Write-Ahead Logging (WAL) enables you to Write logs before disks.
  • Log files: ib_logFile0, ib_logFile1
  • Log buffer: innodb_log_buffer_size
  • Strong brush: fsync ()

What is ACID guaranteed by?

A Atomicity is guaranteed by the undo log log, which records the log information that needs to be rolled back. When A transaction is rolled back, the SQL that has been successfully executed is cancelled. C consistency is generally guaranteed at the code level; I Isolation is guaranteed by MVCC; Mysql records database changes in memory and redo log. When a transaction is committed, the database is flushed using the redo log. When a transaction is down, the database is restored using the redo log.

What is illusory

What is MVCC?

MVCC: Multi-version concurrency control

  • Enable InnoDB to support READ COMMITTED and REPEATABLE READ.
  • Concurrency performance can be increased by techniques that allow queries not to be blocked and not to wait for locks held by other transactions.
  • InnoDB keeps old versions of modified rows.
  • When querying data that is being updated by another transaction, the previous version is read.
  • Each row of data has a version number that is updated each time it is updated
  • This technique is not widely used in the database world. Some databases, as well as some MySQL storage engines, are not supported.

Update of cluster index = replacement update of secondary index = Delete + New

MVCC implementation mechanism

  • Hidden columns
  • Transaction linked list, which saves transactions that have not yet committed, and transaction commits are removed from the linked list
  • Read view: one for each SQL, including rw_trx_ids, low_limit_id, up_limit_id, low_limit_no, etc
  • Rollback segment: Build old version data dynamically using undo log

InnoDB has three algorithms for row locking:

Record the Lock:

Locks on a single row record.

Gap the Lock:

Gap lock, left open and right closed. Locks a range, but not the record itself. The purpose of the GAP lock is to prevent the illusion of two current reads of the same transaction. The gap lock is the repeatable read level only under the lock, combined with MVCC and gap lock can solve the phantom read problem.

Next, the Key Lock:

The combination of record lock + gap lock locks a range and locks the record itself. For the query of rows, this method is used, the main purpose is to solve the illusion problem. Left open right closed && left closed right open

A deadlock

The necessary conditions for a deadlock to occur:

Mutually exclusive: A process requires exclusive control over allocated resources, that is, a resource is occupied by only one process during a period of time. Request and hold conditions: When a process is blocked by requesting resources, it holds on to acquired resources. Undeprivation condition: a process cannot deprive a resource it has acquired until it is used up. It can only release it when it is used up. Circular wait condition: When a deadlock occurs, there must be a process – a circular chain of resources.

Deadlock prevention

One-time allocation of resources: Allocating all resources at once so that no more requests are made: (breaking the request condition) As long as one resource is not allocated, no other resources are allocated to the process: (breaking the request hold condition) Deensible resources: Orderly allocation of resources: The system assigns a number to each type of resources, and each process requests resources in ascending order. Releasing resources is the opposite (breaks the circular waiting condition).

Select * from Mysql where ID = 7 and ID = 11; select * from Mysql where ID = 7 and ID = 11;

If the table type is MyISAM, it is 11. MySQL > alter table MyISAM alter table MyISAM alter table MyISAM alter table MyISAM alter table MyISAM alter table MyISAM alter table MyISAM alter table MyISAM alter table MyISAM alter table MyISAM If the table type is InnoDB, it is 7. InnoDB tables only record the maximum ID of the primary key increment into memory, so restarting the database or performing an OPTIMIZE operation on the table will cause the maximum ID to be lost.

How do I optimize SQL?

  • Select the correct storage engine.

MySQL, for example, includes two storage engines, MyISAM and InnoDB. Each engine has its pros and cons. MyISAM is good for some applications that require a lot of queries, but it’s not good for a lot of writes. Even if you update a single field, the entire table is locked, and no other process, even the reader process, can operate until the read is complete. In addition, MyISAM is super fast for computations like SELECT COUNT(*). InnoDB tends to be a very complex storage engine, slower than MyISAM for small applications. But it supports “row locking”, which makes it better when there are lots of writes. It also supports more advanced applications, such as transactions.

  • Optimize the data type of the field

One rule to remember is that smaller columns are faster. If a table has only a few columns (such as dictionary tables, configuration tables), there is no reason to use ints as primary keys. MEDIUMINT, SMALLINT, or smaller TinyInts are more economical. If you don’t need to keep track of the time, DATE is much better than DATETIME. Of course, you also need to leave plenty of room for expansion.

  • Add an index to the search field

Indexes do not have to be assigned to primary keys or unique fields. If there is a field in your table that you will search for frequently, it is best to index it, unless the field you are searching for is a large text field, which should be a full-text index.

  • Avoid using SelectThe more data is read from the database, the slower the query becomes. Also, if your database server and WEB server are two separate servers, this can increase the load of network traffic. Even if you want to query all fields of a table, try not to use itWildcards, taking advantage of the built-in field exclusion definition, may lead to more use of enums rather than VARCHars

ENUM types are very fast and compact. In fact, it holds a TINYINT, but it looks like a string. This makes it perfect for making a list of options with this field. For example, if the values of fields such as gender, ethnicity, department, and status are finite and fixed, you should use ENUM instead of VARCHAR.

  • Use NOT NULL whenever possible

Unless you have a very specific reason to use NULL, you should always keep your fields NOT NULL. NULL actually requires extra space and makes your program more complex when you make comparisons. Of course, this is not to say that you can’t use NULL, the reality is complicated, and there will still be situations where you need to use NULL.

  • Fixed length tables are faster

If all fields in a table are “fixed length”, the entire table is considered “static” or “fixed-length”. For example, the table does not have the following types of fields: VARCHAR, TEXT, BLOB. As long as you include one of these fields, the table is no longer a “fixed-length static table” and the MySQL engine handles it in a different way. Fixed length tables improve performance because MySQL searches faster, and because these fixed lengths make it easier to calculate the offset of the next data, the read is naturally faster. If the field is not fixed length, the program needs to find the primary key each time it has to find the next one. Also, fixed-length tables are easier to cache and rebuild. The only side effect, however, is that fixed-length fields waste some space, because they have to allocate that much space whether you use them or not.

How to design a high concurrency system

  • Database optimization, including reasonable transaction isolation level, SQL statement optimization, index optimization
  • Use caching to minimize database IO
  • Distributed database, distributed cache
  • Load balancing of servers

Optimization strategies for locking

  • Reading and writing separation
  • Subsection locking
  • Reduces the duration of lock holding
  • Multiple threads try to fetch resources in the same order

And so on, these are not absolute principles, should be based on the situation, such as the lock granularity can not be too refined, otherwise there may be too many threads lock and release times, but the efficiency is not as good as adding a large lock at a time.

SQL statement optimization methods

  • Joins between Where tables must precede other Where conditions, and conditions that filter out the maximum number of records must precede the end of the Where clause. HAVING finally.
  • Replace IN with EXISTS and NOT EXISTS with NOT IN.
  • Avoid calculations on indexed columns
  • Avoid using IS NULL and IS NOT nullon index columns
  • Queries should be optimized to avoid full table scans, and indexes should be considered on where and order by columns first.
  • Try to avoid null values for fields in the WHERE clause, as this will cause the engine to abandon the index for a full table scan
  • Expression operations on fields in the WHERE clause should be avoided as much as possible, which can cause the engine to abandon indexes for a full table scan

What is a federated index? Why do I care about the order in a federated index?

MySQL uses multiple fields to create an index at the same time, called a federated index. If you want to match an index in a joint index, you need to match the index one by one in the order of the fields when the index is created. Otherwise, the index cannot be matched. MySQL uses indexes in order. Therefore, when creating a joint index, pay attention to the order of the index columns. In general, the columns with frequent query requirements or high field selectivity should be placed first. Additional adjustments can be made individually, depending on the specific query or table structure.

Several problems of standalone MySQL database

With the increase of data volume, read and write concurrency, and the improvement of system availability requirements, single-server MySQL is faced with: 1. Limited capacity, difficult to expand; 2. Read and write pressure, large QPS, especially analysis requirements, will affect business transactions; 3

Binlog format

  1. ROW
  2. Statement
  3. Mixed

Depots table

First of all, the database is divided into two ways: vertical and horizontal. Generally speaking, the order of splitting is first vertical and then horizontal.

How does the ID behind the table guarantee uniqueness?

Since primary keys are incremented by default, there will definitely be conflicts between different tables. There are several ways to think about it: set the step size, for example, 1 to 1024 tables. Let’s set the base step size between 1 and 1024 so that the primary key falls on different tables without conflict. Distributed ids, implement a set of distributed ids generated algorithm or use open source such as snow after this kind of table do not use the primary key as a query, but each form alone as the only a new field in the primary key used, such as the order table order number is the only, no matter falls where tables are based on the order number as a query, update, too.

  • Since the increase
  • sequence
  • To simulate the seq
  • UUID
  • Timestamp/random number
  • snowflake

How to handle non-sharding_key query after table splitting?

  • You can make a mapping table, for example, what if the merchant wants to query the order list? You can’t scan all tables without user_id. Therefore, we can make a mapping table to save the relationship between merchants and users. When querying, we can first query the user list through merchants, and then use user_id to query.
  • Generally speaking, the merchant end does not have high requirements on real-time data. For example, to query the order list, the order table can be synchronized to the offline (real-time) data warehouse, and then a wide table can be made based on the data warehouse, and then query services can be provided based on others such as ES.
  • If the amount of data is not very large, such as some queries and so on in the background, it can also be done through multithreading to scan the table and then aggregate the results. Or the asynchronous form is ok.

MySQL master-slave replication principle

  1. After the master commits the transaction, it writes to the binlog
  2. Slave Connects to master to obtain the binlog
  3. Master creates a dump thread and pushes binglog to slave
  4. Slave Starts an I/O thread to read the binlog of the synchronized master and records it in the relay log
  5. Slave Starts another SQL thread to read relay log events and execute them on the slave to complete synchronization
  6. Slave Records its own binglog

Since the default replication mode of mysql is asynchronous, the master database sends logs to the slave database and does not care whether the slave database has processed the logs. This will cause a problem. If the master database has been suspended and the slave database has failed to process the logs, the logs will be lost after the slave database becomes the master database. Two concepts emerge from this.

  • Full synchronous replication

The master writes to the binlog and forces the log to be synchronized to the slave, which is returned to the client after all the slave libraries have finished executing, but performance is obviously affected in this way.

  • Semi-synchronous replication

Unlike full synchronization, the logic of semi-synchronous replication is as follows: The slave database sends an ACK to the master database after successfully writing to the log. The master database considers the write operation complete after receiving at least one acknowledgement from the slave database.

Limitations of master-slave replication

  1. Master/slave delay problem
  2. The application side needs to work with the read/write separation framework
  3. High availability is not addressed

MySQL > select * from user;

Dynamically switching data source version 1.0

Configure multiple data sources (e.g. 2, master and slave) based on Spring/Spring Boot. Based on operation AbstractRoutingDataSource and custom annotations such as readOnly, simplify the automatic switching data source, improving two 1.2: support configure multiple from library; 1.3: Support load balancing of multiple slave libraries. Disadvantages: 1) More intrusive 2) Less intrusive will lead to “write to read” inconsistency

Database Framework version 2.0

Shardingsphere-jdbc master-slave function 1) SQL parsing and transaction management, automatic read/write separation 2) solve the problem of inconsistent “write/read” disadvantages: 1) intrusion on the business system 2) unfriendly to the existing old system transformation

Read/Write Separation – Database Middleware version 3.0

The master-slave function of MyCat/ ShardingSphere-proxy 1) requires the deployment of a middleware, and rules are configured in the middleware 2) simulates a MySQL server, which has no intrusion on the business system

High availability of the MySQL

Why high availability?

1. Read/write separation to improve read processing capability. 2. Failover: Provides the failover capability and heartbeat retry of the service connection pool to implement disconnection reconnection and service continuity, reducing RTO and RPO.

What is failover? Disaster recovery Disaster recovery: Hot backup and cold backup For the primary and secondary nodes, it is simple to say that the primary node fails, and a secondary node becomes the active node. In view of the whole cluster, common policies for providing services are as follows: Multiple instances are not deployed on the same host or rack. 2. Deploy multiple instances across equipment rooms and availability zones

High availability definition

High availability means less unserviceable time. Generally measured by SLA/SLO. 1 year = 365 days = 8760 hours 99 = 8760 * 1% = 8760 * 0.01 = 87.6 hours 99.9 = 8760 * 0.1% = 8760 * 0.001 = 8.76 hours 99.99 = 8760 * 0.0001 = 0.876 hours = 0.876 * 60 = 52.6 minutes 99.999 = 8760 * 0.00001 = 0.0876 hours = 0.0876 * 60 = 5.26 minutes

MySQL high availability 0: manual primary/secondary switchover

If the primary node fails, change a secondary node to the primary node. Reconfigure other slave nodes. Modify the application data source configuration. Disadvantages:

  1. Maybe the data is inconsistent.
  2. Human intervention is required.
  3. Intrusive code and configuration.

MySQL High Availability 1: Manual primary/secondary switchover

Use LVS+Keepalived to realize multiple nodes to probe + request routing. Configuring VIP or DNS does not change the configuration. Disadvantages:

  1. Manually handle the primary/secondary switchover
  2. Extensive configuration and script definitions

MySQL High Availability 2: MHA

MHA (Master High Availability) is a relatively mature solution for MySQL High Availability. It is developed by Youshimaton from DeNA (Japan) (now working at Facebook). It is a set of excellent high availability software for failover and master/slave promotion in MySQL high availability environment. Based on Perl language development, can generally achieve master/slave switchover within 30 seconds. During the switchover, the logs of the primary node are directly copied over SSH. Disadvantages:

  1. SSH information needs to be configured
  2. At least three

MySQL high availability 3: MGR *

If the primary node fails, a secondary node is automatically selected to become the primary node. Without manual intervention, group replication ensures data consistency. Disadvantages:

  1. Externally obtaining state changes requires reading the database.
  2. External LVS/VIP configuration is required.

MGR: MySQL Group Replication

High consistency: Group replication is implemented based on distributed Paxos to ensure data consistency. High fault tolerance: automatic detection mechanism, as long as not most nodes are down can continue to work, built-in anti-brain crack protection mechanism; High scalability: Group member information is automatically updated when a node is added or removed. After a new node is added, incremental data is automatically synchronized from other nodes until the data is consistent with that of other nodes. High flexibility: single-master mode and multi-master mode are provided. Single-master mode can automatically select the master when the master library is down, and all writes are carried out on the master node. Multi-master mode supports multi-node writes.

MGR Usage Scenarios

Elastic replication; High availability sharding;

MySQL High Availability 4: MySQL Cluster

Complete database layer available solution. MySQL InnoDB CLuster is a high availability framework, the main components are as follows:

  1. MySQL Group Replication: provides DB extension, automatic failover, etc.
  2. MySQL Router: lightweight middleware that provides failover of application connection targets;
  3. MySQL Shell: New MySQL client with multiple interface modes. You can set group replication and Router.

MySQL High Availability 5: Orchestrator

If the primary node fails, change a secondary node to the primary node. Features:

  1. Automatically discover MySQL replication topology and display it on the Web;
  2. Reconstruct the replication relationship, you can drag the graph on the Web to change the replication relationship;
  3. Detects main exceptions and can be recovered automatically or manually, using hooks for custom scripts;
  4. Supports replication management on command lines and web interfaces.

Why do we split the database

The rapid development of business leads to the rapid expansion of data scale, single database has been unable to adapt to the development of Internet business; Traditional solutions that centrally store data to a single data node are difficult to meet the massive data scenarios on the Internet in terms of capacity, performance, availability, and operation and maintenance costs. In the case that the data volume of single database and single table exceeds a certain capacity level, the index tree level increases and disk IO is likely to be stressed, which will lead to many problems. From the aspect of performance, most relational databases use B+ tree indexes. When the amount of data exceeds the threshold, the increase of index depth will increase the I/O times of disk access, which leads to the decrease of query performance. At the same time, the high concurrent access requests make the centralized database become the biggest bottleneck of the system. In terms of availability, stateless servitization can achieve random expansion with low cost, which will inevitably lead to the final pressure of the system falls on the database. A single data node, or simple master-slave architecture, is becoming increasingly unaffordable. In terms of operation and maintenance cost, when the data in a database instance exceeds the threshold, the time cost of data backup and recovery becomes more and more uncontrollable with the size of the data.

Possible problems with single libraries

1. Failure to execute DDL, such as adding a column or index, will directly affect online services, resulting in a long time of database unresponsiveness; 2, can not backup, similar to the above, backup will automatically lock all tables of the database, and then export data, a large amount of can not execute; 3. Performance and stability will be affected. The system will become slower and slower, and the master library delay may be high at any time.

From read/write split to database split

The master-slave structure solves the problem of high availability and read expansion, but the single-machine capacity remains unchanged and the single-machine write performance cannot be solved. Increased capacity –> Separate database and table, distributed, multiple databases, served as a data sharding cluster. Reduces the write pressure on a single node. Increase the upper limit of data capacity of the entire system.

solution

  • Clone whole system replication, cluster – data replication – master – slave structure, backup and high availability
  • By decoupling the replication of different functions, the business is split up — vertical division of libraries and tables — distributed servitization and microservice
  • By splitting different data extensions, data sharding – horizontal database and table – distributed structure, arbitrary expansion

The vertical resolution

Vertical partition => Distributed servitization => Microservice architecture

Open library

Vertical split: The splitting of a database into multiple databases that provide different business data processing capabilities. For example, splitting all order data and product data into two separate libraries has a significant impact on the business system, because the data structure itself changes, and the SQL and relational relationships must also change. Originally a complex SQL directly to a batch of orders and related products are checked out, now this SQL can not be used, have to rewrite SQL and procedures. First query the order database data, get all the product ids corresponding to this batch of orders, then query all the product information in the product database according to the product ID set, and finally assemble in the business code.

Down the table

Vertical split (table split) : If the data volume of a single table is too large, it may be necessary to split the single table. For example, a 200-column order master table can be divided into dozens of sub-tables: order table, order details table, order receiving information table, order payment table, order product snapshot table and so on. The impact on business systems can sometimes be as large as building a new system. For a highly concurrent online production system, the more movement, the more core, the higher the risk of major failure. So, under normal circumstances, we use this method as little as possible.

Advantages and disadvantages of vertical split

  1. Single library (single table) smaller, easier to manage and maintain
  2. Improves performance and capacity
  3. After modification, the system and data complexity are reduced
  4. It can be used as the basis of micro-service transformation
  5. More libraries, more complex management
  6. It is highly intrusive to the business system
  7. The transformation process is complex and prone to failure
  8. Split to a certain extent can not continue to split

The general practice of vertical splitting

  1. Clear separation scope and scope of influence
  2. Review assessed and re-impacted services
  3. Prepare new database cluster replication data
  4. Modify the system configuration and release a new version

Note:

  1. Split the system first, or the database first?
  2. How much to split first?

Database horizontal splitting

What is horizontal splitting?

Horizontal sub-database sub-table is divided into three categories: sub-database, sub-table and sub-database sub-table

Database horizontal splitting

Horizontal split (divide database and table by primary key) : Horizontal split is to fragment data directly. There are two specific methods: divide database and table, but both of them only reduce the amount of data of a single node, but do not change the structure of data itself. In this way, the code logic of the business system itself does not need to be significantly changed, and can even be transparent based on some middleware. For example, put an order library table (orderDB t_ORDER table) with 1 billion records. We divide the single database into 32 libraries as user ID divided by 32. OrderDB_00.. 31; Divide the order ID by 32, then divide each database into 32 tables T_ORDER_00.. 31. This gives a total of 1024 child tables, and a single table has only 100,000 pieces of data. A query is much more efficient if it can be routed directly to a specific word table, such as orderDB05.T_ORDER_10.

Database horizontal splitting

Horizontal split (by time) : Most of the time, our data is time attribute, so it is natural to split by time dimension. Such as current tables and historical tables, or even separate tables by quarter, month, or day. In this way, when we query data by time dimension, we can directly locate the current subtable. Refer to the next section for a more detailed analysis. Compulsively specify database tables based on conditions: for example, some users’ data is configured to go into a separate database table, and other data is processed by default. Custom mode: Specify certain conditions for the data to enter certain libraries or tables.

Divided database or divided table, how to choose

In general, if the data itself is heavy read and write pressure, disk I/O has become a bottleneck, then the database is better than the table. By distributing data to different database instances and using different disks, the repository can improve the parallel data processing capability of the whole cluster in parallel. On the contrary, you can reduce the amount of data in a single table by considering multiple tables as much as possible, thus reducing the time of operation on a single table, and also increasing the processing power by operating multiple tables in parallel on a single database.

Database horizontal splitting

What are the advantages and disadvantages of split database and table: 1. Solve the capacity problem; 2. Have less impact on the system than vertical split; 3

Classification management of data

As we learn more about business systems and the data itself, we find that many data have different requirements for quality. For example, order data, must be the highest consistency requirements, can not lose data. While the log data and some computing intermediate data, we can not be so high consistency, throw it away, or retrieve it from other places. Similarly, we can adopt different strategies for the order data in the same table. If there are many invalid orders, we can regularly remove or transfer them (in some trading systems, more than 80% of the meaningless orders are placed by machines and cancelled, and no one will inquire about them, so we can clean them up). If there are no invalid orders, then we can also consider:

  1. Orders placed in the last week but not paid are more likely to be queried and paid, and we can directly cancel orders for a long time.
  2. The data ordered in the last 3 months are most likely to be repeated online query and system statistics.
  3. If the data is more than 3 months or less than 3 years old, the possibility of query is very small, so we can not provide online query.
  4. More than 3 years of data, we can directly not provide any way to query.

In this way, we can take certain measures to optimize the system:

  1. Define the data placed within one week but not paid as hot data and put it into database and memory at the same time;
  2. Define the data within three months as warm data, put it into the database, and provide normal query operations;
  3. The data of 3 months to 3 years is defined as cold data, deleted from the database, archived to some cheap disks, and stored in the way of compression (such as MySQL’s tokuDB engine, which can be compressed to dozens of times). Users need to email or submit work orders to query, and we export and send to users.
  4. Data more than 3 years old is defined as ice data and backed up to media such as tapes without any query operation.

As we can see, the above are for some specific scenarios to analyze and give solutions. By complementing existing technologies and tools in a variety of scenarios, we can end up with a complex system of technologies.

Separate library and table framework and middleware

Java Framework level:

  • TDDL
  • Apache ShardingSphere-JDBC

Middleware level:

  • DRDS (Commercial Closed Source)
  • Apache ShardingSphere-Proxy
  • MyCat/DBLE
  • Cobar
  • Vitness
  • KingShard

Database middleware ShardingSphere

Apache ShardingSphere is an ecosystem of open source distributed database middleware solutions. It is composed of JDBC, Proxy and Sidecar (under planning), which are independent but can be deployed and used together. They all provide standardized data sharding, distributed transactions, and database governance functions, and can be applied to diverse application scenarios such as Java isomorphism, heterogeneous languages, and cloud native.

The shardingSphere-JDBC framework is used directly in business code. Support for common databases and JDBC. Java only.

As a middleware, ShardingSphere-Proxy is deployed independently and transparent to the business end. Currently supports MySQL and PostgreSQL. Any language platform system can be accessed, you can use mysql command or IDE operation. It is less intrusive to service systems.

How do I migrate data

  • It’s easy to design new systems, but we’re dealing with old systems and historical data
  • How to migrate old data to new databases and systems more smoothly
  • Especially in the case of heterogeneous database structures
  • Achieve accurate data, fast migration speed, reduce downtime, little impact on business

Data migration mode: Full

  • Full data export and import
  1. Business systems down,
  2. Database migration, consistency verification,
  3. Then the business system is upgraded and a new database is added.

Direct copy, can dump after the full import; If the data is heterogeneous, you need to use programs to process it

Data migration mode: Full + Incremental

  • Depends on the timestamp of the data itself
  1. The data is first synchronized to some recent timestamp
  2. And then down for maintenance when the upgrade is released,
  3. Then synchronize the change data over the last period of time (usually one day).
  4. Finally, upgrade the service system and access the new database.

Data migration mode: Binlog + full + Incremental

  • Data is parsed and reconstructed through the master or slave binlogs for replication.
  • Typically, middleware and other tools are required.

Can achieve multithreading, breakpoint continuation, full history and incremental data synchronization. In this way, we can: 1. Customize complex heterogeneous data structure; 2. Realize automatic capacity expansion and reduction, for example, from single table to single table, from single table to single table, from 4 table to 64 table.

Database middleware ShardingSphere

The migration tool Shardingsphere-scaling

  • Supports full and incremental data synchronization.
  • Support breakpoint continuation and multithreaded data synchronization.
  • Supports heterogeneous database replication and dynamic database expansion.
  • Have UI interface, visual configuration.

This article is compiled from the Web