Mysql introduction: mysql is one of the most popular relational database software, because of its small size, fast, open source free, easy to use, low maintenance cost, easy to expand in the cluster architecture, high availability, so by developers and enterprises welcome.

innodb

Innodb has been used as the default engine since mysql5.5

Innodb architecture diagram

The following figure is the official innoDB architecture diagram, which is mainly divided into memory structure and disk structure.

Innodb memory space

The memory structure mainly includes Buffer Pool, Change Buffer, Adaptive Hash Index and Log Buffer.

  • Buffer Pool: Buffer Pool, BP for short. BP takes Page Page as the unit and the default size is 16K. The bottom layer of BP adopts the linked list data structure to manage Page. When InnoDB accesses table records and indexes, they are cached in the Page Page, which can reduce disk I/O and improve efficiency.
    • Page management Mechanism Page can be classified into three types based on the status:
      • Free Page: idle page, not used
      • Clean Page: Used page, data has not been modified
      • Dirty Page: Used page. Data on the page is modified. Data on the page is inconsistent with data on the disk

      InnoDB maintains and manages the three page types through three linked list structures

      • Free List: indicates the free buffer and manages free pages
      • Flush List: Manages dirty pages. Internal pages are sorted by modification time. Dirty pages exist in both the FLUSH list and the LRU list, but the two do not affect each other. The LRU list manages the availability and release of pages, while the Flush list manages the flush of dirty pages.
      • Lru list: indicates the buffer in use, managing clean and Dirty pages. The buffer is based on MIDpoint. The front list is called new list area, which stores frequently accessed data, accounting for 63%. The next linked list, called the Old list area, holds less used data, at 37%.

      Improved LRU algorithm maintenance

      • Normal LRU: End elimination method, new data is added from the head of the list and eliminated from the end when space is freed
      • Modified LRU: The linked list is divided into two parts, new and old. When adding an element, it is not inserted from the table head, but from the middle midpoint position. If the data is accessed quickly, then the page will move to the head of the new list; if the data is not accessed, it will gradually move to the old tail, waiting to be eliminated.
      • Every time a new page is read into the buffer pool, the InnoDb engine determines if there are free pages and if there are enough, then it removes the free page from the free List and puts it into the LRU list. If there are no free pages, the default pages of the LRU linked list are weeded out according to the LRU algorithm and memory space is freed to allocate to the new pages.
    • Dirty page refresh mechanism
      1. Dirty pages are flushed to disk when LRU is phased out
      2. When the percentage of dirty pages in InnoDB exceeds the value of Innodb_max_dirty_pages_pcT_LWM, innoDB will start flushing dirty pages to disk.
      3. When the percentage of dirty pages in InnoDB exceeds the value of innodb_max_dirty_pages_pcT_LWM, Moreover, if innoDB_max_dirty_pages_pcT is exceeded innoDB will enter AgresFlush mode where InnoDB will flush dirty pages to disk faster.
      4. Another case is called sharp checkpoint. When InnoDB reuses a previous redo file, all innodb_buffer_pool pages related to that file are flushed to disk. Doing so can cause disk I/O storms, which can affect performance or availability.

      Parameters that

      • The default value of innodb_max_dirty_pages_pct is 75, which means that frequent refresh mode is only entered when the percentage of dirty pages exceeds 75%.
      • Innodb_max_dirty_pages_pct_lwm the default value is 0,0 is a special value for innodb_max_dirty_pages_pct_lwm, which indicates that this function is disabled; Since this feature is not enabled, this means that the percentage of dirty pages in Innodb_buffer_pool is around 75%.

      IO Storm Solution:

      • For the third case, MySQL should refresh the innodb_buffer_pool page of the redo log file when the IO storm occurs. Mysql’s solution to this problem is to refresh innodb_buffer_pool pages to disk in real time based on the redo log rate, so that when redo log files are reused, there are not too many pages to refresh, and there are no IO storms. Because page flushing is automatically adjusted according to the rate at which redo logs are generated, this solution is called innodb_adaptive_flushing;
      • Since innodb_adaptive_flushing is automatically flushed, it basically puts 2. So by default mysql will set innodb_max_dirty_pages_pct_lwm= 0,0 is a special value that means off in this case.
  • Change Buffer: Write Buffer, CB for short. During DML operations, if BP does not have its corresponding Page data, disk pages are not immediately loaded into the buffer pool. Instead, CB records the buffer changes and then merges the data back into BP when future data is read.
    • The ChangeBuffer occupies 25% of the BufferPool space by default and 50% by maximum. The value can be adjusted based on read and write services. Parameter innodb_change_buffer_max_size; When a record is updated, it exists in the BufferPool and is modified directly in the BufferPool, a memory operation. If the record does not exist in the BufferPool (no hit), a memory operation is performed in ChangeBuffer, avoiding disk data query and avoiding disk I/O. The next time the record is queried, it is advanced disk read, and then it is merged from ChangeBuffer and loaded into BufferPool.
    • Write buffer, only for non-unique ordinary index pages, why? If the index is set to be unique, InnoDB must perform uniqueness checks before making changes, so it must query the disk and perform an IO operation. Records are queried directly into BufferPool and then modified in BufferPool, not ChangeBuffer.
    • When does the write buffer actually update data?
      • Active: There are threads in the innoDB engine that actively merge on a regular basis
      • Passive: The changeBuufer is merged with the ChangeBuufer and added to the BufferPool
  • Adaptive Hash Index: an Adaptive Hash Index used to optimize queries for BP data.
    • The InnoDB storage engine monitors table index lookups, and if it sees a speed increase by creating a hash index, it creates a hash index, so it’s called adaptive. The InnoDB storage engine automatically hashes certain pages based on the frequency and mode of access.
  • Log Buffer: A Log Buffer that stores data to be written to Log files (Redo/Undo) on the disk. The contents of the Log Buffer are periodically flushed to the Log file on the disk. When the log buffer is full, it is automatically flushed to disk, and when large transactions such as BLObs or multi-line updates occur, adding the log buffer can save disk I/O. The LogBuffer is used to log InnoDB engine logs. During DML operations, Redo and Undo logs are generated. When the LogBuffer space is used up, it is automatically written to disk. You can reduce disk I/O frequency by increasing the innodb_log_BUFFer_size parameter.
    • The innodb_flush_log_at_trx_COMMIT parameter controls the log flushing behavior. The default value is 1
      • 0: write log files and flush disk files every second (write log files LogBuffer >OS cache, flush disk files OScache). Data is lost at most one second
      • 1: transaction submission. Write log files and flush disks immediately. Data is not lost, but I/O operations are frequent
      • 2: transaction commit, write log file immediately, flush every 1 second operation

The index

Indexes can speed up queries and affect where queries and order by sorting. MySQL index type:

  • Index storage structure: B Tree index, Hash index, FULLTEXT full-text index, R Tree index
  • Application hierarchy: common index, unique index, primary key index, composite index
  • By index key value type: primary key index, secondary index (secondary index)
  • Partitioning from data storage and index key-value logic: clustered index (clustered index), non-clustered index (non-clustered index)

Normal index

This is the most basic type of index, based on a normal field index, there are no restrictions.

The only index

Similar to “normal index”, except that the value of the index field must be unique, but empty values are allowed.

The primary key index

It is a special kind of unique index that does not allow null values. The primary key constraint can be added when creating or modifying a table. Each table can have only one primary key.

Composite index

  • A single index is a column of an index. That is, the statement of a new index is executed on only one column. Users can create indexes on multiple columns called group composite indexes (composite indexes). Composite indexes can replace multiple single indexes and require less overhead than multiple single indexes.
  • There are two concepts of index at the same time called narrow index and wide index. Narrow index refers to the index with 1-2 columns, and wide index refers to the index with more than 2 columns. An important principle of index design is to use narrow index instead of wide index, because narrow index is often more effective than combined index.

B tree and B plus tree

B tree

  • Index values and data are distributed throughout the tree structure
  • Each node can store multiple index values and corresponding data data
  • Multiple index values in a tree node are arranged in ascending order from left to right

B tree search: starting from the root node, the index value sequence in the node using dichotomy search, if hit the end of the search. No hit will enter the child node to repeat the search process, until the corresponding node pointer is empty, or is already a leaf node.

B + tree

  • Non-leaf nodes do not store data, but only index values, which is convenient for storing more index values
  • The leaf node contains all the indexes and data
  • Leaf nodes are connected with Pointers to improve the access performance of the interval
    • Leaf nodes store row data, which is stored in a one-way linked list
    • Row data is stored in a page in mysql, which is a bidirectional linked list

Compared with B tree, when B+ tree conducts range search, it only needs to search the index values of two nodes and then use the Pointers of leaf nodes to traverse. However, B Tree requires traversal of all nodes and data within the range. Obviously, B+Tree has high efficiency.

Hash

The underlying implementation of Hash is implemented by Hash tables, which store data based on key values <key,value>. This is great for finding values by key, which is a single key query, or equivalent query, but for range queries it requires a full table scan. Hash index In MySQL Hash structure is mainly used in Memory native Hash index, InnoDB adaptive Hash index.

  • InnoDB’s adaptive hash index is designed to improve query efficiency. InnoDB’s storage engine monitors queries on each index page of the table. When InnoDB notices that some index values are accessed frequently, it creates a hash index in memory based on the B+Tree index. Enable the B+Tree index in memory to have the function of hash index, that is, to quickly and fixed-value access to frequently accessed index pages.
  • InnoDB adaptive Hash index: When using Hash index access, data can be located in a single search, equivalent query efficiency is better than B+Tree.
  • The creation of adaptive hash indexes enables InnoDB storage engine to automatically create hash indexes for some hot pages based on the frequency and pattern of page access to speed up access. In addition, users can only choose to turn on or off the function of InnoDB’s adaptive hash index, without manual intervention.

Four kinds of log

Redo log engine layer

Redolog is used to record the data information of transaction modification, and redolOG is persisted to disk. When the database crashes unexpectedly, redolOG is read to recover the latest data.

Redo Log generation and release:

  • Generate: Redo logs are generated as a transaction is executed. Redo logs are written to the Log Buffer when a transaction is committed, not to disk files immediately after the transaction is committed. Redolog sets three persistence Settings for Innodb_flush_log_at_trx_commit:
    • 0: Redo buffer ->OS cache -> Flush cache to disk may lose transaction data for 1 second. The operation is performed every second by the background Master thread.
    • 1 (default) : Redo Buffer -> OS cache -> Flush cache to disk, the safest and worst performing method.
    • 2: Perform Redo Buffer -> OScache for each transaction. The Master thread then performs OScache -> Flush cache to disk every second.
  • Free up: Once the dirty pages of the redo Log are written to disk, the redo Log is done, and the space used by the redo Log is reused.

The Redo Log file is written in a sequential loop. When the Redo Log is full, it is rewritten to the first file.

  • Write pos is the position of the current record, which moves backwards as you write, returning to the beginning of file 0 after writing to the end of the last file.
  • Checkpoint is the current point of erasure, which is also moved backwards and iterated. Records are updated to the data file before erasure.
  • Write The remaining space between pos and checkpoint, which can be used to record new operations. If write POS catches up to checkpoint, it is full, and no new update can be performed. It must stop to erase some records and push checkpoint forward.

Undo log(Engine layer)

Undo log logs are mainly used for rollback, and there are two types of undo logs:

  • insert undo

The undo log is generated at the time of the transaction and is only needed for transaction rollback, so it is recycled after the transaction commits.

  • update undo

Created when update and delete operations are performed, it is used by snapshot reads and rollbacks, so it is not deleted immediately after a transaction commits, but is cleaned up by the Purge thread without rollback and snapshot reads

  • Undo logging is a logical storage
    • When inserting, simply write down the primary key value of the record and delete the record based on the primary key when rolling back.
    • When you modify, you may only need to record the old value of the field being modified

Binlog (Database layer)

The Binary log is the log of the MySQL Server, namely, Binary log. There are three recording modes of a binlog file: STATEMENT, ROW, and MIXED.

  • ROW (Row-based replication, RBR) : Logs record the modification of each ROW of data, and then modify the same data on the slave end.
    • Advantages: The modification details of each row can be clearly recorded, and master/slave data synchronization and data recovery can be fully realized.
    • Disadvantages: Batch operation, will produce a large number of logs, especially ALTER table will cause the log skyrocketing.
  • STATMENT (SBR) : Each SQL file modified is recorded in the master’s Binlog. During replication, the SLAVE SQL process parses the same SQL file that was executed by the master and executes it again. SQL statement replication for short.
    • Advantages: A small amount of logs reduces disk I/OS and improves storage and recovery speed
    • Disadvantages: Inconsistencies in primary and secondary data in some cases, such as last_insert_id(), now(), etc.
  • MIXED (mixed-based replication, MBR) : If the two modes are used together, STATEMENT mode is used to save binlogs. For operations that cannot be copied in STATEMENT mode, ROW mode is used to save binlogs. MySQL selects a write mode based on the SQL STATEMENT to be executed.

Relay Log (Database layer)

In fact, master and slave copy is used, interested partners can understand.

The lock

Isolation level

Read uncommitted

Read Uncommitted: Updates lost in rollback coverage types are resolved, but dirty reads may occur, that is, Uncommitted transaction changes may be Read in other sessions.

Has been submitted to read

Read Committed: Only data Committed in other sessions can be Read, preventing dirty reads. However, unrepeatable reads may occur, that is, the results of two queries may be inconsistent in a transaction.

Repeatable read

Repeatable Read: Solves non-repeatable reads by ensuring that multiple instances of the same transaction will see the same rows when reading data concurrently. However, in theory, phantom reading can occur. Simply speaking, phantom reading refers to that when a user reads a data row in a certain range, another transaction inserts a new row in this range, and when the user reads the data in this range, he will find a new phantom row.

serializable

Serializable: All increments, deletions, and changes are executed in serial mode. It resolves the illusion problem by forcing transaction ordering and resolving conflicts. This level can cause a lot of timeout and lock contention, which is inefficient.

record lock

Row lock, lock index row only, RC isolation level, RR isolation level. Update and DELETE cannot be read repeatedly.

  • For example,

(id=1, name=’ wangwu ‘, name=’ wangwu ‘, name=’ wangwu ‘, name=’ wangwu ‘, name=’ wangwu ‘, name=’ wangwu ‘, name=’ wangwu ‘)

gap lock

Gap lock, lock index left and right range, RR isolation level.

next key

Row lock + gap lock, lock index and left and right gap, RR isolation level. Fix insert non-repeatable read problem.

  • For example,

Id=1, name=’ grade ‘=1,id=1, name=’ grade’ =2, id=1, name=’ grade ‘=2 Update set gradeId=3 where name=’ Chinese ‘and T2 insert into name=’ Chinese’ and gradeId=4.

MVCC

Concurrency Control is a multi-version Concurrency Control. Concurrency Control is a copy-on-write (COW) Concurrency Control for A select file.

  • The realization of the MVCC

Data visibility is determined based on the data version chain and ReadView of the current transaction

  • The basis of MVCC
  • Each transaction has an ID: trx_id
  • During the execution of a transaction, a single transaction ID is assigned only when the record is actually modified for the first time, and this transaction ID is incremented.
  • Each line has an implicit field
    • DB_TRX_ID: last modified/inserted transaction ID: Record THE ID of the transaction in which the record was created or last modified
    • DB_ROLL_PTR: Rollback pointer to the previous version of this record
    • DB_ROW_ID: An implied increment ID (hiding primary keys). InnoDB automatically generates a cluster index using DB_ROW_ID if the table does not have a primary key
    • In fact, there are also deleted flag hidden fields. If the record is updated or deleted, it does not mean that the record is really deleted, but that the deleted flag has changed

ReadView

Read View is a Read View generated when a transaction performs snapshot Read. At the moment of snapshot Read, a snapshot of the current database system is generated, and the ID of the current active transaction is recorded and maintained. Each time a transaction is started, a new TRx_ID is assigned, which is self-increasing.

The Read View actually consists of the following parts:

  • Trx_ids: indicates the id of the current active transaction
  • Min_trx_id: indicates the minimum ID of an active transaction
  • Max_trx_id: the next trx_id assigned by the system after the readView is generated
  • Creator_trx_id: transaction ID of the generated readView

For example:

There are 1, 2, 3, 5, 7, 9 transactions, 1 and 7 have been committed, 9 generates readView, what are the attributes above?

  • ,3,5 trx_ids: 2
  • min_trx_id:2
  • max_trx_id:10
  • creator_trx_id:9

MVCC principle

Take the above example as follows

Trx_id =9; trx_id=9;

  1. Select min_trx_id from creator_trx_id where creator_trx_id =2 If min_trx_id is less than min_trx_id or creator_trx_id is equal to creator_trx_id, the transaction was committed or is the record of the current transaction at the time the read View was generated, so the record should be visible, but not less, and the comparison continues
  2. Trx_id = max_trx_id = max_trx_id = max_trx_id = max_trx_id = max_trx_id = max_trx_id = max_trx_id
  3. Compare if trx_id is in trx_IDS, if not, the transaction was committed when the read view was generated and is visible. If yes, it is not visible. Proceed to the next step.
  4. Get the last record through the DB_ROLL_PTR pointer and start at Step 1 until you find an available record or no record.

Why is it sometimes impossible to solve hallucinations?

An 🌰

  1. So far there are two figures:
    • id=1,name=’A’;
    • id=3,name=’C’;
  2. There are currently two transactions T1 T2;
    1. T1 select * can find two records and generate a readView
    2. Id =2 name= ‘B’
    3. T1 performs select * again. Currently, only two records can be read due to MVCC
    4. Update set name= ‘b’ where id=2
    5. Select * from T1 where id=2
  3. This is mainly because T1 generates new records after updating and trx_id=T1 ID. At this time, MVCC will find that the latest record with ID =2 is visible for the current transaction, so it will be read out.