Hello, I’m Larry Wang. Creating and using indexes correctly is the foundation of high-performance queries. We’ve already looked at the data structures associated with indexes, the various types of indexes and their relative advantages and disadvantages. Let’s take a look at how to really take advantage of these indexes and what are some common index creation strategies

Index creation policy

Strategy 1: Keep index column types as small as possible

When we define the table structure, we need to explicitly specify the type of the column. For example, the integer types include TTNYINT, NEDUMNT, INT and BIGTNT, which occupy increasing storage space. The type size refers to the size of the data range represented by the type. The range of integers that can be represented is also increasing. If we want to index a column of integers, we should try to index the column with a smaller type if the range of integers allows. For example, we should not use BIGINT if we can use INT, and we should not use INT if we can use NEDIUMINT. This is because:

  • The smaller the data type, the faster the comparison at query time (CPU level)
  • The smaller the data type, the less storage space the index occupies, and the more records you can fit into a data page, reducing the performance penalty of disk /0, which means more data pages can be cached in memory for faster read and write efficiency.

This recommendation applies even more to primary keys of tables, because not only do clustered indexes store primary keys, but all secondary indexes store primary keys of a record. Smaller data types for primary keys can mean more storage savings and more efficient I/0.

Strategy two: Index selectivity

To create an index, select columns with high selectivity/dispersion. Index selectivity/discreteness refers to the ratio of unduplicated index values (also known as cardinality) to the total number of records (N) ina table, ranging from 1/N to 1. The more selective an index is, the more efficient the query is, because a more selective index allows MySQL to filter out more rows in a lookup. Unique index selectivity is 1, which is the best index selectivity and the best performance.

Poor index selectivity is when the data in a column is very repetitive, such as a gender field, where, regardless of political correctness, only two things are possible, male or female. So when we query, even if we use this index, from the point of view of probability, we can still find half of the data.

Take this table for example:

Which column is best as an index field? The name field, of course, because there is no duplication in the data, and the gender field is the worst place to index because the data is very repetitive.

How to calculate the selectivity/dispersion of indexes? For example, the table order_exp:

select    COUNT(DISTINCT order_no)/count(*) cnt from order_exp;
Copy the code

 select    COUNT(DISTINCT order_status)/count(*) cnt from order_exp;
Copy the code

Obviously, indexes on the order_NO column are more selective than those on the order_STATUS column, simply because there are only -1,0,1 values in the order_STATUS column.

Strategy 3: Prefix indexes

Sometimes you need to index very long character columns, which can make the index large and slow. One strategy is to simulate hash indexes.

Simulate hash indexes

The order_note field in the order_EXP table is very long. To use it as an index, we can add an order_NOT_hash field to store the hash value of order_note, and then create an index on order_not_hash. This is a significant speed increase over previous indexing methods. One is to index the entire order_note, which is indexed by integer hash values, so numeric comparisons are much more efficient than string matching.

But the drawbacks are also obvious:

  1. Additional maintenance is required for the order_NOT_hash field;
  2. The choice of hash algorithm determines the probability of hash conflict. Bad hash algorithm will lead to a lot of duplicate values.
  3. Range lookups are not supported.

What could be improved? You can also index the first part of the characters, which greatly saves the index space and improves the index efficiency. But it also reduces the selectivity of the index. In general, we need to ensure that the selectivity of a column prefix is high enough to meet query performance. (Especially for BLOB, TEXT, or very long VARCHAR columns, you should use a prefix index because MySQL does not allow the full length of these columns to be indexed).

The trick is to choose prefixes that are long enough for high selectivity, but not too long (to save space). The prefix should be long enough so that the selectivity of the prefix index is close to indexing the entire column. In other words, the cardinality of the prefix should be close to the cardinality of the full column. To determine the appropriate length for prefixes, find a list of the most common values and compare them to the list of the most common prefixes.

Start by finding a list of the most common values:

SELECT COUNT(*) AS cnt,order_note FROM order_exp GROUP BY order_note ORDER BY cnt DESC LIMIT 20;
Copy the code

By looking at the distribution of the data, it’s a safe guess that the first nine characters aren’t very selective, and starting at the tenth would be fine. Give it a try:

SELECT COUNT(DISTINCT LEFT(order_note,3))/COUNT(*) AS sel3,

COUNT(DISTINCT LEFT(order_note,4))/COUNT(*)AS sel4,

COUNT(DISTINCT LEFT(order_note,5))/COUNT(*) AS sel5,

COUNT(DISTINCT LEFT(order_note, 6))/COUNT(*) As sel6,

COUNT(DISTINCT LEFT(order_note, 7))/COUNT(*) As sel7,

COUNT(DISTINCT LEFT(order_note, 8))/COUNT(*) As sel8,

COUNT(DISTINCT LEFT(order_note, 9))/COUNT(*) As sel9,

COUNT(DISTINCT LEFT(order_note, 10))/COUNT(*) As sel10,

COUNT(DISTINCT LEFT(order_note, 11))/COUNT(*) As sel11,

COUNT(DISTINCT LEFT(order_note, 12))/COUNT(*) As sel12,

COUNT(DISTINCT LEFT(order_note, 13))/COUNT(*) As sel13,

COUNT(DISTINCT LEFT(order_note, 14))/COUNT(*) As sel14,

COUNT(DISTINCT LEFT(order_note, 15))/COUNT(*) As sel15,

COUNT(DISTINCT order_note)/COUNT(*) As total

FROM order_exp;
Copy the code

As you can see, there’s a very high increase in selectivity from the 10th, and as the number of prefixes increases, the selectivity increases, but by the 15th, it’s not much different from the 14th, it’s very close to the selectivity of the entire column.

So if you prefix this field, indexes 13 through 15 are good choices, and even 12 is not out of the question. Instead of finding a list of the most common values, it is possible to calculate prefix character selectiveness.

In the example above, where the appropriate prefix length has been found, how to create a prefix index:

ALTER TABLE order_exp ADD KEY (order_note(14));
Copy the code

The query statement does not need to change after the prefix index is created:

select * from order_exp where order_note = 'xxxx' ;
Copy the code

Prefix indexes are an effective way to make indexes smaller and faster, but they also have disadvantages. MySQL cannot use prefix indexes for ORDER BY and GROUP BY, and cannot use prefix indexes for overwrite scans.

Sometimes the suffix index is useful (for example, to find all E-mail addresses for a domain name). MySQL does not natively support reverse indexing, but you can store a string reversed and build a prefix index based on it. Indexes can be maintained by triggers or by application processing itself.

Policy 4: Create indexes only for columns used for searching, sorting, or grouping

That is, only columns that appear in the WHERE clause and join columns that appear in the join clause are indexed. Columns that appear in a query list are not necessarily indexed unless an overwrite index is required. Or create an index for a column that appears in an ORDER BY or GROUP BY clause. Such as:

SELECT * FROM order_exp ORDER BY insert_time, order_status,expire_time;
Copy the code

The result set of the query needs to be sorted according to the insert_time value first. If the insert_time value is the same, it needs to be sorted according to order_STATUS; if the value of order_STATUS is the same, it needs to be sorted according to expire_TIME. Reviewing the storage structure of the union index, the U_IDx_DAY_STATUS index itself is sorted according to the above rules, so it is good to extract data directly from the index, and then perform a back-table operation to retrieve columns that are not contained in the index.

ORDER BY order_status,expire_time, insert_time, insert_time, ORDER BY order_status,expire_time, insert_time

SELECT insert_time, order_status,expire_time,count(*) FROM order_exp GROUP BY insert_time, order_status,expire_time;
Copy the code

This query does three grouping operations:

  • Records were first grouped according to the insert_time value, and all records with the same insert_time value were divided into a group.
  • The records in each insert_time group with the same value are grouped according to the value of order_STATUS, and the records with the same value of order_status are grouped into a small group.
  • Then divide the small groups generated in the previous step into smaller groups according to the value of expire_time.
  • If there is no index, this grouping process needs to be implemented in memory. If there is an index, the grouping order happens to be the same as the order of the index column in our U_IDx_DAY_STATUS index, and our B+ tree index is arranged according to the index column. That’s right, so you can use the B+ tree index for grouping.

In the same way that you sort with a B+ tree index, the order of the grouped columns needs to be the same as the order of the index columns.

Strategy 5:The column index more

Many people have a poor understanding of multi-column indexes. A common mistake is to create a separate index for each column, or to create multiple column indexes in the wrong order.

The most confusing problem we encountered was the order of the index columns. The correct order depends on the queries that use the index, and you also need to consider how best to meet the sorting and grouping needs. In a multi-column B-tree index, the order of the index columns means that the index is sorted first by the leftmost column, second by the second column, and so on. Therefore, indexes can be scanned in ascending or descending ORDER to satisfy the query requirements for clauses such as ORDER BY, GROUP BY, and DISTINCT that precisely match column ORDER.

So the column order of a multi-column index is critical. There is a rule of thumb for selecting the column order of an index: place the most selective column at the top of the index. When sorting and grouping are not a concern, it is usually good to put the most selective columns first. The purpose of the index is to optimize the lookup of WHERE conditions. In this case, indexes designed this way do filter out the desired rows fastest, and are more selective for queries that use only the index part prefix column in the WHERE clause.

However, performance is not only dependent on index column selectivity, but also on query conditions. You may need to adjust the order of the index columns, such as sorting and grouping, based on those queries that run most frequently to make the index the most selective.

At the same time, when optimizing performance, you may need to use indexes with the same columns but in a different order to meet different types of query requirements.

Strategy 6: Primary keys are columns that rarely change

As we know, rows are physically sorted by clustered index. If the primary key changes frequently, the order of objects changes,MySQL has to constantly adjust the B+ tree, and pages can be split and merged in the middle, which can lead to a dramatic performance degradation.

Strategy 7: Avoid redundant and duplicate indexes

MySQL allows multiple indexes to be created on the same column, intentionally or unintentionally. MySQL needs to maintain duplicate indexes separately, and the optimizer also needs to consider them individually when optimizing queries, which can affect performance. Duplicate indexes are the same type of leads created in the same order on the same columns. You should avoid creating duplicate indexes in this way and remove them as soon as they are discovered.

Sometimes duplicate indexes are inadvertently created, as in the following code:

CREATE TABLE test (
ID INT NOT NULL PRIMARY KEY,
A INT NOT NULL,
B INT NOT NULL,
UNIQUE(ID),
INDEX(ID)
) ENGINE=InnoDB;
Copy the code

A primary key is created, a unique constraint is added, and an index is added for use by the query. In fact, MySQL’s only restrictions and primary key restrictions are implemented through indexes, so the above method actually creates three duplicate indexes on the same column. There is usually no reason to do this unless different types of indexes are created on the same column to meet different query requirements.

There are some differences between redundant and duplicate indexes. If index (A B) is created, creating index (A) is redundant because it is just A prefix index of the previous index. So index (AB) can also be used as index (A) (this redundancy is only for b-tree indexes). But if you create index (B,A) again, it is not redundant, and neither is index (B), because B is not the left-most prefix column of index (A,B).

The existing index (A) is extended to (A, ID), where ID is the primary key. For InnoDB, the primary key column is already included in the secondary index, so this is also redundant.

The solution to redundant and duplicate indexes is as simple as dropping them, but the first step is to find them. You can do this by writing complex queries that access the INFORMATION_SCHEMA tables.

Strategy 8: Delete unused indexes

In addition to redundant and duplicate indexes, there may be some indexes that the server never uses. Such an index is completely redundant and is recommended to be removed.