1, index,

An index is a data structure used to quickly find a specified item in a large amount of data. When we study index, we must contact the most example is the catalog of books. Basically, every article about index will use the catalog of books or dictionaries to introduce the concept of index. Index is indeed like catalog, in essence, index is a kind of data structure for quick retrieval of data.

We said that an index is a data structure, so what is an index structure in MySQL? Now MySQL default for the Innodb storage engine, and its use B + tree to implement the index (about MySQL why use B + tree index, I think this article summarizes the quite good cloud.tencent.com/developer/a…). Every time we create an index (primary key, unique index, secondary index), Innodb creates a B+ tree for it to implement the above “directory” function, and Innodb creates a B+ tree index for it to look like this.

2. The cost of indexing

We say that the index is a good thing, the data in the database is generally more, and we too many scenes of data from the database query, if direct scanning the entire table, in the case of few data in the table or can be, but when the table data quantity is big, it is too time-consuming to scan the full table directly, serious impact on the user experience, Index enables us to improve the speed of data retrieval in a large number of tables. In large Internet companies, the magnitude of table data is usually relatively large. In general, it is required that the query operation must match the index, otherwise a large number of slow query statements may be generated. In the “Alibaba Java Development Manual”, it is clear that the index should be optimized to at least Range level, as shown below:

If you are not familiar with the type level of an index, refer to the related article, which will not be expanded here.

But good things can not be used casually, with good efficiency, with bad drag. It is not that we build an index for each query. There are two main reasons.

1. Time cost

Once we create indexes, then later if you want to at the time of query by index with cool, so the usual data changes (add, delete, modify) must maintain good involved all indexes, we went to insert, delete, modify the data in the table when going to the corresponding index of clustering with all the secondary indexes. Therefore, if there are too many indexes, our DML operation statement will need to change more indexes, and the corresponding execution time will be longer. Therefore, indexes not only improve the query speed, but also increase the burden or cost of DML operation. From this point of view, indexes cannot be created randomly.

2. Space costs

Every time we create an index, MySQL creates a B+ tree for us, and this B+ tree does not exist in a vacuum, it actually takes up physical space. Therefore, the more indexes we create for a table, the larger the storage space occupied by the index. We should not underestimate the space occupied by the index. When the data volume of the table is very large, each B+ tree saves some or all of the table column data, so the space occupied by each B+ tree should not be underestimated. In this sense, indexes cannot be created randomly.

3. Use of indexes

As mentioned above, indexes can greatly improve the speed of data query on the one hand, and will cost time and space on the other hand, so we cannot create indexes casually. At the same time, even if we create index, if he can’t very good use of the index, may result in the index is created but not used, so in the daily work need to write SQL developer, you should learn how to better use the index, some time ago I was in the apes counselling interview the interviewer also asked me what do you think the use of the index? Combined with my answer at that time and the use of index summary, next we discuss some problems in the process of using index should pay attention to, help us in the daily development of a better design table index, write a more reasonable and effective SQL.

1. Create indexes only for columns that appear in search criteria, sorted, and grouped

There may be many columns in the data table in our daily development. For example, there may be id,name,sex,class and Score columns in the student table. We want to query students whose scores are above 90.

select id,name,sex,class,score from student where score>90;
Copy the code

In this case, we only need to index score, and do not need to index other columns in the SELECT.

Why create indexes for only the necessary columns?

First, queries are filtered not by select criteria, but by search, sort, group, and so on.

Second, we know that Innodb’s B+ tree is in page size. By default, a page size is 16KB. So if the page size is fixed, the fewer the index columns, the more the index entries in a page. The less storage the index takes up at this point. At the same time, as more index items are stored on each page, the B+ tree becomes shorter and fatter. Therefore, fewer I/O operations are required for index query, and the index query speed is also improved.

2. Primary keys tend to increase

Here, we’d better set the auto_increment attribute for the primary key to increment the primary key. In development, there may be some ways to generate primary keys using database middleware, and it’s also better to generate primary keys by increasing the primary key. Why is that?

First we know that Innodb automatically creates a B+ tree for the primary key. The index represented by the B+ tree is called a clustered index. If we create a primary key, Innodb will generate a B+ tree for that key. If there is no primary key, Innodb will select a non-empty Unique key. If there is no primary key, Innodb will implicitly generate a 6-byte RowId for us as the primary key, and then create a B+ tree index for it. Innodb creates a unique clustered index for each table. The B+ tree of clustered indexes is sorted by primary key, and all data is stored in the B+ leaf node, like this:

This graph is from how MySQL works: Understanding MySQL at its roots. I think it’s a good one, so I’ll refer to it here.

Only the primary key value is stored in the inner node, and the entire row of data is stored in the leaf node. All data in each page is sorted according to the primary key increment.

If we insert records in primary key incrementing mode, when a data page is used up, we just need to add a new data page at the end of the data page, put the new records into the new data page, and add a directory entry pointing to the newly created page in the upper page.

And if our primary key is not increasing, this time will be according to the location of the sort found should insert position to maintain the order of increasing the primary key of the B + tree, suppose you want to into the page 9, and happened to page 9 of the storage space is not enough to insert a new record, so this time will need to add a data page, the page 9 in the half of the data is transferred to a new page, At the same time, insert a directory entry into the upper directory page, and if the upper page is also full, it needs to continue the split, shift. Splitting pages and shifting records means performance loss, so from this point of view it’s best to keep the primary key trending up.

Similarly, the records of the secondary index are sorted by the primary key even if the index columns are the same. In this case, the operation of record shifting and page splitting will also be involved when the records are inserted.

3. Index differentiation should be high

Index distinctness refers to the proportion of a column that is not duplicated. Again, if we index sex, let’s take the student table above. If we specify sex=’ male ‘, we may hit a very large number of rows.

MySQL has a query optimizer where the index created on sex is a secondary index (except for clustered indexes), and there is a concept of a back table. Back table means that the secondary index stores the index columns and primary key values of the secondary index. The index first locates the corresponding primary key value of the record from the B+ tree corresponding to the secondary index, and then queries the complete record from the cluster index according to the primary key value. Therefore, the B+ tree needs to be queried twice. The previous improved query optimizer, in the case of secondary index hit a large number of records, need to return to the table query there are a lot of records, the query optimizer may think that this is not as fast as direct scan of the full table! So the query optimizer may choose to forgo indexes and go for a full table scan.

Therefore, we should try to build indexes on columns that are highly differentiated and can filter a large number of records through indexes. This will make the query optimizer more inclined to use the secondary index + back table approach.

4. Keep the index type as small as possible

It has to be a case of meeting business requirements first and then optimizing.

For example, if we used bigint for a column and we were sure it wasn’t too big, we could use int instead. In the same type, the larger the storage range, the larger the storage space, such as gender, we only need one bit to indicate, do not underestimate the storage space savings on a single column, when the data volume of the table is large, this space savings is amazing. The smaller the storage space of the index, the more index entries can be placed on a page, and the more efficient I/O operations of the query will be.

5. Index string prefix

It is known that long strings occupy storage space. When a string is used as a search criterion, you can select a prefix of a certain length to build an index. In this way, storage space occupied by index entries is greatly reduced and I/O operation efficiency is improved. For example, if the person_info table has a field description, the average length is about 200 characters, and 90% of the records can be distinguished by comparing the first 20 characters. In this case, the index can achieve roughly the same effect but greatly reduce the storage space occupied by the index.

create index idx_description on person_info(description(20))
Copy the code

6. The indexed column appears separately in the expression

When we use an index, we must ensure that the index appears alone in the expression, otherwise the query will not hit the index. For example, in the student table, we want to query the student whose score is 90.

select * from student where score=90;
Copy the code
select * from student where score+10=100;
Copy the code

The two methods look the same. You can use a secondary index for score, but you can only use a full table scan. Similarly, using an index function such as upper(name) will not allow the index to be used effectively, so use the index in a separate expression.

7. Use overwrite index queries whenever possible

As mentioned earlier, when querying a record through a secondary index, you need to find the primary key corresponding to the record, and then find the full record based on the primary key. Because the information stored in the secondary index is incomplete, its B+ tree stores the index column of the secondary index + primary key. However, if we do not need to find all columns of a record, but only need to query some columns, then we can use the covered index, the so-called covered index is the select column in the B+ tree corresponding to the secondary index, then there is no need to query back to the table.

Select * from student where score+ primary key (id); select * from student where score+ primary key (id); If our query statement looks like this:

select id,score from student where score=80
Copy the code

Select * from secondary index where all columns are in the secondary index. Select * from secondary index where all columns are in the secondary index. What if we want to find the name of the student, and we don’t want to go back to the table? For this use we can use federated indexes.

create index idx_score_name on student(score,name);
Copy the code

At this time, the records in the secondary index are sorted according to score first. The records with the same score are sorted according to name, and those with the same name are sorted according to the primary key ID. At this time we will query:

select id,score,name from student where score=80;
Copy the code

At this time, all the columns in the query statement are in the B+ tree corresponding to score, so we do not need to return to the table to continue the query, but only need to return all the records with score=80.

Since back table operations require two queries to the B+ tree, the efficiency of the query can be greatly improved if we can avoid back table operations by overwriting the index.

Delete duplicate indexes

For example, we created a joint index (score,name) just now, so if we query score separately, do we need to create a secondary index for score? The answer is no! Because the query for indexes follows the leftmost matching principle, we can directly use the joint index established for score and name. The secondary index is sorted by score first, then by name if score is the same, and then by primary key if name is the same. When a query condition is matched, the left column is matched first, and the right column is matched when the left column is the same. Therefore, if you create a joint index idx_score_name for (score,name) and a secondary index idx_score for score, then idx_score is actually redundant, because idx_score_name can meet your requirements. So delete duplicate indexes. Why? Go back and look at the time and space cost of indexing!

This article summarizes a few we need to pay attention to some small problems when using index, feel good please click a thumbs-up, your support and encouragement is the source power of my creation!