01 preface

Hello, long time no update. Because recently in the interview. It took two weeks to prepare, and I got five offers within three days. Finally, I chose the offer from a unicorn in the Internet industry in Guangzhou, and I just started my job yesterday. These days, I have just sorted out the interesting questions I have been asked in the interview, and I would like to take this opportunity to share with you.

The interviewer for this company was interesting. On the one hand, he was a young man of his age, and we chatted for two hours (until my mouth was dry). In the second interview, an architect from Ali asked a scenario question:

The database has a string type field, and it stores a URL. How do you design the index?

At that time, I gave a split field: the first half of the URL must be low in distinction, to the second half of the high; I split the high degree and the low degree into two fields for storage, and set up the specific answer of the index in the high degree of discrimination, and put forward the idea of improving the degree of discrimination as far as possible.

The interviewer agreed with my direction, but asked if I had any other plans. At that time did not answer, back after I checked the next information, here also to share with you under the specific design scheme.

International practice, mind map first:

02 Index the entire field

First show the table design:

CREATE TABLE IF NOT EXISTS `t`(
   `id` INT(11) NOT NULL AUTO_INCREMENT,
   `url` VARCHAR(100) NOT NULL,
   PRIMARY KEY ( `id` )
)ENGINE=InnoDB DEFAULT CHARSET=utf8;

Table data:

Actually, the question is: how do you design indexes for strings? “, you might say why don’t you just execute the following statement?

alter table t add index index_url(url);

MySQL index_url = ‘index_url’; MySQL index_url = ‘index_url’;

Yes, that’s fine. Executing the following query requires only one scan.

select id,url from t where url='javafish/nhjj/mybatis';

However, it also has the problem of wasting storage space, which only works when the data is short and highly differentiated (which is necessary, otherwise we won’t index poorly differentiated fields). If you think about how long the entire field is, you must be wasting space.

Is there a less space-intensive way? Naturally, MySQL’s prefix index comes to mind.

03 Prefix Index

For table data above, add a prefix index to the index. It is not necessary to index the entire field, so you can create an index like this:

alter table t add index index_url(url(8));

At this point, the index_url structure looks like this:

select id,url from t where url='javafish/nhjj/mybatis';

Execute the same SQL query and its flow looks like this:

  • From the index_url index tree, find the corresponding index valuejavafishThe first one found is ID1; If the primary key is ID1, the value of the URL is notjavafish/nhjj/mybatis, this line of records is discarded;
  • If you select the next record from ID1, you will find that it is still idjavafishI’m going to take ID2, I’m going to go to the ID index and I’m going to take the whole row and I’m going to say, again, that’s not true;
  • Repeat the previous step until the value retrieved on index_url is notjavafish, the loop ends. And along the way,So we’re going back to the primary key index 6 times, so we’re going to scan 6 rows. From this comparison, you can easily see that using a prefix index might be a good thingCauses the query to read more data.

When we increase the length of the URL prefix index to 10. You will notice that the same query can be executed with only one row scanned to get the target data.

3.1 Prefix length selection

You might see it here, too. Using a prefix index and defining the length can save space without adding too much extra query cost. It’s a very important thing to choose, because if you have a small amount of data you can visually determine the length of the prefix, but there’s a lot of data and how do you decide?

SQL: SELECT COUNT (DISTINCT); SELECT COUNT (DISTINCT); SELECT COUNT (DISTINCT);

select count(distinct url) as L from t;

Batch operation can be done as follows:

SELECT
    count( DISTINCT LEFT ( url, 8 ) ) AS L8,
    count( DISTINCT LEFT ( url, 9 ) ) AS L9,
    count( DISTINCT LEFT ( url, 10 ) ) AS L10,
    count( DISTINCT LEFT ( url, 11 ) ) AS L11 
FROM
    t;

The result is this:

Our choice of prefix length principle is: high degree of differentiation + occupy less space; Taking both into account, I would choose 10 as the length of the prefix index.

3.2 Deficiencies in Prefix Indexes

The prefix index is good, but it also has disadvantages. For example, the poor selection of the length mentioned above will result in an increase in the number of scanned lines.

There is also the use of prefix indexes. When you optimize SQL, you cannot use indexes to override this optimization point. If you don’t know the index coverage, you should check out this article “MySQL Index Principles”.

For example, even if you change the index_url definition to include all the information in the URL (100), InnoDB still has to go back to the id index because the system doesn’t know if the prefix index definition truncates the entire information.

This is also a point to consider when choosing a prefix index.

04 Other Methods

All of the above URLs are short and can be indexed with a prefix. Suppose the URL suddenly gets longer (don’t ask why, it can get longer and thicker) and looks something like this:

Because the prefix distinction is really not high, at least the length BBB 0 20, the distinction is more ideal. The longer the index is selected, the more disk space is used, the less index values can be dropped on the same data page, and the less efficient the search will be.

So are there any other ways that we can do that without taking up as much space?

Yes, such as storing in reverse order and adding hash fields

4.1 Storage in reverse order

Let’s start with the first one. When you store the URL, you store it in reverse order. At this time, the degree of discrimination of the prefix is very high, using the reverse order to build the prefix index. The reverse function can be used to search for the reverse function:

Select url from t where url = reverse(' input url string ');

4.2 Hash fields

Add an Integer field to the table, which is used as a URL checkmark, and index it.

alter table t add url_crc int unsigned, add index(url_crc);

To insert, you can do this by calling MySQL’s CRC32 function to calculate a checksum and save it to the library.

INSERT INTO t VALUE( 00000000007, 'wwww.javafish.top/article/erwt/spring', CRC32('wwww.javafish.top/article/erwt/spring'))

And then when it’s done, it inserts something like this.

One thing to note, however, is that every time you insert a new record, you use the crc32 () function to get the checkmark to fill in the new field at the same time. There may be a conflict.

CRC32 () = CRC32 (); CRC32 () = CRC32 (); CRC32 () = CRC32 ();

Select url from t where url_crc = crc32(' input URL string ') and url = 'input URL string';

In this way, it is equivalent to reducing the index length of the URL to 4 bytes, shortening the storage space while improving the efficiency of the query.

4.3 Comparison between the two

Similarities: none of them support range queries.

Indexes created on fields stored in reverse order are sorted as strings in reverse order. There is no way to use index for range queries. Similarly, the hash field approach can only support equivalent queries.

Their differences are mainly reflected in the following three aspects:

  • In terms of extra space, the reverse order method does not consume extra storage space on the primary key index, whereas the hash field method requires an additional field. Of course, using a 4-byte prefix length in reverse order is not enough, and if you go any longer, the cost is more or less offset by the extra hash field.
  • In terms of CPU consumption, the reverse method requires an additional call to the reverse function for each write and read, while the hash field method requires an additional call to the crc32 () function. If only the computational complexity of these two functions is considered, the additional CPU consumption of the reverse function will be smaller.
  • From the perspective of query efficiency, the performance of queries using hash fields is relatively more stable. Because the value calculated by CRC32 has the probability of conflict, but the probability is very small, it can be considered that the average number of scanned rows per query is close to 1. Reverse order storage, after all, still uses a prefix index, which means it will increase the number of scanned lines.

05 summary

This article talks about four solutions, each of which has advantages and disadvantages. There is no way to judge which is the best, only the most appropriate. In development, you also need to choose according to the business, and the general direction is: increase differentiation & minimize footprint.

  • Create the full index directly, which may take up more space.
  • Create a prefix index, save space, but will increase the number of queries scanned, and can not use overwritten index;
  • Store it in reverse order, and then create a prefix index to bypass the problem that the prefix of the string itself is not distinguished enough;
  • Create hash field indexes, query performance is stable, there is extra storage and computation cost, as the third method, do not support range scanning.

06 reference

  • time.geekbang.org/column/article/71492
  • cnblogs.com/Mr-Echo/p/12730797.html

07 big factory interview question & e-book

If you like this article, please help to have a look at it.

I don’t know what to send you when I first meet you. Just send hundreds of eBooks and the latest interview materials for 2021. WeChat search JavaFish reply ebook to send you 1000+ programming ebook; Send some interview questions in reply to the interview; 1024 sends you a complete set of Java video tutorials.

The interview questions are answered, and the details are as follows: If you need it, come and get it. It’s absolutely free.