01 preface

Hello, long time no update. Because I’m having a job interview. I spent two weeks preparing and got five offers within three days. Finally, I chose the offer of a unicorn in the Internet industry in Guangzhou. I just joined the company yesterday. These days I just sort out the interesting questions I was asked in the interview and take this opportunity to share them with you.

The interviewer of this company was interesting, and he was a young man of the same age, and we talked for two hours (until my mouth was dry). The second interview was with an architect from Ali, who asked a scene question:

Database has a string type field, store is URL how to design index?

At that time, I gave the split field: the first half of the URL must be low, and the second half is high; I split the high distinction and low distinction into two fields to store, and set up the specific answer in the high distinction field index, and put forward the idea of improving the distinction as much as possible.

The interviewer agreed with my direction, but asked me if I had any other plans. I didn’t answer it at that time. After I went back, I checked the materials myself. Here I will share the specific design scheme with you.

02 Add an index to the entire field

First show the table design:

CREATE TABLE IF NOT EXISTS `t`(
   `id` INT(11) NOT NULL AUTO_INCREMENT,
   `url` VARCHAR(100) NOT NULL.PRIMARY KEY ( `id` )
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
Copy the code

Table data:

In fact, the question = string index design? You might say, why not just execute the following statement?

alter table t add index index_url(url);
Copy the code

I drew an arbitrary graph. In MySQL index_URL looks like this:

Indeed, that’s ok. The following query requires only one scan operation.

select id,url from t where url='javafish/nhjj/mybatis';
Copy the code

But it also has the problem of wasting storage space, which is only suitable for short storage and high enough differentiation (which is necessary, otherwise we wouldn’t build indexes on low differentiation fields) **. When you think about how long the field is, it must take up a lot of space.

Is there a less time-consuming way? We naturally thought of MySQL’s prefix index.

03 Prefix Index

Select * from table_name where table_name = ‘1’ and table_name = ‘1’;

alter table t add index index_url(url(8));
Copy the code

At this point, the index_URL structure looks like this:

select id,url from t where url='javafish/nhjj/mybatis';
Copy the code

To execute the same SQL query, the flow looks like this:

Find the index from the index_URL index treejavafishThe first one found is ID1; The primary key is ID1, and the primary key is ID1javafish/nhjj/mybatis, this row is discarded;
Select the next record of the location ID1 just found, and find that it is stilljavafish, fetch ID2, fetch the whole row on the ID index and say, still not correct;
Repeat the previous step until the value at index_URL is notjavafish, the loop ends. And in the process,Six rows are scanned by going back to the primary key index. With this comparison, you can easily see that using a prefix index mightThe number of times that the query statement reads data increases.

When we increase the length of the URL prefix index to 10. You will find that the same query only needs to scan 1 row to get the target data.

3.1 Prefix length selection

If you look at this, you might see it too. Using a prefix index with a defined length can save space without adding too much query cost. The choice of prefix length is particularly important, because if you have a small amount of data, you can visually determine the choice of prefix length, but if you have a large amount of data, how do you determine that?

MySQL has a count distinct operation, so we can execute the following SQL to see how many prefix lengths are appropriate.

select count(distinct url) as L from t;
Copy the code

Batch operation can be as follows:

SELECT
	count( DISTINCT LEFT ( url, 8))AS L8,
	count( DISTINCT LEFT ( url, 9))AS L9,
	count( DISTINCT LEFT ( url, 10))AS L10,
	count( DISTINCT LEFT ( url, 11))AS L11 
FROM
	t;
Copy the code

The result is this:

We choose the prefix length based on the following principles: high distinction + less space; Considering these two factors, I would choose 10 as the length of the prefix index.

3.2 Deficiency of prefix index

Prefix indexing is good, but it has its drawbacks. For example, a bad choice of length as we mentioned above will lead to an increase in the number of scanned lines.

There is also the use of prefixed index, when you tune SQL, you can not use the index to override this optimization point. If you don’t know about index coverage, check out this article: MySQL Index Principles

For example: Even if you change the definition of index_URL to a prefix index of URL (100), InnoDB will have to go back to the ID index again even though the index_URL already contains all the information, because InnoDB is not sure whether the prefix index definition truncates the entire information.

This is also a consideration for whether or not you choose a prefix index.

04 Other Methods

The urls above are shorter and can be indexed with a prefix. Suppose the URL suddenly becomes longer (don’t ask why, it can become longer and thicker) and looks something like this:

Because prefix discrimination is really not high, at least the length > 20, the discrimination is ideal. The longer the index is selected, the more disk space it occupies, the fewer index values can fit on the same data page, and the less efficient the search will be.

Is there any other way to make sure that you can differentiate without taking up so much space?

Yes, such as reverse storage and hash fields

4.1 Storage in reverse order

First, when you store urls, you store them in reverse order. At this time the prefix distinction is very high, use reverse order to build prefix index. The reverse function can be used to query:

select url from t where url = reverse('Input URL string');
Copy the code

4.2 Hash fields

Add an integer field to the table, use it as the URL check code, and build an index on it.

alter table t add url_crc int unsigned, add index(url_crc);
Copy the code

To do this, call MySQL crc32 to calculate a check code and save it to the library.

INSERT INTO t VALUE( 00000000007.'wwww.javafish.top/article/erwt/spring', CRC32('wwww.javafish.top/article/erwt/spring'))
Copy the code

And then when it’s done, it inserts this result.

One thing to note, however, is that every time a new record is inserted, the crc32 () function gets the check code to fill in the new field, which may cause a conflict.

Crc32 () = crc32 (); crc32 () = crc32 ();

select url from t where url_crc = crc32('Input URL string') and url = 'Input URL string'
Copy the code

In this way, the index length of the URL is reduced to 4 bytes, shortening the storage space and improving the query efficiency.

4.3 Comparison between the two

Similarities: Range query is not supported.

Indexes created on fields stored in reverse order are sorted in reverse order, so there is no way to use the index to perform range queries. Similarly, the hash field approach can only support equivalent queries.

Their differences are mainly reflected in the following three aspects:

In terms of the extra space occupied, the reverse storage method does not consume additional storage space on the primary key index, whereas the hash field method requires an additional field. Of course, four bytes of prefix length is not enough for reverse storage, and any longer will almost cancel out the extra hash field.
In terms of CPU consumption, reverse requires an additional call to the reverse function for each write and read, while hash requires an additional call to the crc32 () function. Given the computational complexity of these two functions alone, the reverse function consumes even less additional CPU resources.
In terms of query efficiency, the performance of the hash field query is more stable. Because the value calculated by CRC32 has the probability of conflict, but the probability is very small, it can be considered that the average number of scanned rows per query is close to 1. After all, the reverse order storage method still uses the prefix index method, that is to say, it will still increase the number of scanned lines.

05 summary

This article talks about four solutions, each with advantages and disadvantages. There is no way to judge which is best, only the most suitable. In development, you also need to choose according to the business, and the general direction is to improve differentiation & minimize footprint.

Create the full index directly, which may take up more space.
Create prefix indexes, which save space, but increase the number of query scans, and cannot use overwrite indexes.
Reverse order storage, and then create prefix index, used to bypass the character string prefix distinction is not enough;
Create hash index, stable query performance, extra storage and calculation cost, same as the third method, does not support range scan.

06 reference

time.geekbang.org/column/article/71492
cnblogs.com/Mr-Echo/p/12730797.html

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Interviewer: How do you index a million urls?

01 preface

02 Add an index to the entire field

03 Prefix Index

3.1 Prefix length selection

3.2 Deficiency of prefix index

04 Other Methods

4.1 Storage in reverse order

4.2 Hash fields

4.3 Comparison between the two

05 summary

06 reference

Interviewer: How do you index a million urls?

01 preface

02 Add an index to the entire field

03 Prefix Index

3.1 Prefix length selection

3.2 Deficiency of prefix index

04 Other Methods

4.1 Storage in reverse order

4.2 Hash fields

4.3 Comparison between the two

05 summary

06 reference

Related Posts

Explain SQL set operations in detail

Leetcode 849. Maximize Distance to Closest Person (Python)

Golang — Gorm library vs standard SQL library