In today’s article, we continue to explore search engines, and talk to you about the most important part of search engines – the inverted index.

Before introducing inverted indexes, let’s take a look at what an index is. A database index is a sorted data structure in a database management system that facilitates rapid query and updating of data in a database table. We can simply think of an index as a list of items in a dictionary. Let’s say we want to look up a word called “index.” We can quickly find where the letter “I” starts. Indexes are the same, except instead of looking for the first letter of a word, we’re looking for data.

In the previous article introducing search engines, we have said that after the crawler of search engines crawls the text information of web pages, it will segmentation words first and then store them. That is, instead of storing the complete document, you store the keyword information in the document. Obviously, search engines contain a huge number of web pages, and in order to be efficient, we must use indexes.

We call each web page a document, prepare a document Id for it, and then link the keywords in the document through a linked list. So the data structure should look something like this.

In this figure, we query the keyword information contained in the document by the ID of the document. We first look up the corresponding document and then look up the ID in it. This is a query that conforms to our daily thinking, so it is considered as a “forward query”. Therefore, this index structure is called a forward index.

But only forward indexing is not enough. For example, if a user searches “Peking University”, we can get two keywords: “Beijing” and “university”. What we want is to be able to retrieve the corresponding document through these two keywords, at this point, we do not know the document Id. In other words, this is a reverse lookup. To use a dictionary analogy, we want to find the word’s position in the dictionary.

To do this, if we only had forward indexing, we had to go through all the documents and pick out the ones that contained the keywords “Beijing” and “university.” It is also easy to see that this is not desirable. So to solve this problem, we have to create a reverse index that points to documents by keyword. In this way, we can quickly filter out the corresponding document information through keywords.

This inverted index is an inverted index.

With the inverted index, the rest is much easier. We can easily recall all the documents containing keywords according to the keywords, and then calculate the correlation between each document and keywords through the corresponding algorithm, and then we can carry out correlation screening. This is also the relevance filtering mentioned earlier in the introduction of search engines.

The whole inverted index technique should not be difficult to understand, but in practice it is a bit more complicated and involves a lot of optimization. Here’s a look at one of the most widely used optimization solutions.


Optimization in ElasticSearch


When it comes to inverted indexes, you can’t miss ElasticSearch. ElasticSearch is arguably the most widely used open source search engine in the world. ElasticSearch is available from Wikipedia and GitHub to Baidu and Tencent, as well as countless small and medium sized companies. It combines many functions such as search engine, full-text search and structured analysis through a set of systems, and has the advantages of simple configuration and superior performance.

ElasticSearch as a distributed search engine, there are many clever design patterns, plus the complexity of the distributed system itself, there is a lot to explore. Today this article mainly talks about inverted index, so briefly talk about the optimization of inverted index.

As mentioned above, we achieve the purpose of keyword search by setting an inverted index for keywords. Logically, of course, there is no problem, but in practice there is. The biggest problem is that there are too many keywords, and the keywords are not in order, if we want to find one of them, we can only go through the list of all the keywords. This is obviously unacceptable, and there has to be optimization.

One of the simplest optimizations is to sort these keywords, and then create a dictionary. With dictionary, we can quickly search through binary lookup. It looks perfect, but there’s still a problem.

There is no problem with complexity, O(logN) complexity is acceptable, unacceptable is the disk read. Because the dictionary is so large, each disk lookup is expensive, and each random disk read takes 10ms. This is also unacceptable for a high-performance system.

To optimize this answer, we must reduce the number of random reads on the disk.

The best way to reduce disk use is to store data in memory. But as we said, the dictionary is too big to fit in memory. So we must do a simple index to it again, such as:

Keywords beginning with A are stored on page X and keywords beginning with B are stored on page Y…

This is actually the way to look up a dictionary, but the problem is that if the keywords are in English, of course you can do that. However, search engines do not only support English, the keywords may be in various languages, and even a variety of symbols. And even in English, the number of indexes for each letter is not the same. For example, there are many words beginning with the letter S and very few with the letter Z. If this simple operation, in fact, does not necessarily improve the operation efficiency.

So to solve this problem, we need to introduce a data structure called a prefix tree (Trie tree).

A prefix tree would look something like this:

The principle of a prefix tree is not difficult. In fact, it maps strings with the same prefix to the same fork in the tree. The end of each fork records the position of the content corresponding to the prefix. It’s basically mapping the index, which was stored flat, onto the tree. However, the prefix tree does not store all the indexes, only some prefixes of keywords. Using the prefix, we can find a location in the dictionary and then start looking backwards from that location, saving time by avoiding too much random addressing on hard drives.

Here’s an example:

For example, the keyword we want to search is “Access”. Through the prefix tree, we can first find the position of dictionary corresponding to the prefix A, which is also the position of Ada in the figure. Then we go back from Ada until we find Access. By properly building the prefix tree, we can control the overhead of dictionary traversal for optimization purposes.

In addition to this, ElasticSearch has made some optimizations for querying federated indexes and index merging. Because there are other technologies and data structures involved, I won’t go into too much detail here. I will share with you in a future post.

If you feel that you have gained something, feel free to click on it