A few days ago, I was asked a question about what to do if the project uses Mysql database and the access pressure of the database is too high. I answered that the first step is to increase the cache, such as redis cache database, to reduce the pressure. The second step is to limit the flow and use Hystix to control the flow of requests. Using the rabbitMq peak clipping, let part of check, part after the check, if the above doesn’t work, that is about to consider to use no database.

For example, if noSQL database is fast, it uses an inverted index. For example, if noSQL database is fast, it uses an inverted index.

If you think about how mysql is a relational database, it looks up the data, it looks up the table, its table is a B+ tree, and then it looks up the data that we want to look up in that B+ tree, and we call the data that we want to look up a keyword, We know that the data structure of B+ tree will optimize the time complexity of search to logn, which is the optimal way based on the search through the index, but it is obviously insufficient for a large amount of data. The data structure of inverted index can satisfy our query. It is searched by keyword and file mapping (this file can be understood as mysql table). We know that the time complexity of Map mapping is O1, so it is of course faster.

So how did he do that? What is his data structure?

Let’s look at a word-document matrix

This matrix diagram shows the relationship between terms and documents, through which inverted indexes are implemented

The following are the basic concepts of inverted indexes

Document: The general search engine processing object is the Internet web page, while the concept of Document is broader, representing the storage object in the form of text, compared with the web page, covering more forms, such as Word, PDF, HTML, XML and other files in different formats can be called documents. An email, a text message, a tweet can also be called a document. Throughout the rest of this book, documents are often used to represent textual information. Document Collection: A Collection composed of several documents is called a Document Collection. For example, the vast number of Web pages on the Internet or the large number of e-mails are concrete examples of document collections.

The reference Document (the Document ID) : in the search engine inside, each Document in the Document collection will be given a unique internal number, this number as a unique identifier for this Document, so convenient internal processing, the interior of the each Document number is called “Document number”, later DocID is sometimes used to easily represent the Document number.

Word ID: Similar to document ID, a search engine internally uses a unique number to represent a Word, which can be used as a unique representation of a Word.

Inverted Index: Inverted Index is a concrete form of storage that implements a “word-document matrix”, by which a list of documents containing a word can be quickly obtained by Inverted Index. The inverted index consists of two main parts: “word dictionary” and “inverted file”.

Lexicon: The usual unit of index for a search engine is a word. A Lexicon is a collection of strings made up of all the words that have ever appeared in a collection of documents. Each index entry in a Lexicon contains information about the word itself and a pointer to an inverted list.

PostingList: An inverted list is a list of all documents in which a word appears and information about where the word appears in the document. Each record is called a Posting. The inverted list tells you which documents contain a word.

Inverted files: An Inverted list of all Inverted words is often stored sequentially in a File on disk, known as an Inverted File. Inverted files are physical files that store Inverted indexes.

Now that we understand the concept, let’s look at the inverted index in detail through the diagram

The figure above shows the document number and content

Above is took out the document content words in the article again, inverted list is stored in a collection of document number, it is will involve the participles, because the words appear in the document, how to rid of the word from the document, and you need to word segmentation and word segmentation and word segmentation operation should be to complete, This part of the specific project will be selected for word segmentation, not to say more.

In the figure above, the word ID is mapped to the document, but it’s not enough. We also need to find out how often the word appears in the document

The chart above shows how often words appear in a document, which allows search engines to rank content.

There is also a problem to consider, that is, the word is stored in the form of the above, so the performance of the query word is still slow, the index leads to the word by what data structure to rough out. The concept word dictionary mentioned above.

There are also several data structures for saving words, including hash and linked list forms and tree forms, which optimize the speed.

The above is the data structure of the inverted index used by ES. What is different about the inverted index ES, is that ES is based on distributed storage, and its data can be distributed to many shards. The number of shards can be adjusted according to the traffic volume, so there is no problem of too much data stored in one shard causing slow query.

Now let’s see how es queries. Let’s look at a graph, okay

Since ES is distributed, it has multiple nodes for high availability, and if the node fails, slave shards are used to record data to prevent data loss.

The query flow of ES is

1 When accessing node A, node A forwards the access to each shard. Each shard returns the data ID, which is A string

2 Node A sorts the pages based on the returned ID, and then queries each fragment based on the ID

3 Each fragment queries the corresponding document data based on the ID and returns it to node A

Mysql > select * from mysql;