“All problems in computer science can be solved by another level of indirection.”

– David j. Wheeler

“The computer world is the Art of Trade-off”

One, foreword

Recently, several projects I contacted used Elasticsearch (ES for short) to store data and search and analyze data, so I learned some ES. This article collates my own technology share.

This article will not focus on the use of distributed technologies and related apis in ES, but will focus on the topic of “HOW to quickly search IN ES”. This is also the part of ES that I am most interested in before learning.


This paper mainly includes the following contents:

  • On search

    • The difference between traditional relational database and ES

    • Principles of Search Engines

  • Investigate the inverted index

    • Posting list -> term DIC -> term index

    • FOR Roaring Bitmaps postings List FOR Roaring Bitmaps

    • How to do federated query quickly?

Second, about search

Imagine a search scenario. Suppose we want to search for an ancient poem with the word “before” in its content.


What is the difference between a traditional relational database and an ES implementation?

If we use an RDBMS like MySQL to store ancient poems, we should use such SQL to query them

select name from poems where content like Before the "% %";
Copy the code

This method, which we call sequential scanning, involves going through all the records to make a match.

Not only is it inefficient, but it does not meet our expectations when we search for keywords like “ABCD”, for example, we usually expect results like “A”,”AB”,”CD” and” ABC “.

As a result, there are professional search engines, such as ES, our protagonist today.

Principles of Search Engines

Search engine search principle can be briefly summarized into such steps,


  • Content crawl, pause word filter

    Such as useless modal words/connectors like “d” and “d”

  • Content segmentation, extraction of key words

  • Build an inverted index based on keywords

  • Users enter keywords to search

This leads us to a concept that will be the focus of our analysis today – the inverted index. It is also the core knowledge of ES.

If you know ES, you should know that ES is an encapsulation of Lucene. The implementation of inverted indexes is implemented through the API provided by Lucene jar package, so the following contents about inverted indexes are actually the contents of Lucene.

Three, inverted index

First of all, let’s not forget our previous search requirements. Let’s take a look at what our above query requirements would look like if we built an inverted index.


In this way, as soon as we input “before”, we can directly locate the ancient poems that meet the query conditions with the help of the inverted index.

This is, of course, just a very literal shorthand for how inverted indexes work. In ES, the inverted index is exactly what it is, how it is stored and so on, which is the essence of the inverted index.

1. A few concepts

Before we move on, let’s describe some of the leading concepts.

term

The keyword thing is my own term, in ES, the keyword is called term.

postings list

Using the above example, {quiet night thinking, looking at lushan Waterfall} is the list corresponding to the term “before”. In ES, these are described as a collection of all ids that contain documents for a particular term. Because integer numbers can be compressed efficiently, INTEGER numbers are best placed in the Postings list as a unique identifier for documents. ES will process these stored documents into a unique integer ID.

When storing data, within each shard, ES stores data in a different segment, a shard unit smaller than the shard, and these segments are merged periodically. A maximum of 2^31 documents are stored in each segment. Each document is assigned a unique ID ranging from 0 to (2^31)-1.


Related nouns are described in official ES documents, which can be found in reference materials.

2. Index internal structure

The inverted index described above is only a rough model. Really want to use in actual production, of course, still far away.

In actual production scenarios, such as the log analysis most commonly used in ES, how many terms can be obtained after the log content is segmented?

So how to quickly query the corresponding term in the mass term? It’s obviously not practical to go through it all.

term dictionary

As a result, there is term dictionary. In order to quickly find terms, ES puts all terms in an order and searches them by dichotomy. If this sounds familiar, this is the MySQL way of indexing, which is to create an index dictionary using B+ trees to point to the indexed data.

term index

But then again, where do you think Term Dictionary should go? It must be in memory, right? Disk I/O is so slow. MySQL index is stored in memory.

But what happens if you put the entire term Dictionary in memory?

Memory burst…

Remember, ES will index all text fields by default, which will consume a lot of memory, so ES is deeply optimized for indexes. In order to ensure the efficiency of execution, as far as possible to reduce the occupation of memory space.

Hence the term index.

Term index is classified as a “Trie tree” in terms of data structure, also known as dictionary tree. This is a data structure that deals specifically with string matching to solve the problem of quickly finding a string in a collection of strings.

This tree will not contain all terms, it will contain some prefixes of terms (this is also how dictionary trees are used, common prefixes). Term index allows you to quickly locate an offset in the Term Dictionary, and then look up from there. That’s what this graph on the right shows. (How about if we look it up in the English dictionary, we locate the first word beginning with S, or locate the first word beginning with Sh, and then search from there)

Lucene also made two optimizations here. First, term dictionary is stored in blocks on the disk, and a block is compressed with a common prefix. For example, words starting with Ab can be omitted. Second, term index is saved in memory with FST (Finite State Transducers) data structure.

FST has two advantages:

  • Small space footprint. By reusing the prefixes and suffixes of words in the dictionary, the storage space is compressed
  • Fast query speed. O(len(STR)) query time complexity.

The theory of FST is quite complicated and will not be discussed in detail in this paper

Read on: https://www.shenyanchao.cn/blog/2018/12/04/lucene-fst/

OK, now we can get a rough idea of what the Lucene inverted index looks like.


Postings List

In practice, Postings List still needs to address several pain points,

  • The Postings list can take up a lot of disk space if not compressed,
  • How to quickly find intersections and Unions under union query

Some people might not feel the need for compression, “Posting Lists already only store document ids. Need compression?” However, if you have millions of DOC ids in your Posting list, compression becomes necessary. (Like looking up ancient poems by dynasty?) As for why union set is required, ES is specially used for search, there will be a lot of joint query requirements (AND, OR).

Following the idea above, we will first compress.

1. The compression

Frame of Reference

In Lucene, postings and lists are required to be ordered integer arrays. This has the nice advantage of being compressed by delta-encode.

For example, if you now have a list of ids [73, 300, 302, 332, 343, 372], convert each ID to the incremental value of the previous ID (the previous ID of the first ID defaults to 0, The increment is itself) the list is [73, 227, 2, 30, 11, 29]. In this new list, all ids are less than 255, so only one byte of storage is required for each ID.

In fact, ES is going to do it more subtly,


It splits all the documents into blocks, each containing exactly 256 documents, and increments each document individually to figure out how many bits to store all the documents in the block to hold each ID. And place this bit as a header before each block. This technique is called Frame of Reference.

The figure above is also an example from the official ES blog (assuming each block has 3 files instead of 256).

The steps FOR can be summarized as follows:


After the last bit compression, integer arrays are expanded from four fixed-size (8,16,32, 64-bit) types to a total of 64 [1-64] bits.

Using these methods, you can greatly reduce the space consumption of Posting list and improve query performance. However, ES does something more to improve the performance of filter queries: caching.

Roaring Bitmaps (for filter cache)

In ES, filters can be used to optimize the query. Filter queries only deal with matching documents, not scoring documents, and the results of the query can be cached.

For filter queries, ES provides a special cache called the Filter cache, which stores the result set of the filters. The cached filters do not require much memory and only retain information about which documents match the filter. At the same time, it can be reused by other queries, which greatly improves the query performance.

The Frame Of Reference compression algorithm we mentioned above works well for Postings lists, but is not suitable for filter caches that need to be stored in memory.

Filter cache stores frequently used data. The filter cache is designed to speed up processing efficiency and requires higher compression algorithms.

For this kind of postings list, ES adopts a different compression method. So let’s take it one step at a time.

First we know that postings list is an Integer array with compression space.

So let’s say we have an array like this, what’s our first idea of compression? Represented in bits, each document corresponds to one of its bits, known as bitmaps.

It is often used as an index in databases, query engines, and search engines, and bit operations (such as intersection of and, or union) can be parallel and more efficient.

However, bitmaps have an obvious disadvantage in that their memory footprint is constant regardless of the actual element base in the business. That is, not for sparse storage. There are also some mature compression solutions for sparse bitmaps. Lucene uses roaring Bitmaps.

I’m going to describe in a very simple way what this compression process looks like,


Split the doc ID into 16 bits high and 16 bits low. The high order is aggregated (the high order is used as the key, and the value is all the arrays with the same high order). Different Containers (data structures) are used to store the data of the low order based on the data volume (the length of the array aggregated from different high order is different).

  • Len <4096 ArrayContainer Stores values directly

  • Len >=4096 BitmapContainer Uses bitmap for storage

The maximum number of values is 2^16=65536. It is assumed that 65536bit= 8KB is needed for bitmap storage, while in the way of direct storage, a value of 2 bytes is needed for 4K, and a total of 2byte*4K= 8KB is needed. Therefore, when the total value is less than 4K, it saves more space by directly saving the value.


Space compression is mainly reflected in:

  • High level aggregation (suppose there are 100W values of the same high level in the data, 100w2byte is required, now only 12byte is required)

  • Low compression

The downside is that the speed of bit-manipulation can have an impact on native Bitmaps.

That’s a trade-off. The art of balance.

2. Joint query

With compression out of the way, let’s talk about federated queries.

Filter cache = filter cache = filter cache = filter cache = filter cache = filter cache = filter cache

If the filter queried is not cached, skip list is used to traverse the Postings list on disk.


The above are three Posting lists. We now need to combine them with the relationship of AND to produce the intersection of Posting Lists. First, select the shortest Posting list, then search the other two Posting lists one by one to see if they exist, and finally get the intersection result. We can skip some elements, like if we go to the green 13, we can skip the blue 3, because 3 is smaller than 13.

There is another benefit to using skip Lists. Remember that postings lists are stored on disk with the encoding FOR

Divide all the documents into blocks, each containing exactly 256 documents, and then incrementally encode each document individually to calculate the maximum number of bits needed to store all the documents in the block to hold each ID. And place this bit as a header before each block.

Because the encoding of this FOR has a decompression cost. Skip List saves CPU by not only skipping the cost of traversal, but also the process of knowing how to compress these compressed blocks.

Five, the summary

Let’s make a technical summary (feel a bit of wang Gang teacher’s taste 😂)

  • In order to quickly locate the target document, ES uses the inverted index technology to optimize the search speed. Although the space consumption is relatively large, the search performance is significantly improved.
  • In order to quickly locate a term in a large number of terms, save memory usage and reduce disk I/O reading, Lucene uses the inverted index structure of “term index -> term Dictionary -> Postings list” to further improve search efficiency by FST compression into memory.
  • In order to reduce the disk consumption of postings list, Lucene uses the FOR (Frame of Reference) technology to compress the list. The compression effect is very obvious.
  • The FILTER statement of ES adopts the Roaring Bitmap technology to cache search results, ensuring the query speed of high-frequency filter and reducing storage space consumption.
  • In joint query, in the case of filter cache, bitmap’s native features are directly used to quickly obtain union sets and obtain joint query results; otherwise, skip List is used to obtain union sets for multiple Postings lists, skipping traversal costs and saving CPU costs of decompression of some data

Elasticsearch index

Move the contents of the disk into memory as much as possible, reduce the number of random disk reads (while also taking advantage of sequential disk reads), combine various compression algorithms, and use memory with extremely strict attitude.

When indexing with Elasticsearch, note:

  • Fields that do not need indexes must be clearly defined, as they are automatically indexed by default
  • In the same way, fields of type String that do not need to be analyzed need to be explicitly defined, because analysis is done by default
  • It is important to choose regular ids; too random ids (such as Java UUID) are not good for queries

Finally, technology selection is always accompanied by business scenarios. Each database has its own problems to solve (or areas of expertise), corresponding to its own data structure, and different use scenarios and data structures, need to use different indexes, in order to maximize the purpose of speeding up the query.

This article is all about how Lucene implements inverted indexes, how to count every bit of memory, disk space, and how to speed up processing with tricky bits. But if you think higher and compare MySQL, you will find that although both indexes are implemented differently. In general, a B-tree index is an index structure optimized for writing. When we do not need to support fast update, we can use pre-sorting to exchange for smaller storage space, faster retrieval speed and other benefits, at the cost of slow update, just like ES.

I hope you found this article useful

Reference documentation

  • https://www.elastic.co/cn/blog/frame-of-reference-and-roaring-bitmaps
  • https://www.elastic.co/cn/blog/found-elasticsearch-from-the-bottom-up
  • http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html
  • https://www.infoq.cn/article/database-timestamp-02
  • https://zhuanlan.zhihu.com/p/137574234

– END –