What is Elasticsearch? What can it do?

Elasticsearch (ES) is a Lucene based distributed full-text search engine that specializes in massive data storage, data analysis and full-text search. It is an excellent data storage and data analysis middleware that is widely used in log analysis and full-text search. Elasticsearch is a storage middleware and data analysis platform developed by many major manufacturers.

Start with the core concepts

Lucence

Lucene is a sub-project under Apache, is an open source full text search engine toolkit, but it is not a complete full text search engine, but a full text search engine architecture, provides a complete query engine and index engine, it is the core basis of ES to achieve full text search. The core process of indexing documents and searching indexes is done in Lucene.

Core data structure

Document

We all say that ES is document oriented. What does that mean? ES is a document based data operation, including data search and index (index here means data write). So document is the basic data structure of ES, and it gets serialized and stored in ES. So what exactly is this document? I believe you are familiar with Mysql, so we use the concept of database and table in Mysql to compare with INDEX of ES. It may not be very appropriate and consistent, but it can help you understand these concepts. Type has also been phased out since the ES6.x release.

Index

In previous versions of ES, there was the concept of type, analogous to a table in a database, where the document described above would be placed in type. However, in the later versions of ES, type was gradually eliminated in order to improve the efficiency of data storage, so index actually has the concept of both library and table in the present ES. Index is a container of documents, which is a collection of a class of documents. However, it should be noted that index is a classification of logical space, and the actual data is stored on the fragment of physical space.

It should also be noted that in ES index has a different context meaning, it can be either a noun or a verb. Index is a noun which is the set of documents mentioned above, and index is a verb which means to store document data in ES, which means to write data.

In ES, in order to shield language interaction differences, the direct external interaction of ES is carried out through Rest API.

Inverted index

We all know that indexes exist to speed up the query of data. In a relational database, if there is no index, we need to compare each piece of data to find the data. If we are unlucky, we may need to scan the entire table to find the data we want. Mysql, for example, uses B+ trees as indexes to speed up the query of data. Suppose there is such a scene, when you are walking on the road on the weekend, you suddenly hear a very nice song, you have memorized two of the lyrics, and you want to quickly take your phone to QQ music to check what the song is. If you are QQ music program ape, how should you achieve according to the lyrics of the song query function?

Why not use B+ tree as the index line? Full-text index is needed to support the large text indexing, from the space B + tree is not suitable for as a full-text index, and B + tree for every search begins with the root node to search, so will follow the left matching principle, and we use full text search, tend not to follow the principle of the left matching, so may cause disabling indexes. This is where the inverted index comes in handy.

The so-called forward index is just like the table of contents in a book, which queries the content according to the page number, but the inverted index is the opposite, which establishes the correlation between the content and the document ID through the word segmentation of the content. In this way, according to the contents of the dictionary, the full text retrieval can be accurate and fuzzy query, which is very in line with the requirements of full text retrieval.

The structure of an inverted index consists of two parts: Term Dictionary and Posting List. Term Dictionary records the words of the document being used and the relationship between the words and the inverted list. Posting Lists record the location of terms in documents, as well as other information, including the document ID, word frequency (the number of times a term appears in a document to calculate relevance scores), location, and offset (for search highlighting).

FST

As mentioned above, during full-text retrieval, the original data can be obtained through the association relation between term and docId in the inverted index. However, there is a problem here. The bottom layer of ES relies on Lucene to realize the inverted index. Therefore, when data is written, Lucene will generate the corresponding inverted index for each term in the original data, resulting in a large amount of data in the inverted index. The inverted list file corresponding to the inverted index is stored on the hard disk. If each query reads the inverted index data directly from the disk, then queries the original data through the obtained docId, it will certainly cause disk I/O for many times, seriously affecting the efficiency of full-text retrieval. So we need a way to quickly locate term in the inverted index. What is the best way to use it? Consider data structures such as HashMap, TRIE, Binary Search Tree, or Tenary Search Tree, In fact, Lucene actually uses FST (Finite State sensor) to realize the design of secondary indexes, which is actually a Finite State machine.

Let’s take a look at the structure of the trie tree. Lucene does this by invert the term with a common prefix in the index to form a block, as shown in the figure below. Cool and copy have a common prefix of CO, and form the trie tree according to the logic similar to the prefix tree. The corresponding node carries the first address of the block. What are the advantages of trie trees over HashMaps? The HashMap implements precise lookup, but the Trie tree can not only achieve precise lookup, but also achieve fuzzy lookup due to its common prefix characteristics. So where can we optimize the trie tree?

As shown above, the characters following school and cool in term are the same, so we can further compress the space by merging the suffix characters in the original trie tree. The optimized TRIe tree is FST.

Therefore, the establishment of the FST secondary index can realize the fast location of the inverted index, which does not need to go through many disk I/OS, and greatly improves the search efficiency. However, it is important to note that FST is stored in heap memory and resident memory, taking up about 50-70% of heap memory, so this is where we can optimize heap memory in production.

Cluster related Concepts

To enhance the reliability and high availability of ES data stores, ES supports cluster deployment. Even if some nodes of the clustered ES fail, the entire ES cluster does not become unavailable. In addition, the horizontal expansion enhances the data storage capability of ES.

node

A node is actually an instance of ES, and we usually deploy an ES instance on a server, which is essentially a Java process. Although they are ALL ES instances, in fact, different nodes play different roles in ES clusters. Some of them are Data nodes, which are mainly responsible for storing fragmented data and playing an important role in horizontal expansion of data. Some coordinating nodes are responsible for forwarding user requests and coordinating the results of queries. There are also master nodes, which manage and maintain the state of the cluster.

shard

After all, the data storage of a single ES node is limited, so it cannot meet the requirements of mass data storage. So how can we meet the storage requirements of massive data? One of the core ideas is to split, for example, a total of 1 billion pieces of data, if all placed in one node, not only the query and data write speed is slow, there is a single point of page problems. In traditional relational database, more database instances are used to undertake a large amount of data storage. In ES, a similar design idea is adopted. Since an INSTANCE of ES exists in the online data store, multiple instances are used for storage. The collection of data that exists in each instance is called a shard. As shown in the figure below, index is divided into three fragments, which are stored in three ES instances respectively. Meanwhile, in order to improve the high availability of data, each master fragment has two copy fragments, and these copy fragments are data copies of the master fragment.

put /article
{    
	"settings": {
  		"number_of_shards":3."number_of_replicas":3}}Copy the code

It should be noted that sharding is not randomly set, but the data storage capacity should be planned in advance according to the actual production environment. Otherwise, too large or too small sharding Settings will affect the overall performance of THE ES cluster. If the shard setting is too small, the amount of data in a single shard may be large, affecting the efficiency of data retrieval and the horizontal expansion of data. If the fragmentation setting is too large, it will affect the data correlation score of search results and the accuracy of data retrieval.

In conclusion, this paper comprehensively combs and expounds the core concepts of ES. I believe that you have a preliminary understanding of ES. In the next article, WE will take you to understand the principles and excellent design ideas of the core business processes of ES. In addition, some excellent design ideas in ES are also worth learning, and we can sometimes use these excellent design ideas for reference when designing software platforms.


Hi, I’m Mufeng. Thanks for your likes, favorites and comments. See you next time! Wechat search: Mufeng technical notes, quality articles continue to update, we have learning punch group can pull you in, together with the impact of the big factory, in addition to a lot of learning and interview materials to provide you.

A true master always has the heart of an apprentice