ElasticSearch principle article

A,The basic information

1.Introduction to the

ElasticSearch (Lucene, Solr, ElasticSearch, Lucene, Solr, ElasticSearch) Apache Lucene is not a search engine, but an open source search engine toolkit that provides query and retrieval capabilities. Lucene is to provide developers with a simple and easy to use search kit, in order to facilitate developers in the target system to achieve the function of full-text search, or based on the establishment of a complete full-text search engine. Solr is a high-performance enterprise-class search engine based on Apache Lucene. It is highly available, easy to expand, and fault-tolerant. It provides distributed index storage, multiple copies, load-balancing queries, and centralized configuration. ElasticSearch is a distributed search engine based on Apache Lucene that also functions as a real-time document storage system, where every content in a document can be retrieved and can handle petabytes of structured and unstructured data. Elasticsearch is the most popular enterprise search engine because it is distributed and easy to install. Solr can be distributed by a third party. For example, ZooKeeper is used to achieve distributed coordination and management.

2.Noun explanation

In order to facilitate understanding, in the following introduction, Index, Type, Document, Field and relational database concepts made an analogy, note that this is only an analogy, to facilitate the first contact with search engine partners to understand, in fact, they still have a great difference.

1)Index

An index refers to a collection of documents with similar structures. For example: commodity index, order index. In the commodity index we can store thousands of commodity information; In the order index we can store thousands of order information. In this respect, indexes are similar to databases in relational databases.

2)Type (Type)

A type is a logical partition within an Index (category/partition). Each Index can have one or more types. Type is a logical data classification in an Index. For example, in the commodity index, although clothing and travel routes are both saleable commodities, they can be defined as two types because they have many different attributes. Alternatively, you can define a Type for a category. An example is as follows: Clothing item Type: item name, price, description travel route Item Type: item name, price, description, itinerary Information In this respect, Type is similar to a table in a relational database.

3)Documents & Properties (Document&Field

A document is a piece of instance data, such as an item or an order, usually represented using JSON data structures. A document contains multiple fields, each of which is a data field. In this sense, a Document is like a row in a relational database, and a Field is like a Field in a relational database. Here is a simple Document example:

{ _Index: "twitter", _type: "_doc", _id: "1", _version: 1, _seq_no: 0, _primary_term: 1, found: true, _source: { user: "kimchy", post_date: "2009-11-15T13:12:00", message: "Trying out Elasticsearch, so far so good?" }}Copy the code

4)Term

The smallest unit of storage and query in the index. For English, it is a word. For Chinese, it usually refers to the word after the participle.

5)Term Dictionary

Also called a dictionary, it’s a collection of terms. The usual unit of index for search engines is the word. A dictionary is a collection of strings made up of all the words that have appeared in a collection of documents. Each index entry in the dictionary contains some information about the word itself and a pointer to an “inverted list”.

6)Are rows of data

The generic term for search engines, raw data, can be understood as a doc list.

7)Inverted Index

A generic name for a Lucene index that implements a mapping from term to doc list. An inverted index establishes a mapping between terms and documents. In an inverted index, data is Term oriented rather than Document oriented.

8)Post List

A document is usually made up of several words, and an inversion list records which documents and where a particular word appears. Each record is called a Posting. Inversion lists not only record document numbers, but also store information such as word frequency.

9)May I have an Inverted File?

An inverted list of all words is usually stored sequentially in a file on disk called an inverted file, which is the physical file that stores the inverted index.

Second,The index principle

1.Inverted index structure

Suppose there are three data files, the contents of which are as follows: Doc1: I love China. Doc2: I love work. Doc3: I love coding. To create an inverted index, you first break the content field of each document into separate terms (we call them terms or terms) through a word splitter, create a sorted list of all non-repeating terms, and then list which document each Term appears in. The results are as follows:

Term	Doc1	Doc2	Doc3
I	Y	Y	Y
China	Y
coding			Y
love	Y	Y	Y
work		Y

As shown in the table above, this structure consists of a list of all non-repeating words in the document, each of which has a Doc associated with it. The structure that determines the location of records by attribute values is called a Post Index. May I have an Inverted File? An Inverted File is an Inverted File.

According to the above, the structure of the inverted file is as follows:

Dictionary and inversion list are two important data structures in Lucene, which are the cornerstone of fast retrieval.

2.search

1)Create an inverted index

After a brief introduction to the structure of an inverted index, an example is used to show how an inverted index can be searched efficiently. Suppose you have a table of student scores with the following data:

id	name	gender	score
1	Fern	female	80
2	Alice	female	70
3	Bunny	male	90
4	Aimee	female	70
5	Amy	female	80
6	tracy	female	90

After data is stored in ES, the following inverted index will be established:

Name

Term	Posting List
Aimee	4
Alice	2
Amy	5
Bunny	3
Fern	1
tracy	6

Gender

Term	Posting List
female	,2,4,5,6 [1]
male	3

Score

Term	Posting List
70	[2, 4]
80	(1, 5)
90	[3]

The above is the relationship between the entry and the inverted index, that is, when we search the records with score=70, we can directly find the records [2,4] through the inverted index, and then return the corresponding records.

2)The dictionary

A dictionary is a collection of entries. How does ES ensure efficient lookup when there are many entries?

Mysql database has a similar problem, when the number of records, how to quickly find the required records? SQL > create index (B+ number); SQL > create index (B+ number); SQL > create index (B+ number); ES does the same thing: sort all the entries, then search them by dichotomy, reducing the time complexity to lgN.

Suppose the value of name is as follows:

Fern,Alice,Bunny,Aimee,Amy,tracy

Sorted:

Aimee,Alice,Amy,Bunny,Fern,tracy

Mysql index is stored on disk, ES dictionary is stored in memory, ES can solve the query efficiency problem, but it causes another problem: when there are a lot of terms, memory must not accommodate these terms. ES does this by creating a Term Index.

In the above example, Term is an English character, but in reality, Term can be any byte array; In addition, in many cases, the contents corresponding to the number of terms may not be balanced, as shown above: there are no terms beginning with C characters, while there are many terms beginning with A. The internal Term Index of Lucene uses a variant trie tree (prefix tree/dictionary tree/word search tree), that is, FST(finite-state transducer). Trie trees only shared prefixes, while FST shared both prefixes and suffixes to save space.

The principle of FST is quite complicated, and here we mainly understand its core role here: efficient search Term.

Term Index allows you to quickly locate an offset in the Term Dictionary, and then look up from there. In addition, the size of some compression technologies (search Lucene Finite State Transducers) Term Index can only be one tens of the size of all terms, making it possible to cache the whole Term Index in memory. The overall effect is this:

As shown in the figure above, Term Index is mainly used to quickly locate the Term, which is similar to the dictionary or address book we use. For example, we can quickly find all Chinese characters or surnames starting with W through the letter W, and then find the corresponding Chinese characters or names. From the above analysis, we can see that because of the Term Index of ES, ES is sometimes faster than Mysql in retrieving. Mysql only has Term Dictionary, and this layer is stored on disk in the form of B+ tree. Retrieving a Term requires several Random Access disk operations. Lucene added Term Index on the basis of Term Dictionary to speed up retrieval. Term Index is cached in memory in the form of tree. After the block position of the corresponding Term Dictionary is checked from Term Index, Then go to the disk to find the Term, which greatly reduces the random access times of the disk.

Three,The cluster

1.The index

Select * from Elasticsearch; select * from Mysql; select * from Elasticsearch; select * from Mysql;

The shard is the smallest unit for storing Elasticsearch data. The storage capacity of index is the sum of all shards. The storage capacity of the Elasticsearch cluster is the sum of all index storage capacity.

2.Node (Node)

ES itself has cluster capability, so it only needs to set the same cluster.name for each node to join the same cluster. They share the data and load burden of the entire cluster. A node in ES is a running instance of Elasticsearch. The node name is set by node.name. If this parameter is not set, a random universal unique identifier is assigned to the node at startup. There are three different types of nodes in an ES cluster: master node, data node, and coordination node.

1)The master node

Responsible for managing all changes to the cluster, such as creating or deleting indexes; Track and manage which nodes are part of the cluster, such as adding or deleting nodes. Determine which shards are assigned to the relevant nodes. This can be set using the node.master property. Node. master=true Indicates that this node is eligible to be elected as the master node.

2)Data nodes

Data nodes store data and perform related operations, such as adding, deleting, modifying, querying, and aggregating data. Data nodes have high requirements on machine configurations and consume large AMOUNTS of CPU, memory, and I/O resources. By default, each node is a data node (including the primary node), which can be set using the Node.data property. Node. data=true Indicates that this node is a data node and is used to store data.

3)Coordination node (or client node)

If the node.master and Node. data properties are both false, the node is called a coordination node. The coordination node can only handle routing requests, search, distribute index operations, and so on. The purpose of this object is to perform load balancing in the event of a large number of requests. By default, node.master=true and node.data=true, meaning that by default a node is both a data node and a master node.

4)Best practices

Master node: ordinary server (CPU, memory consumption) data nodes: main consume disk, memory client | ingest nodes: ordinary server (if you want to group aggregation operations, memory is also suggested that the node distribution more)

3.Shard

Sharding is a solution that enables the cluster to support horizontal scaling capabilities, that is, ES is distributed through sharding. Each shard is a container for data, and documents are stored in the shard, which is then distributed among nodes in the cluster. Our documents in the cluster are stored and indexed into shards, but the shards are transparent to clients and they don’t know that they exist. When the cluster scale expands or shrinks, ES will automatically migrate fragments among nodes, so that data is still evenly distributed in the cluster. ES has two types of sharding: primary shard and Replicas. Any document in the index belongs to a master shard, so the number of master shards directly determines the maximum amount of data that can be stored in the cluster. A copy fragment is only a copy of a master fragment. A copy fragment is used to implement high availability and concurrency, for example, to protect data from loss and provide read services when hardware faults occur.

1)The cluster model

The following figure describes an ES cluster with three nodes (Node1, Node2, and Node3), where Node1 is the master node and Node1, Node2, and Node3 are all data nodes. The cluster has nine shards, three master shards (P0, P1, P2) and six slave shards (R0, R1, R2), each with two slave shards.

2)routing

All data must be written to the master shard first. Only after the master shard is successfully processed, data can be copied to the corresponding replica shard. By default, the master shard is returned to the client after all backups are completed and indexes are updated. When there are multiple master shards in a cluster, what rules does ES use to write data to a particular shard? The data routing policy of ES is determined by the following formula:

shard = hash(routing) % number_of_primary_shards
Copy the code

Routing is a variable value, which defaults to the _id of the document and can be set to a custom value. Anyone with a bit of Java development experience will be familiar with this approach, as scenarios such as HashMap, Redis, shards and tables use similar strategies. Note: In general, we need to determine the number of primary shards at index creation time and do not change this number, because if number_of_primary_SHards changes, the values calculated above will change and the data will be completely disorganized before the migration is complete. This is a problem with middleware we use every day, such as Redis. If you really need to adjust the cluster capacity (increase or decrease the number of nodes), you must migrate the data. Redis’s common clustering uses a consistent hash strategy to keep data migration as small as possible when this happens. ES uses the shard = hash(routing) % number_of_primary_shards formula to determine the routing policy. You can see that any node can calculate the location of documents in the cluster, so each node can handle read and write requests.

3)The write operation

After a write request is sent to a node, the node calculates which shard to write to according to the routing formula and forwards the request to the master shard of the shard. After the master shard processes the request successfully, it forwards the request to the duplicate shard. The diagram is as follows:

The specific process is as follows:

Suppose Node1 receives a request to create an index.
Node1 calculates which shard data falls to based on shard=hash(routing)% number_of_primary_shards. Let’s say the calculated shard is equal to 1.
Because shard=1’s primary shard P1 is on Node3, Node1 forwards write requests to Node3.
If Node3 succeeds, Node3 will forward the write request to Node1 and Node2 because the two copies R1 of P1 are on Node1 and Node2.
When all replicas R1 report success, Node3 returns a success message to the requested Node1, which finally returns that the client index was successfully created.

4)A read operation

After a read request is sent to a node, the node calculates the fragment on which data is stored according to the routing formula. The node selects a node in load balancing mode and forwards the read request to the fragment. The process is as follows:

The specific process is as follows:

Suppose Node1 receives a read request.
Node1 uses shard=hash(routing)% number_of_primary_shards to calculate which shard data falls to. Let’s say the calculated shard is equal to 1.
Shard=1 has three shards. Primary Shard P1 is on Node1, and secondary Shard R1 is on Node2 and Node3. In this case, Node1 selects a copy based on load balancing. The R1 copy of the Node2 node is selected. Node1 forwards the request to Node2 to process the read.
After Node2 finishes processing, the result is returned to Node1, which returns the data to the client.

5)The update operation

The update of ES is realized by combining deletion and addition, as detailed in the following article.

4.Cluster health

The number of master shards is determined at index creation time, but the number of replica shards can be changed at any time. By default, an index can have five master shards and any number of copies. The state of the master and replica shards determines the health of the cluster. Only the master shard or its corresponding replica shard will be saved on each node. The same replica shard will not exist in the same node. If there is only one node in the cluster, the replica fragments will not be allocated. In this case, the cluster health status is YELLOW and data loss may occur.

Four,storage

1.Storage location

ES indexes are all on disk, so data is not lost in the event of a power outage or restart. The ES data store path is configured in ElasticSearch.yml. Path. data: /path/to/data // Index data path.logs: /path/to/logs // Log records are stored in the data folder of the installation directory by default. Do not use the default value. If ES is upgraded, all data may be lost.

2.fragmentation

1)Segment

Index documents are stored on disk as segments, each of which stores the inverted index information. Lucene adopts the segmented storage mode at the bottom layer, which almost completely avoids locking during read and write, greatly improving read and write performance. Data written to Lucene is not directly dropped to disk, but written in memory first. After refresh interval, Lucene refresh all data written in this period into a segment. After there are more segments, Lucene merges them into larger segments. Lucene iterates through each segment to complete the query.

2)Invariant of segments

Segments are immutable and cannot be modified once indexed data is written to disk. After a segment is written to the disk, a commit point is generated. Once a segment has the commit point, it is indicated that the segment only has the read permission and loses the write permission. In contrast, when data is in memory, it has write permission, not read permission, meaning it cannot be retrieved. As per the above rules, new documents are easily processed. After a specified period of time, new documents are stored in the disk as a new segment. Because a Segment cannot be modified, you do not remove the document from the old Segment. Instead, you add a. Del file that lists the segments of the deleted document. A document marked for deletion can still be matched by the query, but it is removed from the result set before the final result is returned. (This is similar to the passive expiration strategy for Redis data.) Since the Segment cannot be modified, updating is treated as deleting + adding two operations. The old document is marked down in the. Del file, and the new version of the document is saved to a new Segment.

3)The advantages and disadvantages of segment immutable

Advantages: No lock, simple design, high read/write performance. Because the segment file does not change, once the index file is entered into the file system cache, as long as the file system cache has enough space, it will be hit, and the performance is high. In contrast to data files that change, the index content and data consistency need to be maintained when the data changes. Invariable segments do not have to consider this problem.
Disadvantages:.del marks documents, which can waste a lot of space if there is a lot of data. Every time (determined by refresh_interval), a new Segment will be generated. If there are too many segments, it will consume a lot of server resources. Modified or deleted documents will still appear in the query result set, although they will not be returned to the client, but this will increase the query burden. New data cannot be retrieved before being refreshed into the Segment, which has certain effectiveness problems.

3.Write delay strategy

In order to improve write performance, ES adopts the write delay policy, that is, every time new data is written to memory (JVM memory of ES) first, when the specified time (specified by refresh_interval, default is 1 second) or a certain amount of data is reached, a Refresh is triggered. The data in memory is flushed to a new Segment and cached in the file cache, which is later flushed to disk and a commit point is generated. Note that data is not stored as a Segment until it is refreshed into the Segment, and therefore cannot be retrieved. This is why ES is a quasi-real-time search. We can also trigger Refresh manually by /_refresh to Refresh all indexes and/” index-a “/_refresh to Refresh specified indexes. Note that although refreshing is a much lighter operation than committing, it still has a performance overhead. When manual refreshes are appropriate for testing, do not manually refresh every document you update in a production environment; In addition, refresh per second is not suitable for all scenarios and should be considered on a case-by-case basis. When setting refresh_interval, specify a unit, for example, “refresh_interval” = “10s”. Otherwise, the unit is milliseconds. When refresh_interval is -1, automatic refresh of indexes is disabled. The write delay policy reduces the number of times that data is written to the disk, thus improving the overall write capability. However, the file cache belongs to the memory of the operating system, and data may be lost in case of power failure or exceptions. To avoid data loss, Elasticsearch added a transaction log, which records all data that has not yet been persisted to disk.

4.Index persistence process

From the above part, we have a general understanding of the data persistence process of ES. Next, we will introduce the process of persistence in detail by means of pictures and pictures.

1)Write data

When data is written to ES, to improve the writing speed, data is first written to the memory rather than directly to disk. However, to prevent data loss, a copy of data is appended to the transaction log. As follows:

2)Refresh

A Refresh is triggered when the default time (1 second) is reached or when the amount of data in memory reaches a certain amount. The main steps of the refresh are as follows: Refresh the data in memory to a new segment (S3). Note that S3 is currently in the operating system’s file cache, not on disk. Open segment (S3) to make it searchable. Empty memory, ready to receive new data, transaction log does not clear.

Note: although S3 is not persisted to disk at this point, the data in S3 can be retrieved.

3)Flush

When the log data size exceeds 512MB or the log data duration exceeds 30 minutes, Flush is triggered. This Flush performs the following operations: flusketh S3 data from the file system cache to disks. Generate commit points. Delete the old Translog and create an empty Translog.

5.Segment merging strategy

1)strategy

In the “segmented storage” section, we know that over time, this will result in a large number of segments in the index. The general process of ES retrieval is: query the data that meets the query conditions in all segments, and then merge the query result set in each Segment. The problem is that each segment consumes file handles, memory, and CPU cycles, and more importantly, each search request must check each segment in turn; Therefore, the more segments there are, the slower the search will be. Therefore, in order to control the number of segments in the index, we must periodically merge segments. Elasticsearch solves this problem in a simple way: Elasticsearch solves this problem by merging segments in the background. Smaller segments are merged into larger segments, which are then merged into larger segments. Segment merging purges old deleted documents from the file system. Deleted documents (or older versions of updated documents) are not copied to the new large section. Starting segment merge doesn’t require you to do anything, indexing and searching will happen automatically. Merging all segments at a time can be a huge waste of resources, especially merging “large segments”. Lucene’s idea for merging segments is to group segments based on their size and then merge segments that belong to the same group. In addition, because merging extremely large segments consumes more resources, Lucene will not merge segments until the size of the Segment reaches a certain size or the amount of data in the Segment reaches a certain number. It can be seen that Lucene’s Segment merging mainly focuses on merging small and medium-sized segments, which can not only avoid consuming excessive server resources when merging large segments, but also well control the number of index segments.

2)Segment merge parameter

MergeFactor: The minimum number of segments participating in a merge at each merge. The merge starts when the number of segments in the same group reaches this value, and does not merge if it is less than this value, which reduces the frequency of segment merges. The default value is 10. SegmentSize: Indicates the actual size of a segment, in bytes. MinMergeSize: Segments smaller than this value are grouped into groups, which speeds up the merging of smaller segments. MaxMergeSize: If a segment has more text than this value, it does not participate in mergesize because large segment merges consume more resources.

3)Period of consolidation

ES generates and merges segments using the following process: While indexing, the refresh operation creates new segments and opens them for search. The merge process selects a small number of similar-sized segments and merges them into a larger segment behind the scenes. There is no effect on retrieval at this time.

The new segment is flushed to disk and opened for search. The old segment is deleted and the merge ends.

ElasticSearch is available for download from ElasticSearch (http://www.elasticSearch)

6.Refresh strategy

ES index, Update, Delete, Bulk, and so on all support API Refresh properties to control whether a document can be retrieved.

1)refresh=false

The default value of refresh. The system schedules the refresh control logic by itself and does not refresh data immediately after it is updated.

2)Refresh =true or refresh= (null)

Real-time refresh, that is, after the data is updated, the data is immediately refreshed to the relevant master shard and replica shard (note that the whole index is not refreshed). After the execution is successful, the data will be retrieved. Frequently performing such operations will export and generate many, many small segment files, which will add processing time to the later segment file merge. Main disadvantages of real-time refresh:

Generating too many, small segment files;
Search efficiency will be affected by the existence of too many small segment files.
The merging of late segment files will also increase the processing time;

3)reresh=wati_for

Instead of refreshing immediately, you wait a while to refresh. Refresh =wait_for; refresh=wait_for; refresh=wait_for; Instead of sending too many refresh= WAIT_for requests at the same time, Elasticsearch processes these requests in parallel by merging them into a single BULK request until all the requests are successfully executed and refreshed. The listeners on the wait_for queue are limited in length by the Settings of the index.max_refresh_listeners parameter. The default value is 1000. The new WAIT_for request cannot be placed in the listening queue and can only be returned immediately. Because Wait_for was designed by Elasticsearch to indicate that it had been refreshed as long as it returned, the refresh was triggered immediately before the return, adding “forced_refresh” : true “to the field in the response. See the official ES7.8 documentation for details:

The more changes being made to the index the more work wait_for saves compared to true. In the case that the index is only changed once every index.refresh_interval then it saves no work.
true creates less efficient indexes constructs (tiny segments) that must later be merged into more efficient index constructs (larger segments). Meaning that the cost of true is paid at index time to create the tiny segment, at search time to search the tiny segment, and at merge time to make the larger segments.
Never start multiple refresh=wait_for requests in a row. Instead batch them into a single bulk request with refresh=wait_for and Elasticsearch will start them all in parallel and return only when they have all finished.
If the refresh interval is set to -1, disabling the automatic refreshes, then requests with refresh=wait_for will wait indefinitely until some action causes a refresh. Conversely, setting index.refresh_interval to something shorter than the default like 200ms will make refresh=wait_for come back faster, but it’ll still generate inefficient segments.
refresh=wait_for only affects the request that it is on, but, by forcing a refresh immediately, refresh=true will affect other ongoing request. In general, if you have a running system you don’t wish to disturb then refresh=wait_for is a smaller modification.

7. The Segment structure

Segments XXX files record the number of segments under the Lucene package. Segments XXX files record the number of segments under the Lucene package. Each segment contains the following files.

Name	Extension	Brief Description
Segment Info	.si	Metadata file for segment
Compound File	.cfs, .cfe	A segment contains each file in the following table. In order to reduce the number of open files, all the contents of the segment are saved in the CFS file. The CFE file stores the location of each Lucene file in the CFS file
Fields	.fnm	Save information about Fields
Field Index	.fdx	The metadata information of the storage file is aligned
Field Data	.fdt	Store the forward storage data, the original text is stored here
Term Dictionary	.tim	Invert the metadata information for the index
Term Index	.tip	Inverted index file, which stores all inverted index data
Frequencies	.doc	Saves a list of doc ids for each term and term frequency in DOC
Positions	.pos	The full-text index field will have the file that holds the position of term in doc
Payloads	.pay	Full-text indexed fields, using some advanced features like payloads will have this file, which holds some advanced features of term in doc
Norms	.nvd, .nvm	The file holds index field weighted data
Per-Document Values	.dvd, .dvm	Lucene’s DocValues file, a column store of data, is used for aggregation and sorting
Term Vector Data	.tvx, .tvd, .tvf	Save the vector information of the index field for use in highlighting term and calculating text correlation
Live Documents	.liv	Doc deleted from segment

5. Refer to the article

I can’t get Elasticsearch after watching this! Mp.weixin.qq.com/s/y8DNnj4fj…

Secrets of Time Series Databases (2) – Index www.infoq.cn/article/dat…

ES index storage principle blog.csdn.net/guoyuguang0…

Introduction to ElasticSearch Article 8: store www.cnblogs.com/ljhdo/p/501…

The docs – refresh www.elastic.co/guide/en/el…

Elasticsearch, www.jianshu.com/p/28fb017be…

Elasticsearch, authoritative guide to www.elastic.co/guide/cn/el…