preface

Elasticsearch has been shared in the past, but this post is a summary of what Elasticsearch does, how it works and how it works.

Data in Life

A search engine is a search for data, so let’s start with the data in our lives.

There are two kinds of data in our lives: structured data and unstructured data.

Structured data: Also known as row data, is logically expressed and implemented by two-dimensional table structure. It strictly follows data format and length specifications, and is mainly stored and managed by relational database. Refers to data of fixed format or limited length, such as database, metadata, etc.

Unstructured data: also known as full-text data, variable length or no fixed format, not suitable for two-dimensional table performance by the database, including all formats of office documents, XML, HTML, Word documents, emails, all kinds of reports, pictures and AUDIos, video information, etc.

Note: XML and HTML can be classified as semi-structured data if more detailed. Because they also have their own specific tag formats, they can be processed as structured data as needed, or plain text can be extracted as unstructured data.

According to two kinds of data classification, search is also divided into two kinds: structured data search and unstructured data search.

For structured data, because they have a specific structure, we can generally store and search through the two-dimensional table of relational databases (mysql, Oracle, etc.), and also build indexes.

For unstructured data, i.e. full-text data search, there are two main methods: sequential scanning and full-text retrieval.

Sequential scan: You can also know the general search mode based on the text name. That is, specific keywords are searched by sequential scan. For example, I give you a newspaper and let you find out where the words “peace” appeared in the newspaper. You definitely need to scan the newspaper from cover to cover and mark where the keyword appears and where it appears.

This method is undoubtedly the most time-consuming and inefficient, if the newspaper typesetting is small, and there are many sections or even multiple newspapers, your eyes are almost the same after you scan.

Full text search: Sequential scanning of unstructured data is slow, can we optimize it? Can’t we just try to make our unstructured data have some structure? Part of the information in unstructured data is extracted and reorganized to make it with a certain structure, and then the data with a certain structure is searched, so as to achieve the purpose of relatively fast search.

This way constitutes the basic idea of full-text retrieval. This piece of information extracted from unstructured data and then reorganized is called an index. The main work of this approach is the creation of the index in the early stage, but it is fast and efficient for the late search.

Say first Lucene

After a brief look at the types of data in our lives, we know that SQL retrieval in relational databases cannot handle such unstructured data. Such unstructured data processing relies on full-text search, and the best open source full-text search engine toolkit on the market today is Apache’s Lucene.

But Lucene is only a toolkit, it is not a full text search engine. Lucene’s goal is to provide software developers with an easy-to-use toolkit to facilitate the implementation of full-text search functions in the target system, or to build a complete full-text search engine on this basis.

Currently, Solr and Elasticsearch are the main open source full-text search engines based on Lucene.

Solr and Elasticsearch are mature full-text search engines that provide the same functionality and performance. However, ES itself has the characteristics of distribution and easy installation and use, while Solr’s distribution needs to be realized by a third party, for example, by using ZooKeeper to achieve distributed coordination management.

Both Solr and Elasticsearch rely on Lucene at the bottom, and Lucene can implement full-text search mainly because it implements the query structure of inverted index.

What about inverted indexes? Suppose there are three data files, the contents of which are as follows:

  1. Java is the best programming language.
  2. PHP is the best programming language.
  3. Javascript is the best programming language.

To create an inverted index, we split the content field of each document into separate terms (we call them terms or terms) through a word splitter, create a sorted list of all non-repeating terms, and then list which document each Term appears in. The result is as follows:

 1Term          Doc_1    Doc_2   Doc_3

2-------------------------------------

3Java        | X |        |

4is          |
   X   | X |   X

5the         | X |   X    |   X

6best        |
   X   | X |   X

7programming | x |   X    |   X

8language    |
   X   | X |   X

9PHP         | |   X    |

10Javascript  |
       | |   X

11-------------------------------------

Copy the code

This structure consists of a list of all non-repeating words in a document, with a document list associated with each word. This structure, which determines the location of records by attribute values, is called an inverted index. A file with an inverted index is called an inverted file.

We translate the above content into the form of a graph to illustrate the structure information of the inverted index, as shown below.

There are mainly the following core terms to understand:

  • Term: The smallest unit of storage and query in an index. In English, it is a word, but in Chinese, it usually refers to the word after the participle.
  • Term Dictionary: or Dictionary, which is a collection of terms. The usual unit of index for search engines is the word. A dictionary is a collection of strings made up of all the words that have appeared in a collection of documents. Each index entry in the dictionary contains some information about the word itself and a pointer to an “inverted list”.
  • Post list: A document is usually composed of several words. An inversion list records which documents and where a word appears. Each record is called a Posting. Inversion lists not only record document numbers, but also store information such as word frequency.
  • Inverted files: An Inverted list of all Inverted words is often stored sequentially in a File on disk called an Inverted File, which is a physical File that stores Inverted indexes.

From the figure above we can see that the inverted index mainly consists of two parts: the dictionary and the inverted file. Dictionary and inversion list are two important data structures in Lucene, which are the cornerstone of fast retrieval. The dictionary and the inverted file are stored in two parts, the dictionary in memory and the inverted file on disk.

The core concept

ES is an open source search engine written in Java. It uses Lucene for indexing and searching internally. By encapsulating Lucene, it hides the complexity of Lucene. Instead, provide a simple and consistent RESTful API.

However, Elasticsearch is more than Lucene, and more than just a full-text search engine. It can be accurately described as follows:

  • A distributed real-time document store where each field can be indexed and searched.
  • A distributed real-time analysis search engine.
  • Capable of extending hundreds of service nodes and supporting PB level of structured or unstructured data.

Elasticsearch is a distributed, scalable, near-real-time search and analytics engine. Let’s take a look at some of the core concepts of how Elasticsearch can be distributed, scalable and near real time.

Cluster

The cluster construction of ES is very simple. It does not need to rely on third-party coordination and management components, and it realizes the cluster management function internally. An ES cluster consists of one or more Elasticsearch nodes. You can add each node to the cluster by setting the same cluster.name. Ensure that different cluster names are used in different environments, otherwise you will end up with nodes joining the wrong cluster.

An Elasticsearch service startup instance is a Node. The node name is set by node.name. If this parameter is not set, a random universal unique identifier is assigned to the node at startup.

Discovery mechanism

Then there is a question, how can ES connect different nodes to the same cluster with the same cluster.name setting? The answer is Zen Discovery.

Zen Discovery is a built-in default Discovery module for Elasticsearch (the Discovery module is responsible for discovering nodes in the cluster and electing master nodes). It provides unicast and file-based discovery, and can be extended to support cloud environments and other forms of discovery through plug-ins. Zen Discovery is integrated with other modules; for example, all communication between nodes is done using the Transport module. The node uses the discovery mechanism to Ping other nodes.

Elasticsearch is configured to use unicast discovery by default to prevent nodes from unintentionally joining a cluster. Only nodes running on the same machine automatically form a cluster.

If the cluster’s nodes are running on different machines, using unicast, you can provide Elasticsearch with a list of nodes it should try to connect to. When a node contacts a member of the unicast list, it gets the status of all nodes in the entire cluster, and then it contacts the master node and joins the cluster.

This means that the unicast list does not need to contain all the nodes in the cluster, it just needs enough nodes when a new node contacts one and talks to it. If you use master candidate nodes as a unicast list, you only need to list three. This configuration is in the elasticSearch.yml file:

1discovery.zen.ping.unicast.hosts["host1", "host2:port"]

Copy the code

Node starts to ping, if discovery. Zen. Ping. Unicast. The hosts are set, the set of host ping, or try to ping localhost several ports, Elasticsearch can start multiple nodes on the same host. The Ping response contains the basic information about the node and the node that the node considers to be the master node. Select from the list of nodes that each node considers to be master. The rule is simple: select the first one in alphabetical order. If none of the nodes is considered master, the node is selected from all of them, as above.

The limit is discovery.zen.minimum_master_nodes. If the minimum number of nodes is not reached, the process is repeated until there are enough nodes to start the election. The result of the election is that you can definitely elect a master, and if there’s only one local node you’re going to elect yourself. If the current node is master, wait until the number of nodes reaches discovery.zen.minimum_master_nodes before service is available. If the current node is not the master, try to join the master. Elasticsearch calls the process of service discovery and master selection ZenDiscovery.

As it supports any number of clusters (1-n), it cannot limit the number of nodes to be odd like Zookeeper, so it cannot use voting mechanism to choose the master. Instead, it adopts a rule. As long as all nodes follow the same rule, the information obtained is equal, and the selected master nodes must be consistent. However, the problem of distributed system lies in information asymmetry, which is prone to split-brain problem. Most solutions are to set a quorum value, requiring that the available nodes must be larger than quorum (usually more than half of the nodes) in order to provide services externally. For Elasticsearch, the quorum configuration is discovery.ze.minimum_master_nodes.

Roles of nodes

Each node can be either a candidate master node or a data node. / config/elasticsearch. Yml can be set, the default is true.

1node.master: true  // Whether to be a candidate primary node

2node.datatrue    // Whether to be a data node

Copy the code

Data nodes store data and perform related operations, such as adding, deleting, modifying, querying, and aggregating data. Therefore, data nodes have high requirements on machine configurations and consume large AMOUNTS of CPU, memory, and I/O. Often as a cluster grows, more data nodes need to be added to improve performance and availability.

The candidate primary node can be elected as the master node. Only the candidate primary node has the right to vote and be elected in the cluster. Other nodes do not participate in the election. The master node is responsible for creating indexes, dropping indexes, tracking which nodes are part of the cluster, deciding which shards are assigned to related nodes, and tracking the status of nodes in the cluster. A stable master node is very important for the health of the cluster.

A node can be either a candidate primary node or a data node. However, because the data node consumes a lot of CPU and MEMORY CORE I/0, if a node is both a data node and a primary node, it may affect the primary node and thus the status of the whole cluster.

In order to improve the health of the Elasticsearch cluster, we should partition and isolate the nodes in the Elasticsearch cluster. Several low-configuration machine farms can be used as candidate primary node groups.

The active node checks each other through Ping. The active node pings all other nodes to check whether any node is down. Other nodes also use the Ping command to check whether the active node is available.

Although do to node role to distinguish, but the user’s request can be sent to any one node, and collected by the node is responsible for distributing requests, results of operations, and don’t need to forward the master node, this node can be called coordinating node, coordinating node does not need to specify and configuration, any node in the cluster can act as the role of coordinating node.

Brain crack phenomenon

At the same time, if multiple Master nodes are elected in the cluster due to network or other reasons, the data update is inconsistent. This phenomenon is called split brain. In other words, different nodes in the cluster choose different Master nodes and multiple Master competitors occur.

There are several possible causes of the “split brain” problem:

  1. Network problem: Some nodes cannot access the master due to the network delay between clusters. The master is considered to have died, and a new master is elected. The shards and replicas on the master are marked red to allocate a new master shard

  2. Node load: The role of the master node is both master and data. When there is a large number of visits, ES may stop responding (in a state of false death), resulting in a large area of delay. At this time, other nodes can not get the response of the master node, and consider that the master node has hung up, so they will re-select the master node.

  3. Memory reclamation: The role of the master node is both master and Data. When the ES process on the Data node occupies a large amount of memory, large-scale memory reclamation of the JVM occurs, causing the ES process to become unresponsive.

In order to avoid the occurrence of brain split phenomenon, we can start from the reasons to make optimization measures through the following aspects:

  1. Set the response time of the node status by using the discovery.zen.ping_timeout parameter. The default response time is 3s, and you can adjust the response time appropriately. Increase the parameter (such as 6s, discovery.zen.ping_timeout:6) to reduce misjudgment.

  2. Election trigger We need to set the parameter discovery.zen.munimum_master_nodes in the configuration file of the nodes in the candidate cluster. This parameter indicates the number of candidate primary nodes that need to participate in the election of the primary node. The official recommended value is “master_eligibel_nodes/2” + 1, where “master_eligibel_nodes” indicates the number of candidate primary nodes. This prevents split-brain and maximizes the high availability of the cluster, because elections can go on normally as long as there are not more than discovery.zen.munimum_master_nodes candidates alive. When the value is smaller than this value, the election behavior cannot be triggered, the cluster cannot be used, and the fragmentation disorder will not be caused.

  3. Role separation refers to the role separation of the candidate master node and the data node mentioned above. In this way, the burden of the master node can be relieved, the suspended state of the master node can be prevented, and the misjudgment of the “dead” master node can be reduced.

Shards

ES supports pB-level full-text search. When the amount of data on an index is too large, ES splits the data on an index into different data blocks by horizontal splitting. The split database block is called a fragment.

This is similar to MySql’s library and table, except that MySql requires third-party components and ES implements this function internally.

When data is written into a multi-shard index, the path is used to determine which shard to write into. Therefore, the number of shards must be specified during index creation. Once the number of shards is determined, it cannot be changed.

The number of shards and the number of copies described below can be configured using Settings when creating an index. ES creates 5 master shards for an index by default, and creates one copy for each shard.

1PUT /myIndex

2{

3   "settings" : {

4      "number_of_shards" : 5,

5      "number_of_replicas": 1.

6   }

7}

Copy the code

ES improves indexes both in scale and performance through sharding. Each shard is an index file in Lucene, and each shard must have a master shard and zero to multiple copies.

Replicas

A Copy is a Copy of a fragment. Each master fragment has one or more duplicate fragments. When the master fragment is abnormal, the Copy can provide data query. The master shard and the corresponding replica shard are not on the same node, so the maximum number of replica shards is n-1 (where n is the number of nodes).

Document creation, index and delete requests are all write operations, which must be completed on the master fragment before they can be copied to the relevant copy fragment. In order to improve the ability of writing, ES writes concurrently. Meanwhile, in order to solve the problem of data conflict in the process of concurrent writing, ES controls the process by optimistic locking. Each document has a _version number, which increases when the document is modified. Success is reported to the coordinator node once all replica shards report success, and the coordinator node reports success to the client.

In order to achieve high availability, the Master node avoids placing the Master shard and the replica shard on the same node.

If Node1 service is down or the network is unavailable, the primary shard S0 on the primary node is also unavailable. Fortunately, there are two other nodes that can work properly. At this time, ES will re-elect a new master node, and these two nodes have all the data we need for S0. We will promote the copy of S0 to the master shard, and the process of promoting the master shard happens instantly. The cluster status will be yellow.

Why is our cluster state yellow instead of green? Although we have all 2 master shards, we also set up two replica shards for each master shard, and only one replica shard exists at this time. Therefore, the cluster cannot be in the green state. If we also turn off Node2, our program can still run without losing any data, because Node3 keeps a copy of each shard.

If we restart Node1, the cluster can fragment the missing copies and re-allocate them, and the cluster state will return to its normal state. If Node1 still has the previous shards, it will try to reuse them, but the shard on Node1 is no longer the master shard but the replica shard. If the data has changed in the meantime, it will simply copy the modified data file from the master shard.

Summary:

1. Data is sharded to increase the capacity of processed data and facilitate horizontal expansion, while duplicates are made for sharding to improve cluster stability and concurrency. Copy is multiplication, the more consumption is greater, but also more insurance. Sharding is division. The more shards there are, the fewer and more fragmented the single shard data. 3. The more copies, the higher the availability of the cluster, but because each shard is equivalent to a Lucene index file, it will occupy a certain amount of file handles, memory and CPU, and the data synchronization between shards will also occupy a certain amount of network bandwidth, so the number of shards and copies of the index is not the better.

Mapping

Mapping is used to define the storage type, word segmentation and whether to store information of ES to the fields in the index. Just like the schema in the database, it describes the fields or attributes that the document may have and the data type of each field. However, a relational database must specify the field type when building a table, while ES can not specify the field type and then dynamically guess the field type, or specify the field type when creating an index.

The mapping that automatically identifies the field type according to the data format is called Dynamic mapping. The mapping that specifically defines the field type when we create an index is called Explicit mapping or Explicit mapping.

Before explaining the use of dynamic mapping and static mapping, let’s first understand what field types of data in ES are. Later we will explain why we need to create static maps instead of dynamic ones when creating indexes.

The field data types in ES (V6.8) are as follows:

category The data type
Core types Text, keywords, long, INTEGER, short, double, data, Boolean, etc
The complex type Object, Nested
Geographic types geo_point, geo_shape
Special type IP, completion, token_count, join, and so on
… .

Text A field used to index full-text values, such as the body of an E-mail message or product description. These fields are tokenized and are passed through the tokenizer to convert the string into a list of individual terms before being indexed. The analysis process allows Elasticsearch to search every complete text field in a single word. Text fields are not used for sorting and are rarely used for aggregation.

Keyword A field that indexes structured content, such as an E-mail address, host name, status code, zip code, or tag. They are commonly used for filtering, sorting, and aggregation. The keyword field can only be searched by its exact value.

Through the understanding of field types, we know that some fields need to be clearly defined, for example, whether a field is text type or keword type is very different, time field may need to specify its time format, and some fields need to specify a specific word divider and so on. If this cannot be done precisely with dynamic mapping, automatic recognition is often somewhat different from what is expected.

So a complete format for creating an index is to specify the number of shards and copies and the definition of the Mapping, as follows:

 1PUT my_index 

2{

3   "settings" : {

4      "number_of_shards" : 5,

5      "number_of_replicas" : 1

6   }

7  "mappings": {

8    "_doc": { 

9      "properties": { 

10        "title":    { "type": "text"  }, 

11        "name":     { "type": "text"  }, 

12        "age":      { "type": "integer" },  

13        "created":  {

14          "type":   "date", 

15          "format": "strict_date_optional_time||epoch_millis"

16        }

17      }

18    }

19  }

20}

21


Copy the code

The basic use

Elasticsearch (excluding 0.x and 1.x) currently has the following stable major versions: 2.x, 5.x, 6.x, 7.x (current). You might notice that without 3.x and 4.x, ES jumped straight from 2.4.6 to 5.0.0.

This is to create a unified version of ELK (ElasticSearch, Logstash, Kibana) stack so that users don’t get confused. While Elasticsearch is 2.x (the last version 2.4.6 of 2.x was released July 25, 2017), Kibana is 4.x (Kibana 4.6.5 was released July 25, 2017), The next major version of Kibana will definitely be 5.x, so Elasticsearch will release its own version as 5.0.0. After unification, there will be no confusion in selecting the version of ElasticSearch and then selecting the same version of Kibana without worrying about version incompatibility.

Elasticsearch is built in Java, so in addition to paying attention to the version of ELK technology, we need to pay attention to the JDK version when selecting Elasticsearch. Because each major release relies on a different JDK version, JDK 11 is currently supported in version 7.2.

Install and use

1. Download and decompress Elasticsearch without installing Elasticsearch.

  • Bin: binary system instruction directory, including startup commands and plug-in installation commands.
  • Config: indicates the directory of the configuration file.
  • Data: indicates the data storage directory.
  • Lib: dependency package directory.
  • Logs: log file directory.
  • Modules: A library of modules, such as the modules of X-pack.
  • Plugins: Plugins directory.

2. Run bin/ elasticSearch to start ES. 3. Run on port 9200 by default. Request curl http://localhost:9200/ or enter http://localhost:9200 to obtain a JSON object, which contains the current node, cluster, version and other information.

 1{

2  "name" : "U7fp3O9".

3  "cluster_name" : "elasticsearch".

4  "cluster_uuid" : "-Rj8jGQvRIelGd9ckicUOA".

5  "version" : {

6    "number" : "6.8.1".

7    "build_flavor" : "default".

8    "build_type" : "zip".

9    "build_hash" : "1fad4e1".

10    "build_date" : "The 2019-06-18 T13: hast judged. 517138 z".

11    "build_snapshot" : false.

12    "lucene_version" : "7.7.0".

13    "minimum_wire_compatibility_version" : "5.6.0".

14    "minimum_index_compatibility_version" : "5.0.0"

15  },

16  "tagline" : "You Know, for Search"

17}

Copy the code

Cluster Health Status

To check cluster health, we can run the following command GET /_cluster/health from the Kibana console to GET the following information:

 1{

2  "cluster_name" : "wujiajian".

3  "status" : "yellow".

4  "timed_out" : false.

5  "number_of_nodes" : 1.

6  "number_of_data_nodes" : 1.

7  "active_primary_shards" : 9.

8  "active_shards" : 9.

9  "relocating_shards" : 0.

10  "initializing_shards" : 0.

11  "unassigned_shards" : 5.

12  "delayed_unassigned_shards" : 0.

13  "number_of_pending_tasks" : 0.

14  "number_of_in_flight_fetch" : 0.

15  "task_max_waiting_in_queue_millis" : 0.

16  "active_shards_percent_as_number" : 64.28571428571429

17}

Copy the code

Cluster status is marked by green, yellow, and red

  • Green: The cluster is healthy and all functions are normal. All fragments and copies are working properly.
  • Yellow: alert status. All master shards are functioning properly, but at least one copy is not working properly. The cluster is working, but high availability is affected to some extent.
  • Red: The cluster is unavailable. If one or more shards and their copies are abnormally unavailable, the query operation of the cluster can still be executed, but the result returned will be inaccurate. An error will be reported on write requests assigned to this shard, resulting in data loss.

When the cluster status is red, it will continue to serve search requests from available shards, but you need to fix the unallocated shards as soon as possible.

Principle of mechanism

After introducing the basic concepts and basic operations of ES, we may still have a lot of questions. How do they work inside? How are master and replica shards synchronized? What is the process for creating an index? How does ES allocate index data to different shards? And how is this index data stored? Why is ES a near-real-time search engine while the CRUD (create, read, update, delete) operations of documents are real-time? How does Elasticsearch ensure that updates are persisted without losing data in the event of a power failure? And why does deleting a document not immediately free up space? With these questions in mind we move on to the next section.

Principle of index writing

The following figure describes a 3-node cluster with 12 shards, including 4 Master shards (S0, S1, S2, S3) and 8 replica shards (R0, R1, R2, R3). Each Master shard corresponds to two replica shards. Node 1 is the Master node responsible for the state of the whole cluster.

Write indexes can only be written to the master shard and then synchronized to the replica shard. There are four master shards, and according to what rule is ES written to that particular shard? Why is this index data written to S0 but not S1 or S2? Why is that data being written to S3 and not S0 again?

First of all, it certainly can’t be random, otherwise we won’t know where to look when we need to retrieve documents in the future. In fact, the process is determined by the following formula:

1shard = hash(routing) % number_of_primary_shards

Copy the code

Routing is a variable value, which defaults to the _id of the document and can be set to a custom value. Routing uses the hash function to generate a number, which is then divided by number_of_primary_shards (number of primary shards) to get the remainder. The remainder, between 0 and number_of_primary_shreds -1, is where we are looking for the document shard.

This explains why the number of master shards is determined at index creation time and never changes: if the number changes, then all the values of previous routes are invalid and the document is never found again.

Because each node in ES cluster knows the location of documents in the cluster by the above calculation formula, each node has the ability to handle read and write requests. After a write request is sent to a node, the node is the coordinator node mentioned above. The coordinator node calculates which shard to write to according to the routing formula, and then forwards the request to the master shard node of the shard.

Shard = hash(routing) % 4 = 0 if shard = hash(routing) % 4 = 0 is obtained after routing calculation formula:

  1. The client sends a write request to ES1 node (coordination node), and the value is 0 based on the route calculation formula, so the current data should be written to the master shard S0.
  2. ES1 forwards the request to ES3, where the main shard of S0 is located. ES3 receives the request and writes it to disk.
  3. Data is concurrently copied to two replica shards R0, where data conflicts are controlled through optimistic concurrency. Once all replica shards report success, the node ES3 reports success to the coordinating node, and the coordinating node reports success to the client.

Storage principle

This process is carried out in ES memory. After data is allocated to specific shards and replicas, it is eventually stored on disk so that data will not be lost in the event of power outages. The specific storage path can be found in the configuration file.. Set in the/config/elasticsearch. Yml, stored by default in the installation directory of the data folder. Do not use the default value. If ES is upgraded, all data may be lost.

1path.data: /path/to/data  // Index data

2path.logs: /path/to/logs  // Log

Copy the code

fragmentation

Indexed documents are stored on disk as segments. What are segments? If an index file is split into multiple subfiles, each subfile is called a segment, and each segment is an inverted index. The segment is immutable and cannot be modified once the data in the index is written to the hard disk. In the bottom layer, the segmented storage mode is adopted, so that the lock is almost completely avoided when reading and writing, which greatly improves the read and write performance.

After the segment is written to disk, a commit point is generated, which is a file that records all the information about the segment after the commit. Once a segment has a commit point, it has read permission but no write permission. Conversely, when a segment is in memory, it has write permission, not read permission, meaning it cannot be retrieved.

The concept of segments was developed primarily because a large inverted index was built for the entire document collection in early full-text retrieval and written to disk. If the index is updated, a new index needs to be fully created to replace the original index. This approach is inefficient when there are large volumes of data, and because the cost of creating an index is high, the data should not be updated too frequently, thus not ensuring timeliness.

Index files are segmented and non-modifiable, so how do new, updated, and deleted files handle?

  • New, new is easy to handle, because the data is new, so you only need to add a segment to the current document.
  • Delete, because it cannot be modified, the delete operation does not remove the document from the old segment but by adding a new one.delFile, which lists the segments of the deleted documents. The tagged deleted document can still be matched by the query, but it will be removed from the result set before the final result is returned.
  • Update, can not modify the old section to reflect the document update, actually update is equal to delete and add these two actions. Will put the old document in.delTags are removed from the file, and the new version of the document is indexed into a new segment. It is possible that both versions of a document will be matched by a query, but the deleted older version will be removed before the result set is returned.

Section being set as unmodifiable has certain advantages and disadvantages, which are mainly shown as follows:

  • No locks required. If you never update indexes, you don’t need to worry about multiple processes modifying data at the same time.
  • Once the index is read into the kernel’s file system cache, it stays there because of its immutability. As long as there is enough space in the file system cache, most read requests go directly to memory and do not hit disk. This provides a significant performance boost.
  • Other caches, like the filter cache, remain in effect for the lifetime of the index. They do not need to be rebuilt every time the data changes, because the data does not change.
  • Writing a single large inverted index allows data to be compressed, reducing disk I/O and the amount of indexes that need to be cached into memory.

The disadvantages of segment invariance are as follows:

  • When old data is deleted, the old data is not deleted immediately.delThe file is marked for deletion. Old data can only be removed when the segment is updated, which wastes a lot of space.
  • If there is a data update frequently, each update is new new mark old, there will be a lot of space waste.
  • Each time data is added, a new segment is required to store the data. When the number of segments is too high, the consumption of server resources such as file handles can be very high.
  • The inclusion of all result sets in the results of the query and the need to exclude old data removed by tags adds to the query burden.

Write delay strategy

So with the form of storage, the index is written to disk and how does that work? Do you tune fsync directly to physically write to disk?

The answer is obvious. If the data is written directly to the disk, the DISK I/O consumption will have a serious impact on the performance, so when the data is written to a large amount, the ES pauses will be stuck, and the query will not be able to respond quickly. If that were true, ES would not call it a near-real-time full-text search engine.

To improve write performance, ES does not add a segment to disk for every new piece of data, but uses a write delay policy.

Whenever there is new data, it is first written to the memory, is between memory and disk file system cache, when reaching the default time (1 second) or data reaches a certain amount of memory, will trigger a Refresh (Refresh), will be generated data in memory to a new period and the cache to the file cache system, It is later flushed to disk and a commit point is generated.

The memory used here is the ES JVM memory, while the file caching system uses the operating system memory. New data will continue to be written to memory, but the data in memory is not stored in segments and therefore cannot be retrieved. Flushing from memory to the file caching system generates a new segment and opens it for search, rather than waiting to be flushed to disk.

In Elasticsearch, the lightweight process of writing and opening a new segment is called refresh. By default, each shard is automatically refreshed once per second. This is why Elasticsearch is called near real time because changes to a document are not immediately visible to the search, but become visible within a second. We can also trigger refresh manually, POST /_refresh refresh all indexes, and POST/NBA /_refresh refresh specified indexes.

Tips: Although refreshing is a much lighter operation than committing, it still has a performance overhead. Manual refreshes are useful when writing tests, but do not refresh manually every time you index a document in a production > environment. And not all cases need to be refreshed every second. If you are using Elasticsearch to index a large number of log files, you may want to optimize the index speed rather than > near real-time search. In this case, you can set refresh_interval = “30s” in Settings when creating the index. Reduce the refresh frequency of each index. Note that the time unit must be added when setting the value. Otherwise, the default value is milliseconds. When refresh_interval is -1, automatic refresh of indexes is disabled.

Although the delayed write strategy can reduce the number of data writes to the disk and improve the overall write capability, we know that the file cache system is also the memory space, which belongs to the operating system. As long as the memory is in the case of power failure or abnormal situation, the risk of data loss.

To avoid data loss, Elasticsearch added a transaction log, which records all data that has not yet been persisted to disk. After the transaction log is added, the entire index writing process is shown below.

  • After a new document is indexed, it is first written to memory, but to prevent data loss, a copy of data is appended to the transaction log. New documents are constantly being written to memory, as well as to the transaction log. At this point the new data cannot be retrieved and queried.
  • When the default refresh time is reached or a certain amount of data in memory is reached, a refresh is triggered to refresh the data in memory as a new segment to the file cache system and clear the memory. While the new segment is not committed to disk, it provides document retrieval and cannot be modified.
  • As new document indexes are continuously written, a Flush is triggered when the log data size exceeds 512 MB or the log time exceeds 30 minutes. The data in memory is written to a new segment and written to the file cache system. The data in the file system cache is flushed to disk via fsync, commit points are generated, the log file is deleted, and an empty new log is created.

In this way, when a power failure or restart is required, ES not only needs to load the persistent segments according to the submission point, but also needs the records in the Translog tool to re-persist the unpersisted data to disk, avoiding the possibility of data loss.

Period of consolidation

Because the automatic refresh process creates a new segment every second, this can cause the number of segments to explode in a short period of time. Too many segments can cause major problems. Each segment consumes file handles, memory, and CPU cycles. More importantly, each search request must take turns checking each segment and then merging the query results, so the more segments, the slower the search.

Elasticsearch solves this problem by periodically merging segments in the background. Smaller segments are merged into larger segments, which are then merged into larger segments. Segment merging purges old deleted documents from the file system. Deleted documents are not copied to the new large section. The merge process does not break indexing and searching.

Segment merge occurs automatically when indexing and searching, and the merge process selects a small number of similarly-sized segments and merges them behind the scenes into larger segments that can be either uncommitted or committed. After the merge is complete, the old segment is deleted, the new segment is flushed to disk, and a new commit point containing the new segment is written, excluding the old and smaller segments, and the new segment is opened for search.

Segment merges are computationally expensive and eat a lot of disk I/O, which can drag down write rates and, if left unchecked, affect search performance. Elasticsearch puts resource limits on the merge process by default, so the search still has enough resources to perform well.

Performance optimization

The storage device

Disks are often a bottleneck on modern servers. Elasticsearch uses disks heavily, the more throughput your disk can handle, the more stable your node will be. Here are some tips for optimizing disk I/O:

  • Using SSDS. As mentioned elsewhere, they are much better than mechanical disks.
  • RAID 0 is used. Striping RAID increases disk I/O, at the apparent cost of failing when one disk fails. Do not use mirrored or parity RAID because replicas already provide this functionality.
  • Additionally, use multiple hard drives and allow Elasticsearch to strip data across multiple path.data directory configurations.
  • Do not use remotely mounted storage, such as NFS or SMB/CIFS. This introduced delay is completely counterproductive to performance.
  • If you use EC2, beware of EBS. Even SSD-based EBS are generally slower than local instance storage.

Internal index optimization

Term Dictionary = term Dictionary = term Dictionary = term Dictionary = term Dictionary = term Dictionary = term Dictionary Now, it looks like a traditional database through b-Tree.

However, if there are too many terms, the term dictionary will be too large to store memory, so there is term Index, just like the Index page in the dictionary, which terms start with A and which page respectively, so we can understand that term Index is A tree. This tree does not contain all terms, it contains some prefixes to terms. Term index allows you to quickly locate an offset in the Term Dictionary, and then look up from there.

Term index is compressed by FST in the memory. FST stores all terms in bytes. This compression method can effectively reduce the storage space and make term index enough to fit into the memory, but it will also require more CPU resources for searching. Demo address: Build Your Own FST

For inversion tables stored on disk, compression technology is also adopted to reduce the space occupied by storage.

Adjusting Configuration Parameters

  • Give each document an ordered sequence of well-compressed schema ids, avoiding random UUID-4 ids that have a low compression ratio and can significantly slow Lucene down.

  • Disable Doc values for index fields that do not require aggregation and sorting. Doc Values are an ordered list of mappings based on Document => field value;

  • Use the keyword type instead of the text type for fields that do not need to be fuzzy-searched to avoid the need to segment the text before indexing.

  • If your search results don’t require near-real-time accuracy, consider changing the index.refresh_interval to 30s for each index. If you are doing a bulk import, you can turn off the refresh by setting this value to -1 during the import, and you can turn off the replicas by setting index.number_of_replicas: 0. Don’t forget to turn it back on when you’re done.

  • Avoiding deep paging Query It is recommended to use Scroll for paging query. If there are n shards, the coordination node will sort (from + size) × N data twice. If there are n shards, the coordination node will sort (from + size) × N data twice. Then select the document that you want retrieved. When from is very large, the sorting process becomes heavy and takes up a lot of CPU resources.

  • Reduce mapping fields and provide only fields that need to be retrieved, aggregated, or sorted. Other fields can be stored on other storage devices, for example, Hbase. Query these fields in Hbase after obtaining the result in ES.

  • When creating indexes and querying information, specify a routing value. In this way, the information can be queried in a specific fragment precisely, improving query efficiency. When selecting routes, ensure that data distribution is balanced.

The JVM tuning

  • Ensure that the minimum heap memory (Xms) is the same size as the maximum heap memory (Xmx) to prevent the program from changing the heap size at run time. Elasticsearch’s default heap memory is 1 GB. Through.. The /config/jvm.option file is configured, but it is best not to exceed 50% of physical memory and 32GB.

  • GC defaults to CMS, concurrent but with STW problems, and can be considered using G1 collector.

  • ES relies heavily on Filesystem Cache for fast searching. In general, you should ensure that at least half of the available memory is physically allocated to the file system cache.


Individual public account: small dish also cow

Welcome to long press the picture below to pay attention to the public number: small dishes also cattle!

We regularly provide you with the explanation and analysis of distributed, micro-services and other first-line Internet companies.