In distributed clustering, we introduced sharding, describing it as the underlying unit of work. But what exactly is sharding, and how does it work? In this chapter, we will answer these questions:

Why is search in near real time? Why are CRUD operations on documents real-time? How does ES ensure that updates persist and are not lost even after power outages? Why does deleting a document not immediately free up space? What are the Refresh, Flush, Optimize apis, and when should you use them?Copy the code

The easiest way to understand how sharding works is to start with a history lesson. We’ll take a look at some of the issues that need to be addressed in order to provide a distributed, persistent search engine with near-real-time search and analysis capabilities.

Make text searchable

The first challenge that had to be solved was how to make text searchable. In traditional databases, each field holds one value, but this is insufficient for full-text search. Wanting every word in the text to be searchable means that the database needs to store multiple values.

The best data structure to support multiple values for a field is an inverted index. An inverted index contains an ordered list of unique values or words that appear in all documents, as well as a list of documents to which each word belongs.

 Term  | Doc 1 | Doc 2 | Doc 3 | ...
 brown |   X   |       |  X    | ...
 fox   |   X   |   X   |  X    | ...
 quick |   X   |   X   |       | ...
 the   |   X   |       |  X    | ...
Copy the code

An inverted index stores much more information than a list of documents containing a particular term. It might store the number of documents containing each term, how often a term appears in a given document, the order of terms in each document, the length of each document, the average length of all documents, and so on. These statistics let Elasticsearch know which terms are more important and which documents are more important, i.e. dependencies.

Realize that in order for an inverted index to work as expected, it must know all the documents in the collection.

Early in full-text retrieval, a large index is built for the entire document collection and written to disk. Only when the new index is ready, it replaces the old index, and the most recent changes can be retrieved.


The inverted index written to disk is immutable and has the following benefits:

1. No lock required. If you never need to update an index, you don't have to worry about multiple programs trying to change it at the same time. 2. Once an index is read into the file system's cache, it stays there because it doesn't change. As long as the file system cache has enough space, most reads will go directly to memory rather than disk. This helps improve performance. 3. All other caches are available during the declaration period of the index. They don't need to be rebuilt every time the data changes, because the data doesn't change. 4. Write a single large inverted index to compress data, reduce disk I/O and cache index memory size.Copy the code

Of course, immutable indexes have their drawbacks, starting with the fact that they are immutable! You can’t change it. If you want to search for a new document, you must revisit the entire index. This severely limits not only how much data an index can hold, but also how often an index can be updated.

Dynamic index

The next problem to solve is how to update the inverted index while retaining the benefits of immutability. The answer is, use multiple indexes.

Rather than rewriting the entire inverted index, additional indexes are added to reflect recent changes. Each inverted index can be queried in order, starting with the oldest and finally aggregating the results.

Lucene is the source of Elasticsearch, which uses per-segment search. A segment is a fully functional inverted index, but an index in Lucene now refers to a collection of segments, plus a commit point (a file containing all segments), as shown in Figure 1. New documents are first written to the index cache of the memory area before being written to a segment of disk, as shown in Figures 2 and 3.

Figure 1: Lucene with one commit point and three indexes

To avoid confusion, a Lucene index is a shard in Elasticsearch, and an index in Elasticsearch is a collection of shards. When Elasticsearch searches an index, it sends a query request to all shards under that index and then filters the results to aggregate them into global results.Copy the code

A per-segment search works as follows:

1. New documents are first written to the index cache of the memory area. 2. From time to time, these buffers are committed: a new segment -- an extra inverted index -- is written to disk. The new commit point is written to disk, including the name of the new segment. The disk is fsync -- all writes wait for the file system cache to synchronize to the disk, ensuring that they can be physically written. 3. A new segment is opened and the document it contains can be retrieved. 4.Copy the code

Figure 2: The in-memory cache has the Lucene index for the document to be submitted

Figure 3: After the commit, a new segment is added to the commit point and the cache is cleared

When a request is accepted, all segments are queried in turn. Term statistics on all segments are aggregated to ensure that the correlation between each Term and the document is calculated correctly. In this way, new documents are added to the index at a small cost.

Delete and update

Segments are immutable, so a document can neither be removed from an old segment nor updated to reflect the latest version of the document. Instead, each commit point contains a.del file that contains documents that have been deleted from the segment.

When a document is deleted, it is really just marked as deleted in the.del file, which still matches the query, but is removed from the result before it is eventually returned.

The document update operation is similar: when a document is updated, the old version of the document is marked as deleted, and the new version of the document is indexed in the new section. Perhaps different versions of the document will match a query, but older versions will be removed from the results.

Near real time search

Because of per-segment search, there is a delay between indexing and searching a document. The new document will be searchable in a few minutes, but that’s still not fast enough.

Disks are the bottleneck. Submitting a new segment to disk requires the fsync operation to ensure that the segment is physically written to disk and that data is not lost in the event of a power failure. But fsync is expensive and cannot be triggered every time a document is indexed.

So you need a more lightweight way to make new documents searchable, which means removing fsync.

Between Elasticsearch and the disk is the file system cache. As mentioned earlier, documents in the in-memory index cache (Figure 1) are written to a new segment (Figure 2), but the new segment is written to the file system cache first, which is cheap, and then synchronized to disk, which is expensive. But once a file is cached, it can also be opened and read, just like any other file.

Figure 1: Lucene index of the new document in the memory cache

Lucene allows new segment writes to open so that the documents they contain are searchable rather than performing a full commit. This is a lighter process than commit and can be done frequently without affecting performance.

Figure 2: The cache content has been written to a segment, but not yet committed

refeash API

In Elesticsearch, this lightweight process of writing to open a new segment is called refresh. By default, each shard is automatically refreshed once per second. This is why Elasticsearch is near real time: changes to a document are not searched immediately, but are visible within a second.

This can be confusing for new users: they index a document, try to search for it, and find nothing. The solution is to perform a manual refresh via the API:

POST /_refresh <1>
POST /blogs/_refresh <2>
Copy the code

<1> refresh all indexes

<2> Only refresh index blogs

Not all users need to refresh every second. Maybe you use ES to index millions of log files, and you want to optimize the indexing speed rather than go into real-time search. You can reduce the frequency of refreshing by modifying the configuration item refresh_interval:

  PUT /my_logs
  "settings": {
    "refresh_interval": "30s" <1>
Copy the code

<1> Refresh my_logs every 30s

Refresh_interval can be dynamically updated on existing indexes. You can turn off auto refresh when creating a large index and turn it on when you want to use the index.

PUT /my_logs/_settings
{ "refresh_interval": -1 } <1>

PUT /my_logs/_settings
{ "refresh_interval": "1s" } <2>
Copy the code

<1> Disable all automatic refresh

<2> Automatic refresh per second

Persistent change

Without using fsync to synchronize the file system cache to disk, we cannot ensure the safety of the data after power failure or even normal exit from the application. For ES reliability, you need to ensure that changes are persisted to disk.

We talked about a full commit synchronization segment to disk, write commit point, which lists all known segments. ES uses this commit point to determine which segments belong to the current shard when restarting or re-opening the index.

When we get near-real-time searches with refresh per second, we still need to perform full commit periodically to make sure we recover from failures. But what about documents between commits? We don’t want to lose them.

ES has added a transaction log to record every operation. With transaction logging, the process is now as follows:

1. When a document is indexed, it is added to the memory cache as well as to the transaction log.

Figure 1: The new document is added to the memory cache while the transaction log is written

Refresh The fragment enters the state described in the following figure. Fragments are refeash every second:

  • The document of the memory buffer is written to the segment, but there is no fsync.
  • The section is opened, making new documents searchable.
  • Cache cleared

Figure 2: After a Refresh, the cache is cleared, but the transaction log is not

3. The process continues as more documents are added to the cache and written to the log

Figure 3: The transaction log records the growing document

4. From time to time, if the log is large, a new log will be created and a full commit will be made:

  • All documents in the memory cache are written to the new segment.
  • Clear the cache
  • A commit point is written to the hard disk
  • The file system cache is flushed to the hard disk by fsync
  • The transaction log is cleared

The transaction log records all operations that are not flushed to hard disk. After a fault restart, ES will recover all known segments from the hard disk with the last commit point and all operations from the log.

Transaction logging is also used to provide real-time CRUD operations. When you try CRUD with ID, it first checks the log for the latest changes before retrieving documents in the relevant section. This means that ES can get the latest version of a document in real time.

Figure 4: After flush, the segment is fully committed and the transaction log is cleared

flush API

In ES, the operation of making a commit and deleting the transaction log is called flush. Sharding is flushed every 30 minutes or after the transaction log.

The Flush API can be used to perform a manual flush:

POST /blogs/_flush <1> POST /_flush? wait_for_ongoing <2>Copy the code

<1> flush index blogs

<2> flush All indexes. Wait until the operation is complete

You will rarely need to flush manually, usually automatically.

Flush the index is useful when you want to restart or close an index. When ES tries to restore or reopen an index, it must replay all the operations in the transaction log, so the smaller the log, the faster the recovery.

Combined section

By automatically refreshing new segments every second, it doesn’t take long for the number of segments to explode. Having too many paragraphs is a problem. Each segment consumes file handles, memory, and CPU resources. More importantly, each search request needs to check each segment in turn. The more segments, the slower the query.

ES solves this problem with background merge segments. Smaller segments are merged into larger segments, which are then merged into larger segments.

This is when the old document is deleted from the file system. Old segments are no longer copied into larger new segments.

You don’t have to do anything about it. ES handles this automatically as you index and search. The process is shown here: two committed segments and an uncommitted segment are merged into a larger segment:

1. During indexing, Refresh creates a new segment and opens it.

2. The merge process will select some small segments in the background to merge into a large segment, this process does not interrupt the index and search.

Figure 1: Two committed segments and one uncommitted segment are merged into a larger segment

3. The following figure describes the merged operation:

  • The new segment is flushed to hard disk.
  • The new commit point writes the new segment and excludes the old segment.
  • A new segment opens for search.
  • The old segment is deleted.

Figure 2: After the segment is merged, the old segment is deleted

Merging large segments consumes a lot of IO and CPU. If not checked, search performance will be affected. By default, ES limits the merge process so that the search can be done with sufficient resources.

optimize API

The Optimize API is best described as the forced merge segment API. It forces fragments to merge segments to reach the specified max_num_segments parameter. This is done to improve search performance by reducing the number of segments (usually 1).

Caution against using the Optimize API on dynamic indexes that are being actively updated. The merge processing in the background is already doing a good job, and the optimization command gets in the way. Don't interfere!Copy the code

The Optimize API is useful in certain environments. A typical scenario is logging, in which the log is indexed by day, week, and month. Old indexes are generally only readable; they are impossible to modify. In this case, it is valid to reduce each index segment to 1. The search process uses fewer resources and performs better:

POST /logstash-2014-10/_optimize? max_num_segments=1 <1>Copy the code

<1> Merge each shard in the index into a segment

Reference: the definitive guide to ES