preface

Es is like a black box, if you don’t know the inner workings of it, what else can you do? The only thing you can do is use the ES API to do basic reading and writing. If something goes wrong and you don’t know anything, what are you supposed to do?

Therefore, in order to have a deeper understanding of the internal structure of ES and solve the problems in the process of use, it is particularly important to know the working principle of the bottom layer of ES. Then, I will elaborate on the working principle of the bottom layer.

Es The process of writing data

  • The client selects a node to send the request to. This node iscoordinating node(Coordinate node).
  • coordinating nodeOn the documentrouting, forwards the request to the corresponding node (with a primary shard).
  • On the actual Nodeprimary shardProcess the request and then synchronize the data toreplica node
  • coordinating nodeIf you find thatprimary nodeAnd all thereplica nodeWhen all is said and done, the response is returned to the client.

Es Reads data

The doc ID is hashed to determine which shard the DOC ID is assigned to, and which shard the doc ID is assigned to.

  • The client sends a request toanyA node becomescoordinate node
  • coordinate nodedoc idHash routing is used to forward the request to the corresponding noderound-robin Random polling algorithmIn theprimary shardOne replica is selected randomly from all replicas to balance read request load.
  • The node receiving the request returns a document tocoordinate node
  • coordinate nodeReturn the document to the client.

Es Searches the data process

The most powerful thing about ES is to do full text search, which is like if you have three data:

Java is really fun ah Java is very difficult to learn ah J2EE especially cattleCopy the code

You search by Java keywords, and you search for documents that contain Java. Es will tell you: Java is fun, Java is hard to learn.

  • The client sends a request to acoordinate node
  • The coordinating node forwards the search request toallCorresponds to the shard ofprimary shardreplica shardEither way.
  • Query Phase: Each shard puts its own search results (actually somedoc id) returns to the coordination node, which performs data merging, sorting, paging and other operations to produce the final result.
  • Fetch phase: The fetch phase is then based on the coordination nodedoc idGo to the nodesPull the actualthedocumentData, which is eventually returned to the client.

Write requests are written to the primary shard and then synchronized to all replica shards. Read requests can be read from the Primary shard or replica Shard using the random polling algorithm.

Underlying principles of write data

First write to the memory buffer, the data is not searched in the buffer; Data is also written to a Translog log file.

If the segment file is nearly full or a certain amount of time has elapsed, the segment file is refreshed to a new segment file, but not directly to the SEGMENT file. This process is refresh.

Every 1 second, es writes the data in the buffer to a new segment file. Every second, es generates a new segment file. This segment file stores the data written to the buffer in the last 1 second.

If there is no data in the segment file, the segment file will not refresh. By default, if there is data in the segment file, the segment file will refresh once every second.

In operating systems, disk files actually have something called OS cache, which stands for operating system cache. This means that before data is written to disk files, it goes to the OS cache, a memory cache at the operating system level. As soon as the data in the buffer is flushed into the OS cache by the refresh operation, the data can be searched.

Why is ES quasi-real time? NRT stands for near real-time. The default is refresh every second, so ES is quasi-real-time because written data is not seen until one second later. You can use the ES restful API or Java API to manually perform a refresh operation, that is, manually flush the data in the buffer to the OS cache, so that the data can be immediately searched. As soon as the data is entered into the OS cache, the buffer is cleared, because the data is persisted to disk in the Translog because there is no need to keep the buffer.

Repeat the above steps. New data is continuously entered into buffer and translog, and the buffer data is continuously written to new segment file after new segment file. After each refresh, the buffer is cleared and translog is retained. As this process progresses, translog becomes larger and larger. When the translog reaches a certain length, the COMMIT operation is triggered.

Commit the first step of the operation is to refresh the existing data in the buffer to the OS cache to clear the buffer. Then, write a commit point to a disk file that identifies all segment files corresponding to the commit point, and force all current data from the OS cache to fsync to the disk file. Finally, the existing Translog log files are emptied, a Translog is restarted, and the COMMIT operation is complete.

This commit operation is called flush. Flush is automatically executed every 30 minutes by default, but if translog is too large, flush is triggered. Flush corresponds to the commit process. You can manually flush fsync data from the OS cache to disk through the ES API.

What is a Translog log file for? Before you commit, data is either stored in buffer or OS cache. Both buffers and OS cache are memory, and once the machine dies, all data in memory is lost. Therefore, the corresponding operations of data need to be written into a special log file translog. Once the machine is down, es will automatically read the data in the Translog file and restore the data to the memory buffer and OS cache when the machine is restarted.

Translog is written to the OS cache first, and is flushed to disk every 5 seconds by default. So by default, 5 seconds of data may be stored in the OS cache of the buffer or translog file. If the machine hangs at this point, Five seconds of data will be lost. But this way the performance is better, the maximum loss of 5 seconds of data. You can also set translog so that every write must be fsync directly to disk, but performance will be much worse.

  • index.translog.sync_intervalControl how long translog fsync to disk, minimum 100ms;
  • index.translog.durabilityWhether translog is refreshed every 5 seconds or fsync for each request has two values: Request (fsync is performed on each request,es will not return a success until Translog fsync reaches disk) and Async (default, translog fsync every 5 seconds).

Therefore, the problem of data loss can be searched after 1 second of data writing. 5 seconds of data may be lost in the buffer, Translog OS cache, and segment file OS cache, but not in the disk. If the system goes down, 5 seconds of data will be lost.

To summarize, the data is written to the buffer, and then refreshed to the OS cache every 1s. When the OS cache is refreshed, the data can be searched. Write data to a Translog file every 5 seconds (so that if the machine is down and there is no data in memory, at most 5 seconds of data will be lost). If the translog is large enough, or every 30 minutes by default, the commit operation will be triggered. Flush all buffer data into segment file.

After the data is written to the segment file, an inverted index is created.

Delete/update data underlying principles

If the doc is deleted, a.del file is generated at commit time, which identifies the doc as deleted. Therefore, the.del file is used to determine whether the DOC is deleted.

If the update operation is performed, the original doc is marked as deleted and a new data entry is written.

A segment file is generated every time the buffer refresh, so by default, the segment file is generated every 1 second. As a result, more and more segment files are generated. When merging multiple segment files into one, the doc identified as deleted is physically deleted, and the new segment file is written to the disk. A commit point is written. Identify all new segment files, then open the segment file for search, and delete the old segment file.

The underlying lucene

In a nutshell, Lucene is a JAR package that contains the code for various algorithms that build inverted indexes. We use Java development, the introduction of Lucene JAR, and then based on Lucene API development can be.

With Lucene, we can index existing data, and Lucene will organize the index data structure for us on the local disk.

Inverted index

In search engines, each document has a corresponding document ID, and the document content is represented as a set of keywords. For example, document 1 extracts 20 keywords through word segmentation, and each keyword records the frequency and location of its occurrence in the document.

An inverted index, then, is a mapping of keywords to document ids, each of which corresponds to a series of files in which keywords appear.

Here’s an example.

There are the following documents:

After word segmentation of the document, the following inverted index is obtained.

In addition, a useful inverted index can record additional information, such as document frequency information, indicating how many documents in a document collection contain a particular word.

Then, with inverted indexes, search engines can easily respond to user queries. For example, when a user queries Facebook, the search system looks up an inverted index and reads out documents containing the word, which are the search results provided to the user.

Note two important details about inverted indexes:

  • All terms in an inverted index correspond to one or more documents;
  • Terms in an inverted index are arranged in ascending lexicographical order

The above is just a simple chestnut, not in strict dictionary ascending order.