“This is where Elasticsearch stands out: It encourages you to explore and exploit your data, rather than letting it rot in a data warehouse because it’s too hard to query. Elasticsearch will be your best friend.”

invariance

An inverted index is immutable once written to disk: it never changes. Immutability has important value:

  • No locks required. If you never update indexes, you don’t need to worry about multiple processes modifying data at the same time.
  • Once the index is read into the kernel’s file system cache, it stays there because of its immutability. As long as there is enough space in the file system cache, most read requests go directly to memory and do not hit disk. This provides a significant performance boost.
  • Other caches, like the filter cache, remain in effect for the lifetime of the index. They do not need to be rebuilt every time the data changes, because the data does not change. Writing a single large inverted index allows data to be compressed, reducing disk I/O and the amount of indexes that need to be cached into memory.
  • Of course, there are downsides to an unchanging index. The main fact is that it is immutable! You can’t modify it. If you need to make a new document searchable, you need to rebuild the entire index. This places a significant limit on either the amount of data an index can contain or how often the index can be updated.

Dynamic update index

The next problem to be solved is how to update the inverted index while preserving invariance. The answer is: more indexes.

Add new supplementary indexes to reflect recent changes, rather than rewriting the entire inverted index. Each inverted index is queried in turn – starting with the earliest – and then the results are merged.

Elasticsearch is based on Lucene, a Java library that introduces the concept of segment search. Each segment is itself an inverted index, but an index in Lucene adds the concept of a commit point — a file that lists all known segments — in addition to representing a collection of all segments.

Section by section searches work as follows:

  • New documents are collected into the in-memory index cache

  • From time to time, the cache is committed:

  • A new segment, an appended inverted index, is written to disk.

  • A new commit point with the new segment name is written to disk.

  • Disk synchronization – All writes waiting in the file system cache are flushed to disk to ensure they are written to physical files.

  • A new section is opened to make the documents it contains visible for searching.

  • The memory cache is cleared, waiting to receive new documents.

When a query is triggered, all known segments are queried in sequence. Term statistics aggregate the results of all sections to ensure that the association between each word and each document is calculated accurately. This way you can add new documents to the index at a relatively low cost.

Delete and update

Segments are immutable, so you can neither remove the document from the old segment nor modify the old segment to reflect the update of the document. Instead, each submission point will contain a.del file that lists the segments of the deleted document.

When a document is “deleted”, it is really just marked for deletion in the.del file. A document marked for deletion can still be matched by the query, but it will be removed from the result set before the final result is returned.

Document updates work similarly: when a document is updated, the old version of the document is marked for deletion and the new version of the document is indexed into a new segment. It is possible that both versions of a document will be matched by a query, but the deleted old version will be removed before the result set is returned.

Reference: ElasticSearch Docs