I’ve been using ElasticSearch for a while now, but I haven’t looked at the content of Lucene. Today’s system summary.

This article mainly discusses the basic architecture of Lucene, the inversion index, the word segmentation, how to query Lucene with query syntax, and the difference between Lucene and Elasticsearch.


Basic architecture

  • Document: The primary data carrier for indexing and searching, containing multiple fields.
  • Fields: Multiple fields make up a document. Contains specific information.
  • Term: thetermTranslated. It can be simply understood as a word broken out of a specific message.
  • Entry:tokenTranslated. An occurrence of a word item in a text. It not only contains the content of the word item, but also contains the position of the beginning, the position of the end and other information.
  • Paragraph:SegmentTranslated. Each segment is created only once and cannot be modified once it has been created. As a result, there is a process of segment merging, reducing the number of segments and improving search performance. Deletion of data from a segment occurs only during a segment merge.

What is the difference between term and token?

For example, there is the following text. My brother love ElasticSearch.I love ElasticSearch,too.

Token for My brother, love, ElasticSearch, I, love, ElasticSearch, too

Term for My brother, love, ElasticSearch, I, too

Inverted index

In short, one of the main features of inverted indexing is that it is term oriented rather than document oriented. For example, suppose you have the following documents.

doc1:I love ElasticSearch
doc2:I love Java
doc3:I hate sleeping

If you use the traditional document-oriented way to build an index, then when searching with love, you need to first traverse all the fields in doc1, then traverse all the fields in doc2, and then traverse all the fields in doc2, until the last doc, and then you can determine that the document with the word “love” is doc1,doc2.

What if I use an inverted index?

The following inverted index is then formed.

Term Doc
I doc1,doc2,doc3
love doc1,doc2
elasticsearch doc1
java doc2
sleeping doc3

So, when you use Love to search, you only need to traverse it once to find the results. Returns doc1,doc2 directly. The advantage is that as long as the term is matched, it can be returned directly. There is no need to traverse all the documents as with a traditional index.

Word segmentation is

The process by which a piece of text is turned into a searchable term in Lucene is called parsing. It’s sometimes called a participle. In Lucene, word segmentation is used to analyze (word segmentation) behavior.

For example

I love ElasticSearch  ->  [I,love,elasticsearch]

Here is the original text being converted by the word splitter into multiple word items.

Text analysis is performed by a parser, which in turn consists of a character filter, a word splitter, and a character mapper.

So what do these three components do?

Let me give you an example. The following text exists.

I love! ElasticSearch.

After going through the character filter

I love! ElasticSearch.   -> I lvoe ElasticSearch

Character filters remove inappropriate characters from the original text, such as! , etc. Filters convert multiple inappropriate character streams into appropriate character streams.

Next, it goes through the word splitter.

I love ElasticSearch -> [I,love,ElasticSearch]

The word splitter turns the original character stream into an array of word items. You can eventually search by these terms.

[I,love,ElasticSearch] -> [I,love,elasticsearch]

Finally, the character mapper will format the most primitive words, such as all lowercase words. This translated word item will be stored by Lucene.

This, of course, is the default rule for standard parsers. Different parsers can use different character filters, word separators, and character mappers to perform the analysis.

The query syntax

Composed of query contents and operators.

Lucene’s link to Elasticsearch

First, ElasticSearch is based on Lucene. Lucene is light and independent. Elasticsearch is distributed, extensible, and highly available.

About writing

From now on here will write an article every day, subject matter is not limited, content is not limited, word number is not limited. Try to put your thoughts into it every day.

If this article has brought you some help, you can move your hands to give a thumbs-up, and pay attention to the wave is even better.

If not, then write down what you want to say after reading it? Effective feedback and your encouragement have been the biggest help to me.

In addition, I plan to pick up the blog again. Welcome to visit to eat watermelon.

I’m shane. Today is September 6, 2019. The forty-fourth day of the 100-day writing project, 44/100.