preface

I’ve been using ElasticSearch for a while now, but I haven’t looked into Lucene yet. Today’s system summary.

How to query Lucene using the query syntax and the differences between Lucene and ElasticSearch.

Lucene

Basic architecture

  • Document: The primary data carrier for indexing and searching, containing multiple fields.
  • Fields: Multiple fields make up a document. Contain specific information.
  • Term: thetermTranslated. It can be understood as a simple word split out of a specific message.
  • Entry:tokenTranslated. An occurrence of a term in text. It contains not only the content of the term, but also the starting position, ending position and other information.
  • Paragraph:SegmentTranslated. Each segment is created only once, and once created, the segment is not modified. Therefore, there is a process of merging segments, reducing the number of segments, and improving search performance. Deletion of segment data is only done during segment merge.

What is the difference between term and token? For example, here is the following text. I love My brother love ElasticSearch, ElasticSearch, too. The token for My brother, love, ElasticSearch, I, love, ElasticSearch, too Term for My brother, love, ElasticSearch, I, too

Inverted index

Simply put, one of the most important features of inverted indexes is that they are term oriented rather than document oriented. For example, consider the following documents.

doc1:I love ElasticSearch
doc2:I love Java
doc3:I hate sleeping
Copy the code

If the traditional document-oriented way is used to build indexes, then when using love to search, all fields in DOC1 need to be traversed first, then all fields in DOC2 need to be traversed successively until the last doc, and then the documents with the word “love” can be determined as doc1 and doc2.

What if you use an inverted index?

This will form the following inverted index.

Term Doc
I doc1,doc2,doc3
love doc1,doc2
elasticsearch doc1
java doc2
sleeping doc3

So, when you search with love, you only need to go through it once at most. Just return doc1,doc2. The advantage is that it can be returned as soon as the term is matched. There is no need to traverse all documents like a traditional index.

Word segmentation is

The process of converting a piece of text into a searchable term in Lucene is called analysis. Sometimes it’s called a participle. In Lucene, word segmentation is used to analyze the behavior.

For example

I love ElasticSearch  ->  [I,love,elasticsearch]
Copy the code

Here’s the original text being converted into multiple terms by a word splitter.

Text analysis is performed by a profiler, which in turn consists of a character filter, a word divider, and a character mapper.

So what does each of these three components do?

Let me give you an example. The following text exists.

I love! ElasticSearch.
Copy the code

After passing through the character filter

I love! ElasticSearch.   -> I lvoe ElasticSearch
Copy the code

Character filter is to remove some improper characters in the original text, such as! , etc. Filters convert multiple inappropriate character streams into appropriate character streams.

And then it goes through the splitter.

I love ElasticSearch -> [I,love,ElasticSearch]
Copy the code

The tokenizer converts the original character stream into an array of terms. You can eventually search through these terms.

[I,love,ElasticSearch] -> [I,love,elasticsearch]
Copy the code

Finally, the character mapper formats the original entries, such as all lowercase entries. The transformed term is stored by Lucene.

Of course, this is the default rule for standard profilers. Different analyzers can use different character filters, word dividers, and character mappers to complete the analysis.

The query syntax

A combination of query content and operators.

ElasticSearch link to Lucene

First of all, ElasticSearch is based on Lucene. Lucene is lightweight and independent. ElasticSearch is distributed, scalable, and highly available.

About writing

From now on, I will write an article here every day, with no limit on subject matter, content or word count. Try to put your daily thoughts into it.

If this article has helped you, give it a thumbs up and even better follow it.

If none of these are available, write down what you want to say when you finish reading? Effective feedback and your encouragement are the biggest help to me.

And I’m going to pick up my blog. Welcome to visit and eat watermelon.

I’m shane. Today is September 6, 2019. Forty-fourth day of the hundred day writing project, 44/100.