I. Introduction and application of Lucene

Apache Lucene is currently the most popular open source full text search toolkit, written in the Java language.

At present, there are open source search engines based on this toolkit, such as Solr and Elasticsearch. Since 2010, both Lucene and Solr projects have been developed by the same Apache Software Foundation development team, so the versions we usually see are in sync. The difference is that Lucene is a toolkit, while Solr is an enterprise-class search application built on top of Lucene. In addition, we commonly used Eclipse, help system search function is also based on Lucene.

2. Lucene’s two tasks

Chinese dictionary and full-text index are very similar in our daily articles. Let’s take the pinyin method for example. First we find the number of pages we want to look up by pinyin, then we turn to the page and read the detailed explanation of the word.

In the example above, we mentioned two elements: one is the dictionary, and the other is the process of looking up words. Corresponding to Lucene’s functions, one is to build a dictionary, a process called indexing, and the other is to query based on the search term based on the index.

2.1 Indexing

1) Document preparation

A document is the original text that we’re going to search for.

2) Tokenizer

The first step of the document will be word segmentation, remove punctuation, remove no words, such as “is”, “is”, etc. Commonly used open source Chinese phrase parts MMSEG4J, IKAnalyzer, etc. The cut words are called tokens.

3) Linguistic Processor

The lexicon obtained in the previous step is processed, such as English capitalization to lowercase, plural to singular, past time words to the original form and so on. And the result is called Term.

4) Index components

The indexing component will generate the index and dictionary of the word obtained from the previous step and store it on disk. The index component first turns Term into a dictionary, and then sorts the dictionary. After sorting, the same words are combined to form an inverted list. Each word stores the corresponding Document Frequency ID in the list and the number of Term frequencies the word appears in the Document.

2.2 the search

1) Enter the query word

2) Morphological analysis and language processing

The input word split, keyword recognition (AND,NOT) AND so on. The split lexicon is processed in the same way as when a dictionary is built. Syntax trees are generated from keywords and processed words.

3) Search the index to get documents that conform to the syntax tree

For example, if the syntax tree formed by A and B is not C, the document list containing A, B and C will be searched, and then the intersection of document lists of A and B will be made, and the result set will be made difference set with C, and the result will be the document list that meets the search conditions

4) Sort search results according to relevance

The correlation of the results is obtained by the vector space model algorithm. The relatively simple implementation is described as follows: when building the index, we get the Document Frequency and Term Frequency. The higher the Term Frequency is, the higher the correlation of the Document is. The higher the Document Frequency, the weaker the correlation. This algorithm can be implemented on its own.

5) According to the sorting result above, return the document.

Three, index structure

Lucene’s index structure is hierarchical. Here is an example

3.1 the Index (Index)

If you think of a database as an analogy, an index is like a table in a database.

In Lucene an index is placed in a folder. So it is understood that the index is the contents of the entire folder.

Paragraph 3.2 (Segment)

If you think of a database as an analogy, a segment is like a partition of a table.

The concept of Segment is introduced below the index. A single index can have multiple segments. Generate segment files when FLUSH or COMMIT. There are 0,1 segments in the screenshot. Segments_5: Segments.gen and segments_5 are metadata files for segments that hold information about the segment’s properties. The rest of the files correspond to the section files, which will be explained in more detail later.

Indexes are written sequentially and can only be appended, not modified. When the index is to be dropped, the corresponding docId is written to the.del file. The query will be filtered to this doCID. In addition, the modification of the index is an addition after deleting the Document. This design ensures high throughput.

The segmented design ensures that the query is efficient. When the segments are too large, the query will consume a lot of I/O. If the segment is too small, there are too many segments to query. So Lucene merges the segments, and the deleted data is filtered out during the merge process. Before 4.0, the default merge policy was LogmergePolicy. This policy would merge adjacent segments smaller than the specified value. If two adjacent segments, one of 1GB and one of 1K, would rewrite a 1GB file, which would take a lot of resources. After 4.0, the default policy is changed to TieredMergePolicy. This policy will first order segments by segment size, perform deletion ratio calculation on segments, and prioritize merging smaller segments. Large segments are merged only when the system is at leisure.

3.3 Document (the Document)

If you think of a database as an analogy, a document is like a row of data.

Document is the basic unit of indexing. A segment can have more than one Document

3.4 domain (Field)

If you think of a database as an analogy, the fields are the fields of a table.

Doument can have multiple fields. Lucene provides several different types of fields, such as StringField, TextField, LongField, or NumericDocValuesField.

3.5 Term (Term)

Term is the smallest unit of an index. Term is generated by the Field through the Analyzer.

IV. Document description of paragraph

Section 3 describes the section design and merge strategy in detail. The contents of some section files are explained in detail in the following sections.

Segments_N holds how many segments this index contains and how many documents each segment contains.

*.fnm

Saves how many fields this section contains, the name of each field, and how it is indexed.

*. FDX, *. FDT

Saves all the documents contained in this section, how many fields each document contains, and what information each field holds.

*.tvx, *.tvd, *.tvf

How many documents does this section contain, how many fields does each document contain, how many words does each field contain, the string of each word, the position and other information is saved.

*. Tis, *. Tii

Saves the Term Dictionary, that is, the lexicographical ordering of all the words contained in this section.

*.frq

The inversion list, which is a list of document IDs containing each word, is saved.

*.prx

Saves the position of each word in the inversion list in the document containing the word

*.del

As mentioned in the previous paragraph, this is used to store the deleted document ID.

Author: Tian Liang

Source: CreditEase Institute of Technology