I. Introduction to Lucene

1.1 What is Lucene?

  • Lucene is a sub-project of the Apache Foundation Jakarta project group;
  • Lucene is an open source full text search engine toolkit, provides a complete query engine and indexing engine, part of the language text analysis engine;
  • Lucene is not a complete full-text search engine, it only provides the architecture of full-text search engine, but it can still be used as a toolkit combined with various plug-ins to provide some high-performance full-text search functions for projects.
  • Now commonly used Elasticsearch, Solr and other full-text search engines are based on Lucene.

1.2 Usage scenarios of Lucene

It is suitable for the scene where the amount of data index is not large. When the amount of data index is too large, it is necessary to use ES, Solr and other full-text search servers to realize the search function.

1.3 What can you learn from this article?

  • How do Lucene generate and write such a complex index, and what role do the files in the index play?
  • How does Lucene Full-Text Index search efficiently?
  • How does Lucene optimize search results so that users find what they want based on their keywords?

This article aims to share the experience of reading the source code and developing features for the Lucene search engine, which is in version 7.3.1.

II. Basic workflow of Lucene

Index generation is divided into two parts:

1. Creation stage:

  • In the addDocument stage, the addDocument method is called through IndexWriter to generate the forward index file;
  • After the document is added, the inverted index file is generated by the FLUSH or MERGE operation.

2. Search phase:

  • The user sends query requests to Lucene through query statements;
  • IndexReader under IndexSearch is used to read the contents of the index library and obtain the document index.
  • After the search results are obtained, the results are sorted based on the search algorithm and returned.

The process of index creation and search is shown in the following figure:

Three, Lucene index composition

3.1 Forward Index

Lucene’s basic hierarchy consists of five parts: index, segment, document, domain and word. The generation of forward index is the process of processing documents level by level and decomposing domain storage words based on Lucene’s basic hierarchy.

The hierarchy of index files is shown in Figure 1:

  • Indexes: The Lucene index library contains all the contents of the search text and can be stored as files or streams in different databases or file directories.
  • Segments: An index contains several segments, independent of each other. Because Lucene needs to load index segments for the next step of keyword retrieval, if there are more index segments, it will increase the I/O overhead and slow down the retrieval speed. Therefore, different segments will be merged through the segment merging strategy when writing.
  • Documentation: Lucene writes documents to segments that contain multiple documents.
  • Domain: A document contains many different fields, and different fields are stored in different fields.
  • Words: Lucene will split the strings in the domain into words through lexical analysis and language processing through the cross lexer, and Lucene will carry out full-text retrieval through these keywords.

3.2 Inverted indexes

The core of Lucene full-text index is a fast indexing mechanism based on inverted indexing.

The principle of inverted indexing is shown in Figure 2. In simple terms, inverted indexing is to record the article in which each word appears after the text content is segmented by the analyzer, so as to query the article containing the word through the search term input by the user.

Problem: When the above inverted index is used, the index words need to be loaded into the memory each time. When the number of articles is large and the length is long, the index words may take up a large amount of storage space, and the memory loss is large after loading into the memory.

Solution: Starting with Lucene4, Lucene has adopted FST to reduce the space consumption caused by index words.

FST(Finite StateTransducers). Its main characteristics lie in the following four points:

  • The time complexity of the search word is O(Len (STR));
  • By storing prefixes and suffixes separately, the space needed to store words is reduced.
  • When loading, only the prefix is put into the memory index, and the suffix is stored in the disk, which reduces the loss of the memory index space.
  • FST structure has high query efficiency when it queries query conditions such as PrefixQuery, FuzzyQuery and RegExQuery.

The specific storage mode is shown in Figure 3:

The files related to the inverted index include.tip,.tim, and.doc, where:

  • Tip: Used to store the prefix of the inverted index Term to quickly locate the position of the Term belonging to this Field in the TIM file, i.e. AAB, ABD, BDC in the figure above.
  • TIM: It saves the corresponding terms corresponding to different prefixes and the corresponding inversion list information. The inversion list can be quickly searched by jumping list, and the set operation such as intersection, union and difference set can be searched by skipping some elements through jumping list, which also improves the performance.
  • Doc: contains the document number and word frequency information. Returns the text information saved in the file according to the contents of the inverted list.

3.3 Index query and document search process

Lucene uses the inverted index to locate the document number that needs to be searched. After searching the document by the document number, Lucene uses the word weight and other information to sort the document and return it.

  • Load the TIP file in memory and match it to the position of the suffix block in the TIM file according to FST;
  • Inquire relevant information of suffix and inversion list according to the position of the suffix word block;
  • Locate the document number and word frequency information from the DOC file according to the inverted list information found in TIM to complete the search;
  • After locating the file, Lucene will go to the.fdx file directory index and.fdt to find the target file based on the forward index.

The file format is shown in Figure 4:

The above article mainly explains the working principle of Lucene, and the following article will explain the relevant code of Lucene in Java to perform indexes, queries and other operations.

Four, Lucene add, delete and change the operation

Lucene project text parsing, storage and other operations are implemented by IndexWriter class, IndexWriter file is mainly composed of Directory and IndexWriterConfig two classes, in which:

Directory: Used to specify the type of Directory in which to store the index files. Since you want to search the text content, you naturally need to write the text content and index information into the directory. Directory is an abstract class that allows many different implementations for the storage of indexes. Common storage methods generally include local storage (FSDirectory), memory (RAMDirectory) and so on.

IndexWriterConfig: It is used to specify relevant configuration of IndexWriter when writing file contents, including OpenMode index construction mode and Similarity correlation algorithm, etc.

How does IndexWriter work with indexes? Let’s take a quick look at the source code for the IndexWriter operation.

4.1. Additions to documents

A. Lucene creates a ThreadState object for each document that holds a DocumentWriterPerThread for adding, deleting, and deleting files;

ThreadState getAndLock(Thread requestingThread, DocumentsWriter documentsWriter) { ThreadState threadState = null; Synchronized (this) {if (Freelist.isEmpty ()) {return newThreadState() if no free ThreadState exists; ThreadState = Freelist.remove (Freelist.size ()-1);} else {// Freelist.remove ()-1); DocumentWriterPerThread (); // Preplace the documentWriterPerThread () with the current // threadState (); If (threadState. DWPT == null) {for(int I =0; i<freeList.size(); i++) { ThreadState ts = freeList.get(i); if (ts.dwpt ! = null) { freeList.set(i, threadState); threadState = ts; break; } } } } } threadState.lock(); return threadState; }

B. Insertion of index files: DocumentWriterPerThread calls the ProcessField under DefaultIndexChain to process each field in the document. The ProcessField method is the core execution logic of the index chain. The different FieldTypes set by each field are indexed, segmented and stored by the user. What is important in FieldType is indexOptions:

  • None: Domain information is not written to the inversion list, and the indexing phase cannot be searched by the domain name;
  • Docs: the document is written in the inversion list, but since the word frequency information is not recorded, the occurrence of multiple times is only processed once;
  • DOCS\_AND\_FREQS: document and word frequency write inversion list;
  • DOCS\_AND\_FREQS\_AND\_POSITIONS: document, word frequency and position write inverted list;
  • DOCS\_AND\_FREQS\_AND\_POSITIONS\_AND\_OFFSETS: Documents, word frequency, positions, and offsets are written to an inversion list.
If (fieldType.indexOptions()! = IndexOptions.NONE) { fp = getOrAddField(fieldName, fieldType, true); boolean first = fp.fieldGen ! = fieldGen; Fp.invert (field, first); fp.invert(field, first); if (first) { fields[fieldCount++] = fp; fp.fieldGen = fieldGen; } } else { verifyUnIndexedFieldType(fieldName, fieldType); } // Stored Field if (fieldType. Stored ()) {if (fp == null) {fp = GetOrdAddField (fieldName, fieldType, false); } if (fieldType.stored()) { String value = field.stringValue(); if (value ! = null && value.length() > IndexWriter.MAX_STORED_STRING_LENGTH) { throw new IllegalArgumentException("stored field \"" + field.name() + "\" is too large (" + value.length() + " characters) to store"); } try { storedFieldsConsumer.writeField(fp.fieldInfo, field); } catch (Throwable th) { throw AbortingException.wrap(th); }}} // create docValue (select the words under the document from the document) docValueType dvType = FieldType. if (dvType == null) { throw new NullPointerException("docValuesType must not be null (field: \"" + fieldName + "\")"); } if (dvType ! = DocValuesType.NONE) { if (fp == null) { fp = getOrAddField(fieldName, fieldType, false); } indexDocValue(fp, dvType, field); } if (fieldType.pointDimensionCount() ! = 0) { if (fp == null) { fp = getOrAddField(fieldName, fieldType, false); } indexPoint(fp, field); }

TokenStream has two important derived classes, the TokenStream, and the TokenFilter. The TokenStream class uses the java.io.Reader class to read the characters, and the TokenStream class uses the TokenStream class to read the characters. Token streams are generated and processed through any number of TokenFilters. The source code is as follows:

/ / invert: Word segmentation processing was carried out on the Field, you first need to convert the Field into TokenStream try (TokenStream stream = TokenStream = Field. The TokenStream (docState analyzer, TogenStream)) // TogenStream returns the corresponding tokenStream if (TogenStream! = null) { return tokenStream; } else if (readerValue() ! = null) { return analyzer.tokenStream(name(), readerValue()); } else if (stringValue() ! = null) { return analyzer.tokenStream(name(), stringValue()); } public final tokenStream tokenStream(final String fieldName, final Reader Reader) {// Reuse if a Component already exists in TokenStreamComponents. TokenStreamComponents components = reuseStrategy.getReusableComponents(this, fieldName); final Reader r = initReader(fieldName, reader); // If the Component does not exist, a corresponding Component is created based on the word splitter. if (components == null) { components = createComponents(fieldName); reuseStrategy.setReusableComponents(this, fieldName, components); } // Pass the java.io.Reader input to the Component. components.setReader(r); return components.getTokenStream(); }

D. According to the word splitter configured in IndexWriterConfig, the corresponding word splitter is returned through the policy mode. There are many different implementations of word splitter for different languages and different word segmentation requirements.

  • StopAnalyzer: A stop-word splitter that filters out specific strings or words in a vocabulary.
  • StandardAnalyzer: a standard word segmentation analyzer, which can segment words according to numbers, letters, etc., supports word table filtering instead of StopAnalyzer function, and supports simple Chinese word segmentation.
  • CJKAnalyzer: It provides good support for Chinese word segmentation according to Chinese language habits.

Take StandardAnalyzer as an example:

// The standard word splitter creates the Component. Protected tokenStreamComponents createComponents(final String fieldName) {final StandardTokenizer src = new StandardTokenizer(); src.setMaxTokenLength(maxTokenLength); TokenStream tok = new StandardFilter(src); tok = new LowerCaseFilter(tok); tok = new StopFilter(tok, stopwords); return new TokenStreamComponents(src, tok) { @Override protected void setReader(final Reader reader) { src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength); super.setReader(reader); }}; }

E. After obtaining TokenStream, the attribute is analyzed and obtained through the incrementToken method in TokenStream, and then the inversion list is constructed through the Add method under TermShashperField. Finally, the related data of the Field is stored in a FreqProxPostingsArray of type FreqProxPostingsArray and a TermVectorSpostingsArray of type TermVectorSpostingsArray. Constitute an inversion list;

// Take LowerCasefilter as an example, Convert characters in Token to lowercase public final Boolean incrementToken() throws IOException {if (input.incrementToken()) { CharacterUtils.toLowerCase(termAtt.buffer(), 0, termAtt.length()); return true; } else return false; }
try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream)) { // reset TokenStream stream.reset(); invertState.setAttributeSource(stream); termsHashPerField.start(field, first); While (stream.incrementToken()) {... Try {// build the inversion list termShashperField.add (); } the catch (MaxBytesLengthExceededException e) {... } catch (Throwable th) { throw AbortingException.wrap(th); }}... }

4.2 Deletion of documents

A. To delete documents under Lucene, first of all, the Term or Query to be deleted is added to the delete queue;

synchronized long deleteTerms(final Term... terms) throws IOException { // TODO why is this synchronized? final DocumentsWriterDeleteQueue deleteQueue = this.deleteQueue; Long seqNo = deleteQueue.addDelete(terms); long seqNo = deleteQueue.addDelete(terms); flushControl.doOnDelete(); lastSeqNo = Math.max(lastSeqNo, seqNo); if (applyAllDeletes(deleteQueue)) { seqNo = -seqNo; } return seqNo; }

B. Trigger the delete operation according to the FLUSH policy;

Private Boolean applyAllDeletes (DocumentsWriterDeleteQueue deleteQueue) throws IOException {/ / determine whether delete conditions - > onDelete if (flushControl.getAndResetApplyAllDeletes()) { if (deleteQueue ! = null) { ticketQueue.addDeletes(deleteQueue); } / / delete operation on specified event putEvent (ApplyDeletesEvent. INSTANCE); // apply deletes event forces a purge return true; } return false; }
public void onDelete(DocumentsWriterFlushControl control, ThreadState state) {/ / whether and set if delete condition ((flushOnRAM () && control. GetDeleteBytesUsed () > 1024*1024*indexWriterConfig.getRAMBufferSizeMB())) { control.setApplyAllDeletes(); if (infoStream.isEnabled("FP")) { infoStream.message("FP", "force apply deletes bytesUsed=" + control.getDeleteBytesUsed() + " vs ramBufferMB=" + indexWriterConfig.getRAMBufferSizeMB()); }}}

4.3 Update of documents

Document updating is a process of deleting and then inserting, so I won’t go into that in this article.

4.4 the index Flush

When a certain number of documents have been written, a thread triggers an IndexWriter Flush operation to generate segments and write the Document information from memory to the hard disk. There is currently only one policy for Flush operations: flushbyramorcountPolicy. FlushbyRamorCountPolicy automatically executes Flush operations based on two policies:

  • MaxBufferedDocs: The Flush operation is triggered when a certain number of documents are collected.
  • RamBufferSizeMb: Flush when the document content reaches a limit.

Where activeBytes() is the amount of memory for the index collected by DWPT, and deleteByteUsed is the number of indexes deleted.

@Override public void onInsert(DocumentsWriterFlushControl control, ThreadState state) {/ / according to the number of documents to Flush the if (flushOnDocCount () && state. DWPT. GetNumDocsInRAM () > = indexWriterConfig .getMaxBufferedDocs()) { // Flush this state by num docs control.setFlushPending(state); } else if (flushonRAM ()) {// Flush by RAM final long limit = (long) (indexWriterConfig.getRAMBufferSizeMB() * 1024.d * 1024.d); final long totalRam = control.activeBytes() + control.getDeleteBytesUsed(); if (totalRam >= limit) { if (infoStream.isEnabled("FP")) { infoStream.message("FP", "trigger flush: activeBytes=" + control.activeBytes() + " deleteBytes=" + control.getDeleteBytesUsed() + " vs limit=" + limit); } markLargestWriterPending(control, state, totalRam); }}}

Writes memory information to the index library.

There are two types of index Flush: active Flush and automatic Flush. The active Flush operation is called automatic Flush. The active Flush operation is different from automatic Flush. You can jump to the link if you need to know.

Index segment Merge

When indexing Flush, each DWPT segment will generate a separate segment. Full-text retrieval when there are too many segments may span multiple segments, resulting in multiple load conditions, and therefore merging too many segments.

The execution of segment merges is managed through Mergescheduler. Mergescheduler also includes a variety of management policies, including Nomergescheduler, SerialMergescheduler, and ConcurrentMergescheduler.

1) The merge operation first needs to use the updatePendingMerges method to query the segments to be merged according to the merge policy of the segments. There are many types of segment merge strategies. This article covers only two of the segment merge strategies that Lucene uses by default: tieredMergePolicy and LogmergePolicy.

  • TieredMergePolicy: Sort the segments provided by IndexWriter using the OneMerge scoring mechanism, then select some (possibly discontinuous) segments from the ordered Segment set to generate a non-adjacent Segment to be merged.
  • LogmergePolicy: A fixed length merge method that divides consecutive segments into different levels using maxLevel, LEVEL\_LOG\_SPAN, and levelBottom parameters, and then merges segments from each LEVEL using mergeFactor.
private synchronized boolean updatePendingMerges(MergePolicy mergePolicy, MergeTrigger trigger, int maxNumSegments) throws IOException { final MergePolicy.MergeSpecification spec; // query the segment to be merged if (maxNumSegments! = UNBOUNDED_MAX_MERGE_SEGMENTS) { assert trigger == MergeTrigger.EXPLICIT || trigger == MergeTrigger.MERGE_FINISHED : "Expected EXPLICT or MERGE_FINISHED as trigger even with maxNumSegments set but was: " + trigger.name(); spec = mergePolicy.findForcedMerges(segmentInfos, maxNumSegments, Collections.unmodifiableMap(segmentsToMerge), this); newMergesFound = spec ! = null; if (newMergesFound) { final int numMerges = spec.merges.size(); for(int i=0; i<numMerges; i++) { final MergePolicy.OneMerge merge = spec.merges.get(i); merge.maxNumSegments = maxNumSegments; } } } else { spec = mergePolicy.findMerges(trigger, segmentInfos, this); } // Register all segments to be merged newMergesFound = spec! = null; if (newMergesFound) { final int numMerges = spec.merges.size(); for(int i=0; i<numMerges; i++) { registerMerge(spec.merges.get(i)); } } return newMergesFound; }

2) Create MergeThread by Merge method in ConcurrentMergescheduler class and launch it.

@Override public synchronized void merge(IndexWriter writer, MergeTrigger trigger, Boolean newmergesFound) throws IOException {... While (true) {... OnMerge = Writer.getNextMerge (); onMerge = Writer.getNextMerge (); boolean success = false; Final MergeThread newMergeThread = getMergeThread(Writer, Merge); mergeThreads.add(newMergeThread); updateIOThrottle(newMergeThread.merge, newMergeThread.rateLimiter); if (verbose()) { message(" launch new thread [" + newMergeThread.getName() + "]"); } // Enable thread newMerGeThread.start (); updateMergeThreads(); success = true; } finally { if (! success) { writer.mergeFinish(merge); }}}}

3) Perform the merge operation through the doMerge method;

Public void merge(mergePolicy.oneMerge) throws IOException {...... Try {// MergeInit (merge); // MergeInit (merge); // Execute the Merge operation Mergemiddle (Merge, MergePolicy) between segments; mergeSuccess(merge); success = true; } catch (Throwable t) { handleMergeException(t, merge); } finally {// MergeFinish (merge)}... }

Five, Lucene search function realization

5.1 Load the index library

In order to perform a search, Lucene first needs to load the index segment into memory. Since the operation of loading the index library is very time consuming, the index library needs to be reloaded only when the index library has changed.

Loading index library is divided into two parts: loading segment information and loading document information:

1) Loading segment information:

  • Get the largest generation in the segment through the mark.gen file, and get the overall information of the segment;
  • Read the.si file, construct the SegmentInfo object, and finally aggregate the SegmentInfo object.

2) Load document information:

  • Read the segment information and get the corresponding FieldInfo from the.fnm file to construct FieldinFos.
  • Open inverted list related files and dictionary files;
  • To read statistical information and relevant norms in the index;
  • Read the document file.

5.2 packaging

IndexSearch needs to be encapsulated into IndexSearch after the index library is loaded. IndexSearch will return the results required by the user through the user-constructed Query statement and the designated Similarity algorithm of text (the default BM25). Search function is realized by IndexSearch.search method.

Search: Query contains a variety of implementations, including BooleanQuery, PhraseQuery, TermQuery, PrefixQuery and other Query methods. Users can construct Query statements according to project requirements

Sorting: Besides calculating the sorting of document relevance scores through Similarity, IndexSearch also provides users with BoostQuery to specify keyword scores and customize the sorting. The Similarity correlation algorithm also includes many kinds of calculation and realization of correlation score, which will not be repeated here. Readers can refer to it online if they need.

Six, summarized

As a full-text indexing toolkit, Lucene provides powerful full-text retrieval support for small and medium-sized projects, but there are many problems in the process of using Lucene:

  • Since Lucene needs to retrieve the index library by reading the index information through IndexReader and loading it into the memory to realize its retrieval capability, excessive index volume will consume too much memory of the service deployment machine.
  • The search implementation is relatively complex. It needs to set the index, word segmentation, storage and other information of each Field one by one, and the use is complicated.
  • Lucene does not support clustering.

There are many limitations when Lucene is used, and it is not so convenient to use. When the amount of data increases, distributed search server such as Elasticsearch should be chosen as the implementation scheme of search function.

Authors: Vivo Internet Server Team -Qian Yulun