This is the 18th day of my participation in Gwen Challenge

1 introduction of lucene

1.1 What is Lucene

Lucene is a full-text search framework, not an application product. So it’s not like or Google Desktop, it’s just a tool you can use to implement these products.

1.2 What can Lucene do

To answer this question, it’s important to understand the nature of Lucene. In fact, Lucene’s function is very simple, basically, you give it a few strings, and it provides you with a full-text search service that tells you where the words you want to search appear. Knowing this essence, you can use your imagination to do anything that fits this condition. You can index all the news in the station to make a database; You can index several fields of a database table without having to worry about locking the table for “%like%”; You could also write your own search engine…

1.3 Should you choose Lucene

Here are some test data, if you think it’s acceptable, then you can choose. Test 1:2.5 million records, 300M or so text, generated index 380M or so, 800 threads under the average processing time of 300ms. Test two: 37000 records, two VARCHAR fields in the index database, index file 2.6M, average processing time 1.5ms under 800 threads.

2. How Lucene works

The service provided by Lucene actually consists of two parts: one in and one out. In is write, that is, you provide a source (essentially a string) to the index or remove it from the index; The so-called out is read, that is, to provide users with full-text search services, so that users can locate the source through keywords.

2.1 Write Process

The source string is first processed by Analyzer and consists of: word segmentation, which is divided into words; Remove StopWord (optional). Add the information needed in the source to the various fields of the Document, and index the fields that need to be indexed, and store the fields that need to be stored. Writes indexes to storage, which can be memory or disk.

2.2 Readout process

Users provide search terms that are processed by Analyzer. Search the index of the processed keyword to find the corresponding Document. The user extracts the required fields from the found Document as needed.

3 Some concepts to know

Lucene uses a number of concepts, and knowing what they mean will be helpful.

3.1 analyzer

Analyzer is Analyzer, it is to put a string according to certain rules into words one by one, and remove the invalid word, invalid words here refers to the “of”, “the” in English and Chinese in the “”,” land, “and other words, these words in the article, but itself does not contain any key information, The deletion is beneficial to reduce index files, improve efficiency and improve hit ratio. The rules of word segmentation vary, but they serve only one purpose: semantic division. This is easier to do in English because English is written in words, separated by Spaces. Chinese, on the other hand, must somehow divide a sentence into words. The specific partition method is described in detail below, here only need to understand the concept of the analyzer.

3.2 the document

User-supplied sources are records, which can be text files, strings, a record from a database table, and so on. After a record is indexed, it is stored in the index file as a Document. The user searches, which is also returned as a Document list.

3.3 the field

A Document can contain multiple information fields, for example, an article can contain “title”, “body”, “last modified time” and other information fields, these information fields are stored in the Document through Field. Field has two optional attributes: storage and index. The store property allows you to control whether the Field is stored or not; The index attribute allows you to control whether the Field is indexed or not. This may seem like nonsense, but it’s actually important to get the right combination of these two attributes, as illustrated below: Articles or just as an example, we need to full text search of the title and text, so we’re going to put the index attribute is set to true, at the same time we hope to be able to directly post title extracted from search results, so we put the header field storage property is set to true, but due to the text field is too big, we in order to narrow the index file size, Set the storage property of the body field to false and read the file directly when needed; We just want to extract the last modified time from the search result without searching it, so we set the storage property of the last modified time domain to true and the index property to false. The above three fields cover three combinations of two properties, and the one that is all false is not used. In fact, Field does not allow you to set it that way, because fields that are neither stored nor indexed are meaningless.

3.4 term

Term is the smallest unit of search and represents a word in a document. Term consists of two parts: the word it represents and the field in which the word appears.

3.5 tocken

Tocken is a occurrence of term and contains trem text with the corresponding start-stop offset, as well as a type string. The same word can appear multiple times in a sentence, all denoted by the same term, but with a different tocken, with each tocken marking the place where the word appears.

3.6 the segment

When adding indexes, not every document is immediately added to the same index file. They are first written to different small files and then merged into a large index file, where each small file is a segment.

4. Structure of Lucene

Lucene includes both core, which is the stable core of Lucene, and Sandbox, which includes additional features such as Highlighter and various analyzers. Lucene Core has seven packages: Analysis, Document, Index, queryParser, Search, Store, util.

4.1 analysis

Analysis includes some built-in analyzers, such as WhitespaceAnalyzer by whitespace character segmentation, StopAnalyzer with stopwrod filtering added, and the most commonly used StandardAnalyzer.

4.2 the document

Document contains the data structure of the Document, for example, the Document class defines the data structure that stores the Document, and the Field class defines a Field of the Document.

4.3 the index

An IndexWriter class that writes, merges, and optimizes the segment of an Index file, and an IndexReader class that reads and deletes the Index. IndexWriter is only concerned with how to write indexes into segments and merge them together for optimization. IndexReader focuses on the organization of the documents in an index file.

4.4 queryParser

QueryParser contains classes for parsing query statements. Lucene’s query statements are somewhat similar to SQL statements, with various reserved words, and can be composed of various queries according to certain syntax. Lucene has many Query classes, all of which inherit from Query and execute special queries. QueryParser parses queries and calls various Query classes in sequence to find results.

4.5 the search

Search contains classes that Search results from indexes, such as the Query classes described earlier, including TermQuery, BooleanQuery, and so on.

4.6 store

Store contains storage classes for indexes. For example, Directory defines the storage structure of index files. FSDirectory refers to indexes stored in files, RAMDirectory refers to indexes stored in memory, and MmapDirectory refers to indexes that use memory mapping.

4.7 util

Util contains some common utility classes, such as a utility for converting between time and strings.

5 How to Create an Index

5.1 The simplest code snippet to complete indexing

IndexWriter writer = newIndexWriter ("/data/index/",new StandardAnalyzer(), true);
Document doc = new Document();
doc.add(new Field("title"."lucene introduction", Field.Store.YES, Field.Index.TOKENIZED));
doc.add(new Field("content"."lucene works well", Field.Store.YES, Field.Index.TOKENIZED));
Copy the code

Let’s examine this code. First we create a writer, specify the directory to store the index as “/data/index”, and use StandardAnalyzer. The third parameter indicates that if there are already index files in the index directory, we will overwrite them. And then we’ll create a new document. We add a field to the document with the name “title” and the content “Lucene Introduction” to store and index it. Add a field named “content”, “Lucene works well”, also stored and indexed. We then add this document to the index. If there are multiple documents, we can repeat the above steps to create the document and add. After adding all documents, we optimize the index, which is mainly to merge multiple segments into one, which is conducive to improving the index speed. It is important to close Writer afterwards.

Yes, creating an index is as simple as that! Of course you may modify the above code to get a more personalized service.

5.2 Write Indexes directly to the Memory

You need to create a RAMDirectory and pass it to Writer as follows:

Directory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
Document doc = new Document();
doc.add(new Field("title"."lucene introduction", Field.Store.YES, Field.Index.TOKENIZED));
doc.add(new Field("content"."lucene works well", Field.Store.YES, Field.Index.TOKENIZED));
Copy the code

5.3 Indexing text Files

If you want to index plain text files instead of reading them into strings to create fields yourself, you can create fields with the following code:

Field field = new Field("content".new FileReader(file));
Copy the code

Here file is the text file. This constructor actually reads the contents of the file and indexes it, but does not store it.

6 How do I Maintain indexes

Index maintenance is provided by the IndexReader class.

6.1 How Can I Delete an Index

Lucene provides two ways to remove a document from the index. One is

void deleteDocument(int docNum)
Copy the code

This method is based on the document number in the index to delete, each document added to the index will have a unique number, so delete according to the number is an accurate deletion, but this number is the internal structure of the index, generally we do not know what the number of a file is, so it is not useful. The other is a

void deleteDocuments(Term term)
Copy the code

This method essentially performs a search operation based on the term parameter and then removes the results in bulk. We can use this method to provide a strict query condition to remove the specified document. Here’s an example:

Directory dir = FSDirectory.getDirectory(PATH, false);
IndexReader reader =;
Term term = new Term(field, key);
Copy the code

6.2 How Do I Update indexes

Lucene does not provide a special index update method, we need to delete the corresponding document, and then add the new document to the index. Such as:

Directory dir = FSDirectory.getDirectory(PATH, false);
IndexReader reader =;
Term term = newThe Term (" title ", "lucene;"); reader.deleteDocuments(term); reader.close(); IndexWriter writer =new IndexWriter(dir, new StandardAnalyzer(), true);
Document doc = new Document();
doc.add(new Field("title"."lucene introduction", Field.Store.YES, Field.Index.TOKENIZED));
doc.add(new Field("content"."lucene is funny", Field.Store.YES, Field.Index.TOKENIZED));
Copy the code

7 How Do I Search?

Lucene’s search is quite powerful. There are many helper Query classes, each of which is derived from the Query class and performs a particular Query. You can use them as building blocks to perform complex operations. Lucene also provides a Sort class to Sort results and a Filter class to restrict query conditions. You might automatically compare it to a SQL statement: “Can Lucene perform and, OR, order by, where, like ‘%xx%’?” The answer was, “Sure!”

7.1 Various Queries

Let’s take a look at what lucene allows us to do:

7.1.1 TermQuery

To start with the most basic query, if you want to execute a query such as “document containing ‘Lucene’ in the content field”, you can use TermQuery:

Term t = new Term("content"." lucene";
Query query = new TermQuery(t);
Copy the code

7.1.2 BooleanQuery

If you want to query “document containing Java or Perl in the content field”, you can create two TermQueries and concatenate them using BooleanQuery:

TermQuery termQuery1 = new TermQuery(new Term("content"."java");
TermQuery termQuery 2 = new TermQuery(new Term("content"."perl");
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(termQuery 1, BooleanClause.Occur.SHOULD);
booleanQuery.add(termQuery 2, BooleanClause.Occur.SHOULD);
Copy the code

7.1.3 WildcardQuery

If you want to do a WildcardQuery for a word, you can use WildcardQuery. Wildcards include ‘? ‘matches one arbitrary character and’ matches zero or more arbitrary characters. For example, if you search for ‘use’, you might find ‘useful’ or ‘useless’ :

Query query = new WildcardQuery(new Term("content", "use*");
Copy the code

7.1.4 PhraseQuery

You may be interested in the sino-Japanese relations and want to find the articles where the words “China” and “Japan” are close to each other (within a distance of 5 characters).

PhraseQuery query = new PhraseQuery();
query.add(new Term("content ", "")); query.add(newTerm (" content ", "day"));Copy the code

It might read “Sino-Japanese cooperation…” “And” China and Japan…” “, but the search did not find “a senior Chinese official said That Japan should be suppressed”.

7.1.5 PrefixQuery

If you want to search for words that start with ‘middle’, you can use PrefixQuery:

PrefixQuery query = new PrefixQuery(new Term("content "."In");
Copy the code

7.1.6 FuzzyQuery

FuzzyQuery is used to search for similar terms using the Levenshtein algorithm. If you wanted to search for words similar to ‘wuzza’, you could:

Query query = new FuzzyQuery(new Term("content"."wuzza");
Copy the code

You might get ‘fuzzy’ and ‘wuzzy’.

7.1.7 RangeQuery

Another commonly used Query is RangeQuery. You might want to search for documents between 20060101 and 20060130. You can use RangeQuery:

RangeQuery query = new RangeQuery(newThe Term (" time ","20060101"),newThe Term (" time ","20060130"),true);
Copy the code

The final true means to use the closed interval.

7.2 QueryParser

After seeing so many Queries, you might be asking yourself, “I’m not going to have to combine them all by myself. It’s too much trouble!” Lucene provides a Query statement that is similar to an SQL statement. Let’s call it a Lucene statement. With this statement, you can put various queries together in one sentence, and Lucene will automatically divide them into small pieces for execution by the corresponding Query. Let’s demonstrate each Query:

TermQuery can be “field:key”, such as “Content: Lucene”.

BooleanQuery uses’ + ‘for’ and ‘and’ for ‘or’, for example, ‘Content: Java Contenterl’.

WildcardQuery still uses’? “And”, for example “Content :use”.

PhraseQuery uses’ ~ ‘, as in “content:” China-Japan “~5”.

PrefixQuery uses’ ‘, such as’ medium ‘.

FuzzyQuery uses’ ~ ‘, for example, ‘Content: wuzza ~’.

RangeQuery uses’ [] ‘or’ {} ‘, for example, ‘time:[20060101 TO 20060130]’. Note that TO is case-sensitive.

You can combine query Strings to perform complex operations, such as “articles with lucene in title or body and between 20060101 and 20060130” as follows: “+ (title: Lucene Content :lucene) +time:[20060101 TO 20060130]”

The code is as follows:

Directory dir = FSDirectory.getDirectory(PATH, false);
IndexSearcher is = new IndexSearcher(dir);
QueryParser parser = new QueryParser("content".new StandardAnalyzer());
Query query = parser.parse("+(title:lucene content:lucene) +time:[20060101 TO 20060130]";
Hits hits =;
for (int i = 0; i < hits.length(); i++)
Document doc = hits.doc(i);
Copy the code

First we create an IndexSearcher on the specified file directory. Then create a QueryParser that uses StandardAnalyzer as its analyzer, which searches for content by default. We then parse the string using QueryParser to generate a Query. The Query is then used to find the result, which is returned as Hits. The Hits object contains a list of contents that we display one by one.

7.3 the Filter

Filter is used to restrict the query to a subset of the index. It is similar to where in SQL statements, but different. It is not part of the regular query, but only preprocessed data source, and then given to the query statement. Note that it performs pre-processing rather than filtering the query results, so using filter can be costly, as it can increase the time of a query by a hundred times. The most common filters are RangeFilter and QueryFilter. RangeFilter is set to search only within a specified range of indexes; QueryFilter searches in the results of the last query. Using a Filter is very simple. You just create an instance of a Filter and pass it to searcher. To continue with the example above, query “articles between 20060101 and 20060130”. In addition to writing limits in the query String, you can also write limits in the RangeFilter:

Directory dir = FSDirectory.getDirectory(PATH, false);
IndexSearcher is = new IndexSearcher(dir);
QueryParser parser = new QueryParser("content".new StandardAnalyzer());
Query query = parser.parse("title:lucene content:lucene";
RangeFilter filter = new RangeFilter("time"."20060101"."20060230".true.true);
Hits hits =, filter);
for (int i i < hits.length(); i++)
Document doc = hits.doc(i);
Copy the code

7.4 the Sort

Sometimes you want an ordered result set, like an “order by” for SQL statements, and Lucene can do that: by Sort. Sort Sort Sort (” time “); “Order by time” Sort Sort = new Sort(” time “, true); Order by time desc;

Directory dir = FSDirectory.getDirectory(PATH, false);
IndexSearcher is = new IndexSearcher(dir);
QueryParser parser = new QueryParser("content".new StandardAnalyzer());
Query query = parser.parse("title:lucene content:lucene";
RangeFilter filter = new RangeFilter("time"."20060101"."20060230".true.true);
Sort sort = newSort (" time "); Hits hits =, filter, sort);for (int i = 0; i < hits.length(); i++)
Document doc = hits.doc(i);
Copy the code

8 analyzer

In the previous concept introduction, we have seen the function of a parser, which is to break sentences into semantically divided words. English sharding already has a mature analyzer: StandardAnalyzer, and StandardAnalyzer is a good choice in many cases. Even StandardAnalyzer can do word segmentation for Chinese. But our focus is on Chinese word segmentation, does StandardAnalyzer support Chinese word segmentation? Practice proved that it could, but the effect is not good, the search “if” will bring up “milk is not as good as juice”, and the index file is very large. So what else do we have on hand to use? There is none in core, we can find two in the Sandbox: ChineseAnalyzer and CJKAnalyzer. But they also have the problem of incorrect segmentation. In contrast, StandardAnalyzer and ChineseAnalyzer take about the same time to set up an index, and the index file size is about the same. CJKAnalyzer performs worse, and the index file is large and takes a long time. To solve the problem, first examine the word segmentation of the three parsers. StandardAnalyzer and ChineseAnalyzer both cut sentences into single words, meaning that “milk is not as good as juice” is split into “cow’s milk is not as good as fruit juice”; While CJKAnalyzer will cut into “milk milk is not as good as if the juice is well drunk”. This explains why searches for “juice” even match the sentence. The disadvantages of the above segmentation are at least two: inaccurate matching and large index files. Our goal is to break down the above sentence into “milk tastes better than juice.” The key here is semantic recognition, how do we recognize that “milk” is a word and “milk” is not a word? We naturally think of thesaurus based lexical segmentation, where we get a thesaurus that lists most of the words, we slice the sentence in a certain way, and when the words match the words in the thesaurus, we think the segmentation is correct. In this way, the process of word cutting is transformed into the process of matching, and the simplest way of matching is forward maximum matching and reverse maximum matching. To put it bluntly, one matches from the beginning of the sentence backward, and the other matches from the end of the sentence forward. The segmentation thesaurus based on thesaurus is very important. The capacity of thesaurus directly affects the search results. On the premise of the same thesaurus, it is said that reverse maximum matching is better than forward maximum matching. Of course, there are other ways of dividing words, which is a subject in itself, and I don’t go into that here. Back to the actual application, the goal is to find mature, off-the-shelf word segmentation tools and avoid reinventing the wheel. After searching online, ICTCLAS of Chinese Academy of Sciences and JE-Analysis, which is not open source but free, are widely used. The problem with ICTCLAS is that it is a dynamically linked library, and Java calls require local method calls, which is inconvenient and potentially dangerous, and it does have a bad reputation. The effect of Je-analysis is not bad, but of course there may be incorrect segmentation, which is more convenient and reassuring. = new = 0;

9 Performance Optimization

Up to this point, we’re still talking about how to get Lucene to run and do the assigned task. You can do most of the things you said earlier. However, tests show that Lucene’s performance is not very good, and even returns in half a minute under the condition of large data volume and large concurrency. In addition, it is also a time-consuming process to establish indexes for the initialization of large data. So how do you improve Lucene’s performance? The following describes the optimization of index creation performance and search performance.

9.1 Optimizing index Creation Performance

There are a few ways to optimize this. IndexWriter provides interfaces to control index creation. In addition, we can write indexes to RAMDirectory first and then batch to FSDirectory. Because the biggest bottleneck in index creation is disk IO. In addition, choosing a good analyzer can improve performance.

9.1.1 Optimize index establishment by setting parameters of the IndexWriter

SetMaxBufferedDocs (int maxBufferedDocs) controls the number of documents stored in memory before writing a segment.

setMaxMergeDocs(int maxMergeDocs)
Copy the code

Controls the maximum number of documents that can be stored in a segment. A smaller value facilitates faster index appending. The default value is integer.max_value.

setMergeFactor(int mergeFactor)
Copy the code

Control the frequency with which multiple segments are merged. If the value is large, the index creation speed is fast. The default value is 10, and the index creation speed can be set to 100.

9.1.2 RAMDirectory Caching improves Performance

We can first write indexes into RAMDirectory, and then batch write indexes into FSDirectory when the number reaches a certain level to reduce disk I/O times.

FSDirectory fsDir = FSDirectory.getDirectory("/data/index".true);
RAMDirectory ramDir = new RAMDirectory();
IndexWriter fsWriter = new IndexWriter(fsDir, new StandardAnalyzer(), true);
IndexWriter ramWriter = new IndexWriter(ramDir, new StandardAnalyzer(), true);
while (there are documents to index)
... create Document ...
if (condition for flushing memory to disk has been met)
fsWriter.addIndexes(new Directory[] { ramDir });
ramWriter = new IndexWriter(ramDir, new StandardAnalyzer(), true); }}Copy the code

9.1.3 Selecting a good analyzer

This optimization is mainly for disk space and can reduce the index file size by nearly half, from 600M to 380M for the same test data. But it doesn’t help with time, or even longer, because better analyzers need to match thesaurus and consume more CPU, testing data with StandardAnalyzer takes 133 minutes; It took 150 minutes with MMAnalyzer.

9.2 Optimizing search Performance

Indexing is a time-consuming operation, but it is only necessary for initial creation, usually a small amount of maintenance, and can be put into a background process without affecting user search. We create indexes for users to search, so search performance is our primary concern. Here’s how to improve search performance.

9.2.1 Placing Indexes into Memory

This is the most intuitive idea, because memory is much faster than disk. Lucene provides RAMDirectory to hold indexes in memory:

The Directory fsDir = FSDirectory. GetDirectory ("/data/index/",false);
Directory ramDir = new RAMDirectory(fsDir);
Searcher searcher = new IndexSearcher(ramDir);
Copy the code

However, RAMDirectory and FSDirectory have proven to be about the same speed in practice, both very fast when the data volume is very small, and even slightly slower when the data volume is large (index file 400M), which is really surprising. In addition, Lucene’s search is very memory consuming. Even if the 400M index file is loaded into memory, it will be out of memory after running for a period of time. Therefore, I personally think that the function of loading into memory is not great.

9.2.2 Optimization time range limits

Since loading memory does not improve efficiency, there must be other bottlenecks. After testing, the biggest bottleneck turns out to be the time range limit. So how can we minimize the cost of the time range limit? To search for results within a specified time range, you can:

  1. Set the range with RangeQuery. However, the implementation of RangeQuery actually expands the time points within the time range and adds BooleanClause into BooleanQuery one by one. Therefore, it is impossible to set the time range too large. The range for more than a month will throw BooleanQuery. TooManyClauses, you can set BooleanQuery. SetMaxClauseCount (int maxClauseCount), but expanding is limited, And as maxClauseCount expands, so does the memory footprint
  2. Using RangeFilter instead of RangeQuery is not slower than RangeQuery, but it still has a performance bottleneck. More than 90% of the query time is spent on RangeFilter. By studying its source code, it is found that RangeFilter actually traverses all indexes first. Generating a BitSet, marking each document as true in the time range and false in the absence, and passing the result to a Searcher to find is time-consuming.
  3. To further improve performance, there are two ideas:

A. Cache Filter results. Since RangeFilter is executed before the search, its input is always specified as IndexReader, which is determined by Directory. Therefore, it can be considered that the result of RangeFilter is determined by the upper and lower limits of the range. This is determined by the specific RangeFilter object, so we simply cache the filter result BitSet with the RangeFilter object as the key. The Lucene API already provides a CachingWrapperFilter class that encapsulates the Filter and its results, so we can cache the CachingWrapperFilter object. Do not be misled by the name of CachingWrapperFilter and its description. CachingWrapperFilter appears to have caching capabilities, but caching is for the same filter. If you use the same filter to filter different IndexReaders, It can help you cache the results of different IndexReaders, but our requirement is the opposite. We filter the same IndexReader with different filters, so we can only use it as a wrapper class.

B, reduce the time accuracy. It can be seen from studying the working principle of Filter that it traverses the whole index every time, so the larger the time granularity is, the faster the comparison is and the shorter the search time is. The lower the time precision is, the better it is without affecting the function. Sometimes it is worth sacrificing a little accuracy, of course, the best case is no time limit at all.

The following shows the optimization results for the above two ideas (both using 800 thread random keyword random time range) : the first group, the time precision is second: the mode is directly using the RangeFilter cache without filter. The average time of each thread is 10s 1s 300ms

In the second group, the time precision is in the day mode and the RangeFilter is directly used to use cache without filter. The average time of each thread is 900ms 360ms 300ms

It can be concluded from the above data that:

  1. Minimize time accuracy, and switching accuracy from seconds to days will provide even better performance than using cache, preferably without filter.
  2. Using the cache provides about a 10-fold performance improvement without compromising time accuracy.

9.2.3 Use better profilers

This is similar to creating index optimizations, where the index file is smaller and the search is faster. Of course, this improvement is limited. Better parsers have a performance improvement of less than 20% over the worst parsers.

10 Some Lessons

10.1 Keywords are case-sensitive

While keywords such as or AND TO are case sensitive, Lucene only recognizes uppercase words AND treats lowercase words as ordinary words.

10.2 Read/write Mutual Exclusion

Only one write operation can be performed on the index at a time, and the search can be performed at the same time

10.3 file lock

Forcing an exit during index writing will leave a lock file in the TMP directory, preventing future write operations. It can be manually deleted

10.4 Time Format

Lucene supports only one time format, yyMMddHHmmss, so if you pass a yY-MM-DD HH: MM: SS time to Lucene, it will not be treated as time

10.5 set up the boost

Sometimes in the search when the weight of a field needs to be larger, for example, you might think that title keywords appear in the article is more valuable than the text appear in the keywords, you can set the title of the boost of the larger, so the search results will show the title first appear in the keywords articles (if not use the sort). SetBoost (float boost); The default value is 1.0, which means that weights need to be set to greater than 1.