Today, we’re going to talk about Lucene, so please move your benches.

(What does Lucene do?)

First let’s look at a mind map:

The above are the commonly used full-text search engine frameworks in Java, and the search functions of many projects are based on the above four frameworks.

So what does Lucene really do?

Lucene is an open source library for full-text retrieval and search, a simple but powerful core code library and API that can easily add search functionality to an application.

Lucene, currently the most popular Java full-text search framework. Hibernate Search, Solr, elasticSearch are all search engines based on Lucene.

Hibernate Search is a full-text retrieval tool based on Apache Lucene, which is mainly used for Hibernate persistent model.

Elasticsearch is also developed in Java and uses Lucene as its core for all indexing and searching, but it aims to hide the complexity of Lucene with a simple RESTful API to make full-text searching easy.

Solr is an open source, Lucene Java-based search server that is easy to add to Web applications. It provides level searching (that is, statistics), hit highlighting, and support for a variety of output formats, including XML/XSLT and JSON.

So Lucene is not awesome!!

Next, we are divided into the following parts to understand and unlock the true face of Lucene.

  • Relevant concepts

  • The process of building and querying indexes

  • Inverted index

  • Visualization tool

  • Project Application Guide

Relevant concepts

Lucene official website:

Since it is a full-text search tool, there must be a certain sort structure and rules. Lucene installs an internal hierarchy to quickly retrieve what I need when I type in keywords. There are several layers and concepts involved.

Index library (Index)

A directory with an index library, all files in the same folder constitute a Lucene index library. The concept of a database-like table.

(Lucene index instance)

Segment (Segment)

A Lucene index may consist of multiple subindexes, which become segments. Each paragraph is a complete, independent index that can be searched.

The Document (the Document)

An index can contain multiple segments that are independent of each other, new segments can be generated by adding new documents, and different segments can be merged. A segment is a unit of indexed data storage. Similar to the concept of rows in a database or documents in a document database.

Domain (Field)

A document contains different types of information that can be indexed separately, such as title, date, body, author, etc. Similar to a field * in a database table. *

Term (Term)

A word is the smallest unit of index and is a string of characters after lexical analysis and language processing. A Field consists of one or more terms. For example, the title content is “Hello Lucene”, after word segmentation, “Hello” and “lucene”, these two words are the content information of Term, when the keyword search “hello” or “lucene”, the title will be searched.

**Analyzer (**Analyzer)

A meaningful text needs to be divided into words by Analyzer before it can be searched by keyword. StandartdAnalyzer is a common analyzer in Lucene. The Chinese analyzer includes CJKAnalyzer, SmartChinieseAnalyzer, and so on.

(Lucene index storage structure concept diagram)

When a new document is added to the index, a new Segment is generated. The Segment can be merged with the document number and the document index. Each document has multiple fields that can be indexed, and each field can specify a different type (StringField, TextField).

So, as you can see from the figure, Lucene has the following hierarchy: *** Index — > Segment — > Document — > Field — > Term ***.

Above we have understood some basic concepts of Lucene, then we will enter the link of principle analysis.

(Why are Lucene search engine queries so fast?)

Inverted index

We all know that indexing is the key to speed up retrieval, but lucene uses an inverted index (also known as an inverted index) structure.

Inverted index (reverse index) naturally has forward index (forward index).

  • Forward indexing refers to the retrieval of words from the document, normal query words we are from the document to retrieve the keyword word.

  • Inverted index refers to the retrieval of documents from words, and the concept of inverted index is inverted from the forward index. It is necessary to prepare keywords for documents in advance, and then directly match the keywords to get the corresponding documents during query.

May I have an inverted index, as it is not the records that determine the attribute values, but the attribute values that determine the position of the records?

(How exactly?)

Let’s take an example (from the Internet) :

Suppose you now have two documents with the following contents:

  • Document 1: Home Sales Rise in July.

  • Increase in home sales in July.

By analyzing the above figure, it can be seen that first, after the document is divided by the Analyzer, we can get the term, which corresponds to the document ID. Then, these word sets are sorted once, and the same words are combined and the frequency of occurrence is counted, as well as the document ID of occurrence is recorded.


When implemented, Lucene saves the above three columns as a Term Dictionary file, a * frequency file, and a position file, respectively. The dictionary file not only stores each keyword, but also retains Pointers to frequency files and location files, through which the frequency information and location information of the keyword can be found.

When indexing, if you want to query for the word “sales,” Lucene searches the dictionary binary, finds the word, reads out all the article numbers using a pointer to the frequency file, and returns the result. Dictionaries are usually very small, so the whole process takes milliseconds.

(So that’s it!)

Lucne visualization tool Luke


The process of building and querying indexes

Now that we know how Lucene builds indexes, let’s use Lucene at the code level.

Let’s start with a picture:

Files need to be indexed before they can be retrieved, so the image above should be viewed from the “files to be retrieved” node.

Index building process:

1. Construct a Document object for each file to be retrieved, and treat each part of the file as a Field object.

2. Use the Analyzer class to implement word segmentation for the natural language text in the document, and use the IndexWriter class to build the index.

3, use FSDirectory class to set the way and location of index storage, index storage.

Indexing process:

4, Use the IndexReader class to read the index.

5. Term class is used to represent the keyword searched by the user and the field where the keyword resides, and QueryParser class is used to represent the query conditions of the user.

6. Use IndexSearcher to retrieve indexes and return Document objects that meet the query criteria.

The dotted line points to the package name of the class (Packege). . Such as Analyzer in org. Apache lucene. Analysis package.

Build index code:

Public class CreateTest {public static void main(String[] args) throws Exception {Path indexPath = FileSystems.getDefault().getPath("d:\\index\\"); // MMapDirectory: Linux, MacOSX, Solaris // NIOFSDirectory: Other non-Windows JREs // SimpleFSDirectory: other JREs on Windows Directory dir = (indexPath); // Analyzer Analyzer = new StandardAnalyzer(); boolean create =true;
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
        if (create) {
        } else{/ / lucene is not support update, here is just remove old index, and then create a new index indexWriterConfig. SetOpenMode (indexWriterConfig. OpenMode. CREATE_OR_APPEND); } IndexWriter indexWriter = new IndexWriter(dir, indexWriterConfig); Document doc = new Document(); // If you want to display the Field value in the query result, the Field value will be indexed, but will not be divided into words, i.e. treated as a full token. // If the content is too large and does not need to display the field value (the entire article content), it is not suitable to store in the index doc.add(new StringField()"Title"."sean", Field.Store.YES)); long time = new Date().getTime(); // LongPoint does not store domain values doc.add(new LongPoint("LastModified", time));
//        doc.add(new NumericDocValuesField("LastModified", time)); Doc. Add (new TextField(); // A field that is automatically indexed and segmented."Content"."this is a test of sean", Field.Store.NO)); List<Document> docs = new LinkedList<>(); docs.add(doc); indexWriter.addDocuments(docs); Indexwriter.close () is submitted by default before closing; }}Copy the code

Corresponding sequence diagram:

Query index code:

Public class QueryTest {public static void main(String[] args) throws Exception {Path indexPath = FileSystems.getDefault().getPath("d:\\index\\"); Directory dir =; // Analyzer Analyzer = new StandardAnalyzer(); IndexReader reader =; IndexSearcher searcher = new IndexSearcher(reader); // String[] queryFields = {"Title"."Content"."LastModified"};
//        QueryParser parser = new MultiFieldQueryParser(queryFields, analyzer);
//        Query query = parser.parse("sean"); // Term Term = new Term("Title"."test"); // Query query = new TermQuery(term); // Term Term = new Term("Title"."se*"); // WildcardQuery query = new WildcardQuery(term); // Query query1 = longpoint.newrangequery ("LastModified", 1L, 1637069693000L); Phrasequery. Builder phraseQueryBuilder = new phrasequery. Builder(); phraseQueryBuilder.add(new Term("Content"."test"));
        phraseQueryBuilder.add(new Term("Content"."sean")); phraseQueryBuilder.setSlop(10); PhraseQuery query2 =; Builder booleanQueryBuildr = new BooleanQuery.builder (); booleanQueryBuildr.add(query1, BooleanClause.Occur.MUST); booleanQueryBuildr.add(query2, BooleanClause.Occur.MUST); BooleanQuery query =; Sort = new Sort(); Sort = new Sort(); SortField sortField = new SortField("Title", SortField.Type.SCORE);

        TopDocs topDocs =, 10, sort);
        if(topDocs.totalHits > 0)
            for(ScoreDoc scoreDoc : topDocs.scoreDocs){ int docNum = scoreDoc.doc; Document doc = searcher.doc(docNum); System.out.println(doc.toString()); }}}Copy the code

Corresponding sequence diagram:

Lucene version Information:

< the dependency > < groupId > org). Apache lucene < / groupId > < artifactId > lucene - core < / artifactId > < version > 7.4.0 < / version > </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-queryparser</artifactId> The < version > 7.4.0 < / version > < / dependency >Copy the code

Project Application Guide

In actual development, lucene is rarely used directly. Currently, the mainstream search frameworks Solr and Elasticsearch are based on Lucene, providing us with a more convenient API. Especially in distributed environments, Elasticsearch allows you to solve single point problems, backup problems, and cluster sharding, which is more in line with the trend.

So far, the whole article is over

Follow the public account [MarkerHub], reply to “mind map” to obtain the map source files!