The development of full-text search technology has experienced the following five stages, from the initial STAGE of Ftp file retrieval, to the current stage of web link analysis and user intent identification, has developed to a relatively mature point.

As can be seen from the figure above, the development sequence of the five stages is as follows:

  1. FTP file retrieval phase: The user stores the file to the FTP server. The user searches for the file by entering the exact file name and obtains the FTP address of the file. The user can download the file based on this address.

  2. Category navigation stage: The web page category is saved and displayed to the user. The user can obtain the URL of the web page according to the specific category as required, and then redirect the page according to the URL.

  3. File relevance stage: As the content of the Internet continues to enrich, it is no longer accurate to locate a web page solely by classification. In order to solve this problem, search engines introduce full-text search technology to ensure that the search subject has a strong correlation with the web content.

  4. Page link analysis stage: this stage is mainly the use of external links to count the importance and popularity of each website, in the user search, combined with the importance and popularity of the website for data filtering, in order to improve the quality of search information.

  5. User intention identification stage: this stage is mainly user-centered, and strives to return the webpage content that best matches the user’s search intention according to the shortest search keyword, so as to achieve thousands of faces.

Among them, the representative work of page link analysis stage is Google search, and the representative work of user intention search stage is Baidu search. But this kind of search is belong to off-site search. In addition to site search, there are site search, such as some management background full text search, e-commerce commodity search.

Now mature full-text search engine ElasticSearch, Solr are based on Lucene search engine encapsulation, it is also important to understand the basic principles of Lucene.

Based on using

Create indexes

To create an index, perform the following operations:

  • Create index directory: Used to store the created index file.
  • Use word segmentation: Word segmentation of content.
  • Create IndexWriter: Used to create indexes.
  • Create Document: Use Document to hold what you want to add to the index.

The demo code is as follows:

class LuceneServiceImplTest {

    void createIndex(a) throws IOException {

        final String indexPath = "lucene/indexDir/";

        // Create an index directory
        Path path = Paths.get(indexPath);
        File file = path.toFile();
        if(! file.exists()) {// If the folder does not exist, create it
        FSDirectory directory =;

        // Use standard word segmentation
        StandardAnalyzer standardAnalyzer = new StandardAnalyzer();

        / / create the indexWriter
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(standardAnalyzer);
        IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);

        / / create the document
        Document document = new Document();
        / / the document to add domain
        document.add(new TextField("goodsName"."goods", Field.Store.YES));
        document.add(new TextField("goodsPrice"."100", Field.Store.YES));

        // Create index
        / / submitindexWriter.commit(); }}Copy the code

All this code does is add goodsName and goodsPrice to the index as the two fields of the document. Running the above results, you can have the following files from lucene/indexDir:

Basic concept

To understand the Lucene principle, we need to understand the following concepts.

  • Document: Similar to the concept of rows in a database or documents in a database, an Index can contain multiple documents. A Document that writes to Index is assigned a unique ID, known as a Sequence Number (better known as a DocId).
  • Field: A Document may consist of one or more fields (as shown in the following figure from the source code). Field is the minimum definition unit for a data index in Lucene. Lucene provides many different types of fields, such as StringField, TextField, and LongFiled.

  • Term: The smallest unit of indexing and searching in Lucene. A Field may consist of one or more terms, which are generated by the Field through an Analyzer.
  • Term Dictionary: Term Dictionary, which is the basic index for looking up terms based on criteria.
  • Segment: Split an index into multiple segments or merge them to avoid index file bloat.

According to the above concepts, we sorted out the relationship diagram between them:

Principles of Index Creation

Before we dive into the process of creating an index, let’s look at the overall process:

Lucenef responsible module diagram:

As you can see from the above two figures, Lucene is not responsible for creating the Docuemnt and Field for us; it only provides the related functionality, and we still need to read the contents of the Document and create the Document and Field ourselves. From the above flow chart, we can see that Lucene will use the word splitter to perform relevant word segmentation on the document Field until the output Term. Several important stages of word segmentation are explained as follows:

  1. Remove Stop words: Stop words are some of the most common words in a language. Because they have no special meaning, they cannot be searched in most cases. Therefore, when creating an index, such words will be removed to reduce the size of the index. In English, Stop word such as “the”, “a”, “this” and so on.
  2. Reduce words to roots: cars to car, etc.
  3. Change words into root forms: “drove” to “drive.”

After the Term is segmented, the next step is to submit it to the index component, which is responsible for creating the index to store the Term. Lucene uses FST data structure to save the dictionary and hops to store the inversion list. The diagram is as follows:

The last

How Lucene keeps threads safe will be explained in the next article.

The resources

  • Elasticsearch: How it works
  • Lucene Parsing – Basic Concepts
  • Lucene full text search principle and process