“This is the third day of my participation in the Gwen Challenge in November. Check out the details: The last Gwen Challenge in 2021.”

What is full text search

Data classification

There are two kinds of data in our lives: structured data and unstructured data. Structured data: data of fixed format or limited length, such as database, metadata, etc. Unstructured data: data of variable length or format, such as emails, Word documents, and other files on disk

Structured data search

Common structured data is data in a database.

Searching in the database is easy to achieve, usually using SQL statements to query, and can quickly get the query results.

Why is database searching easy? Because the data storage in the database is regular, there are rows and columns and data format, data length are fixed.

Unstructured data query method

Sequential scan method (Serial Scanning) so-called sequential scan, such as looking for content file contains a single string, is a document, a document for each document, the from head to see, if the document contains the string, then this document for the file, we are looking for and then look at a file, until you to scan all files. If you use Windows search, you can also search for file content, but rather slowly.

Full-text Search refers to the computer indexing program by scanning every word in the article, to establish an index for each word, indicating the number and position of the word in the article, when the user queries, the retrieval program will Search according to the index established in advance. And feedback the search results to the user’s retrieval method. This process is similar to looking up words through the contents of a dictionary.

Part of the information in unstructured data is extracted and reorganized to make it with a certain structure, and then the data with a certain structure is searched, so as to achieve the purpose of relatively fast search. This piece of information extracted from unstructured data and then reorganized is called an index.

For example: dictionaries. The pinyin list and radical check list of the dictionary are equivalent to the index of the dictionary. The interpretation of each word is unstructured. If the dictionary does not have syllabary and radical check list, it can only scan a word in order in the vast sea of words. However, some information of the character can be extracted for structural processing, such as pronunciation, it is more structured, divided into initials and finals, respectively, only several can be enumerated, so the pronunciation is taken out in a certain order, each pronunciation points to the detailed explanation of the page number of the word. We search for sounds in structured pinyin, and then by the number of pages they lead to, we find our unstructured data — the interpretation of the word.

This process of first building an index and then searching the index is called full-text Search. Although the process of creating an index is also very time consuming, once an index is created it can be used many times, and full-text search deals mainly with queries, so it is worth the time to create the index.

How to achieve full text retrieval

You can use Lucene for full-text retrieval. Lucene is an open source full text search engine toolkit under Apache. It provides a complete query engine and indexing engine, and a partial text analysis engine (in English and German). Lucene’s goal is to provide software developers with an easy-to-use toolkit to facilitate the implementation of full-text retrieval functions in target systems.

Lucene applicable scenarios:

  • In the application, it provides full-text retrieval for the data in the database.
  • Develop independent search engine services and systems

Lucene features:

1. Stability and high index performance

  • It can index more than 150GB of data per hour
  • Memory requirements are small, requiring only 1MB of heap memory
  • Incremental indexing is just as fast as bulk indexing
  • The size of the index is about 20% to 30% of the size of the index text

2. Efficient, accurate and high-performance search algorithm

  • Good search sorting
  • Powerful query mode support: phrase query, wildcard query, near query, range query, etc
  • Support field search (e.g. Title, author, content)
  • Sort by any field
  • Multiple index query results can be merged
  • Update and query operations can be performed simultaneously
  • Supports highlighting, join, and grouping results
  • Speed is fast
  • Extensible sorting module, built-in vector space model, BM25 model optional
  • You can configure a storage engine

3. The cross-platform

  • Pure Java write
  • As an open source project under the Apache Open Source license, you can use it in commercial or open source projects
  • Lucene is available in many languages (C, C++, Python, etc.), not just JAVA

Lucene architecture:

Application scenarios of full-text retrieval

Full-text search can be adopted for data with large data volume and unfixed data structure.

  • Standalone software search: Word, Markdown
  • Station search: JINGdong, Taobao, pull hook, index source is the database
  • Search engine: Baidu, Google, index source is crawler captured data

Lucene full text retrieval process description

Index and search flow charts

1, green represents the index process, to search the original content of the index to build an index library, the index process includes: determine the original content of the content to search –> collect documents –> create documents –> analyze documents –> index documents

2, red represents the search process, search content from the index library, the search process includes: users through the search interface –> create query –> execute search, search from the index library –> render search results

Create indexes

Key Concepts:

Document user-supplied sources are records, which can be text files, strings, a record from a database table, and so on. After a record is indexed, it is stored in the index file as a Document. The user searches, which is also returned as a Document list.

Field A Document can contain multiple information fields, for example, an article can contain “title”, “body”, “last modified time” and other information fields, these information fields are stored in the Document through Field. Field has two optional attributes: storage and index. The store property allows you to control whether the Field is stored or not; The index attribute allows you to control whether the Field is indexed or not. If the title and text full-text search, so we need to get the index attribute is set to true, at the same time we hope to be able to directly post title extracted from search results, so we put the header field storage property is set to true, but because of the text field is too big, we in order to narrow the index file size, to store the text of the domain attribute is set to false, when need to read the file directly; We just want to extract the last modified time from the search result without searching it, so we set the storage property of the last modified time domain to true and the index property to false. The above three fields cover three combinations of two properties, and the one that is all false is not used. In fact, Field does not allow you to set it that way, because fields that are neither stored nor indexed are meaningless.

Term Term is the smallest unit of search and represents a word in a document. Term consists of two parts: the word it represents and the name of the Field in which the word appears.

We take the search of the recruitment website as an example. The content displayed in the search of the website is not directly from the database, but obtained from the index library, and the index data of the website needs to be created in advance. Here’s how it works:

The first step is to obtain the original document: it is from the mysql database through the SQL statement to create an index of the data

The second step: create a Document object (Document), the content of the query to build lucene can recognize the Document object, the purpose of obtaining the original content is to index, before the index needs to create the original content into a Document, the Document includes a Field (Field), this Field is the column in the table.

Note: each Document can have multiple fields, different documents can have different fields, the same Document can have the same Field (domain name and Field value are the same). Each document has a unique number, the document ID.

Step 3: The analysis document creates the original content as a document containing the Field, and then analyzes the content in the Field. The analysis process is to generate the final lexical unit through the process of extracting words from the original document, converting letters to lowercase, removing punctuation marks, removing stop words, etc. Vocabulary units can be understood as words one by one.

The sorted words form the smallest unit in the index base: term, which consists of a domain name and a term

Step 4: Create an index

Index all the vocabulary units obtained from the analysis of documents, the purpose of index is to search, and finally realize the search of the indexed vocabulary units to find the Document (Document).

Note: the creation of an index is a vocabulary unit index, which finds documents by words. The structure of this index is called the inverted index structure.

The inverted index structure is also called the inverted index structure, including the index and the document two parts, index is the vocabulary, its scale is smaller, while the document set is larger.

Inverted index

The inverted index records the documents in which each term appears and its position in the document. It can quickly locate the document containing the term and its position according to the term.

Document: Each piece of raw data in the index library, such as a product information, a job information

Entry: the original data is segmented according to the word segmentation algorithm, and each word is obtained

Create an inverted index with the following steps:

1) Create a document list: Lucene first numbers the original document data (DocID) to form a list, which is a document list

2) Create an inverted index list to segment the data in the document and get terms (one after another). Number terms to create an index with terms. Then note down all the document numbers (and other information) that contain the entry.

Search process: when the user input any term, the first user input data for word segmentation, get the user to search for all terms, and then take these terms to the inverted index list for matching. Finding these terms will give you the numbers of all the documents that contain them. Then follow these numbers to the document list to find the document

Index of the query

Querying an index is also a search process. A search is a process in which a user enters a keyword and searches through an index. Search the index by keyword and find the corresponding document by index

Step 1: Create a user interface where users enter keywords Step 2: create a query Specify the domain name and keyword for the query Step 3: execute the query Step 4: render the result (the keyword must be highlighted when the result content is displayed on the page)

Lucene real case


Generate a job information index library to retrieve data from the index library

Create database ES and import SQL script into database for execution.

Preparing the development environment

Step 1: Create a Maven project. Having learned about SpringBoot, we will create a SpringBoot project

Step 2: Import dependencies

<! -- Spring Boot parent boot dependency -->
    <version>2.1.6. RELEASE</version>

    <! - web depend on -- -- >
    <! -- Test dependencies -->
    <! - lombok tools - >
    <! -- Hot deployment -->
    <! --mybatis-plus-->
        <version>3.3.2 rainfall distribution on 10-12</version>
    <! -- PoJO persistence -->
    <! - mysql driver - >
    <! Lucene core package and word segmentation package

        <! Build plugins -->
        <! -- Package plugin -->
Copy the code

Step 3: Create the bootstrap class

public class FullTextSearchDempApplication {
    public static void main(String[] args) { SpringApplication.run(FullTextSearchDempApplication.class,args); }}Copy the code

Step 4: Configure the Properties file

    port: 9000
        name: ran-lucene
        driver-class-name: com.mysql.jdbc.Driver
        url: jdbc:mysql://localhost:3306/es? useUnicode=true&characterEncoding=utf8&serverTimezone=UTC
        username: root
        password: Ran@123456

# Enable camel name matching mapping
        map-underscore-to-camel-case: true
Copy the code

Step 5: Create entity class, Mapper, service

Create indexes

public class LuceneIndexTest {

    private JobInfoService jobInfoService;

    /** * create index */
    public void create(a)throws Exception{
        //1. Specify where the index file is stored. The index is a set of regular files
        Directory directory = FSDirectory.open(new File("/Users/RG/Documents/class/index"));
        //2. Configure the version and its tokenizer
        Analyzer analyzer = new IKAnalyzer();
        IndexWriterConfig config = new IndexWriterConfig(Version.LATEST,analyzer);
        //3. Create an IndexWriter object
        IndexWriter indexWriter = new IndexWriter(directory,config);
        // Delete the existing index library
        //4. Obtain index source/raw data
        List<JobInfo> jobInfoList = jobInfoService.selectAll();
        //5. Iterate over jobInfoList, creating a Document object each time
        for (JobInfo jobInfo: jobInfoList) {
            // Create Document object
            Document document = new Document();
            // Create a Field object and add it to the document
            document.add(new LongField("id",jobInfo.getId(), Field.Store.YES));
            // shard words, index, store
            document.add(new TextField("companyName",jobInfo.getCompanyName(), Field.Store.YES));
            document.add(new TextField("companyAddr",jobInfo.getCompanyAddr(), Field.Store.YES));
            document.add(new TextField("companyInfo",jobInfo.getCompanyInfo(), Field.Store.YES));
            document.add(new TextField("jobName",jobInfo.getJobName(), Field.Store.YES));
            document.add(new TextField("jobAddr",jobInfo.getJobAddr(), Field.Store.YES));
            document.add(new TextField("jobInfo",jobInfo.getJobInfo(), Field.Store.YES));
            document.add(new IntField("salaryMin",jobInfo.getSalaryMin(), Field.Store.YES));
            document.add(new IntField("salaryMax",jobInfo.getSalaryMax(), Field.Store.YES));
            document.add(new StringField("url",jobInfo.getUrl(), Field.Store.YES));
            // Append the document to the index library
        // Close the resource
        System.out.println("create index success!"); }}Copy the code

Index of the query

public void query(a)throws Exception{
    //1. Specify where the index file is stored. The index is a set of regular files
    Directory directory = FSDirectory.open(new File("/Users/RG/Documents/class/index"));
    / / 2. IndexReader object
    IndexReader indexReader = DirectoryReader.open(directory);
    //3. Create an IndexSearcher
    IndexSearcher indexSearcher = new IndexSearcher(indexReader);
    // Use term to query all document objects whose company name contains "Beijing"
    Query query = new TermQuery(new Term("companyName"."Beijing"));
    TopDocs topDocs = indexSearcher.search(query, 100);
    // Get the number of documents that match the query criteria
    int totalHits = topDocs.totalHits;
    System.out.println("Number of documents eligible:"+totalHits);
    ScoreDoc encapsulates document ID information
    ScoreDoc[] scoreDocs = topDocs.scoreDocs;
    for(ScoreDoc scoreDoc : scoreDocs){
        / / document id
        int docId = scoreDoc.doc;
        // Get the document object by the document ID
        Document doc = indexSearcher.doc(docId);
        System.out.println("* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *");
    // Resource release
Copy the code

The use of Chinese word segmentation

Step 1: Generate dependencies

<! --IK Chinese word segmentation -->
Copy the code

Step 2: You can add a configuration file

IKanalyzer is used in the third step to create the index