I like to use simple language to explain knowledge points

Long-term sharing of original Java articles, advanced architect learning notes and learning materials

Like big diaosi can pay attention to, learn together, progress together

Ps: because the salary does not send, affect the mood, several days did not write an article….

What is Lucene?

0) Key words :Lucene, search, full text retrieval

1) Essence: A Jar package, a framework for full text retrieval

2) Function: Lucene is not a complete full-text indexing application, but a full-text indexing engine toolkit written in Java, which can be easily embedded in various applications to achieve full-text indexing/retrieval functions for applications. However, database fuzzy query requires full table scan or index scan, which means consuming a lot of IO. If fuzzy query often occurs, database performance will deteriorate.

3) Benefits:

1. The index file format is platform independent. Lucene defines a set of 8-byte based index file formats that enable compatible systems and applications across different platforms to share created index files.

2. Based on the inverted index of traditional full-text search engine, the block index is realized, which can establish small file index for new files and improve the index speed. And then through merging with the original index to achieve the purpose of optimization.

3. Excellent object-oriented system architecture reduces the learning difficulty of Lucene extension and facilitates the expansion of new functions.

4. A text analysis interface is designed which is independent of language and file format. The indexer can create index files by accepting Token stream.

5. Has default implementation of a powerful query engine, users do not need to write their own code even if the system can obtain powerful query ability, Lucene query implementation of the default Boolean operation, Fuzzy Search, grouping query and so on.

6. Open source, strong extensible ability, available in various languages, suitable for various platforms,

4) Scenario: There are many Search engines based on Lucene framework, such as Solr,Elastic Search,Nutch, etc

5) Chinese: For Chinese users, the most concerned question is whether it supports full-text retrieval in Chinese. However, as you will see from Lucene’s architecture, the support for Chinese retrieval only needs to be extended to its lexical analysis interface.

6) The main package (since it is only a framework, it must introduce the main functions of some packages):

7) Architectural design (Lucene mainly does two things):

As shown in the figure, full text retrieval is generally divided into two processes (Indexing and Search).

I. index creation and storage: the text content is indexed into the database after word cutting (that is, the IndexWriter is used to create indexes for different files and store them in the location where the indexed files are stored). The storage logic is roughly as follows:

1. The depositor defines the structure of documents in the library, for example, the website content needs to be loaded into the full-text retrieval library, so that users can search the relevant webpage content through the “site search”. The structure of the document is similar to that of the table in a relational database. Each document is composed of multiple fields. Assume that the contents of the website to be stored include the following fields: article title, author, publication time, original text link, and body content (generally used as a snapshot of the web page).

2. Documents containing N fields need to be indexed by word cutting (or word segmentation) before they are really stored in the database. The rules of word cutting are completed by language ANALYZER.

3. The sharded words are registered in the index tree for query, and other contents that do not need to be indexed are added to the database. All these operations are performed by the STORAGE.

4. Lucene’s index tree structure is very good, which is one of the characteristics of Lucene.

Search index: is to get the user’s query request, search the index created, and then return the results of the process. The general logic is as follows:

1. The inquirers input the query conditions, which can be calculated by specific operators. For example, they want to query the records related to “China” and “Beijing”, but do not want the results to include “Zhongguancun, Haidian District”, so the input condition is “China + Beijing – Zhongguancun, Haidian District”.

2. Query condition is to convey to the query analyzer, the analyzer will be of “China – sea DianQu + Beijing zhongguancun” were analyzed, and the first analyzer parsing string connector, namely the plus and minus sign here, and then to cut word, every word is the smallest word yuan in the general two characters, the two words don’t have to split China and Beijing, However, zhongguancun in Haidian district needs to be segmented. Assuming that the word is segmented into “Haidian District” AND “Zhongguancun” according to the word segmentation algorithm, the final query condition can be expressed as: “China” AND “Beijing” AND “Zhongguancun”.

3. The query iterates through the index tree according to this condition, obtains the query result, and returns the ResultSet (similar to the ResultSet in JDBC).

4. The returned result set is displayed on the query result page

5. It should be noted that Lucene only supports English by default. In order to illustrate the problem, the above query process uses Chinese example, in fact, when Lucene is extended to support Chinese is such a query process. (A variety of useful Chinese dictionary segmentation, you can search)

8) Principle: In Lucene, this technique of “inverted indexing” is used to achieve correlation mapping. :

No, just capricious. Fast National Day wages are not hair, eat dirt… Can’t you see the big losers in here?