What is Lucene?

Lucene is an open source full text search engine toolkit published by the Apache Software Foundation. It is written by Doug Cutting, a senior full text search expert. It is a full text search engine architecture, providing a complete index creation and query index, as well as partial text analysis engine. Software developers Lucene’s purpose is to provide a simple and easy to use toolkit, to facilitate in the target system can realize the function of full text retrieval, or is on this basis, establish a complete full text search engine, Lucene is a classic in the field of full text retrieval ancestors, many search engines are now on the basis of the creation, thought is interlinked.

Lucene is a text search tool based on keywords. It can only search text within a site, not across sites

Since talked about the site internal search, so we talk about our familiar Baidu, Google those search engines are based on what search….

As can be seen clearly from the figure, baidu, Google and other search engines actually search through the program of web crawler…


Why are we using Lucene?

When introducing Lucene, we said that Lucene is not a search engine, it is only a text search inside the site. So why should we learn from him??

When we wrote the tax service system before, in fact, SQL has been used to carry out the search within the station..

Since SQL can do the function, we have to learn Lucene, why??

Let’s take a look at the disadvantages of using SQL to search:

  • (1) SQL can only search for database tables, not directly for the text search on the hard disk
  • (2) SQL has no correlation ranking
  • (3) SQL search results are not highlighted by key words
  • (4) SQL requires the support of the database, the database itself requires a large memory overhead, such as: Oracle
  • (5) SQL search sometimes slow, especially when the database is not local, super slow, such as: Oracle

Let’s take a look at the content of Lucene keyword searched at Baidu:

We can’t do all of the above if we use SQL. So we will learn Lucene to help us in the site according to the text keyword to search data!


If your site needs to search by keyword, you can use SQL or Lucene… So we Lucene and SQL are the same, both code in the persistence layer.

A quick start

Next, we will explain how to use Lucene….. Before we get into Lucene’s API, let’s talk about what Lucene stores… Our SQL uses memory in the database, in the hard disk as DBF files… So what are we inside Lucene?

Lucene stores a series of binary compressed files and some control files, which are located on the computer’s hard drive. These contents are collectively called the index library. The index library consists of two parts:

  • (1)The original record
    • The original text saved into the index library, for example: I am Zhong Fucheng
  • (2)The vocabulary
    • Each character in the original record is broken down according to a certain splitting strategy (i.e. a word splitter) and stored in a table for future search

In other words, Lucene stores its data in what is often called an index library, which is divided into two parts: the original record and the vocabulary….

1.1 Original records and vocabulary

When we want to save data to the index library, we first save the data to the original record….

And because we give the user to use, the user is using the keyword to query our specific record. Therefore, we need to split the data we originally saved! Store the split data in a vocabulary.

The vocabulary is similar to the index table in Oracle, which will give the corresponding index value when splitting.

Once the user searches according to the keyword, the program first queries whether the keyword exists in the vocabulary. If there is the keyword, it locates in the original record table and returns the original record that meets the conditions to the user.

Let’s look at the following figure for easy understanding:

Here,, some people may question: is the original record split data is a Chinese character split? Then there are a lot of keywords in the vocabulary.

In fact, when we save to the original record sheet, we can specify which algorithm we use to split the data into the vocabulary….. Our graph is Lucene’s standard word segmentation algorithm, which splits Chinese characters one by one. We can use another word segmentation algorithm, two two split or something else.

1.2 Write the first Lucene program

First, let’s import the necessary development kit for Lucene:

  • Lucene -core-3.0.2.jar
  • Lucene-analyzers -3.0.2.jar
  • Lucene -highlighter-3.0.2.jar [Lucene will highlight the searched words to prompt the user]
  • Lucene-memory-3.0.2.jar

Create the User object, which encapsulates the data….


/** * Created by ozc on 2017/7/12. */
public class User {


    private String id ;
    private String userName;
    private String sal;

    public User(a) {}public User(String id, String userName, String sal) {
        this.id = id;
        this.userName = userName;
        this.sal = sal;
    }
    public String getId(a) {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getUserName(a) {
        return userName;
    }

    public void setUserName(String userName) {
        this.userName = userName;
    }

    public String getSal(a) {
        return sal;
    }

    public void setSal(String sal) {
        this.sal = sal; }}Copy the code

We want to use Lucene to query the outbound data, first we need to have an index library. So we first create the index library, to store our data in the index library.

Steps to create an index library:

  • 1) Create JavaBean objects
  • 2) Create Docment object
  • 3) Put all the attribute values of JavaBean objects into the Document object, and the attribute name can be the same or different from JavaBean
  • 4) Create IndexWriter
  • 5) Write the Document object to the index library through the IndexWriter object
  • 6) Close the IndexWriter object

    @Test
    public void createIndexDB(a) throws Exception {

        // Populate the data into JavaBean objects
        User user = new User("1"."Zhong Fucheng"."Future programmers.");

        // Create a Lucene Document object
        Document document = new Document();

        // Put all the property values of the JavaBean object into the Document object. The property name can be the same or different from the JavaBean object


        /** * Add a field to the Document object * Parameter 1: the key of the field * Parameter 2: the value of the character * Parameter 3: Whether to store in the original record table * YES = YES * NO = NO * Parameter 4: Whether stored data needs to be split into vocabularies * ANALYZED means split * NOT_ANALYZED means not split * * */
        document.add(new Field("id", user.getId(), Field.Store.YES, Field.Index.ANALYZED));
        document.add(new Field("userName", user.getUserName(), Field.Store.YES, Field.Index.ANALYZED));
        document.add(new Field("sal", user.getSal(), Field.Store.YES, Field.Index.ANALYZED));

        // Create an IndexWriter object
        // Specify E:/createIndexDB as the directory
        Directory directory = FSDirectory.open(new File("E:/createIndexDB"));

        // Use the standard word segmentation algorithm to split the original record table
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

        // the default value is 1W
        IndexWriter.MaxFieldLength maxFieldLength = IndexWriter.MaxFieldLength.LIMITED;
        /** * IndexWriter writes our document object to the hard disk ** Analyzer A, which algorithm is used to split the raw record table data in document into vocabularys * Parameter three: MaxFieldLength How many words MFL can split the text * * */
        IndexWriter indexWriter = new IndexWriter(directory, analyzer, maxFieldLength);

        // Write the Document object to the index library via the IndexWriter object
        indexWriter.addDocument(document);

        Close the IndexWriter object
        indexWriter.close();

    }

Copy the code

When the program is finished, we will see our index library on our hard disk.

Now we don’t know if the record is really being stored in the index, because we can’t see it. Index inventory data in the CFS file, we can not open the CFS file.

So, we now use a keyword, the index library data read. See if the data was read successfully.

Query the contents of the index library by keyword:

  • 1) Create an IndexSearcher object
  • 2) Create a QueryParser object
  • 3) Create a Query object to encapsulate the keyword
  • 4) Use an IndexSearcher object to search for the top 100 records in the index library. If there are less than 100 records, the actual records will be used
  • 5) Obtain the number that meets the conditions
  • 6) Use the indexSearcher object to query the Document object in the index library
  • 7) Take out all the attributes in the Document object, and then encapsulate back to the JavaBean object, and add it to the collection for saving, for use

    @Test
    public void findIndexDB(a) throws Exception {

        /** ** IndexSearcher(Directory path) ** */
        Directory directory = FSDirectory.open(new File("E:/createIndexDB"));
        // Create an IndexSearcher object
        IndexSearcher indexSearcher = new IndexSearcher(directory);

        // Create a QueryParser object
        /** * Parameter 1: Version matchVersion Version number * Parameter 2: String f, field to be queried * Parameter 3: Analyzer A ** /
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
        QueryParser queryParser = new QueryParser(Version.LUCENE_30, "userName", analyzer);

        // Give the key to query
        String keyWords = "Clock";

        // Create a Query object to encapsulate the keyword
        Query query = queryParser.parse(keyWords);

        // Use an IndexSearcher object to search for the first 100 entries in the index database
        TopDocs topDocs = indexSearcher.search(query, 100);

        // Get the required number

        for (int i = 0; i < topDocs.scoreDocs.length; i++) {

            ScoreDoc scoreDoc = topDocs.scoreDocs[i];
            int no = scoreDoc.doc;
            // Use indexSearcher to query the Document object in the index library
            Document document = indexSearcher.doc(no);

            // Take all the attributes out of the Document object and encapsulate them back into the JavaBean object
            String id = document.get("id");
            String userName = document.get("userName");
            String sal = document.get("sal");

            User user = new User(id, userName, sal);
            System.out.println(user);
            
        }
Copy the code

Effect:


1.3 Further description of the Lucene code

Our Lucene program goes along the same lines: encapsulate a JavaBean object into a Document object, and then write the Document to the index library via IndexWriter. When the user needs to query, the IndexSearcher is used to read data from the index library, find the corresponding Document object, parse the contents, and encapsulate them into JavaBean objects for use.

Second, the Lucene code optimization

Let’s go back to the code we wrote in our last quick Start, and let me capture some representative ones:

The following code occurs when populating data into and querying data from the index library. Duplicate code!


        Directory directory = FSDirectory.open(new File("E:/createIndexDB"));

        // Use the standard word segmentation algorithm to split the original record table
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
Copy the code

The following code actually encapsulates the JavaBean data into a Document object. We can encapsulate it by reflection…. Without encapsulation, we would see a lot of similar code if we had a lot of Javabeans to add to the Document object.


  		document.add(new Field("id", user.getId(), Field.Store.YES, Field.Index.ANALYZED));
        document.add(new Field("userName", user.getUserName(), Field.Store.YES, Field.Index.ANALYZED));
        document.add(new Field("sal", user.getSal(), Field.Store.YES, Field.Index.ANALYZED));
Copy the code

The following code takes the data out of the Document object and encapsulates it into a JavaBean. If there are many attributes in JavaBean, we also need to write similar code…. many times



            // Take all the attributes out of the Document object and encapsulate them back into the JavaBean object
            String id = document.get("id");
            String userName = document.get("userName");
            String sal = document.get("sal");
     		User user = new User(id, userName, sal);
Copy the code

2.1 Writing Lucene tool classes

Some things to note when writing utility classes:

  • When we get the property of an object, we can wrap the property’s GET method
  • You get the get method, you can call it and get the corresponding value
  • When manipulating an object’s properties, we use brute force access
  • If you have properties, values, and objects, remember to use the BeanUtils component

import org.apache.commons.beanutils.BeanUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;

import java.io.File;
import java.lang.reflect.Field;
import java.lang.reflect.Method;

/** * Created by ozc on 2017/7/12. */

/** * use singleton mode ** /
public class LuceneUtils {
    private static Directory directory;
    private static Analyzer analyzer;
    private static IndexWriter.MaxFieldLength maxFieldLength;

    private LuceneUtils(a) {}

    static{
        try {
            directory = FSDirectory.open(new File("E:/createIndexDB"));
            analyzer = new StandardAnalyzer(Version.LUCENE_30);
            maxFieldLength = IndexWriter.MaxFieldLength.LIMITED;
        } catch(Exception e) { e.printStackTrace(); }}public static Directory getDirectory(a) {
        return directory;
    }

    public static Analyzer getAnalyzer(a) {
        return analyzer;
    }

    public static IndexWriter.MaxFieldLength getMaxFieldLength(a) {
        return maxFieldLength;
    }

    / * * *@paramObject Indicates the JavaBean type * passed in@returnReturns the Document object */
    public static Document javaBean2Document(Object object) {
        try {
            Document document = new Document();
            // Get the JavaBean bytecode file objectClass<? > aClass = object.getClass();// Get the corresponding attributes from the bytecode file object.
            Field[] fields = aClass.getDeclaredFields();

            // Get the name of each attribute
            for (Field field : fields) {
                String name = field.getName();
                // Get the value of the property.
                String method = "get" + name.substring(0.1).toUpperCase() + name.substring(1);
                // Get the corresponding value of the method.
                Method aClassMethod = aClass.getDeclaredMethod(method, null);
                String value = aClassMethod.invoke(object).toString();
                System.out.println(value);


                // Encapsulate the data in the Document object.
                document.add(new org.apache.lucene.document.Field(name, value, org.apache.lucene.document.Field.Store.YES, org.apache.lucene.document.Field.Index.ANALYZED));
            }
            return document;
        }  catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }


    / * * *@paramAClass Specifies the type of object to be parsed and passed in by the user@paramDocument passes in the document object *@returnReturn a JavaBean */
    public static Object Document2JavaBean(Document document, Class aClass) {
        try {
            // Create the JavaBean object
            Object obj = aClass.newInstance();
            // Get all the member variables of the JavaBean
            Field[] fields = aClass.getDeclaredFields();
            for (Field field : fields) {

                // Set to allow violent access
                field.setAccessible(true);
                String name = field.getName();
                String value = document.get(name);
                // Encapsulate the data in the Bean using BeanUtils
                BeanUtils.setProperty(obj, name, value);
            }
            return obj;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }
    @Test
    public void test(a) {
        User user = newUser(); LuceneUtils.javaBean2Document(user); }}Copy the code

2.2 Use LuceneUtils to modify the program


    @Test
    public void createIndexDB(a) throws Exception {
        // Populate the data into JavaBean objects
        User user = new User("2"."Zhong Fucheng 2"."Future Programmer 2");
        Document document = LuceneUtils.javaBean2Document(user);
        /** * IndexWriter writes our document object to the hard disk ** Analyzer A, which algorithm is used to split the raw record table data in document into vocabularys * Parameter three: MaxFieldLength How many words MFL can split the text * * */
        IndexWriter indexWriter = new IndexWriter(LuceneUtils.getDirectory(), LuceneUtils.getAnalyzer(), LuceneUtils.getMaxFieldLength());

        // Write the Document object to the index library via the IndexWriter object
        indexWriter.addDocument(document);
        Close the IndexWriter object
        indexWriter.close();
    }


    @Test
    public void findIndexDB(a) throws Exception {


        // Create an IndexSearcher object
        IndexSearcher indexSearcher = new IndexSearcher(LuceneUtils.getDirectory());
        // Create a QueryParser object
        QueryParser queryParser = new QueryParser(Version.LUCENE_30, "userName", LuceneUtils.getAnalyzer());
        // Give the key to query
        String keyWords = "Clock";
        // Create a Query object to encapsulate the keyword
        Query query = queryParser.parse(keyWords);
        // Use an IndexSearcher object to search for the first 100 entries in the index database
        TopDocs topDocs = indexSearcher.search(query, 100);
        // Get the required number
        for (int i = 0; i < topDocs.scoreDocs.length; i++) {
            ScoreDoc scoreDoc = topDocs.scoreDocs[i];
            int no = scoreDoc.doc;
            // Use indexSearcher to query the Document object in the index library
            Document document = indexSearcher.doc(no);
            // Take all the attributes out of the Document object and encapsulate them back into the JavaBean objectUser user = (User) LuceneUtils.Document2JavaBean(document, User.class); System.out.println(user); }}Copy the code

Third, index library optimization

We can now create the index library and read the object data from the index library. In fact, there are places to optimize the index library….

3.1 Merging Files

Every time we add data to the index library, it automatically creates a CFS file for us…

This is not good, because if the data volume is large, our hard disk will have many, many CFS files….. In fact, the index library will help us automatically merge files, the default is 10.

If we want to change the default value, we can do so with the following code:


// Index library optimization
indexWriter.optimize();

// Set merge factor to 3, whenever there are three CFS files, merge
indexWriter.setMergeFactor(3);


Copy the code

3.2 Configuring a Memory Index Library

Our current program is directly with the file operation, so the IO overhead is actually relatively large. And the speed is relatively slow…. We can use in-memory index libraries to improve our read and write efficiency…

It is fast for an in-memory index library because we operate directly on memory… However, we want to store the in-memory index library in the hard disk index library. When we read data, we first need to synchronize the data from the hard disk index library to the memory index library.


		Article article = new Article(1."Training"."Transwise is a Java training institution");
		Document document = LuceneUtil.javabean2document(article);
		
		Directory fsDirectory = FSDirectory.open(new File("E:/indexDBDBDBDBDBDBDBDB"));
		Directory ramDirectory = new RAMDirectory(fsDirectory);
		
		IndexWriter fsIndexWriter = new IndexWriter(fsDirectory,LuceneUtil.getAnalyzer(),true,LuceneUtil.getMaxFieldLength());
		IndexWriter ramIndexWriter = new IndexWriter(ramDirectory,LuceneUtil.getAnalyzer(),LuceneUtil.getMaxFieldLength());
		
		ramIndexWriter.addDocument(document);
		ramIndexWriter.close();
		
		fsIndexWriter.addIndexesNoOptimize(ramDirectory);
		fsIndexWriter.close();
Copy the code

4. Word divider

As we mentioned earlier, when storing data in the index library, we use some algorithm to store the original record table data in the vocabulary….. So the sum of these algorithms we can call a word divider

Word divider: ** uses an algorithm to split characters in Chinese and English texts into words, which are ready for users to enter key words and search for **

For the reason of using a participle, we also explicitly said: because users can not completely record our original record data, so when they search, they are through keywords to query the original record table…. At this point, word segmentation is used to maximize the matching of relevant data

4.1 Word segmentation process

  • Step 1: Use the word divider to separate wordsCopy the code
  • Step 2: Remove stop and disable wordsCopy the code
  • Step 3: If English is available, change the English letters to lowercase, that is, the search is case-insensitiveCopy the code

4.2 Word segmentation API

When we choose a word segmentation algorithm, we will find that there are very, very many word segmentation apis, we can use the following code to see how the word segmentation data:


	private static void testAnalyzer(Analyzer analyzer, String text) throws Exception {
		System.out.println("Current participle in use:" + analyzer.getClass());
		TokenStream tokenStream = analyzer.tokenStream("content".new StringReader(text));
		tokenStream.addAttribute(TermAttribute.class);
		while(tokenStream.incrementToken()) { TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class); System.out.println(termAttribute.term()); }}Copy the code

After the experiment, we can choose the appropriate segmentation algorithm….

4.3 IKAnalyzer participle

This is a third-party word segmentation, we need to import the corresponding JAR package if we want to use it

  • IKAnalyzer3.2.0 Stable. The jar
  • Step 2: Copy ikAnalyzer.cfg. XML and stopword.dic and xxx.dic files to the SRC directory of MyEclipse and configure them. The first line must be an empty line

What’s so good about this third party participle ???? It’s the word divider of choice in Chinese… That is to say: he is divided according to the Chinese words!


Five, the search results for processing

5.1 Search Results highlighting

When we use SQL, the search results are not highlighted… When we use Lucene, we can set the keyword to highlight… This makes the user experience even more important!


		String keywords = "Zhong Fucheng";
		List<Article> articleList = new ArrayList<Article>();
		QueryParser queryParser = new QueryParser(LuceneUtil.getVersion(),"content",LuceneUtil.getAnalyzer());
		Query query = queryParser.parse(keywords);
		IndexSearcher indexSearcher = new IndexSearcher(LuceneUtil.getDirectory());
		TopDocs topDocs = indexSearcher.search(query,1000000);
		
		// Set keyword highlighting
		Formatter formatter = new SimpleHTMLFormatter("<font color='red'>"."</font>");
		Scorer scorer = new QueryScorer(query);
		Highlighter highlighter = new Highlighter(formatter,scorer);
		
		for(int i=0; i<topDocs.scoreDocs.length; i++){ ScoreDoc scoreDoc = topDocs.scoreDocs[i];int no = scoreDoc.doc;
			Document document = indexSearcher.doc(no);
			
			// Set the content to highlight
			String highlighterContent = highlighter.getBestFragment(LuceneUtil.getAnalyzer(),"content",document.get("content"));
			document.getField("content").setValue(highlighterContent);
			
			Article article = (Article) LuceneUtil.document2javabean(document,Article.class);
			articleList.add(article);
		}
		for(Article article : articleList){ System.out.println(article); }}Copy the code

5.2 Summary of Search Results

If we search for articles that are too large and we only want to show parts of them, we can summarize them…

It is worth noting that a summary of search results needs to be used in conjunction with Settings highlighting


String keywords = "Zhong Fucheng";
		List<Article> articleList = new ArrayList<Article>();
		QueryParser queryParser = new QueryParser(LuceneUtil.getVersion(),"content",LuceneUtil.getAnalyzer());
		Query query = queryParser.parse(keywords);
		IndexSearcher indexSearcher = new IndexSearcher(LuceneUtil.getDirectory());
		TopDocs topDocs = indexSearcher.search(query,1000000);
		
		Formatter formatter = new SimpleHTMLFormatter("<font color='red'>"."</font>");
		Scorer scorer = new QueryScorer(query);
		Highlighter highlighter = new Highlighter(formatter,scorer);
		
		// Set the summary
		Fragmenter fragmenter  = new SimpleFragmenter(4);
		highlighter.setTextFragmenter(fragmenter);
		
		for(int i=0; i<topDocs.scoreDocs.length; i++){ ScoreDoc scoreDoc = topDocs.scoreDocs[i];int no = scoreDoc.doc;
			Document document = indexSearcher.doc(no);
			
			String highlighterContent = highlighter.getBestFragment(LuceneUtil.getAnalyzer(),"content",document.get("content"));
			document.getField("content").setValue(highlighterContent);
			
			Article article = (Article) LuceneUtil.document2javabean(document,Article.class);
			articleList.add(article);
		}
		for(Article article : articleList){ System.out.println(article); }}Copy the code

5.3 Sorting Search Results

We certainly use a lot of search engines, using different search engines to search for the same content. Their first page would be in a different order… This is how they are internally sorted using search results….

There are many ways to affect the sorting of web pages:

  • Head /meta/ keywordsCopy the code
  • Web pages are neatly labeledCopy the code
  • Page execution speedCopy the code
  • Using div + CSSCopy the code
  • Etc., etc.Copy the code

In Lucene, we can set the correlation score to rank different results:


		IndexWriter indexWriter = new IndexWriter(LuceneUtil.getDirectory(),LuceneUtil.getAnalyzer(),LuceneUtil.getMaxFieldLength());
		// Set the score for the result
		document.setBoost(20F);
		indexWriter.addDocument(document);
		indexWriter.close();
Copy the code

Of course, we can also sort by a single field:

	//true indicates descending order
	Sort sort = new Sort(new SortField("id",SortField.INT,true));
	TopDocs topDocs = indexSearcher.search(query,null.1000000,sort);
Copy the code

It is also possible to sort by more than one field: in multi-field sorting, the second field sorting is only useful if the first field sorting results are the same


		Sort sort = new Sort(new SortField("count",SortField.INT,true),new SortField("id",SortField.INT,true));
		TopDocs topDocs = indexSearcher.search(query,null.1000000,sort);
Copy the code

5.4 Conditional Search

In our example, we use a search for the content of a field based on a keyword. The syntax looks like this:

	QueryParser queryParser = new QueryParser(LuceneUtil.getVersion(),"content",LuceneUtil.getAnalyzer());
	
Copy the code

In fact, we can also use keywords to search multiple fields, that is, multi-conditional search. We often use in practice is multi – condition search, multi – condition search can use our maximum matching corresponding data!


QueryParser queryParser = new MultiFieldQueryParser(LuceneUtil.getVersion(),new String[]{"content"."title"},LuceneUtil.getAnalyzer());
Copy the code

Six, summarized

  • Lucene is the ancestor of the full text indexing engine. Solr and Elasticsearch are based on Lucene.
  • So what Lucene has is a bunch ofBinary compressed files and some control filesThese contents are collectively referred to asThe index library, the index library is divided into two parts:
    • The original record
    • The vocabulary
  • Understand the optimization of index library: 1, merge files 2, set up memory index library
  • Lucene has a variety of word segmentation, choose their own suitable for a word segmentation
  • The results of the query can be set to highlight, digest, sort

This is the tip of the iceberg for Lucene, which is currently Solr and Elasticsearch, but for more information on Lucene, check out other sources

If the article has the wrong place welcome to correct, everybody exchanges with each other. Students who are used to reading technical articles on wechat and want to get more Java resources can follow the wechat public account :Java3y