In our daily life and work, Baidu, Google these search sites have become the school we are taught to solve doubts, as the saying goes, there is a problem to find Baidu. So how does Baidu find the data they need in the sea data? Why does he search so fast? We all know that it is because of Baidu’s search engine. Some programmers may think of ES, but it doesn’t represent a search engine, it’s just one of the tools, but it’s really useful and efficient.

This article will tell you the basic knowledge of search engine and some methods of Chinese word segmentation, and then make a small demo to try data retrieval. Let us preliminary understanding of the implementation of search engines

Search engine introduction

What is a search engine

Search Engine refers to a system that collects information from the Internet according to certain strategies and uses specific computer programs, organizes and processes the information, provides Search services for users, and displays relevant information to users.

2. Search engine classification

Search engines include full-text index, directory index, metasearch engine, vertical search engine, integrated search engine, portal search engine and free link list, etc.

Here we mainly introduce the full text index, baidu is the use of search engine classification.

Full-text index:

The first is the collection of data in the database, search engine automatic information collection functions are divided into two kinds. One is regular search, that is, every other period of time (such as Google is generally 28 days), the search engine actively sent “spider” program, to a certain IP address range of the Internet website search, once the discovery of a new website, it will automatically extract the website information and url to join their database. The other is to submit site search, that is, the owner of the site actively to the search engine to submit the url, it within a certain period of time (2 days to months) directed to your site to send a “spider” program, scan your site and put the relevant information into a database for users to query.

When a user with key words to find information, search engine will search in the database, if found consistent with user requirements content sites, and the use of special algorithm – usually the degree of match according to the web pages keywords, location, frequency, link quality – the level of relevance and ranking is calculated for each page, and then according to the correlation between high and low, These web links are returned to the user in order. The characteristics of this engine is a relatively high search rate.

3. What problems can search engines solve

1> Efficient data query (use a variety of algorithms to query data, query rate is millisecond level, whether tens of thousands of data or hundreds of millions of data)

2> Relatively easy, switching from a normal database to a search engine is relatively easy.

3> Large data volume, timeliness, high concurrency, etc.

4. Application scenarios of search engines:

1> When the database reaches the level of millions of data 2> the retrieval timeliness, high performance requirements, Ms level response

Let’s take a look at the application of search engines in our ordinary Internet:


Today we are going to talk about the search engine Solr, so what is Solr? What are its advantages and disadvantages compared with ES? Let’s start with a brief introduction to Solr:

Solr is a Lucene-based full-text search server. At the same time, it has been extended to provide a richer use-oriented query language than Lucene. At the same time, it has realized the configuration, scalability and optimization of query performance, and provides a perfect function management interface. It supports the Xml/Http protocol and JSONAPI interface.

It has the following characteristics:

  • Scalability: Solr can distribute indexing and query processing across multiple servers in a cluster.
  • Quick deployment: Solr is open source software, which is easy to install and configure. You can use it directly according to the Sample configuration in the installation package. It can be divided into single-node or cluster mode.
  • Massive data: Solr is designed for massive data processing of more than 100 million levels, and can handle massive data retrieval well.
  • Optimized search function: Solr searches fast enough, for complex search queries Solr can do millisecond level of processing, usually, dozens of milliseconds can process a complex query.

2. Participle introduction

Next, we are going to understand how segmentation is achieved, so, why do we want to segmentation, and what is the relationship between search engines? How do a few words or paragraphs that we type into a search box break down into multiple keywords

Have you ever heard of a word participle? For example, Lucene’s own Chinese word divider smartCN, and the most commonly used IK word divider and so on, today we mainly talk about IK word divider.

IK participle

Dic (main2012. Dic), quantifier (quantifier. Dic), and stopword (stopword.

Dictionary is a Dictionary management class that loads the Dictionary into the memory structure. Specific code dictionary, is located in the org. Wltea. Analyzer. Dic. DictSegment. This class implements one of the core data structures of a word divider, Tire Tree.

Tire Tree (dictionary Tree) is a Tree structure with a relatively simple structure. It is used to build dictionaries and find words quickly by comparing pairs of prefix characters one by one. Therefore, it is sometimes called a prefix Tree. Specific examples are as follows.

For example, I am from Zhongguancun, Haidian District, Beijing

Our dictionary is Beijing, Haidian, Zhongguancun, China, Chinese people, so our dictionary tree based on the dictionary looks like this:

Then we use the dictionary tree to segment the sentence. IK word segmentation can be basically divided into two modes: smart mode and non-SMART mode, which can be configured during initialization in the code.

We don’t actually have to explain the literal meaning of these two modes, just print the results of both modes:

Original sentence:

I am a Chinese from Zhongguancun, Haidian District, Beijing

Smart mode:

Beijing, Haidian District, Zhongguancun, the Chinese people

Non-smart mode:

Beijing, Haidian District, Zhongguancun, China, the Chinese people

Obviously, the non-Smart mode is to output all the words that can be distinguished; Smart mode outputs a reasonable word segmentation result according to the inherent method, which involves ambiguity judgment.

Take a more typical example: Zhang SAN is right.

Possible toe-chains based on forward matching:

L1:{three, three, three}

L2: {}

L3:{true, true, true, real, in reason, in reason}

Let’s start with some of the most basic element structure classes:

Public class Lexeme implements Comparable{...... // private int offset; Private int begin; Private int length; Private String lexemeText; // private int lexemeType; ... }Copy the code

/** Comparison algorithm for words in sorted sets * @see java.lang.Comparable#compareTo(java.lang.object) */public int compareTo(Lexeme other) {if(this.begin < other.getbegin ()){return -1; }else if(this.begin == other.getbegin ()){return -1;}else if(this.length == other.getbegin ()){return -1; }else if(this.length == other.getLength()){return 0; }else {//this.length < other.getLength()return 1; }}else{//this.begin > other.getBegin()return 1; }}Copy the code

Smart mode ambiguity elimination algorithm:

Positional weight comparison (term length product), meaning to select the set of long term term position next




Zhang SAN said, indeed, right

Word segmentation we talk about so much, you can go to read IK word segmentation source, in-depth understanding of the source address:…

Inversion index algorithm

1. Introduction

We can think of the inverted index algorithm as the dictionary of the table of contents, we need to look up the list of words, we will quickly find the word we need. In professional terms:

Inverted indexing stems from the need to find records based on the value of an attribute in practical applications. Each entry in such an index table contains an attribute value and the address of the records that have that attribute value. May I have an Inverted index, because it is not the records that determine the attribute values, but the attribute values that determine the position of the records? May I know an inverted file or inverted file for short?

Inverted file (inverted index), index object is a document or a document set of words, etc., used to store the storage location of these words in a document or a group of documents, is a document or a document set of the most common index mechanism.

The key step in the search engine is inverted index, usually expressed as a keyword inverted index structure, then it is the frequency (the number of occurrences of), location (where there is an article or a web page, and the relevant dates, the information such as the author), it is equivalent to hundreds of billions of pages on the Internet web page made a index, is like a book, catalog, labels. Readers who want to see a topic related to the chapter can directly follow the table of contents to find the relevant page. No more searching from page one to page one.

2. Lucene inverted index principle

Lucerne is an open source high-performance Java-based full text search engine toolkit, not a complete full text search engine, but a full text search engine architecture, provides a complete query engine and indexing engine, part of the text analysis engine. The purpose is to provide a simple tool kit for software developers, in order to facilitate the realization of full-text retrieval function in the target system, or on this basis to build a complete full-text retrieval engine.

There are two articles 1 and 2:

The content of article 1 is Jack Livesin BeiJing,I live in2. He lived in BeiJingin Taiyuan.Copy the code

1> First we need to divide words in the same way as we did before. Then, due to English reasons, we need to filter out the words in, on, and of which have no practical meaning. Then we add s to the third person singular or Ed to the past tense and restore the words, such as lived changed to live, lives changed to live. Then remove unnecessary punctuation marks as well. After the above processing, the remaining keywords are:

[Jack] [live] [BeiJing] [I] [live] [BeiJing]

[he] [live] [Taiyuan]

2> create inverted index

BeiJing 1[2] 3, 6 he 2[1] 1 I 1[1] 4 Live 1[2] 2, 5,2[1] 2 Taiyuan 2[1] 3 Tom 1[1] 1Copy the code

This is the core of lucene’s index structure. We notice that keywords are arranged in character order (Lucene does not use a B-tree structure), so Lucene can quickly locate keywords using a binary search algorithm.

3 > implementation

When implemented, Lucene saves the above three columns as a Term Dictionary, a frequency file, and a position file, respectively. The dictionary file not only stores each keyword, but also retains Pointers to frequency files and location files, through which the frequency information and location information of the keyword can be found.

4> Compression algorithm

To reduce the size of index files, Lucene also uses compression techniques for indexes.

Firstly, the key words in the dictionary file are compressed

The next big use is compression of numbers, which only stores the difference from the previous value (this reduces the length of the number, and thus the number of bytes needed to store it). For example, the current article number is 16389 (save 3 bytes without compression), the previous article number is 16382, save 7 (save 1 byte) after compression.

5> Reasons for use

Suppose you want to query for the word “live”, Lucene searches the dictionary binary, finds the word, reads out all the article numbers using a pointer to the frequency file, and returns the result. Dictionaries are usually very small, so the whole process takes milliseconds.

Using a normal sequential matching algorithm, which does not build an index but instead matches the contents of all articles with a string, this process is quite slow, and when the number of articles is large, the time is often intolerable.

Basic configuration and use of SOLR

Here we install Solr on a Windows system

Download address…

After the decompression:

CMD go to the bin directory of solr and run the solr start command. (For convenience, you can configure environment variables of Solr and name them using solr in CMD.)

See this interface that solr service start is successful, the port number is 8983, visit http://localhost:8983, will automatically jump to http://localhost:8983/solr/#/

This page will display solr basic information, Lucene information, Java information, etc

Then we introduce a solr directive: solr-h can see the basic information of Solr

Configuring solr

Configure core core

Solr create -c mycore -d baisc_configs: the -c parameter specifies the core name to be defined, and the -d parameter specifies the configuration directory

After this command is executed, another test directory is displayed.


Visit Solr Admin page: http://localhost:8983/ to see core, which has a test

In the \solr-6.3.0\server\solr\test directory there are conf and data directories corresponding to configuration and data respectively.

Add data to core

Open the directory: \solr-6.3.0\server\solr\test\conf and add a field:

<field name="name" type="string" indexed="false" stored="true" required="true" multiValued="false" />Copy the code

Then restart solr: solr restart -p 8983

Go to Solr Admin, select core-test-document, and fill in document (s) :

{"id":"1"."name":"BMW"}Copy the code

Click Submit to return Status: Success, then add data successfully.

After adding a few more columns, click Query to Query the data:

Q on the query interface represents the query conditions. For example, enter name:” BMW “to perform the query again

Also can directly get access to the url: http://localhost:8983/solr/test/select? Q = name: BMW

Source: Creditease Technology Institute Yang Heng