takeaway

Search engine is a combination of natural language processing, information retrieval, web architecture, distributed data processing as one to help users accurately interpret information to obtain information of a technology.

At present, the share of mainstream portal search in the web and mobile terminals in the industry is basically divided by various giants (Figure 1.1)(SEO, 2020). Of course, with the development of The Times, search more and more to the subdivision business as the leading direction of refined portal search. For example, you will choose to search for knowledge in Zhihu column, search for popular products in Dewu, search for food and drink in Meituan review, and so on.

THE NO.1

Information retrieval

Search engine

For example, if you want to talk about search engines, you can’t help but find information retrieval.

First of all, we have a clear definition of what information retrieval needs: materials in unstructured natural form (usually text corpus) that meet information needs are found in a large data set (generally refer to articles) (Manning, 2008).

When retrieving information, there are two indicators that cannot be avoided when discussing search performance. One is the recall rate and the other is the percision. Interestingly, these two indicators are like a pair of twin brothers, always one grows and the other disappears. Therefore, how to make the balance between them is a problem faced by each search algorithm.

This article as the search engine technology enlightenment article, mainly for the article structure, inverted index, operators and search algorithm these four dimensions to explain the basic workflow of the search engine.

THE NO.2

The article structure

The document

First let’s talk about how search engines interpret their documents. Before we get into this topic, we need to make a clear definition. There are two kinds of search engines: web search engines and company search engines.

No matter what kind of search engine, the first thing they need to solve is to understand the corpus, then they need to do is to store the corpus.

So how do you understand that? Now let’s look at the basic document context mechanism.

A lot of times people think of a document as a single word bag. But this is not true. Each document is actually made up of different components.

For example, the construction logic of general web pages is basically in the form of XML. In the broad category, we divide them into three layers. The first layer is metadata, which generally contains _URL, keywords, author, date _, etc., and the second layer is body, which generally contains information like _ title, main content _. The third layer is the external information, there are some _ external links and internal links _ jump.

Partitioning can help us distinguish the amount of information in each block, because the entropy of information in each block is different.

In short, the amount of information contained is different. For example, the amount of information contained in a keyword in the title of an article is relatively greater than the amount of information contained in the paragraph.

Have to search content

On our platform, there are two main search engines: goods and community. This can also be layered, although the specific information layering changes, but the specific design logic remains the same.

According to the information theory in Shannon’s Theorem, these documents provide different information values, and information entropy is also different. If we mix them together, the accuracy of our search is bound to be reduced to a certain extent.

In general, the document title contains the amount of information is slightly higher than the amount of information of the document body, through the internal structure of a layered operation, we can intervene algorithm in the later can be artificially adjust the weight of the contents, so as to improve what we recall the accuracy of the content and user input query “fit and better meet the user’s search.

THE NO.3

Inverted index

storage

Having covered the structure of documents, we will discuss how to store these document structures as data structures. I believe you all know that we generally use inverted index to store information in search engines, and inverted index is mainly divided into two index structures.

The first approach is for tiled page layouts, which I mentioned earlier that treat all the sections of the page individually. To put it bluntly, a radish is a radish, and a pit is a pit. Each piece of information is independent without interaction.

In the second structure, we use a vertical structure, which means we have a hierarchical relationship between the layouts on the page.

Let me take the body part as an example. The body part -> block -> paragraph -> sentence.

A tiled page layout

First of all, let’s look at the first merge inversion index, and let’s look at the construction pattern.

We have a dictionary in both the recommendation and blog fields, so what we need to do here is to combine the respective dictionary with its corresponding field.

One thing to note here is that a term will have multiple inversion indexes, such as the word Bush, for which we built an inversion here in the recommendation and another in the blog post.

In this case, we don’t mix information from different fields. Many times are as ambiguous as Bush. By establishing independent inverted indexes, the independence of information can be effectively guaranteed.

Vertical structure

Sometimes when we search for information, we don’t need to search for all the information, maybe we just want to search for specific information in a specific location.

If we want to search for cancer in a sentence, under the merge structure, our search process will become very complicated, we have to first from the article to the chapter to the sentence, obviously, if we use this structure, our search speed will be affected to some extent.

Therefore, we need to adopt a more sophisticated independent inverted index structure, which is generally oriented to the kind of relatively complex article structure.

Let’s take a look at the tree structure in which we record the location of the words in the document and their parent-child relationships.

If we were to search for information at locations 6 and 27, we would first trace back to the location in the search context, and then retrieve the information in a relatively fast (log speed) manner.

If you search for information in the section location, it will discard information with loc of 6, because information with loc of 6 belongs to the author block in the tree structure.

This way can greatly improve our search efficiency, and we need to pay only a small amount of memory, in this cloud storage era, such operation is very cost-effective.

Completely inverted

In addition, we can construct an inverted index in a completely inverted way – that is, in independent inverted form.

We can store the boundary threshold value of Field in the inversion index, and construct an inversion index of Term. By merging the two, independent inversion can be constructed.

So for example in this case we’re assuming that this is an inverted index of a sentence, the beginning and the beginning are marked there, and extentFreq is the number of fields.

Once we have established the specific value of loc, we just need to go to the next [begin,end] to determine whether the loc contains the information we need.

In addition, the part on the right is the interpretation of the meaning of TFIDF:

TFIDF is mainly used to describe the information content of words in the article. Tf represents the number of articles containing the word; Df is the number of articles containing the word; IDF represents the frequency of inverted articles, which is used to describe text scarcity, and N represents the number of articles in total inventory.

Through TFIDF, we can make a relatively rough judgment on the information degree of the article. It can be said that TFIDF is the originator algorithm of information retrieval, and a series of other algorithms are the supplement and optimization of TFIDF.

THE NO.4

Query operators

Operator classification

Operators are generally divided into three categories. One is used to build new inverted indexes, including _#SYN, #NEAR, #WINDOW_; One class is used to generate a list of scores. Its operator is _#SCORE_; One class is used to combine to produce a good list of fractions, with _#AND, #OR, #WSUM_.

Callan, 2020*

Scenario Application

These operators can be used to help us build complex Queries in different contexts.

The NEAR or WINDOW operator

We’ve covered inverted indexes before, and we can certainly build Nike, AJ, Adidas, and so on, but that infrastructure alone is not going to meet our needs. Therefore, we also need some other structured inverted indexes, such as BlackObamaand AJ1.

Such a query is a phrase formed by combining two or more words. In order to ensure proper order performance, we need to use NEAR or WINDOW operators to link words.

The syn operator

The SYN operator can be used to construct inversions of some conceptual properties, such as collections of various colors.

Score operator

The Score operator is easy to understand, and we use some of the scoring algorithms we’ve built to build an inverted index into a list of scores.

AND AND # OR

The #AND AND #OR operators are used to combine the list of scores that have already been built.

THE NO.5

Search algorithm

By the end of the article structure and operators, the tip of the search engine has been dissected.

If the search engine compared to do practice, then we have finished the foundation, and then we enter the core part – search algorithm, understand the search algorithm, then you can basically start to try to build their own small search engine.

First, let’s discuss some of the rough sorting algorithms that might be used to move from an inverted index to a list of scores.

The process of searching information by search engines is actually that we interpret our information in various ways. We compress the number of recall pool step by step through algorithms, and finally get the information we want through sorting (that is, sorted articles).

Thick row

The simplest rough sorting algorithms are UnRankedBoolean and RankedBoolean, both of which are exact search procedures. In the past, it was thought that a document was just a collection of bags of words, and that all you had to do was look it up precisely.

Unrankbooolean

Unrankbooolean As the name implies, the information obtained is unordered. This method is simple and straightforward, with scores that match being 1 and those that don’t match being 0.

It was the dominant approach until the 1990s, but it has largely fallen out of favor as people start to find it difficult to ensure that Query itself is accurate.

RankedBoolean

RankedBoolean means to give an article a score, and the method is simple, just based on the TF score.

Todd: Exactly. Let’s talk about the exact match.

  1. ​Vector space retrieval model (VSM)
  2. ​Probabilistic retrieval model (BM25)
  3. ​Statistical Language Model (query likelihood)
  4. ​Inference networks (Indri)

Since vector space has been largely abandoned by the industry, we won’t expand it here.

BM25

Let’s talk mainly about BM25 first.

The expansion formula of BM25 is shown in the figure below.

The BM25 algorithm was created by Steve Robertson. M25 was first called BM10, then BM15, and through a series of experiments and parameters it ended up being what it is now.

The first part is called RSJ Weight, and it’s called RSJ Weight because Roberson developed the algorithm, and Karen Spark Jones, who was his mentor, helped him develop the algorithm, so it’s jointly named RSJ Weight.

It’s actually a lot like IDF, with further optimization on top of it.

The tf weight here mainly normalizes the length of the text. Because a word that appears once in a 20-word text and once in a 2000-word article should have different confidence, we can eliminate the unfair competitive advantage of long articles to some extent by standardizing.

The other part is user weight, which is mainly used to adjust the weight of users. We can ignore this part relatively. In the actual parameter adjustment of business, the value of user weight is generally 1.

Of the three adjustable parameters, K1 adjusts the intensity of the adjustment amplitude, B adjusts the extent of text length normalization, and K3 adjusts the extent of user weight.

That’s the end of BM25, and now we’re going to talk about the two-stage Smoothing algorithm.

Two – Stage Smoothing algorithm

To estimate the Likelihood of a term appearing in a document, many prime ministers may think of Maximum Likelihood estimates.

However, it is defective in the search. For example, if we want to search the laser powder AJ summer style, maybe AJ laser powder is available, but summer style is not.

The MLE score for summer style will be 0, and we will multiply the score of AJ laser powder by the score of summer style, and the result will be no result, nothing will be searched.

This is where the smoothing function comes in handy. First, we need to handle some very rare terms, and second, we need to smooth out some very short articles.

Let’s start with the Jelinek-Mercer Smoothing Function, also known as a hybrid model. In this model, we will have two maximum likelihood estimates, one is query term based on document, and the other is query term based on library.

Lambda is used to control the degree of smoothness, and generally when Lambda approaches 0, the degree of smoothness approaches 0. So how do we select lambda? Our suggestion is that small lambda works for short query, and large lambda works for long query.

Another smoothing method is Dirickret prior smoothing algorithm, which aims to adjust the occurrence frequency of rare words and high frequency words in the thesaurus. Generally, the value of mu is in the appropriate range of 1000 to 10000.

The formulas of the above smoothing algorithm and MLE algorithm are shown as follows:

Indri

So BM25 and smoothing algorithm, let’s talk about INDRI.

INDRI is constructed using statistical language model and Bayesian interference network.

It may sound scary, but if you take a closer look at it, the key is the following:

  1. It is a probabilistic retrieval model
  2. Query is structured
  3. Documents are represented in a variety of forms

This looks complicated, but it’s not. Here’s how it works:

Callan, 2020*

First we have a web page, each page has its own network structure, such as title, subject, URL, and so on. So naturally, we can break it up into different chunks.

At the same time, we mentioned the smoothing model above. By combining the two parameters in the smoothing model with the article structure, we can get a text language model, that is, the Bayesian probability value in this block.

In addition, each document block can be divided into different representations of child nodes, such as words, phrases, etc., which also have their respective corresponding Bayesian probabilities. In this way, at the document level we have completed the transition from the language model to Bayesian probability.

Now we can think about the and/or/Near operators. We can think about the “I” and the “I” and the “I” and the “I” and the “I” and the “I”.

For each Query, we can first break it down into small words with NLP splits, and then combine and construct them with operators, so that each Query concept also contains its own Bayesian probability. This builds a Query network.

After that, we can connect the Query network to our document network. By matching the likelihood of Query network and article network, we can find articles with high accuracy as accurately as possible, which is the system process of our entire INDRI network.

Citation instructions

SEO, 2020. SEO Search Engine – SEO Rank Services. [online] SEO Rank Services. Available at: https://seorankservices.com/search-engines-optimize-site/seo-search-engine/ [Accessed 3 February 2020].

Manning, C., 2008. Introduction To Information Retrieval. 1st ed.

Callan, J., 2020. Search Engine 11642 – Carnegie Mellon University.

Article | Ray

Focus on technology