This time brings the introduction of e-commerce search business, e-commerce search series is divided into two articles:

  • E-commerce search service introduction
  • From shallow to deep, entry search principle

This article uses the open-source search engine ES(Elasticsearch) as an example.

First of all, this article will involve a lot of concepts for students who are in touch with it for the first time, mainly as follows:

Search for noun concepts describe
Document (Doc) ?
Entry (Term) ?
Inverted Index ?
Keyword (Query) ?
Recall (Recall) ?
Word Frequency (tf: Term Frequency) ?
Inverse Document Frequency (IDF :Inverse Document Frequency) ?
Thick row ?
Fine line ?

This article introduces the principles from the simple to the complex, and gradually unveils the above concepts. The structure of this paper is as follows:

  • The birth of the search engine ES
  • Easy version of the search process
    • Indexing process
    • The query process
  • Advanced search process
    • Indexing process
      • What is a Doc?
      • What is a Term?
      • What is an Inverted Index?
      • Document (Doc) analysis
        • Character filter
        • Word segmentation is
        • Word segmentation filter
      • Create an inverted index
    • The query process
      • Keyword analysis
        • Character filter
        • Word segmentation is
        • Word segmentation filter
      • Recall (Recall)
        • What is a Recall?
      • The sorting
        • What is TF :Term Frequency?
        • What is the Inverse Document Frequency?
        • Coarse/fine discharge
    • Summary of Search process
  • Search engine ES advanced
    • The basic structure of an index (noun)
    • Logical structure of search engine ES

The birth of the search engine ES

ES was born from Lucene, an open source JAVA library. According to the description on Lucene’s official website, Lucene has the following capabilities:

  • LuceneIs a JAVA library
  • LuceneSpell checking is implemented
  • LuceneHit character highlighting is implemented
  • LuceneThe function of analysis and word segmentation is realized

Lucene does not have the following capabilities:

  • distributed
  • High availability
  • Out of the box
  • , etc.

So years ago a developer named Shay Banon implemented a highly available open source distributed search engine called Elasticsearch through Lucene.

Common search functions are based on the “search engine” implementation, then let’s look at the simplified search process.

Easy version of the search process

The simple search process is as follows:

  • Step 1: Indexing process, which requires the source data being searched to be indexed (verb) into the search engine and indexed (noun).
  • Step 2: The Query process, the user input the keyword (Query), search engine analysis Query and return the Query results.

Advanced search process

Indexing process

What is a Doc?

For chestnuts, such as “electricity design manual | SkrShop” web content need to be searched, and that the entire content of the web page is called a document Doc.

Document Doc reads as follows:

<! DOCTYPEhtml>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Electricity design manual | SkrShop</title>
  <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
  <meta name="description" content="It should be the most complete, detailed and practical e-commerce system design manual">
  <! - to omit... -->
  <p>Seckill is a marketing tool of e-commerce</p>
  <! - to omit... -->
Copy the code
Search for noun concepts describe
Document (doc) The text content that needs to be searched can be the detailed information of a product, the information of a web page and so on.

What is a Term?

Continue with the example of the document Doc above. In order to simplify the understanding of Term, it is a marketing method of e-commerce to simplify the content of the above document Doc into one sentence.

Term is the result set of Term segmentation in a Doc. For example, seckill is a marketing method of e-commerce.

Seckill/is/e-commerce/a kind of/marketing/meansCopy the code

Seckill, yes, e-commerce, one, marketing and means are called Terms respectively.

Search for noun concepts describe
Entry (Term) The text Doc being searched is broken down by the word splitter into N standard statements.

What is an Inverted Index?

An inverted index is a concrete implementation of the index (noun) created when indexing (verb) the source data.

We use the following document (Doc) as an example to explain the inverted index:

Document ID Document Content (Doc)
1 E-commerce design manual SkrShop
2 Seckill is a marketing tool of e-commerce
3 Shopping cart is the most important step in the e-commerce purchasing process

Word divider: Document (Doc) broken down into separate Terms (Doc -> Terms).

Open Source Chinese word segmentation:

  • IK Analyzer
  • jieba
  • Etc.

Take jieba word segmentation online demo as an example: app.gumble.pw/jiebademo/

Document ID Document Content (Doc) Chinese word segmentation results (Terms)
1 E-commerce design manual SkrShop E-commerce/Design/Manual/SkrShop
2 Seckill is a marketing tool of e-commerce Seckill/is/e-commerce/a kind of/marketing/means
3 Shopping cart is the most important step in the e-commerce purchasing process Shopping cart/is/e-commerce/buying/process/most/important/step

The document ID for each entry is as follows:

entry The document IDs
electricity One, two, three
design 1
manual 1
SkrShop 1
Seconds kill 2
is 2, 3,
the 2, 3,
A kind of 2
marketing 2
methods 2
The shopping cart 3
buy 3
process 3
The most 3
important 3
step 3

This is the basic process for building an inverted index.

After the establishment of inverted index is completed, the search process is simulated, assuming:

  • searchelectricity, can quickly find documents 1, 2, 3
  • searchmarketingCan quickly find document 2

(This process is called a recall.)

That’s the concept of an inverted index.

Search for noun concepts describe
Inverted Index The concrete implementation of the index (noun) created when indexing (verb) source data.

Document (Doc) analysis

Analysis is the process of standardizing the text of a document (Doc) and converting a document (Doc) into a standardized Term. The implementation of search engine ES analysis process depends on analyzers.

Basic components of analyzer:

  • Character filter
  • Word segmentation is
  • Word segmentation filter

Character filter

A parser corresponds to a character filter.

Format to standard text (text -> standard text), for example, remove HTML tags from the text.

For example, < P > E-commerce design manual SkrShop

— > E-commerce design manual SkrShop

Word segmentation is

One parser corresponds to one word divider.

The process of breaking a document into separate terms (Doc -> terms). For example: e-commerce design manual SkrShop– > E-commerce/Design/Manual/SkrShop

Another thing to mention here is the custom thesaurus: words that the original thesaurus didn’t have, such as recently created web words.

Word segmentation filter

One analyzer corresponds to N segmentation filters.

  • Unified lowercase
  • Synonym conversion
  • Stop words
  • To extract the stem
  • Error correction
  • Automatic completion
  • And so on…

Word segmentation filter The sample
Unified lowercase Suitable for English, etc. For example, the unified conversion of English letters to lowercase, for exampleWord -> word
Synonym conversion Applicable to all languages. caseSpacious -> spacious
Stop words Applicable to all languages. Words with broad meaning and unrepresentative meanings and other artificially designated words shall be removed, for examplethe,is. Chinese word library:Github.com/goto456/sto…
To extract the stem Suitable for English, etc. casewords -> word
Error correction Applicable to all languages. caseWide bowel -> spacious
Automatic completion Applicable to all languages.

Summary of indexing process

The query process

Search for noun concepts describe
Keyword (Query) Initiate a search is a keyword entered by the user

Keyword analysis

Keywords (Query) also need to be parsed, and the document indexing process is the same parsers.

Same analyzer:

  • Same character filter
  • Same word divider
  • Same segmentation filter

Word segmentation:

Keyword (Query) Chinese word segmentation results (Terms)
Design of seckill system Seckill/system/design
Terms (Terms)
Seconds kill
system
the
design

Word segmentation filter:

Here to stop word segmentation filter as an example to explain the process of word segmentation filter, this article uses the stop word library example: github.com/goto456/sto…

Get the set of Terms with the stop word removed:

Terms (Terms)
Seconds kill
system
design

Recall (Recall)

What is a Recall?

Using the above document content and document segmentation results:

Document ID Document Content (Doc) Chinese word segmentation results (Terms)
1 E-commerce design manual SkrShop E-commerce/Design/Manual/SkrShop
2 Seckill is a marketing tool of e-commerce Seckill/is/e-commerce/a kind of/marketing/means
3 Shopping cart is the most important step in the e-commerce purchasing process Shopping cart/is/e-commerce/buying/process/most/important/step

A word segmentation filter is further used to filter word segmentation results, using the same stop word segmentation filter as an example. Examples of stop word libraries used in this article: github.com/goto456/sto…

For example, if you hit a stop, the words are:

The result after passing the stop word segmentation filter:

Document ID Document Content (Doc) Chinese word segmentation results (Terms)
1 E-commerce design manual SkrShop E-commerce/Design/Manual/SkrShop
2 Seckill is a marketing tool of e-commerce Second kill/e-commerce/a kind of/marketing/means
3 Shopping cart is the most important step in the e-commerce purchasing process Shopping cart/e-commerce/purchase/process/important/step

The inverted index structure is further obtained:

entry The document IDs
electricity One, two, three
design 1
manual 1
SkrShop 1
Seconds kill 2
A kind of 2
marketing 2
methods 2
The shopping cart 3
buy 3
process 3
important 3
step 3

Then simulate the search process and assume the design of user search seckill system:

Keyword (Query) Chinese word segmentation results (Terms)
Design of seckill system Seckill/system/design
Terms (Terms)
Seconds kill
system
the
design

The term filter is used as an example of the stop term filter in the same process as above to obtain the set of Terms after the stop term is removed, which is called the set of Query:

Terms (Terms)
Seconds kill
system
design
  • Query entry for the keywordSeconds kill, which can be easily found by the inverted index aboveDocument 2
  • Query entry for the keywordsystemNo documents were found by the above inverted index
  • Query entry for the keyworddesign, which can be easily found by the inverted index aboveDocument 1

The user searches for the seckill system design and finds the following document:

  • Document 2: Seckill is a marketing tool of e-commerce
  • Document 1: E-commerce design manual SkrShop

That process is a recall.

Search for noun concepts describe
Recall (Recall) The process by which search engines use inverted indexing to retrieve relevant documents by term.

During the recall process mentioned above, users found documents 1 and 2 by searching the design of the seckill system.

Supplement: the above text recall method based on inverted index. In addition, there are recall methods based on the same category and other similar attributes, as well as vector recall based on deep learning.Copy the code

Then came the question:

How to determine the order of who comes first and who comes last in the recalled documents 1 and 2?

Then the following is the implementation of search engine sorting.

The sorting

Introduce the above question:

How do you decide who comes first and who comes last in documents 1 and 2?

A: The relevance of the document is determined by the search engine’s score. There are two main indicators that usually determine this score:

  • Document rate: TF (Term Frequency)
  • Inverse Document rate: IDF (Inverse Document Frequency)

It can be simply understood as relevance score = document rate * reverse document rate. The higher the relevance score is, the higher the rank is. Then, we look at the meanings of relevant concepts respectively.

What is TF :Term Frequency?

Use the document above again:

Document ID Document Content (Doc) Chinese word segmentation results (Terms)
1 E-commerce design manual SkrShop E-commerce/Design/Manual/SkrShop
2 Seckill is a marketing tool of e-commerce Second kill/e-commerce/a kind of/marketing/means
3 Shopping cart is the most important step in the e-commerce purchasing process Shopping cart/e-commerce/purchase/process/important/step

Let’s take e-commerce/seckill as an example.

Simple algorithm of word frequency: word frequency = the number of entries in a single document/the total number of entries in the document, the larger the value of word frequency is, the more irrelevant it is.

For example, the frequency of the word “kill” in document 1, in terms of all terms in a single document, we can simply find the frequency of the word “kill” in each document:

Document ID Document Content (Doc) Chinese word segmentation results (Terms) The number of times an entry appears in a single document Word frequency (SEC kill)
1 E-commerce design manual SkrShop E-commerce/Design/Manual/SkrShop 0 0/4 = 0
2 Seckill is a marketing tool of e-commerce Second kill/e-commerce/a kind of/marketing/means 1 1/5 = 0.2
3 Shopping cart is the most important step in the e-commerce purchasing process Shopping cart/e-commerce/purchase/process/important/step 0 0/6 = 0

Similarly, we can simply see the word frequency of the word “e-commerce” in each document:

Document ID Document Content (Doc) Chinese word segmentation results (Terms) The number of times an entry appears in a single document Word frequency (e-commerce)
1 E-commerce design manual SkrShop E-commerce/Design/Manual/SkrShop 1 A quarter of a = 0.25
2 Seckill is a marketing tool of e-commerce Second kill/e-commerce/a kind of/marketing/means 1 1/5 = 0.2
3 Shopping cart is the most important step in the e-commerce purchasing process Shopping cart/e-commerce/purchase/process/important/step 1 1/6 = 0.167
Search for noun concepts describe
Word Frequency (tf: Term Frequency) Number of entries in a single document/total entries in a document
What is the Inverse Document Frequency?

For individual documents, the more frequent the word, the more relevant the value.

Consider this question: If a term appears in all documents, is it more relevant or less relevant?

A: No, right.Copy the code

This is the document rate: the document rate = the number of documents containing a certain term/the number of all documents, the greater the document rate value is irrelevant, and vice versa.

Because the larger the word frequency value, the more relevant, vice versa. In order to ensure logical consistency with word frequency, and the higher the final correlation score, the more relevant, the algorithm of document rate is adjusted, and the numerator and denominator are switched: all documents/(the number of documents containing a certain term + 1) plus 1 to ensure that the denominator is not zero, this is the inverse document rate.

Reverse document rate = number of documents/(number of documents containing a term + 1).

However, since the number of documents is often very large, the value of the inverse document rate above will be huge. Therefore, adjust the formula, introduce logarithms, reduce the size of the value, and make the value smooth:

Inverse document rate = log(number of documents/(number of documents containing an entry + 1))

Entry (Term) Inverse document rate
electricity log(3/(3+1))
Seconds kill log(3/(1+1))

Finally, the correlation score(TF/IDF) of each Query term corresponding to each document is calculated: correlation score = document rate * reverse document rate.

Coarse/fine discharge

The sorting result using TF/IDF score (correlation score = document rate * reverse document rate) above is only a preliminary sorting of recalled documents, which is called rough sorting.

After obtaining rough sorting results, documents are usually sorted more accurately according to actual business requirements. For example, manual intervention is used to increase the weight of some documents and make them sorted more advanced. This process is called precision sorting.

Search for noun concepts describe
Thick row The process of recalling documents using TF/IDF scores
Fine line Rough out the results and sort them more accurately according to the actual business requirements

Summary of Search process

  1. Indexing process: Document (Doc) -> Analysis -> inverted index.

  1. Query process: keywords (Query) -> analysis -> recall -> coarse -> fine.

Search engine ES advanced

The basic structure of an index (noun)

  • The index index
    • Type type: Distinguishes between different document data structure types
      • Mapping: Manages attributes of indexes, such as parsers used, and so on
      • Document doc: Specific document to be searched

Further refine the search process by adding a more detailed index (noun) structure

Logical structure of search engine ES

conclusion

Feel free to leave a comment on our Github project at github.com/skr-shop/ma… .


[Skr Shop] Project address long press to enter: github.com/skr-shop/ma…


SkrShop series more articles:

  • SkrShop | electric business search business is introduced
  • SkrShop | an article about the electricity order settlement page design?
  • SkrShop | electric maker also confuse one yuan, conventional seconds kill, kill time to buy?
  • SkrShop | do you want to know the coupon business, SkrShop tell you
  • What SkrShop |, seconds kill system has so many!
  • SkrShop | shopping cart design of requirement analysis
  • The architectural design of SkrShop | a shopping cart
  • SkrShop | system design of general drawing tools
  • SkrShop | general demand analysis of drawing tools
  • SkrShop | coder, you can design the trading system (work)?
  • SkrShop | coder, you can design the trading system (concept)?
  • SkrShop | third-party payment process analysis and summary
  • SkrShop | electrical design manual of basic goods information
  • SkrShop | electric business users of the system design manual