This is the 25th day of my participation in the August Challenge

Description of Search Engine

What is a search?

Search: in any situation, to find the information you want, this time, you will enter a paragraph of the keyword you want to search, and then expect to find some information about this keyword.

Search categories: general web search, vertical search engines, etc

Web search

Web search is mainly Google, Baidu, Sogou, Bing such sites.

Vertical search (site search)

Vertical search is mainly divided into Internet search and IT system search.

Internet search: e-commerce website, recruitment website, news website, various APPS, such as e-commerce website search “toothpaste”, “children’s clothing” and so on.

IT system search: OA software, office automation software, meeting management, schedule management, project management, staff management. Such as management system search staff, “Zhang SAN”, “Zhang SAN”, “Zhang Xiaosan” and so on;

What if you use a database to do a search?

We all know that data are stored in the database, such as commodity information of e-commerce websites, job information of recruitment websites, news information of news websites and so on.

If from the technical point of view to consider, how to realize the search function of e-commerce website, you can consider, to use the database to search.

Each record in the specified field in the text, can be long, such as the length of the “goods description” field, there are a thousands, even tens of thousands of characters, this time, every time want to scan the text for each record, your package does not contain I specify the keywords (such as “electric toothbrush”) also search term can not be split apart, Search for as many results as you want, such as “electric brush”, and “electric toothbrush” will not come up.

Using a database to achieve search, is not very reliable. Generally speaking, the performance will be poor.

Forward and inverted indexes

Is row index

With the document ID as the keyword, the position information of each word in the document is recorded in the table. When searching, the word information in each document in the table is scanned until all documents containing the query keyword are found.

Inverted index

Keywords are indexed. The record entries corresponding to keywords in the table record all documents in which the word or word appears. An entry is a word table segment, which records the ID of the document and the position of characters in the document.

In the inverted list bar of the figure, (1:1:<1>) is separated by colon to represent the document where the keyword is located, the frequency of the keyword in the document, and the location of the keyword.

Lucene

The paper

An open source full-text search engine toolkit (JAR package), which contains packaged code for building inverted indexes and conducting searches, including algorithms. Developed directly based on Lucene, very complex, complex API (implementing some simple functions, writing a lot of Java code), requiring a deep understanding of the principles (various index structures).

TF-IDF

TF – word frequency

Frequency of keywords per document (t number of keywords in the document/total number of words in the document)

IDF- Reverse document frequency

Inverse document frequency = total number of documents/number of documents in which the keyword t appears

It can be seen that TF-IDF is proportional to the number of occurrences of a word in the document and inversely proportional to the number of occurrences of the word in the entire language

The characteristics of the TF – IDF

The advantages of TF-IDF algorithm are simple and fast, and the results are more consistent with the actual situation.

The disadvantage is that it is not comprehensive to measure the importance of a word simply by “word frequency”, and sometimes important words may not appear many times. Moreover, this algorithm can not reflect the location of words, words appearing in the front and words appearing in the back are regarded as the same importance, which is incorrect. (One solution is to give more weight to the first paragraph and the first sentence of each paragraph.)

There are also stop words, synonyms, and antisense statements to consider, such as:

Stop words: The most frequently used words, such as “is”, “is” and “in”, are not helpful in finding results and must be filtered out.

Synonyms: words that express the same meaning, such as tomato and tomato, potato and potato.

Antisense statements: The TFIDF may be similar, but they express opposite meanings.

Sentence A: I prefer watching TV to watching movies.

Sentence B: I don’t like watching TV, and I don’t like watching movies either.

Problems with Lucene

  • API complex
  • Single bottleneck
  • Not highly available

ElasticSearch

The paper

ElasticSearch is a highly scalable distributed full-text search and analysis engine. Based on Lucene, hiding complexity and providing easy-to-use restful apis, Java APIS (and apis in other languages)

ElasticSearch was launched in 2010 and Elastic went public in 2018. ElasticSearch also iterates quickly and is fully documented.

Version features: Mandatory Content for Elasticsearch upgrade

Recommended documents: Official documents

Applicable scenario

  • Wikipedia, like Baidu.com, toothpaste, Wikipedia for toothpaste, full text search, highlighting, search recommendations
  • The Guardian (foreign news sites), similar to sohu news, user behavior log (click browse, collection, comments) + social network data (think of xyz news related), data analysis, give to The author of each news articles, let him know that his article public feedback (good, bad, hot, trash, despise, worship)
  • Stack Overflow (foreign program exception discussion forum), IT problems, program error, submit, someone will discuss with you and answer, full text search, search related questions and answers, program error, will report the error information pasted into the inside, search for the corresponding answer
  • GitHub, search hundreds of billions of lines of code
  • E-commerce site, search for goods
  • Log data analysis, LogStash log collection, ES for complex data analysis (ELK technology, ElasticSearch + Logstash + Kibana)
  • Commodity price monitoring website, users set the price threshold of a certain commodity, when the price is lower than the threshold, send notification messages to users, for example, subscription toothpaste monitoring, if Colgate toothpaste family set is lower than 50 yuan, inform me, I will buy.
  • BI systems, Business Intelligence, Business Intelligence. For example, there is a large shopping mall group, BI, analyze the trend of user consumption amount and the composition of user group in certain area in recent 3 years, and output several statements related to this area. In certain area, in recent 3 years, the annual consumption amount presents 100% growth, and 85% of the user group is senior white-collar workers, opening a new shopping mall. ES performs data analysis and mining, and Kibana performs data visualization.

Common domestic application scenarios:

  • Site search (e-commerce, recruitment, portal, etc.)
  • IT system search (OA, CRM, ERP, etc.)
  • Data analysis (a popular use scenario for ES)

The characteristics of

  • Can be used as a large distributed cluster (hundreds of servers) technology, processing PB level data, serving large companies; It can also run on a single machine and serve small companies.
  • Elasticsearch is not a new technology. It’s a combination of full text search, data analysis, and distributed technology that makes ES unique. Lucene (full text search), commercial data analysis software (Umeng +, Baidu statistics), distributed database (TIDB/MyCAT).
  • For users, it is out of the box and very simple. As a small and medium-sized application, ES can be directly deployed in 3 minutes and used as a system in the production environment. The amount of data is not large and the operation is not too complicated.
  • Database capabilities are inadequate in many areas (transactions, as well as various online transactional operations); Special functions, such as full text search, synonym processing, relevance ranking, complex data analysis, near real-time processing of massive data; Elasticsearch is a complement to traditional databases and provides a lot of functionality that a database does not.

Elasticsearch currently has two core application areas: vertical search engine and real-time data analysis

The core concept

  • Near Realtime (NRT) : Near Realtime, two meaning, from the time data is written to the time data can be searched until there is a small delay (about 1 second); Performing search and analysis based on ES can be done in seconds.
  • Cluster: a Cluster containing multiple nodes. The Cluster name of each node is determined by a configuration (the default is ElasticSearch, recommended change). For small and medium sized applications, it is normal to have one node in a Cluster at the beginning.
  • The Node: Node, a node in a cluster, also has a name (randomly assigned by default), the node name is important (when performing o&M operations), the default node will be added to a cluster named “ElasticSearch”, if you start a bunch of nodes directly, They will automatically form an ElasticSearch cluster, but a node can also form an ElasticSearch cluster.
  • Document&field: document, the smallest data unit in ES. A document can be a customer data, a commodity classification data, and an order data, usually represented by JSON data structure. Multiple Documents can be stored in the type under each index. A document contains multiple fields, each of which is a data field.
  • Index: a collection of document data that has a similar structure, such as a customer Index, a category Index, an order Index, and a name. An index contains many documents, and an index represents a class of similar or identical documents. For example, if you create a product index, a product index, you might have all of the product data in it, all of the product documents.
  • A document under a Type has the same field. For example, a blog system has an index that defines user data Type, blog data Type, and comment data Type.Es7 officially abolished the support for multiple types in a single index. In ES6, it was officially mentioned that Type will be removed in ES7, and there can only be one Type for each index. - es7 uses the default _doc as the type, officials say that type will be removed completely in 8.x. API requests also send changes, such as GET index/_doc/ ID where index and ID are specific values).
  • Shard: A single machine cannot store a large amount of data. Es can split the data in an index into multiple Shards and store them on multiple servers. With the Shard, you can scale horizontally, store more data, distribute search and analysis operations across multiple servers, and improve throughput and performance. Each shard is a Lucene index.
  • Replica: Any server may fail or break down at any time. In this case, shard may be lost. Therefore, multiple replica copies can be created for each shard. Replica can provide backup service when shard fails to ensure data is not lost. Multiple replicas can improve the throughput and performance of search operations. By default, each index has 10 shards, 5 primary shards, 5 replica shards, minimum high availability configuration. It’s two servers.

Elasticsearch vs relational database

Elasticsearch The database
Document line
Type table
Index library

ElasticSearch installation

Stand-alone mode

Zero configuration, out of the box.

Download elasticsearch:

Wget artifacts. Elastic. Co/downloads/e…

Decompress the installation package:

The tar – ZXVF elasticsearch – 5.5.2. Tar. Gz

Start the elasticsearch:

CD/usr/local/elasticsearch – 5.5.2

Bin/elasticSearch -d #

Viewing cluster information:

curl http://localhost:9200/

{
  "name" : "4onsTYV"."cluster_name" : "elasticsearch"."cluster_uuid" : "nKZ9VK_vQdSQ1J0Dx9gx1Q"."version" : {
    "number" : "5.2.0"."build_hash" : "24e05b9"."build_date" : "The 2017-01-24 T19:52:35. 800 z"."build_snapshot" : false."lucene_version" : "6.4.0"
  },
  "tagline" : "You Know, for Search"
}
Copy the code

Cluster pattern

For distributed installation, you need to modify the elasticSearch. yml configuration file:

# change the configuration file elasticSearch.yml

cluster.name: es-cluster    The cluster name must be the same for all nodes
node.name: node-data-104    The node name is different for each node

node.master: true       Whether to use as the primary node
node.data: true         Is it used as a data node
node.ingest: true       # ingest node, which can process data inside ES

path.data: /home/lgd/es/data    # Data storage directory
path.logs: /home/lgd/es/log     # Directory for storing logs

bootstrap.memory_lock: true     # lock memoryNetwork. The host: 10.10.10.104 HTTP. Enabled:true              # Enable HTTP port: a port that provides HTTP services externally
http.port: 9200                 
transport.tcp.port: 9300        # Enable TCP port for interaction between nodes
discovery.zen.ping.unicast.hosts: ["10.10.10.104"."10.10.10.105"."10.10.10.106"] The cluster automatically discovers nodes
bootstrap.system_call_filter: false # Disable system call filter checking
transport.tcp.compress: true
thread_pool.index.queue_size: 800   Write index thread pool size
thread_pool.bulk.queue_size: 800    # batch insert thread pool size
Copy the code

ElasticSearch Cluster monitoring

The _CAT series provides a series of interfaces for querying the status of the ElasticSearch cluster.

In the old version of ES, there was a very useful plugin called Head, but it was removed after 5.x, and the x-pack was introduced (for a fee).

GET /_cat

=^.^=
/_cat/allocation
/_cat/shards
/_cat/shards/{index}
/_cat/master
/_cat/nodes # Node statistics
/_cat/tasks
/_cat/indices # index statistics
/_cat/indices/{index}
/_cat/segments
/_cat/segments/{index}
/_cat/count
/_cat/count/{index}
/_cat/recovery
/_cat/recovery/{index}
/_cat/health Cluster health status
/_cat/pending_tasks
/_cat/aliases
/_cat/aliases/{alias}
/_cat/thread_pool
/_cat/thread_pool/{thread_pools}
/_cat/plugins # Cluster plugin
/_cat/fielddata
/_cat/fielddata/{fields}
/_cat/nodeattrs
/_cat/repositories
/_cat/snapshots/{repository}
/_cat/templates
Copy the code

Cluster Health Status

GET /_cat/health? v

cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent Smart-es Green 2 2 233 121 00 0-100.0%Copy the code

Status description:

  • Green: The Primary shard and Replica Shard in each index are active.
  • Yellow: The primary shard in each index is active, but some replica shards are not active and are unavailable.
  • Red: Not all primary shards of the index are active, some index data is missing.

Viewing cluster Nodes

View the master node:

GET /_cat/master? v

Id host IP node T2bwdF_5TWqWfA1C0bmKLg 10.xxx.150.231 10.XXX.150.231 node-231Copy the code

View all nodes:

GET /_cat/nodes? v

IP heap.percent ram. Percent CPU load_1m load_5m load_15m node.role master name 10.XXX.xxx.231 64 94 5 2.33 2.66 2.90 mdi * node-231 10.xxx.xxx.208 64 99 8 1.10 1.13 1.14 mdi-node-208Copy the code

View all indexes in the cluster

GET /_cat/indices? v

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open news EylgtpJ7TlWwJF4EIMYj3g 5 13 0 30.8KB 15.4KB Green Open Student 6Kydo3Y0TkyP_SmeR136xA 5 1 1 0 12KB 6KB Green open Student 6Kydo3Y0TkyP_SmeR136xA 5 1 1 0 12KBtest4PMOT7xvS_SYJIbKXGwtyw 5 1 20 14.9KB 7.4KB Green Open workORDER_TET SbrdY8HLQPOQFIoxdNL7Gg 5 10 1.5 KB 810bCopy the code

View an index

GET /_cat/indices/news

View index document count

GET /_cat/count/news? v

epoch      timestamp count
1550218453 16:14:13  3
Copy the code

Or use the following method to view the number of indexed documents:

GET news/news/_count? pretty

GET news/_count? pretty

{
  "count": 3,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  }
}
Copy the code

Check the plugin

GET /_cat/plugins? v&s=component&h=name,component,version,description

Name Component Version Description Node-231 Analysis-IK 5.5.2 IK AnalyzerforElasticsearch Node -208 Analysis-IK 5.5.2 IK Analyzerfor Elasticsearch
Copy the code

Example Query resources allocated to each node

GET /_cat/allocation? v

Indices disk.used disk.avail Disk. total disk.percent host IP node 117 427.3 MB 42.7 GB 153.9 GB 196.7 GB 21 208 node-208 116 435.8 MB 39.1gb 9.9gb 49gb 79 10.250.xxx.231 10.250.xxx.231 node-231Copy the code

Case based explanation of ES document CRUD and search

background

An e-commerce website needs to build a background system based on ES to provide the following functions:

  1. CRUD (add, delete, change and check) operation on commodity information
  2. Perform simple structured queries
  3. You can perform simple full-text searches, as well as complex phrase searches
  4. For full-text search results, you can highlight them

Index structure

Create order index library
PUT /order_detail
{
 "aliases": {},
 "mappings": {
   "default": {
     "properties": {
       "name": {
         "type": "text"."fields": {
           "keyword": {
             "type": "keyword"."ignore_above": 256}}."analyzer": "ik_max_word"."search_analyzer": "ik_smart"
       },
       "desc": {
         "type": "keyword"
       },
       "price": {
         "type": "long"
       },
       "producer": {
         "type": "text"."fields": {
           "keyword": {
             "type": "keyword"."ignore_above": 256}}."analyzer": "ik_max_word"."search_analyzer": "ik_smart"
       },
       "tags": {
         "type": "keyword"}}}},"settings": {
   "index": {
     "number_of_shards": "5"."number_of_replicas": "1"}}}# check
GET /order_detail

# remove
DELETE /order_detail

# Modify (not recommended)
Copy the code

Document the CRUD


# New product
PUT /order_detail/default/1
{
    "name" : "Colgate Toothpaste"."desc" :  "Colgate Whitens and Prevents cavities"."price": 30."producer" : Colgate."tags": [ "White"."Inside"]} {"_index": "order_detail"."_type": "default"."_id": "1"."_version": 1,
  "result": "created"."_shards": {
    "total": 2."successful": 2."failed": 0}."created": true
}

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
PUT /order_detail/default/2
{
    "name" : Crest Toothpaste."desc" :  "Crest effectively prevents tooth decay"."price": 25."producer" :  Crest."tags": [ "Inside" ]
}

PUT /order_detail/default/3
{
    "name" : "Chinese Toothpaste"."desc" : "Chinese Toothpaste Herb"."price": 40."producer" : "Chinese"."tags": [ "Fresh"]}# query merchandise
GET /order_detail/default/1

{
  "_index": "order_detail"."_type": "default"."_id": "1"."_version": 1,
  "found": true."_source": {
    "name": "Colgate Toothpaste"."desc": "Colgate Whitens and Prevents cavities"."price": 30."producer": Colgate."tags": [
      "White"."Inside"]}}# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# Modify goods
PUT /order_detail/default/1
{
    "name" : "Colgate Enhanced Toothpaste"."desc" :  "Colgate Whitens and Prevents cavities"."price": 30."producer" : Colgate."tags": [ "White"."Inside"]} {"_index": "order_detail"."_type": "default"."_id": "1"."_version": 2."result": "updated"."_shards": {
    "total": 2."successful": 2."failed": 0}."created": false
}


PUT /order_detail/default/1
{
    "name" : "Colgate Ultimate Toothpaste."
}


GET /order_detail/default/1

{
  "_index": "order_detail"."_type": "default"."_id": "1"."_version": 2."found": true."_source": {
    "name": "Colgate Enhanced Toothpaste"}}# This method has a disadvantage, even if you have to bring all the fields, to modify the information

Update some documents

POST /order_detail/default/1/_update
{
  "doc": {
    "name": "Colgate Abnormal Toothpaste."}} {"_index": "order_detail"."_type": "default"."_id": "1"."_version": 5,
  "result": "updated"."_shards": {
    "total": 2."successful": 2."failed": 0
  }
}



GET /order_detail/default/1

{
  "_index": "order_detail"."_type": "default"."_id": "1"."_version": 5,
  "found": true."_source": {
    "name": "Colgate Abnormal Toothpaste."."desc": "Colgate Whitens and Prevents cavities"."price": 30."producer": Colgate."tags": [
      "White"."Inside"]}}# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# delete merchandise
DELETE /order_detail/default/1

{
  "found": true."_index": "order_detail"."_type": "default"."_id": "1"."_version": 3."result": "deleted"."_shards": {
    "total": 2."successful": 2."failed": 0}}Copy the code

analyzer

Parsers can be either built-in parsers or custom parsers customized for each index.

Parsers perform four roles:

  • Split text into words:The quick brown foxes– > [The.quick.brown.foxes]
  • Uppercase to lowercase:Thethe
  • Remove commonly used stop words: [The.quick.brown.foxes] – > [quick.brown.foxes]
  • To convert a variant (such as a plural, past tense) to a root:foxesfox

Built-in analyzer – standard word divider

The built-in profiler is as follows: Standard Analyzer (default), Simple Analyzer, Whitespace Analyzer, Stop Analyzer, Keyword Analyzer, Pattern Analyzer, Language Analyzers, Fingerprint Analyzer. For details, see Analyzers on the official website.

Standard participles include the following:

Tokenizer: Standard Tokenizer

Token Filters: Standard Token Filter, Lower Case Token Filter, Stop Token Filter (disabled by default)

# Use standard parser segmentation
POST _analyze
{
  "analyzer": "standard"."text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}


{
  "tokens": [{"token": "the"."start_offset": 0."end_offset": 3."type": "<ALPHANUM>"."position": 0}, {"token": "2"."start_offset": 4."end_offset": 5,
      "type": "<NUM>"."position": 1}, {"token": "quick"."start_offset": 6,
      "end_offset": 11."type": "<ALPHANUM>"."position": 2}, {"token": "brown"."start_offset": 12."end_offset": 17."type": "<ALPHANUM>"."position": 3}, {"token": "foxes"."start_offset": 18."end_offset": 23."type": "<ALPHANUM>"."position": 4}, {"token": "jumped"."start_offset": 24,
      "end_offset": 30."type": "<ALPHANUM>"."position": 5}, {"token": "over"."start_offset": 31."end_offset": 35."type": "<ALPHANUM>"."position": 6}, {"token": "the"."start_offset": 36."end_offset": 39,
      "type": "<ALPHANUM>"."position": 7}, {"token": "lazy"."start_offset": 40."end_offset": 44,
      "type": "<ALPHANUM>"."position": 8}, {"token": "dog's"."start_offset": 45,
      "end_offset": 50."type": "<ALPHANUM>"."position": 9}, {"token": "bone"."start_offset": 51."end_offset": 55."type": "<ALPHANUM>"."position": 10}]}# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# add stop word filter
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard"."max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

POST my_index/_analyze
{
  "analyzer": "my_english_analyzer"."text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}


[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

Copy the code

Chinese analyzer

The most widely used Chinese analyzer is the IK analyzer.

IK analyzer includes two word segmentation methods:

  • Ik_max_word: the text will be split into the most fine-grained, for example, “The national anthem of the People’s Republic of China” will be split into “the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the Republic of China, and the guo Guo, the national anthem of the People’s Republic of China”, exhausting all possible combinations;
  • Ik_smart: Will do the coarsest split, such as “People’s Republic of China national anthem” to “People’s Republic of China national anthem”.

Ik_max_word is recommended for indexing and IK_smart is recommended for searching


POST /_analyze
{
  "analyzer": "ik_max_word"."text": "National Anthem of the People's Republic of China"
}

{
  "tokens": [{"token": "The People's Republic of China"."start_offset": 0."end_offset": 7,
      "type": "CN_WORD"."position": 0}, {"token": "The Chinese people"."start_offset": 0."end_offset": 4."type": "CN_WORD"."position": 1}, {"token": "Chinese"."start_offset": 0."end_offset": 2."type": "CN_WORD"."position": 2}, {"token": "Chinese"."start_offset": 1,
      "end_offset": 3."type": "CN_WORD"."position": 3}, {"token": "People's Republic"."start_offset": 2."end_offset": 7,
      "type": "CN_WORD"."position": 4}, {"token": "The people"."start_offset": 2."end_offset": 4."type": "CN_WORD"."position": 5}, {"token": "Republic"."start_offset": 4."end_offset": 7,
      "type": "CN_WORD"."position": 6}, {"token": "The republic"."start_offset": 4."end_offset": 6,
      "type": "CN_WORD"."position": 7}, {"token": "The"."start_offset": 6,
      "end_offset": 7,
      "type": "CN_CHAR"."position": 8}, {"token": "National anthem"."start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD"."position": 9}]}Copy the code

There are other profilers, such as ANSJ, which include the following segmentation strategies:

Index_ansj (suggested index use)

Query_ansj (suggested for search)

Dic_ansj Indicates the preferred word segmentation mode for user-defined dictionaries

Custom analyzer

Of course, we can also customize parsers, including the following three parts:

  • Zero or more character filters
  • A tokenizer
  • Zero or more token filters

# Use custom (character filter word filter)
PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom"."char_filter": [
            "emoticons"]."tokenizer": "punctuation"."filter": [
            "lowercase"."english_stop"]}},"tokenizer": {
        "punctuation": { 
          "type": "pattern"."pattern": !, "[?] "}},"char_filter": {
        "emoticons": { 
          "type": "mapping"."mappings": [
            ":) => _happy_".":( => _sad_"]}},"filter": {
        "english_stop": { 
          "type": "stop"."stopwords": "_english_"
        }
      }
    }
  }
}

POST /my_index/_analyze
{
  "analyzer": "my_custom_analyzer"."text":     "I'm a :) person, and you?"
}


{
  "tokens": [{"token": "i'm"."start_offset": 0."end_offset": 3."type": "word"."position": 0}, {"token": "_happy_"."start_offset": 6,
      "end_offset": 8,
      "type": "word"."position": 2}, {"token": "person"."start_offset": 9,
      "end_offset": 15."type": "word"."position": 3}, {"token": "you"."start_offset": 21."end_offset": 24,
      "type": "word"."position": 5}]}Copy the code

search


# Search all documentsGET /order_detail/default/_search? pretty {"took": 1,
  "timed_out": false."_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0}."hits": {
    "total": 3."max_score": 1,
    "hits": [{"_index": "order_detail"."_type": "default"."_id": "2"."_score": 1,
        "_source": {
          "name": Crest Toothpaste."desc": "Crest effectively prevents tooth decay"."price": 25."producer": Crest."tags": [
            "Inside"]}}, {"_index": "order_detail"."_type": "default"."_id": "1"."_score": 1,
        "_source": {
          "name": "Colgate Abnormal Toothpaste."."desc": "Colgate Whitens and Prevents cavities"."price": 30."producer": Colgate."tags": [
            "White"."Inside"]}}, {"_index": "order_detail"."_type": "default"."_id": "3"."_score": 1,
        "_source": {
          "name": "Chinese Toothpaste"."desc": "Chinese Toothpaste Herb"."price": 40."producer": "Chinese"."tags": [
            "Fresh"}}} took: milliseconds The data is split into 5 pieces, so for the search request, all the primary shards or one of the replica shards are called. It can also be hits. Total: the number of search results, 3 document hits. Hits: contains the detailed data of the document matching the search# Build syntax with JSONGET /order_detail/default/_search? pretty {"query": { "match_all": {}}}Select * from products where name contains "toothpaste" and sort by descending priceGET /order_detail/default/_search? pretty {"query" : {
        "match" : {
            "name" : "Toothpaste"}},"sort": [{"price": "desc"}}]# select * from page 2 where 1 item is displayed on each page. # select * from page 2 where 2 item is displayed
GET /order_detail/default/_search
{
  "query": { "match_all": {}},"from": 1,
  "size": 1}# specify the name and price of the item to be queried (more suitable for use in production environments, can build complex queries)
GET /order_detail/default/_search
{
  "query": { "match_all": {}},"_source": ["name"."price"]} {"took": 5,
  "timed_out": false."_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0}."hits": {
    "total": 3."max_score": 1,
    "hits": [{"_index": "order_detail"."_type": "default"."_id": "2"."_score": 1,
        "_source": {
          "price": 25."name": Crest Toothpaste}}, {"_index": "order_detail"."_type": "default"."_id": "1"."_score": 1,
        "_source": {
          "price": 30."name": "Colgate Abnormal Toothpaste."}}, {"_index": "order_detail"."_type": "default"."_id": "3"."_score": 1,
        "_source": {
          "price": 40."name": "Chinese Toothpaste"}}]}}# Search for products whose name contains "toothpaste" and whose price is more than 25 yuan
# filter, which simply filters the desired data according to the search criteria, does not calculate any correlation score, and has no effect on correlation
# query calculates the relevance of each document relative to the search criteria and sorts it by relevance

GET /order_detail/default/_search
{
    "query" : {
        "bool" : {
            "must" : {
                "match" : {
                    "name" : "Toothpaste"}},"filter" : {
                "range" : {
                    "price" : { "gt": 25}}}}}}# phrase search
GET /order_detail/_search
{
  "query" : {
      "match_phrase" : {
          "name" : "Chinese Toothpaste Herb"}}} {"took": 1,
  "timed_out": false."_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0}."hits": {
    "total": 2."max_score": 1.1899844."hits": [{"_index": "order_detail"."_type": "default"."_id": "4"."_score": 1.1899844."_source": {
          "name": "Herbal Essence of Chinese Toothpaste"."desc": "Chinese Toothpaste Herb"."price": 15."producer": "Chinese"."tags": [
            "Fresh"]}}, {"_index": "order_detail"."_type": "default"."_id": "5"."_score": 0.8574782."_source": {
          "name": "Chinese toothpaste herbal Essence fresh"."desc": "Chinese Toothpaste Herb"."price": 11."producer": "Chinese"."tags": [
            "Fresh"}}]}}# slop If the value of slop is large enough, the order of words can be arbitrary.
GET /order_detail/_search
{
    "query": {
        "match_phrase": {
            "name": {
                "query": "Refreshing Chinese toothpaste"."slop": 50}}}} {"took": 1,
  "timed_out": false."_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0}."hits": {
    "total": 1,
    "max_score": 0.26850328."hits": [{"_index": "order_detail"."_type": "default"."_id": "5"."_score": 0.26850328."_source": {
          "name": "Chinese toothpaste herbal Essence fresh"."desc": "Chinese Toothpaste Herb"."price": 11."producer": "Chinese"."tags": [
            "Fresh"
          ]
        }
      }
    ]
  }
}



GET /order_detail/_search
{
    "query" : {
        "match" : {
            "name" : "Chinese Toothpaste"}},"highlight": {
        "fields" : {
            "name": {}}}} {"took": 2."timed_out": false."_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0}."hits": {
    "total": 5,
    "max_score": 0.6641487."hits": [{"_index": "order_detail"."_type": "default"."_id": "4"."_score": 0.6641487."_source": {
          "name": "Herbal Essence of Chinese Toothpaste"."desc": "Chinese Toothpaste Herb"."price": 15."producer": "Chinese"."tags": [
            "Fresh"]},"highlight": {
          "name": [
            "< EM > Chinese 
      < EM > Toothpaste 
       Herbal Essence"]}}, {"_index": "order_detail"."_type": "default"."_id": "5"."_score": 0.5716521."_source": {
          "name": "Chinese toothpaste herbal Essence fresh"."desc": "Chinese Toothpaste Herb"."price": 11."producer": "Chinese"."tags": [
            "Fresh"]},"highlight": {
          "name": [
            "< EM > Chinese 
      < EM > Toothpaste 
       Herbal Essence fresh"]}}, {"_index": "order_detail"."_type": "default"."_id": "3"."_score": 0.51623213."_source": {
          "name": "Chinese Toothpaste"."desc": "Chinese Toothpaste Herb"."price": 40."producer": "Chinese"."tags": [
            "Fresh"]},"highlight": {
          "name": [
             Chinese 
       < EM > Toothpaste 
       ]}}, {"_index": "order_detail"."_type": "default"."_id": "1"."_score": 0.2824934."_source": {
          "name": "Colgate Abnormal Toothpaste."."desc": "Colgate Whitens and Prevents cavities"."price": 30."producer": Colgate."tags": [
            "White"."Inside"]},"highlight": {
          "name": [
            Colgate Abnormal < EM > Toothpaste ]}}, {"_index": "order_detail"."_type": "default"."_id": "2"."_score": 0.21380994."_source": {
          "name": Crest Toothpaste."desc": "Crest effectively prevents tooth decay"."price": 25."producer": Crest."tags": [
            "Inside"]},"highlight": {
          "name": [
            Crest < EM > Toothpaste }}]}}Copy the code

For Elasticsearch, there is a correlation score inside the search.

Elasticsearch relevance score

Elasticsearch (or Lucene) uses a Boolean model to find matching documents and a formula called Practical scoring Function to calculate relevance. This formula draws lessons from term frequency/ Inverse document frequency and Vector space model, while adding some modern new features. Such as coordination factor, field length normalization, and increased word or query weight.

The Boolean model

Boolean model is also called exact matching model. The documents to be retrieved can precisely match the retrieval requirements. Documents that do not meet the requirements will not be retrieved. All matched documents are the same in terms of relevance, there is no need to score the documents, and the results returned are unordered.

The Boolean Model simply uses conditions like AND, OR, AND NOT (with, OR, AND NOT) in a query to find matching documents:

Full AND text AND search AND (ElasticSearch OR Lucene) takes all documents that include the words full, text, AND search, as well as ElasticSearch OR Lucene, as result sets.

This process is simple and fast, and it excludes any documents that might not match.

Vector space model

Vector space model In the vector space model, documents and query statements are represented as vectors in higher dimensions. Each of these terms is a dimension of the vector. The correlation between documents and queries is calculated by the distance between the two vectors, usually using the cosine similarity measure.

Imagine if you query “happy hippopotamus” and the common word happy has a lower weight and the uncommon word hippopotamus has a higher weight. Assume that happy has a weight of 2 and hippopotamus has a weight of 5. I can take this two-dimensional vector [2,5] and draw a line in coordinates, starting at (0,0) and ending at (2,5).

Now, imagine we have three documents:

  • I am happy in summer.
  • After Christmas I’m a hippopotamus.
  • The happy hippopotamus helped Harry

You can create vectors for each document that include the weights of each query word — happy and hippopotamus — and then place them in the same coordinate system

The correlation of each document can be obtained by measuring the Angle between the query vector and the document vector. The Angle between document 1 and the query is the largest, so the correlation is low. Document 2 has a smaller Angle to the query and is therefore more relevant; Document 3 matches the Angle of the query perfectly.

Utility scoring function

score(q,d)  =  # 1
            queryNorm(q)  # 2, coord (q, d)# 3
          · ∑ (           # 4
                tf(t in d)   # 5The idf (t) squared# 6, t.g etBoost ()# 7Norm, a (t, d)# 8
            ) (t in q)    # 9
Copy the code

The formula is explained as follows:

  • #1 Score (q, d) is the correlation score between document D and query Q
  • #2 queryNorm(Q) is the query normalization factor.
  • #3 Coord (Q, D) is a coordination factor
  • Query the weight sum of each term t in q for document D
  • #5 TF (t in D) is the word frequency of the term t in document D
  • #6 IDF (t) is the reverse document frequency of the term T
  • #7 t.goetboost () is the boost used in the query
  • #8 Norm (t,d) is the field length regular value, and the sum of field-level boost (if present) when indexing.

conclusion

  1. Describes the search, ES basic concept, index structure, document add, delete, change, analysis, full text search, phrase search, highlighting and so on
  2. It does not cover ES parent-child index, aggregation analysis, data modeling, real-time data analysis, automatic document completion, advanced search (multi_match, boost, etc.), distributed principle, cluster maintenance and upgrade, Elasticsearch SQL (support REST, JDBC and command line), etc

Reference documentation

  • Control relevance
  • Dry goods | ElasticSearch correlation scoring mechanism
  • ElasticSearch correlation scoring mechanism learning