Fourth, in-depth search

1. Term based and full-text based search

1.1 based on`Term`The query

TermThe importance of
- TermIs the smallest unit of semantic expression. Both search and natural language processing using statistical language models require processingTerm
The characteristics of
- Term Level Query: Term Query / Range Query / Exists Query / Prefix Query / Wildcard Query
- inES,TermQuery, do not do word segmentation for input. The input as a whole is searched for the exact term in the inverted index, and the relevance score is performed for each document containing that term using the relevance score formula – for exampleApple Store
- Can be achieved byConstant ScoreConvert the query to oneFiltering, avoid scoring, and use caching to improve performance

1.2 `Term`Examples of queries

1.2.1 Inserting Data

# Term query example, And think about the POST/products / _bulk {" index ": {" _id" : 1}} {" productID ":" XHDK - A - 1293 - # fJ3 ", "desc" : "iPhone"} {" index ": {" _id" : 2}} {"productID":"KDKE-B-9947-#kL5","desc":"iPad"} {"index":{"_id":3}} {"productID":"JODL-X-1937-#pV7","desc":"MBP"} GET /productsCopy the code

1.2.2 example 1

POST /products/_search
{
  "query": {
    "term": {
      "desc": {
        "value": "iPhone"
      }
    }
  }
}
Copy the code

I can’t find anything. What’s the reason? Because of the term query we use, ES will not do any processing to the input condition, that is to say, the condition we search is “iPhone” with uppercase, while es will do default word segmentation processing to the data of text type and turn lowercase when making data index. That’s why we can’t get the data.

POST /products/_search
{
  "query": {
    "term": {
      "desc": {
        "value": "iphone"
      }
    }
  }
}
Copy the code

So we can get the numbers

1.2.3 case 2

POST /products/_search
{
  "query": {
    "term": {
      "productID": {
        "value": "XHDK-A-1293-#fJ3"
      }
    }
  }
}
Copy the code

There’s nothing here. What’s the reason for that? Term = xhdK-a-1293 -#fJ3 = xhdK-a-1293 -#fJ3

POST /_analyze
{
  "analyzer": "standard",
  "text": ["XHDK-A-1293-#fJ3"]
}
Copy the code

The result of the above is

{
  "tokens" : [
    {
      "token" : "xhdk",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "1293",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<NUM>",
      "position" : 2
    },
    {
      "token" : "fj3",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}
Copy the code

We can look it up like this

POST /products/_search
{
  "query": {
    "term": {
      "productID": {
        "value": "xhdk" 
      }
    }
  }
}
Copy the code

XHDK in lower case can match the content after the word segmentation, so we can look up the result, so how do we match exactly?

Xhdk-a-1293 -#fJ3 = xhdK-a-1293 -#fJ3 = xhdK-a-1293 -#fJ3 = xhdK-a-1293 -#fJ3

POST /products/_search
{
  "query": {
    "term": {
      "productID.keyword": {
        "value": "XHDK-A-1293-#fJ3"
      }
    }
  }
}
Copy the code

If you want a full match, you can use a multi-field property in ES, which adds a keyword field to a text field by default. The keyword field provides a full match

1.3 Composite Query –`Constant Score`to`Filter`

The Term query still returns the corresponding score, so what if we want to skip the score?

willQuerytoFilterTo ignoreTF-IDFCalculation to avoid the overhead of correlation calculation
FilterCaching can be used effectively

1.4 Full-text Query

Full-text based search
- Match Query / Match Phrase Query / Query String Query
The characteristics of
- Indexing and search are segmented, and the query string is passed to an appropriate tokenizer, which generates a list of terms to query
- During the query, the input query will be divided into words first, and then each word item one by one for the bottom of the query, the final results are merged. A score is generated for each document. Such as check"Matrix reloaded", will be found to includeMatrixorreloadAll the results of

1.5 `Match Query`The query process

1.6 summarize

Term based lookup vs full-text based lookup
Through the fieldMappingControls the segmentation of fields
- Text vs Keyword
Query controlled by parametersPrecision & Recall
Composite query –Constant ScoreThe query
- Even forKeywordforTermQuery, will also be calculated points
- Queries can be converted toFilteringIn order to improve performance, the correlation calculation is eliminated

2. Structured search

2.1 Structured Data

2.2 `ES`Structured search in

Structured data such as booleans, times, dates, and numbers: there are precise formats that we can logically manipulate. Involves comparing ranges of numbers or times, or determining the size of two values
Structured text can be matched exactly or partially
- TermQuery /PrefixThe prefix queries
Structured results have only yes or no values
- Depending on the scenario, you can decide whether structured search needs to be scored

2.3 example

2.3.1 Inserting Data

# Structured search, DELETE products POST /products/_bulk {"index":{"_id":1}} {"price":10,"avaliable":true,"date":"2018-01-01","productID":"XHDK-A-1293-#fJ3"} {"index":{"_id":2}} {"price":20,"avaliable":true,"date":"2019-01-01","productID":"KDKE-B-9947-#kL5"} {"index":{"_id":3}} {"price":30,"avaliable":true,"productID":"XHDK-A-1293-#fJ3"} {"index":{"_id":4}} {"price":10,"avaliable":false,"productID":"XHDK-A-1293-#fJ3"} GET products/_mappingCopy the code

2.3.2 For Boolean term query, score is calculated

POST /products/_search {"profile": "true", "query": {"term": {"avaliable": true}}Copy the code

2.3.3 Boolean term queries are converted to filtering through constant score without scoring

POST products/_search {"profile": "true", "explain": true, "query": { "constant_score": { "filter": { "term": { "avaliable": true } } } } }Copy the code

2.3.4 Digit Range Query

GET products/_search {"query": {"constant_score": {"filter": {" Range ": {"price": {"gte": 20, "lte": 30}}}}}}Copy the code

2.3.4 date range

GET products/_search {"query": {"constant_score": {"filter": {"range": {"date": {"gte": "now-1y" } } } } } }Copy the code

Now minus 1 year (now = now, y = year, 1y = year)

field	The field
y	years
M	month
w	weeks
d	day
H/h	hours
m	minutes
s	seconds

2.3.5 Exists This parameter is used to query a document that does not contain a field

GET /products/_search {"query": {"constant_score": {"filter": {" Exists ": {"field": "date" } } } } }Copy the code

2.3.6 Multi-value Field Query

POST /movies/_bulk
{"index":{"_id":1}}
{"title":"Father of the Bridge Part II","year":1995,"gener":"Comedy"}
{"index":{"_id":2}}
{"title":"Dave","year":1993,"gener":["Comedy","Romance"]}
Copy the code

2.3.6.1 Handling multi-valued fields, term queries are included rather than equal

Term query {"query": {"constant_score": {"filter": {"term": {"gener.keyword": "Comedy" } } } } }Copy the code

All documents containing “Comedy” are returned, so what if we want an exact match in a multi-valued field? How do we do the solution: Add a genre_count field to count. The solution is given in the combined bool query

2.3.6.2 If the term query matches accurately with multi-value fields (here if you do not understand, you can skip first)

{ "tags" : ["search"], "tag_count" : 1 }
{ "tags" : ["search", "open_source"], "tag_count" : 2 }

GET /my_index/my_type/_search
{
    "query": {
        "constant_score" : {
            "filter" : {
                 "bool" : {
                    "must" : [
                        { "term" : { "tags" : "search" } }, 
                        { "term" : { "tag_count" : 1 } }
                    ]
                }
            }
        }
    }
}
Copy the code

2.3 summarize

Structured data & Structured Search
- If you don’t have to score, you can passConstant ScoreTo convert the query toFiltering
Range query andDate Math
useExitQuery processing is non-nullNULLvalue
Exact values & Exact lookups for multi-valued fields
- TermQueries are inclusive, not equal. Be especially careful with multi-value field queries

3. Search relevance score

3.1 Correlation and correlation score

3.2 word frequency (TF)

3.3 Inverse document frequency IDF

3.4 Concept of TF-IDF

3.5 TF-IDF scoring formula in Lucene

3.6 BM25

3.7 Customized Similarity

3.8 Viewing TF-IDF through the Explain API

3.9 Boosting Relevance

3.10 summarize

What is correlation & correlation score introduction
- TF-IDF/BM25
Customize the relatedness algorithm parameters in Elasticsearch
ES can set Boosting parameter for index and field respectively

4. `Query`&`Filtering`With multi-string multi-field query

4.1 `Query Context` & `Filter Context`

We see that many systems support multiple field queries, search engines generally also provide filtering conditions based on time and price, so ES is also supported, the following is to introduce the advanced query of ES;

ESAdvanced search: supports multiple text input and searches for multiple fields
inESThere,QueryandFilterTwo different onesContext(ContextContext, which will be covered later.)
- Query ContextUse:Query ContextQuery, the search results will carry out correlation score
- Filter ContextUse:Filter ContextThe results of the query will not be graded, so that caching can be used for better performance

4.2 Combination of Conditions

Suppose we now complete the following query:

Suppose the search for movie reviews includes Guitar, with user ratings higher than 3 and release dates between 1993 and 2000.

This search contains 3 pieces of logic, each for different fields, including Guitar reviews, user ratings greater than 3, release dates in a given range, all three pieces of logic, and good performance. How do we do this?

This requires a compound Query in ES: bool Query

4.3 `Boolean query`

aboolA query is a combination of one or more query clauses
- There are four clauses in total. Two of them will affect the calculation of scoring, two do not affect the calculation of scoring;
Relevance is not just the preserve of full-text search. Applies to yes | no clause, matching the clause, the more the higher the relevance score. If multiple query clauses are merged into a single compound query statement, for exampleboolQuery, then the score calculated from each query clause is combined into the total correlation score category.

clause	describe
must	Must match. Contributions count
should	Selective matching. Contributions count
must_not	`Filter Context`Query clause, must not match
filter	`Filter Context`Must match, but does not contribute to the score

4.3.1 `bool`The query syntax

boolQuery seed queries can appear in any order
Multiple queries can be nested
If you have aboolQuery species, nonemustConditions,shouldThe species must satisfy at least one query

From here it is easy to go back to section 2.3.6.2

4.3.2 `bool`Nested query

So that’s one of themshould_notAlthough there is no logicshould_notBut we can do it this way.

4.3.3 `bool`The structure of the query statement affects the relevance score

Competing fields at the same level have the same weight;
Through the nestedboolQuery, can change the impact of the score;

4.3.3.1 Control field`Boosting`

BoostingIt’s a way of controlling relevancy
- BoostingCan be used in indexes, fields, or query subconditions
parameterboostThe meaning of
- whenboost> 1, the relativity of scoring increases;
- When 0 <boostWhen < 1, the relativity of scoring weight decreases;
- whenboost< 0, contribution negative points;
Insert data

Boosting 'POST /blogs/_bulk {"index":{"_id":1}} {"title":"Apple iPad","content":"Apple iPad,Apple iPad"} {"index":{"_id":2}} {"title":"Apple iPad,Apple iPad","content":"Apple iPad"}Copy the code

Query 1 (titleThe fieldboostThe value is relatively high)

POST blogs/_search { "query": { "bool": { "should": [ { "match": { "title": { "query": "apple, ipad", "boost": 1.1}}}, {" match ": {" content" : {" query ":" apple, apple, "" boost" : 1}}}]}}}Copy the code

Because the boost value of the title field is higher, it has a higher weight, so document 2 is shown first, because the two title values of the document contain two Apple ipads

Query 2 (contentThe fieldboostThe value is relatively high)

POST blogs/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "apple, ipad",
              "boost": 1
            }
          }
        },
        {
          "match": {
            "content": {
              "query": "apple, ipad",
              "boost": 2
            }
          }
        }
      ]
    }
  }
}
Copy the code

4.4 summarize

Query Context vs Filter Query
Bool Query: multiple combination conditions (similar to SQL where followed by multiple conditions)
Query structure and correlation score
How to control the accuracy of the query
- Boosting & Boosting Query(The example here is not recorded, the video is available)

5. Single string multi-field query:`Dis Max Query`

5.1 Example of single string query

We performed a single string multi-field query on the above document, which is to match the single string Brown Fox with multiple fields. In the above example, we matched the title and body fields.

Let’s analyze the document:

title
- It only appears in document 1Brown
body
- It appears in document 1Brown
- Brown foxThey all appear in document 2 and remain in the same order as the query, with the highest visual relevance

# Blogs / _BULK {"index":{"_id":1}} {"title":"Quick Brown Rabbits "," Content ":"Brown Rabbits are commonly found seen"} {"index":{"_id":2}} {"title":"Keppping pets healthy","content":"My quick brown fox eats rabbits on a regular Basis "# query POST/blogs / _search {1}" query ": {" bool" : {" should ": [{" match" : {" title ":" Brown fox "}}, {" match ": {"content": "Brown fox"}} ] } } }Copy the code

Strange, why document 1 comes first and scores higher than document 2 when we know that document 2 should be more relevant?

5.2 `bool`Of the query`should`The scoring process of the query

The queryshouldTwo queries in the statement
Add and score the two queries
Times the total number of matching statements
Divided by the total number of statements

Analysis: Both title and content in document 1 contain the key words of our query, so the two subqueries of should will be matched. Although document 2 contains the key word of the query precisely, it only appears in content but not in title. Only one subquery of should can be matched. So document 1 is rated higher than document 2, that’s why.

5.3 `Disjunction Max Query`The query

In the case,titleandcontentCompetition with each other
- Instead of simply stacking scores, you should find a score for a single field that best matches
Disjunction Max Query
- Any documents that match any query are returned as a result. The score that best matches the field is used to return the final score

POST /blogs/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {"match": {"title": "Quick fox"}},
        {"match": {"content": "Quick fox"}}
      ]
    }
  }
}

Copy the code

When we use a Disjunction Max Query, because the Disjunction Max Query takes the field’s best match and returns the final score, if two documents do not match exactly, then their scores are the same. What happens in this case?

POST /blogs/_bulk
{"index":{"_id":1}}
{"title":"Quick brown rabbits","content":"Brown rabbits are commonly seen"}
{"index":{"_id":2}}
{"title":"Keppping pets healthy","content":"My quick brown fox eats rabbits on a regular basis"}

POST /blogs/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {"match": {"title": "Quick pets"}},
        {"match": {"content": "Quick pets"}}
      ]
    }
  }
}

Copy the code

There is no exact keyword match in either documentQuick petsSo their scores should be the same. Let’s see

5.3.1 `Tie Breaker`parameter

Tie BreakerIs a floating point number between 0 and 1. 0 indicates the best match. 1 indicates that all statements are equally important.
Disjunction Max QueryGets a score for the best matching statement_score
Compare the scores of other matching statements withTie Breakermultiply
Sum and normalize the above scores

POST /blogs/_search { "query": { "dis_max": { "queries": [ {"match": {"title": "Quick pets"}}, {"match": {"content": "Quick pets"}}], "tie_breaker": 0.1}Copy the code

6. Single string multi-field query:`Multi Match`

6.1 Three Scenarios of Single-String multi-field Query

Best field (Best Fields)
- When fields compete with each other, they relate to each other. For example,titleandbodySuch a field (mentioned in the previous section). The score comes from the best match field
Most fields (Most Fields)
- When dealing with English content: A common approach is to use the main field (Engish Analyzer), extract stems and add synonyms to match more documents. Same text, add subfield (Standard Analyzer) to provide a more accurate match. Other fields serve as a signal to match documents for increased relevance. The more fields that match, the better
Mixed field (Cross Field)
- For certain entities, such as names, addresses, book information. Information needs to be determined in multiple fields, and a single field can only be part of the whole. Expect to find as many words as possible in any of these listed fields

6.2 `Multi Match Query`Syntax format

Best FieldsIs the default type and may not be specified
Minimum should matchIsoparameters can be passed to the generatedqueryIn the

6.3 `Multi Match`In the`most field`case

6.3.1 Defining indexes and Inserting Data

DELETE title
PUT /titles
{
  "mappings": {
    "properties": {
      "title":{
        "type": "text",
        "analyzer": "english"
      }
    }
  }
}

POST titles/_bulk
{"index":{"_id":1}}
{"title":"My dog barks"}
{"index":{"_id":2}}
{"title":"I see a lot of barking dogs on the road"}
Copy the code

6.3.2 Using common`match`The query

GET titles/_search
{
  "query": {
    "match": {
      "title": "barking dogs"
    }
  }
}
Copy the code

We analyzed the document content and clearly found that the second document was more relevant, but using a normal match query, we found that the first document came first. Why? Because we use English word segmentation when setting up the mapping and the first document is short, the first document comes first, so we need to do some optimization for this situation.

6.3.3 Redefine`Mapping`And insert data

DELETE titles
PUT /titles
{
  "mappings": {
    "properties": {
      "title":{
        "type": "text",
        "analyzer": "english", 
        "fields": {
          "std": {
            "type": "text",
            "analyzer": "standard"
          }  
        }
      }
    }
  }
}

POST titles/_bulk
{"index":{"_id":1}}
{"title":"My dog barks"}
{"index":{"_id":2}}
{"title":"I see a lot of barking dogs on the road"}
Copy the code

Analyze ourmappingdefine
Subfields are addedstd, and the subfield type istext, and usestandardWord segmentation is
usingenglishWord splitters are used to divide words according to English grammarstandardWord segmentation, not for English grammar word segmentation, so as to ensure the accuracy of data

6.3.4 using`Multi Query`The query

GET /titles/_search
{
  "query": {
    "multi_match": {
      "query": "barking dogs",
      "type": "most_fields", 
      "fields": ["title","title.std"]
    }
  }
}
Copy the code

6.3.5 `Multi Query`The field weights

Matches fields with breadthtitleInclude as many documents as possible — to improve recall — while using fieldstitle.stdAs a signal, place the more relevant documents at the top of the results
The contribution of each field to the final score can be customizedboostTo control, for example, to maketitleFields are more important, which also reduces the role of other signal fields

GET /titles/_search
{
  "query": {
    "multi_match": {
      "query": "barking dogs",
      "type": "most_fields", 
      "fields": ["title^10","title.std"]
    }
  }
}
Copy the code

6.4 `Multi Match`In the`cross field`(Cross-field search) cases

When we want to query in multiple fields, we might want to usemost fieldsTo implement the
That’s right,most fieldsIt can satisfy our requirements to some extent, but it cannot satisfy some special cases, such as: the data we want to query appears in all fields at the same time,most fieldsIt can’t be satisfied that we use"operator":"and"Also can not meet (here I also some silly silly points not clear, look at the following example, this to understand the words need to have a specific scene analysis, you can baidu), we can usecopy_to(mentioned earlier), but requires extra storage space;

At this point we can use cross_fields

6.4.1 Inserting Data

{"street": "5 Poland Street", "city" : "London", "country": "United Kingdom", "postcode": "W1V 3DG" }Copy the code

6.4.2 use`most_fields`To query

POST address/_search
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "most_fields",
      "fields": ["street","city","country","postcode"]
    }
  }
}
Copy the code

It can meet our needs

If we want all fields to show the result of the query, we can use “operator”: “and” plus most_fields

POST address/_search
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "most_fields",
      "operator": "and", 
      "fields": ["street","city","country","postcode"]
    }
  }
}
Copy the code

However, if we want to expect all the words in the query text to appear in the document and don’t mind which fields in the document, we can use corss_fields+and

POST address/_search
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "cross_fields",
      "operator": "and", 
      "fields": ["street","city","country","postcode"]
    }
  }
}
Copy the code

6.5 Distinguish field-centered query from entry – centered query

www.cnblogs.com/jiangtao121…

best_fields
- Suitable for multi-field query and query the same text;
- Score Takes the highest score for one of the fields.
- throughtie_breaker(0 ~ 1) Adds the score of the low-scoring field to the final score.
- best_fieldsBut withdis_maxQuery interchange. ES internally converts todis_maxThe queryoperator(Use with caution in this query)
- minimum_should_matchWithin the subquery of each field.

For example :" query":" Complete Conan Doyle ""field":["title","author","characters"] "type":"best_fields" "operator":"and" is equivalent to:  (+title:complete +title:conan +title:doyle) | (+autorh:complete +author:conan +autore:doyle) | (+characters:complete +characters:conan +characters:doyle)Copy the code

corss_fields
- This applies when you expect all the words in the query text to appear in the document, regardless of which fields in the document they appear in.
- operatorActs on the join between subqueries
- Application scenario: Information is divided into different fields, such as address, last name, and first name. Most of the timeopertaotruseand

The above query is equivalent to:  +(title:complete author:complete charactors:complete) +(title:conan author:conan charators:conan) +(title:doyle author:doyle charactor:doyle)Copy the code

most_fields
- It is useful for retrieving documents that contain the same text in multiple places but are handled differently by the underlying analysis.
- Most of the timeoperatoruseor.ESInternal conversion toboolThe query
- Application scenario: Multi-language processing

7. `Search Template`and`Index Alias`The query

7.1 `Search Template`: Decouple programs and searches`DSL`

Parameterize the query so that everyone can do their job, you write your business logic and I optimize my DSL

7.2 `Index Alias`Achieve zero downtime operation and maintenance

We can create aliases for indexes
We create new indexes every day, but when reading or writing, we want them to read from an Index so that we can use aliases

Elasticsearch core technology and practice iii

Fourth, in-depth search

1. Term based and full-text based search

1.1 based onTermThe query

1.2 TermExamples of queries

1.2.1 Inserting Data

1.2.2 example 1

1.2.3 case 2

1.3 Composite Query –Constant ScoretoFilter

1.4 Full-text Query

1.5 Match QueryThe query process

1.6 summarize

2. Structured search

2.1 Structured Data

2.2 ESStructured search in

2.3 example

2.3.1 Inserting Data

2.3.2 For Boolean term query, score is calculated

2.3.3 Boolean term queries are converted to filtering through constant score without scoring

2.3.4 Digit Range Query

2.3.4 date range

2.3.5 Exists This parameter is used to query a document that does not contain a field

2.3.6 Multi-value Field Query

2.3.6.1 Handling multi-valued fields, term queries are included rather than equal

2.3.6.2 If the term query matches accurately with multi-value fields (here if you do not understand, you can skip first)

2.3 summarize

3. Search relevance score

3.1 Correlation and correlation score

3.2 word frequency (TF)

3.3 Inverse document frequency IDF

3.4 Concept of TF-IDF

3.5 TF-IDF scoring formula in Lucene

3.6 BM25

3.7 Customized Similarity

3.8 Viewing TF-IDF through the Explain API

3.9 Boosting Relevance

3.10 summarize

4. Query&FilteringWith multi-string multi-field query

4.1 Query Context & Filter Context

4.2 Combination of Conditions

4.3 Boolean query

4.3.1 boolThe query syntax

4.3.2 boolNested query

4.3.3 boolThe structure of the query statement affects the relevance score

4.3.3.1 Control fieldBoosting

4.4 summarize

5. Single string multi-field query:Dis Max Query

5.1 Example of single string query

5.2 boolOf the queryshouldThe scoring process of the query

5.3 Disjunction Max QueryThe query

5.3.1 Tie Breakerparameter

6. Single string multi-field query:Multi Match

6.1 Three Scenarios of Single-String multi-field Query

6.2 Multi Match QuerySyntax format

6.3 Multi MatchIn themost fieldcase