Fourth, in-depth search

1. Term based and full-text based search

1.1 based onTermThe query

  • TermThe importance of
    • TermIs the smallest unit of semantic expression. Both search and natural language processing using statistical language models require processingTerm
  • The characteristics of
    • Term Level Query: Term Query / Range Query / Exists Query / Prefix Query / Wildcard Query
    • inES,TermQuery, do not do word segmentation for input. The input as a whole is searched for the exact term in the inverted index, and the relevance score is performed for each document containing that term using the relevance score formula – for exampleApple Store
    • Can be achieved byConstant ScoreConvert the query to oneFiltering, avoid scoring, and use caching to improve performance

1.2 TermExamples of queries

1.2.1 Inserting Data

# Term query example, And think about the POST/products / _bulk {" index ": {" _id" : 1}} {" productID ":" XHDK - A - 1293 - # fJ3 ", "desc" : "iPhone"} {" index ": {" _id" : 2}} {"productID":"KDKE-B-9947-#kL5","desc":"iPad"} {"index":{"_id":3}} {"productID":"JODL-X-1937-#pV7","desc":"MBP"} GET /productsCopy the code

1.2.2 example 1

POST /products/_search
{
  "query": {
    "term": {
      "desc": {
        "value": "iPhone"
      }
    }
  }
}
Copy the code

I can’t find anything. What’s the reason? Because of the term query we use, ES will not do any processing to the input condition, that is to say, the condition we search is “iPhone” with uppercase, while es will do default word segmentation processing to the data of text type and turn lowercase when making data index. That’s why we can’t get the data.

POST /products/_search
{
  "query": {
    "term": {
      "desc": {
        "value": "iphone"
      }
    }
  }
}
Copy the code

So we can get the numbers

1.2.3 case 2

POST /products/_search
{
  "query": {
    "term": {
      "productID": {
        "value": "XHDK-A-1293-#fJ3"
      }
    }
  }
}
Copy the code

There’s nothing here. What’s the reason for that? Term = xhdK-a-1293 -#fJ3 = xhdK-a-1293 -#fJ3

POST /_analyze
{
  "analyzer": "standard",
  "text": ["XHDK-A-1293-#fJ3"]
}
Copy the code

The result of the above is

{
  "tokens" : [
    {
      "token" : "xhdk",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "1293",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<NUM>",
      "position" : 2
    },
    {
      "token" : "fj3",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}
Copy the code

We can look it up like this

POST /products/_search
{
  "query": {
    "term": {
      "productID": {
        "value": "xhdk" 
      }
    }
  }
}
Copy the code

XHDK in lower case can match the content after the word segmentation, so we can look up the result, so how do we match exactly?

Xhdk-a-1293 -#fJ3 = xhdK-a-1293 -#fJ3 = xhdK-a-1293 -#fJ3 = xhdK-a-1293 -#fJ3

POST /products/_search
{
  "query": {
    "term": {
      "productID.keyword": {
        "value": "XHDK-A-1293-#fJ3"
      }
    }
  }
}
Copy the code

If you want a full match, you can use a multi-field property in ES, which adds a keyword field to a text field by default. The keyword field provides a full match

1.3 Composite Query –Constant ScoretoFilter

The Term query still returns the corresponding score, so what if we want to skip the score?

  • willQuerytoFilterTo ignoreTF-IDFCalculation to avoid the overhead of correlation calculation
  • FilterCaching can be used effectively

1.4 Full-text Query

  • Full-text based search
    • Match Query / Match Phrase Query / Query String Query
  • The characteristics of
    • Indexing and search are segmented, and the query string is passed to an appropriate tokenizer, which generates a list of terms to query
    • During the query, the input query will be divided into words first, and then each word item one by one for the bottom of the query, the final results are merged. A score is generated for each document. Such as check"Matrix reloaded", will be found to includeMatrixorreloadAll the results of

1.5 Match QueryThe query process

1.6 summarize

  • Term based lookup vs full-text based lookup
  • Through the fieldMappingControls the segmentation of fields
    • Text vs Keyword
  • Query controlled by parametersPrecision & Recall
  • Composite query –Constant ScoreThe query
    • Even forKeywordforTermQuery, will also be calculated points
    • Queries can be converted toFilteringIn order to improve performance, the correlation calculation is eliminated

2. Structured search

2.1 Structured Data

2.2 ESStructured search in

  • Structured data such as booleans, times, dates, and numbers: there are precise formats that we can logically manipulate. Involves comparing ranges of numbers or times, or determining the size of two values
  • Structured text can be matched exactly or partially
    • TermQuery /PrefixThe prefix queries
  • Structured results have only yes or no values
    • Depending on the scenario, you can decide whether structured search needs to be scored

2.3 example

2.3.1 Inserting Data

# Structured search, DELETE products POST /products/_bulk {"index":{"_id":1}} {"price":10,"avaliable":true,"date":"2018-01-01","productID":"XHDK-A-1293-#fJ3"} {"index":{"_id":2}} {"price":20,"avaliable":true,"date":"2019-01-01","productID":"KDKE-B-9947-#kL5"} {"index":{"_id":3}} {"price":30,"avaliable":true,"productID":"XHDK-A-1293-#fJ3"} {"index":{"_id":4}} {"price":10,"avaliable":false,"productID":"XHDK-A-1293-#fJ3"} GET products/_mappingCopy the code

2.3.2 For Boolean term query, score is calculated

POST /products/_search {"profile": "true", "query": {"term": {"avaliable": true}}Copy the code

2.3.3 Boolean term queries are converted to filtering through constant score without scoring

POST products/_search {"profile": "true", "explain": true, "query": { "constant_score": { "filter": { "term": { "avaliable": true } } } } }Copy the code

2.3.4 Digit Range Query

GET products/_search {"query": {"constant_score": {"filter": {" Range ": {"price": {"gte": 20, "lte": 30}}}}}}Copy the code

2.3.4 date range

GET products/_search {"query": {"constant_score": {"filter": {"range": {"date": {"gte": "now-1y" } } } } } }Copy the code

Now minus 1 year (now = now, y = year, 1y = year)

field The field
y years
M month
w weeks
d day
H/h hours
m minutes
s seconds

2.3.5 Exists This parameter is used to query a document that does not contain a field

GET /products/_search {"query": {"constant_score": {"filter": {" Exists ": {"field": "date" } } } } }Copy the code

2.3.6 Multi-value Field Query

POST /movies/_bulk
{"index":{"_id":1}}
{"title":"Father of the Bridge Part II","year":1995,"gener":"Comedy"}
{"index":{"_id":2}}
{"title":"Dave","year":1993,"gener":["Comedy","Romance"]}
Copy the code
2.3.6.1 Handling multi-valued fields, term queries are included rather than equal
Term query {"query": {"constant_score": {"filter": {"term": {"gener.keyword": "Comedy" } } } } }Copy the code

All documents containing “Comedy” are returned, so what if we want an exact match in a multi-valued field? How do we do the solution: Add a genre_count field to count. The solution is given in the combined bool query

2.3.6.2 If the term query matches accurately with multi-value fields (here if you do not understand, you can skip first)
{ "tags" : ["search"], "tag_count" : 1 }
{ "tags" : ["search", "open_source"], "tag_count" : 2 }

GET /my_index/my_type/_search
{
    "query": {
        "constant_score" : {
            "filter" : {
                 "bool" : {
                    "must" : [
                        { "term" : { "tags" : "search" } }, 
                        { "term" : { "tag_count" : 1 } }
                    ]
                }
            }
        }
    }
}
Copy the code

2.3 summarize

  • Structured data & Structured Search
    • If you don’t have to score, you can passConstant ScoreTo convert the query toFiltering
  • Range query andDate Math
  • useExitQuery processing is non-nullNULLvalue
  • Exact values & Exact lookups for multi-valued fields
    • TermQueries are inclusive, not equal. Be especially careful with multi-value field queries

3. Search relevance score

3.1 Correlation and correlation score

3.2 word frequency (TF)

3.3 Inverse document frequency IDF

3.4 Concept of TF-IDF

3.5 TF-IDF scoring formula in Lucene

3.6 BM25

3.7 Customized Similarity

3.8 Viewing TF-IDF through the Explain API

3.9 Boosting Relevance

3.10 summarize

  • What is correlation & correlation score introduction
    • TF-IDF/BM25
  • Customize the relatedness algorithm parameters in Elasticsearch
  • ES can set Boosting parameter for index and field respectively

4. Query&FilteringWith multi-string multi-field query

4.1 Query Context & Filter Context

We see that many systems support multiple field queries, search engines generally also provide filtering conditions based on time and price, so ES is also supported, the following is to introduce the advanced query of ES;

  • ESAdvanced search: supports multiple text input and searches for multiple fields
  • inESThere,QueryandFilterTwo different onesContext(ContextContext, which will be covered later.)
    • Query ContextUse:Query ContextQuery, the search results will carry out correlation score
    • Filter ContextUse:Filter ContextThe results of the query will not be graded, so that caching can be used for better performance

4.2 Combination of Conditions

Suppose we now complete the following query:

  • Suppose the search for movie reviews includes Guitar, with user ratings higher than 3 and release dates between 1993 and 2000.

This search contains 3 pieces of logic, each for different fields, including Guitar reviews, user ratings greater than 3, release dates in a given range, all three pieces of logic, and good performance. How do we do this?

This requires a compound Query in ES: bool Query

4.3 Boolean query

  • aboolA query is a combination of one or more query clauses
    • There are four clauses in total. Two of them will affect the calculation of scoring, two do not affect the calculation of scoring;
  • Relevance is not just the preserve of full-text search. Applies to yes | no clause, matching the clause, the more the higher the relevance score. If multiple query clauses are merged into a single compound query statement, for exampleboolQuery, then the score calculated from each query clause is combined into the total correlation score category.
clause describe
must Must match. Contributions count
should Selective matching. Contributions count
must_not Filter ContextQuery clause, must not match
filter Filter ContextMust match, but does not contribute to the score

4.3.1 boolThe query syntax

  • boolQuery seed queries can appear in any order
  • Multiple queries can be nested
  • If you have aboolQuery species, nonemustConditions,shouldThe species must satisfy at least one query

From here it is easy to go back to section 2.3.6.2

4.3.2 boolNested query

  • So that’s one of themshould_notAlthough there is no logicshould_notBut we can do it this way.

4.3.3 boolThe structure of the query statement affects the relevance score

  • Competing fields at the same level have the same weight;
  • Through the nestedboolQuery, can change the impact of the score;
4.3.3.1 Control fieldBoosting
  • BoostingIt’s a way of controlling relevancy
    • BoostingCan be used in indexes, fields, or query subconditions
  • parameterboostThe meaning of
    • whenboost> 1, the relativity of scoring increases;
    • When 0 <boostWhen < 1, the relativity of scoring weight decreases;
    • whenboost< 0, contribution negative points;
  • Insert data
Boosting 'POST /blogs/_bulk {"index":{"_id":1}} {"title":"Apple iPad","content":"Apple iPad,Apple iPad"} {"index":{"_id":2}} {"title":"Apple iPad,Apple iPad","content":"Apple iPad"}Copy the code
  • Query 1 (titleThe fieldboostThe value is relatively high)
POST blogs/_search { "query": { "bool": { "should": [ { "match": { "title": { "query": "apple, ipad", "boost": 1.1}}}, {" match ": {" content" : {" query ":" apple, apple, "" boost" : 1}}}]}}}Copy the code

Because the boost value of the title field is higher, it has a higher weight, so document 2 is shown first, because the two title values of the document contain two Apple ipads

  • Query 2 (contentThe fieldboostThe value is relatively high)
POST blogs/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "apple, ipad",
              "boost": 1
            }
          }
        },
        {
          "match": {
            "content": {
              "query": "apple, ipad",
              "boost": 2
            }
          }
        }
      ]
    }
  }
}
Copy the code

4.4 summarize

  • Query Context vs Filter Query
  • Bool Query: multiple combination conditions (similar to SQL where followed by multiple conditions)
  • Query structure and correlation score
  • How to control the accuracy of the query
    • Boosting & Boosting Query(The example here is not recorded, the video is available)

5. Single string multi-field query:Dis Max Query

5.1 Example of single string query

We performed a single string multi-field query on the above document, which is to match the single string Brown Fox with multiple fields. In the above example, we matched the title and body fields.

Let’s analyze the document:

  • title
    • It only appears in document 1Brown
  • body
    • It appears in document 1Brown
    • Brown foxThey all appear in document 2 and remain in the same order as the query, with the highest visual relevance
# Blogs / _BULK {"index":{"_id":1}} {"title":"Quick Brown Rabbits "," Content ":"Brown Rabbits are commonly found seen"} {"index":{"_id":2}} {"title":"Keppping pets healthy","content":"My quick brown fox eats rabbits on a regular Basis "# query POST/blogs / _search {1}" query ": {" bool" : {" should ": [{" match" : {" title ":" Brown fox "}}, {" match ": {"content": "Brown fox"}} ] } } }Copy the code

Strange, why document 1 comes first and scores higher than document 2 when we know that document 2 should be more relevant?

5.2 boolOf the queryshouldThe scoring process of the query

  • The queryshouldTwo queries in the statement
  • Add and score the two queries
  • Times the total number of matching statements
  • Divided by the total number of statements

Analysis: Both title and content in document 1 contain the key words of our query, so the two subqueries of should will be matched. Although document 2 contains the key word of the query precisely, it only appears in content but not in title. Only one subquery of should can be matched. So document 1 is rated higher than document 2, that’s why.

5.3 Disjunction Max QueryThe query

  • In the case,titleandcontentCompetition with each other
    • Instead of simply stacking scores, you should find a score for a single field that best matches
  • Disjunction Max Query
    • Any documents that match any query are returned as a result. The score that best matches the field is used to return the final score
POST /blogs/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {"match": {"title": "Quick fox"}},
        {"match": {"content": "Quick fox"}}
      ]
    }
  }
}

Copy the code

When we use a Disjunction Max Query, because the Disjunction Max Query takes the field’s best match and returns the final score, if two documents do not match exactly, then their scores are the same. What happens in this case?

POST /blogs/_bulk
{"index":{"_id":1}}
{"title":"Quick brown rabbits","content":"Brown rabbits are commonly seen"}
{"index":{"_id":2}}
{"title":"Keppping pets healthy","content":"My quick brown fox eats rabbits on a regular basis"}

POST /blogs/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {"match": {"title": "Quick pets"}},
        {"match": {"content": "Quick pets"}}
      ]
    }
  }
}

Copy the code

There is no exact keyword match in either documentQuick petsSo their scores should be the same. Let’s see

5.3.1 Tie Breakerparameter

  • Tie BreakerIs a floating point number between 0 and 1. 0 indicates the best match. 1 indicates that all statements are equally important.
  • Disjunction Max QueryGets a score for the best matching statement_score
  • Compare the scores of other matching statements withTie Breakermultiply
  • Sum and normalize the above scores
POST /blogs/_search { "query": { "dis_max": { "queries": [ {"match": {"title": "Quick pets"}}, {"match": {"content": "Quick pets"}}], "tie_breaker": 0.1}Copy the code

6. Single string multi-field query:Multi Match

6.1 Three Scenarios of Single-String multi-field Query

  • Best field (Best Fields)
    • When fields compete with each other, they relate to each other. For example,titleandbodySuch a field (mentioned in the previous section). The score comes from the best match field
  • Most fields (Most Fields)
    • When dealing with English content: A common approach is to use the main field (Engish Analyzer), extract stems and add synonyms to match more documents. Same text, add subfield (Standard Analyzer) to provide a more accurate match. Other fields serve as a signal to match documents for increased relevance. The more fields that match, the better
  • Mixed field (Cross Field)
    • For certain entities, such as names, addresses, book information. Information needs to be determined in multiple fields, and a single field can only be part of the whole. Expect to find as many words as possible in any of these listed fields

6.2 Multi Match QuerySyntax format

  • Best FieldsIs the default type and may not be specified
  • Minimum should matchIsoparameters can be passed to the generatedqueryIn the

6.3 Multi MatchIn themost fieldcase

6.3.1 Defining indexes and Inserting Data

DELETE title
PUT /titles
{
  "mappings": {
    "properties": {
      "title":{
        "type": "text",
        "analyzer": "english"
      }
    }
  }
}

POST titles/_bulk
{"index":{"_id":1}}
{"title":"My dog barks"}
{"index":{"_id":2}}
{"title":"I see a lot of barking dogs on the road"}
Copy the code

6.3.2 Using commonmatchThe query

GET titles/_search
{
  "query": {
    "match": {
      "title": "barking dogs"
    }
  }
}
Copy the code

We analyzed the document content and clearly found that the second document was more relevant, but using a normal match query, we found that the first document came first. Why? Because we use English word segmentation when setting up the mapping and the first document is short, the first document comes first, so we need to do some optimization for this situation.

6.3.3 RedefineMappingAnd insert data

DELETE titles
PUT /titles
{
  "mappings": {
    "properties": {
      "title":{
        "type": "text",
        "analyzer": "english", 
        "fields": {
          "std": {
            "type": "text",
            "analyzer": "standard"
          }  
        }
      }
    }
  }
}

POST titles/_bulk
{"index":{"_id":1}}
{"title":"My dog barks"}
{"index":{"_id":2}}
{"title":"I see a lot of barking dogs on the road"}
Copy the code
  1. Analyze ourmappingdefine
  2. Subfields are addedstd, and the subfield type istext, and usestandardWord segmentation is
  3. usingenglishWord splitters are used to divide words according to English grammarstandardWord segmentation, not for English grammar word segmentation, so as to ensure the accuracy of data

6.3.4 usingMulti QueryThe query

GET /titles/_search
{
  "query": {
    "multi_match": {
      "query": "barking dogs",
      "type": "most_fields", 
      "fields": ["title","title.std"]
    }
  }
}
Copy the code

6.3.5 Multi QueryThe field weights

  • Matches fields with breadthtitleInclude as many documents as possible — to improve recall — while using fieldstitle.stdAs a signal, place the more relevant documents at the top of the results
  • The contribution of each field to the final score can be customizedboostTo control, for example, to maketitleFields are more important, which also reduces the role of other signal fields
GET /titles/_search
{
  "query": {
    "multi_match": {
      "query": "barking dogs",
      "type": "most_fields", 
      "fields": ["title^10","title.std"]
    }
  }
}
Copy the code

6.4 Multi MatchIn thecross field(Cross-field search) cases

  1. When we want to query in multiple fields, we might want to usemost fieldsTo implement the
  2. That’s right,most fieldsIt can satisfy our requirements to some extent, but it cannot satisfy some special cases, such as: the data we want to query appears in all fields at the same time,most fieldsIt can’t be satisfied that we use"operator":"and"Also can not meet (here I also some silly silly points not clear, look at the following example, this to understand the words need to have a specific scene analysis, you can baidu), we can usecopy_to(mentioned earlier), but requires extra storage space;

At this point we can use cross_fields

6.4.1 Inserting Data

{"street": "5 Poland Street", "city" : "London", "country": "United Kingdom", "postcode": "W1V 3DG" }Copy the code

6.4.2 usemost_fieldsTo query

POST address/_search
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "most_fields",
      "fields": ["street","city","country","postcode"]
    }
  }
}
Copy the code

It can meet our needs

If we want all fields to show the result of the query, we can use “operator”: “and” plus most_fields

POST address/_search
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "most_fields",
      "operator": "and", 
      "fields": ["street","city","country","postcode"]
    }
  }
}
Copy the code

However, if we want to expect all the words in the query text to appear in the document and don’t mind which fields in the document, we can use corss_fields+and

POST address/_search
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "cross_fields",
      "operator": "and", 
      "fields": ["street","city","country","postcode"]
    }
  }
}
Copy the code

6.5 Distinguish field-centered query from entry – centered query

www.cnblogs.com/jiangtao121…

  1. best_fields
    • Suitable for multi-field query and query the same text;
    • Score Takes the highest score for one of the fields.
    • throughtie_breaker(0 ~ 1) Adds the score of the low-scoring field to the final score.
    • best_fieldsBut withdis_maxQuery interchange. ES internally converts todis_maxThe queryoperator(Use with caution in this query)
    • minimum_should_matchWithin the subquery of each field.
For example :" query":" Complete Conan Doyle ""field":["title","author","characters"] "type":"best_fields" "operator":"and" is equivalent to:  (+title:complete +title:conan +title:doyle) | (+autorh:complete +author:conan +autore:doyle) | (+characters:complete +characters:conan +characters:doyle)Copy the code
  1. corss_fields
    • This applies when you expect all the words in the query text to appear in the document, regardless of which fields in the document they appear in.
    • operatorActs on the join between subqueries
    • Application scenario: Information is divided into different fields, such as address, last name, and first name. Most of the timeopertaotruseand
The above query is equivalent to:  +(title:complete author:complete charactors:complete) +(title:conan author:conan charators:conan) +(title:doyle author:doyle charactor:doyle)Copy the code
  1. most_fields
    • It is useful for retrieving documents that contain the same text in multiple places but are handled differently by the underlying analysis.
    • Most of the timeoperatoruseor.ESInternal conversion toboolThe query
    • Application scenario: Multi-language processing

7. Search TemplateandIndex AliasThe query

7.1 Search Template: Decouple programs and searchesDSL

  • Parameterize the query so that everyone can do their job, you write your business logic and I optimize my DSL

7.2 Index AliasAchieve zero downtime operation and maintenance

  • We can create aliases for indexes
  • We create new indexes every day, but when reading or writing, we want them to read from an Index so that we can use aliases