Similarity Module of Elasticsearch 7.10

The similarity (scoring/ranking model) defines how matching documents are scored. Each field has similarity, which means that different similarity can be defined for each field through mapping.

Configuring custom similarity is considered an expert feature, and the built-in similarity is likely sufficient to satisfy the description in similarity.

Configuring a similarity

Most existing or custom similarities have configuration options that can be configured with the index Settings shown below. Index options can be provided when creating an index or updating index Settings.

PUT /index { "settings": { "index": { "similarity": { "my_similarity": { "type": "DFR", "basic_model": "G", "after_effect" : "l" and "normalization" : "h2," "normalization. H2. C" : "3.0"}}}}}Copy the code

Here, we configure the DFR similarity so that it is called my_similarity in the mapping, as shown in the following example:

PUT /index/_mapping
{
  "properties" : {
    "title" : { "type" : "text", "similarity" : "my_similarity" }
  }
}
Copy the code

Available similarities

BM25 similarity (default)

TF/ IDF-based similarity has built-in TF normalization and should apply to short fields (for example, names). For more details, see Okapi_BM25. Type name: BM25. This similarity has the following options:

K1: Control the frequency normalization of nonlinear terms (saturation). The default value is 1.2
B: Control the degree to which tf values are normalized by document length. The default value is 0.75
Discount_overlaps: Determines whether overlapped marks (marks with position increments of 0) are ignored when calculating norms. The default is true, which means that overlapping tokens are not counted when evaluating specifications

DFR similarity

Similarities between implementation and randomness framework differences. Type name: DFR. This similarity has the following options:

Basic_model: Possible values: g, if, in, and ine
After_effect: Possible values: B and L
Exploratory: Possible values: No, H1, H2, H3, and Z

All options except the first one require normalized values.

DFI similarity

Realize the similarity of independent model differences. Type name: DFI, this similarity has the following options:

Independence_measure: Standardized

When using this similarity, it is strongly recommended not to remove stop terms for good relevance. Also note that words with lower frequency than expected will receive a score equal to 0.

IB similarity

Information based models. The algorithm is based on the concept that the information content in any symbol distribution sequence depends primarily on the reuse of its basic elements. For written texts, this challenge will correspond to comparing the writing styles of different authors. Type name: IB. This similarity has the following options:

Distribution: Possible values: ll and SPL.
Lambda: Possible values: df and TTF.
Normalization: Has the same degree of similarity as DFR.

LM Dirichlet similarity

LM Dirichlet similarity. Type name: LMDirichlet. This similarity has the following options:

Mu: The default value is 2000

The scoring formula in this article assigns negative scores to words that occur less often than predicted by the language model, which is illegal for Lucene, and thus gives a score of 0 for such words.

LM Jelinek Mercer similarity

LM Jelinek Mercer similarity. This algorithm tries to capture important patterns in text while preserving noise. Type name: LMJelinekMercer. This similarity has the following options:

Lambda: The best value depends on the collection and the query. The best value for title queries is about 0.1 and the best value for long queries is 0.7. The default value is 0.1. When the value approaches zero, documents that match more query words will rank higher than documents that match fewer words.

Scripted similarity

A similarity that allows you to use scripts to specify how scores should be calculated. Type name: Scripted. For example, the following example shows how to re-implement TF-IDF:

PUT /index { "settings": { "number_of_shards": 1, "similarity": { "scripted_tfidf": { "type": "scripted", "script": { "source": "double tf = Math.sqrt(doc.freq); Double idf = math.log ((field.doccount +1.0)/(term.docfreq +1.0)) +1.0; double idf = math.log ((field.doccount +1.0)/(term.docfreq +1.0)) +1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;" } } } }, "mappings": { "properties": { "field": { "type": "text", "similarity": "scripted_tfidf" } } } } PUT /index/_doc/1 { "field": "foo bar foo" } PUT /index/_doc/2 { "field": "bar baz" } POST /index/_refresh GET /index/_search? Explain =true {"query": {"query_string": {"query": "foo^1.7", "default_field": "field"}}}Copy the code

Produce:

{ "took": 12, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 1.9508477, "hits" : [{" _shard ":" [index] [0], "" _node" : "" _index OzrdjxNtQGaqs4DmioFw9A", ":" index ", "_type" : "_doc", "_id" : "1", "_score" : 1.9508477, "_source" : {" field ": "Foo bar foo", "_explanation": {"value": 1.9508477, "description": "Weight (field:foo in 0) [PerFieldSimilarity], result of:", "details": [{"value": 1.9508477, "description": "score from ScriptedSimilarity(weightScript=[null], script=[Script{type=inline, lang='painless', IdOrCode ='double tf = math.sqrt (doc.freq); double idf = math.log ((field.doccount +1.0)/(term.docfreq +1.0)) +1.0;  double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm; ', the options = {}, params = {}}]) computed from: ", "details" : [{" value ": 1.0," description ":" weight ", "details" : []}, {" value ": 1.7," description ":" query. Boost ", "details" : []}, {" value ": 2," description ": "field.docCount", "details": [] }, { "value": 4, "description": "field.sumDocFreq", "details": [] }, { "value": 5, "description": "field.sumTotalTermFreq", "details": [] }, { "value": 1, "description": "term.docFreq", "details": []}, {" value ": 2," description ":" term. TotalTermFreq ", "details" : []}, {" value ": 2.0," description ": "doc.freq", "details": [] }, { "value": 3, "description": "doc.length", "details": [] } ] } ] } } ] } }Copy the code

WARNING: Although script similarities provide great flexibility, they need to satisfy a set of rules. This could cause Elasticsearch to silently return the wrong hot match, or fail due to an internal error:

The score returned must be positive.
All other variables remain equal, and as doc.freq increases, the score must not decrease.
All other variables remain equal, and the score must not increase as doc.length increases.

You may have noticed that much of the above script depends on having the same statistics for each document. The above code can be made slightly more efficient by providing the weight_script, which computes the document-independent part of the score and can be used under the weight variable. If weight_script is not provided, weight is equal to 1. The weight_script has access to the same variables as the script, except doc, because it should calculate documents independent of the score.

The following configuration will give the same TF-IDF score, but more efficient:

PUT /index { "settings": { "number_of_shards": 1, "similarity": { "scripted_tfidf": { "type": Scripted ", "weight_script": {"source": "double idf = math.log ((field.doccount +1.0)/(term.docfreq +1.0)) +1.0; return query.boost * idf;" }, "script": { "source": "double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;" } } } }, "mappings": { "properties": { "field": { "type": "text", "similarity": "scripted_tfidf" } } } }Copy the code

Default Similarity

By default, Elasticsearch will use any similarity configured as default.

When creating an index, you can change the default similarity of all fields in the index:

PUT /index
{
  "settings": {
    "index": {
      "similarity": {
        "default": {
          "type": "boolean"
        }
      }
    }
  }
}
Copy the code

If you want to change the default similarity after the index is created, you must close the index, send the following request, and then open it again:

POST /index/_close

PUT /index/_settings
{
  "index": {
    "similarity": {
      "default": {
        "type": "boolean"
      }
    }
  }
}

POST /index/_open
Copy the code

See the website: www.elastic.co/guide/en/el…

Translation is not allowed to ask for more advice, translation is not easy do not embezzle, such as use, please indicate the source

Similarity Module of Elasticsearch 7.10

Configuring a similarity

Available similarities

Related Posts

Python – the decorator modifier

React Ref: Three ways

Matplotlib custom style to draw beautiful statistics