The profile

This article looks at scenarios and simple examples of the best_fields, MOST_fields, and cross_fields syntax of multi_match.

The best field

The bool query takes the “more-matches- IS-better “approach, so the scores for each match statement are added together to provide a final score _score for each document. Documents that match both statements will score higher than documents that match only one statement, but this can sometimes lead to situations that don’t match expectations. Here’s an example:

Taking English nursery rhymes as the case background, we searched as follows:

GET /music/children/_search
{
  "query": {
    "bool": {
      "should": [{"match": { "name":  "brush mouth" }},
        { "match": { "content": "you sunshine"}}]}}}Copy the code

Result response (with deletion)

{
  "hits": {
    "total": 2."max_score": 1.7672573."hits": [{"_id": "4"."_score": 1.7672573."_source": {
          "name": "brush your teeth"."content": "When you wake up in the morning it's a quarter to one, and you want to have a little fun You brush your teeth"}}, {"_id": "3"."_score": 0.7911257."_source": {
          "name": "you are my sunshine"."content": "you are my sunshine, my only sunshine, you make me happy, when skies are gray"}}]}}Copy the code

“You are My Sunshine” was expected to come before “Brush You teeth”, but it did the opposite. Why?

Let’s reconstruct _score as a match: the score per query, multiplied by the number of matched queries, divided by the total number of queries.

Let’s look at the match: The Name field in document 4 contains Brush and the Content field contains You, so both matches are graded. The name field in document 3 does not match, but the content field contains you and Sunshine. If a match is matched, only one item is awarded. The resulting document 4 will score higher.

However, if we think about it carefully, although document 4 has two matches, each match only matches one of the keywords. Document 3 only matches one match, but it matches two consecutive keywords at the same time. According to our expectation, the correlation between two consecutive keywords matched on a field should be higher. Simply adding up the scores of multiple matches is a higher score, but not the one we’d expect.

What we’re looking for is optimal field matching, where one field matches as many keywords as possible and gets it ahead of the other. Instead of more fields matching the keyword, let it be in front.

We use the dis_max syntax query and preferentially return the score of the best match as the scoring result of the query. The request is as follows:

GET /music/children/_search
{
  "query": {
    "dis_max": {
      "queries": [{"match": { "name":  "brush mouth" }},
        { "match": { "content": "you sunshine"}}]}}}Copy the code

Result response (with deletion)

{
  "hits": {
    "total": 2."max_score": 1.0310873."hits": [{"_id": "4"."_score": 1.0310873."_source": {
          "name": "brush your teeth"."content": "When you wake up in the morning it's a quarter to one, and you want to have a little fun You brush your teeth"}}, {"_id": "3"."_score": 0.7911257."_source": {
          "name": "you are my sunshine"."content": "you are my sunshine, my only sunshine, you make me happy, when skies are gray"}}]}}Copy the code

Dis_max = 1.7672573; dis_max = 1.7672573; dis_max = 1.7672573;

Best field query tuning

The dis_max query in the previous section will adopt a single best match field while ignoring other matches, which is still not reasonable for precise search. We need the matching results of other matches to participate in the final scoring according to a certain weight, and the weight can be set by yourself.

We can add a tie_breaker parameter to take into account the results of other matches as follows:

  1. The value of tie_breaker is between 0 and 1, which is a decimal, and the recommended value range is 0.1 to 0.4.
  2. Dis_ Max is responsible for getting _score for the best matching statement, and _score for other matching statements is multiplied by tie_breaker.
  3. Sum and normalize the scores.

So with the addition of tie_breaker, all matching conditions are considered, but the best matching statements still dominate.

Example request:

GET /music/children/_search
{
  "query": {
    "dis_max": {
      "queries": [{"match": { "name":  "brush mouth" }},
        { "match": { "content": "you sunshine"}}]."tie_breaker": 0.3}}}Copy the code

Multi_match query

best_fields

Best-fields strategy: Return documents with a field that matches as many keywords as possible first.

If we search on multiple fields using the same search string, the request syntax can be longer:

GET /music/children/_search
{
  "query": {
    "dis_max": {
      "queries": [{"match": {
            "name": {
              "query": "you sunshine"."boost": 2."minimum_should_match": "50%"}}}, {"match": {
            "content": "you sunshine"}}]."tie_breaker": 0.3}}}Copy the code

Search requests can be simplified with multi_match, which supports boost, minimum_should_match, and tie_breaker parameters:

GET /music/children/_search
{
  "query": {
    "multi_match": {
      "query": "you sunshine"."type": "best_fields"."fields": ["name^2"."content"]."minimum_should_match": "50%"."tie_breaker": 0.3}}}Copy the code

The boost, minimum_should_match, and tie_breaker parameters are useful for removing the long tail. For example, if we search for 4 keywords, many documents match only 1, it also shows that these documents are not what we want. Raise the threshold to filter out long mantissa data.

most_fields

Most-fields strategy: Return as many doc fields as possible that match a keyword, preferentially.

The common way is that we build multiple indexes for the same text field, and make a copy of stem extraction, analysis and processing and original text storage, so as to improve the accuracy of matching.

Let’s take the music index as an example (excerpt mapping fragment information). Let’s make a little change:

PUT /music
{
  "mappings": {
      "children": {
        "properties": {
          "name": {
            "type": "text"."analyzer": "english"
            "fields": {
              "keyword": {
                "type": "keyword"."ignore_above": 256}}},"content": {
            "type": "text"."analyzer": "english"
            "fields": {
              "keyword": {
                "type": "keyword"."ignore_above": 256
              }
            }
          }
        }
      }
    }
}
Copy the code

For example, in the name and content fields, we have not only text fields, but also subfields of keyword type. Text will be processed with word segmentation and English word stem, while keywork will remain unchanged. When searching content, we can use name or name.keyword to search simultaneously. Example:

GET /music/children/_search
{
  "query": {
    "multi_match": {
      "query": "brushed"."type": "most_fields"."fields": ["name"."name.keyword"]}}}Copy the code

Brushed out to brush after retrieving the word stem, we can match the result, but name.keyword can’t, and finally the document result is returned. If you search only for the name.keyword field, no results are returned.

This is most_fields’ policy, which wants multiple indexes on the same text, and results from each index to participate in the search, so as to return as many results as possible.

And best_fields distinction

  1. Best_fields is used to search multiple fields and select the score with the highest matching degree of one field. Meanwhile, in the case that the highest scores of multiple queries are the same, the scores of other queries are considered to some extent. Simply put, when you search multiple fields, you want to find a field that contains as many keywords as possible
  • Advantages: With the best_fields strategy, combined with other fields, and minimum_should_match support, you can push matching results to the front as accurately as possible
  • Disadvantages: except for those results that match exactly, other results of similar size, sorting results are not very uniform, there is no distinction

Practical example: Baidu and other search engines, the most match to the front, but there is no distinction between other degrees

  1. With most_fields, a search is performed on multiple fields, and as many of the fields’ queries as possible are involved in the total score calculation. This will result in a hodgepodge of results similar to the initial best_fields result, which may not be accurate. One field in one document contains more keywords, but because more fields are matched in other documents, it is ranked first; Therefore, more fields like name.keyword and name.std need to be set up to make one field match the Query String accurately as much as possible, contributing higher scores, and ranking the data that matches more accurately to the front
  • Advantages: The results matching as many fields as possible are pushed to the front, and the whole sorting result is relatively uniform
  • Disadvantages: It is possible that the exact matching results cannot be pushed to the front

Example in action: Wiki, obvious MOST_fields strategy, search results are fairly uniform, but it does take several pages to find the best match

cross_fields

In the design of some entity objects, multiple fields may be used to identify an information, such as an address. In common storage schemes, four fields, including province, city, district, and street, can be stored separately to complete the address information. First name and last name.

What are the issues we should pay attention to when we encounter a search for this type of field, called cross-field entity search?

Review the author field of the music index, which is designed as author_first_name and author_last_name, and try to demonstrate cross-field entity search.

Use most_fields to query

GET /music/children/_search
{
  "query": {
    "multi_match": {
      "query":       "Peter Raffi"."type":        "most_fields"."fields":      [ "author_first_name"."author_last_name"]}}}Copy the code

Results of the response:

{
  "hits": {
    "total": 2."max_score": 1.3862944."hits": [{"_id": "4"."_score": 1.3862944."_source": {
          "id": "55fa74f7-35f3-4313-a678-18c19c918a78"."author_first_name": "Peter"."author_last_name": "Raffi"."author": "Peter Raffi"."name": "brush your teeth"."content": "When you wake up in the morning it's a quarter to one, and you want to have a little fun You brush your teeth"}}, {"_id": "1"."_score": 0.2876821."_source": {
          "author_first_name": "Peter"."author_last_name": "Gymbo"."author": "Peter Gymbo"."name": "gymbo"."content": "I hava a friend who loves smile, gymbo is his name"}}]}}Copy the code

“Peter Raffi” appears to be the correct result, but Peter Gymbo is also available. This is not the result we want, but the long mantras are not displayed due to the small amount of data. The MOST_fields query leads to the following 3 questions:

  1. Just find as many doc’s that field matches as possible, not a doc that field matches exactly
  2. Most_fields can’t use minimum_should_match to remove long mantissa data, which is the result of very few matches
  3. TF/IDF algorithms, such as Peter Raffi and Peter Gymbo, search Peter Raffi, because there are few Raffi’s in first_name, so the frequency of query in all documents is very low, the score is very high, and unexpected order may occur.

Merge fields using COPY_to

Copy_to syntax can merge multiple fields together, which can solve the problem of cross-entity fields, with the side effect of occupying more storage space. An example of copy_to is as follows:

PUT /music/_mapping/children
{
  "properties": {
      "author_first_name": {
          "type":     "text"."copy_to":  "author_full_name" 
      },
      "author_last_name": {
          "type":     "text"."copy_to":  "author_full_name" 
      },
      "author_full_name": {
          "type":     "text"}}}Copy the code

Note that this request needs to be executed during index creation, which is quite limited. So when the case is designed, there is an Author field that stores the full name.

GET /music/children/_search
{
  "query": {
    "match": {
      "author_full_name": {
        "query": "Peter Raffi"."operator": "and"}}}}Copy the code

For single-field queries, you can specify operator or minimum_should_match to control precision as you like.

Let’s see if the three problems mentioned above can be solved

  1. Matching problem

The most matched data is returned first.

  1. The long tail problem

You can specify operator or minimum_should_match to control accuracy.

  1. The problem of inaccurate grading

Solution, all information in one field, IDF calculation times is uniform, there will be no extreme error.

Disadvantages: the need for early design of redundant fields, the storage will be more. Copy_to splicing field, will encounter the order problem, such as English name before surname, and address order is not fixed, some from the province to the street from large to small, some of the reverse, this is also one of the limitations.

Native cross_fields syntax

Multi_match has the native cross_fields syntax to solve the cross-field entity search problem with the following request:

GET /music/children/_search
{
  "query": {
    "multi_match": {
      "query": "Peter Raffi"."type": "cross_fields"."operator": "and"."fields": ["author_first_name"."author_last_name"]}}}Copy the code

This time the meaning of cross_fields is required:

  • Peter must appear in author_first_name or author_last_name
  • Raffi must appear in author_first_name or author_last_name

Take a look at the three problem fixes mentioned above:

  1. Matching problem

Cross_fields requires that each term must appear in any field

  1. The long tail problem

Solve, see the previous article, each term must match, the long tail problem naturally solved.

  1. The problem of inaccurate grading

Cross_fields reverses the frequency of the word by mixing different fields. Peter is higher in first_name and lower in last_name, and the IDF value is smaller in both fields. Raffi is also processed in the same way, so that the IDF value obtained is relatively normal and not high.

summary

We can spend a little time to understand the multi-field search scenario, and the details to pay attention to, accurate search is a very large topic, there is no upper limit to optimize the space, you can start from the most basic scene and adjust the grammar.

Focus on Java high concurrency, distributed architecture, more technology dry products to share and experience, please pay attention to the public account: Java architecture community can scan the left TWO-DIMENSIONAL code to add friends, invite you to join the Java architecture community wechat group to discuss technology! [Java Architecture Community]

(p1-jj.byteimg.com/tos-cn-i-t2…).