Elasticsearch master (15)

Use most_fields strategy to expose cross-fields search flaws

Cross – search fields

  • Cross-fields search, a unique identifier, spans multiple fields. For example, a person, the identifier, is the name; A building whose identifier is an address. Names can be scattered in multiple fields, such as first_name and last_name, and addresses can be scattered in country, province, or city.

  • Searching for an identity across multiple fields, such as a person’s name or an address, is a cross-fields search

Initially, most_fields is probably appropriate if you want to implement it. Because best_fields is the first result to search for the best match for a single field, cross-fields themselves are not a field problem.

 POST /waws/article/_bulk
 { "update": { "_id": "1"}}
 { "doc" : {"author_first_name" : "Peter"."author_last_name" : "Smith"}}
 { "update": { "_id": "2"}}
 { "doc" : {"author_first_name" : "Smith"."author_last_name" : "Williams"}}
 { "update": { "_id": "3"}}
 { "doc" : {"author_first_name" : "Jack"."author_last_name" : "Ma"}}
 { "update": { "_id": "4"}}
 { "doc" : {"author_first_name" : "Robbin"."author_last_name" : "Li"}}
 { "update": { "_id": "5"}}
 { "doc" : {"author_first_name" : "Tonny"."author_last_name" : "Peter Smith"}}
Copy the code
  • Search data
 GET /waws/article/_search
 {
   "query": {
     "multi_match": {
       "query":"Peter Smith"."type":"most_fields"."fields": ["author_first_name"."author_last_name"]}}}# Peter Smith = author_first_name = author_first_name
 Term (Smith) = term (Smith) = term (Smith) = term (Smith) = term (Smith) = term (Smith
 The IDF score of doc 1 is lower than that of author_last_name
 
 {
   "took": 1."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 3."max_score": 0.6931472."hits": [{"_index": "waws"."_type": "article"."_id": "2"."_score": 0.6931472."_source": {
           "articleID": "KDKE-B-9947-#kL5"."userID": 1."hidden": false,
           "postDate": "2017-01-02"."tag": [
             "java"]."tag_cnt": 1."view_cnt": 50."title": "this is java blog"."content": "i think java is the best programming language"."sub_title": "learned a lot of course"."author_first_name": "Smith"."author_last_name": "Williams"}}, {"_index": "waws"."_type": "article"."_id": "1"."_score": 0.5753642."_source": {
           "articleID": "XHDK-A-1293-#fJ3"."userID": 1."hidden": false,
           "postDate": "2017-01-01"."tag": [
             "java"."hadoop"]."tag_cnt": 2."view_cnt": 30."title": "this is java and elasticsearch blog"."content": "i like to write best elasticsearch article"."sub_title": "learning more courses"."author_first_name": "Peter"."author_last_name": "Smith"}}, {"_index": "waws"."_type": "article"."_id": "5"."_score": 0.51623213."_source": {
           "articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
           "postDate": "2017-03-01"."tag": [
             "elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"}}]}}Copy the code

Problem 1: Just find as many doc’s that field matches as possible, not a doc that field matches perfectly

Problem 2: Most_fields can’t use minimum_should_match to remove long mantissa data because there are too few matches

Question 3: TF/IDF algorithms, such as Peter Smith and Smith Williams, search for Peter Smith, because there are very few Smith’s in first_name, so the frequency of query in all documents is very low and the score is very high. Perhaps Smith Williams will come ahead of Peter Smith instead

Elasticsearch (16)

Deep exploration search technology _ Use COPY_to custom combination field to solve cross-fields search drawbacks

Last time, we actually said that there are three drawbacks to implementing cross-fields with the MOST_fields strategy, and the results show these three drawbacks

  • The first method: use copy_to to combine multiple fields into a single field

In fact, the problem is that there are multiple fields, with multiple fields, it is very awkward, we just need to find a way to cross the situation of a logo across multiple fields, into a single field. Last_name = first_name; last_name = last_name; full_name = last_name

 PUT /waws/_mapping/article
 {
   "properties": {
       "new_author_first_name": {
           "type":"string"."copy_to":"new_author_full_name" 
       },
       "new_author_last_name": {
           "type":"string"."copy_to":"new_author_full_name" 
       },
       "new_author_full_name": {
           "type":"string"}}}Copy the code

With this copy_to syntax, you can copy the values of multiple fields into a single field and create an inverted index

 POST /waws/article/_bulk
 { "update": { "_id": "1"}}
 { "doc" : {"new_author_first_name" : "Peter"."new_author_last_name" : "Smith"}}
 { "update": { "_id": "2"}}  
 { "doc" : {"new_author_first_name" : "Smith"."new_author_last_name" : "Williams"}}
 { "update": { "_id": "3"}}
 { "doc" : {"new_author_first_name" : "Jack"."new_author_last_name" : "Ma"}}
 { "update": { "_id": "4"}}
 { "doc" : {"new_author_first_name" : "Robbin"."new_author_last_name" : "Li"}}
 { "update": { "_id": "5"}}
 { "doc" : {"new_author_first_name" : "Tonny"."new_author_last_name" : "Peter Smith"}}
Copy the code
  • Search data
 GET /waws/article/_search
 {
   "query": {
     "match": {
       "new_author_full_name":"Peter Smith"}}} {"took": 2."timed_out": false,
   "_shards": {
     "total": 5."successful": 5."failed": 0
   },
   "hits": {
     "total": 3."max_score": 0.62191015."hits": [{"_index": "waws"."_type": "article"."_id": "2"."_score": 0.62191015."_source": {
           "articleID": "KDKE-B-9947-#kL5"."userID": 1."hidden": false,
           "postDate": "2017-01-02"."tag": [
             "java"]."tag_cnt": 1."view_cnt": 50."title": "this is java blog"."content": "i think java is the best programming language"."sub_title": "learned a lot of course"."author_first_name": "Smith"."author_last_name": "Williams"."new_author_last_name": "Williams"."new_author_first_name": "Smith"}}, {"_index": "waws"."_type": "article"."_id": "1"."_score": 0.51623213."_source": {
           "articleID": "XHDK-A-1293-#fJ3"."userID": 1."hidden": false,
           "postDate": "2017-01-01"."tag": [
             "java"."hadoop"]."tag_cnt": 2."view_cnt": 30."title": "this is java and elasticsearch blog"."content": "i like to write best elasticsearch article"."sub_title": "learning more courses"."author_first_name": "Peter"."author_last_name": "Smith"."new_author_last_name": "Smith"."new_author_first_name": "Peter"}}, {"_index": "waws"."_type": "article"."_id": "5"."_score": 0.5063205."_source": {
           "articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
           "postDate": "2017-03-01"."tag": [
             "elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}]}}Copy the code

The effect is hard to replicate. For example, the official website will also give some examples, say what text to use, how to search, how to how the effect. The ES version is constantly iterating, and so is the scoring algorithm. So it’s really hard to say that best_fields, most_fields, cross_fields like these lectures fully reproduce the scenes and effects they should have.

The expectation is that, for example, when you develop your own search application, you encounter a scenario that needs best_fields, and you know how to do it, and you know the principle of best_fields, and what effect it can achieve; Encounter most_fields scene, know how to do it, and how it works; Do you know how to do it? Do you know how it works? Do you know how it works

  • Problem solving

Problem 1: Just find as many field matching docs as possible, not a doc that perfectly matches the field –> solve, the document that best matches is returned first

Minimum_should_match minimum_should_match minimum_should_match minimum_should_match minimum_should_match minimum_should_match minimum_should_match

Question 3: TF/IDF algorithms, such as Peter Smith and Smith Williams, search for Peter Smith, because there are very few Smith’s in first_name, so the frequency of query in all documents is very low and the score is very high. Perhaps Smith Williams will be ranked ahead of Peter Smith –> solve that Smith and Peter are in the same field, so the number of occurrences in all documents is even without extreme deviation