Elasticsearch master episode 73

Term Vector: Term Vector: term Vector for elasticSearch

The term vector is introduced

Gets statistics for each term within a field in the document

  • term information: term frequency in the field.term positions.start and end offsets.term payloads
  • term statistics: set term_statistics = true;total term frequency, the frequency of occurrence of a term in all documents;document frequencyAnd how many documents include this term
  • field statistics: document count, how many documents contain the field;sum of document frequency, the sum of df of all terms in a field;sum of total term frequency, the tf sum of all terms in a field
GET /twitter/tweet/1/_termvectors GET /twitter/tweet/1/_termvectors? fields=textCopy the code

Term statistics and field statistics are not precise and will not be considered. Some Doc may have been deleted

I’ll tell you, it’s rarely used, but when it’s used, in general, it’s when you need to probe some data. For example, if you want to see a certain term, a certain term, a journey to the West, this term appears in how many documents. Or a field, film_desc, the description of the movie, how many doc’s contain that description.

Index-iime term Vector experiment

Term vector, which involves a lot of term and field-related statistics, can be collected in two ways

  • Index-time, when configured in the mapping and indexed, will generate the term and field statistics for you
  • Query-time, you haven’t generated any Term vector information before, and then when you look at the Term vector, you can see it directly, on the fly, calculates all sorts of statistics on the spot, and returns it to you
  • Master, how to collect term Vector information
  • Knowing the term vector information, you can learn how to use the term vector for data exploration
  • indexing
 PUT /waws_index
 {
   "mappings": {
     "waws_type": {
       "properties": {
         "text": {
             "type": "text"."term_vector": "with_positions_offsets_payloads"."store" : true,
             "analyzer" : "fulltext_analyzer"
          },
          "fullname": {
             "type": "text"."analyzer" : "fulltext_analyzer"}}}},"settings" : {
     "index" : {
       "number_of_shards" : 1."number_of_replicas" : 0
     },
     "analysis": {
       "analyzer": {
         "fulltext_analyzer": {
           "type": "custom"."tokenizer": "whitespace"."filter": [
             "lowercase"."type_as_payload"]}}}}}Copy the code
  • In the data
 PUT /waws_index/waws_type/1
 {
   "fullname" : "Leo Li"."text" : "hello test test test "
 }
 
 PUT /waws_index/waws_type/2
 {
   "fullname" : "Leo Li"."text" : "other hello test ..."
 }
Copy the code
  • To get the data
 GET /waws_index/waws_type/1/_termvectors
 {
   "fields" : ["text"]."offsets" : true,
   "payloads" : true,
   "positions" : true,
   "term_statistics" : true,
   "field_statistics" : true
 }
 
 {
   "_index": "waws_index"."_type": "waws_type"."_id": "1"."_version": 1."found": true,
   "took": 19."term_vectors": {
     "text": {
       "field_statistics": {
         "sum_doc_freq": 6.The df sum of all terms in a field
         "doc_count": 2.There are two doc's
         "sum_ttf": 8         Select * from all fields;
       },
       "terms": {
         "hello": {             # Hello
           "doc_freq": 2.# How many doc's contain this field
           "ttf": 2.The frequency with which a term appears in all documents
           "term_freq": 1.# hello is included several times in doc1
           "tokens": [{"position": 0.# position
               "start_offset": 0.# start offset
               "end_offset": 5.End offset
               "payload": "d29yZA=="}},"test": {
           "doc_freq": 2."ttf": 4."term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10."payload": "d29yZA=="
             },
             {
               "position": 2."start_offset": 11."end_offset": 15."payload": "d29yZA=="
             },
             {
               "position": 3."start_offset": 16."end_offset": 20."payload": "d29yZA=="}]}}}}}Copy the code

Query-time term vector experiment

 GET /waws_index/waws_type/1/_termvectors
 {
   "fields" : ["fullname"]."offsets" : true,
   "positions" : true,
   "term_statistics" : true,
   "field_statistics" : true
 }
 
 {
   "_index": "waws_index"."_type": "waws_type"."_id": "1"."_version": 1."found": true,
   "took": 39."term_vectors": {
     "fullname": {
       "field_statistics": {
         "sum_doc_freq": 4."doc_count": 2."sum_ttf": 4
       },
       "terms": {
         "leo": {
           "doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 3}},"li": {
           "doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 1."start_offset": 4."end_offset": 6}]}}}}}Copy the code

In general, if conditions permit, you can use the term vector of Query time. What data do you want to probe

Manually specify the term vector for doc

 GET /waws_index/waws_type/_termvectors
 {
   "doc" : {
     "fullname" : "Leo Li"."text" : "hello test test test"
   },
   "fields" : ["text"]."offsets" : true,
   "payloads" : true,
   "positions" : true,
   "term_statistics" : true,
   "field_statistics" : true
 }
 
 {
   "_index": "waws_index"."_type": "waws_type"."_version": 0."found": true,
   "took": 1."term_vectors": {
     "text": {
       "field_statistics": {
         "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
       },
       "terms": {
         "hello": {
           "doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5}},"test": {
           "doc_freq": 2."ttf": 4."term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10
             },
             {
               "position": 2."start_offset": 11."end_offset": 15
             },
             {
               "position": 3."start_offset": 16."end_offset": 20}]}}}}}Copy the code

You specify a doc by hand, you don’t actually specify doc, you specify the entry you want to insert, hello test, so you can put it in a field

Split these terms, and then for each term, calculate some statistics about it in all the existing doc’s

This is very useful, it allows you to manually specify the data situation of the term to probe, you can specify to probe the statistics of the term “Odyssey to the West”

Specify an Analyzer manually to generate the term vector

 GET /waws_index/waws_type/_termvectors
 {
   "doc" : {
     "fullname" : "Leo Li"."text" : "hello test test test"
   },
   "fields" : ["text"]."offsets" : true,
   "payloads" : true,
   "positions" : true,
   "term_statistics" : true,
   "field_statistics" : true,
   "per_field_analyzer" : {
     "text": "standard"}} {"_index": "waws_index"."_type": "waws_type"."_version": 0."found": true,
   "took": 0."term_vectors": {
     "text": {
       "field_statistics": {
         "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
       },
       "terms": {
         "hello": {
           "doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5}},"test": {
           "doc_freq": 2."ttf": 4."term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10
             },
             {
               "position": 2."start_offset": 11."end_offset": 15
             },
             {
               "position": 3."start_offset": 16."end_offset": 20}]}}}}}Copy the code

terms filter

 GET /waws_index/waws_type/_termvectors
 {
   "doc" : {
     "fullname" : "Leo Li"."text" : "hello test test test"
   },
   "fields" : ["text"]."offsets" : true,
   "payloads" : true,
   "positions" : true,
   "term_statistics" : true,
   "field_statistics" : true,
   "filter" : {
       "max_num_terms" : 3."min_term_freq" : 1."min_doc_freq" : 1}} {"_index": "waws_index"."_type": "waws_type"."_version": 0."found": true,
   "took": 1."term_vectors": {
     "text": {
       "field_statistics": {
         "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
       },
       "terms": {
         "hello": {
           "doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5}]."score": 1
         },
         "test": {
           "doc_freq": 2."ttf": 4."term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10
             },
             {
               "position": 2."start_offset": 11."end_offset": 15
             },
             {
               "position": 3."start_offset": 16."end_offset": 20}]."score": 3
         }
       }
     }
   }
 }
Copy the code

That is to say, it is also useful to filter out the term vector statistics you want to see based on the term statistics. For example, if you probe the data, you can filter out some terms that appear too infrequently

multi term vector

 GET _mtermvectors
 {
    "docs": [{"_index": "my_index"."_type": "my_type"."_id": "2"."term_statistics": true
       },
       {
          "_index": "my_index"."_type": "my_type"."_id": "1"."fields": [
             "text"]}]} {"docs": [{"_index": "waws_index"."_type": "waws_type"."_id": "2"."_version": 1."found": true,
       "took": 0."term_vectors": {
         "text": {
           "field_statistics": {
             "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
           },
           "terms": {
             "...": {
               "doc_freq": 1."ttf": 1."term_freq": 1."tokens": [{"position": 3."start_offset": 17."end_offset": 20."payload": "d29yZA=="}},"hello": {
               "doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 1."start_offset": 6."end_offset": 11."payload": "d29yZA=="}},"other": {
               "doc_freq": 1."ttf": 1."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5."payload": "d29yZA=="}},"test": {
               "doc_freq": 2."ttf": 4."term_freq": 1."tokens": [{"position": 2."start_offset": 12."end_offset": 16."payload": "d29yZA=="}]}}}}}, {"_index": "my_index"."_type": "my_type"."_id": "1"."error": {
         "root_cause": [{"type": "index_not_found_exception"."reason": "no such index"."index_uuid": "_na_"."index": "my_index"}]."type": "index_not_found_exception"."reason": "no such index"."index_uuid": "_na_"."index": "my_index"}}}]Copy the code
  • The second
 GET /waws_index/_mtermvectors
 {
    "docs": [{"_type": "test"."_id": "2"."fields": [
             "text"]."term_statistics": true
       },
       {
          "_type": "test"."_id": "1"} {}]"docs": [{"_index": "waws_index"."_type": "test"."_id": "2"."_version": 0."found": false,
       "took": 0
     },
     {
       "_index": "waws_index"."_type": "test"."_id": "1"."_version": 0."found": false,
       "took": 0}}]Copy the code
  • The third
 GET /waws_index/waws_type/_mtermvectors
 {
    "docs": [{"_id": "2"."fields": [
             "text"]."term_statistics": true
       },
       {
          "_id": "1"} {}]"docs": [{"_index": "waws_index"."_type": "waws_type"."_id": "2"."_version": 1."found": true,
       "took": 0."term_vectors": {
         "text": {
           "field_statistics": {
             "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
           },
           "terms": {
             "...": {
               "doc_freq": 1."ttf": 1."term_freq": 1."tokens": [{"position": 3."start_offset": 17."end_offset": 20."payload": "d29yZA=="}},"hello": {
               "doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 1."start_offset": 6."end_offset": 11."payload": "d29yZA=="}},"other": {
               "doc_freq": 1."ttf": 1."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5."payload": "d29yZA=="}},"test": {
               "doc_freq": 2."ttf": 4."term_freq": 1."tokens": [{"position": 2."start_offset": 12."end_offset": 16."payload": "d29yZA=="}]}}}}}, {"_index": "waws_index"."_type": "waws_type"."_id": "1"."_version": 1."found": true,
       "took": 0."term_vectors": {
         "text": {
           "field_statistics": {
             "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
           },
           "terms": {
             "hello": {
               "term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5."payload": "d29yZA=="}},"test": {
               "term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10."payload": "d29yZA=="
                 },
                 {
                   "position": 2."start_offset": 11."end_offset": 15."payload": "d29yZA=="
                 },
                 {
                   "position": 3."start_offset": 16."end_offset": 20."payload": "d29yZA=="}]}}}}}]}Copy the code
  • The fourth
 GET /_mtermvectors
 {
    "docs": [{"_index": "waws_index"."_type": "waws_type"."doc" : {
             "fullname" : "Leo Li"."text" : "hello test test test"}}, {"_index": "my_index"."_type": "my_type"."doc" : {
            "fullname" : "Leo Li"."text" : "other hello test ..."}}]} {"docs": [{"_index": "waws_index"."_type": "waws_type"."_version": 0."found": true,
       "took": 0."term_vectors": {
         "fullname": {
           "field_statistics": {
             "sum_doc_freq": 4."doc_count": 2."sum_ttf": 4
           },
           "terms": {
             "leo": {
               "term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 3}},"li": {
               "term_freq": 1."tokens": [{"position": 1."start_offset": 4."end_offset": 6}]}}},"text": {
           "field_statistics": {
             "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
           },
           "terms": {
             "hello": {
               "term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5}},"test": {
               "term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10
                 },
                 {
                   "position": 2."start_offset": 11."end_offset": 15
                 },
                 {
                   "position": 3."start_offset": 16."end_offset": 20}]}}}}}, {"_index": "waws_index"."_type": "waws_type"."_version": 0."found": true,
       "took": 0."term_vectors": {
         "text": {
           "field_statistics": {
             "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
           },
           "terms": {
             "...": {
               "term_freq": 1."tokens": [{"position": 3."start_offset": 17."end_offset": 20}},"hello": {
               "term_freq": 1."tokens": [{"position": 1."start_offset": 6."end_offset": 11}},"other": {
               "term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5}},"test": {
               "term_freq": 1."tokens": [{"position": 2."start_offset": 12."end_offset": 16}]}}},"fullname": {
           "field_statistics": {
             "sum_doc_freq": 4."doc_count": 2."sum_ttf": 4
           },
           "terms": {
             "leo": {
               "term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 3}},"li": {
               "term_freq": 1."tokens": [{"position": 1."start_offset": 4."end_offset": 6}]}}}}}]}Copy the code