Elasticsearch (Elasticsearch

Elasticsearch master episode 73

Term Vector: Term Vector: term Vector for elasticSearch

The term vector is introduced

Gets statistics for each term within a field in the document

term information: term frequency in the field.term positions.start and end offsets.term payloads

term statistics: set term_statistics = true;total term frequency, the frequency of occurrence of a term in all documents;document frequencyAnd how many documents include this term

field statistics: document count, how many documents contain the field;sum of document frequency, the sum of df of all terms in a field;sum of total term frequency, the tf sum of all terms in a field

GET /twitter/tweet/1/_termvectors GET /twitter/tweet/1/_termvectors? fields=textCopy the code

Term statistics and field statistics are not precise and will not be considered. Some Doc may have been deleted

I’ll tell you, it’s rarely used, but when it’s used, in general, it’s when you need to probe some data. For example, if you want to see a certain term, a certain term, a journey to the West, this term appears in how many documents. Or a field, film_desc, the description of the movie, how many doc’s contain that description.

Index-iime term Vector experiment

Term vector, which involves a lot of term and field-related statistics, can be collected in two ways

Index-time, when configured in the mapping and indexed, will generate the term and field statistics for you
Query-time, you haven’t generated any Term vector information before, and then when you look at the Term vector, you can see it directly, on the fly, calculates all sorts of statistics on the spot, and returns it to you

Master, how to collect term Vector information

Knowing the term vector information, you can learn how to use the term vector for data exploration

indexing

 PUT /waws_index
 {
   "mappings": {
     "waws_type": {
       "properties": {
         "text": {
             "type": "text"."term_vector": "with_positions_offsets_payloads"."store" : true,
             "analyzer" : "fulltext_analyzer"
          },
          "fullname": {
             "type": "text"."analyzer" : "fulltext_analyzer"}}}},"settings" : {
     "index" : {
       "number_of_shards" : 1."number_of_replicas" : 0
     },
     "analysis": {
       "analyzer": {
         "fulltext_analyzer": {
           "type": "custom"."tokenizer": "whitespace"."filter": [
             "lowercase"."type_as_payload"]}}}}}Copy the code

In the data

 PUT /waws_index/waws_type/1
 {
   "fullname" : "Leo Li"."text" : "hello test test test "
 }
 
 PUT /waws_index/waws_type/2
 {
   "fullname" : "Leo Li"."text" : "other hello test ..."
 }
Copy the code

To get the data

 GET /waws_index/waws_type/1/_termvectors
 {
   "fields" : ["text"]."offsets" : true,
   "payloads" : true,
   "positions" : true,
   "term_statistics" : true,
   "field_statistics" : true
 }
 
 {
   "_index": "waws_index"."_type": "waws_type"."_id": "1"."_version": 1."found": true,
   "took": 19."term_vectors": {
     "text": {
       "field_statistics": {
         "sum_doc_freq": 6.The df sum of all terms in a field
         "doc_count": 2.There are two doc's
         "sum_ttf": 8         Select * from all fields;
       },
       "terms": {
         "hello": {             # Hello
           "doc_freq": 2.# How many doc's contain this field
           "ttf": 2.The frequency with which a term appears in all documents
           "term_freq": 1.# hello is included several times in doc1
           "tokens": [{"position": 0.# position
               "start_offset": 0.# start offset
               "end_offset": 5.End offset
               "payload": "d29yZA=="}},"test": {
           "doc_freq": 2."ttf": 4."term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10."payload": "d29yZA=="
             },
             {
               "position": 2."start_offset": 11."end_offset": 15."payload": "d29yZA=="
             },
             {
               "position": 3."start_offset": 16."end_offset": 20."payload": "d29yZA=="}]}}}}}Copy the code

Query-time term vector experiment

 GET /waws_index/waws_type/1/_termvectors
 {
   "fields" : ["fullname"]."offsets" : true,
   "positions" : true,
   "term_statistics" : true,
   "field_statistics" : true
 }
 
 {
   "_index": "waws_index"."_type": "waws_type"."_id": "1"."_version": 1."found": true,
   "took": 39."term_vectors": {
     "fullname": {
       "field_statistics": {
         "sum_doc_freq": 4."doc_count": 2."sum_ttf": 4
       },
       "terms": {
         "leo": {
           "doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 3}},"li": {
           "doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 1."start_offset": 4."end_offset": 6}]}}}}}Copy the code

In general, if conditions permit, you can use the term vector of Query time. What data do you want to probe

Manually specify the term vector for doc

 GET /waws_index/waws_type/_termvectors
 {
   "doc" : {
     "fullname" : "Leo Li"."text" : "hello test test test"
   },
   "fields" : ["text"]."offsets" : true,
   "payloads" : true,
   "positions" : true,
   "term_statistics" : true,
   "field_statistics" : true
 }
 
 {
   "_index": "waws_index"."_type": "waws_type"."_version": 0."found": true,
   "took": 1."term_vectors": {
     "text": {
       "field_statistics": {
         "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
       },
       "terms": {
         "hello": {
           "doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5}},"test": {
           "doc_freq": 2."ttf": 4."term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10
             },
             {
               "position": 2."start_offset": 11."end_offset": 15
             },
             {
               "position": 3."start_offset": 16."end_offset": 20}]}}}}}Copy the code

You specify a doc by hand, you don’t actually specify doc, you specify the entry you want to insert, hello test, so you can put it in a field

Split these terms, and then for each term, calculate some statistics about it in all the existing doc’s

This is very useful, it allows you to manually specify the data situation of the term to probe, you can specify to probe the statistics of the term “Odyssey to the West”

Specify an Analyzer manually to generate the term vector

 GET /waws_index/waws_type/_termvectors
 {
   "doc" : {
     "fullname" : "Leo Li"."text" : "hello test test test"
   },
   "fields" : ["text"]."offsets" : true,
   "payloads" : true,
   "positions" : true,
   "term_statistics" : true,
   "field_statistics" : true,
   "per_field_analyzer" : {
     "text": "standard"}} {"_index": "waws_index"."_type": "waws_type"."_version": 0."found": true,
   "took": 0."term_vectors": {
     "text": {
       "field_statistics": {
         "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
       },
       "terms": {
         "hello": {
           "doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5}},"test": {
           "doc_freq": 2."ttf": 4."term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10
             },
             {
               "position": 2."start_offset": 11."end_offset": 15
             },
             {
               "position": 3."start_offset": 16."end_offset": 20}]}}}}}Copy the code

terms filter

 GET /waws_index/waws_type/_termvectors
 {
   "doc" : {
     "fullname" : "Leo Li"."text" : "hello test test test"
   },
   "fields" : ["text"]."offsets" : true,
   "payloads" : true,
   "positions" : true,
   "term_statistics" : true,
   "field_statistics" : true,
   "filter" : {
       "max_num_terms" : 3."min_term_freq" : 1."min_doc_freq" : 1}} {"_index": "waws_index"."_type": "waws_type"."_version": 0."found": true,
   "took": 1."term_vectors": {
     "text": {
       "field_statistics": {
         "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
       },
       "terms": {
         "hello": {
           "doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5}]."score": 1
         },
         "test": {
           "doc_freq": 2."ttf": 4."term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10
             },
             {
               "position": 2."start_offset": 11."end_offset": 15
             },
             {
               "position": 3."start_offset": 16."end_offset": 20}]."score": 3
         }
       }
     }
   }
 }
Copy the code

That is to say, it is also useful to filter out the term vector statistics you want to see based on the term statistics. For example, if you probe the data, you can filter out some terms that appear too infrequently

multi term vector

 GET _mtermvectors
 {
    "docs": [{"_index": "my_index"."_type": "my_type"."_id": "2"."term_statistics": true
       },
       {
          "_index": "my_index"."_type": "my_type"."_id": "1"."fields": [
             "text"]}]} {"docs": [{"_index": "waws_index"."_type": "waws_type"."_id": "2"."_version": 1."found": true,
       "took": 0."term_vectors": {
         "text": {
           "field_statistics": {
             "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
           },
           "terms": {
             "...": {
               "doc_freq": 1."ttf": 1."term_freq": 1."tokens": [{"position": 3."start_offset": 17."end_offset": 20."payload": "d29yZA=="}},"hello": {
               "doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 1."start_offset": 6."end_offset": 11."payload": "d29yZA=="}},"other": {
               "doc_freq": 1."ttf": 1."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5."payload": "d29yZA=="}},"test": {
               "doc_freq": 2."ttf": 4."term_freq": 1."tokens": [{"position": 2."start_offset": 12."end_offset": 16."payload": "d29yZA=="}]}}}}}, {"_index": "my_index"."_type": "my_type"."_id": "1"."error": {
         "root_cause": [{"type": "index_not_found_exception"."reason": "no such index"."index_uuid": "_na_"."index": "my_index"}]."type": "index_not_found_exception"."reason": "no such index"."index_uuid": "_na_"."index": "my_index"}}}]Copy the code

The second

 GET /waws_index/_mtermvectors
 {
    "docs": [{"_type": "test"."_id": "2"."fields": [
             "text"]."term_statistics": true
       },
       {
          "_type": "test"."_id": "1"} {}]"docs": [{"_index": "waws_index"."_type": "test"."_id": "2"."_version": 0."found": false,
       "took": 0
     },
     {
       "_index": "waws_index"."_type": "test"."_id": "1"."_version": 0."found": false,
       "took": 0}}]Copy the code

The third

 GET /waws_index/waws_type/_mtermvectors
 {
    "docs": [{"_id": "2"."fields": [
             "text"]."term_statistics": true
       },
       {
          "_id": "1"} {}]"docs": [{"_index": "waws_index"."_type": "waws_type"."_id": "2"."_version": 1."found": true,
       "took": 0."term_vectors": {
         "text": {
           "field_statistics": {
             "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
           },
           "terms": {
             "...": {
               "doc_freq": 1."ttf": 1."term_freq": 1."tokens": [{"position": 3."start_offset": 17."end_offset": 20."payload": "d29yZA=="}},"hello": {
               "doc_freq": 2."ttf": 2."term_freq": 1."tokens": [{"position": 1."start_offset": 6."end_offset": 11."payload": "d29yZA=="}},"other": {
               "doc_freq": 1."ttf": 1."term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5."payload": "d29yZA=="}},"test": {
               "doc_freq": 2."ttf": 4."term_freq": 1."tokens": [{"position": 2."start_offset": 12."end_offset": 16."payload": "d29yZA=="}]}}}}}, {"_index": "waws_index"."_type": "waws_type"."_id": "1"."_version": 1."found": true,
       "took": 0."term_vectors": {
         "text": {
           "field_statistics": {
             "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
           },
           "terms": {
             "hello": {
               "term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5."payload": "d29yZA=="}},"test": {
               "term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10."payload": "d29yZA=="
                 },
                 {
                   "position": 2."start_offset": 11."end_offset": 15."payload": "d29yZA=="
                 },
                 {
                   "position": 3."start_offset": 16."end_offset": 20."payload": "d29yZA=="}]}}}}}]}Copy the code

The fourth

 GET /_mtermvectors
 {
    "docs": [{"_index": "waws_index"."_type": "waws_type"."doc" : {
             "fullname" : "Leo Li"."text" : "hello test test test"}}, {"_index": "my_index"."_type": "my_type"."doc" : {
            "fullname" : "Leo Li"."text" : "other hello test ..."}}]} {"docs": [{"_index": "waws_index"."_type": "waws_type"."_version": 0."found": true,
       "took": 0."term_vectors": {
         "fullname": {
           "field_statistics": {
             "sum_doc_freq": 4."doc_count": 2."sum_ttf": 4
           },
           "terms": {
             "leo": {
               "term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 3}},"li": {
               "term_freq": 1."tokens": [{"position": 1."start_offset": 4."end_offset": 6}]}}},"text": {
           "field_statistics": {
             "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
           },
           "terms": {
             "hello": {
               "term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5}},"test": {
               "term_freq": 3."tokens": [{"position": 1."start_offset": 6."end_offset": 10
                 },
                 {
                   "position": 2."start_offset": 11."end_offset": 15
                 },
                 {
                   "position": 3."start_offset": 16."end_offset": 20}]}}}}}, {"_index": "waws_index"."_type": "waws_type"."_version": 0."found": true,
       "took": 0."term_vectors": {
         "text": {
           "field_statistics": {
             "sum_doc_freq": 6."doc_count": 2."sum_ttf": 8
           },
           "terms": {
             "...": {
               "term_freq": 1."tokens": [{"position": 3."start_offset": 17."end_offset": 20}},"hello": {
               "term_freq": 1."tokens": [{"position": 1."start_offset": 6."end_offset": 11}},"other": {
               "term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 5}},"test": {
               "term_freq": 1."tokens": [{"position": 2."start_offset": 12."end_offset": 16}]}}},"fullname": {
           "field_statistics": {
             "sum_doc_freq": 4."doc_count": 2."sum_ttf": 4
           },
           "terms": {
             "leo": {
               "term_freq": 1."tokens": [{"position": 0."start_offset": 0."end_offset": 3}},"li": {
               "term_freq": 1."tokens": [{"position": 1."start_offset": 4."end_offset": 6}]}}}}}]}Copy the code

Elasticsearch master episode 73

Term Vector: Term Vector: term Vector for elasticSearch

The term vector is introduced

Index-iime term Vector experiment

Query-time term vector experiment

Manually specify the term vector for doc

Specify an Analyzer manually to generate the term vector

terms filter

multi term vector

Related Posts

What if the Internet is bad? TLS handshake bandwidth dropped by 80%. | dragon lizards technology

Understand the basic concepts and interfaces of Spring AOP

What indicators should K8S monitor and why