23 Elasticsearch mapping parameters

@[toc] ElasticSearch (ElasticSearch) @[toc] ElasticSearch (ElasticSearch) @[toc] ElasticSearch (ElasticSearch) @[toc] ElasticSearch (ElasticSearch)

In view of these 23 common mapping parameters, Songge recorded a video tutorial:

Video link: https://pan.baidu.com/s/1J23m… Extract code: 6K2A

This is a brief note from the video tutorial that Songge recorded. For the complete content, you can refer to the video.

1. Elasticsearch mapping parameter

1.1 analyzer

Defines a word splitter for a text field. The default is valid for both indexes and queries.

Assuming we don’t use a word splitter, let’s first look at the result of the index by creating an index and adding a document:

PUT blog PUT blog/_doc/1 {"title":" Define a word splitter for a text field. The default is valid for both indexes and queries. }

Look at Term Vectors

GET blog/_termvectors/1
{
  "fields": ["title"]
}

Check the results as follows:

{ "_index" : "blog", "_type" : "_doc", "_id" : "1", "_version" : 1, "found" : true, "took" : 0, "term_vectors" : {" title ": {" field_statistics" : {" sum_doc_freq ": 22," doc_count ": 1," sum_ttf ": 23}," terms ": {" righteousness" : {" term_freq ": 1," tokens ": [{" position" : 1, "start_offset" : 1, "end_offset" : 2}]}, "points" : {" term_freq ": 1, "tokens" : [{" position ": 7," start_offset ": 7," end_offset ": 8}]}," and ": {" term_freq" : 1, "tokens" : [{15, "position" : "start_offset" : 16, "end_offset" : 17}]}, "trap" : {" term_freq ": 1," tokens ": [{" position" : 9, 9, "start_offset" : "end_offset" : 10}}], "the word" : {" term_freq ": 1," tokens ": [{" position" : 4, "start_offset" : 4, "end_offset" : 5}}], "set:" {" term_freq ": 1," tokens ": [{" position" : 0, "start_offset" : 0, "end_offset" : 1}}], "to" : {" term_freq ": 1," tokens ": [{" position" : 12, "start_offset" : 13, "end_offset" : 14}}], "guide" : {" term_freq ": 1," tokens ": [{" position" : 14, "start_offset" : 15, "end_offset" : 16}]}, "effect" : {" term_freq ": 1," tokens ": [{" position" : 21, "start_offset" : 22, "end_offset" : 23}]}, "wen" : {" term_freq ": 1, "tokens" : [{" position ": 2," start_offset ": 2," end_offset ": 3}]}," is ": {" term_freq" : 1, "tokens" : [{19, "position" : "start_offset" : 20, "end_offset" : 21}]}, "a" : {" term_freq ": 1," tokens ": [{" position" : 20, "start_offset" : 21, "end_offset" : 22}}], "the" : {" term_freq ": 1," tokens ": [{" position" : 3, "start_offset" : 3, "end_offset" : 4}}], "check" : {" term_freq ": 1," tokens ": [{" position" : 17, 16, "start_offset" : "end_offset" : 18}}], "segment" : {" term_freq ": 1," tokens ": [{" position" : 5, 5, "start_offset" : "end_offset" : 6}}], "the" : {" term_freq ": 2," tokens ": [{" position" : 6, "start_offset" : 6, "end_offset" : 7}, {" position ": 22," start_offset ": 23," end_offset ": 24}}," line ": {" term_freq" : 1, tokens: [{' start_offset ': 14,' end_offset ': 15}]} [{11, "position" : "start_offset" : 12, "end_offset" : 13}]}, "word" : {" term_freq ": 1," tokens ": [{" position" : 8, "start_offset" : 8, "end_offset" : 9}}], "poll" : {" term_freq ": 1," tokens ": [{" position" : 18, 17, "start_offset" : "end_offset" : 19}}], "all" : {" term_freq ": 1," tokens ": [{" position" : 19, 18, "start_offset" : "end_offset" : 20}}], "Mr" : {" term_freq ": 1," tokens ": [{" position" : 10, "start_offset" : 11, "end_offset" : 12 } ] } } } } }

As you can see, by default, Chinese is divided word by word, which makes no sense. If this is done, the query can only be looked up word by word, as follows:

GET a blog / _search {" query ": {" term" : {" title ":" fixed "}}}

MEANINGLESS!!

Therefore, we should configure the appropriate word splitter according to the actual situation.

Set a word splitter for the field:

PUT blog
{
  "mappings": {
    "properties": {
      "title":{
        "type":"text",
        "analyzer": "ik_smart"
      }
    }
  }
}

Store documents:

PUT blog/_doc/1 {"title":" Define a word splitter for a text field. The default is valid for both indexes and queries. }

View the entry vector:

GET blog/_termvectors/1
{
  "fields": ["title"]
}

The query results are as follows:

{ "_index" : "blog", "_type" : "_doc", "_id" : "1", "_version" : 1, "found" : true, "took" : 1, "term_vectors" : {" title ": {" field_statistics" : {" sum_doc_freq ": 12," doc_count ": 1," sum_ttf ": 13}," terms ": {" participle device" : {" term_freq ": 1," tokens ": [{" position" : 4, "start_offset" : 7, "end_offset" : 10}]}, "and" : {" term_freq ": 1, tokens: [{"start_offset" : 16, "end_offset" : 17}]} [{" position ": 2," start_offset ": 4," end_offset ": 6}]}," definition ": {" term_freq" : 1, "tokens" : [{" position ": 0, "start_offset" : 0, "end_offset" : 2}}], "to" : {" term_freq ": 1," tokens ": [{" position" : 6, "start_offset" : 13, "end_offset" : 14}}], "text" : {" term_freq ": 1," tokens ": [{" position" : 1, "start_offset" : 2, "end_offset" : 4}}], "effective" : {" term_freq ": 1," tokens ": [{" position" : 11, "start_offset" : 21 and 23 "end_offset" :}}], "query" : {" term_freq ": 1," tokens ": [{9," position ":" start_offset ": 17, "end_offset" : 19}}], "the" : {" term_freq ": 2," tokens ": [{" position" : 3, "start_offset" : 6, "end_offset" : }, {"position" : 12, "start_offset" : 23, "end_offset" : 24}}, [{7, "position" : "start_offset" : 14, "end_offset" : 16}]}, "are" : {" term_freq ": 1," tokens ": [{" position" : $array_tokens = $array_tokens = $array_tokens = $array_tokens = $array_tokens 5, "start_offset" : 11, "end_offset" : 13 } ] } } } } }

Then you can search by word:

GET a blog / _search {" query ": {" term" : {" title ":" index "}}}

1.2 search_analyzer

A word splitter when querying. By default, if SEARCH_ANALYZER is not configured, the query will first look to see if there is any SEARCH_ANALYZER, if so, then use SEARCH_ANALYZER for word segmentation, if not, then look to see if there is any Analyzer, and if so, Analyzer is used for word segmentation, otherwise the ES default word segmentation is used.

1.3 normalizer

The normalizer parameter is used for standardized configuration prior to parsing (index or query).

For example, in ES, for strings that we don’t want to shard, we usually set them to the keyword and search for the entire word. If data is not cleaned before indexing and case inconsistents occur, such as javaboy and javaboy, normalizer can be used to normalize documents before indexing and before querying.

As a counterexample, create an index named blog and set the Author field type to keyword:

PUT blog
{
  "mappings": {
    "properties": {
      "author":{
        "type": "keyword"
      }
    }
  }
}

Add two documents:

PUT blog/_doc/1
{
  "author":"javaboy"
}

PUT blog/_doc/2
{
  "author":"JAVABOY"
}

Then do a search:

GET blog/_search
{
  "query": {
    "term": {
      "author": "JAVABOY"
    }
  }
}

Keys in upper case search for documents in upper case, and keywords in lower case search for documents in lower case.

If normalizer is used, documents can be preprocessed separately for indexing and querying.

Normalizer is defined as follows:

PUT blog
{
  "settings": {
    "analysis": {
      "normalizer":{
        "my_normalizer":{
          "type":"custom",
          "filter":["lowercase"]
        }
      }
    }
  }, 
  "mappings": {
    "properties": {
      "author":{
        "type": "keyword",
        "normalizer":"my_normalizer"
      }
    }
  }
}

Normalizer is defined in Settings and referenced in Mappings.

The test is the same as before. When querying at this point, uppercase keywords can also query lowercase documents, because both the index and the query will convert uppercase to lowercase.

1.4 the boost

The boost parameter can set the weight of the field.

Boost can be used in two ways. One is when defining Mappings and the other when specifying field types. The other is used when querying.

The latter is recommended for practical development, but the former is problematic: the weight cannot be changed without re-indexing the document.

Using Boost in Mapping (not recommended) :

PUT blog
{
  "mappings": {
    "properties": {
      "content":{
        "type": "text",
        "boost": 2
      }
    }
  }
}

The other way is to specify boost when querying

GET a blog / _search {" query ": {" match" : {" content ": {" query" : "hello", "boost" : 2}}}}

1.5 coerce

Coerce is used to clean dirty data. Default is true.

For example, for a number, in JSON, the user might write something wrong:

{"age":"99"}

Or:

{" age ":" 99.0 "}

These are not the correct number formats.

You can solve this problem through Coerce.

By default, the following is fine, and Coerce works:

{PUT blog "the mappings" : {" properties ": {" age" : {" type ":" integer "}}}} to POST a blog / _doc {" age ":" 99.0 "}

If you need to modify Coerce, do so as follows:

PUT blog
{
  "mappings": {
    "properties": {
      "age":{
        "type": "integer",
        "coerce": false
      }
    }
  }
}

POST blog/_doc
{
  "age":99
}

When Coerce is changed to false, the number must be a number, not a string. This field will return an error if it is passed in as a string.

1.6 copy_to

This property can copy the values of multiple fields into the same field.

The definition is as follows:

PUT blog { "mappings": { "properties": { "title":{ "type": "text", "copy_to": "full_content" }, "content":{ "type": "text", "copy_to": "full_content" }, "full_content":{ "type": "Text"}}}} PUT blog/_doc/1 {"title":" You're a little bit better ", "content":" When Coerce changes to false, the number will be a number, not a string, this field will return an error." } GET blogs / _search {" query ": {" term" : {" full_content ":" when "}}}

1.7 doc_values and fielddata

Searching in ES uses inverted indexes, and the doc_values parameter is used to speed up sorting and aggregation operations. Additional column storage mappings are added when inverted indexes are created.

DOC_VALUES is turned on by default, and can be turned off if you determine that a field does not need sorting or aggregation.

Most fields generate doc_values when indexed, except text. Text fields are queried to generate a data structure of FieldData, which is generated when the fields are first aggregated and sorted.

doc_values	fielddata
Create at index time	Dynamically created when used
disk	memory
Free memory	Free disk
Indexing speeds are slightly lower	When there are many documents, dynamic creation is slow and takes up memory

DOC_VALUES is turned on by default and FIELDDATA is turned off by default.

Doc_values demo:

PUT users

PUT users/_doc/1
{
  "age":100
}

PUT users/_doc/2
{
  "age":99
}

PUT users/_doc/3
{
  "age":98
}

PUT users/_doc/4
{
  "age":101
}

GET users/_search
{
  "query": {
    "match_all": {}
  },
  "sort":[
    {
      "age":{
        "order": "desc"
      }
    }
    ]
}

Since doc_values is enabled by default, you can use this field to sort. If you want to turn doc_values off, you can do the following:

PUT users
{
  "mappings": {
    "properties": {
      "age":{
        "type": "integer",
        "doc_values": false
      }
    }
  }
}

PUT users/_doc/1
{
  "age":100
}

PUT users/_doc/2
{
  "age":99
}

PUT users/_doc/3
{
  "age":98
}

PUT users/_doc/4
{
  "age":101
}

GET users/_search
{
  "query": {
    "match_all": {}
  },
  "sort":[
    {
      "age":{
        "order": "desc"
      }
    }
    ]
}

1.8 the dynamic

1.9 enabled

By default, ES indexes all fields, but some fields may only need to be stored, not indexed. This can be controlled by the enabled field:

PUT blog
{
  "mappings": {
    "properties": {
      "title":{
        "enabled": false
      }
    }
  }
}

PUT blog/_doc/1
{
  "title":"javaboy"
}

GET blog/_search
{
  "query": {
    "term": {
      "title": "javaboy"
    }
  }
}

With enabled set to false, you can search through that field again.

1.10 the format

Date format. Format can specification date formats, and you can define more than one format at a time.

PUT users
{
  "mappings": {
    "properties": {
      "birthday":{
        "type": "date",
        "format": "yyyy-MM-dd||yyyy-MM-dd HH:mm:ss"
      }
    }
  }
}

PUT users/_doc/1
{
  "birthday":"2020-11-11"
}

PUT users/_doc/2
{
  "birthday":"2020-11-11 11:11:11"
}

Between multiple date format, use the | | link, note that there is no space.
If the user does not specify a date format, the default date format isstrict_date_optional_time||epoch_mills

In addition, all the date format, can be in https://www.elastic.co/guide/… Website view.

1.11 ignore_above

IgBore_Above is used to specify the maximum length of the string used for the keyword and index. If the maximum length is above, this field will not be indexed. This field is only applicable to keyword types.

PUT blog
{
  "mappings": {
    "properties": {
      "title":{
        "type": "keyword",
        "ignore_above": 10
      }
    }
  }
}

PUT blog/_doc/1
{
  "title":"javaboy"
}

PUT blog/_doc/2
{
  "title":"javaboyjavaboyjavaboy"
}

GET blog/_search
{
  "query": {
    "term": {
      "title": "javaboyjavaboyjavaboy"
    }
  }
}

1.12 ignore_malformed

Ignore_malformed can ignore irregular data. This parameter defaults to false.

PUT users
{
  "mappings": {
    "properties": {
      "birthday":{
        "type": "date",
        "format": "yyyy-MM-dd||yyyy-MM-dd HH:mm:ss"
      },
      "age":{
        "type": "integer",
        "ignore_malformed": true
      }
    }
  }
}

PUT users/_doc/1
{
  "birthday":"2020-11-11",
  "age":99
}

PUT users/_doc/2
{
  "birthday":"2020-11-11 11:11:11",
  "age":"abc"
}


PUT users/_doc/2
{
  "birthday":"2020-11-11 11:11:11aaa",
  "age":"abc"
}

1.13 include_in_all

This is for the _all field, but in ES7, this field has been deprecated.

1.14 the index

The index property specifies whether or not a field is indexed. This property is true to indicate that the field is indexed, and false to indicate that the field is not indexed.

PUT users
{
  "mappings": {
    "properties": {
      "age":{
        "type": "integer",
        "index": false
      }
    }
  }
}

PUT users/_doc/1
{
  "age":99
}

GET users/_search
{
  "query": {
    "term": {
      "age": 99
    }
  }
}

If index is false, you cannot search by the corresponding field.

1.15 index_options

Index_options controls what information is stored in the inverted index (used in the text field). There are four values for index_options:

index_options	note
docs	Store only the document number, which is the default
freqs	On the basis of DOCS, the word item frequency is stored
positions	On the basis of freqs, the word item offset position is stored
offsets	Stores character positions at the beginning and end of word items based on positions

1.16 norms

The norms are useful for field scoring, and text should not be reciprocated if they are not particularly needed.

1.17 null_value

Null_value enables NULL fields to be explicitly indexable and searchable in ES:

PUT users
{
  "mappings": {
    "properties": {
      "name":{
        "type": "keyword",
        "null_value": "javaboy_null"
      }
    }
  }
}

PUT users/_doc/1
{
  "name":null,
  "age":99
}

GET users/_search
{
  "query": {
    "term": {
      "name": "javaboy_null"
    }
  }
}

1.18 position_increment_gap

In order to support approximate query and phrase query, when we index a text field containing multiple values, we will add an imaginary space between each value to separate the values, so that we can effectively avoid some meaningless phrase matching. The gap size is controlled by position_increment_gap, which defaults to 100.

PUT users

PUT users/_doc/1
{
  "name":["zhang san","li si"]
}

GET users/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "sanli"
      }
    }
  }
}

sanliCannot search because there is an imaginary gap of 100 between the two phrases.

GET users/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "san li",
        "slop": 101
      }
    }
  }
}

The gap size can be specified through slop.

You can also specify a gap when defining an index:

PUT users
{
  "mappings": {
    "properties": {
      "name":{
        "type": "text",
        "position_increment_gap": 0
      }
    }
  }
}

PUT users/_doc/1
{
  "name":["zhang san","li si"]
}

GET users/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "san li"
      }
    }
  }
}

1.19 the properties

1.20 similarity

The rating model of Similarity specifies the document, and there are three default types:

similarity	note
BM25	The default scoring model for ES and Lucene
classic	TF/IDF score
boolean	Boolean model scoring

1.21 store

By default, fields are indexed and searchable, but not stored. Although they are not stored, there is a backup of the fields in _source. If you want to store the fields, you can configure Store to do so.

1.22 term_vectors

Term_vectors are the information generated by word partitioners, including:

A set of terms
The location of each term
The offset of the first/last character of term from the origin of the original string

Term_vectors values:

The values	note
no	No information is stored. Default is this
yes	The term is stored
with_positions	Add location information to YES
with_offset	Add offset information from YES
with_positions_offsets	Term, location, and offset are all stored

1.23 fields

The Fields parameter allows the same field to be indexed in many different ways. Such as:

PUT blog
{
  "mappings": {
    "properties": {
      "title":{
        "type": "text",
        "fields": {
          "raw":{
            "type":"keyword"
          }
        }
      }
    }
  }
}

PUT blog/_doc/1
{
  "title":"javaboy"
}

GET blog/_search
{
  "query": {
    "term": {
      "title.raw": "javaboy"
    }
  }
}

https://www.elastic.co/guide/…

Finally, Songge also collected more than 50 project requirements documents, want to do a project practice friends may wish to look at Oh ~

The requirements document address: https://github.com/lenve/javadoc