In the fifth part of this tutorial, we are going to talk about the 23 common mapping parameters in Es.

Songo has recorded a video tutorial on these 23 common mapping parameters:

Video link: pan.baidu.com/s/1J23m6oST… Extraction code: 6K2A

This is a note from a video by Songo. The notes are brief and to the point. You can refer to the video for the full content.

1.ElasticSearch mapping parameters

1.1 analyzer

Defines a tokenizer for a text field. The default is valid for both indexes and queries.

Assuming no tokenizer, let’s first look at the result of the index, creating an index and adding a document:

PUT blog PUT blog/_doc/1 {"title":" Defines the toggle for the text field. The default is valid for both indexes and queries. }Copy the code

View term Vectors

GET blog/_termvectors/1
{
  "fields": ["title"]
}
Copy the code

The following information is displayed:

{ "_index" : "blog", "_type" : "_doc", "_id" : "1", "_version" : 1, "found" : true, "took" : 0, "term_vectors" : {" title ": {" field_statistics" : {" sum_doc_freq ": 22," doc_count ": 1," sum_ttf ": 23}," terms ": {" righteousness" : {" term_freq ": 1," tokens ": [{" position" : 1, "start_offset" : 1, "end_offset" : 2}]}, "points" : {" term_freq ": 1, "tokens" : [{" position ": 7," start_offset ": 7," end_offset ": 8}]}," and ": {" term_freq" : 1, "tokens" : [{15, "position" : "start_offset" : 16, "end_offset" : 17}]}, "trap" : {" term_freq ": 1," tokens ": [{" position" : 9, 9, "start_offset" : "end_offset" : 10}}], "the word" : {" term_freq ": 1," tokens ": [{" position" : 4, "start_offset" : 4, "end_offset" : 5}}], "set:" {" term_freq ": 1," tokens ": [{" position" : 0, "start_offset" : 0, "end_offset" : 1}}], "to" : {" term_freq ": 1," tokens ": [{" position" : 12, "start_offset" : 13, "end_offset" : 14}}], "guide" : {" term_freq ": 1," tokens ": [{" position" : 14, "start_offset" : 15, "end_offset" : 16}]}, "effect" : {" term_freq ": 1," tokens ": [{" position" : 21, "start_offset" : 22, "end_offset" : 23}]}, "wen" : {" term_freq ": 1, "tokens" : [{" position ": 2," start_offset ": 2," end_offset ": 3}]}," is ": {" term_freq" : 1, "tokens" : [{19, "position" : "start_offset" : 20, "end_offset" : 21}]}, "a" : {" term_freq ": 1," tokens ": [{" position" : 20, "start_offset" : 21, "end_offset" : 22}}], "the" : {" term_freq ": 1," tokens ": [{" position" : 3, "start_offset" : 3, "end_offset" : 4}}], "check" : {" term_freq ": 1," tokens ": [{" position" : 17, 16, "start_offset" : "end_offset" : 18}}], "segment" : {" term_freq ": 1," tokens ": [{" position" : 5, 5, "start_offset" : "end_offset" : 6}}], "the" : {" term_freq ": 2," tokens ": [{" position" : 6, "start_offset" : 6, "end_offset" : 7}, {" position ": 22," start_offset ": 23," end_offset ": 24}}," line ": {" term_freq" : 1, "tokens" : [{"position" : 13, "start_offset" : 14, "end_offset" : 15}] "tokens" : {"term_freq" : 1, "tokens" : [{11, "position" : "start_offset" : 12, "end_offset" : 13}]}, "word" : {" term_freq ": 1," tokens ": [{" position" : 8, "start_offset" : 8, "end_offset" : 9}}], "poll" : {" term_freq ": 1," tokens ": [{" position" : 18, 17, "start_offset" : "end_offset" : 19}}], "all" : {" term_freq ": 1," tokens ": [{" position" : 19, 18, "start_offset" : "end_offset" : 20}}], "Mr" : {" term_freq ": 1," tokens ": [{" position" : 10, "start_offset" : 11, "end_offset" : 12 } ] } } } } }Copy the code

As you can see, by default, Chinese is divided word by word, which makes no sense. If this is the case, the query can only be queried word by word, as follows:

GET a blog / _search {" query ": {" term" : {" title ":" fixed "}}}Copy the code

Nonsense!!

Therefore, we should configure the appropriate word segmentation according to the actual situation.

To set a word divider for a field:

PUT blog
{
  "mappings": {
    "properties": {
      "title":{
        "type":"text",
        "analyzer": "ik_smart"
      }
    }
  }
}
Copy the code

Store documents:

PUT blog/_doc/1 {"title":" Defines the toggle for the text field. The default is valid for both indexes and queries. }Copy the code

View the entry vector:

GET blog/_termvectors/1
{
  "fields": ["title"]
}
Copy the code

The query results are as follows:

{ "_index" : "blog", "_type" : "_doc", "_id" : "1", "_version" : 1, "found" : true, "took" : 1, "term_vectors" : {" title ": {" field_statistics" : {" sum_doc_freq ": 12," doc_count ": 1," sum_ttf ": 13}," terms ": {" participle device" : {" term_freq ": 1," tokens ": [{" position" : 4, "start_offset" : 7, "end_offset" : 10}]}, "and" : {" term_freq ": 1, "tokens" : [{"position" : 8, "starT_offset" : 16, "end_offset" : 17}] "tokens" : [{" position ": 2," start_offset ": 4," end_offset ": 6}]}," definition ": {" term_freq" : 1, "tokens" : [{" position ": 0, "start_offset" : 0, "end_offset" : 2}}], "to" : {" term_freq ": 1," tokens ": [{" position" : 6, "start_offset" : 13, "end_offset" : 14}}], "text" : {" term_freq ": 1," tokens ": [{" position" : 1, "start_offset" : 2, "end_offset" : 4}}], "effective" : {" term_freq ": 1," tokens ": [{" position" : 11, "start_offset" : 21 and 23 "end_offset" :}}], "query" : {" term_freq ": 1," tokens ": [{9," position ":" start_offset ": 17, "end_offset" : 19}}], "the" : {" term_freq ": 2," tokens ": [{" position" : 3, "start_offset" : 6, "end_offset" : $tokens $tokens $tokens $tokens $tokens $tokens $tokens $tokens $tokens $tokens $tokens $tokens [{7, "position" : "start_offset" : 14, "end_offset" : 16}]}, "are" : {" term_freq ": 1," tokens ": [{" position" : $tokens $tokens $tokens $tokens $tokens $tokens $tokens $tokens $tokens 5, "start_offset" : 11, "end_offset" : 13 } ] } } } } }Copy the code

Then you can search by word:

GET a blog / _search {" query ": {" term" : {" title ":" index "}}}Copy the code

1.2 search_analyzer

The word divider when querying. By default, if search_Analyzer is not configured, the query first checks whether search_Analyzer is available. If so, the search_Analyzer is used for word segmentation. If not, the analyzer is available. Analyzer is used for word segmentation, otherwise the es default word segmentation is used.

1.3 normalizer

The Normalizer parameter is used for the standardized configuration before parsing (indexes or queries).

For example, in ES, for strings that we don’t want to shred, we usually set them to keyword and use the whole word when searching. Normalizer can be used to normalize documents before indexing and query if data is not cleaned before indexing and case is inconsistent, such as javaboy and Javaboy.

As a counter example, create an index named blog and set the author field type to keyword:

PUT blog
{
  "mappings": {
    "properties": {
      "author":{
        "type": "keyword"
      }
    }
  }
}
Copy the code

Add two documents:

PUT blog/_doc/1
{
  "author":"javaboy"
}

PUT blog/_doc/2
{
  "author":"JAVABOY"
}
Copy the code

Then do a search:

GET blog/_search
{
  "query": {
    "term": {
      "author": "JAVABOY"
    }
  }
}
Copy the code

You can search for upper-case documents by keyword in upper case and lower-case documents by keyword in lower case.

If Normalizer is used, you can pre-process documents for indexing and querying.

Normalizer can be defined in the following ways:

PUT blog
{
  "settings": {
    "analysis": {
      "normalizer":{
        "my_normalizer":{
          "type":"custom",
          "filter":["lowercase"]
        }
      }
    }
  }, 
  "mappings": {
    "properties": {
      "author":{
        "type": "keyword",
        "normalizer":"my_normalizer"
      }
    }
  }
}
Copy the code

Define normalizers in Settings and then reference them in Mappings.

The test method is the same as before. In this case, the uppercase keyword can also be queried in lower case documents, because both indexes and queries convert uppercase to lower case.

1.4 the boost

The Boost parameter sets the weight of the field.

“Mappings” is used when specifying field types and/or “mappings” is used when specifying field types. The other is used when querying.

The latter is recommended for practical development, but the former has a problem: the weights cannot be changed without re-indexing the document.

Using Boost in mapping (not recommended) :

PUT blog
{
  "mappings": {
    "properties": {
      "content":{
        "type": "text",
        "boost": 2
      }
    }
  }
}
Copy the code

Another way is to specify boost at query time

GET a blog / _search {" query ": {" match" : {" content ": {" query" : "hello", "boost" : 2}}}}Copy the code

1.5 coerce

Coerce clears dirty data. The default value is true.

For example, if a number is in JSON, the user might write it wrong:

{"age":"99"}
Copy the code

Or:

{" age ":" 99.0 "}Copy the code

None of these are correct number formats.

Coerce can resolve the problem.

By default, the following actions are fine and coerce will work:

{PUT blog "the mappings" : {" properties ": {" age" : {" type ":" integer "}}}} to POST a blog / _doc {" age ":" 99.0 "}Copy the code

If you need to change COERCE, do the following:

PUT blog
{
  "mappings": {
    "properties": {
      "age":{
        "type": "integer",
        "coerce": false
      }
    }
  }
}

POST blog/_doc
{
  "age":99
}
Copy the code

When coerce is changed to false, the number will be a number and not a string, and the coerce field will receive an error when passed into a string.

1.6 copy_to

This property copies the values of multiple fields into the same field.

The definition is as follows:

PUT blog { "mappings": { "properties": { "title":{ "type": "text", "copy_to": "full_content" }, "content":{ "type": "text", "copy_to": "full_content" }, "full_content":{ "type": "Text"}}}} PUT blog/_doc/1 {"title":" coerce ", "content":" after the coerce field is changed to false, the number will be a number and not a string. } GET blogs / _search {" query ": {" term" : {" full_content ":" when "}}}Copy the code

1.7 doc_values and fielddata

Es searches mainly use inverted indexes, and the doc_values parameter is created to speed up sorting and aggregation operations. When an inverted index is created, an additional column storage map is added.

Doc_values is turned on by default, and you can turn doc_values off if you determine that a field does not need sorting or aggregation.

Most fields generate doc_values when indexed, except for text. The text field generates a fieldData data structure when queried. Fieldata is generated when the fields are aggregated and sorted for the first time.

doc_values fielddata
Index time creation Dynamic creation when used
disk memory
No memory usage No disk usage
Index speed is slightly lower When there are many documents, dynamic creation is slow and takes up memory

Doc_values is on by default, fieldData is off by default.

Doc_values demo:

PUT users

PUT users/_doc/1
{
  "age":100
}

PUT users/_doc/2
{
  "age":99
}

PUT users/_doc/3
{
  "age":98
}

PUT users/_doc/4
{
  "age":101
}

GET users/_search
{
  "query": {
    "match_all": {}
  },
  "sort":[
    {
      "age":{
        "order": "desc"
      }
    }
    ]
}
Copy the code

Doc_values is enabled by default, so you can sort by doc_values directly. If you want to turn doc_values off, do as follows:

PUT users
{
  "mappings": {
    "properties": {
      "age":{
        "type": "integer",
        "doc_values": false
      }
    }
  }
}

PUT users/_doc/1
{
  "age":100
}

PUT users/_doc/2
{
  "age":99
}

PUT users/_doc/3
{
  "age":98
}

PUT users/_doc/4
{
  "age":101
}

GET users/_search
{
  "query": {
    "match_all": {}
  },
  "sort":[
    {
      "age":{
        "order": "desc"
      }
    }
    ]
}
Copy the code

1.8 the dynamic

1.9 enabled

Es indexes all fields by default, but some fields may need to be stored without indexing. This can be controlled by the Enabled field:

PUT blog
{
  "mappings": {
    "properties": {
      "title":{
        "enabled": false
      }
    }
  }
}

PUT blog/_doc/1
{
  "title":"javaboy"
}

GET blog/_search
{
  "query": {
    "term": {
      "title": "javaboy"
    }
  }
}
Copy the code

Once enabled is set to false, you can search through this field again.

1.10 the format

Date format. Format can regulate date formats, and you can define more than one format at a time.

PUT users
{
  "mappings": {
    "properties": {
      "birthday":{
        "type": "date",
        "format": "yyyy-MM-dd||yyyy-MM-dd HH:mm:ss"
      }
    }
  }
}

PUT users/_doc/1
{
  "birthday":"2020-11-11"
}

PUT users/_doc/2
{
  "birthday":"2020-11-11 11:11:11"
}
Copy the code
  • Between multiple date format, use the | | link, note that there is no space.
  • If the user does not specify a format for the date, the default date format isstrict_date_optional_time||epoch_mills

In addition, all the date format, can be in www.elastic.co/guide/en/el… Url view.

1.11 ignore_above

Igbore_above specifies the maximum length of the string for the segmentation and index. If the value exceeds the maximum length, the field will not be indexed. This field only applies to the keyword type.

PUT blog
{
  "mappings": {
    "properties": {
      "title":{
        "type": "keyword",
        "ignore_above": 10
      }
    }
  }
}

PUT blog/_doc/1
{
  "title":"javaboy"
}

PUT blog/_doc/2
{
  "title":"javaboyjavaboyjavaboy"
}

GET blog/_search
{
  "query": {
    "term": {
      "title": "javaboyjavaboyjavaboy"
    }
  }
}
Copy the code

1.12 ignore_malformed

Ignore_malformed Indicates that irregular data is ignored. The default value is false.

PUT users
{
  "mappings": {
    "properties": {
      "birthday":{
        "type": "date",
        "format": "yyyy-MM-dd||yyyy-MM-dd HH:mm:ss"
      },
      "age":{
        "type": "integer",
        "ignore_malformed": true
      }
    }
  }
}

PUT users/_doc/1
{
  "birthday":"2020-11-11",
  "age":99
}

PUT users/_doc/2
{
  "birthday":"2020-11-11 11:11:11",
  "age":"abc"
}


PUT users/_doc/2
{
  "birthday":"2020-11-11 11:11:11aaa",
  "age":"abc"
}
Copy the code

1.13 include_in_all

This is for the _all field, which is deprecated in ES7.

1.14 the index

The index attribute specifies whether a field is indexed. True indicates that the field is indexed, false indicates that the field is not indexed.

PUT users
{
  "mappings": {
    "properties": {
      "age":{
        "type": "integer",
        "index": false
      }
    }
  }
}

PUT users/_doc/1
{
  "age":99
}

GET users/_search
{
  "query": {
    "term": {
      "age": 99
    }
  }
}
Copy the code
  • If index is false, the search cannot be performed by the corresponding field.

1.15 index_options

Index_options controls what information is stored in the inverted index (used in the text field). There are four values:

index_options note
docs Only the document number is stored, which is the default
freqs Store term frequency on a docs basis
positions On the basis of freqs, store term offset position
offsets On the basis of positions, stores the character positions at the beginning and end of a word item

1.16 norms

Norms are useful for field scoring; text defaults to enabling norms, but norms should not be enabled if they are not particularly needed.

1.17 null_value

Null_value makes null_value explicitly indexable and searchable:

PUT users
{
  "mappings": {
    "properties": {
      "name":{
        "type": "keyword",
        "null_value": "javaboy_null"
      }
    }
  }
}

PUT users/_doc/1
{
  "name":null,
  "age":99
}

GET users/_search
{
  "query": {
    "term": {
      "name": "javaboy_null"
    }
  }
}
Copy the code

1.18 position_increment_gap

The parsed text field will take into account the position of term in order to support approximate query and phrase query. When we index a text field containing multiple values, an imaginary space will be added between each value to separate the values, which can effectively avoid some meaningless phrase matching. The gap size is controlled by position_increment_gap, which defaults to 100.

PUT users

PUT users/_doc/1
{
  "name":["zhang san","li si"]
}

GET users/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "sanli"
      }
    }
  }
}
Copy the code
  • sanliIt cannot be searched because there is an imaginary gap of 100 between the two phrases.
GET users/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "san li",
        "slop": 101
      }
    }
  }
}
Copy the code

The gap size can be specified by SLOP.

You can also specify a gap when defining an index:

PUT users
{
  "mappings": {
    "properties": {
      "name":{
        "type": "text",
        "position_increment_gap": 0
      }
    }
  }
}

PUT users/_doc/1
{
  "name":["zhang san","li si"]
}

GET users/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "san li"
      }
    }
  }
}
Copy the code

1.19 the properties

1.20 similarity

Similarity specifies a scoring model for a document, with three defaults:

similarity note
BM25 Es and Lucene default scoring model
classic TF/IDF score
boolean Boolean model scoring

1.21 store

By default, fields are indexed and searchable, but not stored. Although they are not stored, there is a backup of the fields in _source. If you want to store fields, you can configure store to do so.

1.22 term_vectors

Term_vectors are the information generated by the word divider, including:

  • A set of terms
  • The position of each term
  • The offset of the first/last character of term from the origin of the original string

Term_vectors values:

The values note
no No information is stored, which is the default
yes The term is stored
with_positions Add location information to yes
with_offset Add offset information to yes
with_positions_offsets Term, position, and offset are stored

1.23 fields

The fields parameter allows the same field to be indexed in many different ways. Such as:

PUT blog
{
  "mappings": {
    "properties": {
      "title":{
        "type": "text",
        "fields": {
          "raw":{
            "type":"keyword"
          }
        }
      }
    }
  }
}

PUT blog/_doc/1
{
  "title":"javaboy"
}

GET blog/_search
{
  "query": {
    "term": {
      "title.raw": "javaboy"
    }
  }
}
Copy the code
  • www.elastic.co/guide/en/el…

Finally, Songgo also collected more than 50 project requirements documents, if you want to do a project to practice friends can look at it

Requirements documentation address: github.com/lenve/javad…