Elasticsearch Mapping parameters

Elasticsearch specifies the following parameters when creating the index definition type:

Analyzer Specifies the word divider. Elasticsearch is a distributed storage system that supports full-text search. For fields of text type, it uses a word spliter to segment words and stores the word roots one by one in an inverted index. Subsequent searches are mainly for word roots.

Analyzer This parameter can be used per query, per field, and per index. Its priorities are as follows: 1. The tokenizer defined on the field. 2.

In the context of the query, the look-up of the toggle is prioritized as follows: 1. Toggle defined in the full-text query; 2. Toggle defined in the field search_Analyzer when defining the type map. 3. Tokenizer defined by analyzer when defining field mapping 4. Tokenizer defined by default_search in index. 5, the index of the default definition of the word segmentation 6, standard word segmentation (standard).
Normalizer is planned for the keyword type. Before indexing the field or querying the field, you can perform some simple processing on the original data and store the processed result as a word root in the inverted index. For example:

PUT index
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {                                    // @1
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]             // @2
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "foo": {
          "type": "keyword",
          "normalizer": "my_normalizer"                      // @3
        }
      }
    }
  }
}
Copy the code

Code @1: First define normalizer in the Analysis property in Settings. Code @2: Sets the normalized filter, in the example the processor is lowercase and asciifolding. @3: When defining a mapping, normalizer can be used to reference the defined Normalizer if the field type is keyword.

Boost weight value, which increases the weight at query time and has a direct effect on query relevance, defaults to 1.0. It affects team query and does not affect prefix, range query, or match query.

Note: It is not recommended to use the Boost attribute when creating index maps, but rather to specify it through the Boost parameter when querying. The main reasons are as follows: 1. You cannot dynamically change the boost value defined in the field, unless you use the reindex command to rebuild the index. 2. Conversely, if boost values are specified at query time, each query can use a different boost value, which is flexible. 3. Specify boost values in the index, which will be stored in the record, thus reducing the quality of the score calculation.
Coerce whether type “implicit conversion” is performed. Es ultimately stores the document as a string. For example, the following field types exist:

"number_one": {
   "type": "integer"
}
Copy the code

Declaring the type of the number_ONE field as a number, is it allowed to receive data as a string of “6”? When coerce is set to false, es will accept unquoted values. When COERce is set to false, es will accept unquoted values. When COERce is set to false, es will accept unquoted values. Assigning “6” to number_One throws a type mismatch exception. A default COERce value can be specified when an index is created, as shown in the following example:

PUT my_index {" Settings ": {"index.mapping.coerce": false}, "mappings": {Copy the code

The copy_to copy_TO parameter allows you to create custom _all fields. In other words, the values of multiple fields can be copied to a single field for example, first_name and last_name fields can be copied to the full_name field as follows:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "first_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "last_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "full_name": {
          "type": "text"
        }
      }
    }
  }
}
Copy the code

Indicates that the value of field full_name comes from first_name + last_name. A copy of a field is the original value, not the root of the word. 2. The copied field is not included in the _souce field, but can be used to query. 3, The same field can be copied to multiple fields: “copy_to” : [” field_1 “, “field_2”]

Doc_values When a field needs to be sorted, ES needs to extract the set of sorted field values in the matching result set and sort it. Inverted index data structures are quite efficient for retrieval, but not so good for sorting.

For elasticSearch,doc_values is a columnar storage structure that is enabled by default for most of the data types in elasticSearch. For elasticSearch,doc_values is enabled by default for most of the data types in elasticSearch. The value of the field is also added to doc_values. The value of the field under all indexes of this type is stored in a column. An example of using doc_values is as follows:

PUT my_index { "mappings": { "_doc": { "properties": { "status_code": { "type": "Keyword" // By default, "doc_values": true}, "session_ID ": {"type": "keyword", "doc_values": false}}}}}Copy the code

Dynamic Indicates whether fields can be added dynamically and implicitly. When executing the index API or updating the document API, the action for the _source field containing some previously undefined fields will be different depending on the value of dynamic:

True, the default, indicates that new fields are added to the type map.
False, the new field will not be stored in the _souce field, that is, the new field will not be stored, and the new field cannot be queried.
Strict: indicates that an exception is thrown. You need to use the PUT Mapping API to display the added field mapping.

If dynamic is set to false, you can use the Put Mapping API to add fields. In the same way, the PUT Mapping API can update dynamic values. For example:

PUT my_index/_doc/1 
{
  "username": "johnsmith",
  "name": {
    "first": "John",
    "last": "Smith"
  }
}
PUT my_index/_doc/2              // @1
{
  "username": "marywhite",
  "email": "[email protected]",
  "name": {
    "first": "Mary",
    "middle": "Alice",
    "last": "White"
  }
}
GET my_index/_mapping  // @2
Copy the code

The code @1 adds two fields username and name.middle to the original mapping. It can be seen from the mapping API obtained by code @2 that ES has automatically added type mapping definitions for the fields that do not exist originally. Note: Dynamic is only binding on the current level, for example:

PUT my_index
{
  "mappings": {
    "_doc": {
      "dynamic": false,         // @1
      "properties": {
        "user": {                    // @2
          "properties": {
            "name": {
              "type": "text"
            },
            "social_networks": {    // @3
              "dynamic": true,
              "properties": {}
            }
          }
        }
      }
    }
  }
}
Copy the code

The top layer of the @1: _doc type cannot not support dynamic implicit addition of field mappings. Code @2: But the _doc nested object user object is supported to add field maps implicitly on the fly. Code @3: Dynamic implicit addition of field maps is also supported for nested objects, Social_Networks.

By default, ES will try to index all fields for you. However, some types of fields do not need to be indexed, but are used to store data. The Enabled attribute can be set only for fields of type mapping (Type) and object. The following is an example:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "user_id": {
          "type":  "keyword"
        },
        "last_updated": {
          "type": "date"
        },
        "session_data": { 
          "enabled": false
        }
      }
    }
  }
}

PUT my_index/_doc/session_1
{
  "user_id": "kimchy",
  "session_data": { 
    "arbitrary_object": {
      "some_array": [ "foo", "bar", { "baz": 2 } ]
    }
  },
  "last_updated": "2015-12-06T18:20:22"
}
Copy the code

In the example above, ES stores the data of the session_data object, but cannot query it based on the properties in session_data through the query API. Similarly, you can update the Enabled attribute using the Put Mapping API.

Eager_global_ordinals Global serial number, which maintains an increasing number for each unique term in lexicographical order. The global serial number supports only strings (keywords and text fields). In keyword fields, they are available by default, but text fields are only available when FieldData =true. Doc_values (and fielddata) are also ordered numbers, which are unique numbers for all term roots in a particular segment and field. The global ordinals are just built on top of this and provide a mapping between the Segment ordinals and the global ordinals, which are unique in the entire shard. Because the global ordinal of each field is associated with all segments of a shard, when a new segment becomes visible, they need to be completely rebuilt. The term aggregation is based on the global ordinal number, first performing aggregation at the shard level (reduce), then converging the results of all shards (reduce) and converting the global ordinal number into a real root (string), and then returning the aggregated result after merging. By default, global ordinals are loaded at search time, which can be a great speed boost for indexing apis. However, if you are more concerned with search performance, setting eager_Global_ordinals on the aggregation fields you plan to use will help improve query efficiency. Eager_global_ordinals means pre-loaded global ordinals. An example is as follows:

PUT my_index/_mapping/_doc
{
  "properties": {
    "tags": {
      "type": "keyword",
      "eager_global_ordinals": true
    }
  }
}
Copy the code

Elasticsearch provides the DOC_VALUES attribute to support column storage for sorting and aggregation, but doc_values does not support the text field type. Because text fields need to be parsed first (word segmentation), doc_VALUES column storage performance is affected. In order to support efficient sorting and aggregation of text fields, ES introduced a new data structure (FieldData), which uses memory for storage. The default build time is built when the first aggregation query and sort operation is performed, which mainly stores the mapping relationship between the root and the document in the inverted index. The aggregation and sort operation is performed in memory. Fielddata therefore consumes a large amount of JVM heap memory. Once fieldData is loaded into memory, it will be permanent. Loading FieldData is usually an expensive operation, so by default, fields in text fields do not turn on fieldData by default. Think carefully about why you want to start FieldData before using it. Generally, text fields are used for full-text search. Doc_values is recommended for aggregated and sorted fields.

To save on memory usage, ES provides another mechanism (fieldDatA_frequency_filter) that allows direct root and document mappings to be loaded only for those root frequencies within a specified range (maximum, minimum). Maximum and minimum values can be specified as absolute values, such as numbers, It can also be based on percentages (the percentage calculation is based on the entire segment, and the frequency denominator is not all the documents in the segment, but the documents in the segment that have a value for that field). You can exclude small segments by specifying the minimum number of documents that must be included in the segment with the min_Segment_size parameter, which means you can control the scope of fieldDatA_frequency_filter to be the segment that contains more documents than min_SEGment_size. The following is an example of fielddata_frequency_filter:
Format In JSON documents, dates are represented as strings. Elasticsearch uses a set of pre-configured formats to recognize and parse these strings into values of type long (milliseconds). There are three date formats: 1) Custom format 2) Date mesh(described in THE DSL query API) 3) Built-in format 1. Custom format You can use Java to define the time format, for example:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "date": {
          "type":   "date",
          "format": "yyyy-MM-dd HH:mm:ss"
        }
      }
    }
  }
}
Copy the code

2. Some OF the DATE Mesh apis have been described in detail in the DSL query API and will not be repeated here.

Elasticsearch has a number of built-in formats for us, as follows:

Epoch_millis Time stamp, in milliseconds.
Epoch_second Indicates the time stamp, in seconds.
date_optional_time

The date is mandatory and the time is optional. The supported format is as follows:
Basic_date the format is yyyyMMdd
Basic_date_time The format is yyyyMMdd’t ‘Hmmmss.sssz
Basic_date_time_no_millis Is in the following format: yyyyMMdd’t ‘HHmmssZ
Basic_ordinal_date 4-digit year + 3-digit day of year, in the format of yyyyDDD
Basic_ordinal_date_time The value is in the format of yyyyDDD’t ‘HHmmss.SSSZ
Basic_ordinal_date_time_no_millis The format string is yyyyDDD’t ‘hhMMSSz
Basic_time The format string is HHmmss.SSSZ
Basic_time_no_millis the format string is HHmmssZ
Basic_t_time The format string is’ T ‘HHmmss.SSSZ
Basic_t_time_no_millis The format string is’ T ‘HHmmssZ
Basic_week_date The value is in the format of XXXX ‘W’ wwe, 4 is the year, followed by ‘W’, two digits week of year and one digit day of week.
Basic_week_date_time The value is in the format of XXXX ‘W’ Wwe’t ‘HH:mm: ss.sssz.
Basic_week_date_time_no_millis The format of the basic_week_date_time_no_millis is XXXX ‘W’ Wwe’t ‘HH:mm:ssZ.
Date The value is in the format of YYYY-MM-DD
Date_hour The value is in the format of YYYY-MM-DD’t ‘HH
Date_hour_minute The value is in the format of YYYY-MM-DD’t ‘HH: MM
Date_hour_minute_second The value is in the format of YYYY-MM-DD’t ‘HH: MM :ss
Date_hour_minute_second_fraction The value is in the format of YYYY-MM-DD’t ‘HH: MM: ss.sss
Date_hour_minute_second_millis The value is in the format of YYYY-MM-DD’t ‘HH: MM: ss.sss
Date_time The value is in the format of YYYY-MM-DD’t HH: MM: ss.sss
Date_time_no_millis The format string is YYYY-MM-DD’t ‘HH: MM :ss
Hour The value is in the format of HH
Hour_minute The format string is HH:mm
Hour_minute_second is in the format of HH:mm:ss
Hour_minute_second_fraction The value is in the format of HH:mm: ss.sss
Hour_minute_second_millis The value is in the format of HH:mm: ss.sss
Ordinal_date The format is YYYY-DDd, where DDD is Day of year.
Ordinal_date_time The value is in the format of YYYY-DDd ‘T’ HH:mm:ss.SSSZZ, where DDD is day of year.
Ordinal_date_time_no_millis The value is in the format of YYYY-DDD ‘T’ HH:mm:ssZZ
Time The value is in HH:mm: ss.ssszz format
The format of time_no_millis is HH:mm:ssZZ
T_time The format is’ T ‘HH:mm: ss.ssszz
T_time_no_millis The format string is’ T ‘HH:mm:ssZZ
Week_date The value is in the format of XXXX -‘W ‘ww-e, a 4-digit year. Ww indicates week of year, and e indicates day of week.
Week_date_time The value is in the format of XXXX -‘W ‘ww-e’t’ HH:mm:ss.SSSZZ
The format of week_date_time_no_millis is XXXX -‘W ‘ww-e’ T ‘HH:mm:ssZZ
Weekyear The format string is XXXX
Weekyear_week The format is XXXX -‘W ‘ww, where ww is week of year.
Weekyear_week_day The format is XXXX -‘W ‘w-e, where ww is week of year and e is day of week.
Year The format string is YYYY
Year_month The format string is YYYY-mm
Year_month_day The value is in the format of YYYY-MM-DD

Tips: When formatting dates, ES suggests adding strict_ prefix to the preceding format.

Ignore_above Strings that exceed the ignore_ABOVE setting are not indexed or stored. For string arrays, ignore_above is applied to each array element separately, and string elements exceeding IGNOre_above will not be indexed or stored. The result of the current test is that string characters larger than IGNOre_above are stored, but not indexed (that is, cannot be queried based on this value). The test results are as follows:

Public static void create_mapping_ignore_above() {// Create a mapping RestHighLevelClient client = esClient.getClient (); try { CreateIndexRequest request = new CreateIndexRequest("mapping_test_ignore_above2"); XContentBuilder mapping = XContentFactory.jsonBuilder() .startObject() .startObject("properties") .startObject("lies") Field ("ignore_above", 10) // The length cannot exceed 10. EndObject ().endobject ().endobject (); // request.mapping("user", mapping_user); request.mapping("_doc", mapping); System.out.println(client.indices().create(request, RequestOptions.DEFAULT)); } catch (Throwable e) { e.printStackTrace(); } finally { EsClient.close(client); }} public static void index_mapping_ignore_above() {// Index RestHighLevelClient client = esClient.getClient (); try { IndexRequest request = new IndexRequest("mapping_test_ignore_above2", "_doc"); Map<String, Object> data = new HashMap<>(); data.put("lies", new String[] {"dingabcdwei","huangsw","wuyanfengamdule"}); request.source(data); System.out.println(client.index(request, RequestOptions.DEFAULT)); } catch (Throwable e) { e.printStackTrace(); } finally { EsClient.close(client); }} public static void search_ignore_above() {// Query data RestHighLevelClient client = esClient.getClient (); try { SearchRequest searchRequest = new SearchRequest(); searchRequest.indices("mapping_test_ignore_above2"); SearchSourceBuilder sourceBuilder = new SearchSourceBuilder(); sourceBuilder.query( // QueryBuilders.matchAllQuery() // @1 // QueryBuilders.termQuery("lies", "dingabcdwei") // @2 // QueryBuilders.termQuery("lies", "huangsw") // @3 ); searchRequest.source(sourceBuilder); SearchResponse result = client.search(searchRequest, RequestOptions.DEFAULT); System.out.println(result); } catch (Throwable e) { e.printStackTrace(); } finally { EsClient.close(client); }}Copy the code

Code @1: First query all data, whose _souce field value is: “_source”:{” lies “: [” dingabcdwei “, “huangsw”, “wuyanfengamdule”]}, the table name is stored regardless of whether the string value is greater than the value specified in ignore_above. @2: Failed to match record in an attempt to search with a value greater than ignore_above, indicating not added to inverted index. Code @3: Attempts to search with a value not exceeding the ignore_above length and finds a match. Note: In ES,ignore_above is the length of a character, whereas the underlying implementation of LUCene is computed in bytes, so pay attention to the relationship if you want to feed back to LucnCE.

Ignore_malformed attempts to index the wrong data type into a field. By default, it throws an exception and rejects the entire document. The ignore_malformed parameter, if set to true, allows errors to be ignored. Ill-formed fields are not indexed, but other paragraphs in the document are handled normally. When you can create an index, set the index.mapping.ignore_malformed configuration item to define the default value of the index level. The priority is field level and index level.
Index Indicates whether the field is indexed. True: indicates that the field is indexed. False: indicates that the field is not indexed.
Index_options controls the additional content that a document adds to the reverse index, with the following optional values:

Docs: The document number is added to the inverted index.
Freqs: Document number and access frequency.
Positions: Indicates the document number, access frequency, word position (in sequence), proximity, and phrase queries. This mode is required.
Offsets: Document number, word frequency, word offset (start and end positions) and word position (serial number), highlighted. This mode needs to be set. By default, the analyzed String field uses positions and the other fields use docs;

Fields Fields allows different Settings for fields with the same name in the same index. For example:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "city": {
          "type": "text",        // @1
          "fields": {              // @2
            "raw": { 
              "type":  "keyword"   // @3
            }
          }
        }
      }
    }
  }
}
Copy the code

@1: The above map is city field, define type as text, use full-text index. @2: Defines multiple fields for city, city.raw, whose type is keyword. You can use user for full text matching, and you can also use user. Raw for aggregation, sorting and other operations. Another common case is to use a different tokenizer for the field.

The scoring specification of the specification field can be stored to improve the scoring calculation efficiency in query. While the specification is useful for keeping score, it also requires a lot of disks (usually one byte per document for each field in the index, even for documents that don’t have this particular field). It can also be seen that the norms fit within filtering or aggregation fields. Notice that the norms=true can be updated to false via the Put Mapping API, but not from false to true.
Null_value replaces the displayed NULL value with the newly defined value. Use the following example as an illustration:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "status_code": {
          "type":       "keyword",
          "null_value": "NULL"             // @1
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "status_code": null                     // @2
}

PUT my_index/_doc/2
{
  "status_code": []                       // @3
}

GET my_index/_search
{
  "query": {
    "term": {
      "status_code": "NULL"               // @4
    }
  }
}
Copy the code

@1: define “NULL” as NULL for the status_code field; @3: The empty array does not contain explicit NULL and therefore cannot be replaced by null_value. Code @4: This query will query document 1. The query results are as follows:

{ "took":4, "timed_out":false, "_shards":{ "total":5, "successful":5, "skipped":0, "failed":0 }, "hits":{ "total":1, "Max_score" : 0.2876821, "hits" : [{" _index ":" mapping_test_null_value ", "_type" : "_doc", "_id" : "RyjGEmcB - TTORxhqI2Zn", "_score" : 0.2876821, "_source" : {" status_code ": null}}}}]Copy the code

Null_value has the following characteristics: Null_value must be the same as the data type of the field. For example, a field of type long cannot have the string NULl_value. Null_value only reverses the index value and cannot change the _souce field value.

Position_increment_gap For multi-value fields, the gap between values. For example:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "names": {
          "type": "text",
          "position_increment_gap": 0      // @1
		  // "position_increment_gap": 10  // @2
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
    "names": [ "John Abraham", "Lincoln Smith"]
}
Copy the code

The names field is an array. When position_increment_gap=0, ES uses the standard word divider by default. The root is position 0: John position 1: Abraham position 2: Lincoln position 3: Smith When position_increment_gap = 10, es uses the default participle, dividing into the root: position 0: John position 1: Abraham position 11: Lincoln This is the second word, equal to position + position_increment_gap of the previous word. Position 12: Smith

For the following query:

GET my_index/_search
{
    "query": {
        "match_phrase": {
            "names": "Abraham Lincoln" 
        }
    }
}
Copy the code

For position_increment_gap=0, the document will be matched. For position_increment_gap=10, the document will not be matched because Abraham is 10 away from Lincoln. To match the document, Slop =10 needs to be set at query time, as described in detail in the PREVIOUS DSL query section.

Properties creates the field definition for the mapping type.
Search_analyzer In general, the same analyzer is applied when indexing and when searching to ensure that terms in the query have the same format as terms in the reverse index. If you want to use a different word analyzer when searching than when storing, specify using the search_Analyzer property, Commonly used in ES to implement instant search (edge_ngram).
Similarity similarity algorithm, where optional values:

BM25 Default value of the current version. The BM25 algorithm is used.
Classic uses the TF/IDF algorithm, which used to be es, Lucene’s default similarity algorithm.
Boolean A simple Boolean similarity that is used when full-text sorting is not required, and the score should only be based on whether the query criteria match. Boolean similarity provides terms with a score equal to their query boost.

Store By default, field values are indexed to make them searchable, but they are not stored. This means that fields can be queried, but the original field value cannot be retrieved. Usually it doesn’t matter. The field value is already part of the _source field, which is stored by default. If you only want to retrieve the values of a single field or several fields, rather than the entire _source, this can be done by source filting context, and in some cases it makes sense to store fields. For example, if you have a document that contains title, date, and very large content fields, you might just want to retrieve the title and date and not need to extract those fields from the large _source field. Es also provides another way to extract partial fields, stored_fields, stored_fields filtering, The store of only fields is supported as true. This is already covered in the _souce filtering section of the Elasticsearch Doc API.
Term_vector Term_vector contains information about terms generated by the analysis process, including:

List of terms.
The position (or order) of each item.
Start and end character offsets. Term_vector can be:

No does not store term_vector information, default.
Yes stores only the values in the field.
With_positions stores value and position information in the field.
With_offsets stores the values and offsets in the fields
With_positions_offsets Stores information about the value, position, and offset of a field.

This article details the parameters Elasticsearch supports when creating type maps.

See article such as surface, I am Weige, keen on systematic analysis of JAVA mainstream middleware, pay attention to the public number “middleware interest circle”, replycolumnCan get into the system column navigation, replydataYou can get the author’s learning mind map.

Related Posts

I had to refactor a piece of code six times, and I was devastated

Those years, I climbed the North branch (ii) – crawler based session login

Set @hashMap (version 1.8)