concept

Mapping is equivalent to creating statements in a relational database, defining document fields, their types, indexes, and storage methods. This usually involves the following aspects: which fields in a document need to be defined as full-text index fields. Which fields in the document are defined as exact values, such as date, number, geographic location, etc. Which fields in the document need to be indexed (the document can be queried by the value of that field). The format of the date value. Dynamically add field rule definitions, etc.

Look at the mapping

GET /phone/_mapping
Copy the code

Dynamic Mapping Dynamic mapping

Elasticsearch is an important feature for dynamic mapping: There is no need to create iIndex, define mapping information and Type type in advance. When you directly insert document data into ES, ES will automatically configure type mapping information for each new field according to its possible data type. This process is called dynamic mapping.

JSON data type Elasticsearch data type
null No field is added.
true or false boolean
1234 Why is long not integer?
123.4 float
2018-10-10 date
“hello world” text

Why not integer? Because the ES mapping_type is checked for data type by JSON parser and JSON has no implicit type conversion (INTEGER =>long or float=> double), the Dynamic Mapping selects a wide data type.

PUT blog/_doc/1 { "blog_id": 10001, "author_id": 5520, "post_date": "2018-01-01", "title": "my first blog", "content": "my first blog in the website" } PUT blog/_doc/2 { "blog_id": 10002, "author_id": 5520, "post_date": "2018-01-02", "title": "my second blog", "content": "my second blog in the website" } GET blog/_mapping GET blog/_search? q=2018 GET blog/_search? q=2018-01-01 GET blog/_search? q=post_date:2018 GET blog/_search? q=post_date:2018-01-01Copy the code

_all

Since version 6.0, the _all field has been deprecated. We recommend using COPY_to for similar functionality.

The values in the # first_name and last_name fields will be copied to the full_name field. PUT /my_index { "mappings": { "person": { "properties": { "first_name": { "type": "string", "copy_to": "full_name" }, "last_name": { "type": "string", "copy_to": "full_name" }, "full_name": { "type": "string" } } } } }Copy the code

Exact values vs. Full text

Exact value: During inverted indexing, the classifier creates the field as a whole into the index

The exact value is determined, as its name suggests. Such as a date or a user ID, or more strings such as username or email address. The exact values “Foo” and “Foo” are not the same. The exact value 2014 and 2014-09-15 are also different.

Full text retrieval: word segmentation, synonyms, confounding words, case, part of speech, filtering, tense conversion, etc.

We don’t ask does this document match the query? However, we ask what about the match between this document and the query? In other words, how relevant is this document to the query criteria? We rarely match the full text exactly. We want to query the part of the full text that contains the query text. Not only that, but we expect search engines to understand our intent: a query for “UK” will return documents referring to “United Kingdom.” A query for “jump” can match “jumped,” jumps, “or even” leap.” Johnny Walker can also match “Johnnie Walker,” “Johnnie Deep,” and “Johnny Depp.” Fox News Hunting returns stories about hunting on Fox News, and Fox Hunting News returns news stories about Fox hunting.

ES data type

Core types

Number type:

  • long, integer, short, byte, double, float, half_float, scaled_float
  • Choose as narrow a range of data types as possible to meet your requirements.

String: string:

  • Keyword: Fields that are structured for indexing and can be used for filtering, sorting, and aggregation. Keyword fields can be searched only by using exact value. Id should be keyword
  • Text: When a field is to be searched in full text, such as Email content or product description, these fields should be of text type. After setting the text type, the field content is parsed, and the string is parsed into terms before generating an inverted index. Fields of type text are not used for sorting and are rarely used for aggregation.
  • It can be useful to have both text and keyword versions in the same field: one for full-text search, the other for aggregation and sorting.

Date: exact value Boolean: binary: binary range: exact value Integer_range, FLOAT_range, long_range, doubLE_range and date_range

Complex types:

  • Object: Used for a single JSON Object
  • Nested: Used for JSON object arrays

Location:

  • Geo-point: latitude/longitude integration
  • Geo-shape: For complex shapes such as polygons

Unique types:

  • IP address: IP is used for IPv4 and IPv6 addresses
  • Completion: Provides suggestions for automatic Completion
  • Tocken_count: Counts the number of tokens in the string
  • .
  • www.elastic.co/guide/en/el…

Mapping parameters

  • analyzer

Specify profilers (Character Filter, Tokenizer, Token filters)

  • boost

The score weight for the current field relevance, default 1

  • coerce

Whether to allow cast true “1” => 1 false “1” =< 1

  • copy_to

The values of multiple fields are copied into the group field, which can then be queried as a single field.

  • doc_values

To improve sorting and aggregation efficiency, the default is true. If you are sure that you do not need to sort or aggregate fields or access field values through scripts, you can disable the doc value to save disk space

  • Dynamic Sets the control over whether new fields can be added dynamically

True: Newly detected fields will be added to the map. (Default) false: Newly detected fields are ignored. These fields will not be indexed and therefore will not be searchable, but will still appear in the matches returned by _source. These fields are not added to the map and new fields must be explicitly added. Strict: If a new field is detected, an exception is thrown and the document is rejected. The new field must be explicitly added to the map.

  • Eager_global_ordinals is used on aggregated fields to optimize aggregate performance
  • Enabled Specifies whether to create an inverted index. If no index is created, the index can be retrieved and displayed in the _source metadata with caution. This status cannot be changed
  • fielddata
  • fields
  • The format to format
  • Ignore_above Exceeding the length is ignored
  • Ignore_malformed Ignores type errors
  • Index_options controls what information is added to the reverse index for searching and highlighting. Only for text fields
  • Index_phrases improves the exact_Value query speed, but consumes more disk space
  • index_prefixes
  • Index Creates an index for the current field. The default value is true. If no index is created, the field will not be found by the index, but will still be displayed in the source metadata
  • meta
  • normalizer
  • norms
  • null_value
  • position_increment_gap
  • properties
  • search_analyzer
  • similarity
  • store
  • term_vector
  • www.elastic.co/guide/en/el…

demo

# copy_to
PUT my-index-000001
{
  "mappings": {
    "properties": {
      "first_name": {
        "type": "text",
        "copy_to": "full_name" 
      },
      "last_name": {
        "type": "text",
        "copy_to": "full_name" 
      },
      "full_name": {
        "type": "text"
      }
    }
  }
}
PUT my-index-000001/_doc/1
{
  "first_name": "John",
  "last_name": "Smith"
}
GET /my-index-000001/_search
{
  "query": {
    "match_all": {}
  }
}
GET my-index-000001/_search
{
  "query": {
    "match": {
      "full_name": { 
        "query": "John Smith"
      }
    }
  }
}

# coerce
DELETE  my-index-000001
PUT my-index-000001
{
  "mappings": {
    "properties": {
      "number_one": {
        "type": "integer"
      },
      "number_two": {
        "type": "integer",
        "coerce": false
      }
    }
  }
}
PUT my-index-000001/_doc/1
{
  "number_one": "10" 
}
PUT my-index-000001/_doc/2
{
  "number_two": 10
}
GET /my-index-000001/_search
{
  "query": {
    "match_all": {}
  }
}
Copy the code

Inverted index and forward index

Elasticsearch uses a structure called an inverted index, which is suitable for fast full-text searches. An inverted index consists of a list of all non-repeating words in a document, and for each word there is a list of documents containing it. Quick Brown fox jumped over The lazy dog Quick Brown foxes leap over lazy dogs in summer

# inverted index Term Doc_1 Doc_2 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- The Quick | | X The | | X brown dog | | X X X | | dogs | | fox | | X X foxes | | X in | | X jumped | X | lazy | X | X leap | | X over | X | X quick | X | summer | | X the | X | -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # for all documents that contain brown word GET/my_index / _search {" query ": {" match" : {" body ": "brown" } }, "aggs" : { "popular_terms": { "terms" : { "field" : "body" } } } }Copy the code

The query section is simple and efficient. For the aggregate part, we need to find all the unique terms in Doc_1 and Doc_2. Doing this with an inverted index is expensive: we iterate over each term in the index and collect tokens in the Doc_1 and Doc_2 columns. This is slow and difficult to scale: as the number of terms and documents increases, so does the execution time.

Doc Values solves this problem by transposing the relationship between the two. Inverted indexes map terms to the documents that contain them, and doc values map documents to terms they contain:

Doc      Terms
-----------------------------------------------------------------
Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the
Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer
-----------------------------------------------------------------
Copy the code

Get each document line, get all the terms, and then find the union of the two sets.

doc values

Doc Values are “fast, efficient, and memory friendly” Doc Values are generated at the same time as the inverted index. That is, Doc Values, like inverted indexes, are generated based on seseinterfaces and are immutable. Doc Values are also serialized to disk like inverted indexes, which greatly improves performance and scalability. Doc Values persist data structures to disk through serialization, so we can make full use of the operating system’s memory instead of the JVM’s Heap. Doc Values is enabled by default for all fields except analyzed strings. This means that all numeric, geographic coordinates, dates, IP, and not_analyzed character types are turned on by default. Analyzed strings can’t use Doc Values yet. The text is parsed through the process to generate many terms that make Doc Values not run efficiently. If the fields will never be aggregated or sorted, you can disable them to save disk space and speed up indexing.

fielddata

Unlike Doc values, FieldData is built on a query and lives in the Heap of the JVM.

The dimension doc_values fielddata
Creation time The index is created when the Dynamic creation when used
Create a location disk Heap memory (JVM)
advantages Does not take up memory space Does not occupy disk space
disadvantages Index speed is slightly lower When there are many documents, dynamic creation is expensive and takes up memory

As the document grows, Field Data may generate OOM, and ES has a fuse for FieldData that internally estimates a required amount of memory, and if it exceeds that, it fuses, and the query is aborted and returns an exception. This all happens before the data is loaded, which means no OutOfMemoryException is raised.

Available Circuit Breakers Elasticsearch has a series of Circuit Breakers, all of which guarantee that memory will not exceed its limit: Indices, breaker. Fielddata. Limit fielddata breaker fielddata will be 60% of the default heap size limit. Indices, breaker. Request. Limit request circuit breaker estimate the structure size of need to finish the other parts of the request, such as creating an aggregate barrels, the default limit is 40% of the heap memory. Knead indices. Breaker. Total. Limit the total request and fielddata circuit breaker to ensure both together will not use more than 70% of the heap memory.Copy the code

With the ES version, doc_values are getting better and better optimized, index speed is close to FieldData, and we know that hard drives are getting faster (like SSDS). Therefore, DOC_values can now meet most scenarios and is also the official focus of ES maintenance object. Doc Values have many advantages over Field data. So after ES2.x, aggregation-enabled field attributes default to DOC_values instead of FieldData.

# example: PUT /product2/_doc/1 {"name" : "xiaomi phone", "desc" : "shouji Zhong de zhandouji", "price" : 3999, "tags": [ "xingjiabi", "fashao", "buka" ] } PUT /product2/_doc/2 { "name" : "xiaomi nfc phone", "desc" : "zhichi quangongneng nfc,shouji zhong de jianjiji", "price" : 4999, "tags": [ "xingjiabi", "fashao", "gongjiaoka" ] } PUT /product2/_doc/3 { "name" : "nfc phone", "desc" : "shouji zhong de hongzhaji", "price" : 2999, "tags": [ "xingjiabi", "fashao", "menjinka" ] } PUT /product2/_doc/4 { "name" : "xiaomi erji", "desc" : "erji zhong de huangmenji", "price" : 999, "tags": [ "low", "bufangshui", "yinzhicha" ] } PUT /product2/_doc/5 { "name" : "hongmi erji", "desc" : "erji zhong de kendeji", "price" : 399, "tags": ["lowbee", "xuhangduan", "zhiliangx"]} # GET /product2/_search {"query": {"match": { "Xiaomi"}}, "aggs": {"tags_aggs": {"terms": {"field": "price"}}}} # modify mapping PUT /product2 {"mappings": { "properties": { "name": { "type": "text" }, "desc": { "type": "text" }, "price": { "type": "long" }, "tags": { "type": "text", "fielddata": true } } } }Copy the code

The batch operation

Mget Batch query

# the first way GET / _mget {" docs ": [{" _index" : "product2," "_type" : "_doc", "_id" : "1"}, {" _index ":" phone ", "_type" : "_doc", "_id" : "1"}}] # second way GET/product2 / _mget {" docs ": {" _id" : "1"}, {" _id ": "2"}}] # third way to GET/product2 / _mget {" ids: "/" 2 ", "1"}Copy the code

Bulk: Adds, deletes, and modifies data in batches

{ action: { metadata }}\n
{ request body        }\n
{ action: { metadata }}\n
{ request body        }\n
...

Copy the code

The action/metadata line specifies which document does what: Action must be one of the following: create If the document does not exist, then create it. Index creates a new document or replaces an existing document. Update section updates a document. Delete Deletes a document. Metadata should specify the _index, _type, and _id of documents to be indexed, created, updated, or deleted.

POST /_bulk
{"delete":{"_index":"product2","_type":"_doc","_id":"1"}}
{"create":{"_index":"product2","_type":"_doc","_id":"6"}}
{"name":"hongmi2 erji","desc":"erji zhong de kendeji","price":99,"tags":["lowbee","xuhangduan","zhiliangx"]}
{"update":{"_index":"product2","_type":"_doc","_id":"2"}}
{"doc":{"price":"999"}} 
Copy the code

Es optimistic locking

(1) Pessimistic lock: in all cases, lock, read and write lock, row lock, table lock. (2) Optimistic lock: The concurrency is high and the operation is troublesome. Every no-query operation requires a comparison of version

GET /noble_test/_doc/1? Refresh return results: {" _index ":" noble_test ", "_type" : "_doc", "_id" : "1", "_version" : 12, "_seq_no" : 11, "_primary_term" : 1, "found" : true, "_source" : { "first_name" : "yanlk" } } PUT /noble_test/_doc/1? Version = {12 "first_name" : "yanlk}" return results: {" _index ":" noble_test ", "_type" : "_doc", "_id" : "1", "_version" : 13, "result" : "updated", "_shards" : { "total" : 2, "successful" : 2, "failed" : 0 }, "_seq_no" : 12, "_primary_term" : 1 }Copy the code

ElasticSearch complete directory

Elasticsearch is the basic application of Elasticsearch.Elasticsearch Mapping is the basic application of Elasticsearch.Elasticsearch is the basic application of Elasticsearch Elasticsearch tF-IDF algorithm and advanced search 8.Elasticsearch ELK