When learning about Elasticsearch, we often encounter the following concepts:

  • Reverted index

  • doc_values

  • The source?

What do these concepts refer to? What’s the use? How are they configured? Only when we have a good grasp of these concepts can we use them correctly.

Inverted index

Inverted Index is the core data structure of Elasticsearch and any other system that supports full text search. A reverse index is similar to what you see at the end of any book. It maps terms that appear in documents to documents.

For example, you can build a reverse index from the following string:

Inverted Index

Term Frequency Document (postings)
choice 1 3
day 1 2
is 3 1, 2, 3
it 1 1
last 1 2
of 1 2
of 1 2
sunday 2 1, 2,
the 3 2, 3
tomorrow 1 1
week 1 2
yours 1 3

What reverse indexing means here is that we look for the corresponding document IDS according to term. This is the opposite of finding terms by document ID.

Please note the following points:

  • Once punctuation is removed and lowercase, the document is broken down into terms.

  • Terms are sorted alphabetically

  • The Frequency column captures the number of occurrences of the term throughout the document set

  • The third column captures the document in which the term is found. In addition, it may include finding the exact location of the term (offset in the document)

When searching for terms in documents, it is very quick to find the documents in which a given term appears. If a user searches for the Term “Sunday,” finding Sunday from the “Term” column is very fast because the terms are sorted in the index. Even if there are millions of terms, you can quickly find them when sorting terms.

Next, consider a case where the user searches for two words, such as last Sunday. A reverse index can be used to search for occurrences of last and Sunday, respectively; Document 2 contains both terms and is therefore better than Document 1, which contains only one term.

Reverse indexing is the basis for performing a quick search. Again, it is easy to find out how many times terms appear in the index. This is a simple summary of the counts. Of course, Elasticsearch uses a lot of innovation based on the simple reverse indexing we explain here. It is both search and analysis.

By default, Elasticsearch builds a reverse index on all fields in the document, pointing to the Elasticsearch document in which that field is located. May I have an Inverted Index in each Elasticsearch Lucene?

In Kibana, we create a document like this:

PUT twitter/_doc/1
{
  "user" : "Double Elms."."message" : "It's a nice day. Let's go out."."uid": 2."age": 20."city" : "Beijing"."province" : "Beijing"."country" : "China"."name": {
    "firstname": "Three"."surname": "Zhang"
  },
  "address" : [
    "Haidian District, Beijing, China"."No.29 Zhongguancun"]."location" : {
    "lat" : "39.970718"."lon" : "116.325747"}}Copy the code

Once the file is created, Elastic has created the corresponding Inverted Index for us to search, for example:

GET twitter/_search
{
  "query": {
    "match": {
      "user": "Zhang"}}}Copy the code

We can get the corresponding search results:

{
  "took": 0."timed_out" : false."_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped": 0."failed": 0}."hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score": 0.5753642."hits": [{"_index" : "twitter"."_type" : "_doc"."_id" : "1"."_score": 0.5753642."_source" : {
          "user" : "Double Elms."."message" : "It's a nice day. Let's go out."."uid": 2."age": 20."city" : "Beijing"."province" : "Beijing"."country" : "China"."name" : {
            "firstname" : "Three"."surname" : "Zhang"
          },
          "address" : [
            "Haidian District, Beijing, China"."No.29 Zhongguancun"]."location" : {
            "lat" : "39.970718"."lon" : "116.325747"}}}]}}Copy the code

If we don’t want to have an Inverted Index set up for our field that is not searched, we can do this:

DELETE twitter
PUT twitter
{
  "mappings": {
    "properties": {
      "city": {
        "type": "keyword"."ignore_above": 256}."address": {
        "type": "text"."fields": {
          "keyword": {
            "type": "keyword"."ignore_above": 256}}},"age": {
        "type": "long"
      },
      "country": {
        "type": "text"."fields": {
          "keyword": {
            "type": "keyword"."ignore_above": 256}}},"location": {
        "properties": {
          "lat": {
            "type": "text"."fields": {
              "keyword": {
                "type": "keyword"."ignore_above": 256}}},"lon": {
            "type": "text"."fields": {
              "keyword": {
                "type": "keyword"."ignore_above": 256}}}}},"message": {
        "type": "text"."fields": {
          "keyword": {
            "type": "keyword"."ignore_above": 256}}},"name": {
        "properties": {
          "firstname": {
            "type": "text"."fields": {
              "keyword": {
                "type": "keyword"."ignore_above": 256}}},"surname": {
            "type": "text"."fields": {
              "keyword": {
                "type": "keyword"."ignore_above": 256}}}}},"province": {
        "type": "text"."fields": {
          "keyword": {
            "type": "keyword"."ignore_above": 256}}},"uid": {
        "type": "long"
      },
      "user": {
        "type": "object"."enabled": false
      }
    }
  }
}
 
PUT twitter/_doc/1
{
  "user" : "Double Elms."."message" : "It's a nice day. Let's go out."."uid": 2."age": 20."city" : "Beijing"."province" : "Beijing"."country" : "China"."name": {
    "firstname": "Three"."surname": "Zhang"
  },
  "address" : [
    "Haidian District, Beijing, China"."No.29 Zhongguancun"]."location" : {
    "lat" : "39.970718"."lon" : "116.325747"}}Copy the code

Above, we modified the user field by mapping:


 "user": {
        "type": "object"."enabled": false
  }
Copy the code

That is, this field will not be indexed, so if we use this field to search, it will not produce any results:

GET twitter/_search
{
  "query": {
    "match": {
      "user": "Zhang"}}}Copy the code

The search results are:

{
  "took": 0."timed_out" : false."_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped": 0."failed": 0}."hits" : {
    "total" : {
      "value": 0."relation" : "eq"
    },
    "max_score" : null,
    "hits": []}}Copy the code

Apparently nothing came of it. But if we query the document:

GET twitter/_doc/1
Copy the code

The results are as follows:

{
  "_index" : "twitter"."_type" : "_doc"."_id" : "1"."_version" : 1,
  "_seq_no": 0."_primary_term" : 1,
  "found" : true."_source" : {
    "user" : "Double Elms."."message" : "It's a nice day. Let's go out."."uid": 2."age": 20."city" : "Beijing"."province" : "Beijing"."country" : "China"."name" : {
      "firstname" : "Three"."surname" : "Zhang"
    },
    "address" : [
      "Haidian District, Beijing, China"."No.29 Zhongguancun"]."location" : {
      "lat" : "39.970718"."lon" : "116.325747"}}}Copy the code

It is obvious that user information is stored in source. It’s just not being searched for.

If we don’t want our entire document to be searched, we can even use the following method:

DELETE twitter
 
PUT twitter 
{
  "mappings": {
    "enabled": false}}Copy the code

So the entire Twitter index will not have any Inverted indexes established, so we can run the following command:

PUT twitter/_doc/1
{
  "user" : "Double Elms."."message" : "It's a nice day. Let's go out."."uid": 2."age": 20."city" : "Beijing"."province" : "Beijing"."country" : "China"."name": {
    "firstname": "Three"."surname": "Zhang"
  },
  "address" : [
    "Haidian District, Beijing, China"."No.29 Zhongguancun"]."location" : {
    "lat" : "39.970718"."lon" : "116.325747"
  }
}
 
GET twitter/_search
{
  "query": {
    "match": {
      "city": "Beijing"}}}Copy the code

The above command results in no search results. For more information, see “Mapping Parameters: Enabled”.

Source

In Elasticsearch, usually every field in each document is stored in the shard where the source is stored, for example:

PUT twitter/_doc/2
{
  "user" : "Double Elms."."message" : "It's a nice day. Let's go out."."uid": 2."age": 20."city" : "Beijing"."province" : "Beijing"."country" : "China"."name": {
    "firstname": "Three"."surname": "Zhang"
  },
  "address" : [
    "Haidian District, Beijing, China"."No.29 Zhongguancun"]."location" : {
    "lat" : "39.970718"."lon" : "116.325747"}}Copy the code

Here, we create a document with id 2. We can obtain all of its stored information by using the following command.

GET twitter/_doc/2
Copy the code

It will return:

{
  "_index" : "twitter"."_type" : "_doc"."_id" : "2"."_version" : 1,
  "_seq_no" : 1,
  "_primary_term" : 1,
  "found" : true."_source" : {
    "user" : "Double Elms."."message" : "It's a nice day. Let's go out."."uid": 2."age": 20."city" : "Beijing"."province" : "Beijing"."country" : "China"."name" : {
      "firstname" : "Three"."surname" : "Zhang"
    },
    "address" : [
      "Haidian District, Beijing, China"."No.29 Zhongguancun"]."location" : {
      "lat" : "39.970718"."lon" : "116.325747"}}}Copy the code

You can see all the fields saved by Elasticsearch under _source above. If we don’t want to store any fields, we can do the following:

DELETE twitter
 
PUT twitter
{
  "mappings": {
    "_source": {
      "enabled": false}}}Copy the code

So we use the following command to create a document with id 1:

PUT twitter/_doc/1
{
  "user" : "Double Elms."."message" : "It's a nice day. Let's go out."."uid": 2."age": 20."city" : "Beijing"."province" : "Beijing"."country" : "China"."name": {
    "firstname": "Three"."surname": "Zhang"
  },
  "address" : [
    "Haidian District, Beijing, China"."No.29 Zhongguancun"]."location" : {
    "lat" : "39.970718"."lon" : "116.325747"}}Copy the code

So again, let’s query this document:

GET twitter/_doc/1
Copy the code

The result displayed is:

{
  "_index" : "twitter"."_type" : "_doc"."_id" : "1"."_version" : 1,
  "_seq_no": 0."_primary_term" : 1,
  "found" : true
}
Copy the code

Obviously our document has been found, but we can’t see any source. So can we do a search on this document? Try the following command:

GET twitter/_search
{
  "query": {
    "match": {
      "city": "Beijing"}}}Copy the code

The result displayed is:


{
  "took": 0."timed_out" : false."_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped": 0."failed": 0}."hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score": 0.5753642."hits": [{"_index" : "twitter"."_type" : "_doc"."_id" : "1"."_score": 0.5753642}]}}Copy the code

Obviously, the document with id 1 can be searched correctly, that is, it has a perfectly inverted index for us to query, although it does not have a word source.

So how do we selectively store the fields we want? This is true if we want to save our own storage space and store only those fields we need in source. We can do the following:

DELETE twitter
 
PUT twitter
{
  "mappings": {
    "_source": {
      "includes": [
        "*.lat"."address"."name.*"]."excludes": [
        "name.surname"]}}}Copy the code

Above, we use include to include the fields we want, and exclude to exclude the fields we don’t need. We try the following document input:

PUT twitter/_doc/1
{
  "user" : "Double Elms."."message" : "It's a nice day. Let's go out."."uid": 2."age": 20."city" : "Beijing"."province" : "Beijing"."country" : "China"."name": {
    "firstname": "Three"."surname": "Zhang"
  },
  "address" : [
    "Haidian District, Beijing, China"."No.29 Zhongguancun"]."location" : {
    "lat" : "39.970718"."lon" : "116.325747"}}Copy the code

Query with the following command, we can see:

GET twitter/_doc/1
Copy the code

The result is:

{
  "_index" : "twitter"."_type" : "_doc"."_id" : "1"."_version" : 1,
  "_seq_no": 0."_primary_term" : 1,
  "found" : true."_source" : {
    "address" : [
      "Haidian District, Beijing, China"."No.29 Zhongguancun"]."name" : {
      "firstname" : "Three"
    },
    "location" : {
      "lat" : "39.970718"}}}Copy the code

Obviously, we only have a few fields stored. In this way, we can selectively store the fields we want.

In practice, we can optionally display the fields we want when querying the document, although many fields are stored in the source:

GET twitter/_doc/1? _source=name,locationCopy the code

In this case, we only want to display the fields associated with name and location.

{
  "_index" : "twitter"."_type" : "_doc"."_id" : "1"."_version" : 1,
  "_seq_no": 0."_primary_term" : 1,
  "found" : true."_source" : {
    "name" : {
      "firstname" : "Three"
    },
    "location" : {
      "lat" : "39.970718"}}}Copy the code

For more information, see the document “Mapping Meta-field: _Source”.

Doc_values

By default, most fields are indexed, which makes them searchable. A reverse index allows a query to look for a search term in a unique sorted list of terms, from which it has immediate access to a list of documents containing that term.

Sort, Aggregtion, and accessing field values in scripts require different data access modes. In addition to looking up terms and looking up documents, we need to be able to look up documents and look up terms that they have in fields.

Doc values are disk data structures that are built when documents are indexed, making this data access pattern possible. They store the same value as _source, but in a column-oriented fashion, which is more efficient for sorting and aggregation. Doc values are supported for almost all field types, except for string fields.

By default, they are enabled for all fields that support doc values. If you are sure that you do not need to sort or summarize the fields, and that you do not need to access the field values through scripts, you can disable the doc value to save disk space:

For example, we can disable the city field from sort or aggregation by doing the following:

DELETE twitter
PUT twitter
{
  "mappings": {
    "properties": {
      "city": {
        "type": "keyword"."doc_values": false."ignore_above": 256}."address": {
        "type": "text"."fields": {
          "keyword": {
            "type": "keyword"."ignore_above": 256}}},"age": {
        "type": "long"
      },
      "country": {
        "type": "text"."fields": {
          "keyword": {
            "type": "keyword"."ignore_above": 256}}},"location": {
        "properties": {
          "lat": {
            "type": "text"."fields": {
              "keyword": {
                "type": "keyword"."ignore_above": 256}}},"lon": {
            "type": "text"."fields": {
              "keyword": {
                "type": "keyword"."ignore_above": 256}}}}},"message": {
        "type": "text"."fields": {
          "keyword": {
            "type": "keyword"."ignore_above": 256}}},"name": {
        "properties": {
          "firstname": {
            "type": "text"."fields": {
              "keyword": {
                "type": "keyword"."ignore_above": 256}}},"surname": {
            "type": "text"."fields": {
              "keyword": {
                "type": "keyword"."ignore_above": 256}}}}},"province": {
        "type": "text"."fields": {
          "keyword": {
            "type": "keyword"."ignore_above": 256}}},"uid": {
        "type": "long"
      },
      "user": {
        "type": "text"."fields": {
          "keyword": {
            "type": "keyword"."ignore_above": 256}}}}}}Copy the code

Above, we set the doc_values of the city field to false.

      "city": {
        "type": "keyword"."doc_values": false."ignore_above": 256}.Copy the code

We create a document as follows:

PUT twitter/_doc/1
{
  "user" : "Double Elms."."message" : "It's a nice day. Let's go out."."uid": 2."age": 20."city" : "Beijing"."province" : "Beijing"."country" : "China"."name": {
    "firstname": "Three"."surname": "Zhang"
  },
  "address" : [
    "Haidian District, Beijing, China"."No.29 Zhongguancun"]."location" : {
    "lat" : "39.970718"."lon" : "116.325747"}}Copy the code

So, when we get aggregation using the following method:

GET twitter/_search
{
  "size": 0."aggs": {
    "city_bucket": {
      "terms": {
        "field": "city"."size": 10}}}}Copy the code

On our Kibana we can see:

{
  "error": {
    "root_cause": [{"type": "illegal_argument_exception"."reason": "Can't load fielddata on [city] because fielddata is unsupported on fields of type [keyword]. Use doc values instead."}]."type": "search_phase_execution_exception"."reason": "all shards failed"."phase": "query"."grouped": true."failed_shards": [{"shard": 0."index": "twitter"."node": "IyyZ30-hRi2rnOpfx4n1-A"."reason": {
          "type": "illegal_argument_exception"."reason": "Can't load fielddata on [city] because fielddata is unsupported on fields of type [keyword]. Use doc values instead."}}]."caused_by": {
      "type": "illegal_argument_exception"."reason": "Can't load fielddata on [city] because fielddata is unsupported on fields of type [keyword]. Use doc values instead."."caused_by": {
        "type": "illegal_argument_exception"."reason": "Can't load fielddata on [city] because fielddata is unsupported on fields of type [keyword]. Use doc values instead."}}},"status": 400}Copy the code

Clearly, our operation was a failure. Although we can’t do aggregation or sort, we can get its source by using the following command:

GET twitter/_doc/1
Copy the code

The command output is as follows:


{
  "_index" : "twitter"."_type" : "_doc"."_id" : "1"."_version" : 1,
  "_seq_no": 0."_primary_term" : 1,
  "found" : true."_source" : {
    "user" : "Double Elms."."message" : "It's a nice day. Let's go out."."uid": 2."age": 20."city" : "Beijing"."province" : "Beijing"."country" : "China"."name" : {
      "firstname" : "Three"."surname" : "Zhang"
    },
    "address" : [
      "Haidian District, Beijing, China"."No.29 Zhongguancun"]."location" : {
      "lat" : "39.970718"."lon" : "116.325747"}}}Copy the code

See “Mapping Parameters: DOC_values” for more information.