This article is written by the author of blog Park: One inch HUI, personal blog address: www.cnblogs.com/zsql/

Recently, I have been working on ES, so I have recorded some of my ideas, which may not be comprehensive in many aspects, but are basically verified. The es version of this article is for elasticsearch7.8.1. The es version of this article is for elasticsearch7.8.1.

First, the importance of index design

First, after the index is created, the index shards can only be multiplied and reduced by the _split and _shrink interfaces. This is mainly because the ES data is allocated to the shards by the _routing interface. Therefore, it is not recommended to change the number of index shards in essence, because this will cause the data to be moved again. In addition, the index can only add fields, not modify or delete fields, so there is no flexibility, so you have to use _reindex to rebuild the index each time. In addition, the size of a shard and the number of shards have a significant impact on index query and write performance. Therefore, it is conceivable that a good index design can reduce the late operation and maintenance management and improve a lot of performance. So the design of the index is very important.

  • Good index design plays an important role in the whole cluster planning. Index design directly affects the quality and complexity of cluster design.
  • A good index design should be fully combined with the time dimension and space dimension of the business scenario, and fully consider the design of the full dimension of adding, deleting, modifying and searching.
  • Good index design is based entirely on the principle of “design first, code later”, which takes a long time in the early stage in order to make the later work smoother and avoid unnecessary rework

How to design index

Before designing the index, we should understand the content of the index, understand the composition of the index, such as the basic configuration setting of the index, mapping mapping, as well as the important fragmentation, copy, template, index life cycle and so on. Knowing this allows you to tailor your design accordingly. First of all, the company’s business scenarios, the size of data volume, daily increment, data characteristics, whether the historical data will be updated. How long the data is stored, is it permanent or has a certain period of time. Whether the data needs to be quasi-real-time or not. So knowing the composition of the index and knowing the business scenario can be combined to do a better design.

2.1. Consider the common base configuration of indexes

Elasticsearch7.x does not allow index level Settings to be configured in elasticSearch.yml, so you need to configure each index separately. This allows you to create an index that is automatically added to the index. For details on how to create an index template for ElasticSearch, see

Let’s look at some common indexe-level configurations

"Number_of_replicas ": 1, # recommended number of replicas is 1 "max_result_window": 100000, "refresh_interval": "30 s," # here to real-time demand is not high, can increase the value to improve writing performance "index. Search. Slowlog. Threshold. Query. Warn" : 10s, "index.search.slowlog.threshold.query.info": 5s, "index.search.slowlog.threshold.query.debug": 2s, "index.search.slowlog.threshold.query.trace": 500ms, "index.search.slowlog.threshold.fetch.warn": 1s, "index.search.slowlog.threshold.fetch.info": 800ms, "index.search.slowlog.threshold.fetch.debug": 500ms, "index.search.slowlog.threshold.fetch.trace": 200ms, "index.indexing.slowlog.threshold.index.warn": 10s, "index.indexing.slowlog.threshold.index.info": 5s, "index.indexing.slowlog.threshold.index.debug": 2 s, "index. Indexing. Slowlog. Threshold. Index. Trace" : 500 ms "dynamic" : whether false # close dynamic field mapping, the default is true, the select individuals choose to disableCopy the code

There are many other types of index configuration, which can be adjusted according to the actual situation. In this way, the public index configuration can be designed as an index template:

PUT _index_template/template_index
{
    "index_patterns": [
        "index-*"
    ],
    "template": {
        "settings": {
            "number_of_replicas": 1,
            "max_result_window": 100000,
            "refresh_interval": "30s",
            "index.search.slowlog.threshold.query.warn": "10s",
            "index.search.slowlog.threshold.query.info": "5s",
            "index.search.slowlog.threshold.query.debug": "2s",
            "index.search.slowlog.threshold.query.trace": "500ms",
            "index.search.slowlog.threshold.fetch.warn": "1s",
            "index.search.slowlog.threshold.fetch.info": "800ms",
            "index.search.slowlog.threshold.fetch.debug": "500ms",
            "index.search.slowlog.threshold.fetch.trace": "200ms",
            "index.indexing.slowlog.threshold.index.warn": "10s",
            "index.indexing.slowlog.threshold.index.info": "5s",
            "index.indexing.slowlog.threshold.index.debug": "2s",
            "index.indexing.slowlog.threshold.index.trace": "500ms"
        },
        "mappings": {
            "dynamic": false
        }
    },
    "priority": 10
}
Copy the code

In this way, when creating an index starting with index-, the above configuration is configured by default. This is to consider the common basic Settings.

2.2 index naming conventions

This part is mainly about index naming norms, including alias, through alias can make the operation of the index become more flexible, an index can have multiple aliases, of course, an alias can configure multiple indexes, which greatly increases the flexibility of the index. Specify the beginning of a special field in the index name. For details about permission control, see elasticsearch7.8 permission control and planning

It must be named in strict accordance with the following format :(otherwise it will not be used because permissions are set here);

  • Index naming specification: index-{industry}-{business}-{version}
  • Alias naming conventions: index-{industry}-{business}

If the index is split (there are multiple indexes), we need a global read alias to name all split indexes, and a new write alias to name all updatable indexes. If this is not described here, please refer to 2.5 large index design.

  • Read alias: index-{industry}-{business}-read
  • Write aliases: index-{industry}-{business}-insert

2.3. Design of mapping

Mapping Settings are basically how to select data types, segmentation, etc

Chinese word segmentation: “Analyzer “: “ik_max_word” is recommended for finer granularity of Chinese word segmentation

When setting a field, be sure to follow the process shown below. According to actual business needs, main concerns:

  • Data type selection;
  • Whether retrieval is required;
  • Whether sorting + aggregation analysis is required;
  • Whether it needs to be stored separately

The meanings of core parameters are summarized as follows

2.4. Design of sharding

This is very important and directly affects the management and performance of the later stage.

Data in Elasticsearch is organized into indexes. Each index consists of one or more shards. Each shard is an instance of a Luncene index, which you can think of as a self-managed search engine that indexes a portion of data and processes queries in the Elasticsearch cluster.

Fragmentation design principles

  • The recommended size of each shard is 20-40 GB, and the recommended size is no more than 30 GB. However, there may be special cases where some index fields are small but the data volume is large. In this case, the number of shards can also be increased
  • Ensure that the number of shards per node is kept below 20 to 25 for each 1GB of heap memory. Therefore, a node with 30GB of heap memory can have a maximum of 600-750 shards
  • The sharding of each index is generally 1-3 times of the number of nodes. Suppose we have 15 data nodes, then 15*3*40G=1.8T. Such an index is really large at most, and if it is larger, we need to refer to the design of large indexes
  • The number of fragments should be a multiple of the data node as far as possible, so that the data can be balanced in the index, but the amount of data is very small, and the number of fragments should be designed according to the situation

Here is a simple reference table (all can be adjusted to suit the situation, just personal suggestions) :

Index size Number of fragments 0-20G220-100G8100-400G15400-900G30900G-1.6T45

As above set is based on 15 data node configuration, basic to incremental reserve some space, had better be set according to the actual situation, if an index has big enough, the configuration above can’t satisfy the need for separation index, using the index template for automatic rolling + Rollover + index lifecycle, separation index. See section 2.5

2.5. Design of large indexes

When an index is too large will have a lot of risk, the first can affect performance, when the number of fragmentation certain circumstances, more and more data, a shard will be more and more big, will be in violation of the principle of the design of the above, the second is an index to a problem, it is difficult to restore, and wide scope of influence, that how to design the large index. You can use index template +Rollover+ life cycle to automatically scroll to create indexes, all indexes are read with an alias, and one index is set to write, which makes it easy to split indexes. Take a look at the schematic of this design.

Index_latest ensures that only one index is written to the latest index. Each time an index meets one of the three criteria (number of documents, time, and index size), a new index is automatically scrolled. Now let’s do some real exercises, just to make it easier to understand.

A website is to first, interested can refer to: www.elastic.co/guide/en/el…

It is mainly divided into four steps:

  1. Create rules for the index life cycle
  2. Create the index template and apply the lifecycle
  3. Initialize an index
  4. validation

If data is stored regularly, such as logs that are stored only for the last 30 days, it can be automatically cleaned up with the index life cycle. Let’s start by creating a policy policy_index. Here’s the test, so set the time to 5 minutes.

PUT _ilm/policy/policy_index
{
  "policy": {
    "phases": {
      "hot": {                      
        "actions": {
          "rollover": {
            "max_size": "50GB",     
            "max_age": "5m"
          }
        }
      }
    }
  }
}
Copy the code

Next, design the index template and apply this policy to it.

PUT _index_template/policy_index_template { "index_patterns": [ "index-test-*" ], "template": { "settings": { "number_of_shards": 1, "number_of_replicas": 1, "index.lifecycle.name": Rollover_alias: "index-test-insert"}, "aliases": {"index-test-read": {"is_write_index": false # This alias is used for reading, not writing, otherwise it will conflict with the written alias}}}}Copy the code

The template here is just to illustrate the content of this section, but the basic configuration and mapping related Settings should be set up

The next step is to create an index

PUT index-test-000001
Copy the code

Create index (index-test-insert); create index (index-test-insert)

Next, just validate: GET index-*/_ilm/explain

When the condition is reached, a new index is automatically generated, and the index-test-insert alias is switched to the new index, so that’s it

Large indexes are designed to be split, and many of them are indexed by time. If you remember correctly, the 000001 items above can be configured as dates.

Refer to blog post:

Mp.weixin.qq.com/s/KQQJfKCOu…

Mp.weixin.qq.com/s?\_\_biz=M…

Author: Yichun HUI Source: www.cnblogs.com/zsql/