.

Elasticsearch is a search engine used by Top N companies like Netflix, Microsoft, eBay, Facebook, etc. It’s easy to use, but relatively difficult to master in the long run. In this article, we’ve shared six less obvious but well worth knowing features of Elasticsearch on your system.

1. Elastic Stack

Elasticsearch was originally developed as a standalone product. Its core role is to provide extensible search engine services, it provides a variety of language library API, based on distributed model creation, and external to provide REST API interface services. As the Elastic ecosystem grew, there were other sets of tools that worked alongside Elasticsearch. From the earliest Kibana (for visualization and data analysis) and Logstash (for log collection) to the following N multitools Elastic developed:

  • Beats – Core Features: Data transfer purposes,
  • Elastic Cloud – Hosting Elasticsearch cluster,
  • Machine Learning – For discovering data patterns,
  • APM – Application Performance Monitoring,
  • Swiftype – One-click web search.

The number of tools grows every year, enabling companies to achieve new goals and create new opportunities.

Elastic is no longer just Elasticsearch, but an integrated tool set, an integrated big data solution tool set.

2. Two kinds of data sets

2.1 Data set classification

Basically, you can index (i.e. store) any data you want in Elasticsearch. But there are actually two types: static data and time series data. They can have a significant impact on how the cluster is configured and managed.

  • Static data is a set of data that can grow or change slowly. Like a catalog or a list of things. You can think of them as data stored in a regular database. Blog posts, library books, orders, etc. You might want to index such data in Elasticsearch to enable fast searches, something that is difficult to do in a regular database.
  • A time series data set, which can be event data associated with typically rapidly growing moments, such as log files or metrics. You’ll need to index them in Elasticsearch for data analysis, pattern discovery, and system monitoring.

2.2 Data set modeling method

Depending on the type of data you store, you should model the cluster differently.

  • For static data: you should choose a fixed number of indexes and shards. They don’t grow rapidly, and you always want to search all the documents in the dataset.
  • For time series data, you should choose a scrolling index based on time. You query for recent data relatively frequently, and eventually even delete or at least archive outdated documents to save money on physical storage on your machine.

Ming Yi: The two data sets determine the two different modeling methods of our data.

3. Search score

For each search query, Elasticsearch calculates the correlation score. The score is based on the TF-IDF algorithm, which represents word item frequency – reverse document frequency. Basically, two values are computed in this algorithm.

  • The first: term frequency TF – indicates how often a given term is used in a document.
  • The second – reverse document frequency IDF – represents the uniqueness of a given word item across all documents.

3.1 TF calculation

For example, if we have two documents:

Document1: To be or not To be, that is the question. Document2: To be. I am. You are.

TF calculation of question word item is as follows:

  • For documents 1:10 (1 in 10 word items occurs)
  • For document 2:0/9 (0 occurrences in 9 terms).

3.2 the IDF calculation

IDF evaluates to a single value for the entire data set. It is the ratio of all documents to those containing the search term. In our example it is: log (2/1) = 0.301 where:

  • 2 – Number of all files,
  • 1 – Number of files that contain the word “question”.

3.3 Correlation score results

Finally, the TF-IDF score of the two documents is calculated as the product of two values:

  • Document 1:1/10 x 0.301 = 0.1 * 0.301 = 0.03
  • Document2:0/9 x 0.301 = 0 * 0.301 = 0.00 Now we see that document1 has a value of 0.03 and document2 has a value of 0.00. Therefore, document 1 will be returned preferentially in the result list.

Ming Yi: The actual application is more complicated than this, you can combine explain:true to verify a

As follows:

PUT my_index3
{
  "mappings": {
    "_doc": {
      "properties": {
        "title": { 
          "type": "text"
        }
      }
    }
  }
}

POST my_index3/_doc/1
{
  "title":"To be or not to be, that is the question."
}

POST my_index3/_doc/2
{
  "title":"To be. I am. You are. He, she is."
}

POST my_index3/_search
{
  "explain": true, 
  "query": {
    "match": {
        "title":"question"
    }
  }
}
Copy the code

4 Data Model

Elasticsearch has two performance benefits. It can expand horizontally, very fast. The speed mainly depends on how the data is stored.

4.1 Index stage data model

When indexing a document, it goes through three steps: character filters, Tokenizer, and Token filters. They are used to normalize documents. For example: a document

To be or not to be, that is the question.

1) May actually be stored as: if punctuation is removed and all term items are lowercase:

to be or not to be that is the question

2) It can also be stored as: if the stop word filter is applied, it will remove all common language terms, such as: to, be, or, not, that, is, the. Left:

question

So that’s the index.

4.2 Search phase data model

The same steps apply when searching for documents. The query is also filtered into character filters, Tokenizer, and token filters. Elasticsearch is then searching for documents with normalized terms. Fields in Elasticsearch are stored in an inverted index structure, which allows for quick retrieval of matching documents.

You can define specific filters for each field. Use analyzers to implement definitions. Multiple analyzers can be used to analyze fields for different goals. Such as:

You can use the Standard word splitter for word segmentation, ik_max_word for fine-grained word segmentation, and IK_smart for coarse-grained word segmentation.Copy the code

Then in the search phase, you can define the fields to scan to get the search results you want. By applying this behavior, ElasticSearch can provide results faster than a regular database.

Ming Yi: The quality of the model can not only improve retrieval efficiency, but also save storage space.

5 Sharding Plan

5.1 How many shards and indexes should I have?

This is the most common problem with Elasticsearch for beginners. Why is this a problem? The number of shards can only be set at the beginning of index creation.

So the answer really depends on the data set you have. As a rule of thumb, a single shard should contain a maximum of 20-40 GB of data. Shards comes from Apache Lucene.

Given all the structure and overhead that Apache Lucene uses for reverse indexing and quick searches, it doesn’t make sense to divide small shards (say 100 MB or 1 GB).

Elastic consultants recommend using 20-40 GB. Remember that shards cannot be further divided and always reside on a single node. Shards of this size can also be easily moved to other nodes, or replicated within the cluster if needed. Having this shard capacity gives you a trade-off between speed and memory consumption.

Of course, performance metrics may show different things in your particular situation, so keep in mind that this is just a suggestion and you may want to achieve other performance goals in conjunction with your actual business scenarios.

5.2 Precautions for Actual Sharding

1) To know how many shards each index should have, you can simply estimate how much memory they consume by indexing some documents into a temporary index, and how many documents you want to have over a period of time. Time refers to either partial time (in a time series dataset) or total time (in a static dataset).

2) Don’t forget that even if you misconfigure the number of shards or indexes, you can always reindex the data to the correct data and then reindex the data to complete the migration. Last but not the least. You can always query multiple indexes at once. For example, you can implement a one-key query by simply asking for indexes or aliases for all the dates of the previous month in a single query based on date increments.

logstash_20190201_000001
logstash_20190202_000002
....
logstash_20190228_000028
Copy the code

Querying 30 indexes with a single shard has the same performance as one large index with 30 shards.

Ming Yi: Combining business data volume is the foundation of sharding.

6. Node type

The Elasticsearch node can contain multiple roles. Roles include:

  • Master: indicates the Master node.
  • Data: indicates the Data node.
  • Ingest: the Ingest node,
  • Co-ordinate-only: coordinates only nodes.

Each role has a purpose.

6.1 the master node

Role: Responsible for cluster scope Settings and changes, such as creating or deleting indexes, adding or removing nodes, and assigning shards to nodes. For large data scale clusters, each cluster should contain at least three candidate primary nodes. From all the nodes that match the primary node, the system selects one node as the primary node, whose role is to perform cluster wide operations. The other two nodes are purely for high availability. Hardware requirements: The primary node has low requirements on CPU, RAM, and disk storage.

6.2 Data Nodes

Purpose: Used to store and search data. Hardware requirements: Data nodes have high requirements on all resources: CPU, RAM, and disk. The more data you have, the more hardware resources are required.

6.3 Ingest node

Purpose: The Ingest node is used for document preprocessing before the actual indexing occurs. The Ingest node intercepts bulk and index queries, applies transformations, and then passes documents back to the index or bulk API. Hardware requirements: low disk, medium RAM and high CPU,

6.4 Coordinating Nodes Only

Function: Load balancer for client requests. It knows where specific documents can reside and routes search requests to the corresponding nodes. Warning: Adding too many coordination only nodes to the cluster increases the burden on the entire cluster, as the selected master node must wait for confirmation of cluster status updates from each node! The benefits of just coordinating nodes should not be overstated – data nodes can happily be used for the same purpose.

Hardware requirements: low disk, medium high speed RAM and medium high CPU.

6.5 What is the preferred method for configuring a Large cluster?

Here are some suggestions:

  1. Three master nodes – Maintains cluster status and cluster Settings,
  2. Two coordination-only nodes — they listen for external requests and act as intelligent load balancers for the entire cluster,
  3. Many data nodes – depending on the data set requirements,
  4. Several Ingest nodes (optional) – if you are performing Ingest pipes and want to mitigate the impact of other nodes on preprocessing documents.

    The exact number depends on your specific use case + actual business scenario, andMust be tested against performanceMake adjustments.

Ming Yi: It needs to be allocated according to the actual business scenario and business scale.

summary

After all, each company’s business scenarios are different. The above six features are recommended for selection. In practice, further optimization needs to be made based on business scenarios, official documents and source code. In translation, combined with their own practice to do part of the fine tuning + interpretation.

Dariusz Mydlarz is an official certified Engineer at Elastic. Original address: blog.softwaremill.com/6-not-so-ob…



Ming Yi world – Elasticsearch basic, advanced, actual combat first public number