ElasticSearch (part 1)

preface

ElasticSearch is a Lucene-based search server. It provides a distributed multi – user – capable full – text search engine based on RESTful Web interface. Elasticsearch, developed in the Java language and released as open source under the Apache license, is a popular enterprise-level search engine. ElasticSearch for cloud computing is stable, reliable, fast, easy to install and use. Official clients are available in Java,.net (C#), PHP, Python, Apache Groovy, Ruby, and many other languages. Elasticsearch is the most popular enterprise search engine, according to DB-Engines’ rankings, followed by Apache Solr, also based on Lucene.

Elasticsearch interview questions

How many shards do you have in your es cluster? How many shards do you have in your es cluster?

What is the inverted index of ElasticSearch

Select * from elasticSearch; select * from elasticSearch

How does ElasticSearch implement master voting

5, Describe the process of Elasticsearch indexing documents in detail

How about Elasticsearch?

How to optimize Linux Settings for Elasticsearch deployment

8. What is the internal structure of Lucence?

How does Elasticsearch implement Master voting?

10 select a master from Elasticsearch and 10 select another master from Elasticsearch

11. How do clients select specific nodes to execute requests when connecting to the cluster?

Describe the process of indexing documents for Elasticsearch.

How many shards do you have in your es cluster? How many shards do you have in your es cluster?

Interviewer: I want to know the application scenario and scale of ES that the applicant contacted with before, and whether he has done large-scale index design, planning and tuning.

Answer: truthfully combined with their own practice scenarios can be answered.

For example, the ES cluster architecture has 13 nodes, and the index is 20+ index according to channel. The index is increased by 20+ index according to date, and the index is 10 fragments, and the index is increased by 100 million + data every day. The index size of each channel is controlled within 150GB.

Indexing-only tuning means:

1.1. Optimization in the design stage

(1) Create indexes based on date templates and roll over API according to incremental service requirements;

(2) Use alias for index management;

(3) Perform force_merge operations on indexes at dawn every day to release space.

(4) Adopt cold and hot separation mechanism to store hot data on SSD to improve retrieval efficiency; Cold data is periodically shrink to reduce storage;

(5) life cycle management of index is adopted.

(6) Set the word segmentation reasonably only for the fields requiring word segmentation;

(7) In the Mapping stage, attributes of each field are fully combined to determine whether retrieval and storage are needed. … .

1.2. Write tuning

(1) The number of copies before writing is set to 0;

(2) Before writing, disable refresh_interval to -1 and refresh mechanism;

(3) In the writing process, bulk writing is adopted;

(4) Restore the number of copies and refresh interval after writing;

(5) Use automatically generated ids whenever possible.

1.3. Query tuning

(1) Disable wildcard.

(2) Disable batch terms (hundreds of scenarios);

(3) Make full use of the inverted index mechanism to keyword as much as possible;

(4) When the amount of data is large, the index can be determined based on time before retrieval;

(5) Set a reasonable routing mechanism.

1.4. Other tuning

Deployment tuning, business tuning, etc.

As part of the above, the interviewer will have a general assessment of your previous practice or operations experience.

What is the inverted index of ElasticSearch

Interviewer: I want to know your understanding of basic concepts.

Answer: A simple explanation is ok.

Traditionally, our retrieval is to find the position of corresponding keywords through the article one by one.

The inverted index, through the word segmentation strategy, forms the mapping relation table between words and articles, and this dictionary + mapping table is the inverted index. With the inverted index, it can realize o (1) time complexity of the efficient retrieval of articles, greatly improving the retrieval efficiency.

The academic solution:

An inverted index, as opposed to which words are included in an article, starts with a word and records which documents that word has appeared in. It consists of two parts — a dictionary and an inverted list.

The underlying implementation of inversion indexes is based on: FST (Finite State) data structure.

The data structure that Lucene has used extensively since version 4+ is FST. FST has two advantages:

(1) Small space occupation. By reusing the prefixes and suffixes of words in the dictionary, the storage space is reduced.

(2) Fast query speed. O(len(STR)) query time complexity.

Select * from elasticSearch; select * from elasticSearch

Interviewer: I want to know the operation and maintenance ability of large data volume.

Answer: Index data planning, should do a good job in the early planning, is the so-called “design first, coding after”, so as to effectively avoid the sudden data surge caused by the cluster processing capacity insufficient online customer search or other business affected.

How to tune, as mentioned in Question 1, is detailed here:

3.1 Dynamic index Level

Create index based on template + time + Rollover API rolling, example: design phase definition: blog index template format: blog_index_ timestamp form, increasing data every day. The advantage of this is that the data volume does not surge so that the data volume of a single index is very large, close to the 32nd power -1 of upper limit 2, and the index storage reaches TB+ or even larger.

Once a single index is large, storage and other risks come with it, so think ahead + avoid early.

3.2 Storage Layer

Hot data (for example, data generated in the latest three days or one week) is stored separately, and other data is stored separately.

If cold data is not written to new data, you can periodically perform force_merge plus shrink compression to save storage space and search efficiency.

3.3 Deployment Layer

Once there is no planning, this is a contingency strategy.

Combined with the dynamic expansion feature of ES itself, dynamic new machines can relieve the cluster pressure. Note: If the master node and other planning is reasonable, dynamic new machines can be completed without restarting the cluster.

How does ElasticSearch implement master voting

Interviewer: To understand the underlying principles of ES clustering, not just the business level.

Answer:

Prerequisites:

(1) Only the candidate primary node (master: true) can become the primary node.

(2) The minimum number of primary nodes (MIN_master_nodes) is designed to prevent brain splitting.

Select the Master node and return the corresponding Master node successfully, otherwise return null. The election process is roughly described as follows:

Step 1: Ensure that the number of candidate primary nodes reaches the specified value of elasticSearch. yml

Discovery. Zen. Minimum_master_nodes;

Step 2: Comparison: First determine whether the master qualification, with candidate master node qualification priority return;

If both nodes are candidate primary nodes, the value with a small ID is the primary node. Note that the id here is of type string.

Off-topic: how to get the node ID.

1GET /_cat/nodes? v&h=ip,port,heapPercent,heapMax,id,name 2ip port heapPercent heapMax id nameCopy the code

5, Describe the process of Elasticsearch indexing documents in detail

Interviewer: To understand the underlying principle of ES, not just the business level.

Answer:

The index document here should be understood as the document writing ES, the process of creating the index.

Document writing includes single document writing and bulk writing. This section describes the process of single document writing.

Remember this diagram from the official documentation.

Step 1: The customer writes data to a node in the cluster and sends a request. (If no routing/coordination node is specified, the requested node acts as the routing node.)

Step 2: After node 1 receives the request, it uses the document id to determine that the document belongs to Shard 0. The request will be forwarded to another node, let’s say node 3. Therefore, the primary shard of shard 0 is allocated to node 3.

Step 3: Node 3 performs a write operation on the master shard. If successful, the request is forwarded to the replica shards of node 1 and node 2 in parallel and waits for the result to return. All replica shards report success, node 3 reports success to the coordinator node (node 1), and node 1 reports success to the requesting client.

If the interviewer asks again: How do you get sharded documents in step 2?

A: Obtained by the routing algorithm. The routing algorithm is the process of calculating the target fragment ID based on the route and document ID.

1shard = hash(_routing) % (num_of_primary_shards)Copy the code

How about Elasticsearch?

Interviewer: You want to understand the underlying principles of ES search, not just the business level.

Answer:

The search is decomposed into two phases: “Query then Fetch”.

The purpose of the Query phase is to locate the position without fetching it.

The steps are as follows:

(1) Suppose an index data has 5 master +1 copies in total 10 shards, one of which will be hit in one request.

(2) Each fragment is queried locally, and the result is returned to the local ordered priority queue.

(3) The results of step 2 are sent to the coordination node, which produces a global sorted list.

The purpose of the FETCH phase is to fetch data.

The routing node retrieves all documents and returns them to the client.

How to optimize Linux Settings for Elasticsearch deployment

Interviewer: I want to know the operation and maintenance capability of ES cluster.

Answer:

(1) Disable cache swap;

(2) The heap memory is set to Min (node memory /2, 32GB);

(3) Set the maximum number of file handles;

(4) Adjust thread pool + queue size according to business needs;

(5) Disk storage RAID mode – Raid 10 is used to improve the performance of a single node and avoid storage failures of a single node.

8. What is the internal structure of Lucence?

Interviewer: I want to know the breadth and depth of your knowledge.

Answer:

Lucene is an index and search process, including index creation, index, and search. You can build on that a little bit.

How does Elasticsearch implement Master voting?

(1) Elasticsearch is selected by ZenDiscovery module, which consists of Ping (the RPC between nodes to find each other) and Unicast (the Unicast module contains a host list to control which nodes need to be pinged).

(2) Sort all nodes that can become master (node.master: true) according to the nodeId dictionary, each election of each node in the order of known nodes, then select the first (0) node, for the moment it is considered as the master node.

(3) If the number of votes for a node reaches a certain value (n/2+1) and the node elects itself, then the node is master. Otherwise, a new election will be held until the above conditions are met.

(4) Supplement: The master node is responsible for cluster, node and index management, not document-level management; The data node can turn off HTTP functionality *.

10, 10 of the Elasticsearch nodes (say 20)

If I pick one master, and the other 10 pick another master, what do I do?

(1) When the number of master candidates in the cluster is not less than 3, the problem of split brain can be solved by setting the minimum number of votes (discovery.zen.minimum_master_nodes) to exceed half of all candidate nodes.

(3) When the number of candidates is two, only one master candidate can be modified, and the other candidates can be used as data nodes to avoid the problem of brain splitting.

11. How do clients select specific nodes to execute requests when connecting to the cluster?

The TransportClient uses the Transport module to remotely connect to an ElasticSearch cluster. It does not join the cluster, but simply obtains one or more initialized transport addresses and communicates with them in a polling manner.

Describe the process of indexing documents for Elasticsearch.

By default, the coordination node participates in the calculation using the document ID (routing is also supported) to provide the appropriate shard for the route.

shard = hash(document_id) % (num_of_primary_shards)Copy the code

(1) When the node where the shard is located receives the request from the coordination node, it writes the request to the MemoryBuffer, and then writes the request to the Filesystem Cache periodically (every second by default). This process from MomeryBuffer to Filesystem Cache is called refresh;

(2) Of course, in some cases, Momery Buffer and Filesystem Cache data may be lost. ES ensures data reliability through the translog mechanism. The data in Filesystem cache is flushed when the data is written to disk.

(3) During Flush, the buffer in memory is cleared, the content is written to a new segment, the segment’s fsync creates a new commit point and flusher the content to disk, and the old translog is deleted and a new translog is started.

Flush is triggered when it is timed (default: 30 minutes) or when translog becomes too large (default: 512 MB).

Addendum: About Lucene seinterfaces:

(1) Lucene index is composed of multiple segments, and the segment itself is a fully functional inverted index.

The (2) segment is immutable, allowing Lucene to incrementally add new documents to the index without rebuilding the index from scratch.

(3) For each search request, all segments in the index are searched, and each segment consumes CPU clock cycles, file handles, and memory. This means that the higher the number of segments, the lower the search performance.

(4) To solve this problem, Elasticsearch merges segments into a larger segment, commits the new merged segments to disk, and removes those old segments.

The last

Welcome everyone to pay attention to my public account [programmer Chase wind], 2019 many companies Java interview questions sorted out more than 1000 400 pages of PDF documents, articles will be updated in it, sorted information will also be placed in it.

If you like the article, remember to pay attention to me. Thank you for your support!

preface

Elasticsearch interview questions

How many shards do you have in your es cluster? How many shards do you have in your es cluster?

1.1. Optimization in the design stage

1.2. Write tuning

1.3. Query tuning

1.4. Other tuning

What is the inverted index of ElasticSearch

Select * from elasticSearch; select * from elasticSearch

3.1 Dynamic index Level

3.2 Storage Layer

3.3 Deployment Layer

How does ElasticSearch implement master voting

5, Describe the process of Elasticsearch indexing documents in detail

How about Elasticsearch?

How to optimize Linux Settings for Elasticsearch deployment

8. What is the internal structure of Lucence?

How does Elasticsearch implement Master voting?

10, 10 of the Elasticsearch nodes (say 20)

11. How do clients select specific nodes to execute requests when connecting to the cluster?

Describe the process of indexing documents for Elasticsearch.

The last

Related Posts

Spring Boot Logging

Use Python to crawl the “Superfluous” video barrage

Is the List you think is the List you think?