Dry goods | BAT Elasticsearch line companies such as interview questions

.

Http://player.elasticSearch BAT interview There are only questions, some answers, but not all. Take some time to clean up together.

Since it is an interview question, everyone will have their own answer combined with the business scenario, there is no very standard answer. Welcome everyone to leave a message pat brick correction.

How many shards do you have in your es cluster? How many shards do you have in your es cluster?

Interviewer: I want to know the application scenario and scale of ES that the applicant contacted with before, and whether he has done large-scale index design, planning and tuning. Answer: truthfully combined with their own practice scenarios can be answered. For example, the ES cluster architecture has 13 nodes, and the index is 20+ index according to channel. The index is increased by 20+ index according to date, and the index is 10 fragments, and the index is increased by 100 million + data every day. The index size of each channel is controlled within 150GB.

Indexing-only tuning means:

1.1. Optimization in the design stage

1) Create indexes based on date templates and roll over API according to incremental service requirements;
2) Use alias for index management;
3) Perform force_merge operations on indexes at dawn every day to release space.
4) Adopt hot and cold separation mechanism to store hot data on SSD to improve retrieval efficiency; Cold data is periodically shrink to reduce storage;
5) Adopt the life cycle management of index for curator;
6) Set the word segmentation reasonably only for the fields requiring word segmentation;
7) In the Mapping stage, attributes of each field are fully combined to determine whether retrieval and storage are needed. … .

1.2. Write tuning

1) Set the number of copies before writing to 0;
2) Disable refresh_interval to -1 and refresh mechanism before writing.
3) In the writing process, bulk writing is adopted;
4) Restore the number of copies and refresh interval after writing;
5) Use automatically generated ids whenever possible.

1.3. Query tuning

1) Disable wildcard.
2) Disable batch terms (hundreds of scenarios);
3) Make full use of the inverted index mechanism to keyword as much as possible;
4) When the amount of data is large, the index can be determined based on time before retrieval;
5) Set a reasonable routing mechanism.

1.4. Other tuning

Deployment tuning, business tuning, etc.

As part of the above, the interviewer will have a general assessment of your previous practice or operations experience.

What is the inverted index of ElasticSearch

Interviewer: I want to know your understanding of basic concepts. Answer: A simple explanation is ok.

Traditionally, our retrieval is to find the position of corresponding keywords through the article one by one. The inverted index, through the word segmentation strategy, forms the mapping relation table between words and articles, and this dictionary + mapping table is the inverted index. With the inverted index, it can realize o (1) time complexity of the efficient retrieval of articles, greatly improving the retrieval efficiency.

The academic solution:

An inverted index, as opposed to which words are included in an article, starts with a word and records which documents that word has appeared in. It consists of two parts — a dictionary and an inverted list.

The underlying implementation of inversion indexes is based on: FST (Finite State) data structure. The data structure that Lucene has used extensively since version 4+ is FST. FST has two advantages:

1) Small space occupancy. By reusing the prefixes and suffixes of words in the dictionary, the storage space is reduced.
2) Fast query speed. O(len(STR)) query time complexity.

Select * from elasticSearch; select * from elasticSearch

Interviewer: I want to know the operation and maintenance ability of large data volume. Answer: Index data planning, should do a good job in the early planning, is the so-called “design first, coding after”, so as to effectively avoid the sudden data surge caused by the cluster processing capacity insufficient online customer search or other business affected. How to tune, as mentioned in Question 1, is detailed here:

3.1 Dynamic index Level

Create index based on template + time + Rollover API rolling, example: design phase definition: blog index template format: blog_index_ timestamp form, increasing data every day.

The advantage of this is that the data volume does not surge so that the data volume of a single index is very large, close to the 32nd power -1 of upper limit 2, and the index storage reaches TB+ or even larger.

Once a single index is large, storage and other risks come with it, so think ahead + avoid early.

3.2 Storage Layer

Hot data (for example, data generated in the latest three days or one week) is stored separately, and other data is stored separately. If cold data is not written to new data, you can periodically perform force_merge plus shrink compression to save storage space and search efficiency.

3.3 Deployment Layer

Once there is no planning, this is a contingency strategy. Combined with the dynamic expansion feature of ES itself, dynamic new machines can relieve the cluster pressure. Note: If the master node and other planning is reasonable, dynamic new machines can be completed without restarting the cluster.

How does ElasticSearch implement master voting

Interviewer: To understand the underlying principles of ES clustering, not just the business level. Answer: Preconditions:

1) Only the candidate primary node (master: true) can become the primary node.
2) The purpose of the minimum number of primary nodes (MIN_master_nodes) is to prevent brain splitting.

This I looked at a variety of online analysis of the version and source code analysis of the books, clouds in the fog. Select the Master node and return the corresponding Master node successfully, otherwise return null. The election process is roughly described as follows:

Step 1: Ensure the number of candidate primary nodes is up to the specified value of elasticSearch. yml: discovery.zen.minimum_master_nodes;
Step 2: Comparison: First determine whether the master qualification, with candidate master node qualification priority return; If both nodes are candidate primary nodes, the value with a small ID is the primary node. Note that the id here is of type string.

Off-topic: how to get the node ID.

1GET /_cat/nodes? V&h = IP, port, heapPercent heapMax, id, name2ip port heapPercent heapMax id name3127.0.0.1 39 9300 1.9 gb Hk9w Hk9wFwU
Copy the code

5, Describe the process of Elasticsearch indexing documents in detail

Interviewer: To understand the underlying principle of ES, not just the business level. Answer: The index document here should be understood as the document writing ES, the process of creating the index. Document writing includes single document writing and bulk writing. This section describes the process of single document writing.

Remember this diagram from the official documentation.

Step 1: The customer writes data to a node in the cluster and sends a request. (If no routing/coordination node is specified, the requested node acts as the routing node.)

Step 2: After node 1 receives the request, it uses the document id to determine that the document belongs to Shard 0. The request will be forwarded to another node, let’s say node 3. Therefore, the primary shard of shard 0 is allocated to node 3.

Step 3: Node 3 performs a write operation on the master shard. If successful, the request is forwarded to the replica shards of node 1 and node 2 in parallel and waits for the result to return. All replica shards report success, node 3 reports success to the coordinator node (node 1), and node 1 reports success to the requesting client.

If the interviewer asks again: How do you get sharded documents in step 2? A: Obtained by the routing algorithm. The routing algorithm is the process of calculating the target fragment ID based on the route and document ID.

1shard = hash(_routing) % (num_of_primary_shards)
Copy the code

How about Elasticsearch?

Interviewer: You want to understand the underlying principles of ES search, not just the business level. Answer: The search is decomposed into two phases of “Query then fetch”. The purpose of the Query phase is to locate the position without fetching it. The steps are as follows:

1) Suppose an index has 5 master +1 copies of 10 shards, one of which will be hit in one request.
2) Each fragment is queried locally, and the result is returned to the local ordered priority queue.
3) The results of step 2 are sent to the coordination node, which produces a global sorted list.

The purpose of the FETCH phase is to fetch data. The routing node retrieves all documents and returns them to the client.

How to optimize Linux Settings for Elasticsearch deployment

Interviewer: I want to know the operation and maintenance capability of ES cluster. Answer:

1) Disable cache swap;
2) The heap memory is set to Min (node memory /2, 32GB);
3) Set the maximum number of file handles;
4) Adjust thread pool + queue size according to business needs;
5) Disk storage RAID mode — Raid 10 is used to improve the performance of a single node and avoid the failure of a single node storage.

8. What is the internal structure of Lucence?

Interviewer: I want to know the breadth and depth of your knowledge. Answer:

Lucene is an index and search process, including index creation, index, and search. You can build on that a little bit.

summary

After seeing the title, I feel familiar and strange. To really speak up in the interview, it takes a lot of work and deep understanding. To verify the relative accuracy of my answers, I scoured the source code, official documentation and some in-depth blog posts. There is still a long way to go for Elasticsearch, there is no other option but to die!

Reference: Read the original article to check.