How many shards do you have in your es cluster? How many shards do you have in your es cluster?

Interviewer: I want to know the application scenario and scale of ES that the applicant contacted with before, and whether he has done large-scale index design, planning and tuning. Answer: truthfully combined with their own practice scenarios can be answered. For example, the ES cluster architecture has 13 nodes, and the index is 20+ index according to channel. The index is increased by 20+ index according to date, and the index is 10 fragments, and the index is increased by 100 million + data every day. The index size of each channel is controlled within 150GB. Indexing-only tuning means:

1.1. Optimization in the design stage
  • Based on service incremental requirements, the index is created based on the date template and rolled over API.
  • Use aliases for index management;
  • The force_merge operation is performed at dawn every day to release space.
  • Hot data is stored on SSD to improve retrieval efficiency. Cold data is periodically shrink to reduce storage;
  • Life cycle management of index is adopted by curator of index.
  • Only for the fields that need word segmentation, the reasonable setting of word segmentation;
  • In the Mapping stage, the attributes of each field are fully combined, such as whether to be retrieved and stored.
1.2. Write tuning
  • The number of copies before writing is set to 0.
  • Before writing, disable refresh_interval and set it to -1.
  • During the write process, bulk write is adopted.
  • Restore the number of copies and refresh interval after writing;
  • Use automatically generated ids whenever possible.
1.3. Query tuning
  • Disable wildcard;
  • Disable batch terms (hundreds of scenarios);
  • Make full use of the inverted index mechanism to keyword type as much as possible;
  • When the amount of data is large, the index can be determined based on time before retrieval;
  • Set a proper routing mechanism.
1.4. Other tuning

Deployment tuning, business tuning, etc. As part of the above, the interviewer will have a general assessment of your previous practice or operations experience.

What is the inverted index of ElasticSearch

The data structure that Lucene has used extensively since version 4+ is FST. FST has two advantages:

- Small space footprint. By reusing the prefixes and suffixes of words in the dictionary, the storage space is reduced. - Fast query speed. O(len(STR)) query time complexity.Copy the code

Select * from elasticSearch; select * from elasticSearch

Interviewer: I want to know the operation and maintenance ability of large data volume.

3.1 Dynamic index Level

Create indexes based on template + time + Rollover API scrolling.

Design phase definition: blog index template format: blog_index_ timestamp form, increment data daily. The advantage of this method is that the data volume of a single index is not very large, which is close to the 32th power -1 of upper limit 2, and the index storage reaches TB+ or even larger.

Once a single index is large, storage and other risks come with it, so think ahead + avoid early.

3.2 Storage Layer

Hot data (for example, data generated in the latest three days or one week) is stored separately, and other data is stored separately.

If cold data is not written to new data, you can periodically perform force_merge plus shrink compression to save storage space and search efficiency.

3.3 Deployment Layer

Once there is no planning, this is a contingency strategy. Combined with the dynamic expansion feature of ES itself, dynamic new machines can relieve the cluster pressure. Note: If the master node and other planning is reasonable, dynamic new machines can be completed without restarting the cluster.

How does ElasticSearch implement master voting

1. GET /_cat/nodes? v&h=ip,port,heapPercent,heapMax,id,name 2. ip port heapPercent heapMax id nameCopy the code

5, Describe the process of Elasticsearch indexing documents in detail

How about Elasticsearch?

The search is decomposed into two phases: “Query then Fetch”.

The purpose of the Query phase is to locate the position without fetching it.

The steps are as follows:

  1. Suppose an index data has 5 master +1 copies with 10 shards, one of which will be hit in one request.
  2. Each fragment is queried locally, and the results are returned to the local ordered priority queue.
  3. The results of step 2 are sent to the coordination node, which produces a global sorted list.

The purpose of the FETCH phase is to fetch data.

The routing node retrieves all documents and returns them to the client.

How to optimize Linux Settings for Elasticsearch deployment

Interviewer: I want to know the operation and maintenance capability of ES cluster.

Answer:

  • Disable cache swap;
  • Heap memory is set to: Min (node memory /2, 32GB);
  • Set the maximum number of file handles.
  • Adjust thread pool + queue size according to service needs;
  • Disk storage RAID – Raid 10 can be used to improve the performance of a single node and avoid storage failures of a single node.

8. What is the internal structure of Lucence?

Interviewer: I want to know the breadth and depth of your knowledge.

Answer:

Lucene is an index and search process, including index creation, index, and search. You can build on that a little bit.

How does Elasticsearch implement Master voting?

  1. For Elasticsearch, the ZenDiscovery module is responsible for the Ping (RPC) and Unicast (a Unicast module that contains a host list to control which nodes need to be pinged).
  2. Sort all nodes that can become master (node.master: true) by nodeId dictionary; select the first node (0th bit) that can become master.
  3. If the number of votes for a node reaches a certain value (n/2+1) and the node chooses itself, the node is master. Otherwise, a new election will be held until the above conditions are met.
  4. Note: The master node is responsible for cluster, node, and index management, not document level management. The data node can turn off HTTP functionality *.

10 select a master from Elasticsearch and 10 select another master from Elasticsearch

  • When the cluster has at least three master candidates, the split brain problem can be solved by setting the minimum number of votes (discovery.zen.minimum_master_nodes) to exceed half of all candidate nodes.
  • When the number of candidates is two, only the master candidate can be changed, and the other candidates can be used as data nodes to avoid the brain-splitting problem.

11. How do clients select specific nodes to execute requests when connecting to the cluster?

The TransportClient uses the Transport module to remotely connect to an ElasticSearch cluster. It does not join the cluster, but simply obtains one or more initialized transport addresses and communicates with them in a polling manner.

Describe the process of indexing documents for Elasticsearch.

By default, the coordination node participates in the calculation using the document ID (routing is also supported) to provide the appropriate shard for the route

shard = hash(document_id) % (num_of_primary_shards)

  1. When the shard node receives a request from the coordinator node, it writes the request to the MemoryBuffer and then periodically (every second by default) to the Filesystem Cache. This process from MomeryBuffer to Filesystem Cache is called refresh;
  2. Of course, in some cases, Momery Buffer and Filesystem Cache data may be lost. ES ensures data reliability through the translog mechanism. The data in Filesystem cache is flushed when the data is written to disk.
  3. During rflush, the buffer in memory is cleared, the content is written to a new segment, the segment’s fsync creates a new commit point and flusher the content to disk, and the old translog is deleted and a new Translog is started.
  4. Flush is triggered when it is timed (default: 30 minutes) or when translog becomes too large (default: 512 MB).

Addendum: About Lucene seinterfaces:

  1. Lucene indexes are composed of multiple segments, which themselves are a fully functional inverted index.
  2. Segments are immutable, allowing Lucene to incrementally add new documents to the index without rebuilding the index from scratch.
  3. For each search request, all segments in the index are searched, and each segment consumes CPU clock cycles, file handles, and memory. This means that the higher the number of segments, the lower the search performance.
  4. To solve this problem, Elasticsearch merges small segments into a larger segment, commits the new merged segments to disk, and removes those old segments.

Elasticsearch is a distributed RESTful search and data analysis engine.

  • Queries: Elasticsearch allows you to perform and merge multiple types of searches — structured, unstructured, geographic, metric — in any way you want.
  • Analysis: It’s one thing to find the ten documents that best match your query. But what if you’re dealing with a billion lines of logs? Elasticsearch aggregation allows you to think big and explore trends and patterns in your data.
  • Speed: Elasticsearch is fast. Really, really fast.
  • Scalability: It runs on laptops. It can also run on hundreds of servers that host petabytes of data.
  • Elasticity: Elasticsearch runs in a distributed environment and was designed with this in mind from the beginning.
  • Flexibility: Multiple case scenarios. Number, text, location, structured, unstructured. All data types are welcome.
  • HADOOP & SPARK: Elasticsearch + HADOOP

Elasticsearch is a highly scalable open source full text search and analysis engine. It allows you to store, search, and analyze large amounts of data quickly and in near real time.

Here are some use cases for Elasticsearch:

  1. You run an online store and you allow your customers to search for the products you sell. In this case, you can use Elasticsearch to store the entire product catalog and inventory and provide search and auto-complete suggestions for them.
  2. You want to collect log or transaction data, and you want to analyze and mine that data for trends, statistics, summaries, or anomalies. In this case, you can use Loghide (part of Elasticsearch/ Loghide /Kibana stack) to collect, aggregate, and parse data, and then have Loghide input this data into Elasticsearch. Once the data is in Elasticsearch, you can run searches and aggregations to mine any information you’re interested in.
  3. You run a price alert platform that allows price-savvy customers to specify the following rule: “I am interested in purchasing specific electronic devices and would like to be notified if any vendor’s product is priced below $X in the next month.” In this case, you can grab the vendor’s prices, push them into Elasticsearch, and use its reverse Percolator feature to match price movements with customer queries, eventually pushing an alert to the customer when a match is found.
  4. You have analytical/business intelligence needs and want to quickly investigate, analyze, visualize, and ask special questions about large amounts of data (think millions or billions of records). In this case, you can use Elasticsearch to store the data, and then use Kibana (part of Elasticsearch/ Loghide /Kibana stack) to build custom dashboards to visualize the various aspects of the data that are important to you. In addition, you can perform complex business intelligence queries on data using Elasticsearch aggregation capabilities.

Describe how Elasticsearch updates and deletes documents.

  1. Delete and update are also write operations, but documents in Elasticsearch are immutable and therefore cannot be deleted or changed to show their changes.
  2. Each segment on disk has a corresponding.del file. When the delete request is sent, the document is not actually deleted, but is marked as deleted in the.del file. The document will still match the query, but will be filtered out of the results. When segments are merged, documents marked as deleted in the. Del file will not be written to the new segment.
  3. When a new document is created, Elasticsearch assigns a version number to the document, and when the update is performed, the old version of the document is marked as deleted in the.del file, and the new version of the document is indexed to a new segment. Older versions of documents still match the query, but are filtered out of the results

16, Describe the process of Elasticsearch.

In Elasticsearch, how do I find an inverted index based on a word?

  1. Lucene’s indexing process, according to the basic process of full-text retrieval, is the process of writing the inverted table in this file format.
  2. Lucene’s search process is to read out the indexed information according to this file format, and then calculate the score of each document.

18, What are the optimizations for Linux Settings when Elasticsearch is deployed?

  1. Machines with 64 GB of ram are ideal, but 32 GB and 16 GB machines are also common. Less than 8 GB is counterproductive.
  2. If you have to choose between faster CPUs and more cores, more cores is better. The extra concurrency provided by multiple cores far outweighs a slightly faster clock rate.
  3. If you can afford an SSD, it will go far beyond any rotating media. Ssd-based nodes have improved query and index performance. SSDS are a good choice if you can afford them.
  4. Avoid clustering across multiple data centers, even if they are in close proximity. Clustering across large geographical distances is definitely avoided.
  5. Make sure the JVM running your application is exactly the same as the server’s JVM. In several places in Elasticsearch, Java’s native serialization is used.
  6. You can set gateway.recover_after_nodes, gateway.expected_nodes, and gateway.recover_after_time to avoid excessive fragment exchanges during cluster restart. This could reduce data recovery from hours to seconds.
  7. Elasticsearch is configured to use unicast discovery by default to prevent nodes from unintentionally joining a cluster. Only nodes running on the same machine automatically form a cluster. It is best to use unicast instead of multicast.
  8. Do not arbitrarily change the size of the garbage collector (CMS) and individual thread pools.
  9. Give Lucene (less than) half of your memory (but no more than 32 GB!). , set by the ES_HEAP_SIZE environment variable.
  10. Swapping memory to disk is fatal to server performance. If memory is swapped to disk, a 100 microsecond operation can become 10 milliseconds. And think about all those 10 microseconds of operating delays that add up. It’s not hard to see how awful performance considerations are.
  11. Lucene uses a large number of files. Elasticsearch also uses a lot of sockets to communicate between nodes and HTTP clients. All of this requires sufficient file descriptors. You should increase your file descriptor and set it to a large value, such as 64,000.

19. What do you need to know about using Elasticsearch for GC?

  1. The index of the inverted dictionary needs to be resident in memory and cannot be GC, so the segmentMemory growth trend on the data node needs to be monitored.
  2. Each type of cache, fifield cache, fifilter cache, indexing cache, bulk queue, etc., should be set to a reasonable size and the heap should be sufficient in the worst-case scenario, i.e., when all types of cache are full. Is there heap space available for other tasks? Avoid using clear Cache to free memory.
  3. Avoid searches and aggregations that return large result sets. The Scan & Scroll API can be used for scenarios that require a large amount of data pulling.
  4. Cluster STATS resides in memory and cannot be horizontally expanded. A large cluster can be divided into multiple clusters connected by a tribe Node.
  5. To determine whether heap is sufficient, the heap usage of a cluster must be continuously monitored based on actual application scenarios.
  6. Understand memory requirements based on monitoring data, and properly configure circuit breakers to minimize the risk of memory overflow

20, How to implement Elasticsearch aggregation for large data (tens of millions of magnitude)?

The first approximation aggregation provided by Elasticsearch is cardinality metrics. It provides the cardinality of a field, the number of distinct or unique values for that field. It is based on the HLL algorithm. HLL will first for our input hash algorithm, and then according to the result of the hash operation base of the bits do probability estimation is obtained. It features configurable precision to control memory usage (more precision = more memory); Small data sets are very accurate; We can configure parameters to set the fixed amount of memory required for deduplication. Whether unique values are in the thousands or billions, the amount of memory used depends only on the precision of your configuration.

21, What does Elasticsearch do to ensure read-write consistency under concurrent conditions?

  1. Optimistic concurrency control can be used by version numbers to ensure that the new version is not overwritten by the old version, leaving the application layer to handle specific conflicts;
  2. In addition, for write operations, the consistency level supports quorum/ One /all, which defaults to quorum, that is, write operations are allowed only when most shards are available. But even if most are available, there may be a failure to write to the copy due to network reasons, so that the copy is considered faulty and the shard is rebuilt on a different node.
  3. For read operations, you can set replication to sync(the default), which makes the operation return only after both master and replica sharding is complete. If Replication is set to ASYNc, you can also query the master shard by setting the search request parameter _preference to primary to ensure that the document is the latest version.