Elasticsearch interview questions

The article directories

How does Elasticsearch implement master voting?

1. Sort all nodes that can become master by nodeId. Each node is sorted by nodeId, and then select the first node (0th bit). 2. If the number of votes for a node reaches a certain value (n/2+1) and the node elects itself, that node is master. Or a new election. 3. For the brain split problem, it is necessary to set the minimum number of candidate master nodes to be the number of master nodes n/2+1 (quorum).Copy the code

2, Describe the process of indexing documents for Elasticsearch.

1. When the shard node receives a request from the coordinator node, it writes the request to the MemoryBuffer and then to the Filesystem Cache periodically (default: every 1 second). This process from MomeryBuffer to Filesystem Cache is called refresh; 2. Of course, in some cases, Momery Buffer and Filesystem Cache data may be lost. ES ensures data reliability through the translog mechanism. The data in Filesystem cache is flushed when the data is written to disk. 3. During Flush, the buffer in memory is flushed, the content is written to a new segment, the segment's fsync creates a new commit point and flusher the content to disk, and the old translog is deleted and a new Translog is started. 4. Flush is triggered when it is timed (default: 30 minutes) or when translog becomes too large (default: 512 MB).Copy the code

Describe how Elasticsearch updates and deletes documents.

Delete and update are also write operations, but documents in Elasticsearch are immutable and therefore cannot be deleted or changed to show their changes. 2. Each segment on disk has a corresponding. Del file. When the delete request is sent, the document is not actually deleted, but is marked as deleted in the.del file. The document will still match the query, but will be filtered out of the results. When segments are merged, documents marked as deleted in the. Del file will not be written to the new segment. When a new document is created, Elasticsearch assigns a version number to that document. When the update is performed, the old version of the document is marked as deleted in the. Del file and the new version of the document is indexed to a new segment. Older versions of documents still match the query, but are filtered out of the results.Copy the code

Description of Elasticsearch search in detail

1. The search is executed as a two-phase process called Query Then Fetch; 2. During the initial query phase, the query is broadcast to each shard copy (master shard or replica shard) in the index. Each shard performs a search locally and builds a priority queue matching documents of size from + size. Note: Filesystem Cache is queried during search, but some data is still in the MemoryBuffer, so the search is performed in near real time. 3. Each shard returns the IDS and sorting values of all documents in its own priority queue to the coordination node, which merges these values into its own priority queue to generate a global sorted result list. 4. Next comes the fetch phase, where the coordination node identifies which documents need to be fetched and submits multiple GET requests to the related shard. Each shard loads and enriches the document, then returns the document to the coordination node if necessary. Once all documents have been retrieved, the coordination node returns the result to the client. 5. Supplementary: The search type of Query Then Fetch refers to the data of the fragment when scoring the document relevance, which may not be accurate when the number of documents is small. DFS Query Then Fetch adds a pre-query processing. Ask Term and Document Frequency, this score is more accurate, but performance deteriorates.Copy the code

5. How to implement Elasticsearch aggregation for large data (tens of millions of magnitude)?

The first approximation aggregation provided by Elasticsearch is cardinality metrics. It provides the cardinality of a field, the number of distinct or unique values for that field. It is based on the HLL algorithm. HLL will first for our input hash algorithm, and then according to the result of the hash operation base of the bits do probability estimation is obtained. It features configurable precision to control memory usage (more precision = more memory); Small data sets are very accurate; We can configure parameters to set the fixed amount of memory required for deduplication. Whether unique values are in the thousands or billions, the amount of memory used depends only on the precision of your configuration.Copy the code

Elasticsearch ensures read/write consistency under concurrent conditions.

Optimistic concurrency control can be used by version numbers to ensure that the new version is not overwritten by the old version, leaving the application layer to handle specific conflicts; In addition, for write operations, the consistency level supports quorum/ One /all, which defaults to quorum, that is, write operations are allowed only when most shards are available. But even if most are available, there may be a failure to write to the copy due to network reasons, so that the copy is considered faulty and the shard is rebuilt on a different node. 3. For read operations, you can set Replication to sync(the default) so that the operation is returned only after both the master and replica sharding is complete. If Replication is set to ASYNc, you can also query the master shard by setting the search request parameter _preference to primary to ensure that the document is the latest version.Copy the code

What are the clusters/nodes/indexes/documents/types in ElasticSearch?

Cluster: A collection of one or more nodes (servers) that collectively hold your entire data and provide joint indexing and search capabilities across all nodes. The cluster is identified by a unique name, which is "ElasticSearch" by default. This name is important because if a node is set to join a cluster by name, it can only be part of the cluster. Node: A single server that is part of a cluster. It stores data and participates in cluster indexing and search functions. Index: Like a "database" in a relational database. It has a map that defines multiple types. An index is a logical namespace that maps to one or more master shards and can have zero or more replica shards. Eg: MySQL => database ElasticSearch => index document: similar to a row in a relational database. The difference is that each document in the index can have a different structure (field), but should have the same data type for common fields. MySQL => Databases => Tables => Columns/Rows ElasticSearch => Indices => Types => Tables => Tables => Columns/RowsCopy the code

What is the inverted index of Elasticsearch?

1. Inverted indexing is at the heart of search engines. The primary goal of a search engine is to provide quick searches when looking for documents where the search criteria occurred. An inverted index is a hash chart that acts like a data structure to direct the user from a word to a document or web page. It is the heart of search engines. Its main goal is to quickly search to find data from millions of files. 2, our traditional retrieval is through the article, one by one to find the location of the corresponding keywords. The inverted index, through the word segmentation strategy, forms the mapping relation table between words and articles, and this dictionary + mapping table is the inverted index. With the inverted index, it can realize o (1) time complexity of the efficient retrieval of articles, greatly improving the retrieval efficiency. The academic solution: an inverted index, as opposed to which words are included in an article, starts with a word and records which documents that word appears in. It consists of two parts -- a dictionary and an inverted list. The underlying implementation of inversion indexes is based on: FST (Finite State) data structure. The data structure that Lucene has used extensively since version 4+ is FST. FST has two advantages: 1) Small footprint. By reusing the prefixes and suffixes of words in the dictionary, the storage space is reduced. 2) Fast query speed. O(len(STR)) query time complexity.Copy the code

What is the parser in ElasticSearch?

1. When indexing data in ElasticSearch, the data is internally transformed by the Analyzer defined for the index. The analyzer consists of a Tokenizer and zero or more TokenFilters. The compiler can precede one or more Charfilters. The analysis module allows you to register parsers under logical names that can then be referenced in mapping definitions or in some API. Elasticsearch comes with a number of pre-built profilers ready to use. Alternatively, you can combine the built-in character filters, compilers, and filters to create custom profilers.Copy the code

10. What is the purpose of enabling properties, indexes and storage?

1. The Enabled attribute applies to all ElasticSearch specific/creation domains, such as Index and size. User-supplied fields do not have an enabled attribute. Storage means that data is stored by Lucene and will be returned if asked. 2. Stored fields are not necessarily searchable. By default, the fields are not stored, but the source file is complete. Because you want to use the default (which makes sense), do not set the Store property the index property for searching. 3. Indexed attributes can only be used for searching. Only index fields can be searched. The reason for the difference is that the index fields are transformed during analysis, so you cannot retrieve the raw data if necessary.Copy the code

The size of the index data, how many shards there are, and some tuning methods for Elasticsearch.

For example, the ES cluster architecture has 13 nodes, and the index is 20+ index according to channel. The index is increased by 20+ index according to date, and the index is 10 fragments, and the index is increased by 100 million + data every day. The index size of each channel is controlled within 150GB. 1) Create indexes based on the date template and roll over THE API to roll the indexes according to the incremental requirements of services; 2) Use alias for index management; 3) Perform force_merge operations on indexes at dawn every day to release space. 4) Adopt hot and cold separation mechanism to store hot data on SSD to improve retrieval efficiency; Cold data is periodically shrink to reduce storage; 5) Adopt the life cycle management of index for curator; 6) Set the word segmentation reasonably only for the fields requiring word segmentation; 7) In the Mapping stage, attributes of each field are fully combined to determine whether retrieval and storage are needed. ... 1.2. Write Tuning 1) Set the number of copies before writing to 0; 2) Disable refresh_interval to -1 and refresh mechanism before writing. 3) In the writing process, bulk writing is adopted; 4) Restore the number of copies and refresh interval after writing; 5) Use automatically generated ids whenever possible. 1.3 query Tuning 1) Disable wildcard; 2) Disable batch terms (hundreds of scenarios); 3) Make full use of the inverted index mechanism to keyword as much as possible; 4) When the amount of data is large, the index can be determined based on time before retrieval; 5) Set a reasonable routing mechanism. 1.4 Other tuning Deployment tuning, business tuning, etc.Copy the code

Select * from Elasticsearch; select * from Elasticsearch;

1 dynamic index level based on template + time +rollover API rolling index creation, example: design stage definition: blog index template format: blog_index_ timestamp form, increasing data every day. The advantage of this is that the data volume does not surge so that the data volume of a single index is very large, close to the 32nd power -1 of upper limit 2, and the index storage reaches TB+ or even larger. Once a single index is large, storage and other risks come with it, so think ahead + avoid early. 2 Storage layer Cold and hot data are stored separately. Hot data (for example, data generated in the latest three days or one week) and other cold data are stored separately. If cold data is not written to new data, you can periodically perform force_merge plus shrink compression to save storage space and search efficiency. 3 Deployment Layer If no plan has been made, it is an emergency policy. Combined with the dynamic expansion feature of ES itself, dynamic new machines can relieve the cluster pressure. Note: If the master node and other planning is reasonable, dynamic new machines can be completed without restarting the cluster.Copy the code

What do you need to know when using Elasticsearch?

Since ES is written in Java, all the attention is on the GC side

1. The index of inverted dictionary needs to be resident in memory and cannot be GC, so the growth trend of SegmentMemory on data node needs to be monitored. Set the appropriate size for all types of caches, such as field cache, filter cache, indexing cache, bulk queue, etc., and determine whether the heap is sufficient in the worst-case scenario. Is there heap space available for other tasks? Avoid using clear Cache to free memory. Avoid searches and aggregations that return a large number of result sets. The Scan & Scroll API can be used for scenarios that require a large amount of data pulling. If you want to know whether the heap is sufficient, you must monitor the heap usage of the cluster based on actual application scenarios.Copy the code

What types of queries does Elasticsearch support?

There are two types of query: exact match and full-text search match. Exact matches, such as term, EXISTS, term set, range, prefix, IDS, Wildcard, regEXP, fuzzy, etc. Full-text retrieval, such as match, match_PHRASE, multi_match, match_phrase_prefix, query_string, etcCopy the code

15, Can you list the main available field data types related to Elasticsearch?

1. String data types, including text, which supports full-text retrieval, and keyword, which accurately matches. 2. Numeric data types, such as byte, short integer, long integer, floating point, double, half_float, SCALed_float. Date type, Date nanoseconds Date nanoseconds, Boolean value, binary (Base64 encoded string), etc. Range (integer_range, long_range, doubLE_range, float_range, date_range) 5. Complex data types containing objects, nested and Object. 6. GEO Type related to geographical location. 7. Specific types such as arrays (values in arrays should have the same data type)Copy the code

16. How do I monitor the Elasticsearch cluster status?

Marvel makes it easy to monitor Elasticsearch via Kibana. You can view your cluster health and performance in real time, as well as analyze past cluster, index, and node metrics.Copy the code

Elasticsearch is a sexualized search option for Elasticsearch.

Elasticsearch is a personalized search service that has a much higher click through rate and conversion rate than the original one. (2) The commodity vector based on Word2VEc has another advantage, that is, it can be used to recommend similar commodities; (3) There are some limitations in using Word2vec to realize personalized search or personalized recommendation, because it can only process time series data such as user click history, but can not fully consider user preferences, which still has a lot of room for improvement and improvement;Copy the code

18. Do you know dictionary trees?

The data structure The advantages and disadvantages
Array/List Use dichotomy to find the imbalance
HashMap/TreeMap High performance, large memory consumption, almost three times the original data
Skip List Skip list, can quickly find words, in Lucene, Redis,HBase implementation
Trie Suitable for English dictionaries, if the system has a large number of strings and these strings have almost no common prefix
Double Array Trie Suitable for Chinese dictionary, small memory footprint, many word segmentation tools used this algorithm
Ternary Search Tree A stateful transfer machine, Lucene 4 has an open source implementation and is widely used

The core idea of Trie is to swap space for time, using common prefixes of strings to reduce the overhead of query time

Achieve the purpose of improving efficiency. It has three basic properties:

1. The root node contains no characters. Each node except the root node contains only one character.

2. From the root node to a node, the characters on the path are connected to the string corresponding to the node.

3. All children of each node contain different characters.

1. As you can see, the number of nodes at each level of the trie tree is 26^ I. So to save space, we can also use dynamic linked lists, or we can use arrays to simulate dynamics. The cost of space is no more than the number of words x the length of words. 2, implementation: for each node to open a letter set size array, each node to hang a linked list, using left son and right brother notation to record the tree; 3. For The Chinese dictionary tree, the child nodes of each node are stored in a hash table, so as not to waste too much space, and the query speed can retain the complexity of the hash O(1).

19. Does ElasticSearch have a schema?

1. ElasticSearch can have a framework. A schema is a description of one or more fields that describe the type of document and how the different fields of the document are handled. The schema in Elasticsearch is a map that describes fields in a JSON document and their data types, and how they should be indexed in a Lucene index. Therefore, in Elasticsearch terminology, we usually refer to this pattern as a "map". Elasticsearch has the ability to be architecturally flexible, which means you can index documents without explicitly providing a schema. If no mapping is specified, by default Elasticsearch dynamically generates a mapping when it detects new fields in the document during indexing.Copy the code

20. Why use Elasticsearch?

Because in the mall in our data, the future will be very much, so the previous fuzzy query, fuzzy query lead configuration, will give up the index and lead to goods query is a full table scan, in millions of level of database, the efficiency is very low, and we use ES do a full-text index, we will often query goods certain fields, such as commodity name, Description, price, and ID fields are added to our index to speed up queries.Copy the code