“This is the fourth day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Elasticsearch architecture principles

1. Node type of Elasticsearch

Elasticsearch is divided into two types of nodes: Master node and DataNode node.

1.1 the Master node

When Elasticsearch starts, a Master node is elected. Once a node is started, the Zen Discovery mechanism is used to find other nodes in the cluster and establish connections.

Discovery. Seed_hosts: [" 192.168.21.130 ", "192.168.21.131", "192.168.21.132"]Copy the code

A primary node is elected from among the candidate primary nodes.

cluster.initial_master_nodes: ["node1", "node2","node3"] 
Copy the code

The Master node is responsible for: managing indexes (creating and deleting indexes), allocating fragments, maintaining metadata, managing cluster node status Not responsible for data writing and querying. In a production environment, memory can be relatively small, but the machine needs to be stable.

1.2 the DataNode node

In the Elasticsearch cluster, there are N Datanodes. Datanodes are responsible for: data writing, data retrieval, and most of the Elasticsearch pressure is on datanodes. In a production environment, it is better to configure large memory

Sharding and duplicating mechanisms

2.1 Sharding (Shard)

Elasticsearch is a distributed search engine. The index data is distributed on different server nodes. The index data is called shards. Elasticsearch automatically manages shards and migrates an index to multiple shards on different servers if the shards are not evenly distributed

2.2 a copy of the

For fault tolerance of Elasticsearch shards, assuming one node is unavailable will cause the entire index library to be unavailable. Therefore, you need to be copy tolerant for sharding. Each shard has a corresponding copy. In Elasticsearch, the default index is 1 shard, each shard has 1 master shard and 1 replica shard. Each Shard has one Primary Shard and several Replica shards. The Primary Shard and Replica Shard are not on the same node

2.3 Specify the number of fragments and copies

//Create index PUT for the specified number of fragments and replicas/job_idx_shard_temp 
{ 
"mappings":{ 
"properties":{ 
"id":{"type":"long","store":true}, 
"area":{"type":"keyword","store":true}, 
"exp":{"type":"keyword","store":true},  
"edu":{"type":"keyword","store":true}, 
"salary":{"type":"keyword","store":true}, 
"job_type":{"type":"keyword","store":true}, 
"cmp":{"type":"keyword","store":true}, 
"pv":{"type":"keyword","store":true}, 
"title":{"type":"text","store":true}, 
"jd":{"type":"text"} 
} 
}, 
"settings":{ 
"number_of_shards":3, 
"number_of_replicas":2}}//View shard, master shard, and copy shardGET /_cat/indices? vCopy the code

Elasticsearch (Elasticsearch

3.1 Writing Principles of Elasticsearch Documents

1. Select any DataNode to send requests, for example, node2. At this point, Node2 becomes one

Coordinating node

2. Calculate the shard to be written to the document

shard = hash(routing) % number_of_primary_shards

Routing is a variable value and defaults to the document _id

3. The Coordinating node will route requests to the corresponding Primary shard

DataNode (suppose primary shard is on Node1 and Replica Shard is on Node2)

4. The Primary Shard on node1 processes the request, writes data to the index library, and synchronizes data to

Replica shard

5. The Primary Shard and Replica Shard files are saved and the client is returned

3.2 Principles of Elasticsearch

Client initiates a query request, and a DataNode receives the request. Then, the DataNode becomes a Coordinating Node.

A Coordinating Node broadcasts a query request to each data Node, and the sharding of these data nodes processes the query request

The data is queried for each fragment, and the qualified data is put into a priority queue, and the document ID, node information and fragment information of these data are returned to the coordination node

The coordinating node summarizes all the results and sorts them globally

The coordinating node sends a GET request to the shard containing these document ids, and the corresponding shard returns the document data to the coordinating node, which finally returns the data to the client

Implementation of Elasticsearch index

4.1 Overwrite to file system cache

** When data is written to an ES fragment, it is first written to memory, then a segment is generated by the buffer in memory, and flushed to the file system cache. The segment can be retrieved (note that it is not flushed directly to disk)

4.2 Write Translog to ensure fault tolerance

When data is written to the memory, translog logs are also recorded. If an exception occurs during refresh, data is recovered based on translog. After the segment data in the file system cache is flushed to disks, the Translog file is cleared

4.3 Flush To Disks

By default, ES flushs file system cached data to disk every 30 minutes

4.4 the segment merger

When there are too many segments, ES will periodically merge multiple segments into large ones to reduce the IO overhead during index query. At this stage, ES will actually physically delete (previously deleted data).

5. Manually control the accuracy of search results

5.1 In the following search, if the remark field in document contains Java or developer phrases, the search conditions are met.

GET /es_db/_search 
{ 
"query": { 
"match": { 
"remark": "java developer" 
} 
} 
} 
Copy the code

If the remark field in the document contains Java and developer phrases, use the following syntax:

GET /es_db/_search 
{ 
"query": { 
"match": { 
"remark": { 
"query": "java developer", 
"operator": "and" 
} 
} 
} 
} 
Copy the code

In the preceding syntax, if you change the value of operator to or. The results are consistent with the first case search syntax. The default ES operator is or when performing a search. If you need to include a certain proportion of multiple search terms in the remark field in the result document of the search, you can use the following syntax to implement the search. Minimum_should_match can be a percentage or a fixed number. The percentage represents the percentage of terms in the Query search term. If the query term is not divisible, it will match down (for example, if the query term has three words, it will not be divisible if the percentage is used to provide precision calculation. If at least two words need to be matched, it will be described as 67%. If the 66% description is used, ES thinks it’s ok to match one word. . The fixed number represents the minimum number of terms in the Query search criteria to match.

GET /es_db/_search 
{ 
 "query": { 
 "match": { 
 "remark": { 
 "query": "java architect assistant", 
 "minimum_should_match": "68%" 
 } 
} 
} 
}
Copy the code

If you use should+bool, you can also control the matching of the search criteria. The details are as follows: The following case represents the remark field in the document search, which must match at least two of the Java, Developer and Assistant terms.

GET /es_db/_search 
{ "query": { 
 "bool": { 
 "should": [ 
   { 
   "match": { 
   "remark": "java" 
   } 
   }, 
   { 
   "match": { 
   "remark": "developer" 
   } 
   }, 
   { 
   "match": { 
   "remark": "assistant" 
   } 
   } 
 ], 
 "minimum_should_match": 2}}}Copy the code

5.2. Low-level transformation of match

In fact, in ES, when the match search is performed, the bottom layer of ES usually performs the low-level transformation of the search criteria to achieve the final search result. Such as:

GET /es_db/_search 
{ 
"query": { 
"match": { 
"remark": "java developer" 
} 
} 
} 

#After conversion: 
GET /es_db/_search 
{ 
"query": { "bool": { 
"should": [ 
{ 
"term": { 
"remark": "java" 
} 
}, 
{ 
"term": { 
"remark": { 
"value": "developer" 
} 
} 
} 
] 
} 
} 
} 


#An exact match
GET /es_db/_search 
{ 
"query": { 
"match": { 
"remark": { 
"query": "java developer", 
"operator": "and" 
} 
} 
} 
} 

#After conversion: 
GET /es_db/_search 
{ 
"query": { 
"bool": { 
"must": [ 
{ 
"term": { 
"remark": "java" 
} }, 
{ 
"term": { 
"remark": { 
"value": "developer" 
} 
} 
} 
] 
} 
} 
} 


#compatibilityGET /es_db/_search { "query": { "match": { "remark": { "query": "java architect assistant", "minimum_should_match": "68%"}}}}
#After conversion: 
GET /es_db/_search 
{ 
"query": { 
"bool": { 
"should": [ 
{ 
"term": { 
"remark": "java" 
} 
}, 
{ 
"term": { 
"remark": "architect" 
} 
}, { 
"term": { 
"remark": "assistant" 
} 
} 
], 
"minimum_should_match": 2 
} 
} 
} 
Copy the code

** suggests that if you go to the trouble, use the converted syntax to perform your search more efficiently. ** ** If the development cycle is short and the workload is heavy, use simplified notation. **

5.3 boost weight control

Search document for data containing Java in the remark field, and if remark contains Developer or Architect, documents containing Architect are shown first. (Increase the relevancy score when matching architect data). Generally used for relevancy ranking when searching. For example, comprehensive sorting in e-commerce. Comprehensive ranking of sales volume, advertising, evaluation value, inventory and unit price of a product. In the above ranking elements, the weight of advertising investment rights is the highest, and the weight of inventory is the lowest.


 GET /es_db/_search 
 {
 "query":{
 "bool":{
 "must":[
 {
 "match":{
 "remark":"java"
 }
 }
 ],
 "should":[
 {
 "match":{
 "remark":{
 "query":"developer",
 "boost":1
 }
 }
 },
 {
 "match":{
 "remark":{
 "query":"architect",
 "boost":3
 }
 }
 }
 ]
 }
 }
 }
Copy the code

5.4 Implement best Fields strategy for multi-field search based on DIS_max

Best Fields strategy: search for a field in the document, as many as possible to match the search criteria. Instead, as many fields as possible match the search criteria (the most Fields strategy). Such as Baidu search uses this strategy.

Advantages: Precisely matched data can be arranged as far as possible in the front, and can use minimum_should_match to remove the long mantissa field to avoid the influence of the long mantissa field on the sorting result. For example, if we search for 4 keywords and many documents match only 1, it also shows that the documents are not what we want. Disadvantages: Relatively uneven sorting. __dis_max syntax: Directly obtain the query with the highest correlation score among the multiple search criteria, and sort the query with the highest correlation score. _

In the following case, it is to find the rod matching correlation score in the name field or the Java Developer matching correlation score in the remark field, which is higher, and then use the correlation score to sort the results.

GET /es_db/_search 
{ 
"query": { 
"dis_max": { 
"queries": [ 
{ 
"match": { "name": "rod" 
} 
}, 
{ 
"match": { 
"remark": "java developer" 
} 
} 
] 
} 
} 
} 

#Returns the result
#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.15/security-minimal-setup.html to enable security.{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 4, "base" : "eq"}, "max_score" : 1.6375021, "hits" : [{" _index ":" es_db ", "_type" : "_doc", "_id" : "3", "_score" : 1.6375021, "_source" : {" name ":" rod ", "sex" : 0, "age" : 26, "address" : "Guangzhou baiyun mountain park", "remark" : "PHP developer"}}, {" _index ":" es_db ", "_type" : "_doc", "_id" : "1", "_score" : 1.4691012, the "_source" : {" name ":" zhang ", "sex" : 1, "age" : 25, "address" : "guangzhou tianhe park", "remark" : "Java developer"}}, {" _index ":" es_db ", "_type" : "_doc", "_id" : "2", "_score" : 0.5598161, "_source" : {" name ": "Bill", "sex" : 1, "age" : 28, "address" : "guangzhou li wan building", "remark" : "Java assistant"}}, {" _index ":" es_db ", "_type" : "_doc", "_id" : "5", "_score" : 0.46919835, "_source" : {" name ":" xiao Ming ", "sex" : 0, "age" : 19, "address" : ", "Remark" : "Java Architect Assistant"}}]}}Copy the code

5.5. Optimize dis_max search based on tie_breaker parameter

The tie_breaker parameter can be used to optimize the dis_max search, in which the highest correlation score among query criteria is used to sort the results, ignoring other Query scores. In some cases, the relevance score of other query criteria may be required to be involved in the final ranking of the results. The tie_breaker parameter represents the correlation scores of other Query search criteria multiplied by the parameter values to participate in the sorting of results. If this parameter is not defined, the value of this parameter is 0. So correlation scores for other Query conditions are ignored.

GET /es_db/_search { "query": { "dis_max": { "queries": [ { "match": { "name": "rod" } }, { "match": { "remark": "Java Developer"}}], "tie_breaker":0.5}}}Copy the code

5.6. Simplify dis_max+ Tie_breaker with multi_match

A search for the same result in ES can also be implemented using a different syntax statement. Do not need special attention, as long as can achieve the search, is to complete the task! Such as:

GET /es_db/_search { "query":{ "dis_max":{ "queries":[ { "match":{ "name":"rod" } }, {" match ": {" remark" : {" query ":" javadeveloper ", "boost" : 2, "minimum_should_match" : 2}}}], "tie_breaker" : 0.5}}}
#Returns the result
#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.15/security-minimal-setup.html to enable security.{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 1.6375021, "hits" : [{" _index ":" es_db ", "_type" : "_doc", "_id" : "3", "_score" : 1.6375021, "_source" : {" name ":" rod ", "sex" : 0, "age" : 26, "address" : "PHP developer" :" PHP developer"}}]}


#The multi_match syntax is: wheretypeCommonly used are BEST_fields and most_fields. To the n is the weight"boost": n.GET /es_db/_search { "query":{ "multi_match":{ "query": "rod java developer", "fields": ["name","remark^2"], "type": "Best_fields", "tie_breaker" : 0.5, "minimum_should_match" : "50%"}}}Copy the code

5.7. Cross Fields Search

Cross fields: A unique identifier used to search for data in multiple fields is called a cross fields search. For example, people’s names can be divided into family names and given names, and addresses can be divided into provinces, cities, districts, counties, and streets. A search for document using a person’s name or address is called a cross fields search. The most fields search strategy is generally used to achieve this kind of search. Because it’s not a field problem. **Cross fields search strategy to search for conditional data from multiple fields. By default, the logic of the most ** **fields search is the same, and the correlation score is calculated the same as the Best Fields strategy. In general ** **, if the cross fields search strategy is used, it carries an additional parameter operator. ** ** is used to mark how the search criteria are matched across multiple fields. ** Of course, there is also the Cross Fields search strategy in ES. The syntax is as follows:

GET /es_db/_search 
{ 
"query": { 
"multi_match": { 
"query": "java developer", 
"fields": ["name", "remark"], 
"type": "cross_fields", 
"operator" : "and" 
} 
} 
} 
Copy the code

What the above syntax means is that Java in the search criteria must match in the Name or remark field, and developer must match in the Name or remark field. The most Field strategy problem: The most fields strategy is to match as many fields as possible, so it leads to the problem of precise search result ordering. We can’t use minimum_should_match to remove long mantissa data because of cross fields search. Therefore, when using the most fields and cross fields strategies to search data, there are different gaps. Therefore, the best Fields strategy is recommended for commercial project development.

5.8. Copy_to combination fields

In JINGdong, if you enter “mobile phone” in the search box and click search, which field will be used for data matching among the type name, name, selling point and description of the product? If it is not appropriate to search with a field, is it appropriate to search with _all? Also not suitable, because the _all field may contain fields such as picture, price, etc. Suppose you have a field that contains (but is not limited to) the data content of the item type name, item name, item selling point, and so on. Can a data search match be made on this particular field?

{"category_name" : "phone ", "product_name" :" OnePlus 6T phone ", "price" : 568800, "Sell_point ":" The best Android phone in China ", "tags": ["8G+128G", "256G expandable "], "color" : "keyword", "keyword" :Copy the code

Copy_to: is to copy multiple fields to a field, to achieve a multi-field combination. Copy_to solves the cross fields search problem and, in commercial projects, the default field problem for search criteria. To use copy_to syntax, you need to manually specify the mapping policy when defining index. Copy_to grammar:

PUT /es_db/_mapping
{
"properties": {
"provice" : {
"type": "text",
"analyzer": "standard",
"copy_to": "address"
},
"city" : {
"type": "text",
"analyzer": "standard",
"copy_to": "address"
},
"street" : {
"type": "text",
"analyzer": "standard",
"copy_to": "address"
},
"address" : {
"type": "text",
"analyzer": "standard"
}
}
}
Copy the code

In the mapping definition, four new fields are added: provice, City, Street, and Address. The provice, City, and Street fields are automatically copied to the address field to form a combination of fields. When searching for an address, you can make a match in the address field to avoid the problem caused by the most fields policy. When maintaining data, no special maintenance is required for the Address field. Because the Address field is a composite field, it is automatically maintained by ES. Similar to derivation properties in Java code. It may not exist when stored, but it must exist logically, because address consists of three physical attributes: province, city, and street.

5.9. Approximate Match

All of these are exact matches. If there is a data Java Assistant in doc, then searching jave will not find the data. Because the jave word doesn’t exist in doc. If the search syntax is:

GET _search 
{ 
"query" : { 
"match" : { 
"name" : "jave" 
}
} 
} 
Copy the code

If the desired result is a special requirement, for example: Hello world must be a complete phrase, not divisible; Or the fields in the document contain the words hello and world, and the closer the two words are to each other, the higher the correlation score. So this particular search is an approximate search. Including hell search criteria in Hello World data search, including H search tips are part of the data approximation search. The match search syntax cannot be used to search for the above special requirements.

5.10, match the phrase

Phrase search. ** is a search term without words. Indicates that search criteria are indivisible. ** If hello world is an indivisible phrase, we can do this using the previous phrase search match phrase. The syntax is as follows:

GET _search 
{ 
"query": { 
"match_phrase": { 
"remark": "java assistant" 
} 
} 
} 
Copy the code

**-1), match Phrase principle — Term position ** ES how to implement match phrase search? In ES, match Phrase is similar to match. The analyze word is first used for the search condition. Split the search criteria into Hello and world. Since it is a word segmentation search, how does ES realize the phrase search? This involves the establishment process of inverted index. When the inverted index is established, ES will split the document data first, such as:

GET _analyze 
{ 
"text": "hello world, java spark", 
"analyzer": "standard" 
}
Copy the code

The result of participle is:

{ 
"tokens": [ 
{ 
"token": "hello", 
"start_offset": 0, 
"end_offset": 5, 
"type": "<ALPHANUM>", 
"position": 0 
}, 
{ 
"token": "world", 
"start_offset": 6, 
"end_offset": 11, 
"type": "<ALPHANUM>", 
"position": 1 
}, 
{ 
"token": "java", 
"start_offset": 13, 
"end_offset": 17, 
"type": "<ALPHANUM>", 
"position": 2 
}, 
{ 
"token": "spark", 
"start_offset": 18, 
"end_offset": 23, 
"type": "<ALPHANUM>", 
"position": 3 
} 
] 
} 
Copy the code

As can be seen from the above results. In addition to dividing data, ES retains a position when making word segmentation. Position represents the subscript of the word in the entire data. When ES performs the match Phrase search, the search term “Hello world” is divided into “Hello” and “world”. Then the data is retrieved in the inverted index. If hello and world both appear in a field of a document, then check whether the position of the two matched words is continuous. If it is continuous, it indicates that the match is successful; if it is not, the match fails. **-2). Match phrase Search parameter — SLOp ** If the search parameter is Hello Spark. The data stored in ES is Hello World, Java Spark. If match Phrase is used, it cannot be found. At this point, you can use match to solve the problem. However, when we need to make a special requirement in the search results: The closer the distance between the two words hello and Spark is, the higher the document order is in the result set. In this case, using match may not get the desired result. Slop is provided for match Phrase in the ES search. Slop stands for match phrase What is the maximum number of times a word can be moved during a search to achieve data matching. In all matching results, the closer the distance of multiple words is, the higher the correlation score is and the higher the ranking is. The match phrase search that uses the SLOP parameter is called proximity search. For example, the data is Hello World, and the Java Spark search is match phrase: Hello Spark. Slop: 3 (indicates that the word can move at most three times.) During the phrase search, the condition hello Spark is divided into two words: Hello and spark. Parallel and continuous. Next, you can move words based on the SLOP parameter.

The subscript	0	1	2	3
doc	hello	world	java	spark
search	hello	spark
mobile	hello		spark
Mobile 2	hello			spark

The match is successful, and you do not need to move it for the third time. If: is hello world, the Java Spark search is match Phrase: Spark Hello. Slop: 5 (indicates that the word can be moved at most 5 times.) During the phrase search, the condition hello Spark is divided into two words: Hello and spark. And continuous. Spark Hello next, words can be moved based on the SLOP parameter. Subscript: 0 1 2 3 doc: Hello World Java Spark Search: Spark Hello Move 1: Spark/Hello Move 2: Hello Spark move 3: Hello Spark Move 4: The Hello Spark matches successfully and does not need to move for the fifth time. If no match is found after the number of SLOP moves ends, no search result is displayed. If Chinese word segmentation is used, the number of moves is more complicated, because Chinese words overlap, it is difficult to calculate the specific number of moves, and it requires multiple attempts. Test cases:

GET _analyze 
{ 
"text": "hello world, java spark", 
"analyzer": "standard" 
} 

POST /test_a/_doc/3 
{ 
"f" : "hello world, java spark" 
} 

GET /test_a/_search 
{ 
"query": { 
"match_phrase": { 
"f" : { 
"query": "hello spark", 
"slop" : 2 
} 
} 
} 
}

GET /test_a/_search 
{ 
"query": { 
"match_phrase": { 
"f" : { 
"query": "spark hello", 
"slop" : 4 
} 
} 
} 
} 
Copy the code

** 英文 : **

GET _analyze {"text": "China, one of the most powerful countries in the world "," Analyzer ": "ik_max_word"} POST /test_a/_doc/1 {"f" : GET /test_a/_search {"match_phrase": {"match_phrase": {"f" : {"query": "Slop ": {"match_phrase": {"f" : {"query":" Slop ": 5}}}} GET/test_a / _search {" query ": {" match_phrase" : {" f ": {" query" : "China's best", "slop" : 9}}}}Copy the code

6. Experience sharing

Use match and Proximity search to balance recall rate and accuracy. Recall rate: Recall rate is the ratio of search results. For example, there are 100 documents in index A, and the number of documents returned during search is also known as recall rate. Precision: is the accuracy of the search results. For example, if the search condition is Hello Java, the search results that match the phrase as close as possible to Hello Java are sorted in the first place, which is precision.

If only the match Phrase syntax is used in a search, the recall rate is lower because the search results must contain a phrase (including proximity search). If you only use match in a search, it will result in poor accuracy because the search results are ranked based on the correlation score algorithm. Therefore, when the recall rate and accuracy are required to be considered in the results, match and proximity search should be combined to obtain the search results. Test cases:

POST /test_a/_doc/3
{
"f" : "hello, java is very good, spark is also very good"
}

POST /test_a/_doc/4
{
"f" : "java and spark, development language "
}

POST /test_a/_doc/5
{ "f" : "Java Spark is a fast and general‐purpose cluster computing system. I t provides high‐level APIs in Java, Scala, Python and R, and an optimized engi ne that supports general execution graphs."
}

POST /test_a/_doc/6
{
"f" : "java spark and, development language "
}

GET /test_a/_search
{
"query": {
"match": {
"f": "java spark"
}
}
}

GET /test_a/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"f": "java spark"
}
}
],
"should": [
{
"match_phrase": {
"f": {
"query": "java spark",
"slop" : 50
}
}
}
]
}
}
}
Copy the code

Prefix search

Use prefix matching for search capabilities. Typically for keyword fields, that is, fields that do not have words. Grammar:

GET /test_a/_search 
{ 
"query": { 
"prefix": { 
"f.keyword": { 
"value": "J" 
} 
} 
} 
} 
Copy the code

** Note: for prefix searches, this is for keyword fields. The keyword field data is case-sensitive. The search efficiency of ** prefix is low. Prefix searches do not calculate relevance scores. The shorter the prefix, the less efficient it is. If prefix search is used, you are advised to use a long prefix. The longer the prefix, the more efficient it is, because prefix search requires scanning the entire index.

Wildcard search

ES also has wildcards. But it’s different from Java and databases. Wildcards can be used in inverted indexes as well as fields of type keyword. Common wildcards:? – One arbitrary character * -0 to N arbitrary characters

GET /test_a/_search { "query": { "wildcard": { "f.keyword": { "value": "? e*o*" } } } }Copy the code

Performance is also low and a full index scan is required. Not recommended.

Regular search

ES supports regular expressions. Can be used in inverted index or keyword type fields. Common symbols: [] – Range, for example, [0-9] is a number ranging from 0 to 9. – One character + – The preceding expression can appear more than once.

GET/test_a / _search {" query ": {" regexp" : {" f.k eyword ":" [A ‐ z]. + "}}}Copy the code

Performance is also low, requiring a full index scan.

10. Search recommendations

Search as your type. For example, if some data in the index starts with “Hello”, the related information is recommended when you enter “Hello”. (Similar to Baidu input box) syntax:

GET /test_a/_search { "query": { "match_phrase_prefix": { "f": { "query": "java s", "slop" : 10, "max_expansions": 10}}}}Copy the code

The max_expansions are used to specify the maximum number of terms (words) that prefix can match. The value of max_expansions can be used to specify the maximum number of terms (words) that prefix can match. Beyond that, there is no longer a match. The limitation of this syntax is that the prefix search is performed only on the last term. Execution performance is poor, after all, the last term is the term that needs to scan all sloP-qualified inverted indexes. Because it is inefficient, be sure to use the parameter max_expansions if you must.

Fuzzy fuzzy search technology

The search criteria may be incorrect, such as: Hello world -> Hello word. This kind of spelling mistake is quite common. Fuzzy technology is used to solve spelling errors (it works well in English, but hardly in Chinese). . Fuzziness refers to the number of letters that word can modify to correct spelling errors. (The number of modified letters includes letter changes, adding or subtracting letters.) . F is the name of the field to search for.

GET /test_a/_search 
{ 
"query": { 
"fuzzy": { 
"f" : { 
"value" : "word", 
"fuzziness": 2 
} 
} 
} 
} 
Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

ElasticSearch cluster architecture and search in depth

Elasticsearch architecture principles

1. Node type of Elasticsearch

1.1 the Master node

1.2 the DataNode node

Sharding and duplicating mechanisms

2.1 Sharding (Shard)

2.2 a copy of the

2.3 Specify the number of fragments and copies

Elasticsearch (Elasticsearch

3.1 Writing Principles of Elasticsearch Documents

3.2 Principles of Elasticsearch

Implementation of Elasticsearch index

4.1 Overwrite to file system cache

4.2 Write Translog to ensure fault tolerance

4.3 Flush To Disks

4.4 the segment merger

5. Manually control the accuracy of search results

5.1 In the following search, if the remark field in document contains Java or developer phrases, the search conditions are met.

5.2. Low-level transformation of match

5.3 boost weight control

5.4 Implement best Fields strategy for multi-field search based on DIS_max

5.5. Optimize dis_max search based on tie_breaker parameter

5.6. Simplify dis_max+ Tie_breaker with multi_match

5.7. Cross Fields Search

5.8. Copy_to combination fields

5.9. Approximate Match

5.10, match the phrase

6. Experience sharing

Prefix search

Wildcard search

Regular search

10. Search recommendations

Fuzzy fuzzy search technology

ElasticSearch cluster architecture and search in depth

Elasticsearch architecture principles

1. Node type of Elasticsearch

1.1 the Master node

1.2 the DataNode node

Sharding and duplicating mechanisms

2.1 Sharding (Shard)

2.2 a copy of the

2.3 Specify the number of fragments and copies

Elasticsearch (Elasticsearch

3.1 Writing Principles of Elasticsearch Documents

3.2 Principles of Elasticsearch

Implementation of Elasticsearch index

4.1 Overwrite to file system cache

4.2 Write Translog to ensure fault tolerance

4.3 Flush To Disks

4.4 the segment merger

5. Manually control the accuracy of search results

5.1 In the following search, if the remark field in document contains Java or developer phrases, the search conditions are met.

5.2. Low-level transformation of match

5.3 boost weight control

5.4 Implement best Fields strategy for multi-field search based on DIS_max

5.5. Optimize dis_max search based on tie_breaker parameter

5.6. Simplify dis_max+ Tie_breaker with multi_match

5.7. Cross Fields Search

5.8. Copy_to combination fields

5.9. Approximate Match

5.10, match the phrase

6. Experience sharing

Prefix search

Wildcard search

Regular search

10. Search recommendations

Fuzzy fuzzy search technology

Related Posts

Elastic: Getting started with Elastic Cloud

Why is the HDFS block size 128MB? block.size

Docker practice