Elasticsearch series - Prefix search and fuzzy search

The profile

In this article, we will introduce several ways of partial search. The browser search box we often use, which pops up a drop-down prompt when typing, is also based on the principle of partial search.

The prefix search

Now let’s talk about partial matching, which is the equivalent of mysql’s “where content like ‘%love%'”. In the database at a glance you can find that this query is not indexed, very inefficient.

Elasticsearch has a special split handling for this search, supporting multiple partial search formats, this time focusing on prefix matching for the NOT_analyzed exact value field.

Prefix search syntax

Common searches that may have prefix requirements include zip code, product serial number, Courier number and CERTIFICATE number. The contents of these values themselves contain certain logical classification meanings, for example, a prefix represents information such as region and year. Let’s take zip code as an example:

/ / mappings": {"properties": {"postcode": {"postcode": {"postcode": {"postcode": {"postcode": {"postcode": {"postcode": {"type": POST /demo_index/address/_bulk {"index": {"_id": 1}} {"postcode" : "510000"} { "index": { "_id": 2 }} { "postcode" : "514000"} { "index": { "_id": 3 }} { "postcode" : "527100"} { "index": { "_id": 4 }} { "postcode" : "511500"} { "index": { "_id": 5 }} { "postcode" : "511100"}Copy the code

Example prefix search:

GET /demo_index/address/_search
{
  "query": {
    "prefix": {
      "postcode": {
        "value": "511"}}}}Copy the code

The search results are as expected:

{
  "took": 3."timed_out": false."_shards": {
    "total": 5."successful": 5."skipped": 0."failed": 0
  },
  "hits": {
    "total": 2."max_score": 1."hits": [{"_index": "demo_index"."_type": "address"."_id": "5"."_score": 1."_source": {
          "postcode": "511100"}}, {"_index": "demo_index"."_type": "address"."_id": "4"."_score": 1."_source": {
          "postcode": "511500"}}]}}Copy the code

Principle of Prefix Search

Prefix query does not calculate relevance score, and _score is fixed to 1. The only difference with prefix filter is that filter caches bitsets.

Let’s examine the sample search process:

When indexing a document, create an inverted index first. Keyword does not have the operation of keyword. Create an index directly.

postcode	doc ids
510000	1
514000	2
527100	3
511500	4
511100	5

If it is a full-text search, the search string “511” returns no match.
If it is a prefix search, scan for a “511500”, continue to search, and get another “511100”, continue to search until the entire inverted index is searched, and return the result.

From this process, we can find that the search performance of match is very high, while the performance of prefix search is relatively low due to the traversal of indexes, but in some scenarios, only prefix search can be competent. If the prefix length is longer, fewer documents can be matched and the performance will be better. If the prefix length is too short and contains only one character, the amount of matched data will affect the performance.

Wildcard and regular searches

Wildcard search and regular expression search and prefix search types, but a bit more features.

The wildcard

General notation:? Any one character, zero or any number of characters, example:

GET /demo_index/address/_search
{
  "query": {
    "wildcard": {
      "postcode": {
        "value": "* 110 *"}}}}Copy the code

Regular search

That is, the search string is written as a regular expression, which is the usual format:

GET /demo_index/address/_search
{
  "query": {
    "regexp": {
      "postcode": {
        "value": 11. "[0-9] +"}}}}Copy the code

Both of these are kind of advanced syntax introductions that allow us to write more flexible query requests, but they don’t perform very well, so we don’t use them much.

Instant search

When you search for “Elasticsearch” on Google, you’ll see a box that shows relevant terms:

The browser captures every input event, every input character, sends a request to the background, takes the content you search as the search prefix, and returns the top 10 data of the current hot spot related to the search to you to assist you to complete the input. Baidu also has a similar function.

The implementation principle is based on the prefix search, but the backend implementation of Google/Baidu is more complex, we can simulate real-time search from the perspective of Elasticsearch:

GET /demo_index/website/_search 
{
  "query": {
    "match_phrase_prefix": {
      "title": "Elasticsearch q"}}}Copy the code

The principle is the same as match_phrase, except that the last term is searched as a prefix. Elasticsearch q = “q” = “q” = “q” = “q” = “q” = “q” Documents that also contain characters beginning with q.

Of course, this query supports sloP parameters.

Max_expansions parameters

The max_expansions parameter can be used to reduce the performance problems caused by short prefixes. The recommended value is 50, as shown in the following example:

GET /demo_index/website/_search 
{
  "query": {
    "match_phrase_prefix": {
      "postcode": {
        "query": "Elasticsearch q"."max_expansions": 50}}}}Copy the code

The max_expansions function is to control the number of words that match the prefix. It looks for the first word that matches the prefix “Q” and then looks for the words that match the prefix alphabetically until there are no more words to match or when the number of max_expansions exceeds the number of words.

When we use Google to search for information, the key is to enter a character request once, so that we can use max_expansions to control the number of matching documents, because we will keep typing until the content we want to search is entered or the appropriate prompt is selected, then click the search button to search the web page.

Be sure to include max_expansions when using match_phrase_prefix, otherwise the performance will be too low when typing the first character.

The application of the ngram

Some of the queries we used previously had no special setting for the index. This solution is called Query Time implementation, which is non-intrusive and flexible, usually at the expense of search performance. Another solution is called index time, where the index setting is intrusive and some preparation for the search is done in advance. It’s a huge boost to performance. Moving from query-time to indexing is a good practice if some functionality is more real-time.

The prefix search function depends on the specific usage scenario. If it is at the entrance of the level 1 function, which is responsible for most of the traffic, it is recommended that we take a look at Ngram before using the index.

What is ngrams

Prefix queries are made by matching each other. The whole process is a bit blind and the search volume is large, so the performance is low, but if I split the keywords in a certain length in advance, I can go back to the efficient method of match query. Ngrams are essentially a sliding window for splitting keywords. The length of the window can be set. Let’s take “Elastic” for example.

Length 1: [E, L, A, S, T, I, C]
Length 2: [El, La, AS, ST, Ti, IC]
Length 3: [Ela, LAS, AST, STI, TIC]
Length 4: [Elas,last,asti, STIC]
Length 5: [Elast,lasti,astic]
Length 6: [Elasti, Lastic]
Length 7: [Elastic]

As you can see, the longer the length, the fewer words are broken up. Each split word is added to the inverted index for a match search.

There is also a special edge Ngram that leaves only words that begin with the first letter, as follows:

Length 1: E
Length 2: El
Length 3: Ela
Length 4: Elas
Length 5: Elast
Length 6: Elasti
Length 7: Elastic

This split is particularly in line with our search habits.

case

Create an index, specifying filter

PUT /demo_index
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type":   "edge_ngram"."min_gram": 1."max_gram": 20}},"analyzer": {
        "autocomplete": {
          "type":    "custom"."tokenizer": "standard"."filter": [
            "lowercase"."autocomplete_filter"]}}}}}Copy the code

Filter means that for any term received by the Token filter, the filter will generate an N-gram with a minimum fixed value of 1 and a maximum of 20.

Use the token filter above in your custom parser autoComplete

PUT /demo_index/_mapping/_doc
{
  "properties": {
      "title": {
          "type":     "text"."analyzer": "autocomplete"."search_analyzer": "standard"}}}Copy the code

We can test the effect

GET /demo_index/_analyze
{
  "analyzer": "autocomplete"."text": "love you"
}
Copy the code

Response results:

{
  "tokens": [{"token": "l"."start_offset": 0."end_offset": 4."type": "<ALPHANUM>"."position": 0
    },
    {
      "token": "lo"."start_offset": 0."end_offset": 4."type": "<ALPHANUM>"."position": 0
    },
    {
      "token": "lov"."start_offset": 0."end_offset": 4."type": "<ALPHANUM>"."position": 0
    },
    {
      "token": "love"."start_offset": 0."end_offset": 4."type": "<ALPHANUM>"."position": 0
    },
    {
      "token": "y"."start_offset": 5."end_offset": 8."type": "<ALPHANUM>"."position": 1
    },
    {
      "token": "yo"."start_offset": 5."end_offset": 8."type": "<ALPHANUM>"."position": 1
    },
    {
      "token": "you"."start_offset": 5."end_offset": 8."type": "<ALPHANUM>"."position": 1}}]Copy the code

The test results are in line with expectations.

Add a little test data

PUT /demo_index/_doc/_bulk
{ "index": { "_id": "1"}} {"title" : "love"}
{ "index": { "_id": "2"}} 
{"title" : "love me"}} {"index": { "_id": "3"}} 
{"title" : "love you"}} {"index": { "_id": "4"}} 
{"title" : "love every one"} 
Copy the code

Use a simple match query

GET /demo_index/_doc/_search
{
    "query": {
        "match": {
            "title": "love ev"}}}Copy the code

Response results:

{
  "took": 1."timed_out": false."_shards": {
    "total": 5."successful": 5."skipped": 0."failed": 0
  },
  "hits": {
    "total": 2."max_score": 0.83003354."hits": [{"_index": "demo_index"."_type": "_doc"."_id": "4"."_score": 0.83003354."_source": {
          "title": "love every one"}}, {"_index": "demo_index"."_type": "_doc"."_id": "1"."_score": 0.41501677."_source": {
          "title": "love"}}]}}Copy the code

If you use match, only love will also come out, full text search, but the score is lower.

Using match_phrase

It is recommended to use match_PHRASE, which requires that it is available in each term and position is just next to 1 bit, which is in line with our expectations.

GET /demo_index/_doc/_search
{
    "query": {
        "match_phrase": {
            "title": "love ev"}}}Copy the code

We can see that most of the work is done in the indexing phase, and all the queries only need to execute match or match_PHRASE, which is much more efficient than prefix queries.

Search tips

Elasticsearch also supports a type of Completion Suggest, also known as auto completion.

Completion is suggest principle

When creating an index, specify completion as the field type. Elasticsearch will generate a list of all possible completed words for the search field and then place them into a Finite State machine, an optimized graph structure.

When performing a search, Elasticsearch follows the matching path character by character from the beginning of the graph. Once it is at the end of the user’s input, Elasticsearch looks for all possible ends of the current path and generates a list of suggestions, which are cached in memory.

Performance-wise, completion Suggest is much faster than any word-based query.

The sample

Specify that the title.fields field is of type Completion

PUT /music
{
  "mappings": {
    "children": {"properties": {
        "title": {
          "type": "text"."fields": {
            "suggest": {
              "type":"completion"}}},"content": {
          "type": "text"
        }
      }
    }
  }
}

Copy the code

Insert some sample data

PUT /music/children/_bulk
{ "index": { "_id": "1"}} {"title":"children music London Bridge"."content":"London Bridge is falling down"}
{ "index": { "_id": "2"}} 
{"title":"children music Twinkle"."content":"twinkle twinkle little star"} 
{ "index": { "_id": "3"}} 
{"title":"children music sunshine"."content":"you are my sunshine"} 
Copy the code

Search for requests and responses

GET /music/children/_search
{
  "suggest": {
    "my-suggest": {
      "prefix": "children music"."completion": {
        "field":"title.suggest"}}}}Copy the code

The response is as follows, with omissions:

{
  "took": 26."timed_out": false."suggest": {
    "my-suggest": [{"text": "children music"."offset": 0."length": 14."options": [{"text": "children music London Bridge"."_index": "music"."_type": "children"."_id": "1"."_score": 1."_source": {
              "title": "children music London Bridge"."content": "London Bridge is falling down"}}, {"text": "children music Twinkle"."_index": "music"."_type": "children"."_id": "2"."_score": 1."_source": {
              "title": "children music Twinkle"."content": "twinkle twinkle little star"}}, {"text": "children music sunshine"."_index": "music"."_type": "children"."_id": "3"."_score": 1."_source": {
              "title": "children music sunshine"."content": "you are my sunshine"}}]}}Copy the code

The returned value can then be added to the front page as a prompt, such as data filling in a browser dropdown.

Fuzzy search

Fuzzy search can be used to correct misspelled words. Example:

GET /music/children/_search
{
  "query": {
    "fuzzy": {
      "name": {
        "value": "teath"."fuzziness": 2}}}}Copy the code

Fuzziness: The maximum number of letters that can be corrected. The default value is 2. There is a limit.

General usage: Match has a nested fuzziness set to auto.

GET /music/children/_search
{
  "query": {
    "match": {
      "name": {
        "query": "teath"."fuzziness": "AUTO"."operator": "and"}}}}Copy the code

Just take a look.

summary

This article introduces the prefix search, wildcard search and regular search of the basic gameplay, the performance of the prefix search impact and control means to do a simple explanation, Ngram in the index of local search and search hints is very classic practice, finally introduced the general use of fuzzy search, you can understand.

Focus on Java high concurrency, distributed architecture, more technology dry products to share and experience, please pay attention to the public account: Java architecture community can scan the left TWO-DIMENSIONAL code to add friends, invite you to join the Java architecture community wechat group to discuss technology! [Java Architecture Community]

(p1-jj.byteimg.com/tos-cn-i-t2…).

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Elasticsearch series – Prefix search and fuzzy search

The profile

The prefix search

Prefix search syntax

Principle of Prefix Search

Wildcard and regular searches

The wildcard

Regular search

Instant search

Max_expansions parameters

The application of the ngram

What is ngrams

case

Search tips

Completion is suggest principle

The sample

Fuzzy search

summary

Elasticsearch series – Prefix search and fuzzy search

The profile

The prefix search

Prefix search syntax

Principle of Prefix Search

Wildcard and regular searches

The wildcard

Regular search

Instant search

Max_expansions parameters

The application of the ngram

What is ngrams

case

Search tips

Completion is suggest principle

The sample

Fuzzy search

summary

Related Posts

Circular linked list data structure of a circular linked list can be divided into two and a half | the fourth set

DaoCloud continuous Integration of Java projects

Redis file event