Ngrams and Edge Ngrams are two more unique ways to tag text in Elasticsearch. Ngrams is a way to divide a marker into multiple subcharacters for each part of a word. Both ngram and Edge Ngram filters allow you to specify min_gram as well as max_gram Settings. These Settings control the size of the tokens into which the word is split. This can be confusing, so let’s look at an example. Suppose you want to analyze the word “spaghetti” using the Ngram analyzer, let’s start with the simplest case, 1-Gams (also known as unigrams).

In a real-world search example, such as Google search:

Every time we type in the first few letters, a corresponding list of candidates appears. This is the Autocomplete feature. In Elasticsearch we can do this with Edge Ngram.

 

1-grams

1-grams for “Spaghetti” are S, P, A, G, H, E, T, T, I. Split the string into smaller tokens based on the size of ngram. In this case, each token is a character, because we are talking about unigrams.

Bigrams

If you were to split the string into two-letter groups (which means size 2), you would get the following smaller tokens: SP, PA, AG, GH, he, et, TT, Ti.

Trigrams

Again, if you are going to use three sizes, you will get tokens for SPA, PAG, AGH, GHE, Het, ETT, TTI.

Set min_gram and max_gram

When using this analyzer, you need to set two different sizes: one that specifies the minimum ngrams to be generated (min_gram setting) and one that specifies the maximum ngrams to be generated. Using the previous example, if you specify min_gram to be 2 and max_gram to be 3, you will get the combined tag from the previous two examples:

sp, spa, pa, pag, ag, agh, gh, ghe, he, het, et, ett, tt, tti, ti
Copy the code

If you want to set min_gram to 1 and max_gram to 3, then you will get more marks from S, SP, spa, P, PA, pag, a,…. Start.

Analyzing text in this way has an interesting advantage. When you query for text, your query will be sliced up into text the same way, so say you’re looking for misspelled words “spaghety”. One way to search for this is to do a Fuzzy Query, which allows you to specify edit distances for words to check for matches. But you can use ngrams to obtain similar behavior. Let’s compare the Bigrams generated by the original word (” spaghetti “) to the misspelled word (” spaghety “) :

  • “Spaghetti” bigrams: SP, PA, AG, GH, HE, ET, TT, TI
  • Bigrams “Spaghety” : SP, PA, AG, GH, HE, et, TY

You can see that the six tokens overlap, so when the query contains “spaghety”, the words with “spaghetti” still match. Keep in mind that this means you may not intend to use more words of the original “spaghetti” words, so be sure to test your query for relevance!

Another useful thing Ngrams does is allow you to analyze text when you don’t know the language beforehand or when you use languages that combine words in a different way than other European languages. This also has the advantage of being able to handle multiple languages with a single parser without having to specify them.

Edge ngrams

A variant of regular Ngram splitting, called Edge Ngrams, builds Ngrams only from the front edge. In the “spaghetti” example, if you set min_gram to 2 and max_gram to 6, you get the following flags:

sp, spa, spag, spagh, spaghe
Copy the code

You can see that each tag is built from the edge. This helps to search for words that share the same prefix without actually performing a prefix query. If you need to build ngrams from the end of a word, you can use the side attribute to get the edge from the back rather than the front of the default.

Ngram set

Ngrams are a great way to analyze text when you don’t know what the language is, because they can analyze languages with no Spaces between words. An example of configuring Edge Ngram Analyzer using min and Max grams is shown below:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}
Copy the code

We can analyze our string using the my_tokenizer we just created:

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "2 Quick Foxes."
}
Copy the code

The results are as follows:

{
  "tokens" : [
    {
      "token" : "Qu",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Qui",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Quic",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "Quick",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "Fo",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "Fox",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "Foxe",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "Foxes",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "word",
      "position" : 7
    }
  ]
}
Copy the code

Since we defined min_gram as 2, the length of the token generated starts at 2.

In general, we recommend using the same parser for indexing and searching. In the case of the Edge_ngram tokenizer, the advice is different. It only makes sense to use the edge_ngram tag generator when indexing to ensure that some words are available for matches in the index. When searching, search only for terms entered by the user, such as Quick Fo.

Here is an example of how to set the field for the search type:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": [
            "letter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}
Copy the code

In our example, we use two different Analyzers for indexing and searching: Autocomplete and autoComplete_search.

PUT my_index/_doc/1
{
  "title": "Quick Foxes" 
}

POST my_index/_refresh
Copy the code

Above we add a document. Let’s search:

GET my_index/_search
{
  "query": {
    "match": {
      "title": {
        "query": "Quick Fo", 
        "operator": "and"
      }
    }
  }
}
Copy the code

Display result:

{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 0.5753642, "hits" : [{" _index ":" my_index ", "_type" : "_doc", "_id" : "1", "_score" : 0.5753642, "_source" : {" title ":" Quick Foxes "}}}}]Copy the code

In this case, autoComplete Analyzer can decompose the string “Quick Foxes” into [qu, Qui, Quic, Quick, FO, Fox, Foxe, Foxes]. While the autoComplete_Search Analyzer searches for the entry [quick, fo], both appear in the index.

Of course, we can also do the following search:

GET my_index/_search
{
  "query": {
    "match": {
      "title": {
        "query": "Fo"
      }
    }
  }
}
Copy the code

It shows the same result as above.

Shingles

Like Ngrams and Edge Ngrams, there is a filter called Shingle (no, not the sick shingle!). . Shingle Token filters are basically ngrams for token levels rather than character levels.

Think of our favorite word, “spaghetti.” Elasticsearch will generate tags S, SP, SPA, P, PA, PAg, A, AG, etc. using ngrams with min and Max Settings of 1 and 3. A shingle filter does this at the token level, so if you have the text “foo bar baz” and again use in_shingLE_SIZE as 2 and max_shingLE_size as 3, you will generate the following tokens:

foo, foo bar, foo bar baz, bar, bar baz, baz
Copy the code

Why is the single token output still included? This is because the Shingle filter contains the original token by default, so the original token generates tokens foo, bar, and baz, which are then passed to the Shingle Token filter to generate tokens foo bar, foo bar baz, and bar baz. All of these tokens are combined to form the final token stream. You can disable this behavior by setting the output_unigrams option to false, which does not require the original tokens: foo, bar and baz

The next listing shows an example of a Shingle Token filter; Note that the min_SHingLE_SIZE option must be greater than or equal to 2.

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "shingle": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "shingle-filter"
          ]
        }
      },
      "filter": {
        "shingle-filter": {
          "type": "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 3,
          "output_unigrams": false
        }
      }
    }
  }
}
Copy the code

Here, we define a filter called shingle-filter. The minimum Shangle size is 2 and the maximum Shingle size is 3. We also set output_unigrams to false so that the original tokens are not included in the final result.

Let’s do an example and see what the result is:

GET /my_index/_analyze
{
  "text": "foo bar baz",
  "analyzer": "shingle"
}
Copy the code

The result displayed is:

{
  "tokens" : [
    {
      "token" : "foo bar",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "foo bar baz",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "shingle",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "bar baz",
      "start_offset" : 4,
      "end_offset" : 11,
      "type" : "shingle",
      "position" : 1
    }
  ]
}
Copy the code

 

Reference:

【 1 】 www.elastic.co/guide/en/el…