Same, but different: make ElasticSearch more powerful with synonyms

Same, but Different: Making Elasticsearch more powerful with synonyms by Christoph Buscher

Without a doubt, using synonyms is one of the most important skills in the search engineer’s toolbox. Although novices sometimes underestimate the importance of synonyms, they are indispensable to almost any search system. At the same time, people still sometimes underestimate the complexities and nuances associated with using synonyms, even for power users. The synonym filter is part of the analysis process of turning the input text into searchable words; While this tool is relatively simple to get started with, it can be used in a variety of ways, requiring an in-depth understanding of a few concepts before it can be used successfully in a real-world context.

Recently we’ve made some analytics improvements in Elasticsearch. Probably the most significant feature is the ability to reload the parser used in the search so that users can change and reload synonyms used in the search. In addition to demonstrating the new API, this post will answer some common questions about synonym usage and point out a few things you should be aware of frequently.

Why use synonyms?

To help you understand the tremendous power and flexibility of synonyms, let’s take a quick look at the inner workings of most search engines today. Search engines analyze documents and queries and break them down into their smallest units (often called lexicons, which are really abstract symbols). When searching, the matching process uses simple string similarity, so if the query has some very minor spelling errors (such as “hous,” one letter less than “house”) or uses plural nouns (” houses “), even if the document contains singular nouns (” house “), Search engines will not match this document. Tools such as stemmers or fuzzy queries can solve some of the most common such problems, but they do not eliminate differences between related concepts or ideas, nor do they equate slightly different word usages in documents or queries.

This is where synonyms come in handy. The English synonyms come from the Greek and are the prefixes _σύν (syn, which means “together”) _ and _ νομα (onoma, which means “name”) _. From its etymology, we can see that synonyms are different words that have the exact or basic same meaning in the same language or domain. In fact, there’s a very wide range of synonyms, These include common synonyms (” tired “and” sleepy “), abbreviations (” lb. “and” pound “), different spellings of products in e-commerce searches (” iPod “and” i-pod “), subtle linguistic differences (such as “lift” in British English and “elevator” in American English), and specialist words And common words (such as “dog” and “dog”), or even simply two ways of expressing the same concept (” universe “and” space “). By providing appropriate rules for synonyms, the search engineer can provide information about which words have similar meanings in their domain and should be treated similarly.

It is important for a search engine to know which words in a document match the query, even though they may not look the same. Because this involves very specific domain knowledge, the user needs to provide the appropriate rules. Synonyms filter can be used in the custom analyzer, it can be based on user defined rules replace or add other words, as well as in the index for in order to store the content at the same time in the index after the document (for example, two variations of words), can also be used in the index to extend the search term and matching to the more relevant documents. We will discuss the pros and cons of both approaches later.

Several situations to watch out for when using synonyms

A synonym analyzer is a very flexible tool, but it can lead to overuse in certain situations. For example, sometimes people force it in place of a stem extract, resulting in a large synonym file that contains various syntactic distortions of verbs and nouns (for English). While this approach may work, it is generally less performing and more difficult to maintain than using a true stem extractor or word restoration tool. Use it to correct spelling mistakes, too. If there are only a few particularly common misspellings, such as for e-commerce platforms, it can sometimes be worthwhile to try to correct them by using synonyms. However, if the problem is more extensive, the ngram approach using fuzzy queries or character levels may be more persistent. An alternative to using the synonym extension method can also be considered in the analysis chain. Sometimes, instead of using synonyms in a more restrictive analysis process, it can be more flexible and manageable to improve documentation in a collection pipeline or some other client process. For example, you can use the named entity identification (NER) framework to detect named entities in your document and then encode them in your preprocessing pipeline or with your own identifiers at collection time. If you then apply the same process to the user’s queries and then send them to Elasticsearch, you can achieve the same effect, but often have more control.

In addition, you may also be inclined to use synonyms to deal with other “same” concepts, such as grouping specific animal species under a common term, or even building taxonomic support for your domain. This is where things get really interesting and there are a lot of issues to explore, but keep in mind that synonyms are sometimes not the best choice, as using them inappropriately can cause your system to behave strangely.

Synonyms used in indexing versus synonyms used in searching

Synonyms are used in the parser, which can be used either for indexing or for searching. One of the most common questions about how to use synonym filters in ElasticSearch is: “Should I use it when indexing, when searching, or both?” Let’s first look at applying synonym filtering to _ index. This means that the words are replaced or extended once in the indexed document, and the results are always stored in the search index.

Using synonyms when indexing has several disadvantages:

Because all synonyms must be indexed, the index size is larger.
Search scores (which rely on word statistics) can be affected because synonyms are also counted, so statistics for less common words can be skewed.
You cannot change the synonym rules for existing documents unless you re-index them.

The last two are particularly big disadvantages. The only potential benefit of using synonyms when indexing is better performance because you have already taken the trouble to do the scaling process upfront, so you don’t need to do the scaling process again on each query, which might result in more words needing to be matched. However, this is usually not the real problem in practice.

In contrast, using synonyms in the analysis tool used in the search can avoid many of these problems:

The index size is not affected.
The word statistics in the corpus remain the same.
If you need to change the synonym rules, you do not need to re-index the document.

These advantages usually outweigh the only disadvantage, which is that synonym extensions must be performed on each query, which may result in more words needing to be matched. Not only that, but extending synonyms when searching also allows for the use of the more sophisticated Synonym_graph lexical filter, which correctly handles multi-word synonyms and is only available in the search parser.

In general, the benefits of using synonyms when searching often outweigh the small performance improvements that might be achieved when using synonyms when indexing.

However, if you use synonyms in your search, there is another issue that needs to be addressed in the past. Although changing the synonym rule does not require re-indexing the document, to do so, you must temporarily close and reopen the index. This is necessary because the parser only creates an instance when an index is created, when a node is restarted, and when an index that has been closed is reopened. To make the changes made to the synonym rule file visible to the index, the user must first update the file on all nodes, and then close and reopen the index. But the problem has been solved.

Synonyms, reload successfully

Starting with Elasticsearch 7.3, you can see changes in synonym files without reopening the index. We have added an endpoint that allows users to trigger a reload of profiler resources on demand. Calling this new endpoint will reload all parsers in the indexes, if the components in those indexes have been marked updatable. This, in turn, makes these components available only when searching.

For a synonym filter, marking it updatable and calling the “reload API” makes the synonym profile on each node visible to the analysis process. While you still cannot update the synonym rules in the filter definition (with synonym parameters), these synonym rules should be used primarily for occasional testing purposes. In any case, using a configuration file to configure synonyms has several advantages:

Easier management! In a production system, there can be many synonym rules, and since they can significantly affect search relevancy, they should be considered an integral part of the configuration and should be versioned and tested for any updates.
Synonyms usually come from other sources or are created by algorithms running on your data. Reading from a file, you do not need to add these synonyms to the filter configuration.
The same synonym file can be used in different filters.
Large synonym rule sets take up a lot of memory in the Elasticsearch cluster state, which is used to store meta information related to index Settings. To avoid unnecessarily increasing the size of the cluster, we recommend storing large synonym rule sets in configuration files.

For demonstration purposes, let’s assume that you add the initial my_synonyms. TXT file that contains the following single rule to the config directory of the Elasticsearch node. Let’s assume that this file initially contains only one of the following rules:

universe, cosmos

Next, we need to define a parser and have it reference this file in the synonym filter:

PUT /synonym_test
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym_analyzer": {
            "tokenizer": "whitespace",
            "filter": ["my_synonyms"]
          }
        },
        "filter": {
          "my_synonyms": {
            "type": "synonym",
            "synonyms_path": "my_synonyms.txt",
            "updateable": true
          }
        }
      }
    }
  }
}

Note that we mark the synonym filter as updateable. This is important because when we call a new reload endpoint, only updatable filters are reloaded; This has its drawbacks, however, because you are not allowed to use a parser that contains updatable filters when indexing. But let’s first check to see if the synonym has been applied correctly by running a short test through the _analyze endpoint:

GET /synonym_test/_analyze
{
  "analyzer": "synonym_analyzer",
  "text": "cosmos"
}

This operation should return two words, one of which, as expected, is “universe.” We next add another rule to the synonyms. TXT file by adding a second line:

lift, elevator

If you are using the previous version, you must close and reopen the index at this point in order to show the changes. Now you can simply invoke the new endpoint:

POST /synonym_test/_reload_search_analyzers

Although this request does not require a body, it may be limited to one or more indexes that use the typical index wildcard pattern. The response includes information about which parsers have been reloaded and which nodes are affected:

{[...]. , "reload_details": [{ "index": "synonym_test", "reloaded_analyzers": ["synonym_analyzer"], "reloaded_node_ids": ["FXbmbgG_SsOrNRssrYcPow"] }] }

Now running the above _analyze request for the word “lift” will also return “elevator” (as the second synonym lexicon).

However, there are some caveats. As mentioned above, you should use filters that are already marked as updateable when searching, so the correct way to use the synonym analyzer defined above at the field level is as follows:

POST /synonym_test/_mapping
{
  "properties": {
    "text_field": {
      "type": "text",
      "analyzer": "standard",
      "search_analyzer": "synonym_analyzer"
    }
  }
}

Also, reload only applies to synonyms loaded from a file, that is, changes to synonyms defined through Settings in the filter are not supported. Finally, in practice, you need to ensure that updates to the synonym file are applied to all nodes in the cluster. If the parser on some nodes sees different versions of the file, you may receive different search results, depending on which node is being used in the search. If this happens with respect to synonyms, you first need to check that the synonym files are the same on each node, and then retrigger the “reload” operation.

To summarize, the new _reload_search_analyzer endpoint enables you to quickly modify and change synonyms that are applied when queryingwithout having to reopen the index. For example, by examining the query log, you can determine whether the words a user queries with are different from the existing words in an indexed document, and then add them at any time. But adding synonyms can have an unexpected negative impact on relevancy scores, so we recommend first doing some form of testing (either A/B testing, or A ranking evaluation API, etc.) before applying the changes directly in production.

As part of the analysis chain

Another common problem with synonym filters is their behavior in more complex analysis chains. In most cases, you will precede the synonym filter with some common character or word filters, such as the lowerCase (lowercase) filter. This means that all words that flow through the parse chain are lowercase before the synonym filter is applied. Does this mean that the input synonyms in the synonym rule also need to be lowercase to match? Let’s take a look at this simple example:

PUT /test_index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym_analyzer": {
            "tokenizer": "whitespace",
            "filter": ["lowercase", "my_synonyms"]
          }
        },
        "filter": {
          "my_synonyms": {
            "type": "synonym",
            "synonyms": ["Eins, Uno, One", "Cosmos => Universe"]
          }
        }
      }
    }
  }
}
GET /test_index/_analyze
{
  "analyzer": "synonym_analyzer",
  "text": "one"
}

In the above example, you can verify that lowercase input text is expanded to three words, which means that lowercase operations are also applied to the synonym filter’s rules. Similarly, the substitution rule on the right (for example, the “Cosmos => Universe” rule) is rewritten, as you can see in the lowercase output of the following example:

GET /test_index/_analyze
{
  "analyzer": "synonym_analyzer",
  "text": "cosmos"
}

In general, the synonym filter rewrites the input supplied to the word splitter and filter used in the previous parse chain. However, there are a few notable exceptions to this rule: multiple filters that output stacked terms (such as common_grams or phonetic filters) are not allowed to be used before a synonym filter, and if you try to do so, an error will be reported. For other filters, such as compound word filters or synonym filters themselves, they are skipped if they precede another synonym filter in the analysis chain. The latter rule is important for implementing a synonym filter link. We can see the implementation in the example below.

What happens if you use two or more synonym filters in a row? Does the output of the previous item become the input of the subsequent item, that is, turning the chaining operation of the synonym filter into a partially passed operation? Let’s try it with the following example:

PUT /synonym_chaining
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "first_synonyms": {
            "type": "synonym",
            "synonyms": ["a => b", "e => f"]
          },
          "second_synonyms": {
            "type": "synonym",
            "synonyms": ["b => c", "d => e"]
          }
        },
        "analyzer": {
          "synonym_analyzer": {
            "filter": [
              "first_synonyms",
              "second_synonyms"
            ],
            "tokenizer": "whitespace"
          }
        }
      }
    }
  }
}
GET /synonym_chaining/_analyze
{
  "analyzer": "synonym_analyzer",
  "text": "a"
}

The output word is “c”, which means that the two filters have been applied in sequence, that is, the first filter replaces “a” with “b”, and the second filter in turn replaces this input with “c”. If you try to change the input to “d”, it will be replaced with “e” (the first rule is not applied); But if you change the input to “e,” the word will be replaced with “f” according to the first rule, leaving the second filter with nothing to match at all.

Do you remember? We mentioned earlier that there are some exceptions to rewriting based on the previous lexical filter. If the second_synonym filter above had applied the first filter’s rule to its rule set, it would have changed its rule d => e to d => f (because the previous filter’s rule e => f would have been applied). In earlier versions of Elasticsearch, this behavior used to be a source of annoyance for people, so synonym filters are now skipped when dealing with synonym rules in later filters. In version 6.6 and later, it will work as described.

Looking to the future

In this short blog post, we’ve just touched on the tip of the iceberg in terms of what synonyms can be used for, and tried to address some of the common problems associated with using them. Synonyms can be a powerful tool for increasing the recall rate of your search system, but there are also some important details that you need to know and experiment with, especially with systematic relevance testing.

We have added a new API in Elasticsearch 7.3 to allow you to reload the parser applied when searching. This API makes this kind of experiment much easier because you don’t have to close and reopen the index as you did before. In addition, it allows you to update the synonym rules applied when searching without taking your index offline. This API is just one small step in a series of improvements that we hope will make it easier for users to manage synonyms in large clusters. Feel free to tell us what you think, and send us feedback or questions in the comments section of this SegmentFault post. Wish you a happy analysis!

Same, but different: make ElasticSearch more powerful with synonyms

Why use synonyms?

Several situations to watch out for when using synonyms

Synonyms used in indexing versus synonyms used in searching

Synonyms, reload successfully

As part of the analysis chain

Looking to the future

Related Posts

Yii2-ElasticSearch (3) A preliminary attempt at Yii2 ElasticSearch

ElasticSearch Exception Handling — Lifecycle Management

Install Elasticsearch with Chinese Word Segmenting + Pinyin (online + offline)