Moment For Technology

Elasticsearch: How to search for emoji

Posted on June 24, 2022, 2:55 a.m. by 李詩婷
Category: The back-end Tag: elasticsearch

Elasticsearch is a very popular search engine. It can segmentation text, so as to achieve full-text search. In practical use, we will find some characters contain some emoticons, such as smiling faces, animals and so on. Then how should we search for these emoticons?

? = ?, light skin tone, skin tone, type 1 -- 2 ? = ?, medium-light skin tone, skin tone, type 3 Medium skin tone, skin tone, type 4 ? = ?, medium-dark skin tone, skin tone, type 5 ? = ?, dark skin tone, Skin tone, type 6 (content, music, note ♭ = ♭, bemolle, flat, music, note ♯ = ♯, diese, diesis, music, note, Sharp ? = ?, face, grin, grinning face ? = ?, face, smileface with big eyes, mouth, open, smile smile = kicker, Eye, face, grinning face with smiling, mouth, open, smile ? = ?, beaming face with smiling eyes, eye, face, beaming face Grin, smile ? = ?, face, grinning squinting face, laugh, mouth, satisfied, smile ? = ?, cold, face, grinning face with sweat, open, smile, sweat ? = ?, face, floor, laugh, rofl, rolling, Rolling on the floor laughing rotfl ? = ?, face, face with tears of joy, joy, laugh, tear ? = ?, face, Slightly smiling face, smile ? = ?, face, upside-down ? = ?, face, wink, winking face ? = tap, tiger ? = stage, Leopard ? = ?, face, horse ? = ?, equestrian, horse, racehorse, racing whale = 10000, face, unicorn unicorn = kicker, stripe, Zebra ? = ?, DeerCopy the code

On it, we can see all kinds of emoji symbols. For example, if we want to search for GRIN, it will find documents with the ? emoji as well. In today's article, we'll show you how to search for emoji.

 

The installation

If you haven't already installed Elasticsearch and Kibana, see the previous article "Elastic: A Beginner's Guide" to do so. In addition, we must install the ICU Analyzer. For the installation of ICU Analyzer, see the previous article "Elasticsearch: An Introduction to the ICU Analyzer". Insert the following command into the root directory of Elasticsearch:

./bin/elasticsearch-plugin install analysis-icu
Copy the code

Once installed, we need to restart Elasticsearch to make it work. Run:

./bin/elasticsearch-plugin list
Copy the code

The command above shows:

$ ./bin/elasticsearch-plugin install analysis-icu
- Installing analysis-icu
- Downloading analysis-icu from elastic
[=================================================] 100%   
- Installed analysis-icu
$ ./bin/elasticsearch-plugin list
analysis-icu
Copy the code

After installing ICU Analyzer, we must restart Elasticsearch.

 

Search for Emoji

Let's start with a simple experiment:

GET /_analyze {"tokenizer": "icu_tokenizer", "text": "I live in ?? and I'm ??"}Copy the code

The above uses icu_tokenizer to participle "I live in ?? and I'm ??". The ?? emoji is unique because it's a combination of the more classic ? and ? emojis. The National flag of China is also very special. It is a combination of ? and ?. So, not only are we talking about properly splitting Unicode code points, but we're really getting to know emoji here.

The result of the above request is:

{ "tokens" : [ { "token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "ALPHANUM", "position" : 0 }, { "token" : "live", "start_offset" : 2, "end_offset" : 6, "type" : "ALPHANUM", "position" : 1 }, { "token" : "in", "start_offset" : 7, "end_offset" : 9, "type" : "ALPHANUM", "position" : 2 }, { "token" : "" "" "? ?", "start_offset" : 10, "end_offset" : 14, "type" : " EMOJI ", "position" : 3}, {" token ": "and", "start_offset" : 16, "end_offset" : 19, "type" : "ALPHANUM", "position" : 4 }, { "token" : "I'm", "start_offset" : 20, "end_offset" : 23, "type" : "ALPHANUM", "position" : 5 }, { "token" : "" "" "? ?", "start_offset" : 24, "end_offset" : 29, "type" : " EMOJI ", "position" : 6}]}Copy the code

Apparently emoji symbols are segmented correctly and can be searched.

In actual use, we may not be limited to the search of these emoji symbols. For example, we want to search for the following documents:

PUT emoji-capable/_doc/1 {"content": "I like ?"}Copy the code

The above document contains an ?, or tiger. For the above documents, we want to search tiger for documents correctly, so how do we do that?

On Github, there is a project called github.com/jolicode/em... . Among its projects, there is a directory github.com/jolicode/em... . This is essentially a catalog of synonyms. We now download one of the files github.com/jolicode/em... Go to Elasticsearch's local installation directory:

│ ├─ ├─ ├─ ├─ ├─ download.txt ├─ download.txt ├─ download.txt ├─ download.txt ├─ download.txt...Copy the code

On my computer:

$PWD/Users/liuxg/elastic1 / elasticsearch tree - L - 7.11.0 / config $3. ├ ─ ─ analysis │ └ ─ ─ Cldr-emoji-annotate-synonyms.txt ├─ ElasticSearch. Keystore ├─ ElasticSearch.yML ├─ jv.options ├─ jv.options ├── double exercises, double Exercises, double Exercises, double Exercises, double ExercisesCopy the code

In the file cldr-emoji-annotation-synonym-en.txt above, it contains synonyms for common emoji symbols. Such as:

? = ?, face, grin, grinning face ? = ?, face, smileface with big eyes, mouth, open, smile smile = Eye, Beaming face, grinning face with smiling eyes, mouth, open, smile ? = ?, beaming face with smiling eyes, eye, face, grin, beaming face Smile ? = ?, face, grinning squinting face, laugh, mouth, satisfied, smile ? = ?, cold, face, grinning face with sweat, open, smile, sweat ....Copy the code

To this end, we carry out the following experiments:

PUT /emoji-capable
{
  "settings": {
    "analysis": {
      "filter": {
        "english_emoji": {
          "type": "synonym",
          "synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt" 
        }
      },
      "analyzer": {
        "english_with_emoji": {
          "tokenizer": "icu_tokenizer",
          "filter": [
            "english_emoji"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "english_with_emoji"
      }
    }
  }
}
Copy the code

Above, we defined the english_with_emoji descriptor, and we used the same descriptor, english_with_emoji, for the content field. We use the _analyze API to do the following:

GET emoji-capable/_analyze {"analyzer": "english_with_emoji", "text": "I like ?"}Copy the code

The command above returns:

{ "tokens" : [ { "token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "ALPHANUM", "position" : 0 }, { "token" : "like", "start_offset" : 2, "end_offset" : 6, "type" : "ALPHANUM", "position" : 1 }, { "token" : ? "" "" ""," start_offset ": 7," end_offset ": 9," type ":" SYNONYM ", "position" : 2}, {" token ": "tiger", "start_offset" : 7, "end_offset" : 9, "type" : "SYNONYM", "position" : 2 } ] }Copy the code

It obviously returns tokens like Tiger as well as ?. So we can search for both, and we can search for this document. In the same way:

GET emoji-capable/_analyze
{
  "analyzer": "english_with_emoji",
  "text": "? means happy"
}
Copy the code

It returns:

{" tokens ": [{" token" : "" ?" "" "," start_offset ": 0," end_offset ": 2," type ":" SYNONYM ", "position" : 0 }, { "token" : "face", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "grin", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "grinning", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "means", "start_offset" : 3, "end_offset" : 8, "type" : "ALPHANUM", "position" : 1 }, { "token" : "face", "start_offset" : 3, "end_offset" : 8, "type" : "SYNONYM", "position" : 1 }, { "token" : "happy", "start_offset" : 9, "end_offset" : 14, "type" : "ALPHANUM", "position" : 2 } ] }Copy the code

It shows that if we search for Face, grinning, GRIN, the document will also be returned correctly.

Now, we enter the following two documents:

PUT emoji-capable/_doc/1 {"content": "I like ?"} PUT emoji-capable/_doc/2 {"content": "? means happy"}Copy the code

We search the documents as follows:

GET emoji - capable / _search {" query ": {" match" : {" content ":" ? "}}}Copy the code

Or:

GET emoji-capable/_search
{
  "query": {
    "match": {
      "content": "tiger"
    }
  }
}
Copy the code

They all return the first document:

{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 0.8514803, "hits" : [{" _index ": "Emoji - capable", "_type" : "_doc", "_id" : "1", "_score" : 0.8514803, "_source" : {" content ": """I like ?"""}}]}}Copy the code

In general, we conduct the following search:

GET emoji - capable / _search {" query ": {" match" : {" content ":" ? "}}}Copy the code

Or:

GET emoji-capable/_search
{
  "query": {
    "match": {
      "content": "grin"
    }
  }
}
Copy the code

They all return a second document:

{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 0.8514803, "hits" : [{" _index ": "Emoji - capable", "_type" : "_doc", "_id" : "2", "_score" : 0.8514803, "_source" : {" content ": """? means happy""}}]}}Copy the code

 

Search
About
mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.