In the ingest node of Elasticsearch, we can use the following processor to process some of our data. Their function is very specific and clear. Is there a more flexible way to program Elasticsearch? If so, what language is it in?

In Elasticsearch, it uses a language called Painless. It was built specifically for Elasticsearch. Painless is a simple, secure scripting language designed for use with Elasticsearch. It is the default scripting language for Elasticsearch and can be safely used for inline and stored scripts. It has a groovy-like syntax. Versions after Elasticsearch 6.0 no longer support Groovy, Javascript, or Python.

Using scripts, you can evaluate custom expressions in Elasticsearch. For example, you can use scripts to return “Script fields” as part of a search request, or to evaluate a custom score for a query.

 

How to use scripts:

The syntax of the script is:

"script": { "lang": "..." , "source" | "id": "..." , "params": { ... }}Copy the code
  • The default value of lang here is “painless”. In practical use, this can be omitted unless a second language is available
  • The source can be an inline script, or an ID that corresponds to a stored script
  • Any named parameter can be used as an input parameter to the script

Painless can use Java-style comment statements that support branching, looping, and other control structures, so Painless has some keywords that cannot be used to declare identifiers. The Painless keyword is much less than Java, with only 15 in all. The following table lists all the available keywords to see which statement types Painless supports.

Painless keyword

if else while do for
in continue break return new
try catch throw this instanceof

Painlesss supports all control statements in Java syntax except switch.

 

A simple use example of Painless

The inline script

Let’s start by creating a simple document:

PUT twitter/_doc/1 {"user" : "hello ", "message" :" nice weather today, go for a walk ", "uid" : 2, "age" : 20, "city" : "Beijing ", "province" : "Beijing", "country" : "Chinese", "address" : "haidian district in Beijing, China", "location" : {" lat ":" 39.970718 ", "says lon" : "116.325747"}}Copy the code

In this document, we now want to change the age to 30, so one way to do this is to read the entire document, change the age to 30, and write it back in the same way. First there are several actions: read the data, then modify it, then write it again. Obviously it’s a bit of a hassle. Here we can modify it directly using Painless:

POST twitter/_update/1
{
  "script": {
    "source": "ctx._source.age = 30"
  }
}
Copy the code

The source here indicates our Painless code. Here we wrote very little code in the DSL. This code is called inline. Here we access the age in _souce directly via ctx._source.age. So we programmatically modify the age directly. The result of the run is:

{ "_index" : "twitter", "_type" : "_doc", "_id" : "1", "_version" : 16, "_seq_no" : 20, "_primary_term" : 1, "found" : True, "_source" : {" user ":" ShuangYuShu - zhang SAN ", "message" : "today the weather is good, walk to", "uid" : 2, "age" : 30, "city" : "Beijing", "province" : "Beijing", "country" : "Chinese", "address" : "haidian district in Beijing, China", "location" : {" lat ":" 39.970718 ", "says lon" : "116.325747"}}}Copy the code

Obviously this age has changed to 30. The above approach is fine, but scripts need to be recompiled every time they are executed. Compiled scripts can be cached and used later. The script above needs to be recompiled if the age is changed. A better approach would be this:

POST twitter/_update/1
{
  "script": {
    "source": "ctx._source.age = params.value",
    "params": {
      "value": 34
    }
  }
}
Copy the code

In this way, the source of our script does not have to change, and only needs to be compiled once. The next time you call it, you just need to change the parameters in params.

In the Elasticsearch:

"script": {
  "source": "ctx._source.num_of_views += 2"
}
Copy the code

and

"script": {
  "source": "ctx._source.num_of_views += 3"
}
Copy the code

Are treated as two different scripts that need to be compiled separately, so the best way is to pass in the parameters using Params.

In addition to the above update, we can also use script query to continue searching our documents:

GET twitter/_search { "query": { "script": { "script": { "source": "doc['city'].contains(params.name)", "lang": "Painless ", "params": {"name":" Beijing "}}}}}Copy the code

In the script above, query for all documents that contain “Beijing” in the city field.

 

Stored script

 

In this case, scripts can be stored in a cluster state. It can then be called by ID:

PUT _scripts/add_age
{
  "script": {
    "lang": "painless",
    "source": "ctx._source.age += params.value"
  }
}
Copy the code

Here, we define a script called add_age. What it does is it adds a value to the age in the source. We can call it later:

POST twitter/_update/1
{
  "script": {
    "id": "add_age",
    "params": {
      "value": 2
    }
  }
}
Copy the code

From the above implementation, we can see that age will be incremented by 2.

 

Access the fields in source

The syntax used to access field values in Painless depends on context. In Elasticsearch, there are many different Plainless contexts. As that link shows, Plainless contexts include: Ingest Processor, update, update by Query, sort, filter, and so on.

Context Access to the field
Ingest node: Use CTX to access fields ctx.field_name
Updates: Use the _source field ctx._source.field_name

Updates here include _update, _reindex, and update_by_query. Here, our understanding of context is very important. It means that CTX contains different fields in use for different apis. In the following examples, we make specific analysis for some situations.

 

Painless script example

First we create a pipeline called add_field_c. For more information on how to create a Pipleline, see my previous article “How to use the Pipeline API to handle events in Elasticsearch”.

Example 1

PUT _ingest/pipeline/add_field_c
{
  "processors": [
    {
      "script": {
        "lang": "painless",
        "source": "ctx.field_c = (ctx.field_a + ctx.field_b) * params.value",
        "params": {
          "value": 2
        }
      }
    }
  ]
}
Copy the code

This pipepline creates a new field: field_c. The result is the sum of Field_A and field_B multiplied by 2. So let’s create a document like this:

PUT test_script/_doc/1? pipeline=add_field_c { "field_a": 10, "field_b": 20 }Copy the code

Here, we use pipleline add_field_c. The results are as follows:

{ "took" : 147, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : , "hits" : {0} "total" : {" value ": 1, the" base ":" eq "}, "max_score" : 1.0, "hits" : [{" _index ": "Test_script _type", "" :" _doc ", "_id" : "1", "_score" : 1.0, "_source" : {" field_c ": 60," field_a ": 10," field_b ": 20}}]}}Copy the code

Obviously, we can see that field_c was created successfully.

Example 2

In ingest, you can use script handlers to process metadata such as _index and _type. Here is an example of an Ingest Pipeline that renames the index and type to my_index, regardless of what was provided in the original index request:

PUT _ingest/pipeline/my_index
{
    "description": "use index:my_index and type:_doc",
    "processors": [
      {
        "script": {
          "source": """
            ctx._index = 'my_index';
            ctx._type = '_doc';
          """
        }
      }
    ]
}
Copy the code

Using the above pipeline, we can try to index a document to any_index:

PUT any_index/_doc/1? pipeline=my_index { "message": "text" }Copy the code

The results are as follows:

{
  "_index": "my_index",
  "_type": "_doc",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 89,
  "_primary_term": 1,
}
Copy the code

That is, the actual document is stored in my_index, not any_index.

 

Example 3

PUT _ingest/pipeline/blogs_pipeline
{
  "processors": [
    {
      "script": {
        "source": """
          if (ctx.category == "") { 
             ctx.category = "None"
          } 
"""
      }
    }
  ]
}
Copy the code

We defined a pipeline above that will check if the category field is empty and change it to “None” if so. Again, take the previous test_script index:

PUT test_script/_doc/2? pipeline=blogs_pipeline { "field_a": 5, "field_b": 10, "category": "" } GET test_script/_doc/2Copy the code

The results are as follows:

{
  "_index" : "test_script",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 2,
  "_seq_no" : 6,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "field_a" : 5,
    "field_b" : 10,
    "category" : "None"
  }
}
Copy the code

Obviously, it has changed the category field to “None”.

 

Example 4

POST _reindex
{
  "source": {
    "index": "blogs"
  },
  "dest": {
    "index": "blogs_fixed"
  },
  "script": {
    "source": """
      if (ctx._source.category == "") {
          ctx._source.category = "None" 
      }
"""
  }
}
Copy the code

The example above writes “None” if category is empty on reindex. As we can see from the two examples above, we can work directly on cxt.field for pipeline, and on fields under CXT. _source for update. This is also the context difference mentioned earlier.

 

Example 5

PUT test/_doc/1
{
    "counter" : 1,
    "tags" : ["red"]
}
Copy the code

You can add tags to the tags list using the and update script (this is just a list, so tags are added even if they exist) :

POST test/_update/1
{
    "script" : {
        "source": "ctx._source.tags.add(params.tag)",
        "lang": "painless",
        "params" : {
            "tag" : "blue"
        }
    }
}
Copy the code

Display result:

GET test/_doc/1
Copy the code
{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 4,
  "_seq_no" : 3,
  "_primary_term" : 11,
  "found" : true,
  "_source" : {
    "counter" : 1,
    "tags" : [
      "red",
      "blue"
    ]
  }
}
Copy the code

Shows that “blue” has been successfully added to the tags list.

You can also remove tags from the tags list. The Painless function that deletes a tag takes the array index of the element to be deleted. To avoid possible runtime errors, you first need to ensure that the tag exists. If the list contains duplicates of the tag, this script removes only one match.

POST test/_update/1
{
  "script": {
    "source": "if (ctx._source.tags.contains(params.tag)) { ctx._source.tags.remove(ctx._source.tags.indexOf(params.tag)) }",
    "lang": "painless",
    "params": {
      "tag": "blue"
    }
  }
}

GET test/_doc/1
Copy the code

Display result:

{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 5,
  "_seq_no" : 4,
  "_primary_term" : 11,
  "found" : true,
  "_source" : {
    "counter" : 1,
    "tags" : [
      "red"
    ]
  }
}
Copy the code

“Blue” has apparently been deleted.

 

Painless script simple practice

To illustrate how Painless works, let’s load some hockey statistics into the Elasticsearch index:

PUT hockey/_bulk? refresh {"index":{"_id":1}} {" first ":" Johnny ", "last" : "gaudreau", "goals" :,27,1 [9], "assists" :,46,0 [17], "gp" :,82,1 [26], "born" : "1993/08/13}" {"index":{"_id":2}} {" first ":" Sean ", "last" : "monohan," "goals" :,54,26 [7], "assists" :,26,13 [11], "gp" :,82,82 [26], "born" : "1994/10/12}" {"index":{"_id":3}} {" first ":", jiri ", "last" : "hudler", "goals" :,34,36 [5], "assists" :,62,42 [11], "gp" :,80,79 [24], "born" : "1984/01/04}" {"index":{"_id":4}} {" first ", "sampling", "last" : "frolik", "goals" :,6,15 [4], "assists" :,23,15 [8], "gp" :,82,82 [26], "born" : "1988/02/17}" {" index ": {" _id" : 5}} {" first ":" Sam ", "last", "Bennett," "goals" : [0, 5], "assists" : (8, 0), "gp" : [0] 26, "born" : "1996/06/20}"  {"index":{"_id":6}} {" first ":" Dennis ", "last" : "wideman," "goals" :,26,15 [0], "assists" :,30,24 [11], "gp" :,81,82 [26], "born" : "1983/03/20}" {"index":{"_id":7}} {" first ":" David ", "last" : "Jones," "goals" :,19,5 [7], "assists" :,17,4 [3], "gp" :,45,34 [26], "born" : "1984/08/10}" {"index":{"_id":8}} {" first ":" tj ", "last" : "brodie," "goals" :,14,7 [2], "assists" :,42,30 [8], "gp" :,82,82 [26], "born" : "1990/06/07}" {"index":{"_id":39}} {" first ":" mark ", "last" : "giordano", "goals" :,30,15 [6], "assists" :,30,24 [3], "gp" :,60,63 [26], "born" : "1983/10/03}" {"index":{"_id":10}} {" first ":" mikael ", "last" : "backlund", "goals" :,15,13 [3], "assists" :,24,18 [6], "gp" :,82,82 [26], "born" : "1989/03/17}" {"index":{"_id":11}} {" first ":" Joe ", "last" : "colborne," "goals" :,18,13 [3], "assists" :,20,24 [6], "gp" :,67,82 [26], "born" : "1990/01/30}"Copy the code

Use Painless to access values in Doc

The values in the document can be accessed through a Map value called doc. For example, the following script counts the total number of goals scored by a player. This example uses type int and fo r loops.

GET hockey/_search { "query": { "function_score": { "script_score": { "script": { "lang": "painless", "source": """ int total = 0; for (int i = 0; i < doc['goals'].length; ++i) { total += doc['goals'][i]; } return total; """}}}}}Copy the code

Here we calculate the _score for each document using script. Add up each athlete’s goal through script to form the final _score. Here we use the Map type doc[‘goals’] to access our field values. The result displayed is:

{ "took" : 25, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : , "hits" : {0} "total" : {" value ": 11," base ":" eq "}, "max_score" : 87.0, "hits" : [{" _index ": "Hockey", "_type" : "_doc", "_id" : "2", "_score" : 87.0, "_source" : {" first ":" Sean ", "last" : "monohan," "goals" : [ 7, 54, 26 ], "assists" : [ 11, 26, 13 ], "gp" : [ 26, 82, 82 ], "born" : "1994/10/12" } }, ...Copy the code

Alternatively, you can use script_fields instead of function_score to do the same:

GET hockey/_search { "query": { "match_all": {} }, "script_fields": { "total_goals": { "script": { "lang": "painless", "source": """ int total = 0; for (int i = 0; i < doc['goals'].length; ++i) { total += doc['goals'][i]; } return total; """}}}}Copy the code

The result displayed is:

{ "took" : 5, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 11, "base" : "eq"}, "max_score" : 1.0, "hits" : [{" _index ":" hockey ", "_type" : "_doc", "_id" : "1", "_score" : 1.0, "fields" : {" total_goals ": [37]}}, {" _index" : "hockey", "_type" : "_doc", "_id" : "2", "_score" : 1.0, "fields" : {" total_goals ": [87]}},...Copy the code

The following example uses a Painless script to sort players by their combined first and last names. Use doc [‘first’].value and doc [‘last’].value to access the name.

GET hockey/_search
{
  "query": {
    "match_all": {}
  },
  "sort": {
    "_script": {
      "type": "string",
      "order": "asc",
      "script": {
        "lang": "painless",
        "source": "doc['first.keyword'].value + ' ' + doc['last.keyword'].value"
      }
    }
  }
}
Copy the code

Check for missing items

Doc [r]. ‘the field’ value. If the field is missing from the document, an exception is thrown.

To check the document for missing values, call doc [‘field’].size() == 0.

 

Update the field with Painless

You can also easily update fields. You can access the original source of the field using ctx._source.<field-name>.

First, let’s look at the player’s source data by submitting the following request:

GET hockey/_search
{
  "stored_fields": [
    "_id",
    "_source"
  ],
  "query": {
    "term": {
      "_id": 1
    }
  }
}
Copy the code

The result displayed is:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "hockey",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "first" : "johnny",
          "last" : "gaudreau",
          "goals" : [
            9,
            27,
            1
          ],
          "assists" : [
            17,
            46,
            0
          ],
          "gp" : [
            26,
            82,
            1
          ],
          "born" : "1993/08/13"
        }
      }
    ]
  }
}
Copy the code

To change player 1’s last name to hockey, simply set ctx._source.last to the new value:

POST hockey/_update/1
{
  "script": {
    "lang": "painless",
    "source": "ctx._source.last = params.last",
    "params": {
      "last": "hockey"
    }
  }
}
Copy the code

You can also add fields to the document. For example, this script adds a new field containing the player nickname as hockey.

POST hockey/_update/1
{
  "script": {
    "lang": "painless",
    "source": """
      ctx._source.last = params.last;
      ctx._source.nick = params.nick
    """,
    "params": {
      "last": "gaudreau",
      "nick": "hockey"
    }
  }
}
Copy the code

The result displayed is:

GET hockey/_doc/1
Copy the code
{
  "_index" : "hockey",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 2,
  "_seq_no" : 11,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "first" : "johnny",
    "last" : "gaudreau",
    "goals" : [
      9,
      27,
      1
    ],
    "assists" : [
      17,
      46,
      0
    ],
    "gp" : [
      26,
      82,
      1
    ],
    "born" : "1993/08/13",
    "nick" : "hockey"
  }
}
Copy the code

A new field called “Nick” has been added.

We can even manipulate the date type to get information such as year and year:

GET hockey/_search
{
  "script_fields": {
    "birth_year": {
      "script": {
        "source": "doc.born.value.year"
      }
    }
  }
}
Copy the code

Display result:

{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 11, "base" : "eq"}, "max_score" : 1.0, "hits" : [{" _index ":" hockey ", "_type" : "_doc", "_id" : "2", "_score" : 1.0, "fields" : {" birth_year ": [1994]}},...Copy the code

 

Script Caching

Elasticsearch sees a new script for the first time, compiles it and stores the compiled version in the cache. Either inline or stored scripts are stored in the cache. The new script can expel cached scripts. By default, 100 scripts can be stored. We can change its size by setting script.cache.max_size, or set the expiration time by script.cache.expire. These Settings need to be set in config/ elasticSearch.yml.

 

Script debugging

Scripts that can’t be debugged are very difficult. Having a good debugging tool is definitely very useful for our scripting.

Debug.explain

Painless doesn’t have a REPL, and while it’s fine one day, it won’t tell you the whole story about debugging a Painless script embedded in Elasticsearch, because the data or “context” that the script can access is so important. Currently, the best way to debug an embedded script is to throw an exception at a selected location. Although you can throw your own exception(throw new Exception (‘whatever’), Painless’s sandbox prevents you from accessing useful information, such as the type of the object. So Painless has a utility method, debug.explain, that throws exceptions for you. For example, you can use _explain to explore the context in which script Query is available.

PUT /hockey/_doc/1? Refresh {" first ":" Johnny ", "last" : "gaudreau", "goals" :,27,1 [9], "assists" :,46,0 [17], "gp" :,82,1 [26]} POST/hockey / _explain / 1  { "query": { "script": { "script": "Debug.explain(doc.goals)" } } }Copy the code

Suggesting that doc. Class goals by org. Elasticsearch. Index. Fielddata. ScriptDocValues. Long to respond to:

{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "runtime error",
        "painless_class": "org.elasticsearch.index.fielddata.ScriptDocValues.Longs",
        "to_string": "[1, 9, 27]",
        "java_class": "org.elasticsearch.index.fielddata.ScriptDocValues$Longs",
        "script_stack": [
          "Debug.explain(doc.goals)",
          "                 ^---- HERE"
        ],
        "script": "Debug.explain(doc.goals)",
        "lang": "painless"
      }
    ],
    "type": "script_exception",
    "reason": "runtime error",
    "painless_class": "org.elasticsearch.index.fielddata.ScriptDocValues.Longs",
    "to_string": "[1, 9, 27]",
    "java_class": "org.elasticsearch.index.fielddata.ScriptDocValues$Longs",
    "script_stack": [
      "Debug.explain(doc.goals)",
      "                 ^---- HERE"
    ],
    "script": "Debug.explain(doc.goals)",
    "lang": "painless",
    "caused_by": {
      "type": "painless_explain_error",
      "reason": null
    }
  },
  "status": 400
}
Copy the code

You can use the same technique to look at the LinkedHashMap in the _source is _update API:

POST /hockey/_update/1
{
  "script": "Debug.explain(ctx._source)"
}
Copy the code

The results are as follows:

{ "error": { "root_cause": [ { "type": "remote_transport_exception", "reason": "[localhost] [127.0.0.1:9300] [indices: data/write/update [s]]"}], "type" : "illegal_argument_exception", "" reason" : "failed to execute script", "caused_by": { "type": "script_exception", "reason": "runtime error", "painless_class": "java.util.LinkedHashMap", "to_string": "{first=johnny, last=gaudreau, goals=[9, 27, 1], assists=[17, 46, 0], gp=[26, 82, 1], born=1993/08/13, nick=hockey}", "java_class": "java.util.LinkedHashMap", "script_stack": [ "Debug.explain(ctx._source)", " ^---- HERE" ], "script": "Debug.explain(ctx._source)", "lang": "painless", "caused_by": { "type": "painless_explain_error", "reason": null } } }, "status": 400 }Copy the code

 

Reference:

【 1 】 www.elastic.co/guide/en/el…

(2) www.elastic.co/guide/en/el…