When using default aggregation, developers often don’t get the results they expect. Basic aggregation also has limitations. This is the case, for example, if you want to change the offset value of a histogram. Since Elasticsearch does not provide this native functionality, we use scripts to get the results we need. We will also cover other aggregation tasks that use scripts. My previous article “Getting Started with Elasticsearch (3)” also covered some of this. In today’s article, we’ll take it a step further.

There are two articles in this series:

  • Script aggregation (1)
  • Script aggregation (2)

 

To prepare data

To support the following example, we provide a set of documents containing the details of the virtual company’s employees. We include data for each employee, including name, age, position and salary. We create an employee index:

PUT /employee/_doc/1
{
  "name": "Bob",
  "age": 35,
  "about": "Bob joined the company as a full time technology consultant in the year 2012",
  "position": "consultant",
  "salary": 5000,
  "experience": "3-years",
  "married": 1,
  "fullTime": true
}
Copy the code
PUT /employee/_doc/2
{
  "name": "Jack",
  "age": 30,
  "about": "Jack joined the company as a part time management consultant in the year 2013",
  "position": "Management consultant",
  "salary": 3000,
  "experience": "3-years",
  "married": 0,
  "fullTime": false
}
Copy the code
PUT /employee/_doc/3
{
  "name": "Tom",
  "age": 33,
  "about": "Tom is serving as the operations manager of the firm from the year 2011",
  "position": "Operations manager",
  "salary": 7000,
  "experience": "7-years",
  "married": 1,
  "fullTime": true
}
Copy the code

We ran the above three commands in Dev Tools in Kibana. This produces an index called Employee. For brevity, we index only three documents in this tutorial. Of course, you can change the values in these examples and index more documents.

 

Use scripts to change default histogram values

Suppose our supervisor needs a histogram so that she can see how many employees are in each given pay interval (box). The histogram should be divided into $3,000 intervals. We now have the interval and data to perform the histogram aggregation. There is a problem with this aggregation: since there are 3,000 intervals, a simple segmentation leads to cut-off points of 3,000, 6,000, 9,000, and so on. We can use normal histogram statistics as follows:

GET employee/_search
{
  "size": 0,
  "aggs": {
    "NAME": {
      "histogram": {
        "field": "salary",
        "interval": 3000
      }
    }
  }
}
Copy the code

The aggregation above shows the result:

"Aggregations" : {" NAME ": {" buckets" : [{" key ": 3000.0," doc_count ": 2}, {" key" : 6000.0, "doc_count" : 1}]}Copy the code

The above shows two documents between 3000 and 6000 and only one document above 6000.

Our supervisor clarified and told us that we needed breakdown data so that we knew who had a salary range of 0-3,000, then 3,000-6,000, and so on. This is an offset to the histogram value and cannot be done using normal Elasticsearch aggregation. Of course, the point of this article is that we can actually do this using scripts.

This is the query that can help us:

GET employee/_search
{
  "size": 0, 
  "aggs": {
    "histogramData": {
      "histogram": {
        "field": "salary",
        "interval": 3000,
        "script": "_value + 2000"
      }
    }
  }
}
Copy the code

The script adds a value of 2,000 to the default offset, and then calculates the interval based on the given value (3,000). This extends the offset to 8,000. Since the interval step is 3,000, the script calculates the last histogram interval to be 6000-9000. We can now categorize employees clearly based on the time intervals required.

The result is as follows:

"Aggregations" : {"buckets" : [{"key" : 3000.0, "doc_count" : 1}, {"key" : 6000.0, "doc_count" : 1}, {" key ": 9000.0," doc_count ": 1}]}Copy the code

For full use of this offset, you can refer to my previous article Kibana: Using aggFULL Full Advanced Setup to Fine tune statistics.

 

Use scripts to split values in fields

Next, we will extract only specific data from specific fields for aggregation. The document in the index contains the Experence field with a value of the form “x-year” (where “x” is a number).

If we tried regular aggregation, we would get buckets with names like “3-years,” “4-years,” and “7-years.” However, suppose we need to name the segments “3”, “4”, and “7”. You can split with the character “-“, and then do so using only the first element of the split. If we want to split the field, we need to define the field as the keyword type. To do this, we redefine an index:

PUT employee_new
{
  "mappings": {
    "properties": {
      "experience": {
        "type": "keyword"
      }
    }
  }
}
Copy the code

We import the previous employee data using the reindex method:

POST _reindex
{
  "source": {
    "index": "employee"
  },
  "dest": {
    "index": "employee_new"
  }
}
Copy the code

Our aggregated script is as follows:

GET employee_new/_search { "size": 0, "aggs": { "urls": { "terms": { "field": "experience", "script": Substring (0,1)}}}}Copy the code

The result of the above aggregation is:

  "aggregations" : {
    "urls" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "3",
          "doc_count" : 2
        },
        {
          "key" : "7",
          "doc_count" : 1
        }
      ]
    }
Copy the code

Above, we see that it shows keys 3 and 7, that is, years 3 and 7. We can also change the aggregation to:

GET employee_new/_search { "size": 0, "aggs": { "urls": { "terms": { "script": "Doc ['experience'].get(0).substring(1)"}}}}Copy the code

It’s the same thing as above.

 

Use scripts to perform term aggregation on multiple fields

When using the term aggregation, we may gain more benefits by performing aggregation on multiple fields. Suppose we want to aggregate terms on the About field. The default word summary only gives us a document count of the most popular words. We may also need to perform another term summary in the Position field, which returns the document count for the primary term for that field. Looking further into this example, we can see how we need a term aggregation on both fields, which is important in cases where we need two aggregation results in the same bucket.

No such option is available in Elasticsearch term aggregation. So let’s try using scripts, which are actually quite simple. This is how we summarize words on the “About” and “position” fields.

GET employee/_search
{
  "size": 0, 
  "aggs": {
    "union_demo": {
      "terms": {
        "size": 30,
        "script": "doc['about.keyword'].value + ' ' + doc['position.keyword'].value"
      }
    }
  }
}
Copy the code

Notice that here we are given a size parameter and set its value to 30. This is done because because the About field contains many words, the number of buckets will be greater than 10. Elasticsearch term aggregation will only show 10. This query in this script will show us the union of terms aggregated from the two fields.

The result of the above aggregation is:

  "aggregations" : {
    "union_demo" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Bob joined the company as a full time technology consultant in the year 2012 consultant",
          "doc_count" : 1
        },
        {
          "key" : "Jack joined the company as a part time management consultant in the year 2013 Management consultant",
          "doc_count" : 1
        },
        {
          "key" : "Tom is serving as the operations manager of the firm from the year 2011 Operations manager",
          "doc_count" : 1
        }
      ]
    }
Copy the code

Above we combine the two fields position and About into a new field to do an aggregation.

 

conclusion

In this script tutorial, you’ve seen how to implement several types of aggregations that are not possible with native Elasticsearch functionality. We’ve done offsets in histogram aggregation, splitting values into specific fields, and how to do term aggregation over multiple fields. All of this is done through scripting.