Data group buckets.

To prepare data

  • Indexed TVS
curl PUT ip:port/tvs
{
  "mappings": {
    "properties": {
      "price": {
      	"type": "long"
      },
      "color": {
      	"type": "keyword"
      },
      "brand": {
      	"type": "keyword"
      },
      "sold_date": {
      	"type": "date"
      }
    }
  }
}
Copy the code
  • The test data
Curl POST IP: POST/TVS / _bulk {" index ": {}} {" price" : 1000, "color" : "red", "brand", "changhong", "sold_date" : {" index ":" 2016-10-28 "} {}} {" price ": 2000," color ":" red ", "brand", "changhong", "sold_date" : "2016-11-05"} {" index ": {}} {" price ": 3000," color ":" green ", "brand" : "millet", "sold_date" : "2016-05-18"} {" index ": {}} {" price" : 1500, "color", "blue", "brand" : "TCL", "sold_date" : "2016-07-02"} {" index ": {}} {" price" : 1200, "color" : "Green" and "brand", "TCL", "sold_date" : "2016-08-19"} {" index ": {}} {" price" : 2000, "color" : "red", "brand" : "Changhong", "sold_date" : "2016-11-05"} {" index ": {}} {" price" : 8000, "color" : "red", "brand" : "samsung", "sold_date" : {" index ":" 2017-01-01 "} {}} {" price ": 2500," color ":" blue ", "brand" : "millet", "sold_date" : "2017-02-12"}Copy the code

1. Basic functions

Metric is some kind of aggregate analysis operation performed on a bucket. Count avg Max min sum

Grouping by quantity

Statistics show that a certain color TV sells the most

curl GET ip:port/tvs/_search
{
    "size" : 0,
    "aggs" : { 
        "popular_colors" : { 
            "terms" : { 
              "field" : "color"
            }
        }
    }
}
Copy the code

Request parameters

  • Size: Only the aggregation results are retrieved without returning the raw data on which the aggregation was performed;
  • Aggs: fixed syntax, indicating that a group aggregation operation is to be performed on a piece of data.
  • Popular_colors: Name of each AGGS, custom;
  • Terms: Groups groups based on field values.
  • Field: indicates the field for grouping.

Returns the result

{ "took" : 7, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 8, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "popular_colors" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Red", "doc_count" : 4}, {" key ":" green ", "doc_count" : 2}, {" key ":" blue ", "doc_count" : 2}]}}}Copy the code

The refs explain

  • hits.hitsWe specified size=0 in the request, so hits.hits is empty, otherwise the original aggregated data will be returned.
  • aggregationsAggregate results.
  • popular_colorUser-defined aggregation name.
  • bucketsBuckets according to the field we specify.
  • keyThe field value.
  • doc_countThe number of Doc’s in this bucket group.

Grouping by quantity is not a metric, it is a default for Elasticsearch aggregate analysis, implemented using term.

Statistical mean

Statistics on the average price of each color TV:

curl GET ip:port/tvs/_search
{
   "size" : 0,
   "aggs": {
      "colors": {
         "terms": {
            "field": "color"
         },
         "aggs": { 
            "avg_price": { 
               "avg": {
                  "field": "price" 
               }
            }
         }
      }
   }
}
Copy the code

Nested AGGs and terms level, perform a metric operation on each bucket.

Returns the result

{... "aggregations" : { "colors" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Red", "doc_count" : 4, "avg_price" : {" value ": 3250.0}}, {" key" : "green", "doc_count" : 2, "avg_price" : {" value ": 2100.0}}, {" key ":" blue ", "doc_count" : 2, "avg_price" : {" value ": 2000.0}}]}}}Copy the code

The avg_price value is the result of the metric calculation, the average value of the price field for all doc’s in each bucket.

Drill-down analysis

The buckets are regrouped, and the aggregation analysis operation is performed on each of the smallest groups. For example: group TVS by color, and then average the price of each brand OF TV under each color.

curl GET ip:port/tvs/_search 
{
  "size": 0,
  "aggs": {
    "group_by_color": {
      "terms": {
        "field": "color"
      },
      "aggs": {
        "color_avg_price": {
          "avg": {
            "field": "price"
          }
        },
        "group_by_brand": {
          "terms": {
            "field": "brand"
          },
          "aggs": {
            "brand_avg_price": {
              "avg": {
                "field": "price"
              }
            }
          }
        }
      }
    }
  }
}
Copy the code

Nested group_by_brand Groups according to the band field to find the average price of brands.

{... "aggregations" : { "group_by_color" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [{" key ":" red ", "doc_count" : 4, "color_avg_price" : {" value ", 3250.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [{"key" : "long ", "doc_count" : 0, "bucket" : 3, "brand_avg_price" : {"value" : 1666.66666666667}}, {"key" : "Samsung ", "doc_count" : 1, "brand_avg_price" : {" value ": 8000.0}}}}, {" key" : "green", "doc_count" : 2, "color_avg_price" : {" value ": 2100.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [{"key" : "TCL", "doc_count" : 1, "brand_avg_price" : {" value ": 1200.0}}, {" key" : "millet", "doc_count" : 1, "brand_avg_price" : {" value ": 3000.0}}}}, {" key" : "blue", "doc_count" : 2, "color_avg_price" : {" value ": 2000.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [{"key" : "TCL", "doc_count" : 1, "brand_avg_price" : {" value ": 1500.0}}, {" key" : "millet", "doc_count" : 1, "brand_avg_price" : {" value ": 2500.0}}}}}}}]]Copy the code

Statistical extremum

Count the highest and lowest prices for each color of TV:

curl GET ip:port/tvs/_search
{
   "size" : 0,
   "aggs": {
      "colors": {
         "terms": {
            "field": "color"
         },
         "aggs": {
            "min_price" : { "min": { "field": "price"} }, 
            "max_price" : { "max": { "field": "price"} }
         }
      }
   }
}
Copy the code

2. Interval grouping

The histogram keyword is used to complete the interval grouping of the specified field values; if we want to group the field type as date, we need to use the date_HISTOGRAM keyword.

Receive a field and group buckets according to each range of field values:

curl GET ip:port/tvs/_search
{
   "size" : 0,
   "aggs":{
      "price":{
         "histogram":{ 
            "field": "price",
            "interval": 2000
         }
      }
   }
}
Copy the code

In the above request, we grouped the “price” field into interval groups with interval intervals of 2000 and returned the result:

{... "Aggregations" : {" price ": {" buckets" : [{" key ": 0.0," doc_count ": 3}, {" key" : 2000.0, "doc_count" : 4}, {" key ": 4000.0," doc_count ": 0}, {" key" : 6000.0, "doc_count" : 0}, {" key ": 8000.0," doc_count ": 1}]}}}Copy the code

After grouping by interval, we can perform metric operations on each bucket, such as calculating the sum:

curl GET ip:port/tvs/_search
{
   "size" : 0,
   "aggs":{
      "price":{
         "histogram":{ 
            "field": "price",
            "interval": 2000
         },
         "aggs":{
            "revenue": {
               "sum": { 
                 "field" : "price"
               }
             }
         }
      }
   }
}
Copy the code

2.1. date_histogram

Fields grouped by interval are of the date type, and the date_histogram keyword is required, for example:

curl GET ip:port/tvs/_search { "size" : 0, "aggs": { "sales": { "date_histogram": { "field": "sold_date", "interval": "month", "format": "yyyy-MM-dd", "min_doc_count" : 0, "extended_bounds" : { "min" : "2016-01-01", "max" : "2017-12-31"}}}}}Copy the code

Ginseng explanation to

  • min_doc_countThe number of doc’s in a date interval must be at least equal to this parameter before the interval is returned.
  • extended_boundsWhen dividing buckets, you limit them to this start and end date.

TV sales per brand per quarter:

curl GET ip:port/tvs/_search 
{
  "size": 0,
  "aggs": {
    "group_by_sold_date": {
      "date_histogram": {
        "field": "sold_date",
        "interval": "quarter",
        "format": "yyyy-MM-dd",
        "min_doc_count": 0,
        "extended_bounds": {
          "min": "2016-01-01",
          "max": "2017-12-31"
        }
      },
      "aggs": {
        "total_sum_price": {
          "sum": {
            "field": "price"
          }
        },  
        "group_by_brand": {
          "terms": {
            "field": "brand"
          },
          "aggs": {
            "sum_price": {
              "sum": {
                "field": "price"
              }
            }
          }
        }
      }
    }
  }
}
Copy the code

Group by date, drill down into groups and group by brand, and then perform summation Metric for each subgroup. The results are as follows:

{... "aggregations" : { "group_by_sold_date" : { "buckets" : [ { "key_as_string" : "2016-01-01", "key" : 1451606400000, doc_count: 0, total_sum_price: {"value" : 0.0}, "group_by_brand" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ ] } }, { "key_as_string" : "Doc_count" : 1, "total_sum_price" : {"value" : 3000.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [{"key" : "count", "doc_count" : 0 1, "sum_price" : {"value" : 3000.0}}]}}, {" key_AS_string ": "2016-07-01", "key" : 1467331200000, "doc_count" : 2, "total_sum_price" : {"value" : 2700.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "TCL", "doc_count" : 2, "sum_price" : { "value" : }}, {" key_AS_string ": "2016-10-01", "key" : 1475280000000, "doc_count" : 3, "total_sum_price" : {"value" : 5000.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [{" key ", "changhong", "doc_count" : 3, "sum_price" : {" value ": 5000.0}}]}}, {" key_as_string" : "2017-01-01", "key" : 1483228800000, "DOC_count" : 2, "total_SUM_price" : {"value" : 10500.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [{"key" : "three ", "doc_count" : 0 1, "sum_price" : {" value ": 8000.0}}, {" key" : "millet", "doc_count" : 1, "sum_price" : {" value ": }}, {" key_AS_string ": "2017-04-01", "key" : 1491004800000, "doc_count" : 0, "total_sum_price" : {"value" : 0.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ ] } }, { "key_as_string" : "2017-07-01", "key" : 1498867200000, "doc_count" : 0, "total_sum_price" : { "value" : 0.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ ] } }, { "key_as_string" : "2017-10-01", "key" : 1506816000000, "doc_count" : 0, "total_sum_price" : { "value" : 0.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : []}}}Copy the code

3. Aggregation qualification

Aggregation Scope specifies the DOC Scope for Aggregation analysis. It can be used in combination with Query and filter.

Aggregate analysis is used in conjunction with full-text retrieval

All aggregations in Elasticsearch are performed within a scope, which is the retrieved result when combined with a normal search request.

Statistics of sales of each color under the specified brand:

Curl the GET IP: port/TVS / _search {" size ": 0," query ": {" term" : {" brand ": {" value" : "millet"}}}, "aggs" : { "group_by_color": { "terms": { "field": "color" } } } }Copy the code

Returns the result

{
  "took" : 34,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_color" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "绿色",
          "doc_count" : 1
        },
        {
          "key" : "蓝色",
          "doc_count" : 1
        }
      ]
    }
  }
}
Copy the code

Aggregation analysis is used in conjunction with filter

Statistics of the average price of all TV sets with a price greater than 1200:

curl GET ip:port/tvs/_search 
{
  "size": 0,
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "price": {
            "gte": 1200
          }
        }
      }
    }
  },
  "aggs": {
    "avg_price": {
      "avg": {
        "field": "price"
      }
    }
  }
}
Copy the code

For a filter refined for a bucket, you can use aggs.filter. For example, the average value of changhong TV in recent 1 month, 3 months and 6 months is as follows:

Curl the GET IP: port/TVS / _search {" size ": 0," query ": {" term" : {" brand ": {" value" : "changhong"}}}, "aggs" : {" recent_1m ": { "filter": { "range": { "sold_date": { "gte": "now-1m" } } }, "aggs": { "recent_1m_avg_price": { "avg": { "field": "price" } } } }, "recent_3m": { "filter": { "range": { "sold_date": { "gte": "now-3m" } } }, "aggs": { "recent_3m_avg_price": { "avg": { "field": "price" } } } }, "recent_6m": { "filter": { "range": { "sold_date": { "gte": "now-6m" } } }, "aggs": { "recent_6m_avg_price": { "avg": { "field": "price" } } } } } }Copy the code

4. Global grouping

For an aggregate analysis request, two results are given, using the global bucket for this requirement:

  1. Specifies aggregation results within scope;
  2. Aggregation results without limitation of scope.

Comparing the average sales volume of Changhong TV with that of all TV brands:

Curl the GET IP: port/TVS / _search {" size ": 0," query ": {" term" : {" brand ": {" value" : "changhong"}}}, "aggs" : { "single_brand_avg_price": { "avg": { "field": "price" } }, "all": { "global": {}, "aggs": { "all_brand_avg_price": { "avg": { "field": "price" } } } } } }Copy the code

Query in the above request is used to limit the scope to perform aggregate analysis on doc within that scope, and the internal global keyword specifies aggregate analysis to all doc.

Request the results

{ "took" : 35, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : {"all" : {"doc_count" : 8, "all_brand_avg_price" : {"value" : 2650.0}}, "single_brand_avg_price" : {"value" : 1666.6666666666667}}}Copy the code

Generally speaking, some metric operations of aggregation analysis are easy to be carried out in parallel in multiple shards, such as Max, min, AVG, etc. After receiving the return results of each SHard, the coordinate Node only needs simple calculation to get the final result:

  1. Coordinate Node broadcasts requests to all shards;
  2. Each fragment calculates the local maximum field value and returns it to coordinate Node.
  3. Coordinate Node chooses the maximum value that all the shards return, and that’s the final maximum.

These algorithms can scale horizontally as the number of machines increases linearly, without any coordination (no intermediate results need to be discussed between machines), and with very little memory consumption (an integer represents the maximum).

However, there are other algorithms that are difficult to execute in parallel, such as count(DISTINCT). It is not necessary to filter distinct values directly on each SHard, because coordinate Node needs to get the results returned by each SHard to perform the filtering operation in memory. This process can be time-consuming if the data volume is very large.

As a result, Elasticsearch uses approximation algorithms to improve performance, which give accurate but not 100% accurate results at the expense of a few minor estimation errors, in return for fast execution and minimal memory consumption.