Elasticsearch Query Getting started with DSL Query

This is for learning DSL notes, suitable for ES novice, big guy please skip ~

Query DSL, also known as Query expressions, is a very flexible and expressive Query language that uses the JSON interface to implement rich queries and make your Query statements more flexible, accurate, readable, and easy to debug

Query and Filtering

Data retrieval in Elasticsearch falls into two categories: query and filtering.

Rating Query Query to retrieve the results, the attention point is matching degree, such as retrieval “operations of coffee” and the title of the document how matching, calculation is the relevance of the Query and the document, calculation is completed will calculate a score, recorded in _score field, and finally according to _score field to sort all of the retrieved documents

Filter Filter not the retrieval results are graded, pay attention to the point is whether match, such as retrieval “operations of coffee” matches the title of the document, only match or not match, because is simply the result of matching, so to calculate also very fast, and the result of the Filter will be cached in memory, performance than the Query Query is much higher

A simple query

The simplest DSL query expression is as follows:

GET /_search
{
  "query": {"match_all": {}}}Copy the code

/_search finds the contents of all indexes in the entire ES

Query is the query keyword, and agGS is the aggregation keyword

Match_all matches all documents, or match_none matches none documents

Return result:

{
  "took": 6729,
  "timed_out": false."num_reduce_phases": 6,
  "_shards": {
    "total": 2611,
    "successful": 2611,
    "skipped": 0."failed": 0}."hits": {
    "total": 7662397664,
    "max_score": 1,
    "hits": [{"_index": ".kibana"."_type": "doc"."_id": "url:ec540365d822e8955cf2fa085db189c2"."_score": 1,
        "_source": {
          "type": "url"."updated_at": "The 2018-05-09 T07:19:46. 075 z"."url": {
            "url": "/app/kibana"."accessCount": 0."createDate": "The 2018-05-09 T07:19:46. 075 z"."accessDate": "The 2018-05-09 T07:19:46. 075 z"}}},... Omit other results... ] }}Copy the code

Took: indicates how many milliseconds it took us to execute the entire search request

Timed_out: indicates whether the query times out

Note that timed_out also returns the result when it is True. This result is the data ES got when the request timed out, so the returned data may be incomplete.

And when you receive timed_out True, even though the connection is closed, the query does not end in the background, but continues

_shards: displays information about the fragments that participate in the query, including the number of successful fragments and the number of failed fragments

Hits: indicates information about matched documents. Total indicates the total number of matched documents. Max_score indicates the maximum value of all _scores in a document

By default, the hits array in hits contains the first ten documents in the query result. Each document contains _index, _type, _id, _score, and _source data

By default, the resulting documents are sorted in descending order by relevancy (_score), which means that the documents with the highest relevancy are returned first. Document relevancy refers to the degree to which the document content matches the query criteria, as described in query and filtering above

Specify the index

The above query will search all indexes in ES, but we usually only need to search one or more indexes. Searching all indexes will cause a waste of resources. We can specify indexes in ES by the following methods

Specify a fixed index,Ops - coffee - nginx - 2019.05.15For index name

GET/ops - coffee - nginx - 2019.05.15 / _searchCopy the code

Ops-coffee-nginx-2019.05.15

Specifies multiple fixed index names separated by commas

GET/ops - coffee - nginx - 2019.05.15, ops - coffee - nginx - 2019.05.14 / _searchCopy the code

Matches with asterisks to find data under all matched indexes

GET /ops-coffee-nginx-*/_search
Copy the code

It is also possible to separate multiple matching indexes with commas

Paging query

Hits only displays 10 files by default. Hits only displays 10 files by default. Hits only displays 10 files by default. Size and from are given in ES

Size: sets the number of results returned at one time (i.e. the number of documents in hits). Default is 10

From: sets the number of results to be queried later. The default value is 0

GET/ops - coffee - nginx - 2019.05.15 / _search {"size": 5,
  "from": 10,
  "query": {"match_all": {}}}Copy the code

The above query will query all the data under the index ops-coffee-nginx-2019.05.15 and will show the data for documents 11 through 15 in hits

The full text query

Match_all is used as a full-text query keyword. Match_all is used to query all records. There are also several commonly used query keywords in ES

match

For the simplest query, the following example looks for all records whose host is ops-coffee.cn

GET/ops - coffee - 2019.05.15 / _search {"query": {"match": {
      "host":"ops-coffee.cn"}}}Copy the code

multi_match

Perform the same match query on multiple fields, as shown in the following example for records containing ops-coffee.cn in the host or http_referer fields

GET/ops - coffee - 2019.05.15 / _search {"query": {"multi_match": {
      "query":"ops-coffee.cn"."fields": ["host"."http_referer"]}}}Copy the code

query_string

You can use AND OR OR in a query to complete complex queries, such as:

GET/ops - coffee - 2019.05.15 / _search {"query": {"query_string": {
      "query":"(a.ops-coffee.cn) OR (b.ops-coffee.cn)"."fields": ["host"]}}}Copy the code

Find all records whose host is a.ops-coffee.cn or b.ops-coffee.cn

You can also use the following method to combine more conditions to complete more complex query requests

GET/ops - coffee - 2019.05.14 / _search {"query": {"query_string": {
      "query":"host:a.ops-coffee.cn OR (host:b.ops-coffee.cn AND status:403)"}}}Copy the code

The above represents all records of the query (host = a.ops-coffee.cn) or (host = b.ops-coffee.cn and status = 403)

Its like a similar one simple_query_string keywords, can be in the query_string AND/OR replace with a + OR | this symbol

term

Term can be used for exact matches, which can be numbers, times, Boolean values, or strings with not_analyzed set

GET/ops - coffee - 2019.05.14 / _search {"query": {"term": {
      "status": {
        "value": 404}}}}Copy the code

Term does not analyze the input text and directly matches the output result accurately. If you want to match multiple values at the same time, you can use terms

GET/ops - coffee - 2019.05.14 / _search {"query": {
    "terms": {
      "status": [403404]}}}Copy the code

range

Range Is used to query the number or time within a specified range

GET/ops - coffee - 2019.05.14 / _search {"query": {
    "range": {"status": {"gte": 400,
        "lte": 599}}}}Copy the code

The above indicates that all data between 400 and 599 are searched. The operators here mainly include gt greater than, GTE greater than or equal to, LT less than, and LTE less than or equal to

When using a date as a range query, we need to pay attention to the date format. There are two officially supported date formats

Time stamp, note the millisecond granularity

GET/ops - coffee - 2019.05.14 / _search {"query": {
    "range": {
      "@timestamp": {
        "gte": 1557676800000,
        "lte": 1557680400000,
        "format":"epoch_millis"}}}}Copy the code

Date string

GET/ops - coffee - 2019.05.14 / _search {"query": {
    "range": {"@timestamp": {"gte": "The 2019-05-13 18:30:00"."lte": "2019-05-14"."format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd"."time_zone": "+ 08:00." "}}}}Copy the code

Generally recommended by the date string, look more clear, the input date format can be according to their own habits, only need to specify the format field matching format, format if there are multiple use | | apart, as in the example, but I recommend using the same date format

If the date is missing, the missing part will be filled with the Unix start time (January 1, 1970). When you specify “format”:”dd” as the format, “gte”:10 will be converted to 1970-01-10T00:00.000z

Elasticsearch uses UTC time by default, so we need to set the time zone using time_zone to avoid errors

Combination query

Usually we need to combine a bunch of conditions together to find out what the result is, and in that case we need to use the bool provided by ES

For example, if we want to query all data whose host is ops-coffee.cn and whose http_x_forworded_for is 111.18.78.128 and whose status is not 200, we can use the following statement

GET/ops - coffee - 2019.05.14 / _search {"query": {"bool": {
      "filter": [{"match": {
          "host": "ops-coffee.cn"
        }},
        {"match": {
          "http_x_forwarded_for": "111.18.78.128"}}]."must_not": {
        "match": {
          "status": 200}}}}}Copy the code

There are four key words to combine the relationship between queries:

Must: similar to AND in SQL, must be included

Must_not: Similar to NOT in SQL, must NOT be included

Should: meeting any of these conditions will increase the _score. Failure to meet these conditions will not affect the score. Should will only affect the _score value of the query result, but will not affect the content of the result

Filter: similar to must, but no correlation score is given to the results. In most cases, there is no correlation requirement for logs, so it is recommended to use filter in the process of query

Write in the last

The ES query is extensive and profound. This article is a basic introduction, and the content comes from the official website

There are many articles on the web about ELK build deployment log collection, but how do you use this data trove once you’ve collected the log? There are only some general concepts shared by some big companies on the Internet, but I have no actual implementation process. I am thinking of making use of these data. My initial idea is to search for traffic data of business or function in ES, and then make trend analysis