To explore the data

Sample data

Now that we know the basics, let’s try it out with more realistic data. I have prepared a sample data of fake bank customer accounts. The structure of each document is as follows:

{
    "account_number": 0."balance": 16623."firstname": "Bradshaw"."lastname": "Mckenzie"."age": 29."gender": "F"."address": "244 Columbus Place"."employer": "Euron"."email": "[email protected]"."city": "Hobucken"."state": "CO"
}
Copy the code

Out of curiosity, the data uses the www.json-generator.com generated by this site, so please ignore the actual values and semantics of the data as they are all randomly generated.

Loading sample data

You can download the sample data (accounts.json) here, put it in your directory, and load it into your cluster with the following statement:

curl -H “Content-Type: application/json” -XPOST “localhost:9200/bank/account/_bulk? pretty&refresh” –data-binary “@accounts.json” curl “localhost:9200/_cat/indices? v”

The response is as follows:

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open bank L7sSYV2cQXmu6_4rJWVIww 5 1 1000 0 128.6 KB 128.6 KB

(Note: It is normal for the index to be yellow or red for a short period of time, because sharding is required during index creation, and there is a delay if the cluster is busy.)

We successfully indexed 1000 documents into the bank index (under type _doc).


Search API

Now let’s try some simple searches. There are two basic ways to perform a search: one that will send the search parameters through the REST Request URI, and the other that will send the search parameters through the REST Request Body. The request body approach makes your request more expressive and requires you to use a more readable JSON format. Here we use an example to demonstrate how urIs are requested, but for the rest of the tutorial we will only use request bodies. Test data or simple troubleshooting data can be urIs, faster)

The Rest API search method is accessed using the _search end, and the following example returns all documents under the bank index:

curl ‘localhost:9200/bank/_search? q=*&pretty’

Let’s analyze the search request for the first time. We search for the (_search end) bank index, and the q=* argument tells Elasticsearch to match all documents. The pretty argument that appeared earlier only tells Elasticsearch to return formatted JSON. The response is as follows:

{
  "took" : 63."timed_out" : false."_shards" : {
    "total" : 5."successful" : 5."failed" : 0 
},
  "hits" : {
    "total" : 1000."max_score" : 1.0."hits": [{"_index" : "bank"."_type" : "account"."_id" : "1"."_score" : 1.0."_source" : {"account_number":1."balance":39225."firstname":"Amber"."lastname":"Duke"."age":32."gender":"M"."address":"880 Holmes Lane"."employer":"Pyrami"."email":"[email protected]"."city":"Brogan"."state":"IL"}}, {"_index" : "bank"."_type" : "account"."_id" : "6"."_score" : 1.0."_source" : {"account_number":6."balance":5686."firstname":"Hattie"."lastname":"Bond"."age":36."gender":"M"."address":"671 Bristol Street"."employer":"Netagy"."email":"[email protected]"."city":"Dante"."state":"TN"}}, {"_index" : "bank"."_type" : "account".Copy the code

As for the response, we see the following:

  • took– Elasticsearch Indicates the time spent for searching, in milliseconds
  • timed_out– Whether the request times out
  • _shards– Tells us how many shards were searched and how many shards were searched successfully/failed
  • hit– Search results (10 by default)
  • hits.total– Search for the total number of matched documents
  • hit.hits– The actual array of search results
  • _scoremax _score– Ignore these two parameters for now

The request body is used in the same way as the search above.

curl -XPOST ‘localhost:9200/bank/_search? pretty’ -d ‘ { “query”: { “match_all”: {} } }’

The difference is that our _search API uses POST requests and JSON-style request bodies instead of q=* in urIs. We’ll discuss JSON-style queries in the next section.

The response is as follows:

{
  "took" : 26."timed_out" : false."_shards" : {
    "total" : 5."successful" : 5."failed" : 0
  },
  "hits" : {
    "total" : 1000."max_score" : 1.0."hits": [{"_index" : "bank"."_type" : "account"."_id" : "1"."_score" : 1.0."_source" : {"account_number":1."balance":39225."firstname":"Amber"."lastname":"Duke"."age":32."gender":"M"."address":"880 Holmes Lane"."employer":"Pyrami"."email":"[email protected]"."city":"Brogan"."state":"IL"}}, {"_index" : "bank"."_type" : "account"."_id" : "6"."_score" : 1.0."_source" : {"account_number":6."balance":5686."firstname":"Hattie"."lastname":"Bond"."age":36."gender":"M"."address":"671 Bristol Street"."employer":"Netagy"."email":"[email protected]"."city":"Dante"."state":"TN"}}, {"_index" : "bank"."_type" : "account"."_id" : "13".Copy the code

It is important to understand that once you get your search results, the request will be fully executed in Elasticsearch without saving any form of server resources, and there will be no cursors in your search results, etc. This is in stark contrast to other storage engines that use SQL, where you might fetch a subset of the query results and then, if you use a stateful server cursor, go back to the server to fetch the rest.


Query Statement Introduction

Elasticsearch provides specific query syntax based on JSON style for performing query operations. It is referenced in the Query DSL. The query syntax is quite detailed and can be intimidating at first, but the best way to learn is to start with a few basic examples:

Going back to the previous example, we executed the following query:

{
  "query": { "match_all": {}}}Copy the code

Looking at the above statement, the query section tells us what the query defines, and match_all is the type of query we want to use. Match_all searches all documents under the specified index.

In addition to the Query parameter, we can also pass other parameters to influence the search results. For example, the following match_all query returns only the first document:

curl -XPOST ‘localhost:9200/bank/_search? pretty’ -d ‘ { “query”: { “match_all”: {} }, “size”: 1 }’

Note that if size is not specified, 10 documents are returned by default.

In the following example, the match_all query returns documents 11 through 20:

curl -XPOST ‘localhost:9200/bank/_search? pretty’ -d ‘ { “query”: { “match_all”: {} }, “from”: 10, “size”: 10 }’

The from (default is 0) argument starts at the document number, and the size argument specifies how many documents are returned from the document number specified by the from argument. This feature is useful when implementing paging searches. Note that the default is 0 when from is not specified.

The following example executes the match_all query and sorts it in descending order by account balance and returns 10 (size by default) documents.

curl -XPOST ‘localhost:9200/bank/_search? pretty’ -d ‘ { “query”: { “match_all”: {} }, “sort”: { “balance”: { “order”: “desc” } } }’


Perform a search

Now that we’ve seen some basic searches, let’s explore more query DSLS. Let’s start by looking at the fields of the search results. By default, all queries return the full JSON document, which is called the source (the _source field of the search hit). If we don’t want to return the entire document, we can just request a few fields from it.

The following example shows how search results return only the account_number and balance fields.

curl -XPOST ‘localhost:9200/bank/_search? pretty’ -d ‘ { “query”: { “match_all”: {} }, “_source”: [“account_number”, “balance”] }’

Note that in the example above we just reduced the _source field. The result still has _source but only account_number and balance fields.

If you have a background in SQL statements, the above example is somewhat similar to the concept of a list of fields in SQL as SELECT FROM.

Let’s move on to the query section. Now that we know how to use match_all to get indexed entire documents, let’s introduce a new type of query called match Query, which is arguably the most basic field-based search (a search for a particular field or set of fields).

The following example returns the document with account number 20:

curl -XPOST ‘localhost:9200/bank/_search? pretty’ -d ‘ { “query”: { “match”: { “account_number”: 20 } } }’

The following example returns all documents that contain Mill at the address.

curl -XPOST ‘localhost:9200/bank/_search? pretty’ -d ‘ { “query”: { “match”: { “address”: “mill” } } }’

The following example returns all documents whose addresses contain Mill or Lane.

curl -XPOST ‘localhost:9200/bank/_search? pretty’ -d ‘ { “query”: { “match”: { “address”: “mill lane” } } }’

(Note: the command is omitted from the query body.)

The following example, a variation of the match query (match_phrase), returns the entire document containing the phrase Mill Lane in the address field:

{
  "query": { "match_phrase": { "address": "mill lane"}}}Copy the code

Bool (EAN) query Boolean queries allow us to combine small queries into large queries through Boolean logic.

The following example combines two match queries to return all documents with address fields for Mill and Lane.

{
  "query": {
    "bool": {
      "must": [{"match": { "address": "mill"}}, {"match": { "address": "lane"}}]}}}Copy the code

In the example above, the bool must part specifies that all queries must return true for the document to be fortune-telling.

Instead, the following example combines two match queries to return all documents whose addresses contain Mill or Lane.

{
  "query": {
    "bool": {
      "should": [{"match": { "address": "mill"}}, {"match": { "address": "lane"}}]}}}Copy the code

In the example above, the bool should section specifies a set of queries to hit the document if any of the conditions is true.

The following example combines two match queries and returns a document that contains neither Mill nor Lane in the address field.

{
  "query": {
    "bool": {
      "must_not": [{"match": { "address": "mill"}}, {"match": { "address": "lane"}}]}}}Copy the code

In the example above, the bool must_NOT section specifies a set of queries where none of the criteria is true in order to hit the document.

We can also combine must, should, and must_NOT into a single Boolean query. Furthermore, we can combine multiple Boolean queries into any other Boolean query to simulate complex multi-level Boolean logic

(Note: this is similar to the SQL statement AND OR multi-layer nested, but the JSON expression is more troublesome, it is not too complex logic, otherwise when you check the business logic problems will see the N-layer query DSL crash…)

The following example returns an account that is 40 years old but does not live in an ID state:

{
  "query": {
    "bool": {
      "must": [{"match": { "age": "40"}}]."must_not": [{"match": { "state": "ID"}}]}}}Copy the code

filtering

In the previous section we skipped over a small detail called the document score (the _score field in the search results). A score is a numeric value that indicates how well the document matches the search criteria we specify. The higher the score, the higher the document relevance, and the lower the reproduction document relevance.

However, queries don’t always need to produce scores, especially if they are only used to “filter” a collection of documents. Elasticsearch detects this and automatically optimizes query execution to avoid useless score calculations.

The Boolean query we introduced in the previous chapter also supports filter phrases, which can be used to limit matching to other phrases without changing the score calculation. Let’s introduce a range Query with an example that allows us to filter by range. This query is usually used for numeric or date type filtering.

The following example uses a Boolean query to filter all documents with balances between 2000 and 3000. In other words, we want to find accounts with balances greater than or equal to 2000 and less than or equal to 3000.

{
  "query": {
    "bool": {
      "must": { "match_all": {}},"filter": {
        "range": {
          "balance": {
            "gte": 20000."lte": 30000
          }
        }
      }
    }
  }
}
Copy the code

Looking at the example above, the Boolean query consists of a match_all query (the query part) and a range query (the filtering part). We can replace the query and filtering parts with any other query. In this example, the documents that are matched by the range query are the most meaningful. No other method of matching is more relevant.

In addition to match_all, match, bool, and range, there are many query types available, which I will not go into here. Since we have a basic understanding of query, it should not be too difficult to apply these knowledge points to learning and experimenting with other types of query.


aggregated

Aggregation provides the ability to group and extract statistics from data. The easiest way to think about them is that they are roughly the same as GROUP BY and aggregation in SQL. In Elasticsearch, you can return hit documents in the result of a search and return aggregated results outside the hit result set. This feature is useful in the sense that you can return query results and aggregate results all at once through a concise API, thus avoiding multiple network IO.

We start by counting how many accounts there are in each state in the account information, sort them alphabetically in reverse order by state (which is the default), and return 10 aggregated results.

{
  "size": 0."aggs": {
    "group_by_state": {
      "terms": {
        "field": "state"}}}}Copy the code

The SQL statement is similar to the following:

SELECT state, COUNT() FROM bank GROUP BY state ORDER BY COUNT() DESC

The response is as follows:

{
  "took" : 26."timed_out" : false."_shards" : {
    "total" : 5."successful" : 5."failed" : 0
  },
"hits" : {
    "total" : 1000."max_score" : 0.0."hits": []},"aggregations" : {
    "group_by_state" : {
      "buckets": [{"key" : "al"."doc_count" : 21
      }, {
        "key" : "tx"."doc_count" : 17
      }, {
        "key" : "id"."doc_count" : 15
      }, {
        "key" : "ma"."doc_count" : 15
      }, {
        "key" : "md"."doc_count" : 15
      }, {
        "key" : "pa"."doc_count" : 15
      }, {
        "key" : "dc"."doc_count" : 14
      }, {
        "key" : "me"."doc_count" : 14
      }, {
        "key" : "mo"."doc_count" : 14
      }, {
        "key" : "nd"."doc_count" : 14}]}}}Copy the code

We can see that al state has 21 accounts, tx 17, ID 15, and so on.

Notice that we set size to 0 because we only want to focus on the aggregated results in the response.

Based on the previous aggregation example, the following example also calculates the average balance of each state account.

{
  "size": 0."aggs": {
    "group_by_state": {
      "terms": {
        "field": "state"
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}
Copy the code

Notice how we nested the average_balance aggregate in the group_by_state aggregate. This is a common aggregation pattern, and you can nest any aggregate to change the aggregate results to suit your needs.

Based on the above aggregation example, we can then sort by the average balance in reverse order:

{
  "size": 0."aggs": {
    "group_by_state": {
      "terms": {
        "field": "state"."order": {
          "average_balance": "desc"}},"aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}
Copy the code

The following example shows how to aggregate by age group (20-29, 30-39, and 40-49), followed by gender, and finally calculate their average balance, one set of results for each age group and each gender.

{
  "size": 0."aggs": {
    "group_by_age": {
      "range": {
        "field": "age"."ranges": [{"from": 20."to": 30
          },
          {
            "from": 30."to": 40
          },
          {
            "from": 40."to": 50}},"aggs": {
        "group_by_gender": {
          "terms": {
            "field": "gender"
          },
          "aggs": {
            "average_balance": {
              "avg": {
                "field": "balance"
              }
            }
          }
        }
      }
    }
  }
}
Copy the code

There are many aggregation methods that are not covered here, but if you want to explore them further, the aggregation reference guide is a good place to start.


conclusion

Elasticsearch is both simple and complex, so far we’ve seen a little bit about what it is, some of its internals, and how to manipulate it using the REST API. Hopefully this tutorial will give you a better understanding of Elasticsearch and, more importantly, inspire you to experiment with more features in ES.

Alter data for Elasticsearch

The next section: Elasticsearch official translation — 2 Settings