An introduction to

The introduction

Elasticsearch is a highly extensible open source full text search engine. It searches almost in real time, using ES as a search engine to provide solutions to the needs of complex search functions.

Usage scenarios of ES:

  • Online shopping mall, search for goods.

  • ES with LogStash, Kibana, log analysis.

The rest of this tutorial will guide you through the ES installation, startup, browsing, and CRUD of data. If you complete this tutorial, you should already have a good understanding of ES, so hopefully you’ll find it inspiring.

The basic concept

ES has several core concepts, and understanding them from the beginning will help you a lot later on.

Near real time (NRT)

ES is a near-real-time search engine, representing a minimal delay from adding data to being searchable. (About 1s)

The cluster

Multiple ES servers can be used as a cluster and can be searched on any node. The cluster has a default (modifiable) name, “ElasticSearch”, which must be unique because the nodes of the cluster are added to the cluster by the cluster name.

Make sure that you do not have the same cluster name in the same environment, or nodes may join unexpected clusters.

node

A node is a single server that is part of a cluster, stores data, and participates in the indexing and search functions of the cluster. Like a cluster, a node is identified by a name, which by default is a random universal unique identifier (UUID) assigned to the node at startup. If you do not want to use default values, you can define any node names you want. This name is important for administrative purposes, because you want to determine which servers in the network correspond to which nodes in the ElasticSearch cluster.

The index

An index is a collection of documents with some similar characteristics. For example, you could have an index for customer data, another index for product catalogs, and another index for order data. An index is identified by a name (which must be all lowercase) that is used to refer to the index when indexing, searching, updating, and deleting documents within it. You can define as many indexes as you want in a single cluster.

If you have studied Mysql, you can temporarily understand it as the database in Mysql.

type

An index can have multiple types. For example, an index can have an article type, a user type, and a comment type. Multiple types can no longer be created in an index, and the whole concept of types will be removed in future releases.

Indexes created in Elasticsearch 7.0.0 or later no longer accept _default_ mapping. Indexes created in 6.x will continue to run in Elasticsearch 6.x as before. The use of types is not recommended in the API in 7.0. Major changes are made to the index Creation, Place Mapping, Get Mapping, Place Template, Get Template and Get Field mapping apis.

The document

A document is a basic unit of information that can be indexed. For example, you can have a document for a customer, a document for a product, and, of course, a document for an order. Documents are represented in the Javascript Object Notation (JSON) format, a ubiquitous format for data interaction on the Internet.

You can store as many documents as you want in an index/type. Note that although a document physically exists in an index, the document must be indexed/given an index type.

Sharding and copy

Indexes may store large amounts of data that may exceed the hardware limits of a single node. For example, a single index of 1 billion documents occupying 1TB of disk space might not fit on a single node’s disk or be too slow to satisfy a single node’s search requests on its own.

To solve this problem, ElasticSearch provides the ability to subdivide an index into multiple fragments, called shards. When creating an index, you only need to define the number of shards required. Each shard is itself a fully functional and independent “index” that can be hosted on any node in the cluster.

Why sharding?

  • It allows you to split/scale internal capacity horizontally

  • It allows you to distribute and operate in parallel across shards (possibly on multiple nodes), thereby improving performance/throughput

The mechanism for how shards are allocated and how their documents are aggregated back into search requests is completely managed by ElasticSearch and is transparent to you as the user.

Useful in a network/cloud environment where failure can occur at any time, failover mechanisms are strongly recommended when fragments/nodes are somehow taken offline or disappear for whatever reason. For this purpose, ElasticSearch allows you to copy one or more copies of an index shard into a so-called replica shard, or replica shard for short.

Why have a copy?

  • High availability in the event of a shard/node failure. Therefore, it is important to note that the replica shard is never assigned to the same node as the original/master shard that copied it.

  • Allows you to scale the search volume/throughput because searches can be performed in parallel on all replicas.

In summary, each index can be split into multiple shards. Indexes can also be replicated zero times (meaning no copies) or multiple times. After replication, each index will have a master shard (the original shard copied from it) and a replica shard (a copy of the master shard).

You can define the number of shards and copies per index at index creation time. You can also dynamically change the number of copies at any time after the index is created. You can change the number of shards for an existing index using the shrink and split API, and it is recommended that you consider the number of shards and replicas when creating an index.

By default, each index in ElasticSearch is assigned a master shard and a replica, which means that if there are at least two nodes in the cluster, the index will have a master shard and another replica shard (a full replica), for a total of two shards per index.

Each ElasticSearch shard is a Lucene index. You can have the maximum number of documents in a Lucene index. From Lucene-5843, the limit is 2147483519 (=integer.max_value-128) documents. You can use the API to monitor fragmentation sizes (covered later).

Now let’s get to the fun part…

The installation

Binaries are available from www.slastic.co/downloads and all past releases. Platform-specific archived versions are available for each version for Windows, Linux, and MacOS, as well as DEB and RPM packages for Linux and MSI installation packages for Windows.

Linux

For simplicity, let’s use the TARB package for installation

Download ElasticSearch 7.1.1 Linux tar.

The curl - L - O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.1.1-linux-x86_64.tar.gzCopy the code

Unpack the

The tar - XVF elasticsearch 7.1.1 - Linux - x86_64. Tar. GzCopy the code

Once unzipped, a set of files and folders will be created in the current directory, and then we’ll go to the bin directory.

cdElasticsearch - 7.1.1 / binCopy the code

Start nodes and a single cluster

./elasticsearchCopy the code

Windows

For Windows users, we recommend using the MSI installation package. The package includes a graphical user interface (GUI) to guide you through the installation process.

Download address

Then double-click the downloaded file to launch the GUI. On the first screen, select the installation directory:

Then choose whether to install ElasticSearch as a service or manually start ElasticSearch as needed. To be consistent with the Linux example, choose not to install as a service:

For the configuration, just leave the defaults:

Uncheck all plug-ins to not install any:

Click the Install button to install ElasticSearch:

By default, ElasticSearch will be installed in %ProgramFiles%\Elastic\ElasticSearch. Go to the installation directory, open the command prompt, and enter

.\elasticsearch.exeCopy the code

Successfully running the node

If everything goes well with the installation, you should see the following stack of messages:

[2018-09-13T12:20:01.766][INFO][O.e.e.nodeEnvironment][localhost. Localdomain] using [1] Data paths, Mounts [[/home (/dev/mapper/fedora-home)]], net usable_space [335.3 GB], NET total_space [410.3 GB], mounts [/home (/dev/mapper/fedora-home)]], NET usable_space [335.3 GB], net total_space [42.4 GB], Types [ext4][2018-09-13T12:20:01,772][INFO][O.E.e.nodeEnvironment][localhost. Localdomain] Heap size [990.7MB], compressed ordinary object pointers [true[2018-09-13T12:20:01.774][INFO][O.E.n.mode][localhost.localdomain] Node name [localhost.localdomain], node ID [B0aEHNagTiWx7SYj-l4NTw][2018-09-13T12:20:01.775][INFO][O.E.n.ode][localhost. Localdomain] version[7.1.1], PID [13030], build[OSS /zip/77fcE/T15 2018-09-13:20 37:57. 478402 z], OS/Linux / 4.16.11-100. Fc26. X86_64 / amd64], the JVM ["Oracle Corporation"[2018-09-13T12:20:01.775][INFO][O.E.n.ode][localhost. Localdomain] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch.LN1ctLCi, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Djava.locale.providers=COMPAT, -XX:UseAVX=2, -Dio.netty.allocator.type=unpooled, -Des.path.home=/home/manybubbles/Workspaces/Elastic/master/elasticsearch/qa/unconfigured-node-name/build/cluster/integTe StCluster node0 / elasticsearch - 7.0.0-1 - the SNAPSHOT, -Des.path.conf=/home/manybubbles/Workspaces/Elastic/master/elasticsearch/qa/unconfigured-node-name/build/cluster/integTe StCluster node0 / elasticsearch - 7.0.0-1 - the SNAPSHOT/config, - Des. Distribution. The flavor = oss, -des.distribution. type=zip][2018-09-13T12:20:02.543][INFO][o.e.p.luginsService][localhost. Localdomain] Loaded Module [2018-09-13T12:20:02.543][INFO][o.e.p.luginsService][localhost. Localdomain] Loaded Module [2018-09-13T12:20:02.543][INFO][o.e.p.luginsService][localhost. Localdomain] Loaded Module [2018-09-13T12:20:02.543 [ingest-common][2018-09-13T12:20:02.544][INFO][O.e.p.luginsService][localhost. Localdomain] Loaded Module [2018-09-13T12:20:02.544][INFO][O.E.p.luginsService][localhost. Localdomain] Loaded Module Mustache][2018-09-13T12:20:02.544][INFO][O.E.p.luginsService][localhost. Localdomain] Loaded Module [2018-09-13T12:20:02.544][INFO][o.E.p.luginsService][localhost. Localdomain] Loaded Module [2018-09-13T12:20:02.544][INFO][O.E.p.luginsService][localhost. Localdomain] Loaded Module [2018-09-13T12:20:02.544][INFO][o.e.p.luginsService][localhost. Localdomain] loaded Module [2018-09-13T12:20:02.544][INFO][o.e.p.luginsService][localhost. Localdomain] Loaded Module [2018-09-13T12:20:02.544][INFO][o.e.p.luginsService][localhost. Localdomain] Loaded Module [2018-09-13T12:20:02.545][INFO][o.e.p.luginsService][localhost. Localdomain] Loaded Module [2018-09-13T12:20:02.545][INFO][o.e.p.luginsService][localhost. Localdomain] Loaded Module [2018-09-13T12:20:02.02,545][INFO][O.e.p.pluginsService][localhost. Localdomain] No plugins [2018-09-13T12:20:04.657][INFO][O.E.D. DiscoveryModule][localhosttype[ZEN][2018-09-13T12:20:05.006][INFO][O.E.n.ode][localhost. Localdomain] Initialized [2018-09-13T12:20:05.007][INFO ][o.e.n.Node ] [localhost.localdomain] starting ... [2018-09-13T12:20:05.202][INFO][O.E.T. ransportService][localhost.localdomain] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}[2018-09-13T12:20:05.221][O.E.B. Bootstrapchecks][localhost. Localdomain] Max file Descriptors [4096]forelasticsearch process is too low, Increase to at least [65535][2018-09-13T12:20:05.221][O.E.B. Bootstrapchecks][localhost. Localdomain] Max virtual  memory areas vm.max_map_count [65530] is too low, Increase rate at least [262144][2018-09-13T12:20:08.355][INFO][o.e.c.s.asterservice][localhost. Localdomain] elected-as-master ([0] nodes joined)[, ], reason: master node changed {previous [], current [{localhost.localdomain}{B0aEHNagTiWx7SYj-l4 NTW} {hzsQz6CVQMCTpMCVLM4IHg} {127.0.0.1} {127.0.0.1:9300} {testattr =test}}] [the 2018-09-13 T12:20:08, 360] [INFO] [O.E.C.S.C lusterApplierService] [localhost. Localdomain] master node changed {previous  [], current [{localhost.localdomain}{B0aEHNagTiWx7SYj-l4 NTW} {hzsQz6CVQMCTpMCVLM4IHg} {127.0.0.1} {127.0.0.1:9300} {testattr =test}]}, reason: apply cluster state (from master [master {localhost.localdomain}{B0aEHNagTiWx7SYj-l4 NTW} {hzsQz6CVQMCTpMCVLM4IHg} {127.0.0.1} {127.0.0.1:9300} {testattr =test} committed version [1] source[elected-as-master ([0] nodes joined)[, [2018-09-13]]]) T12:20:08, 384] [INFO] [o.e.h.n.Net ty4HttpServerTransport] [localhost. Localdomain] publish_address Bound_addresses {127.0.0.1:9200}, {[: : 1) : 9200}. {127.0.0.1:9200}[2018-09-13T12:20:08.384][INFO][O.E.n.ode][localhost. Localdomain] startedCopy the code

We can see that the node named “6-bjhwl” (a different set of characters in your example) has started and selected itself as a host in a single cluster. Don’t worry about what master means just yet. The most important thing here is that we have started a node in a cluster.

As mentioned earlier, we can override cluster or node names. When you start ElasticSearch, you can do this from the command line, as shown below:

./elasticsearch -Ecluster.name=my_cluster_name -Enode.name=my_node_nameCopy the code

Also notice the line marked AS HTTP, which contains information about the HTTP address (192.168.8.112) and port (9200) from which the node can be accessed. By default, ElasticSearch uses port 9200 to provide access to its RESTAPI. This port can be configured if necessary.

Browse the cluster

Using the REST API

Now that we have the nodes (and clusters) up and running, the next step is to understand how to communicate with them. Fortunately, ElasticSearch provides a very comprehensive and powerful RESTAPI that you can use to interact with clusters. A few actions that can be performed using the API are as follows:

  • Check the health, status, and statistics of clusters, nodes, and indexes.

  • Manages cluster, node, and index data and metadata.

  • CRUD (Create, read, update, and delete) and search operations are performed on indexes.

  • Perform advanced search operations such as paging, sorting, filtering, scripting, aggregation, and many other operations.

Cluster health

Let’s start with a basic health check that we can use to see how the cluster is doing. We’ll use curl to do this, but you can use any tool that allows you to make HTTP/REST calls. Suppose we’re still on the same node that started ElasticSearch and opened another command shell window.

To check the health of the cluster, we will use _catAPI. You can run the following commands in the Kibana console by clicking View in the Console, or use curl by clicking the copy to curl link below and pasting it into your terminal.

curl -X GET "localhost:9200/_cat/health? v"Copy the code

Response results:

epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time Active_shards_percent1475247709 17:01:49 ELASticSearch Green 1 100 00 0-100.0%Copy the code

You can see that the cluster named “ElasticSearch” is in green. Whenever we ask for cluster health, we get either green, yellow, or red.

  • Green – All is well (cluster fully functional)

  • Yellow – All data available, but some copies have not been allocated (cluster is fully functioning)

  • Red – Some data is not available for any reason (cluster part is working properly)

Note: When the cluster is red, it will continue to serve search requests from available shards, but you may need to fix it as soon as possible because there are unallocated shards.

From the response above, we can see that there is a total of 1 node, and we have 0 shards because we have no data in them yet. Note that because we used the default cluster name (ElasticSearch), and because ElasticSearch is delivered by default to look for other nodes on the same machine, you could accidentally start multiple nodes on your machine and have them all join a single cluster. In this scenario, you might see multiple nodes in the response above.

We can also get a list of nodes in the cluster:

curl -X GET "localhost:9200/_cat/nodes? v"Copy the code

Response results:

Percent CPU load_1m load_5m load_15m node.role Master Name127.0.0.1 10 5 5 4.46mdi * PB2SGZYCopy the code

We can see a node named “PB2sgZY”, which is a single node in the current cluster.

View all indexes

Now let’s take a look at our index:

curl -X GET "localhost:9200/_cat/indices? v"Copy the code

Response results:

health status index uuid pri rep docs.count docs.deleted store.size pri.store.sizeCopy the code

This means we don’t have indexes in the cluster yet.

Create indexes

Now, let’s create an index named “Customer” and list all indexes again:

curl -X PUT "localhost:9200/customer? pretty"curl -X GET "localhost:9200/_cat/indices? v"Copy the code

The first command uses the put verb to create an index named “Customer”. We simply append the pretty command to the end of the call, which nicely prints the JSON response (if any).

Response results:

health status index    uuid                   pri rep docs.count docs.deleted store.size pri.store.sizeyellow open   customer 95SQ4TSUT7mWBT7VNHH67A   1   1          0            0       260b           260bCopy the code

The result of the second command tells us that we now have an index named Customer with a primary shard and a copy (the default) containing zero documents.

You may also notice a yellow health TAB in the customer index. Recall from our previous discussion that yellow means some copies have not been allocated. The reason this happens with this index is that by default ElasticSearch creates a copy of this index. Because only one node is currently running, a copy cannot be allocated (for high availability) until another node joins the cluster. Once the copy is assigned to the second node, the health of the index turns green.

Query the document

Now let’s put something in the customer index. We will index a simple customer document with ID 1 in the customer index as follows:

curl -X PUT "localhost:9200/customer/_doc/1? pretty" -H 'Content-Type: application/json' -d'{ "name": "John Doe"}'Copy the code

Response results:

{  "_index" : "customer"."_type" : "_doc"."_id" : "1"."_version" : 1,  "result" : "created"."_shards" : {    "total": 2."successful" : 1,    "failed": 0}."_seq_no": 0."_primary_term" : 1}Copy the code

From above, we can see that a new customer document has been successfully created in the customer index. The document also has an internal ID 1, which we specified when we indexed it.

Note that ElasticSearch does not require you to explicitly create indexes before indexing documents. In the previous example, if the client index did not already exist, ElasticSearch would create it automatically.

Now let’s retrieve the document we just indexed:

curl -X GET "localhost:9200/customer/_doc/1? pretty"Copy the code

Response results:

{  "_index" : "customer"."_type" : "_doc"."_id" : "1"."_version" : 1,  "_seq_no": 25."_primary_term" : 1,  "found" : true."_source" : { "name": "John Doe" }}Copy the code

No exception was found except for one field found, indicating that we found a document with the requested ID 1 and another field, _source, which returns the full JSON document we indexed from the previous step.

Remove the index

Now, let’s delete the index we just created and list all indexes again:

curl -X DELETE "localhost:9200/customer? pretty"curl -X GET "localhost:9200/_cat/indices? v"Copy the code

Response results:

health status index uuid pri rep docs.count docs.deleted store.size pri.store.sizeCopy the code

This means that the index was successfully dropped, and we are now back where we started with nothing in the cluster.

Before we continue, let’s take a closer look at some of the API commands we’ve learned so far:

Curl -x PUT "localhost:9200/customer/ 1" curl -x PUT "localhost:9200/customer/ 1" -h 'content-type: application/json' -d'{ "name": $$$$$$$$$$$$$$$$$$$$$$Copy the code

If we take a closer look at the above commands, we can actually see the pattern of how data is accessed in ElasticSearch. This pattern can be summarized as follows:

<HTTP Verb> /<Index>/<Endpoint>/<ID>Copy the code

This REST access pattern is common in all API commands, and if you can remember it, you’ll be off to a good start in mastering ElasticSearch.

Modify the data

ElasticSearch provides near real-time data manipulation and search capabilities. By default, a one-second delay (refresh interval) is expected between indexing/updating/deleting data and data appearing in search results. This is an important difference from other platforms, such as SQL, where data is available immediately after a transaction completes.

Create/replace documents (modify data)

We’ve seen how to index individual documents before. Let’s recall this command again:

curl -X PUT "localhost:9200/customer/_doc/1? pretty" -H 'Content-Type: application/json' -d'{ "name": "John Doe"}'Copy the code

Again, the above will index the specified document to the customer index with ID 1. If we execute the above command again with a different (or the same) document, ElasticSearch replaces (recreates) a new document with ID 1 on top of the existing one:

curl -X PUT "localhost:9200/customer/_doc/1? pretty" -H 'Content-Type: application/json' -d'{ "name": "Jane Doe"}'Copy the code

The above changes the name of the document with ID 1 from “John Doe” to “Jane Doe”. On the other hand, if we use a different ID, new documents are created and existing documents in the index remain unchanged.

The index above is a new document with ID 2.

The ID is optional when it is created. If not specified, ElasticSearch will generate a random ID. The actual ID generated by ElasticSearch (or whatever was explicitly specified in the previous example) is returned as part of the index API call.

This example shows how to index a document without an explicit ID:

curl -X POST "localhost:9200/customer/_doc? pretty" -H 'Content-Type: application/json' -d'{ "name": "Jane Doe"}'Copy the code

Notice that in this case, we used the post verb instead of put because we didn’t specify an ID.

Update the data

In addition to being able to add and replace documents, we can also update documents. Note that Elasticsearch does not actually perform an override update underneath. Instead, delete the old document and add a new one.

This example changes the name from ID 1 to Jane Doe, as shown in the following example:

curl -X POST "localhost:9200/customer/_update/1? pretty" -H 'Content-Type: application/json' -d'{ "doc": { "name": "Jane Doe" }}'Copy the code

This example shows how to update a previous document (ID 1) by changing the name field to “Jane Doe” while adding an age field to it:

curl -X POST "localhost:9200/customer/_update/1? pretty" -H 'Content-Type: application/json' -d'{ "doc": { "name": "Jane Doe", "age": 20 }}'Copy the code

Updates can also be performed using a simple script, and this example uses the script to increase the age by 5:

curl -X POST "localhost:9200/customer/_update/1? pretty" -H 'Content-Type: application/json' -d'{ "script" : "ctx._source.age += 5"}'Copy the code

Elasticsearch provides the ability to update multiple documents under a given condition, just like updata in Sql… where … We will cover this in more detail in later chapters.

Delete the data

Deleting documents is fairly simple.

This example shows how to delete a previous customer with ID 2:

curl -X DELETE "localhost:9200/customer/_doc/2? pretty"Copy the code

See the _delete_by_query API to delete all documents that match a particular query. It is worth noting that deleting an entire index is much more efficient than deleting all documents using the DELETE By Query API. The _delete_by_query API will be covered in more detail later.

The batch

In addition to being able to index, update, and delete individual documents, Elasticsearch provides the ability to do this in bulk using the _BULK API. This feature is important because it provides a very efficient mechanism for performing multiple operations as quickly as possible with as little network round-tripping as possible.

As a simple example, the following call indexes two documents (ID 1-John Doe and ID 2-Jane Doe) in a batch operation:

curl -X POST "localhost:9200/customer/_bulk? pretty" -H 'Content-Type: application/json' -d'{"index":{"_id":"1"}}{"name": "John Doe" }{"index":{"_id":"2"}}{"name": "Jane Doe" }'Copy the code

This example updates the first document (ID 1) and then deletes the second document (ID 2) in a batch operation:

curl -X POST "localhost:9200/customer/_bulk? pretty" -H 'Content-Type: application/json' -d'{"update":{"_id":"1"}}{"doc": { "name": "John Doe becomes Jane Doe" } }{"delete":{"_id":"2"}}'Copy the code

Note that for the DELETE operation, there is no corresponding source document after it, because the delete operation only needs to remove the ID of the document.

The batch API does not fail because an operation fails (it continues with errors and eventually returns the status of each operation). If an operation fails for some reason, it continues to process the remaining operations after the failure. When the BULK API returns, it provides each operation with a state (in the same order as the send operation) to check whether a particular operation failed.

Browsing data

Sample data

Now that we know the basics, let’s try to use a more realistic data set. I prepared an example of a fictitious JSON customer bank account information document. Each document has the following contents:

{    "account_number": 0."balance": 16623,    "firstname": "Bradshaw"."lastname": "Mckenzie"."age": 29."gender": "F"."address": "244 Columbus Place"."employer": "Euron"."email": "[email protected]"."city": "Hobucken"."state": "CO"}Copy the code

This data was generated using www.json- generator.com/, so ignore the actual values and semantics of the data as they are generated randomly.

Loading sample data

You can download the sample dataset (accounts.json) here. Extract it to the current directory and then load it into the cluster as follows:

curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk? pretty&refresh" --data-binary "@accounts.json"curl "localhost:9200/_cat/indices? v"Copy the code

Response results:

health status index uuid pri rep docs.count docs.deleted store.size pri.store.sizeyellow open bank L7sSYV2cQXmu6_4rJWVIww 5 1 1000 0 128.6 KB 128.6 KBCopy the code

This means that we just managed to bulk store the index’s 1000 documents into the bank index.

Search API

Now let’s start with some simple searches. There are two basic ways to run a search:

  • One is to send search parameters through the REST request URI.

  • The other is to send search parameters through the REST request body.

The request body approach allows you to be more expressive and to define the search in a more readable JSON format. We’ll try an example of the request URI method, but for the rest of the tutorial, we’ll just use the request body method.

The REST API for search is accessible from the _search endpoint. This example returns all documents in the bank index:

curl -X GET "localhost:9200/bank/_search? q=*&sort=account_number:asc&pretty"Copy the code

Let’s first examine the search call. We are searching (_search) in the bank index, and the q=* parameter indicates that Elasticsearch matches all documents in the index. Sort =account_number: The ASC parameter instructs you to sort the results in ascending order using the account_number field for each document. Similarly, the pretty parameter only tells Elasticsearch to return a nicely printed JSON result.

Corresponding results (partially displayed) :

{  "took" : 63,  "timed_out" : false."_shards" : {    "total" : 5,    "successful" : 5,    "skipped": 0."failed": 0}."hits" : {    "total" : {        "value": 1000,        "relation": "eq"    },    "max_score" : null,    "hits": [{"_index" : "bank"."_type" : "_doc"."_id" : "0"."sort": [0]."_score" : null,      "_source" : {"account_number": 0."balance": 16623,"firstname":"Bradshaw"."lastname":"Mckenzie"."age": 29."gender":"F"."address":"244 Columbus Place"."employer":"Euron"."email":"[email protected]"."city":"Hobucken"."state":"CO"}}, {"_index" : "bank"."_type" : "_doc"."_id" : "1"."sort": [1]."_score" : null,      "_source" : {"account_number": 1,"balance": 39225,"firstname":"Amber"."lastname":"Duke"."age": 32."gender":"M"."address":"880 Holmes Lane"."employer":"Pyrami"."email":"[email protected]"."city":"Brogan"."state":"IL"}},... ] }}Copy the code

As for the response, we can see:

  • TookElasticsearch Time in milliseconds to perform the search

  • Timed_out tells us if the search times out

  • _shards tells us how many fragments were searched and how many fragments were successfully/failed searched

  • Hits Search results

  • Hits. total An object that contains information related to the total number of documents that match the search criteria

  • Hits. total.value Specifies the total number of hits.

  • Hits.total.relation :hits.total.value is the exact number of hits, in which case it equals eq or the lower bound (greater than or equal to) of the total number of hits, in which case it equals gte.

  • Hits. hits Array of actual search results (default top 10 documents)

  • Hits.sort results sort key (lost if sorted by fractions)

  • Hits._score and max_score — Ignore these fields for now

Hits. Total accuracy is controlled by the request parameter track_total_hits, when track_total_hits is set to true, the request will accurately track total hits “relation” : “eq”. The default value is 10,000, which means that total hits can be tracked down to 10,000 documents exactly. You can force an exact count by explicitly setting track_total_hits to true. The details will be covered in a later section.

This is the way to use request body search:

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "query": { "match_all": {} }, "sort": [ { "account_number": "asc" } ]}'Copy the code

The difference here is that instead of passing q=* in the URI, we provide the jSON-style query request body to the _search API. We’ll discuss this JSON query in the next section.

It’s important to understand that once you get the search results, Elasticsearch fully processes the request and does not maintain any kind of server-side resource or open a cursor in the results. This is in stark contrast to many other platforms such as SQL, where you may initially get a partial subset of the query results up front and then keep returning them to the server if you want to get (or page) the rest of the results using some sort of state of the server-side cursor.

Introducing a query language

Elasticsearch provides a JSON-style query language that you can use to perform queries. This is called the Query DSL. The query language is so comprehensive that it may seem intimidating at first glance, but the best way to actually learn it is to start with a few basic examples.

Going back to the previous example, we executed this query:

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "query": { "match_all": {} }}'Copy the code

Looking closely at the above, the query section tells us that match_all is just the type of query we want to run, and match_all is just a search for all documents in the specified index.

In addition to query parameters, we can also pass other parameters to influence the search results. At the end of the previous section, we passed sort, where we passed size:

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "query": { "match_all": {} }, "size": 1}'Copy the code

Note that if the size is not specified, the default is 10.

The following example executes match_all and returns documents 10 through 19 (from and size are analogous to limit? ?). :

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "query": { "match_all": {} }, "from": 10, "size": 10}'Copy the code

The FROM argument (based on 0) specifies which document index to start from, and the size argument specifies how many documents to return from the from argument. This feature is useful when implementing pagination of search results.

Note that if from is not specified, the default value is 0.

This example performs the match_all operation, sorts the results in descending order of account balance, and returns the top 10 (default size) documents.

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "query": { "match_all": {} }, "sort": { "balance": { "order": "desc" } }}'Copy the code

Perform a search

Now that we’ve seen some basic search parameters, let’s dive into the Query DSL. Let’s first look at the returned document fields. By default, the complete JSON document is returned as part of all searches. This is called the “source” (the _source field in the search hit). If we do not want to return the entire source document, we can request only a few fields from the source.

This example shows how to return two fields from a search, account number and balance (at! Within the source) :

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "query": { "match_all": {} }, "_source": ["account_number", "balance"]}'Copy the code

Note that the above example only reduces the _source_ field. It still returns only one field named _source, but it contains only the account_number and balance fields.

If you’ve seen MySql before, this is conceptually similar to SQL SELECT from Field List.

Now let’s go to the query section. Earlier, we saw how to use the match_all query to match all documents. Now let’s introduce a new query called a match query, which can be thought of as a basic field search query (that is, a search for a specific field or set of fields).

This example returns the account number 20

Mysql match is similar to conditional queries in mysql.

For example, return account number 20:

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "query": { "match": { "account_number": 20 } }}'Copy the code

This example returns all accounts with “mill” in the address:

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "query": { "match": { "address": "mill" } }}'Copy the code

This example returns all accounts whose address contains “mill” or “lane” :

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "query": { "match": { "address": "mill lane" } }}'Copy the code

A variation of match (match_phrase), which returns all accounts whose address contains the phrase “Mill lane” :

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "query": { "match_phrase": { "address": "mill lane" } }}'Copy the code

Note: if you add a space to match, it will be considered two words and any words will be queried

Match_parase ignores Spaces, treats the character as a whole, and matches documents containing the whole in the index.

Now let’s introduce bool queries. The bool query allows us to combine smaller queries into larger ones using Boolean logic.

If you’re familiar with mysql, you’ll know that Boolean queries are the equivalent of and or not…

This example contains two matching queries that return all accounts with addresses containing “mill” and “lane” :

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "query": { "bool": { "must": [ { "match": { "address": "mill" } }, { "match": { "address": "lane" } } ] } }}'Copy the code

In the example above, the bool must clause specifies all queries that must be true to treat the document as a match.

Instead, this example contains two matching queries and returns all accounts with addresses containing “mill” or “lane” :

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "query": { "bool": { "should": [ { "match": { "address": "mill" } }, { "match": { "address": "lane" } } ] } }}'Copy the code

In the example above, the bool should clause specifies a list of queries, any of which must be true for the document to be considered a match.

This example contains two matching queries that return all accounts with neither “mill” nor “lane” in the address:

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "query": { "bool": { "must_not": [ { "match": { "address": "mill" } }, { "match": { "address": "lane" } } ] } }}'Copy the code

In the example above, the bool MUST_NOT clause specifies a list of queries, none of which must be true for the document to be considered a match.

We can combine the must, should, and must_NOT clauses together in a bool query. Furthermore, we can combine bool queries in these bool clauses to simulate any complex multi-level Boolean logic.

This example returns the accounts of all people aged 40 who do not live in ID(AHO) :

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "query": { "bool": { "must": [ { "match": { "age": "40" } } ], "must_not": [ { "match": { "state": "ID" } } ] } }}'Copy the code

Executive filter

In the previous section, we skipped over a small detail called Document Score (the _score field in the search results). A score is a number that is a relative measure of how well the document matches the search query we specify. The higher the score, the more relevant the document, and the lower the score, the less relevant the document.

But queries do not always need to generate scores, especially if they are only used to “filter” a set of documents. Elasticsearch automatically optimizes query execution to avoid calculating useless scores.

The bool query we introduced in the previous section also supports the Filter clause, which allows us to use the query to restrict documents that will be matched by other clauses without changing the way scores are computed. As an example, let’s introduce range queries, which allow us to filter documents based on a series of values. This is often used for number or date filtering.

In this example, bool is used to query all accounts whose balance is between 20000 and 30000, including the balance. In other words, we want to find accounts with balances greater than or equal to 20,000 and less than or equal to 30,000.

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "query": { "bool": { "must": { "match_all": {} }, "filter": { "range": { "balance": { "gte": 20000, "lte": 30000}}}}}}'Copy the code

By analyzing the above, the bool query contains a match_all query (the query part) and a range query (the filter part). We can replace any other query with the query and filter section. In the example above, the range query makes a lot of sense because documents that fall into the range all match “equally”, and no document makes more sense than one (because it is filtered).

In addition to match_all, match, bool, and range queries, there are many other query types available, and we won’t go into them here. Now that we have a basic understanding of how they work, it shouldn’t be too difficult to apply this knowledge when learning and experimenting with other query types.

Perform aggregation (analogous to mysql aggregation functions)

Aggregation provides the ability to group data and extract statistics. The easiest way to think about aggregation is to roughly equate it with SQL GROUP BY and SQL aggregation functions. In Elasticsearch, you can perform a search that returns hits, while returning aggregated results separated from all hits in a response. This is very powerful and efficient because you can run queries and multiple aggregations and get the results of both (or any) operations at once, avoiding network roundtripping with a clean and simplified API.

First, this example groups all accounts by status and then returns the top 10 (default) states (also default) in descending order of count:

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword" } } }}'Copy the code

In SQL, the above aggregation is conceptually equivalent to:

SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC LIMIT 10;Copy the code
{  "took": 29."timed_out": false."_shards": {    "total": 5,    "successful": 5,    "skipped": 0."failed": 0}."hits" : {     "total" : {        "value": 1000,        "relation": "eq"     },    "max_score" : null,    "hits": []},"aggregations" : {    "group_by_state" : {      "doc_count_error_upper_bound": 20."sum_other_doc_count": 770,      "buckets": [{"key" : "ID"."doc_count": 27}, {"key" : "TX"."doc_count": 27}, {"key" : "AL"."doc_count": 25}, {"key" : "MD"."doc_count": 25}, {"key" : "TN"."doc_count": 23}, {"key" : "MA"."doc_count": 21}, {"key" : "NC"."doc_count": 21}, {"key" : "ND"."doc_count": 21}, {"key" : "ME"."doc_count": 20}, {"key" : "MO"."doc_count": 20}]}}}Copy the code

We can see that ID(Idaho) has 27 accounts, followed by TX(Texas) with 27, then AL(Alabama) with 25, and so on.

Note that we set size=0 to not show the search results because we only want to see the aggregated results in the response.

Based on the previous summary, this example calculates the average account balance by state (again only for the top 10 states in descending order of count):

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword" }, "aggs": { "average_balance": { "avg": { "field": "balance" } } } } }}'Copy the code

Notice how we nested the average_balance aggregation within the group_BY_state aggregation. This is a common pattern for all aggregations. You can nest aggregations anywhere in the aggregation to extract the results you want from the data.

Based on the previous aggregation, we now sort the average balance in descending order:

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "order": { "average_balance": "desc" } }, "aggs": { "average_balance": { "avg": { "field": "balance" } } } } }}'Copy the code

This example shows how we can group by age (20-29, 30-39, 40-49), then by gender, and finally get the average account balance for each age group and each gender:

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{ "size": 0, "aggs": { "group_by_age": { "range": { "field": "age", "ranges": [ { "from": 20, "to": 30 }, { "from": 30, "to": 40 }, { "from": 40, "to": 50 } ] }, "aggs": { "group_by_gender": { "terms": { "field": "gender.keyword" }, "aggs": { "average_balance": { "avg": { "field": "balance" } } } } } } }}'Copy the code

There are many other aggregation features that we won’t discuss in detail here. If you want to experiment further, the aggregate reference guide is a good place to start.

conclusion

Elastic search is both a simple product and a complex one. So far, we’ve seen what it is, how to look inside it, and how to use some of the Rest apis to use it. Hopefully this tutorial has given you a better understanding of what Elasticsearch is all about, and more importantly, it has inspired you to experiment with its other great features!