Introduction to the

Elasticsearch is a highly extensible, open source, Lucene-based full-text search and analysis engine. It allows you to store, search, and analyze large amounts of data quickly, in near real time, and supports multi-tenancy.

Elasticsearch is also developed in Java and uses Lucene as its core for all indexing and searching, but it aims to hide the complexity of Lucene with a simple RESTful API to make full-text searching easy.

Elasticsearch is more than just Lucene and full text search, however, we can describe it as follows:

  • Distributed real-time file storage where each field is indexed and searchable
  • Distributed real-time analysis search engine
  • Scalable to hundreds of servers, processing petabytes of structured or unstructured data

Furthermore, all of these functions are integrated into a single service that your application can interact with through simple RESTful apis, clients in various languages, and even the command line.

Version selection

The first thing to consider when deciding to use Elasticsearch is the version. There are currently three stable major versions of Elasticsearch: 2.x, 5.x, 6.x (excluding 0.x and 1.x).

All historical versions of Elasticsearch can be viewed here, the latest is Elasticsearch 6.4.2 as the blogger wrote this post. You might notice that without 3.x and 4.x, ES jumped straight from 2.4.6 to 5.0.0. Why is that?

This is to create a unified version of ELK (ElasticSearch, Logstash, Kibana) stack so that users don’t get confused.

Elasticsearch, Kibana, and LogStash are all part of the Elastic Stack and are often used in collaboration with each other. To avoid version confusion, you need to have a unified version, at least under one major version number.

While Elasticsearch is 2.x (the last version 2.4.6 of 2.x was released July 25, 2017), Kibana is 4.x (Kibana 4.6.5 was released July 25, 2017), The next major version of Kibana will definitely be 5.x, so Elasticsearch will release its own version as 5.0.0. After unification, there will be no confusion in selecting the version of ElasticSearch and then selecting the same version of Kibana without worrying about version incompatibility.

Version selection can be considered from the following aspects:

  • Version Problem 2. X is an old version, so you cannot experience new functions and the performance is inferior to that of 5.x. 6. Version X is a little new, with relatively little online information (those with enough development time can be studied).
  • Data migration 2. X data can be directly migrated to 5.x. X data can be migrated directly to 6. X. However, the 2.x data cannot be directly migrated to 6.x.
  • Peripheral tools 2. X Version Peripheral tools are unavailable. The corresponding version of tools such as Kibana need to check, not good match. 5. After X, the main version numbers of Kibana and other tools were unified.
  • X, 5. X, and 6. X you can install the Elasticsearch- Sql plug-in to query Elasticsearch using the familiar Sql syntax. SQL modules are built in after 6.3.0 and are part of the X-Pack.

I chose ElasticSearch 6.4.0 because it’s new, and Kibana also chose Kibana 6.4.0. However, when the local development was complete and ready for deployment, the o&M notification was switched to 5.6.0 because the rest of the company had 5.6.0, which was convenient for unified maintenance. Fortunately, the API didn’t change much.

Environment set up

Install the Elasticsearch

Versions of ElasticSearch after 5.0 require at least Java 8. You can run the following command to check the Java version and install or upgrade the Java version as required.

java -version
echo $ JAVA_HOME
Copy the code

You can download your version of Elasticsearch from elastic.co/download.

If it is clustered, it is available at… \ elasticSearch -5.6.0\config\ elasticSearch. Yml configure some of your cluster information:

cluster.name: my-application   # cluster name
path.data: /path/to/data       # ES Data store path
path.logs: /path/to/logs       # ES Log storage path
node.name: node-1              The name of the current nodeNetwork. The host: 192.168.0.1Set the IP address of the current node to 0.0.0.0
http.port: 9200                The default HTTP port is 9200
Copy the code

Run the Elasticsearch

After elasticSearch is ready, run the following command in the installation directory to start it:

Linux

./bin/elasticsearch
Copy the code

Windows

D:\... \ elasticsearch - 6.4.0 \ bin \ elasticsearch batCopy the code

After a successful run (startup log will have.. Started flag), browser visit http://localhost:9200/? Pretty, you can see something like the following return message (slightly different versions) :

{
  "name" : "AGXQ3qy"."cluster_name" : "elasticsearch"."cluster_uuid" : "mg9t4Yi2TRud1JNwRY0bPA"."version" : {
    "number" : "6.4.0"."build_flavor" : "default"."build_type" : "zip"."build_hash" : "595516e"."build_date" : "The 2018-08-17 T23:18:47. 308994 z"."build_snapshot" : false."lucene_version" : "7.4.0"."minimum_wire_compatibility_version" : "5.6.0"."minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}
Copy the code

Your Elasticsearch cluster is up and running and ready to use.

Visual Web Interface

Because Elasticsearch is restful, this is not very intuitive, so let’s install a graphical interface to make it easier to use. Elasticsearch – Head and Kibana are currently available.

elasticsearch-head

Elasticsearch – Head is a Web front end for browsing and interacting with elasticSearch clusters. Elasticsearch -head is a visualization tool for elasticSearch cluster management, data visualization, add, delete, and query statements. Elasticsearch Head is hosted and can be downloaded or forked at Github.

There are two ways to run and install ElasticSearch-head:

Run as a plugin for ElasticSearch (preferred method)

  1. Elasticsearch/bin/elasticsearch – plugin – install mobz/elasticsearch – head.
  2. Open http://localhost:9200/_plugin/head/.

Note: the plugin-install method before 5.0 is plugin-install… For elasticSearch-plugin-install after 5.0… .

Run as a standalone WebApp

  1. git clone git://github.com/mobz/elasticsearch-head.git.
  2. Open index.html in your browser. Modern browsers require the use of ES-Head.
  3. By default, ES-head will attempt to connect to a cluster node at http:// localhost: 9200 /. Enter different node addresses in the connection box and click Connect if necessary.

The cluster status information is displayed as follows:

I won’t go into details about elasticSearch-head because it’s a bit tricky to install (node.js is required) and there are a lot of resources available online

kibana

Kibana and ElasticSearch are owned by Elastic. Kibana is an open source analytics and visualization platform designed to work with Elasticsearch. You use Kibana to search, view, and interact with data stored in the Elasticsearch index. You can easily perform advanced data analysis and visualize your data in a variety of charts, tables, and maps.

Kibana makes it easy for you to understand large amounts of data. Its simple browser-based interface enables you to quickly create and share dynamic dashboards that display changes to Elasticsearch queries in real time.

Setting up Kibana is very easy. You can install Kibana and start exploring your Elasticsearch index in minutes – no code, no extra infrastructure required.

You can also select the version of ElasticSearch from Kibana and decompress it to use it as follows:

  • Download and unzip Kibana.
  • Open config/kibana.yml in the editor.
  • Set elasticSearch. url to your elasticSearch instance, such as local:elasticsearch.url: "http://localhost:9200".
  • Run bin/ Kibana (or bin\ Kibana.bat on Windows).
  • Enter http://localhost: 5601 in the browser.

If it runs successfully but cannot be accessed, turn off the firewall and try again.

The basic concept

With all that in mind, before we get down to business, let’s take a look at some of the core basic concepts of ES. Understanding these concepts from the beginning will greatly simplify the learning process.

Elasticsearch is a near real-time (NRT) search platform. This means that there is a slight delay (usually one second) in the time from indexing documents to searchable documents. There are usually concepts of clusters, nodes, shards, replicas, etc.

Cluster (Cluster)

A cluster is a collection of nodes with the same cluster.name that work together, share data, and provide failover and scaling capabilities. Of course, a node can also form a cluster.

A cluster is identified by a unique name, which is ElasticSearch by default. This name is important because if a node is set to join the cluster by name, it can only be part of the cluster.

Ensure that different cluster names are used in different environments, otherwise you will end up with nodes joining the wrong cluster.

[Cluster Health Status]

Cluster status is marked by green, yellow, and red

  • Green – Everything is fine (cluster is fully functional).
  • Yellow – All data available, but some copies have not been allocated (the cluster is fully functional).
  • Red – Some data is unavailable for some reason (part of the cluster function).

Note: When the cluster is red, it will continue to serve search requests from available shards, but you may need to fix it as soon as possible because there are unallocated shards.

To check cluster health, we can run the following command GET /_cluster/health from the Kibana console to GET the following information:

{
  "cluster_name": "elasticsearch"."status": "yellow"."timed_out": false."number_of_nodes": 1,
  "number_of_data_nodes": 1,
  "active_primary_shards": 28."active_shards": 28."relocating_shards": 0."initializing_shards": 0."unassigned_shards": 5,
  "delayed_unassigned_shards": 0."number_of_pending_tasks": 0."number_of_in_flight_fetch": 0."task_max_waiting_in_queue_millis": 0."active_shards_percent_as_number": 84.84848484848484}Copy the code

Node (the Node)

A node, a running ES instance, is a node that stores data and participates in the indexing and searching functions of the cluster.

Like a cluster, a node is identified by a name, which by default is a random universal unique identifier (UUID) assigned to the node at startup. If default values are not required, you can define any node names you want. This name is important for administrative purposes, where you can identify which servers in the network correspond to which nodes in the Elasticsearch cluster.

Nodes can be configured to join a specific cluster by cluster name. By default, each node is set to join a ElasticSearch cluster called Cluster, which means that if you start many nodes on the network and assume they can discover each other – they will automatically form and join a cluster called ElasticSearch.

Index

An index is a collection of documents with some similar characteristics. For example, you can have an index of store data, an index of goods, and an index of order data.

An index is identified by a name (which must be all lowercase) that is used to refer to the index when indexing, searching, updating, and deleting documents within it.

Type (Type)

Type, once a logical category/partition of an index, allows you to store different types of documents in the same index, for example, one type for users and another type for blog posts.

Deprecated in 6.0.0, it will no longer be possible to create multiple types in an index, and the whole concept of types will be removed in later versions.

Document

A document is a basic unit of information that can be indexed. For example, you can provide documentation for a single customer, one document for a single product, and one document for a single order. The document is represented in JSON (JavaScript Object Notation), a common format for data interchange on the Internet.

In index/type, you can store as many documents as you need. Note that although the document actually resides in the index, it must actually be indexed/assigned to a type in the index.

Shard (Shards)

Indexes can store large amounts of data that may exceed the hardware limits of a single node. For example, a single index of a billion documents occupying 1TB of disk space might not fit on a single node’s disk, or might be too slow to serve search requests individually from a single node.

To solve this problem, Elasticsearch provides the ability to subdivide an index into multiple pieces called shards. When creating an index, you only need to define the number of shards you need. Each shard is itself a fully functional and independent “index” that can be hosted on any node in the cluster.

The purpose and reasons for setting sharding are as follows:

  • It allows you to split/scale internal capacity horizontally
  • It allows you to distribute and parallelize operations across shards (possibly on multiple nodes), thereby improving performance/throughput

The way shards are distributed and how their documents are aggregated back into search requests is completely managed by Elasticsearch and is transparent to the user.

Sharding is useful in a network/cloud environment where failure can occur at any time, and failover is recommended in case shards/nodes somehow go offline or disappear for any reason. To do this, Elasticsearch allows you to make one or more copies of an indexed shard into a so-called replica shard or copy for short.

Copy (Replicasedit)

A copy is a copy of a shard. The purpose is to provide high availability in the event of a shard/node failure, which allows you to scale the search volume/throughput because searches can be performed in parallel on all replicas.

In summary, each index can be split into multiple shards. Indexes can also be copied zero times (indicating no copies) or more. After replication, each index will have a master shard (copied from the original shard) and a replicated shard (a copy of the master shard).

You can define the number of shards and copies per index at index creation time. You can also dynamically change the number of copies at any time after the index is created. You can change the number of shards for an existing index using _shrink and _splitAPI, but this is not an easy task, so pre-planning the correct number of shards is the best approach.

By default, each index in Elasticsearch is assigned 5 master shards and 1 replica, which means that if there are at least two nodes in the cluster, the index will contain 5 master shards and another 5 replica shards (1 full replica) for a total of 10 shards per index.

summary

Let’s assume a cluster consisting of three nodes (Node1, Node2, and Node3). It has two primary shards (P0, P1), and each primary shard has two replica shards (R0, R1). Copies of the same shard are never placed on the same node, so our cluster looks like “a cluster with three nodes and one index” in the figure below.

Similar to a relational database: Database cluster, if a user table, I’m worried about the large amount of data, I has built more than one user list (i.e., Shard), will be cut into more user information data, and then according to certain rules to the users table, and I worry about a table will appear abnormal data loss, I again will each table backup (up) at a time.

Copies are multiplications, the more wasteful, but also the safer. Sharding is division. The more shards there are, the fewer and more fragmented the single shard data.

In addition, we can draw a comparison diagram to compare traditional relational databases:

  • Relational Databases -> Databases -> Tables -> Rows -> Columns
  • Elasticsearch -> Indeces -> Types -> Documents -> Fields

An Elasticsearch cluster can contain multiple indexes, each index can contain multiple Types, each type can contain multiple documents, then each document can contain multiple Fields.

Although the analogy is so, they are two different products after all, and it has been mentioned above that Types may be deleted in the later version, so generally we create an index for each type. Fresh creates an index for groceries, and groceries creates an index for groceries, rather than creating an index for groceries that includes both fresh and groceries.

You may have noticed that the word index has a different meaning in Elasticsearch, so it’s worth making a distinction here:

Distinction in meaning of “index”

  • As mentioned above, an index is like a database in a traditional relational database. It is the place where related documents are stored. The plural of index is indices or indexes.
  • Index (verb) To index a document means to store a document in an index (noun) so that it can be retrieved or queried. This is much like the INSERT keyword in SQL, except that if the document already exists, the new document overwrites the old one.
  • Traditional databases add an index to a specific column, such as a B-tree index, to speed up retrieval. Elasticsearch and Lucene use a data structure called Inverted Indexes for the same purpose.

Interact with Elasticsearch

Currently, there are two modes to interact with ElasticSearch: Client API and RESTful API.

Client API mode:

Elasticsearch provides official clients for the following languages –Groovy, JavaScript,.NET, PHP, Perl, Python, and Ruby– as well as many community-provided clients and plug-ins, All of these are available in Elasticsearch Clients. I’ll write another article later to explain it in detail.

JSON over HTTP:

All other languages can use RESTful apis to communicate with Elasticsearch through port 9200. You can access Elasticsearch from your favorite Web client. In fact, as you can see, you can even use curl to interact with Elasticsearch.

An Elasticsearch request consists of several of the same parts as any HTTP request:

curl -X<VERB> '
      
       ://
       
        :
        
         /
         
          ? 
          
           '
          
         
        
       
       -d '<BODY>'
Copy the code

Parts marked by < > :

The data format

An object in an application is rarely just a simple list of keys and values. Typically, they have more complex data structures, which may include dates, geographic information, other objects, or arrays, etc.

Maybe one day you want to store these objects in a database. Using row and column storage in a relational database is akin to squeezing an expressive object into a very large spreadsheet: you have to flatten the object to fit the table structure, usually one column for each field, and you have to reconstruct the object with each query.

Elasticsearch is document-oriented, meaning it stores the entire object or document. Elasticsearch not only stores documents, but the contents of each document can be retrieved. In Elasticsearch, you index, retrieve, sort, and filter documents instead of column and column data. This is a completely different way of thinking about data and is why Elasticsearch supports complex full text retrieval.

Elasticsearch uses JavaScript Object Notation or JSON as the serialization format for documents. JSON serialization is supported by most programming languages and has become the standard format in the NoSQL world. It’s simple, concise and easy to read. Almost all languages have modules that can convert any data structure or object to JSON format, though the details vary.

{
  "_index" :   "megacorp"."_type" :    "employee"."_id" :      "1"."_version" : 1."found" :    true."_source" :  {
      "first_name" :  "John"."last_name" :   "Smith"."age" :         25."about" :       "I love to go rock climbing"."interests":  [ "sports"."music"]}}Copy the code

Application of indexes

With everything in place, let’s get started and experience the world of ElasticSearch. First, let’s look at all of our index information:

GET _search
{
  "query": {
    "match_all": {}
  }
}
Copy the code

The following result information is obtained:

{
  "took": 0."timed_out": false."_shards": {
    "total": 1."successful": 1."skipped": 0."failed": 0
  },
  "hits": {
    "total": 1."max_score": 1."hits": [{"_index": ".kibana"."_type": "config"."_id": "5.6.0"."_score": 1."_source": {
          "buildNum": 15523}}]}}Copy the code

As you can see, there’s only one index currently, which is.kibana, and of course it’s not our own, it’s Kibana.

Create the first simple index

NBA’s new season has begun again, I believe most people will pay attention to when there is a wonderful game, we create an INDEX of NBA teams, start our learning road, the index name should be lowercase.

PUT nba
{
  "settings":{
    "number_of_shards": 3,   
    "number_of_replicas": 1	
  },
  "mappings":{
    "nba":{
      "properties":{
        "name_cn":{ 
          "type":"text"
        },
        "name_en":{
          "type":"text"
        },
        "gymnasium":{
          "type":"text"
        },
        "topStar":{
          "type":"text"
        },
        "championship":{
          "type":"integer"
        },
        "date":{
          "type":"date",
          "format":"yyyy-MM-dd HH:mm:ss|| yyy-MM-dd||epoch_millis"
        }
      }
    }
  }
}
Copy the code

Field Description:

The field names Fields that
nba The index
number_of_shards Subdivision number
number_of_replicas replications
name_cn Chinese name of the team
name_en English team name
gymnasium Name of the arena
championship Number of championships
topStar star
date Year of JOINING NBA

If the format is correct, we get the following message indicating success

{
  "acknowledged": true."shards_acknowledged": true."index": "nba"
}
Copy the code

Adding index data

After the index is created, we add the team data to the index. 1,2, and 3 are the ids we specify.

In fact, we can push data directly without creating the index mapping above, but in this way, ES will automatically set the field type for us according to the data information, which will cause the risk of inaccurate index information.

PUT/NBA/NBA /1 {" NAMe_EN ":"San Antonio Spurs SAS", "NAMe_CN ":"San Antonio Spurs ", "gymnasium":"AT&T Center "," Championship ": 5, "topStar" : "Tim Duncan," "date" : "1995-04-12"} PUT/NBA/NBA / 2 {" name_en ":" the Los Angeles Lakers ", "name_cn" : "the Los Angeles Lakers," "Gymnasiums ":" Staples Center ", "championship": 47, "topStar":" Kobe Bryant ", "Date ":" 47-05-12"} PUT/NBA/NBA /3 {" NAMe_EN ":"Golden State Warriors"," NAMe_CN ":"Golden State Warriors", "Gymnasium ":" Oracle Gymnasium ", "championship": 6, "topStar" : "Stephen curry", "date" : "1949-06-13"} PUT/NBA/NBA / 4 {" name_en ":" Miami Heat ", "name_cn" : "the Miami Heat," "Gymnasium ":" American Airlines Stadium ", "championship": 3, "topStar":" lebron James ", "date":"1988-06-13"} PUT/NBA/NBA /5 {"name_en":"Cleveland Cavaliers", "name_CN ":"Cleveland Cavaliers", Loans arena "gymnasium" : "speed", "championship" : 1, "topStar" : "lebron James", "date" : "1970-06-13"}Copy the code

If index data is PUT successfully, the following information is displayed

{
  "_index": "nba"."_type": "nba"."_id": "1"."_version": 1."result": "created"."_shards": {
    "total": 2."successful": 1."failed": 0
  },
  "created": true
}
Copy the code

Querying index data

MATCH_ALL = MATCH_ALL = MATCH_ALL = MATCH_ALL = MATCH_ALL = MATCH_ALL = MATCH_ALL = MATCH_ALL = MATCH_ALL We can query the index information we need separately.

Elasticsearch provides a rich and flexible Query language called Query DSL ** that allows you to build more complex and powerful searches.

1. Match Query match, match_all

We tried the simplest search request for all employees:

Select * from all teams

GET /nba/nba/_search
{
    "query": {
        "match_all": {}
    }
}
Copy the code

The query results are as follows

{
  "took": 4."timed_out": false."_shards": {
    "total": 3."successful": 3."skipped": 0."failed": 0
  },
  "hits": {
    "total": 3."max_score": 1."hits": [{"_index": "nba"."_type": "nba"."_id": "2"."_score": 1."_source": {
          "name_en": "Los Angeles Lakers"."name_cn": "Los Angeles Lakers."."gymnasium": "Staples Center."."championship": 16."topStar": "Kobe Bryant"."date": "1947-05-12"}}, {"_index": "nba"."_type": "nba"."_id": "1"."_score": 1."_source": {
          "name_en": "San Antonio Spurs SAS"."name_cn": "SAN Antonio Spurs"."gymnasium": "AT&T Center Arena."."championship": 5."topStar": "Tim Duncan."."date": "1995-04-12"}}, {"_index": "nba"."_type": "nba"."_id": "3"."_score": 1."_source": {
          "name_en": "Golden State Warriors"."name_cn": "Golden State Warriors."."gymnasium": Oracle Arena."championship": 6."topStar": "Stephen Curry"."date": "1949-06-13"} ···}]}}Copy the code

The data result of the response is divided into two parts

{
----------------first part--------------------
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
---------------second part---------------------
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}
Copy the code

The first part is the fragment copy information, and the second part is the query data set packaged by HITS.

Note: Not only does the response tell us which documents were matched, but it also contains the complete content of those documents – all the information we need to show the user the search results.

“Golden State Warriors”

GET /nba/nba/_search
{
   "query": {
        "match": {
            "name_en": "Golden State Warriors"
        }
    }
}
Copy the code

The query results are as follows:

{
  "took": 6."timed_out": false."_shards": {
    "total": 3."successful": 3."skipped": 0."failed": 0
  },
  "hits": {
    "total": 1."max_score": 1.9646256."hits": [{"_index": "nba"."_type": "nba"."_id": "3"."_score": 1.9646256."_source": {
          "name_en": "Golden State Warriors"."name_cn": "Golden State Warriors."."gymnasium": Oracle Arena."championship": 6."topStar": "Stephen Curry"."date": "1949-06-13"}}]}}Copy the code

2. Filter the query Filter

Let’s make the search a little bit more complicated. We want lebron James, but we only want a team with more than one championship. Our statement will make some changes to add filters, which allow us to effectively perform a structured search:

GET /nba/nba/_search { "query": { "bool": { "filter": { "range": { "championship": { "gt": 1 } } }, "must": { "match": {"topStar": "lebron James"}}}}}Copy the code

Select * from Elasticsearch (select * from Elasticsearch); select * from Elasticsearch (select * from Elasticsearch); select * from Elasticsearch (select * from Elasticsearch);

How Elasticsearch performs full-text field searches and returns the most relevant results first. The relevance concept is very important in Elasticsearch, which is the biggest difference from traditional relational databases where records only match and do not match.

conclusion

Due to space limitations, the most basic concept for ElasticSearch temporarily first introduced so much, content is still much to write, after nearly a month of technology research and development, the current for ES is a general technical knowledge, but many details still need to consider, ES function is really strong and rich, want to skilled play together, It’s going to take some time.

The related content of ES will continue to be introduced later, probably including the Java language construction of ES service, SQL query of ES and so on. Interested can first pay attention to my blog, due to the quality requirements about a weekly update.

In addition, I have added a mind map summary of this article by my friend ReyCG to make the basic information of Elasticsearch more clear.


Personal public account: JaJian

Welcome long press the picture below to pay attention to the public number: JaJian!

We regularly provide you with the explanation and analysis of distributed, micro-services and other first-line Internet companies.