Elasticsearch basic concepts and indexing principles

I’m a beer and I’m a latiao.

But do good, ask not the future.

What is Elasticsearch?

Elasticsearch is a document-based NoSQL database that is a distributed, RESTful search and data analysis engine and the core of the Elastic Stack that centrally stores data. Elasticsearch, Logstash, and Kibana are often used as log analysis systems, commonly known as ELK.

To put it bluntly, it is a database that searches for thieves quickly (but inserts updates slowly, otherwise other databases don’t play). Fast, but also can be word segmentation, very suitable for search, such as shopping mall commodity search. Why is it fast? We’ll talk about it later when we talk about how it works, it’s not just about caching, it’s about how it works. And it’s noSQL, so the data format can be arbitrary. Elasticsearch also gives us a rich RESTful API that is extremely cheap to write code for. Finally, it supports distributed, high performance (fast search), high availability (some nodes down can be used), scalable (can be convenient to add nodes, solve the problem of physical memory online), suitable for distributed system development.

Basic concepts of Elasticsearch

For a quick look at Elasticsearch(or ES for short), let’s compare some concepts with mysql.

Elasticsearch	Mysql
Field (Filed)	Attributes (columns)
The Document (the Document)	Record (line)
Type (Type)	table
The Index (Index)	The database

Isn’t that much clearer? Elasticsearch is document-based because the record element (the smallest unit searched) is a document. For example, here is a document,

{
    "email":      "[email protected]"."first_name": "John"."last_name":  "Smith"."info": {
        "bio":         "Eco-warrior and defender of the weak"."age":         25."interests": [ "dolphins"."whales"]},"join_date": "2014/05/01"
}
Copy the code

The document format looks a lot like Json. Filed are email, first_name, etc. Since the structure is Json, the value can be easily put into any type, which is the benefit of NoSQL.

The Document (the Document)

An object in ES will later correspond to an object in Java code. Each Filed of a document can be of any Type. However, once a file is inserted into an Index (Type is omitted when we describe it, but Type still exists) and a Filed is used for the first time, ES will set the Type of the Filed. For example, if you insert a user whose name is a string and then insert a document later, the name field must be a string. Therefore, it is recommended to set the type of each Filed before inserting the document.

If you do not specify an ID when inserting the document, ES will help you to automatically generate an ID. It is recommended that the ID be a number, so that the search will be much faster. It is suggested to use snowflake algorithm to generate commodity ID in the mall system, which not only avoids the security problem of self-increasing ID, but also solves the problem of slow retrieval of string ID.

Type (Type)

Regarding Type, the concept of Type, in version 6.x, an Index can have multiple types. In 7.x, an index can only have one Type. The default Type is _doc. In 7.x, it is recommended to remove this Type. It will be completely removed in future 8.x releases. But in 7.x, a document must also belong to a type.

The Index (Index)

It is said that indexes in ES are similar to databases in mysql. I think indexes have the potential to become a table concept in mysql in the future. We put all Filed documents with the same characteristics into the same index. In this way, Filed types can be defined through mapping in advance. In addition, index names must be all lowercase, so hump is not recommended.

Nodes and Shards

As ES is basically cluster deployed in production environment, the concept of nodes must be indispensable. A node is an ES instance and a Java process. These Java processes are deployed on different servers to increase THE AVAILABILITY of ES.

ES nodes can be classified into three types according to their functions:

Master node: Is responsible for cluster operations, such as creating or dropping indexes, keeping track of which nodes are part of the cluster, and deciding which shards to assign to related nodes. Each node has access to the state of the cluster, but only the master node can modify the state of the cluster.
Data node: Data node is mainly used to store data. It can add, delete, modify, check and aggregate documents, etc. Data node has high requirements on CPU, memory and IO. When resources are insufficient, new nodes can be added to facilitate data expansion.
Client node: this node processes routing requests and distributes indexes. In fact, the master node and data node also have routing and forwarding functions, but to improve efficiency, it is recommended that the production environment create a separate client node.

Fragmentation is similar to the table in mysql, the split into several small index, an index distribution on different nodes (different server), each small index has perfect function, when the request from the client, the client node to find the right small shards index, data query, this process is transparent for the user, The user is ostensibly just manipulating an index. With sharding, you can avoid the physical limitations of a single node and increase throughput. How many shards is recommended for an index to start with, as changing the number of shards is a cumbersome process.

As a distributed database, ES must provide us with the function of data redundancy, which is the fragment copy, that is, copy a fragment to other nodes. Note that shard and shard copy must be on different nodes! ** Sharding replicas can also improve throughput. Shard copies are different from shards and can be easily modified.

Having said all the concepts, go back to the diagram at the beginning of this section, which has an index of three shards on three nodes, and each shard has a shard copy on a different node.

Elasticsearch index principle

Now that you have a basic understanding of Elasticsearch, go back to the basics (I will write a basic blog post later) and you can use Elasticsearch in your project. Now you can catch your breath and read the rest of the book with a relaxed mind. Now let’s talk about why index is fast.

First of all, we know that the underlying data structure of mysql uses B+Tree, which is very fast, and our Elasticsearch is faster than that. How does Elasticsearch work? First, the storage structure should be optimized, and then improve the efficiency of the interaction with the disk.

Select * from Elasticsearch; select * from Elasticsearch; Its general logic is as follows:

To illustrate this concept, let’s take a look at the following example of our user data:

ID	Name	Age
1	Kate	24
2	John	24
3	Bill	29
4	Kate	26
5	Brand	29

Elasticsearch creates two index trees for the above data:

Term	Posting List
Kate	1, 4
Brand	5
John	2
Bill	3

Term	Posting List
24	1, 2,
26	4
29	3, 5

Filed fields have a set of terms, and each Term is followed by an ID. The Posting List is a set of ids. With the ID and then go to the disk corresponding document so fast.

Have you noticed that it is faster to search terms in order? If you order terms and conduct binary search, the speed is the same as that of BTree, and the time complexity is LogN. This ordered Term group is the Term Dictionary.

Select * from ‘A’; select * from ‘Z’; select * from ‘A’; select * from ‘Z’; select * from ‘A’; select * from ‘Z’; select * from ‘A’; select * from ‘Z’; select * from ‘A’; select * from ‘Z’; select * from ‘A’; select * from ‘Z’; This tree is Term Index. The Term Index prefix does not have to be the first character. For example, A, Ab, and Abz can all be in the Term Index tree. Also, the Term Dictionary may be too large and will be put into disk to avoid taking up too much memory.

And let’s see if this is a little bit clearer.

Because Term Index is in memory, it’s better to compress it, reduce the memory usage, and compress it using FST, which is a little bit complicated to talk about, but it’s compressible, so it’s better to compress the memory.

We have compressed the Term, so can we compress the Posting List to save space? If you’ve ever used Redis, you’ll immediately think of a bitMap, which is a huge array of 0’s and 1’s, 1’s if you have one, 0’s if you don’t. For example, the above 3 and 5 would be 1, 0, 1, 0, 0 in a BitMap. Although the space is obviously much smaller, if a Posting List only stores the ids of 1,10000001, the resulting number might be too large. So the Roaring bitmaps comes out, with an exponential downgrade, which is simply the quotient and remainder store, and the divider is 65535. For example, 1000,62101,131385,196658, these ids are first grouped, and the grouping rule is the same as the quotient. For example, the above ids can be grouped into [(0,1000),(0,62101)],[],[(2,6915)],[(3,53)]. Notice that there are no values with a quotient of 1, so I’m using an empty array. At this point, the numbers from one of the groups are put into a bitmap.

Elasticsearch basic concepts and indexing principles

What is Elasticsearch?

Basic concepts of Elasticsearch

The Document (the Document)

Type (Type)

The Index (Index)

Nodes and Shards

Elasticsearch index principle

Related Posts

What are the rights and wrongs of Java object-oriented?

Build Maven private server using Github

Deep learning curve overtaking, more than a little ahead of the industry