preface

In search, Lucene is the most popular library. A few years ago the industry was asking, do you know Lucene? Do you know the principle of inverted index? It’s already out of date, because many projects now use lucene-based distributed search engine — ElasticSearch, or ES for short.

Now distributed search has basically become the standard configuration of Java system in most Internet industries, among which ES is particularly popular. A few years ago, when ES was not popular, we generally used Solr. However, in recent years, most enterprises and projects began to shift to ES.

Next, a basic understanding of the ES distributed search engine architecture is briefly described.

The infrastructure

ElasticSearch is designed as a distributed search engine and is based on Lucene. The core idea is to start multiple ES process instances on multiple machines to form an ES cluster.

The basic unit for storing data in ES is an index. For example, if you want to store some order data in ES, you should create an index order_IDx in ES, and all order data will be written to this index. An index is equivalent to a table in mysql.

Index -> type -> Mapping -> document -> field.Copy the code

Well, just to make it a little bit more straightforward, LET me draw an analogy here. But remember, don’t equate, analogy is just for the sake of understanding.

Index is equivalent to a table in mysql. An index can have multiple types. The fields of each type are almost the same, but there are some slight differences. Let’s say we have an index, which is the order index. Just like if you’re building a table in mysql, some of the orders are for physical goods, like a dress, a pair of shoes; Some orders are virtual goods orders, such as game point cards, phone recharge. Most fields of the two orders are the same, but a few fields may have slight differences.

Therefore, two types will be created in order index, one is physical goods order type, and the other is virtual goods order type. Most of the fields of these two types are the same, but some fields are different.

There are many cases where there is only one type in an index, but it is true that there are multiple types in an index (note that the concept of mapping types has been completely removed from ElasticSearch 7.x. You can think of index as a table of categories, with each type representing a table in mysql. Each type has a mapping. If you think of a type as a specific table, index represents the same type as multiple types. Mapping is the table structure definition of this type. You must define the structure of the table, what fields are in it, and what type each field is. You actually write a piece of data inside a type in the index, called a document, and a document represents a row in a table in mysql, and each document has multiple fields, Each field represents the value of a field in the Document.

You have an index, and that index can be broken up into shards, and each shard stores part of the data. For example, if you have 3 terabytes of data, each shard contains 1 terabyte of data. If you have 4 terabytes of data, how can you expand it? The second is to improve performance. Data is distributed in multiple shards, that is, on multiple servers. All operations will be executed in parallel and distributed on multiple machines, which improves throughput and performance.

Then, the data of this shard actually has multiple backups, that is, each shard has a primary shard, which is responsible for writing data, but there are several replica shards. After data is written to the primary shard, it is synchronized to the other replica shards.

According to the Replica scheme, there are multiple copies of data for each shard. If one machine breaks down, it doesn’t matter. There are other copies of data on other machines. High availability.

If there are multiple nodes in an ES cluster, one node is automatically elected as the master node. The master node actually does some management work, such as maintaining index metadata and switching the primary shard and Replica Shard identities. If the master node goes down, a new master node is elected.

If the non-master node breaks down, the master node transfers the identity of the primary shard on the faulty node to the Replica Shard on another machine. Then if you repair the downed machine and restart it, the master node will take control of allocating the missing Replica Shard to the replica shard, synchronizing the subsequent data changes and so on to restore the cluster to normal.

To put it more simply, if a non-master node goes down. The primary shard on this node will not exist. Well, the master switches the replica Shard corresponding to the Primary shard (on another machine) to the Primary shard. If the downed machine is repaired, the repaired node is not the Primary shard but the Replica Shard.

This is the basic architecture of ElasticSearch as a distributed search engine.