preface

“The world is awash in data. Over the years, we have been overwhelmed by the flow and volume of data generated by our systems. Existing technologies focus on how to address data warehouse storage and how to structure this data. That all seems fine and dour until you actually need to make decisions based on the data in real time. Elasticsearch is a distributed, scalable, real-time search and data analysis engine. Whether you need full text search, real-time statistics of structured data, or a combination of the two, Elasticsearch is more than just full text search, we also introduce structured search, data analysis, complex human language processing, geographic location, and association between objects.”

- » preface] [Elasticsearch: authoritative guide (https://www.elastic.co/guide/cn/elasticsearch/guide/cn/preface.html)Copy the code

The above excerpt from Elasticsearch: the definitive guide tells us what super powers ES has. This series of articles is based on a practical summary of the work. The ultimate goal is to explore how to smooth the migration to ES (using the Java technology stack) based on existing business requirements to compensate for the limitations of relational databases and improve our ability to process data.

This series is tentatively planned into four parts

  • Part 1: Why do we need to use ES, and the basic concepts
  • Part 2: We discuss the migration strategy, technology selection, and synchronization strategy for Mysql -> ES data in various scenarios.
  • Part 3: We will implement the integration of Spring Boot with ES and use the APIS provided by ES to “translate SQL”
  • The fourth part: the technical basis of our previous edge, to try to solve a realistic RDBMS single table data volume of tens of millions of levels, while there are multiple table join how to use ES to solve this headache problem.

So without further ado, let’s see why we need to use ES.

Why is Elasticsearch required

Fast, is fast

The main reason we use ES is because ES is fast. Especially when the amount of data reaches tens of millions of levels, whether the single table of relational database is optimized by adding indexes, sub-databases and sub-tables, the final optimization effect is often unsatisfactory (and the complexity of sub-databases and sub-tables is high), while ES can easily hold tens of millions of levels of data.

To achieve this speed, ES uses finite state converters to implement inverted indexes for full-text retrieval, BKD trees for storing numerical and geo-location data, and column storage for analysis. And because ES indexes all fields by default, we can retrieve data in real time during query.

2. Not just full text search

This is why it is important to use ES. In traditional relational databases, we often use fuzzy queries to get the data we want.

select * from author where name like '% % lu xun'.
Copy the code

SQL > create index (name); SQL > create index (name); SQL > create index (name); However, it is very simple to implement the above query in ES. Because of the storage and index strategy adopted by ES, the desired results can be queried in real time.

In essence, ES is a search and analysis engine. From an abstract point of view, a search engine does three things: collect data, establish data index, and rank relevance. Collecting data in ES is accomplished by us. 1. For example, synchronizing data from relational database to ES; 2. ES then establishes indexes for synchronized data to facilitate subsequent queries; 3, the last step is the most important step, that is, relevance ranking, such a large amount of data, not all have the same degree of importance, so the ranking is very important for search engines, it determines the quality of the search.

This is also the case in ES. When we use the query, we not only query the data, but also rank the data retrieved by the keyword according to relevance. This is also more consistent with our human way of thinking, assuming that we use the keyword “Lu Xun” is not to query all articles and reports containing Lu Xun keywords. Instead, the author hopes to search lu Xun’s most important works, personal background, life experience, historical evaluation and other information in order of relevance.

So this is something that traditional relational databases can’t provide, and it “seems” to understand what we really want.

3. A complete ecosystem

ElasticSearch is the core of Elastic’s stack, along with Logstash, Filebeat, Kibana, and more.

We can use the technology stack provided by ES to achieve various purposes. For example, we typically build a distributed log collection system by using ELK + Filebeat, and push the logs of each microservice to the Logstash pipeline for processing through Filebeat collection, and then the Logstash pipeline to ES. Finally Kibana display, query log.

4. Scalability

For most databases, significant application changes are often required to take advantage of the additional resources that will be added horizontally. On the other hand, ElastiSearch is distributed by nature. In an ES cluster, nodes can be added and removed at any time, and the cluster will redistribute all data evenly. It knows how to manage multiple nodes to improve scalability and availability. This also means your app doesn’t have to pay attention to this problem.


Of course, every technology has its proper application scenarios, and ES does not support transactions and is more suitable for chit-and-change scenarios, so we need to be aware of these limitations when choosing a technology stack.

Ii. Core Concepts

No matter when we are developing and maintaining ES cluster, it is important to make clear the core concepts in ES. Here we introduce the core concepts of ES in the way of “from small to large”.

1. Fields

Fields are the smallest independent unit of data in ES, and each field has its own data type (we can define our own data type to override the automatic setting of ES). We can also set the analysis, word segmentation, and so on for individual fields.

Core data types include String, Numeric, DateDate, Boolean, Binary, Range, etc., while complex types include Object and Nested. For details, refer to the official introduction

2. What can we do for you?

The concept of a document in ES is equivalent to a row of data in an RDBMS, except that in ES the document is stored directly in JSON format (that is, it can be nested), rather than “flattened” as in an RDBMS, which is a big difference between Nosql and a relational database.

Here is an example of a document

{
   "_id": 3, "_type" : [" your index type "], "_index" : [" your index name "],"_source": {"age": 28."name": ["Daniel"]."year": 1989,}}Copy the code

3. Mapping

“Mapping is the process of defining how a document, and the fields it contains, are stored and indexed.”

Mapping defines how documents and fields are stored and indexed. Use Mapping to define the following information

  • Which fields should be used as full-text indexes
  • Which fields include numbers, dates, geolocations?
  • The format of the time type
  • Define rules to control dynamically adding field mapping

4. Index

The largest data storage concept in ES is a collection of documents with the same characteristics. With the gradual abolition of Type after ES7.0, Index went from a “database” concept to a de facto “table” concept. We can think of it as a table in an RDBMS, but note that Index is only a logical concept and real data is stored separately in shards.

5. Shards

A few more words need to be said here, but understanding sharding is essential to understanding the ES clustering (capacity expansion, fault tolerance, routing) principle.

First, each shard is an instance of a Lucene index, which we can think of as a separate search engine that indexes a subset of data in the Elasticsearch cluster and handles related queries.

Shards are divided into two types: Primary Shard and Replica Shard.

  • Lord shard: Since all data is stored in the master shard, the master shard determines the maximum number of documents that can be stored, but once the number of master shards for an index is specified at creation time, the number of master shards cannot be changed. This is because when indexing a document, It hashes to the corresponding master shard via its primary key (default) (similar to the RDBMS shard and table routing strategy), so once we change the number of master shards, we will not be able to locate the specific master shard. During the mapping, you can set number_of_shards. By default, the maximum value is 1024.

  • Replica shards: We can specify any number of replica shards for a master shard based on the actual hardware resources. As mentioned above, each shard can handle queries, so we can increase the resources of replica shards (corresponding hardware resources increase) to improve the system’s processing power. Also, during mapping, you can set the number of replicas for each master shard by using the number_of_replicas parameter

However, for fault tolerance (data loss due to node host downtime), master and replica fragments cannot be on the same node to prevent data loss due to node downtime.

6. Instances and Nodes

“A node is A running instance of Elasticsearch which belongs to A cluster”

A node is a running ES instance that belongs to a cluster. Typically, we deploy one node on a server, but sometimes we can start multiple nodes on a single server to test the cluster. Suppose that when we start a node and want to join an existing cluster, we can configure the name of the cluster and the IP + port for communication in the configuration file. ES will automatically discover the cluster and try to join the cluster through “single point of transmission”.

Nodes are classified into the following types:

  • Master-eligible node: Manages and configures a cluster, such as adding or deleting nodes.
  • Data node: Documents are actually stored in Data nodes that perform related operations, such as CRUD, search, aggregation, and so on
  • Coordinating node: Used to handle routing of requests, summary of query result sets, intelligent load balancing..
  • Ingest node: Preprocessing documents before indexing
  • Machine Learning node: Primarily used for Machine learning tasks, but requiredBasic License.

For details about the ES node, refer to the official documentation

7. Cluster

  • In ES, a cluster is made up of one or more ES nodes, and each cluster has a unique name/identifier that is used as a basis for the node to join the cluster.
  • Each cluster has a Master node. If the Master node fails, the cluster can be replaced by another node.

ES also supports cross-cluster replication, cross-cluster retrieval, and other functions. For details, refer to Cross-cluster replication and Search across Clusters


In the previous two sections, we have learned about the super data processing capability of ES and some basic concepts of ES. Because the online analysis of the principle of reverse order index used by ES has been more, I will not re-describe the details here, the specific principle is recommended to refer to an article: The Secret of time series database (2) — index

reference

  • Elasticsearch Reference [7.8] » Mapping

  • Elasticsearch Reference [7.8] » Index modules

  • Elasticsearch Reference [7.8] » Set up Elasticsearch » Configuring Elasticsearch » Nodes

  • Elasticsearch Reference [7.8] » Glossary of terms

  • Elasticsearch: What is it? Why do you need him?