preface

ElasticSearch is one of the most important parts of ElasticSearch in the world. In this article I will give you some ideas about how to use ElasticSearch. Although I am a real pie, don’t like about these theories knowledge, because this can view the official document, there will write very detailed, but after using the ElasticSearch, found some points need to master certain theoretical knowledge to understand, for beginners to understand, so to write the article, Hopefully, readers will find this helpful.

ElasticSearch theory

What is the ElasticSearch

Elasticsearch is a jSON-based distributed search and analysis engine. It can be accessed from RESTful Web services interfaces and uses pattern-less JSON(JavaScript object notation) documents to store data. It is based on the Java programming language, which enables Elasticsearch to run on different platforms. Enables users to search very large amounts of data very quickly.

What can ElasticSearch do

  • Distributed real-time file storage where each field is indexed and searchable
  • Distributed real-time analysis search engine
  • Scalable to hundreds of servers, processing petabytes of structured or unstructured data

What is the Lucene

ApacheLucene organizes all the information written to the indexes into a sort of Inverted Index structure, which is a data structure that maps the terms to the documents. It works differently from traditional relational databases in that inverted indexes are largely term oriented rather than document-oriented. The Lucene index also stores a lot of other information, such as word vectors, etc. Each Lucene is composed of multiple segments. Each segment is only created once but will be queried multiple times. Once a segment is created, it will not be modified. Multiple segments are merged at the stage of segment merging. The time of merging is determined by Lucene’s internal mechanism. The number of segments will decrease after merging, but the corresponding segments themselves will become larger. The process of merging segments is very I/O consuming, and at the same time some information is cleaned up that is no longer used. In Lucene, the process of turning data into inverted indexes and whole strings into searchable terms is called analysis. Text analysis is performed by Analyzer, which consists of Tokenizer, Filter and Character Mapper, and its various functions are obvious.

Elk architecture

ELK is an acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analysis engine. Logstash is a server-side data processing pipeline that captures data from multiple sources at the same time, transforms it, and sends it to a “repository” such as Elasticsearch. Kibana allows users to visualize data using graphs and charts in Elasticsearch.

ElasticSearch noun

Cluster (cluster)

A cluster consists of one or more nodes that share the same cluster name. Each cluster has a separate primary node, which is automatically selected by the program, and if the current primary node fails, the program automatically selects another node as the primary node.

Node (the node)

A node belongs to a cluster. Usually a server has one node, but sometimes a server can have multiple nodes for testing purposes. At startup, a node will use broadcasts to discover an existing cluster with the same cluster name and will try to join it. The node properties are determined by some configuration of ElasticSearch.yml! Master and Datanode are essential, others can be added according to the situation! To prevent brain splitting and subsequent maintenance, it is recommended to separate node attributes!

Elasticsearch. Yml configuration:

  1. The combination of node.master: true and node.data: true indicates that this node is both a master node and stores data. If a node is elected as the true master node, it also stores data, which puts more pressure on the node. This is the default configuration for each node of ElasticSearch, which is fine in a test environment. It is not recommended to do this in practice, as this is equivalent to mixing the roles of the master node and the data node.

  2. The combination of node.master: false and node.data: true indicates that this node is not eligible to be the master node, and therefore does not participate in elections, but only stores data. This node is called the data node. You need to configure several such nodes in a cluster to store data and provide storage and query services.

  3. The combination of node.master: true and node.data: false indicates that the node does not store data, is qualified as a master node, can participate in elections, and may become a true master node. This node is called a master node.

  4. Node. master: false Node. data: false This combination indicates that this node does not serve as a master node, nor does it store data. This node is used as a client node, mainly for load balancing of massive requests.

  5. Node.ingest: true executes the preprocessing pipe and is not responsible for data and cluster-related things. It preprocesses documents before indexing, intercepts bulk and INDEX requests for documents, and then transforms them. By passing the document back to the BULK and Index APIS, the user can define a pipeline specifying a series of preprocessors.

Figure:

The preceding node attributes can be configured as required. If there are only three servers with common configuration, you can share the master node with the Datanode in the test environment, namely, node.master: true and node.data: true. In a production environment, it is best to separate the nodes, especially MasterNode and Datanode, even if masterNode is installed on a poorly configured server. Clientnode can be deployed depending on the situation, if there are a large number of queries and there are many aggregation analysis queries; Ingestnode also depends on the situation, if the use of ingest API, can also be deployed. We will cover cluster planning in a future article.

Index

An index is a logical store for Elasticsearch’s logical data, so it can be broken into smaller parts. You can think of indexes as tables in a relational database. However, the index structure is designed for fast and efficient full-text indexing, especially since it does not store raw values. If you know MongoDB, you can use Elasticsearch’s index as a collection in MongoDB. If you are familiar with CouchDB, you can think of indexes as CouchDB database indexes. Elasticsearch can store indexes on one machine or spread them across multiple servers. Each index has one or more shards and each shard can have multiple replicas.

ElasticSearch: ElasticSearch: ElasticSearch: ElasticSearch: ElasticSearch: ElasticSearch

Relational DB -> Databases -> Tables -> Rows -> Columns
Elasticsearch -> Indices -> Types -> Documents -> Fields
Copy the code

However, since Types have been removed from elasticSearch7.x, and for everyday use, I recommend that an index be used as a table in the database, and that the type be set to the same as the index name except when necessary.

Here by the way, let’s talk about the structure of creating index libraries. We know that in a relational database you need to create a table to add data, but in ElasticSearch you can insert data directly, it will automatically create the index structure based on your first data, but this is not applicable in many cases. If we want to create our own, then it is necessary to know the index setting and mapping.

setting

Setting can be understood as managing some important attributes of the index, such as Shard and replica. It determines the final configuration of the index base. For starters, you can just use these three configuration parameters:

  • Number_of_shards: specifies the number of shards to be set, which cannot be changed.
  • Refresh_interval: specifies the refresh time of ES cache. If the write is frequent but the query does not require real-time performance, the value can be set to a higher value to improve performance. You can change the
  • Number_of_replicas: yes Specifies the number of replicas of the index database. You are advised to set the value to more than 1.

mapping

Mapping can be understood as the table structure of a relational database, specifying the types of fields. Text, keyword, byte, short, INTEGER, long, float, double, Boolean, date and so on. Text and keyword are both strings. Text is used for segmentation, and keyword is used for sorting or aggregating.

Shard (shard)

A sharding is a single Lucene instance. This is a low-level feature managed by Elasticsearch. Indexes are logical Spaces that point to master and replica shards. For use, you only need to specify the number of shards, and you don’t need to do much else. Elasticsearch will automatically manage all shards in the cluster. If a shard fails, one Elasticsearch will move the shard to a different node or add a new one.

  • Primary shard: Each document is stored in a shard. When you store a document, the system stores it first in the primary shard and then copies it to different copies. By default, an index has five primary shards. You can specify the number of shards in advance, but once shards are created, the number of shards cannot be changed.

  • Replica Shard: Each shard has zero or more copies. A replica is primarily a copy of a master shard, serving two purposes:

    1. Increased high availability: when the master shard fails, one of the replica shards can be selected as the master shard. 2. Improve performance: When querying, you can query in the master shard or copy shard. By default, a master assignment has one replica, but the number of replicas can be dynamically increased later in the configuration. Replicas must be deployed on different nodes, not on the same node as the master shard.

Sharding Settings are important! An index can not be modified after the specified sharding, so when setting sharding must be well planned in advance!

Figure:

The document

The main entity stored in Elasticsearch is called a document. Using a relational database analogy, a document is equivalent to a row in a database table. When comparing Elasticsearch documents with MongoDB documents, you will see that both can have different structures, but in Elasticsearch documents, the same fields must have the same type. This means that all documents that contain the title field must have the same title field type, such as String. A document is composed of fields, each of which may appear more than once in the same document, and these are called multivalued fields. Each field has a type, such as text, value, date, and so on. Field types can also be complex types, where a field contains other subdocuments or arrays. The field type is important in Elasticsearch because it gives information about how various operations, such as analysis or sorting, are performed. Fortunately, this can be determined automatically, however, we still recommend using mappings. Unlike a relational database, documents do not need to have a fixed structure, each document can have different fields, and you do not have to determine which fields to have during program development. Of course, you can use schemas to enforce document structure. From the client’s perspective, the document is a JSON object. Each document is stored in an index and has a unique identifier and document type automatically generated by Elasticsearch. Documents need to have unique identifiers for their document types, which means that two documents of different types can have the same unique identifier in one index.

  • Document type: In Elasticsearch, an index object can store objects for many different purposes. For example, a blog application can save articles and comments. Document types allow us to easily distinguish between different objects in a single index. Each document can have a different structure, but in a real deployment, separating files by type helps a lot with data manipulation. Of course, one limitation to keep in mind is that different document types cannot set different types for the same property. For example, a field called title must have the same type among all document types in the same index.
  • The core data types text and keyword

  • Numeric data types long, INTEGER, short, byte, double, float, halF_float, SCALed_float

  • Date Data type date

  • Boolean Data type Boolean

  • Binary data type binary

  • Range data types integer_range, FLOAT_range, long_range, doubLE_range, date_range

  • Complex data types

  • Object Data type Object Is used for a single JSON object

  • Nested data types are used for JSON object arrays

  • Geographic data type

  • Geographic location data type GEO_point latitude/longitude integration

  • The geoshape data type geo_shape is used for complex shapes such as polygons

  • Specialized data type

  • IP Data type IP Used for IPv4 and IPv6 addresses

  • Completion Data type Completion provides suggestions for automatic completion

  • Token count data type token_count Computes the number of tokens in a string mapper-murmur3 murmur3 computes the hash of the value and stores it in the index mapper-annotated-text annotated-text Index text containing a special tag (usually used to identify named entities)

  • Leachate types accept queries from Query-DSL

  • The JOIN data type defines parent/child relationships for documents within the same index

  • Alias data types define aliases for existing fields.

  • Multi-field: It is often useful to index the same field in different ways for different purposes. For example, a string field can be mapped to text for full-text search or keyword for sorting or aggregation. Alternatively, you can index text fields using the Standard analyzer, English analyzer, and French analyzer. This is the purpose of multiple domains. Most data types support multiple fields with the fields parameter.

  • Mapping: In the section on full-text search basics, we mentioned the process of analysis: preparing input text for indexing and searching. Each field in the document must be analyzed for its type. For example, there are different analyses of numeric fields and text fields pulled from a Web page, such as the fact that the numbers in the former should not be sorted alphabetically, and the first step in the latter is to ignore HTML tags because they are useless information noise. Elasticsearch stores information about fields in the map.

  • Routing: When a document is stored, it is stored in a unique master shard, which shard is selected by hash value. By default, this value is generated by the id of the document. If the document has a specified parent document, generated from the parent document ID, this value can be modified when the document is stored. This attribute can be ignored in the early stages of learning, using the default. Learn about ElasticSearch after you have some knowledge about it.

  • Alias: It is an additional name for one or more indexes that allows indexes to be queried using this name. An alias can correspond to multiple indexes and vice versa, and an index can be part of multiple aliases. Aliases can only be used for queries, not data manipulation!

Figure:

other

ElasticSearch is one of the most important parts of ElasticSearch in the world. This article introduces the basic knowledge of ElasticSearch, but there is more to it than that. To learn about ElasticSearch knowledge theory, you can go to www.elasticSearch for more details.

In the future, I will probably write a cluster planning for ElasticSearch, which will be explained from examples, including machines, nodes, index libraries, shard replicas, and configurations.

Reference: www.elastic.co/guide/en/el…

Reference Book: The Definitive Guide to ElasticSearch

ElasticSearch: ElasticSearch: Kinaba for ElasticSearch ElasticSearch uses the JAVA API for ElasticSearch

Music to recommend

Original is not easy, if you feel good, I hope to give a recommendation! Your support is the biggest motivation for my writing! Copyright: www.cnblogs.com/xuwujing CSDN blog.csdn.net/qazwsxpcm Nuggets: juejin.cn/user/365003… Personal blog: www.panchengming.com