Cabbage Java self study room covers core knowledge

ElasticSearch 下 载 Java Engineer 下 载 Java Engineer 下 载 Java Engineer 下 载

1. Introduction of ElasticSearch

Elasticsearch is a Lucene-based search server. It provides a distributed multi – user – capable full – text search engine based on RESTful Web interface. Elasticsearch, developed in the Java language and released as open source under the Apache license, is a popular enterprise-level search engine. Elasticsearch for cloud computing is stable, reliable, fast, easy to install and use.

Elasticsearch is a distributed, scalable, real-time search and data analysis engine. It makes it easy to search, analyze and explore large amounts of data. Taking advantage of Elasticsearch’s horizontal scalability can make data more valuable in a production environment. Elasticsearch is implemented in the following steps: First, users submit data to the Elasticsearch database, then use the word segmentation controller to classify the corresponding words, and store their weight and word segmentation results in the data. When users search for data, they rank and score the results based on the weight. The result is then presented to the user.

Elasticsearch was developed with a data collection and log parsing engine called Logstash and an analysis and visualization platform called Kibana. The three products are designed as an integrated solution called the “Elastic Stack” (formerly the “ELK Stack”).

Elasticsearch can be used to search various documents. It provides scalable search, has near-real-time search, and supports multi-tenancy.” Elasticsearch is distributed, which means the index can be split into shards with zero or more copies per shard. Each node hosts one or more shards and acts as a coordinator to delegate operations to the correct shards. Rebalancing and routing are done automatically. “Related data is usually stored in the same index, which consists of one or more primary shards and zero or more replication shards. Once an index is created, the number of master shards cannot be changed.

Elasticsearch uses Lucene and tries to provide all of its features through JSON and Java apis. It supports facetting and percolating, which is useful for notifications if the new document matches the registered query. Another feature, called “gateways,” deals with long-term persistence of indexes; For example, in the case of a server crash, indexes can be recovered from the gateway. Elasticsearch supports real-time GET requests and is suitable as a NoSQL data store, but lacks distributed transactions.

Basic concepts of ElasticSearch

Top down Elasticsearch architecture:

2.1. Cluster

Cluster: ES is a distributed search engine, usually composed of multiple physical units. These physical machines are configured with the same Cluster Name to discover each other and organize themselves into a Cluster. The combination of multiple ES servers is collectively called ES cluster.

Health status of the ES cluster:

  • Green – Both master shards and replicas are allocated normally
  • Yellow – The primary shards are all properly allocated, but the duplicate shards are not properly allocated
  • Red – Primary sharding failed to be allocated (e.g. a new index was created when the server’s disk capacity exceeded 85%)

2.2. Node

Node: an Elasticsearch host in the same cluster.

  1. The node is an instance of Elasticsearch, which is essentially a JAVA process;
  2. Each node has a name, which is specified in a configuration file or at startup.
  3. After each node is started, it is assigned a UID and stored in the data directory.

Main Node types:

  • Data Node – Stores index Data. Data nodes hold data and perform data related operations such as CRUD, search, and aggregations.
  • Client Node – Does not store index and forwards Client requests to a Data Node
  • Master Node – Does not store indexes, manages clusters, such as routing information, determines whether nodes are available, relocates shards when a Node appears or disappears, and coordinates recovery when a Node fails. (All master nodes elect a Master leader node.)

Other Node types:

  • Coordination Node – Responsible for receiving Client requests, distributing the requests to the appropriate nodes, and finally putting the results together

Each Node functions as a Coordination Node by default

  • Hot&Warm Node – Data nodes with different hardware configurations to implement the Hot&Warm architecture and reduce the cost of cluster deployment
  • Machine Learning Node – Runs Machine Learning jobs for exception detection
  • Ingest Nodee – can be regarded as a node for data preprocessing and transformation. It supports pipeline pipeline setting. Ingest can be used to filter and transform data, similar to the function of filter in Logstash.
  • The Tribe Nodee – 5.3 starts with Cross Cluster Search) TribeNode connects to different Elasticsearch clusters and supports treating them as a single Cluster

2.3. Primary Shard

Primary shard: A physical subset of the index (described below). The same index can be physically cut into multiple fragments and distributed to different nodes. Sharding is implemented as an index in Lucene. The data is shard into each shard, which in turn is placed on the nodes in the cluster. Each shard is a separate Lucene instance.

Note: the number of shards in an index in ES is specified during index creation and cannot be changed after creation. So when you start building an index, expect the data size and allocate the number of shards to a reasonable range.

The number of fragments is too small:

  1. Nodes cannot be added to achieve horizontal scaling
  2. The amount of data in a single fragment is too large, which causes data redistribution time

The number of fragments is too large:

  1. It affects the correlation scoring of search structure and the accuracy of statistical results
  2. Excessive fragments on a single node waste resources and affect performance

2.4. Replica Shard

Replica Shard: Each main shard can have one or more replicas. The number of replicas is set by users. ES will try to distribute different fragments of the same index to different nodes to improve fault tolerance. An index can work as long as not all of the shards machines are down.

  • The number of duplicate fragments improves data redundancy
  • After the master fragment fails, the replica fragment can be automatically upgraded to the master fragment
  • Backup shards also reduce the query stress of the primary shard (which consumes more system performance)

2.5. Index

Index: Logical concept, a collection of retrievable document objects. Similar to the concept of database in DB. Multiple indexes can be created in the same cluster. For example, a common approach in production environments is to index the data generated each month to keep the magnitude of the individual indexes manageable.

An index is a container for documents, a combination of a class of documents:

  • Index represents the concept of logical space: each Index has its own Mapping definition, which defines the field name and field type of the contained document
  • The Shard embodies the concept of physical space: the data in the index is scattered across the Shard

The Elasticsearch cluster can contain multiple indexes, each index can contain multiple types, each type can contain multiple documents, and each document can contain multiple Fields. It is also a kind of NoSQL.

ES than traditional relational database, some conceptual understanding:

Database -> Tables -> Rows -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables -> Tables Documents -> FieldsCopy the code

2.6. Type (Type)

Type: The next-level concept of an index, roughly equivalent to a table in a database. An index can contain multiple types.

  • Prior to 7.0, multiple Types could be set for an index
  • Since 6.0, Type has been Deprecated. 7.0 Only one Type -“_doc” can be created for an index

2.7. Documents

Document: the concept of a Document in a search engine, is also a basic unit that can be retrieved in ES, equivalent to a row in a database, a record.

  1. Elasticsearch is document-oriented, and a document is the smallest unit of all searchable data
  2. The document is serialized to JSON format and remains in Elasticsearch
  3. Each document has a UniqueID

Document metadata (metadata used to annotate stable relevant information) :

  • _index – The name of the index to which the document belongs
  • _type – Name of the type to which the document belongs
  • _source – Raw Json data for the document
  • _version – Indicates the version of a document
  • _score – Relevance score

2.8. Fields

Field: corresponds to a column in a database. In ES, every document is actually stored as JSON. A document can be viewed as a collection of fields. An article, for example, might contain information about the subject, abstract, body, author, date, and so on, each of which is a field, which is finally consolidated into a JSON string and dropped to disk.

2.9. Mapping

Mapping is a database schema used to constrain field types, but Elasticsearch Mapping can be explicitly specified and automatically created from document data.

Index Mapping and Settings:

  • Mapping defines the types of document fields
  • Setting defines different data distributions

2.10. The REST API

Elasticsearch provides a very comprehensive and powerful REST API to use to interact with clusters:

  1. Check cluster, node, and index health, status, and statistics
  2. Manage your cluster, node, and index data and metadata
  3. Perform CRUD (Create, read, update, and delete) and search operations on indexes
  4. Perform advanced search operations, such as paging, sorting, filtering, scripting, aggregation, etc

Elasticsearch reverses the index

3.1. Forward index

Forward indexing: Traverses all the participles in a document to get the same word as the keyword, hitting it once. This indexing method requires traversing every document, which is not very good performance.

3.2. Inverted index

Inverted index: Traverses the word segmentation results of documents. If the keyword matches the word, all documents containing the word can be found through the word segmentation, which is relatively high performance.

An inverted index can greatly improve the speed of retrieval. Here is an example to illustrate what an inverted index is and why it improves the speed of retrieval compared to a traditional database. Let’s look at a practical example. Suppose we have the following data:

Document is A:

I love study.
Copy the code

The document B:

And study make me happy.
Copy the code

Now enter the keyword we want to retrieve: “Study happy”.

As you can imagine, in a traditional database, you would use an SQL statement that looks like this to find a matching record in a database table:

SELECT column_name(s)
FROM table_name
WHERE (column_name LIKE 'study' and column_name LIKE 'happy');
Copy the code

If there are 1000W records in the table, this SQL will retrieve the 1000W records in turn, with the time consumed increasing as the number of records increases. This approach is inadequate when dealing with massive amounts of data.

So can we change the way of retrieval?

Since we want to find the specified word (keyword) in the document (record), the final search is whether the word (keyword) exists in a document, in which document. Then we can build an index table directly according to the word (keyword), by searching the keyword directly find all documents containing the keyword, this is the inverted index.

Take the above example to illustrate the traditional database storage mode:

How to create an inverted index:

Two texts in documents A and B,”I love study.“And”And study make me happy.“, you can create the following indexes:

When we retrieve the keyword”study happy“Can be directly foundstudyDocument AThe document BAre present in,happyOnly in theThe document BIn existence. Compared with the traditional database retrieval method, this method is more direct and efficient, especially in the case of a large amount of data.

ElasticSearch can also find that the keyword we entered hits once in document A and twice in document B, and that document B is more relevant than document A.

This is one of the great things about ElasticSearch, not only is it able to quickly retrieve content, but it can also sort documents by relevance and display them in search results.

3.3. Lucene dictionary tree

Lucene’s inverted index adds the left-most layer of “dictionary tree” term index, which does not store all words but only word prefixes. Through the dictionary tree, find the block where the word is located, namely the approximate position of the word, and then search in the block bisection to find the corresponding word, and then find the list of documents corresponding to the word.

Here’s an example:

Contain A “A”, “to”, “tea”, “ted,” “ten”, “I”, “in”, and “inn” trie tree. This tree does not contain all terms, it contains some prefixes to terms. Term index allows you to quickly locate an offset in the Term Dictionary, and then look up from there. In addition, the size of some compression technologies (search Lucene Finite State Transducers) Term index can only be one tens of the size of all terms, making it possible to cache the whole term index in memory.

The actual implementation of Lucene is more complex, using different dictionary indexes for different data structures, using FST models, BKDTree, etc. The real inversion record is not a linked list, but SkipList, BitSet, etc.

4. Principle of Elasticsearch segmentation

When a document is indexed, an inverted index may be created for each field (if the field is not indexed in the mapping).

The process of inverting an index is to divide documents through the Analyzer into terms, each of which points to the collection of documents containing the term. When querying a query, Elasticsearch determines whether to analyze the query based on the search type, and then conducts a correlation query with term in the inverted index to match the corresponding document.

Such as:

“I am a Chinese “can be divided into four words: “I”, “am”,” A “and” Chinese”

Analyzer is a tool to analyze a piece of text into several words according to a certain logic and carry out normalization of these words. Elasticsearch’s tokenizers include CharFilters, Tokenizer, and TokenFilters.

Analyzer = CharFilters (0 or more) + Tokenizer(exactly one) + TokenFilters(0 or more)Copy the code
  • CharFilters: Transform characters using character filters
  • Tokenizer: Splits text into single or multiple participles
  • TokenFilters: Transform each participle using segmentation filters

TokenFilters ->Tokenizer->TokenFilters

If only one Analyzer is set in the Mapping, that analyzer is used to index documents and search queries. Of course, different Analyzers can also be used to index documents and analyze queries.

A special case is that some queries need to be analyzed and some don’t. For example, the Match Query will be analyzed with Search Analyzer and then matched with the inverted index of the corresponding field. Term Query does not analyze the query content, but matches it directly with the inverted index of the corresponding field.

4.1. ElasticSearch built-in word splitter

  • Standard Analyzer: Separates words by non-alphabetic and non-numeric characters
Test text: A *B! C d4e 5f 7-h a, B, C, d4e, 5f, 7, hCopy the code
  • Simple Analyzer: Separates words by non-alphabetic characters and converts them to lowercase
Test text: A *B! C d4e 5f 7-h 结 论 : A, B, C, D, e, f, hCopy the code
  • Whitespace Analyzer: Separates by whitespace characters
Test text: A *B! C c D D c c D D C, D, d4e, 5f, 7-hCopy the code
  • Stop Analyzer: Uses non-alphabetic characters for separation, converts words to lowercase, and removes stop words (default to English stop words, such as the, a, an, this, of, at, etc.)
Test text: The apple is redCopy the code
  • Language Analyzer: Uses the syntax of the specified language for word segmentation. The default is English. There is no built-in Chinese word analyzer

  • Pattern Analyzer: Uses the specified regular expression for word segmentation. The default is \W+, that is, multiple non-numeric and non-alphabetic characters

4.2. ElasticSearch Chinese word segmentation

  • SmartCN: a simple word splitter for Chinese or Chinese-English mixed text
  • IK Word divider: smarter and friendlier Chinese word divider

A rough comparison of the above participles:

Word segmentation is advantage disadvantage
SmartCN The official plug-in The effect of Chinese word segmentation is terrible
IK participle Easy to use, support for custom dictionaries and remote dictionaries The thesaurus needs to be maintained by itself and does not support part-of-speech recognition

5. Data type of Elasticsearch

5.1. String type

String: ElasticSearch is used in earlier versions of ElasticSearch. From ElasticSearch 5.x, text and keyword are not supported.

  • Text: When a field is to be searched in full text, such as Email content or product description, use text. After setting the text type, the field content is parsed, and the string is parsed into terms before generating an inverted index. Fields of type text are not used for sorting and are rarely used for aggregation.
  • Keyword type: The keyword type applies to indexed structured fields, such as email addresses, host names, status codes, and labels. If fields need to be filtered (for example, looking for published posts with the status attribute), sorted, aggregated. Fields of the keyword type can only be searched by exact value.

5.2. Integer type

type Value range
byte The signed 8-bit integer ranges from -128 to 127.
short The signed 16-bit integer ranges from -32768 to 32767.
integer The signed 32-bit integer ranges from −2^31 to 2^31-1.
long The signed 64-bit integer ranges from −2^63 to 2^63-1.

Choose as narrow a range of data types as possible to meet your requirements. For example, if the maximum value of a field does not exceed 100, select byte. So far, the guinness Book of Records has recorded a maximum human age of 134 years, so for the age field, short is sufficient. The shorter the length of the field, the more efficient the indexing and searching.

5.3. Floating point types

type Value range
float 32-BIT single-precision floating point number
double 64 – bit double – precision floating point number
half_float A 16-bit half-precision IEEE 754 floating point type
scaled_float Floating-point numbers of scale types, such as the price field, need to be accurate to the minute, have a 57.34 scale factor of 100, and store a result of 5734

For float, halF_FLOAT, and SCALed_float,-0.0 and +0.0 are different values, and a term query looking for -0.0 will not match +0.0. Similarly, a range query with a top boundary of -0.0 will not match +0.0, and a bottom boundary of +0.0 will not match -0.0. Where scaled_float, for example, the price only needs to be accurate to the point, and the price of a field with a scale factor of 57.34 has a scale factor of 100, which is stored as 5734. Scaled_float with a scale factor is preferred.

5.4. The date type

Date type representations can be in one of the following formats:

  • A string in date format, such as “2018-01-13” or “2018-01-13 12:10:30”
  • Milliseconds since the epoch indicates the number of milliseconds since the epoch.
  • Integer Number of seconds (seconds-since-the-epoch)

5.5. The Boolean type

A string or number representing true or false can be accepted:

  • True values: true, “true”, “on”, “yes”, “1”…
  • False, false, “false”, “off”, “no”, “0”, “” (empty string), 0.0, 0

5.6. The binary type

Base fields refer to binary data stored in an index in base64, which can be used to store data in binary form, such as images. By default, this type of field is only stored without indexes. Binary types only support the index_NAME attribute. A binary type is a binary value of a Base64 encoded string that is not stored in the default way and cannot be searched. There are two Settings:

  • Doc_values: specifies whether the field needs to be stored on the disk for sorting, aggregation, or script query. Accept true and false(default);
  • Store: Whether the value of this field should be stored and retrieved separately from _source, meaning whether a copy of the field should be stored separately from _source, true or false(the default).

5.7. An array type

There is no special array type in ES. Use [] to define the array type. All values in an array must be of the same data type. Arrays of mixed data types are not supported:

  • Array of strings: [“one”, “two”];
  • Integer array: [1, 2];
  • An array of arrays: [1, [2, 3], equivalent to [1, 2, 3];
  • Array of objects: [{” name “:” Tom “, “age” : 20}, {” name “:” Jerry “, “age” : 18}].

5.8. The object type

JSON documents are hierarchical: documents can contain internal objects, and internal objects can contain internal objects.

5.9. The IP type

An IP field is used to store aN IPv4 or IPv6 address and is essentially a long integer field.

5.10. Geo type

Geographic point types are used to store latitude and longitude pairs of geographic locations and can be used to:

  • Find geographic points within a certain range;
  • Aggregate documents by geographic location or distance relative to a central point;
  • Integrate distance into document relevance scores;
  • Sort documents by distance.

There are some other types that are less used, such as nested, geo_shape, token_count, etc., which I will not cover here.

ElasticSearch 下 载 Java Engineer 下 载 Java Engineer 下 载 Java Engineer 下 载