This is the 14th day of my participation in the August More Text Challenge. For details, see:August is more challenging

🌈 past review

Thank you for reading, I hope it can help you, if there are flaws in the blog, please leave a message in the comment area or add my private chat in the profile of the home page, thank you for your advice. I’m XiaoLin, a boy who can write bugs and rap

Today is a special day, I wish you all well, jack shall have Jill, no valentine’s pay rise 20K

  • 🌹 the most comprehensive summary of the most common Linux back-end required commands (super comprehensive! Super detailed!) Collect this one is enough! 🌹
  • 🌈 iS MySQL really CRUD? ✨ to see the difference between 2K and 12K (bottom)
  • 🌈 iS MySQL really CRUD? ✨ to see the difference between 2K and 12K (part 1)

Introduction to ElasticSearch

1.1 what is full-text search

Full-text search is a computer program that scans every word in a text and creates an index to each word, indicating the number and location of its occurrence in the text. When the user queries, the search is based on the established index, similar to the process of looking up words through the search table of the dictionary.

Search: search (build index)

Full-text Retrieval(Retrieval) Takes Text as the Retrieval object and finds the Text containing the specified terms. Comprehensive, accurate and fast are the key indicators to measure the full text retrieval system. Features of full-text search:

  1. Only text is processed.
  2. Semantics are not dealt with.
  3. Search in English is case insensitive.
  4. The result list is sorted by relevance.

What is ElasticSearch

ElasticSearch (ES) is an open source search engine based on Apache Lucene. It is a popular enterprise search engine. Lucene by itself is considered the best performing open source search engine kit to date, but Lucene’s API is relatively complex and requires a deep theory of search. It is difficult to integrate into real applications. However, ES is written in the Java language and provides an easy to use RestFul API. Developers can use its simple RestFul API to develop related search capabilities, thus avoiding the complexity of Lucene.

1.3 Birth of ElasticSearch

A couple of years ago, Shay Banon, an unemployed and recently married developer, went with his wife to London to study cooking. In the midst of his job search, he started building an early version of Lucene in order to build a recipe search engine for his wife.

Working directly from Lucene can be difficult, so Shay started abstracting the Lucene code so that Java programmers could add search capabilities to their applications. He released his first open source project, called “Compass.”

Shay then found a job in a distributed environment with high performance and in-memory data grids, so a high performance, real-time, distributed search engine was needed. Then he decided to rewrite the Compass library as a separate service called Elasticsearch.

The first public version of Elasticsearch appeared in February 2010, and since then has become one of the most popular projects on Github, with over 300 code contributors. A new Elasticsearch company has been formed, providing commercial support and developing new features, but Elasticsearch will always be open source and available to everyone. Shay’s wife is still waiting for her recipe search…

1.4 Application Scenarios of ElasticSearch

ES mainly uses lightweight JSON as the data storage format, which is similar to MongoDB. At the same time also support location query, also convenient location and text mixed query. And is a leader in statistics, log-based data storage and analysis, and visualization. The application scenarios at home and abroad are as follows:

  1. Overseas: Wikipedia uses ES to provide full-text search with highlighted keywords, StackOverflow combines full-text search with location queries, Github uses Elasticsearch to retrieve 130 billion lines of code.
  2. Domestic: Baidu (ES is applied in cloud analysis, network alliance, prediction, library, wallet, risk control and other businesses, and a single cluster imports 30TB+ data every day, a total of 60TB+ data every day), Sina, Alibaba, Tencent and other companies have all used ES.

Install ElasticSearch

2.1 Environment preparation

  1. centos7
  2. The JDK (1.8 or above)
  3. ElasticSearch6.8.0

Download ElasticSearch

You can download ElasticSearch from the official website

Wget HTTP: / / http://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.8.0.tar.gzCopy the code

Install JDK

2.3.1 Download JDK

#The default location is /usr/java/jdk1.8.0_171-amd64*/
rpm -ivh jdk-8u181-linux-x64.rpm
Copy the code

2.3.2 Configuring environment Variables

vim /etc/profile
Copy the code

At the end of the configuration file add:

Export JAVA_HOME = / usr/Java/jdk1.8.0 _171 - amd64 export PATH = $PATH: $JAVA_HOME/binCopy the code

2.3.3 Overloaded system configuration

source /etc/profile
Copy the code

Install ElasticSearch (Linux)

2.4.1 Add a new user and grant permission

#Create a new group on a Linux system
groupadd es

#Create the new user Xialin and place the ES user in the ES group
useradd xiaolin -g es 

#Example Change the password of user ES
passwd xiaolin

#Give xiaolin all permissions in /usr
chown -R xiaolin /usr
Copy the code

2.4.2, decompression

Tar - ZXVF elasticsearch - 6.4.1. Tar. GzCopy the code

2.4.3 Understanding the directory structure

  • Bin Specifies the directory of the executable binary file
  • Config Directory of the configuration file
  • Libraries that the lib runtime depends on
  • Logs Runtime log file
  • Modules Modules that the runtime depends on
  • Plugins can install official and third-party plugins

2.4.4 Starting services

Go to the bin directory and start the ES service

./elasticsearch
Copy the code

2.4.5, test,

The default Web service port is 9200, and the real Java port (TCP port) is 9300, accessible to any identity

#(Curl is a mock browser that detects whether es is installed successfully and does not allow remote links by default)
curl http://localhost:9200
Copy the code

2.4.6 Enabling the Remote Connection

Note: The ES service is protected by default and only allows local clients to connect. Remote connections must be enabled if you want to access the ES service from a remote client

We just need to

Basic concepts of ElasticSearch

3.1 NRT Near Real Time

Elasticsearch is a near real time search platform. This means that there is a slight delay (usually 1 second) from indexing a document until it can be searched

Index of 3.2,

Operation process of ElasticSearch

  1. When ElasticSearch does the add operation, it first adds data to the index and then performs the segmentation of the text field according to the specified word segmentation rules.
  2. After the field is segmtioned, you get a list of root words, which ElasticSearch stores in an inverted index table that associates the root to the document.
  3. When a user enters a query keyword during full-text search, ElasticSearch will use the keyword segmentation to match the inverted index. If the segmentation matches the inverted index root, then the id of the document associated with the root is the document that satisfies the search criteria.
  4. ElasticSearch will search for documents that meet the search criteria one by one, then rank them, sort them, and return them.

An index is a collection of documents with somewhat similar characteristics. For example, you can have an index for customer data, another index for catalog data, and an index for order data. An index is identified by a name (which must be all lowercase) and is used when indexing, searching, updating, and deleting documents in the index. Indexes are similar to the concept of Database in relational databases. In a cluster, you can define as many indexes as you want.

3.3, type,

In an index, you can define one or more types. A type is a logical classification/partition of your index, the semantics of which are entirely up to you. Typically, a type is defined for documents that have a common set of fields. For example, let’s say you run a blogging platform and store all your data in an index. In this index, you can define one type for user data, another type for blog data, and, of course, another type for comment data. Types are similar to the concept of tables in relational databases. Different versions have different requirements for indexes.

version Type
5.x Support for multiple types
6.x There can only be one type
7.x Custom index types are no longer supported by default (default: _doc)

3.4, mapping,

Mapping is an important content in ES. It is similar to the schema of table in traditional relational data. It is used to define the data structure of type in an index. In ES, we can create type(equivalent to table) and Mapping (related to schema) manually or by default. By default, ES automatically creates a type and its mapping based on the inserted data. Mapping includes the field name, field data type, and field index type

3.5, documentation,

** A document is a basic unit of information that can be indexed, like a record in a table. ** For example, you can own a document for an employee or a document for an item. The documents are represented using JSON(Javascript Object Notation), a lightweight data interchange format.

3.6, shard

An index can store large amounts of data beyond the hardware limits of a single node. For example, an index with 1 billion documents takes up 1 TERabyte of disk space, and any node may not have that much disk space. Or a single node can process a search request and respond too slowly. To solve this problem, Elasticsearch provides the ability to divide the index into multiple pieces, each of which is called a shard. When you create an index, you can specify the number of shards you want. Each shard is itself a fully functional and independent “index” that can be placed on any node in the cluster.

Sharding is important for two reasons:

  1. Allows you to split/expand your content horizontally.
  2. Allows you to do distributed, parallel operations on top of shards, improving performance/throughput.

How a shard is distributed, how its documents are aggregated, and how search requests are made are completely managed by Elasticsearch. It’s transparent to you as a user and you don’t have to worry about it.

3.6.1. Sharding Principle

Traditional databases store a single value per field, but this is not sufficient for full-text retrieval. Each word in a text field needs to be searched, which for databases means the ability to index multiple values in a single field. The data structure that best supports the need for multiple values for a field is inverted indexes.

Elasticsearch uses a structure called an inverted index, which is suitable for fast full-text searches.

See its name, know its meaning, there is an inverted index, there must be a positive index. Inverted indexes are better known as Forward index and inverted index.

3.6.2. Straight index

The so-called forward index is that the search engine will map the files to be searched to a file ID. During the search, the file ID is matched with the search keyword to form a K-V pair, and then counts the keywords.

However, the number of documents on the Internet included in search engines is astronomical, such an index structure simply cannot meet the requirements of real-time ranking results. So inverted index he came!

3.6.3 inverted index

Inverted indexes convert the mapping of the file ID to the keyword to the mapping of the keyword to the file ID. Each keyword corresponds to a series of files in which the keyword appears.

An inverted index consists of a list of all the non-repeating words in the document, and for each word there is a list of documents containing it. For example, suppose we have two documents, and each document’s Content field contains the following:

  1. The quick brown fox jumped over the lazy dog
  2. Quick brown foxes leap over lazy dogs in summer

To create an inverted index, we first break the content field of each document into separate words (we call them “entries” or “tokens”), create a sorted list of all the non-repeated entries, and then list which document each entry appears in.

Now, if we want to search for Quick and Brown, we just need to look for the document that contains each term.

Both documents match, but the first document matches better than the second. If we use a simple similarity algorithm that just counts the number of matched terms, then we can say that the first document is better than the second document for the relevance of our query.

3.7 copy,

In a network/cloud environment, where failure can happen at any time, where a shard/node somehow becomes offline or disappears for any reason, having a failover mechanism is very useful and highly recommended. For this purpose Elasticsearch allows you to create one or more copies of a shard. These copies are called duplicate shards.

Replicated sharding is important for two main reasons:

  1. Provides high availability in the event of sharding/node failures. For this reason, it is important to note that the replicated shard is never placed on the same node as the original/primary shard.
  2. Expand your search volume/throughput, as searches can run in parallel across all replicas.