introduce

Elasticsearch is a distributed, scalable real-time search and analysis engine built on top of the full text search engine Apache Lucene(TM). Elasticsearch is more than Just Lucene. Here’s how Elasticsearch is distributed, scalable, high performance, and highly available.

What is a search

When we want to know some information, will use some of the search engine to obtain the data we want, such as search we like a game, or like a book, etc., this is the first impression that search, under any scenario, say blunt point is to find information you want to know, this is a search.

  • Now the search is also called vertical search vertical search for a certain industry of professional search engines, such as e-commerce websites, news websites, various apps and so on, they are the segmentation and extension of the search engine, after extracting the needed data for processing and then returned to the user in some form.

What if you use a database to do the search

Select * from products where product_name like % clothes %, (assuming there is no other efficiency setting) Or other field matching, you can analyze the disadvantages of this approach.

  • 1. For example, the data of the specified field of each record will be very long. For example, the field of ** “product introduction” ** may contain thousands or tens of thousands of characters.

  • 2, this approach can only search to include * * completely “clothes” the record of the two characters, but there may be some special cases, a few records in “clothes” * * keywords is not continuous, clothes in the middle may insert certain characters, this time to search out the record, but the goods we want to search out again, This is when the downside of this approach becomes obvious.

In general, using a database to implement search is not reliable, the performance will be poor.

What is full text search

First of all, what is an inverted index? Let’s start with a picture that has four records.

Inverted index

Biochemical film
biochemical
The movie
biochemical
The movie
The movie

Full-text retrieval is to split the words, store them in the inverted index, and then analyze the content entered by the user and match them in the inverted index. This process is full-text retrieval.

What is the ElasticSearch

Lucene is a jar package that implements the reverse indexing algorithm and other full text search functions. Lucene is a jar package that implements the reverse indexing algorithm and other full text search functions. Lucene is a jar package that implements the reverse indexing algorithm and other full text search functions. Why would anyone want ElasticSeaearch when there’s Lucene? First of all, when there is a large amount of data, such as 1PB of data, it is basically not good to put the data on the same machine, so how about putting the data on multiple machines? Then it becomes distributed. At this time, when the data front end gets data, which machine does it go on to get data? This time is very troublesome, if a machine is down, then the data on the machine can not be obtained, which can not guarantee the high availability, and how to store data in the end when the machine and so on, these need human processing and maintenance. That’s where ElasticSearch comes in, and it solves all of these problems with Lucene.

Give some examples of advantages

  1. High performance, automatic maintenance of data distribution across multiple nodes for index creation, and search requests distribution across multiple nodes for execution.
  2. High availability, automatic maintenance of redundant copies of data, guarantee that some machines go down, data will not be lost.
  3. More advanced features are encapsulated to give us more advanced support, allowing us to quickly develop applications, develop more complex applications, complex search capabilities, aggregative analytics capabilities, location-based search (e.g., several coffee shops within a kilometer), etc.
  4. Dynamic capacity expansion. When our data volume increases rapidly, we only need to add machines. For example, if two machines store 1.2T of data, then no machine will store 600G. This process does not need to be manually allocated, but simply added to the cluster automatically.