Tech Blog Test: Elasticsearch

E-commerce and search engine products that typically involve large databases face the problem that product information retrieval takes too long. This poor user experience may lead to the loss of potential customers. This lag is due to the fact that the product is designed to use a relational database, where the data is scattered across multiple tables, and the relational data process the data in these tables to retrieve the search results is far faster. Companies are looking for alternatives to data storage to facilitate fast retrieval, and Elasticsearch (ES) is a great way to solve these problems.

image

What is Elasticsearch?

Elasticsearch is a Lucene based search engine. It provides a distributed multi – user – capable full – text search engine based on RESTful Web interface.

In other words, Elasticsearch is an open source, standalone database server developed in Java. Basically, it is used for full-text search and analysis. It takes data from a variety of sources and stores it in a complex format that is highly optimized for search. As mentioned above, Elasticsearch uses Apache Lucene as the heart of its search. Since Lucene is just a library, it can be difficult to use. But don’t worry, Elasticsearch encapsulates all search engine operations and can be done using the corresponding Restful API. Elasticsearch is a fast and efficient way to store, search, and analyze large amounts of data, and is especially useful when dealing with semi-structured data (i.e., natural language).

What can Elasticsearch do?

GitHub not only helps us find isolated code repositories when we search on GitHub, but also helps with code-level searches and highlighting search terms. It can also help you make product recommendations when you are shopping online. Elasticsearch helps you locate your passengers and drivers when you’re off work. ELK (Elastic Stack), which combines Kibana, Logstash and Beats, is widely used for big data and near real-time analysis. It includes log analysis, index monitoring, information security and other fields. It can help you explore massive, structured and unstructured data, create visual reports on demand, and set alarm thresholds for monitoring data.

image

Elasticsearch feature history for versions 5, 6, and 7

V5.x

Lucene 6.x,
Performance improved, default scoring mechanism changed from TF-IDF to BM 25
Support for Ingest nodes, Completion Suggested, and Java REST clients
Type is marked deprecated, supporting the Keyword Type
Performance optimization
- Index throughput has been greatly improved by reducing internal contention, preventing concurrent updates to the same document from competing, and reducing locking requirements when synchronizing transaction logs
- Instant Aggregations, which provides Aggregation caching at the Shard level
- New Profile API

V6.x

Lucene 7.x
Removal of Types, in 6.0, multiple types in an index were initially dissupported
Search across multiple Elasticsearch clusters, keeping the original index in the 5.x cluster, and search across clusters to search both 6.x and 5.x clusters
Cross-cluster Replication (CCR)
Friendlier upgrades and data migration, easier migration between major releases, experience upgrades
Performance optimization
- Sparse area is improved to reduce storage cost
- You can use index sort to speed up query performance

V7.x

Lucene 8.0
Major improvements – Officially abolishing support for multiple types under a single index
7.1 From now on, the Security function is free of charge
ECK allows users to configure, manage, and operate Elasticsearch clusters on Kubernetes
TransportClient is obsolete so that ES7 Java code can only use RestClient
New features
- New cluster coordination
- More fully functional REST Client
- Script Score Query, the next generation of scoring methods
Performance optimization
- The default Primary Shard number was changed from 5 to 1 to avoid Over Sharding
- Performance optimization for faster Top K retrieval

Basic concepts of Elasticsearch

Elasticsearch (Index, Document, Type)

Index

An Index is a container for documents. It is a combination of a class of documents
- Index represents the concept of logical space: each Index has its own Mapping that defines the field name and field type of the contained document
- The Shard embodies the concept of physical space: the data in the index is scattered across the Shard
Index Mapping and Settings
- Mapping defines the types of document fields
- Setting defines different data distributions

Define different data distributions

{
  "movies" : {
    "settings" : {
      "index" : {
        "creation_date" : "1570452552",
        "number_of_shards" : "5",
        "number_of_replicas" : "1",
        "uuid" : "pB0UsxjfQT2fW-s8Uy-Nsg",
        "version" : {
          "created" : "2030599"
        }
      }
    }
  }
}
Copy the code

Define the types of document fields

{
    "movie": {
        "mappings": {
            "doc": {
                "properties": {
                    "songName": {
                        "type": "text"
                    },
                    "singer": {
                        "type": "text"
                    },
                    "price": {
                        "type": "integer"
                    }
                }
            }
        }
    }
}
Copy the code

Index has different semantics. In ES, it refers to the index created in the cluster (noun), or it can refer to the process of document to ES (verb), i.e. the process of an inverted index. Seeing an index elsewhere is more indicative of a B-tree index or inverted index.

The Document (the Document)

Elasticsearch is document-oriented, and a document is the smallest unit of all searchable data
- Log entries in log files
- Specific information about a movie
- Details of a song
The document is serialized to JSON format and saved in Elasticsearch
- JSON objects consist of fields,
- Each field has a corresponding field type (string/numeric/Boolean/date/binary/range type)
Each document has a Unique ID
- You can specify the ID yourself or generate it automatically through Elasticsearch

case

{" songName ":" say good don't cry ", "singer" : "jay Chou", "price" : 3}Copy the code

Metadata for the document

{ "_index" : "song", "_type" : "_doc", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "songName" : "Say no cry "," Singer ":" Jay ", "price" : 3}}Copy the code

Metadata, used to annotate relevant information about a document
- _index: indicates the index name of the document
- _type: name of the type to which the document belongs
- _ID: indicates the unique Id of a document
- _source: Raw JSON data of the document
- _all: consolidates the contents of all fields into this field
- _version: indicates the version of a document
- _score: correlation score

Type (Type)

Prior to 7.0, multiple Types could be set for an Index
Since 6.0, Type has been Deprecated. 7.0 Start an index. Only one Type -“_doc” can be created.

RDBMS VS Elasticsearch

The following is a poor analogy between an RDBMS and Elasticsearch. The Elasticsearch cluster can contain multiple Indes (databases), and each index can contain a DOC Type (table). Each type contains multiple documents (records), and each Document contains multiple Fields (columns). A DSL is equivalent to THE SQL of an RDBMS.

RDBMS	Elasticsearch
Schema	Mapping
Table	Index(Type)
Column	Filed
Row	Document
SQL	DSL

6, summary

Elasticsearch can do this in 10 milliseconds, compared to a traditional SQL database management system that takes more than 10 seconds to get the required search query data. Because Elasticsearch has a distributed architecture, it can scale to thousands of servers and hold petabytes of data. We don’t have to manage the complexity of distributed design because ES does it automatically. There are many ways for us to index or query some documents, but with ES, we can easily retrieve the full text of massive data quickly and get the results we want.

Author: peak link: www.jianshu.com/p/1dc661517… The copyright of the book belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please indicate the source.

The original link: www.jianshu.com/p/1dc661517…

Related Posts

Github support for footnotes, Chrome plugin development overview

Alibaba Industrial Internet platform “Thinking” : a transformation from 0 to 1

Breaking Kotlin coroutines (9) – Channel chapter