Elasticsearch

Distributed full text search engine

First, use scenarios

  1. Information search
    • E-commerce sites
    • Job site
    • News website
  2. Log collection and analysis – ELK
  3. Data analysis – product sales, visits, consumption amount

Ii. Core Concepts

  1. Index Indicates the Index. – Database Indicates the Database
  2. Shard Index fragment
    • A Shard corresponds to a Lucene Index
    • Each Shard has a translog
  3. Type Indicates the Type (to be abolished) – Table indicates the Table
  4. Document Document – Row Data Row
  5. Field – Column Field
  6. Mapping Mapping-scheme Field Constraints

Three, API

URL with? Explain can view the cause of the statement error

1. The Index Index

  • Create index – PUT/Index name
  • Check whether the index exists – HEAD/index name
  • View index properties
    • Single – GET/index name
    • Multiple -get/index name 1, index name 2, index name 3
    • All – GET _all
    • – GET /_cat/indices? v
  • Enable index -post/index name /_open
  • Close index – POST/index name /_close
  • DELETE index – DELETE/Index name 1, index name 2, index name 3
  • Index migration – POST _reindex
    • version_type
      • [Default] internal – Migrates directly, overwrites existing documents when encountered
      • External-retain version information for migration and update version when encountering an existing document
    • op_type
      • Create – An error occurs when encountering an existing document
    • conflicts
      • Proceed – When you encounter an existing document, an error message is displayed indicating only the number of document conflicts
    • Query – Supports data filtering, sorting, and quantity Settings

2. The Mapping Mapping

  • Create a mapping – PUT/Index name /_mapping
PUT/index library name /_mapping {"properties": {
        "Field name": {
            "type": "Type"."index": true."store": true."analyzer": "Word splitter"}}}Copy the code
PUT /lagou-company-index/_mapping/
{
    "properties": {
        "name": {
            "type": "text"."analyzer": "ik_max_word"
        },
        "job": {
            "type": "text"."analyzer": "ik_max_word"
        },
        "logo": {
            "type": "keyword"."index": "false"
        },
        "payment": {
       		"type": "float"}}}Copy the code
  • View mapping – GET/Index name /_mapping

  • View all mappings

    • GET _mapping
    • GET all/_mapping
  • Modify a mapping – PUT/Index name /_mapping

    • You can only add mapping fields, not change them
    • If you need to change the mapping, you can only delete the reconstruction mapping
  • Create an index and a mapping – PUT/index name

3. Document Document

  • Add the document

    • Specify ID -post/index name /_doc/{ID}
    • Automatically generates id-post/index name /_doc
  • To view the document

    • ID search – GET/index name /_doc/{ID}
    • Conditional search – GET/ index name /_search
    • Return attribute filtering – GET/index name /_doc/{id}? _source= attribute 1, attribute 2
  • Update the document

    • Global update (added after original data is deleted) -put/index name /_doc/{id} -id is added if it does not exist
    • Partial update (modify single field) – POST/index name /_update/{id}
  • Delete the document

    • Specify ID -delete/index name /_doc/{ID}
    • Conditional Filter – POST/Index name /_delete_by_query
  • Batch search

    • GET /_mget
    • GET/Index name /_mget
  • Batch add, delete, change – POST / _bulk {” method “: {” _index” : “index name”, “_id” : “id number”}} {} “data”

    • Create – Adds a document
    • Index – Add document, full-text replace document – equivalent to PUT
    • Update – Locally updates the document
    • Delete – Deletes a document

    You are advised to update 1000 to 5000 documents at a time. The document size ranges from 5 MB to 15 MB

4. Mapping attributes

  • Type type

    • String String

      • Text – participle, not aggregable
      • Keyword – Can be aggregated without keyword
    • Numberical value

      • byte
      • short
      • interger
      • long
      • double
      • float
      • half_float
      • Scaled_float – High precision, precision factor needs to be specified
    • Date Date – [Suggestion] Use long to save milliseconds

    • Array an Array

      • If any element in the array is matched, it is considered to be matched
      • When sorting, ascending uses the smallest element in the array, descending uses the largest element in the array
    • Object

    • Geo_point latitude and longitude

  • Index Whether to index – Whether to search – [default] true

  • Store or not – Whether data is stored independently, which speeds up parsing but consumes space – [default] false

  • Analyer participle

    • Chinese
      • Ik_max_word [often used] – maximum granularity
      • Ik_smart – coarsest granularity
  • Dynamic Indicates the dynamic mapping mode when unfamiliar fields are encountered

    • True – Automatic mapping
    • False – ignore
    • Strict – an error
  • Date_detection Whether to turn off date detection – When set to false, the string will always be string

  • Dynamic_date_formats sets the string conversion date rule

  • Dynamic_templates uses different mappings for different fields or data types

  • Refresh_interval Index refresh frequency – [default] 1 second

  • The index. The translog. Durability translog brush set way – [default] sync

  • Index. translog.sync_interval Translog flush interval – [default] 5 seconds

PUT/index library name {"settings": {"number_of_shards": Number of fragments,"number_of_replicas": Number of copies,"refresh_interval": "Index refresh Rate"."index.translog.durability": "async"."index.translog.sync_interval": "5s"
    },
    "mappings": {"dynamic": "Dynamic mapping mode"."date_detection": Whether to turn off date detection,"dynamic_date_formats": "MM/dd/yyyy"."properties": {"Field name": {"Mapping attribute Name":"Mapping attribute value"}}}},Copy the code

Fifth, search type

POST/index library name /_search {"query": {"Search type": {"Search criteria":"Find conditional value"}},"sort": [{"Fields to sort": {"order": "asc"}}]."highlight": {
        "pre_tags": "<font color='pink'>"."post_tags": "</font>"."fields": [{"Fields to highlight": {}}},"from": Current page number,"size": Number of items per page}Copy the code
  • Match_all – Finds all

  • Match – Sets the search conditions for word search. The relationship between terms is CHANGED from OR – to and and requires the operator attribute

  • Match_phrase – Will look for conditional participles, and the target document must contain all participles in the same order

  • Multi_match – Searches for terms in terms of or, and can specify the search field

    • You can use * to describe field – *_name
    • You can use ^ enhanced field weighting – subject^3
  • Term – Lookup regardless of word

  • Query_string – Specifies field OR full-text search, AND splits strings using the AND, OR, AND ~ operators

  • Range-range search, used to find numbers and dates

  • Exists – A non-null lookup

  • Prefix – Searches for prefix matches

  • Wildcard – Wildcard lookup

  • Regexp – Regular lookup

  • Fuzzy-fuzzy lookup

  • Bool – Compound lookup

    • Must – Must contain
    • Filter – Must contain, does not affect the score, will be cached in memory, repeated search speed
    • Should – should include
    • Must_not – Must not be included and does not affect scoring
  • Dis_max – Multiple search field scores, only take the highest score as the score – default to add multiple search field scores

  • Suggest — suggest a search

    • Completion – Finds conditional prefix matches and makes suggestions
    • Preserve_separators – Finds whether to reserve separators for conditions
      • Preserve_position_increments – Whether to ignore the stop word when the first word of the suggested word is the stop word
    • Phrase – Will find the condition word segmentation, judge the matching degree with the original text and give suggestions
    • Term – Classifies search terms and makes recommendations for each term
      • Missing – To give advice when an entry cannot be found in the dictionary
      • Always – Gives advice whether an entry is found in a dictionary or not
      • Popular – Suggestions for higher frequency of words, whether or not they are found in the dictionary
    • Context – Similar to Completion, add categories for further filtering

    Production Suggestions:

    Completion → Zero matching → Phase → Zero matching → term

Polymerization analysis

"aggregations" : {
    "<aggregation_name>": {<! -- aggregate name -->"<aggregation_type>": {<! --> <aggregation_body> <! -- aggregator: which fields are aggregated -->} [,"meta": { [<meta_data_body>] } ]? <! --> [,"aggregations": { [<sub_aggregation>]+ } ]? <! -->} [,"<aggregation_name_2>": {... }] * <! -- aggregate name -->}Copy the code

1. Statistical method

  • Pointer aggregation metric
  • Bucket polymerization bucketing – Data is grouped before aggregation statistics are performed

2. Statistical Pointers

  1. The maximum Max
  2. Min min
  3. And the sum
  4. The mean avg
  5. Count count
  6. Document fields have a value count, value_count
  7. To recalculate cardinality
  8. Stats – Includes Max, min, sum, AVG, and count
  9. Advanced statistics extended_STATS – includes sum of squares, variance, and standard deviation
  10. Percentiles – Percentiles can be specified
  11. Percentile_ranks Interval percentage statistics

Distributed cluster

1. The role

  • Cluster – A Cluster consisting of multiple nodes, each of which is identified by a common Cluster name
  • The Node Node
    • Master – Whether you are eligible to run for the primary node – [default] true
    • Data – Whether to save data – [default] true
  • Shard Shard – The data partition of an index
    • The number of primary shards is immutable unless the index is rebuilt
    • By default, each master shard has one replica shard, and the two shards are not on the same node

Characteristics of 2.

  • New nodes are automatically discovered
  • Node peer – Each node can receive a request and forward the request to the other node where the data is stored
  • When the node is down, the missing data is recovered through copy fragmentation
  • Search time in a hundred milliseconds

3. Building and planning

The principle of

  • 30 GB JVM memory, the maximum size of shards is set to 30 GB, and then calculate the total number of shards based on the data volume.

  • The total number of slices divided by 1.5 ~ 3 is the number of nodes

  • The number of copies is 2 to ensure high availability

  • When the search performance deteriorates, the number of copies can be increased to improve the concurrent search capability

application

  • Search function – Tens of millions to billions of data – two to four nodes

  • Online processing analysis – ELK – Data volume of billions – dozens to hundreds of nodes

4. Consistency assurance

  • ? Wait_for_active_shards = Number of Synchronization fragments &timeout= Timeout duration

8. Relevance

  1. Application-side join Application connection – Independent between indexes – Applies to a small number of document records
  2. Data denormalization, Nested objects Nested documents
    • Through field redundancy, index performance is sacrificed for lookup performance
    • Redundant fields should rarely change
    • Suitable for small number of relationship processing
    • This applies to scenarios where you read too much and write too little
  3. Parent/ Child Relationships document
    • Sacrifice lookup performance for index performance
    • A lookup cannot return both parent and child documents
    • Parent and child documents must be on the same shard
    • This applies to scenarios where you write too much and read too little

9. Persistence

  1. refresh

    • Writes the memory buffer to a new segment, making the index retrievable
    • [Default] Runs every 1 second
  2. flush

    • Flush all segments, clear the Translog, and create commit points
    • [Default] Runs every 30 minutes

    When the node crashes and restarts, the Translog log is replayed from the commit point to recover the data

Concurrency control

  1. Built-in version number -? If_seq_no = version &if_primary_term=1
  2. Custom version number -? Version = Version &version_type=external

11. Paging scheme

  1. From + size – Common paging method, deep paging can cause performance problems
  2. Scroll – to cache all qualified search results – not suitable for real-time search, suitable for background batch processing
  3. Search after – Determine the next page based on the last data on the previous page – cannot skip pages

Xii. Performance optimization

  1. Set the number of copies to 0 for the first time
  2. Automatically generates a DOC ID to avoid disk read operations
  3. Unimportant fields have no word or index
  4. Adjust index refresh interval – default 1 second