Elasticsearch document concurrent processing and document routing

@[toc] Elasticsearch @[toc] Elasticsearch @[toc] Elasticsearch @[toc] Elasticsearch @[toc] Elasticsearch @[toc] Elasticsearch @[toc] Elasticsearch @[toc] Elasticsearch @[toc] Elasticsearch

This scene is recorded a video tutorial notes, notes concise, complete friends can refer to the video content, video download link: https://pan.baidu.com/s/1TwyO… Extract code: AEE2

1. Basic operations for Elasticsearch document

1.1 New Document

Start by creating a new index.

Then add a document to the index:

PUT blog/_doc/1 {"title":" 6.elasticsearch ", "date":"2020-11-05", ** ElasticSearch06 ** Download this script from www.elasticsearch06. Start by creating a new index. }

1 represents the ID of the new document.

After adding successfully, the response will look like this:

{
  "_index" : "blog",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

_index represents the document index.
_type represents the type of the document.
The _id represents the ID of the document.
_version represents the version of the document (when you update the document, the version is automatically incrementing by 1 for a single document).
Result indicates the result of execution.
_shards represents sharding information.
_seq_no 和 _primary_termThese are also for version control (for the current index).

After adding successfully, you can view the added document:

Of course, you can add a document without specifying an ID, which is given by default. If you do not specify an ID, you will need to use a POST request instead of a PUT request.

Post blog/_doc {"title":"666", "date":"2020-11-05", "content":" WeChat public account **elasticsearch06** Start by creating a new index. }

1.2 Obtaining documents

The GET API is provided in ES to view documents stored in ES. Use as follows:

GET blog/_doc/RuWrl3UByGJWB5WucKtP

The above command means to get a document with the ID RuWrl3UByGJWB5WucKtP.

If you retrieve a document that does not exist, the following message is returned:

{
  "_index" : "blog",
  "_type" : "_doc",
  "_id" : "2",
  "found" : false
}

If you simply want to detect the existence of a document, you can use the HEAD request:

If the document does not exist, the response is as follows:

If the document exists, the response is as follows:

You can also get documents in bulk.

GET blog/_mget
{
  "ids":["1","RuWrl3UByGJWB5WucKtP"]
}

A GET request can carry a request body.

Certain languages, such as JavaScript’s HTTP request library, do not allow GET requests to have a request body. In fact, the RFC7231 document does not specify how the request body of a GET request should be handled, which creates a degree of confusion. Some HTTP servers support GET requests with the body of the request, while others do not. Although ES engineers prefer to use GET for queries, ES also supports POST queries for compatibility. For example, in the batch query case above, POST requests can also be used.

1.3 Document Update

1.3.1 General Update

Note that every time the document is updated, version increases by 1.

You can update the entire document directly:

PUT blog/_doc/RuWrl3UByGJWB5WucKtP
{
  "title":"666"
}

In this way, the updated document overwrites the original document.

Most of the time, we just want to update the document fields, which we can do with a script.

POST blog/_update/1
{
  "script": {
    "lang": "painless",
    "source":"ctx._source.title=params.title",
    "params": {
      "title":"666666"
    }
  }
}

POST {INDEX}/ _UPDATE /{ID}

In scripting, lang stands for scripting language, and painless is one of the scripting languages built into ES. Source represents the script to be executed, CTX is a context object, and through CTX you can access _source, _title, etc.

You can also add fields to the document:

POST blog/_update/1
{
  "script": {
    "lang": "painless",
    "source":"ctx._source.tags=[\"java\",\"php\"]"
  }
}

The following document is added successfully:

Arrays can also be modified through scripting languages. For example, add another tag:

POST blog/_update/1
{
  "script":{
    "lang": "painless",
    "source":"ctx._source.tags.add(\"js\")"
  }
}

Of course, you can construct slightly more complicated logic using if else.

POST blog/_update/1
{
  "script": {
    "lang": "painless",
    "source": "if (ctx._source.tags.contains(\"java\")){ctx.op=\"delete\"}else{ctx.op=\"none\"}"
  }
}

1.3.2 Query Update

Find the document through a conditional query, and then update it.

For example, change the content of a document containing 666 in the title to 888.

POST blog/_update_by_query
{
  "script": {
    "source": "ctx._source.content=\"888\"",
    "lang": "painless"
  },
  "query": {
    "term": {
      "title":"666"
    }
  }
}

1.4 Delete document

1.4.1 Delete according to ID

Remove a document from the index.

Delete a document with the ID TuUpmHUByGJWB5WuMasV.

DELETE blog/_doc/TuUpmHUByGJWB5WuMasV

If you specify a route when you add a document, you also need to specify a route when you delete the document, otherwise the deletion fails.

1.4.2 Query deletion

Query deletion is a POST request.

For example, delete document containing 666 from title:

POST blog/_delete_by_query
{
  "query":{
    "term":{
      "title":"666"
    }
  }
}

You can also delete all documents in an index:

POST blog/_delete_by_query
{
  "query":{
    "match_all":{
      
    }
  }
}

1.5 Batch operation

Bulk indexing, Bulk deletion, Bulk update and other operations can be performed in ES through the Bulk API.

You first need to write all the bulk operations to a JSON file, which is then uploaded and executed via a POST request.

For example, create a new file named aaa. Json with the following contents:

First line: index indicates that an index operation is to be performed (this represents an action, along with other actions such as create, delete, and update). _index defines the index name, where an index named user is to be created, and _id means that the new document has an ID of 666.

The second line is the argument to the first line.

The third line is UPDATE.

The fourth row is the argument of the third row.

Notice that a line is left at the end.

After the aaa. Json file is successfully created, execute the request command in the directory as follows:

curl -XPOST "http://localhost:9200/user/_bulk" -H "content-type:application/json" --data-binary @aaa.json

When the execution is complete, an index named user is created, a record is added to the index, and the record is modified, resulting in the following:

2. ElasticSearch document routing

ES is a distributed system. When we store a document on ES, the document is actually stored on one of the master shards on the master node.

For example, create a new index with two shards and zero replicas as follows:

Next, save a document to the index:

PUT blog/_doc/a
{
  "title":"a"
}

After the document is saved successfully, you can see in which shard the document was saved:

GET _cat/shards/blog? v

Check the results as follows:

Index Shard Prirep State Docs Store IP Node Blog 1 p Started 0 208B 127.0.0.1 SLave01 Blog 0 p Started 1 1.6KB 127.0.0.1  master

From this result, you can see that the document is saved into shard 0.

So what are the rules for distributing shards in ES?

The routing mechanism in ES is to place documents with the same hash value into a primary shard through the hash algorithm. The shard position is calculated as follows:

shard=hash(routing) % number_of_primary_shards

Routing can be an arbitrary string. By default, ES takes the document’s ID as the routing value, generates a number with routing through a hash function, and takes that number and the number of shards mod. The mod is the shard position.

The biggest advantage of the default routing mode is load balancing, which ensures that data is evenly distributed among different shards. However, it has a big disadvantage that it can’t determine the location of the document when it queries, so it will broadcast the request to all the shards for execution. On the other hand, using the default routing mode, it is not convenient to modify the number of shards later.

Routing can also be customized by the developer as follows:

PUT blog/_doc/d? routing=javaboy { "title":"d" }

Routing is also specified for queries, deletes, and updates if the document is added with routing specified.

GET blog/_doc/d? routing=javaboy

Custom routing may lead to load imbalance, but this should be chosen according to the actual situation.

Typical scenes:

For user data, we can use userid as routing, so that the data of the same user can be kept in the same shard. When retrieving, we can also use userid as routing, so that data can be accurately retrieved from a certain shard.

3. ElasticSearch version control

When we use the ES API to update a document, it first reads the original document, then updates the original document, and then re-indexes the entire document after the update is completed. No matter how many updates you perform, the last updated document saved in ES is the last updated document. But if you have two threads trying to update at the same time, you might have a problem.

The solution is the lock.

3.1 the lock

Pessimistic locking

Very pessimistic, every time to read the data, thinking that someone else may modify the data, so shield all possible damage to the integrity of the data operation. In relational databases, pessimistic locks are commonly used, such as row locks, table locks, and so on.

Optimistic locking

Very optimistic. Every time you read data, you assume that someone else is not going to modify it, so you don’t lock it, and you check for data integrity only when you commit it. This approach improves throughput by eliminating the overhead of locking.

In ES, optimistic locking is actually used.

3.2 Version control

Es6.7 before

Prior to ES6.7, version+version_type was used for optimistic concurrency control. As mentioned earlier, version is automatically incrementing every time a document is modified, and ES uses the version field to ensure that everything is in order.

Version is divided into internal and external version control.

3.2.1 Internal version

The internal version is maintained by ES itself. When a document is created, ES assigns the version of the document a value of 1.

Each time the user modifies the document, the version number is incremented by 1.

If internal versions are used, ES requires that the version parameter’s value be equivalent to the version value in the ES document for the operation to succeed.

3.2.2 External version

External versions can also be maintained.

When you add a document, you specify a version number:

PUT blog/_doc/1? version=200&version_type=external { "title":"2222" }

In future updates, the version should be larger than the existing version number.

VERTION_TYPE = EXTERNAL OR VERTION_TYPE = EXTERNAL_GT means that the version should be larger than the existing one when it is updated.
Vertion_type =external_gte means that the version should be greater than or equal to the existing version number for future updates.

3.2.3 Latest scheme (after ES6.7)

Two parameters, if_seq_no and if_primary_term, are now used for concurrency control.

Seq_no does not belong to a single document, it belongs to the entire index (version belongs to a single document, and version of each document does not matter). Now use seq_no for concurrency when updating documents. Since seq_no belongs to the entire index, seq_no is automatically incrementing for any document modification or addition.

Optimistic concurrency control can now be done with seq_no and primary_term.

PUT blog/_doc/2? if_seq_no=5&if_primary_term=1 { "title":"6666" }

Finally, Songge also collected more than 50 project requirements documents, want to do a project practice friends may wish to look at Oh ~

The requirements document address: https://github.com/lenve/javadoc