What is the Elasticsearch

If you want to search data, it is inevitable to search, and search is inseparable from search engines. Baidu and Google are both very large and complex search engines, and they almost index all the open web pages and data on the Internet. Elasticsearch is a full text search engine that can quickly store, search, and analyze massive amounts of data.



Why Elasticsearch

Elasticsearch is an open source search engine built on top of Apache Lucene™, a full-text search engine library.

What is Lucene? Lucene may be the most advanced, high-performance, and full-featured search engine library in existence, open source or private, but it is just that. To use Lucene, we need to write Java and reference the Lucene package, and we need some understanding of information retrieval to understand how Lucene works, which is not that easy to use.

To solve this problem, Elasticsearch was born. Elasticsearch is also written in Java. It uses Lucene for indexing and searching internally, but its goal is to make full text retrieval easy. It provides a simple and consistent RESTful API for storage and retrieval.

So is Elasticsearch just a simplified Version of Lucene encapsulation? Elasticsearch isn’t just Lucene, and it’s not just a full-text search engine. It can be accurately described as follows:

  • A distributed real-time document store where each field can be indexed and searched
  • A distributed real-time analysis search engine
  • Capable of extending hundreds of service nodes and supporting PB level of structured or unstructured data
All in all, it’s a pretty awesome search engine that wikipedia, Stack Overflow, and GitHub are all using to do searches.



The installation of the Elasticsearch

Elasticsearch can be downloaded from the official Elasticsearch website:

https://www.elastic.co/downloads/elasticsearch

At the same time, there are installation instructions on the official website.

To start elasticSearch, run bin/ ElasticSearch (Mac or Linux) or bin\ ElasticSearch. bat (Windows).

I am using A Mac and I recommend using Homebrew for Mac installation:

[Python]

Plain text view
Copy the code

?
1
brew install elasticsearch

Elasticsearch will run on port 9200 by default


http://localhost:9200/

You can see something like this:

[Python]

Plain text view
Copy the code

?
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
{

"name"
:
"atntrTf"
.

"cluster_name"
:
"elasticsearch"
.

"cluster_uuid"
:
"e64hkjGtTp6_G2h1Xxdv5g"
.

"version"
: {

"number"
:
"6.2.4"
.

"build_hash"
:
"ccec39f"
.

"build_date"
:
"The 2018-04-12 T20: but 497551 z"
.

"build_snapshot"
: false,

"lucene_version"
:
"7.2.1"
.

"minimum_wire_compatibility_version"
:
"5.6.0"
.

"minimum_index_compatibility_version"
:
"5.0.0"

},

"tagline"
:
"You Know, for Search"
}

Elasticsearch has been installed and started successfully. The version of Elasticsearch is 6.2.4.

Let’s take a look at the basic concepts of Elasticsearch and how it interconnects with Python.



Elasticsearch related concepts

Elasticsearch has several basic concepts, such as nodes, indexes, documents, and so on. These concepts will help you to understand Elasticsearch.



The Node and Cluster

Elasticsearch is essentially a distributed database that allows multiple servers to work together and each server can run multiple instances of Elasticsearch.

A single Instance of Elasticsearch is called a Node. A group of nodes forms a Cluster.



Index

May I have an Inverted Index for Elasticsearch? May I have an Inverted Index for Elasticsearch? When looking for data, look up the index directly.

The top unit of Elasticsearch data management is called an Index, which is similar to the concept of MySQL, MongoDB, etc. It is also worth noting that the name of each Index (that is, database) must be lowercase.



Document

The single record inside Index is called a Document. A number of documents form an Index.

Document is represented in JSON format, and here is an example.

Documents in the same Index are not required to have the same structure (scheme), but it is better to keep the same, so as to improve the search efficiency.



Type

Document can be grouped, for example, in the weather Index, it can be grouped by city (Beijing and Shanghai), or by climate (sunny and rainy days). This grouping is called Type, which is a virtual logical grouping used to filter Document, similar to the data table in MySQL and Collection in MongoDB.

Different types should have similar schemas. For example, an ID field cannot be a string in one group and a number in another. This is a difference from tables in a relational database. Data of completely different natures (such as products and logs) should be stored as two indexes instead of two Types in one Index (although that is possible).

As planned, Elastic 6.x will only allow one Type per Index and will remove Type entirely.



Fields

That is, fields, each Document is similar to a JSON structure, it contains many fields, each field has its corresponding value, multiple fields form a Document, in fact, can be analogous to the fields in MySQL data table.

In Elasticsearch, documents belong to a Type that exists in an Index. We can draw some simple comparisons to traditional relational databases:

[Python]

Plain text view
Copy the code

?
1
2
Relational DB
-
> Databases
-
> Tables
-
> Rows
-
> Columns
Elasticsearch
-
> Indices
-
> Types
-
> Documents
-
> Fields

These are some of the basic concepts of Elasticsearch, compared to relational databases.



Python docking Elasticsearch

Elasticsearch provides a number of Restful apis for accessing and querying Elasticsearch. You can use curl to access Elasticsearch. So let’s go straight to the method of interfacing Elasticsearch with Python.

You can install Elasticsearch using a library of the same name in Python.

[Python]

Plain text view
Copy the code

?
1
pip3 install elasticsearch

The official document is:

https://elasticsearch-py.readthedocs.io/

, all usage can be found here, and the rest of the article is based on official documentation.



Create Index

Let’s look at how to create an Index. Here we create an Index named news:

[Python]

Plain text view
Copy the code

?
1
2
3
4
5
from
elasticsearch
import
Elasticsearch
es
=
Elasticsearch()
result
=
es.indices.create(index
=
'news'
, ignore
=
400
)
print
(result)

If the creation is successful, the following information is displayed:

[Python]

Plain text view
Copy the code

?
1
{
'acknowledged'
:
True
.
'shards_acknowledged'
:
True
.
'index'
:
'news'
}

The result is returned in JSON format, with the Acknowledged field indicating that the creation operation was successful.

But if we execute the code again, we will return the following result:

[Python]

Plain text view
Copy the code

?
1
{
'error'
: {
'root_cause'
: [{
'type'
:
'resource_already_exists_exception'
.
'reason'
:
'index [news/QM6yz2W8QE-bflKhc5oThw] already exists'
.
'index_uuid'
:
'QM6yz2W8QE-bflKhc5oThw'
.
'index'
:
'news'
}].
'type'
:
'resource_already_exists_exception'
.
'reason'
:
'index [news/QM6yz2W8QE-bflKhc5oThw] already exists'
.
'index_uuid'
:
'QM6yz2W8QE-bflKhc5oThw'
.
'index'
:
'news'
},
'status'
:
400
}

It says create failed, status is 400, error cause Index already exists.

Note that the code uses ignore as 400, which means that if 400 is returned, the error will be ignored and no exception will be thrown.

If we omit the ignore argument:

[Python]

Plain text view
Copy the code

?
1
2
3
es
=
Elasticsearch()
result
=
es.indices.create(index
=
'news'
)
print
(result)

An error is reported when executed again:

[Python]

Plain text view
Copy the code

?
1
2
raise
HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: TransportError(
400
.
'resource_already_exists_exception'
.
'index [news/QM6yz2W8QE-bflKhc5oThw] already exists'
)

This will cause problems in the execution of the program, so we need to use the ignore parameter to rule out unexpected situations, so that the program can be executed without interruption.



Remove the Index

Delete Index as follows:

[Python]

Plain text view
Copy the code

?
1
2
3
4
5
from
elasticsearch
import
Elasticsearch
es
=
Elasticsearch()
result
=
es.indices.delete(index
=
'news'
, ignore
=
[
400
.
404
])
print
(result)

Here, too, the ignore parameter is used to ignore the problem of the program breaking when the Index does not exist and the deletion fails.

If the deletion is successful, the following information is displayed:

[Python]

Plain text view
Copy the code

?
1
{
'acknowledged'
:
True
}

If Index is already deleted, the following output is displayed:


[Python]

Plain text view
Copy the code

?
1
{
'error'
: {
'root_cause'
: [{
'type'
:
'index_not_found_exception'
.
'reason'
:
'no such index'
.
'resource.type'
:
'index_or_alias'
.
'resource.id'
:
'news'
.
'index_uuid'
:
'_na_'
.
'index'
:
'news'
}].
'type'
:
'index_not_found_exception'
.
'reason'
:
'no such index'
.
'resource.type'
:
'index_or_alias'
.
'resource.id'
:
'news'
.
'index_uuid'
:
'_na_'
.
'index'
:
'news'
},
'status'
:
404
}

This result indicates that the current Index does not exist and fails to be deleted. The result is also JSON with a status code of 400, but because we added the ignore parameter, the 400 status code is ignored, so the program executes the JSON result normally, instead of throwing an exception.



Insert data

Elasticsearch, like MongoDB, can insert structured dictionary data directly when inserting data. Create () is used to insert data.

[Python]

Plain text view
Copy the code

?
1
2
3
4
5
6
7
8
from
elasticsearch
import
Elasticsearch
es
=
Elasticsearch()
es.indices.create(index
=
'news'
, ignore
=
400
)
data
=
{
'title'
:
'Did the United States leave Iraq a mess?'
.
'url'
:
'http://view.news.qq.com/zt2011/usa_iraq/index.htm'
}
result
=
es.create(index
=
'news'
, doc_type
=
'politics'
.
id
=
1
, body
=
data)
print
(result)

Here we declare a piece of news data, including the title and link, and then insert the data by calling create(). When we call create(), we pass in four parameters, index for the index name, doc_type for the document type, The body represents the content of the document, and the ID is the unique id of the data.

The running results are as follows:

[Python]

Plain text view
Copy the code

?
1
{
'_index'
:
'news'
.
'_type'
:
'politics'
.
'_id'
:
'1'
.
'_version'
:
1
.
'result'
:
'created'
.
'_shards'
: {
'total'
:
2
.
'successful'
:
1
.
'failed'
:
0
},
'_seq_no'
:
0
.
'_primary_term'
:
1
}

The result field is created, indicating that the data is successfully inserted.

We can also insert data using index(), but unlike create(), create() requires us to specify an ID field to uniquely identify the data, whereas index() does not. If you do not specify an ID, an ID will be generated automatically. Calling the index() method is written as follows:

[Python]

Plain text view
Copy the code

?
1
es.index(index
=
'news'
, doc_type
=
'politics'
, body
=
data)

The create() method also calls index() internally, encapsulating the index() method.



Update the data

Updating the data is also very simple, we also need to specify the id and content of the data, just call the update() method, the code is as follows:

[Python]

Plain text view
Copy the code

?
01
02
03
04
05
06
07
08
09
10
from
elasticsearch
import
Elasticsearch
es
=
Elasticsearch()
data
=
{

'title'
:
'Did the United States leave Iraq a mess?'
.

'url'
:
'http://view.news.qq.com/zt2011/usa_iraq/index.htm'
.

'date'
:
'2011-12-16'
}
result
=
es.update(index
=
'news'
, doc_type
=
'politics'
, body
=
data,
id
=
1
)
print
(result)

Here we add a date field to the data and call the update() method, which results in the following:

[Python]

Plain text view
Copy the code

?
1
{
'_index'
:
'news'
.
'_type'
:
'politics'
.
'_id'
:
'1'
.
'_version'
:
2
.
'result'
:
'updated'
.
'_shards'
: {
'total'
:
2
.
'successful'
:
1
.
'failed'
:
0
},
'_seq_no'
:
1
.
'_primary_term'
:
1
}

As you can see, the index() method can do both of these things for us. If the data doesn’t exist, insert it, and if it does, update it, which is very convenient.



Delete the data

If you want to delete a piece of data, you can call the delete() method and specify the id of the data to be deleted.

[Python]

Plain text view
Copy the code

?
1
2
3
4
5
from
elasticsearch
import
Elasticsearch
es
=
Elasticsearch()
result
=
es.delete(index
=
'news'
, doc_type
=
'politics'
.
id
=
1
)
print
(result)

The running results are as follows:

[Python]

Plain text view
Copy the code

?
1
{
'_index'
:
'news'
.
'_type'
:
'politics'
.
'_id'
:
'1'
.
'_version'
:
3
.
'result'
:
'deleted'
.
'_shards'
: {
'total'
:
2
.
'successful'
:
1
.
'failed'
:
0
},
'_seq_no'
:
2
.
'_primary_term'
:
1
}

The result field is deleted, indicating that the deletion is successful. _version changes to 3 and 1 is added.



check
Polling data
Elasticsearch is a very simple operation that can be done by a normal database such as MongoDB. What makes Elasticsearch special is its extremely powerful retrieval capabilities.

Install elasticSearch-analysis-IK plugin for ElasticSearch-analysis-IK

https://github.com/medcl/elasticsearch-analysis-ik

Elasticsearch -plugin (version 6.2.4) Elasticsearch (version 6.2.4) Elasticsearch (version 6.2.4)

[Python]

Plain text view
Copy the code

?
1
elasticsearch
-
plugin install https:
/
/
github.com
/
medcl
/
elasticsearch
-
analysis
-
ik
/
releases
/
download
/
v6.
2.4
/
elasticsearch
-
analysis
-
ik
-
6.2
.
4.zip

Replace the version number here with the version number of your Elasticsearch.

Restart Elasticsearch after installation and it will automatically load the installed plug-ins.

First we create a new index and specify the fields to be split, as follows:

[Python]

Plain text view
Copy the code

?
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
from
elasticsearch
import
Elasticsearch
es
=
Elasticsearch()
mapping
=
{

'properties'
: {

'title'
: {

'type'
:
'text'
.

'analyzer'
:
'ik_max_word'
.

'search_analyzer'
:
'ik_max_word'

}

}
}
es.indices.delete(index
=
'news'
, ignore
=
[
400
.
404
])
es.indices.create(index
=
'news'
, ignore
=
400
)
result
=
es.indices.put_mapping(index
=
'news'
, doc_type
=
'politics'
, body
=
mapping)
print
(result)

Here we first delete the previous index, then create a new index, and then update its mapping information. The mapping information specifies the segmentation field and the type of the field is text. The word analyzer and search word analyzer search_analyzer are ik_max_word, which uses the Chinese word analyzer plug-in we just installed. If not specified, the default English participle is used.

Let’s insert some new data:

[Python]

Plain text view
Copy the code

?
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
datas
=
[

{

'title'
:
'Did the United States leave Iraq a mess?'
.

'url'
:
'http://view.news.qq.com/zt2011/usa_iraq/index.htm'
.

'date'
:
'2011-12-16'

},

{

'title'
:
'Ministry of Public Security: School buses will enjoy highest right of way'
.

'url'
:
'http://www.chinanews.com/gn/2011/12-16/3536077.shtml'
.

'date'
:
'2011-12-16'

},

{

'title'
:
'South Korean police Detain 1 Chinese fishing boat every day'
.

'url'
:
'https://news.qq.com/a/20111216/001044.htm'
.

'date'
:
'2011-12-17'

},

{

'title'
:
'Asian Man suspected of shooting at Chinese Consulate in Los Angeles Turns himself in'
.

'url'
:
'http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml'
.

'date'
:
'2011-12-18'

}
]
for
data
in
datas:

es.index(index
=
'news'
, doc_type
=
'politics'
, body
=
data)

Here we specify four data fields with title, URL, and date, and insert them into Elasticsearch using index() with name news and type Politics.

Next, let’s query the relevant content according to the keywords:

[Python]

Plain text view
Copy the code

?
1
2
result
=
es.search(index
=
'news'
, doc_type
=
'politics'
)
print
(result)

You can see that all four inserted data are retrieved:

[Python]

Plain text view
Copy the code

?
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
{

"took"
:
0
.

"timed_out"
: false,

"_shards"
: {

"total"
:
5
.

"successful"
:
5
.

"skipped"
:
0
.

"failed"
:
0

},

"hits"
: {

"total"
:
4
.

"max_score"
:
1.0
.

"hits"
: [

{

"_index"
:
"news"
.

"_type"
:
"politics"
.

"_id"
:
"c05G9mQBD9BuE5fdHOUT"
.

"_score"
:
1.0
.

"_source"
: {

"title"
:
"Did the United States leave Iraq a mess?"
.

"url"
:
"http://view.news.qq.com/zt2011/usa_iraq/index.htm"
.

"date"
:
"2011-12-16"

}

},

{

"_index"
:
"news"
.

"_type"
:
"politics"
.

"_id"
:
"dk5G9mQBD9BuE5fdHOUm"
.

"_score"
:
1.0
.

"_source"
: {

"title"
:
"Asian Man shot at Chinese Consulate in Los Angeles, suspect Turns himself in."
.

"url"
:
"http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml"
.

"date"
:
"2011-12-18"

}

},

{

"_index"
:
"news"
.

"_type"
:
"politics"
.

"_id"
:
"dU5G9mQBD9BuE5fdHOUj"
.

"_score"
:
1.0
.

"_source"
: {

"title"
:
South Korean police Detain 1 Chinese fishing boat per day
.

"url"
:
"https://news.qq.com/a/20111216/001044.htm"
.

"date"
:
"2011-12-17"

}

},

{

"_index"
:
"news"
.

"_type"
:
"politics"
.

"_id"
:
"dE5G9mQBD9BuE5fdHOUf"
.

"_score"
:
1.0
.

"_source"
: {

"title"
:
"Ministry of Public Security: Local school buses to enjoy highest right of way"
.

"url"
:
"http://www.chinanews.com/gn/2011/12-16/3536077.shtml"
.

"date"
:
"2011-12-16"

}

}

]

}
}

You can see that the results are returned in the HITS field, with the total field indicating the number of results and max_score indicating the maximum match score.

Full text search is also available, which is where Elasticsearch comes in:

[Python]

Plain text view
Copy the code

?
01
02
03
04
05
06
07
08
09
10
11
dsl
=
{

'query'
: {

'match'
: {

'title'
:
'Chinese Consulate'

}

}
}
es
=
Elasticsearch()
result
=
es.search(index
=
'news'
, doc_type
=
'politics'
, body
=
dsl)
print
(json.dumps(result, indent
=
2
, ensure_ascii
=
False
))

Select * from Elasticsearch (select * from Elasticsearch); select * from Elasticsearch (select * from Elasticsearch); select * from Elasticsearch (select * from Elasticsearch);

[Python]

Plain text view
Copy the code

?
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
{

"took"
:
1
.

"timed_out"
: false,

"_shards"
: {

"total"
:
5
.

"successful"
:
5
.

"skipped"
:
0
.

"failed"
:
0

},

"hits"
: {

"total"
:
2
.

"max_score"
:
2.546152
.

"hits"
: [

{

"_index"
:
"news"
.

"_type"
:
"politics"
.

"_id"
:
"dk5G9mQBD9BuE5fdHOUm"
.

"_score"
:
2.546152
.

"_source"
: {

"title"
:
"Asian Man shot at Chinese Consulate in Los Angeles, suspect Turns himself in."
.

"url"
:
"http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml"
.

"date"
:
"2011-12-18"

}

},

{

"_index"
:
"news"
.

"_type"
:
"politics"
.

"_id"
:
"dU5G9mQBD9BuE5fdHOUj"
.

"_score"
:
0.2876821
.

"_source"
: {

"title"
:
South Korean police Detain 1 Chinese fishing boat per day
.

"url"
:
"https://news.qq.com/a/20111216/001044.htm"
.

"date"
:
"2011-12-17"

}

}

]

}
}

Here we see the result of the match has two, the first score of 2.54, the second score is 0.28, this is because the first match of the data contained in “China” and the “consulate” two words, the second match data does not include the “consulate”, but contains the word “China”, so has been retrieved, but had lower scores by comparison.

Therefore, it can be seen that the corresponding field will be retrieved in full text, and the results will be sorted according to the relevance of the keywords, which is a basic search engine prototype.

Elasticsearch also supports a wide variety of query methods.

https://www.elastic.co/guide/en/elasticsearch/reference/6.3/query-dsl.html

This is the basic introduction to Elasticsearch and the basic use of The Python operation Elasticsearch, but this is only the basic features of Elasticsearch, there are many more powerful features waiting to be explored



Source:

https://cuiqingcai.com/6214.html

For more Python learning materials, see itheimaGZ