Table of Contents

  • Full text Search Engine Elasticsearch tutorial
    • A, install,
    • 2. Basic concepts
      • 2.1 the Node and Cluster
      • 2.2 the Index
      • 2.3 the Document
      • 2.4 Type
    • Create Index and delete Index
    • 4. Chinese word segmentation setting
    • 5. Data operation
      • 5.1 Adding Records
      • 5.2 Viewing Records
      • 5.3 Deleting Records
      • 5.4 Updating Records
    • 6. Data query
      • 6.1 Returning All Records
      • 6.2 Full-text Search
      • 6.3 Logical Operation
    • 7. Reference links
      • One, foreword
      • Second, the installation
      • Create index
      • Fourth, search intervention
      • 5. Chinese word segmentation
      • Six, summarized
      • Seven, the appendix
  • Search engine selection: Elasticsearch vs Solr
    • [*] Elasticsearch profile (# Elasticsearch introduction [] httpslinkjuejinimtargethttp3a2f2ffuxiaopanggitbooksio2flearnelasticsearch)
    • Pros and cons of Elasticsearch**The advantages and disadvantages of:] (# elasticsearch [] httpslinkjuejinimtargethttp3a2f2fstackoverflowcom2fquestions2f102130092fsolr – v – elasticsearch [] ht tpslinkjuejinimtargethttp3a2f2fhuangxin2f222ftranslation-solr-vs-elasticsearch)
      • advantages
      • disadvantages
    • [*] Solr profile (# Solr introduction [] httpslinkjuejinimtargethttp3a2f2fzhwikipediaorg2fwiki2fsolr)
    • Advantages and disadvantages of Solr
      • advantages
      • disadvantages
    • [Elasticsearch compared with Solr *] (# Elasticsearch compared with Solr [] httpslinkjuejinimtargethttp3a2f2fblogsocialcastcom2frealtime – search – sol r-vs-elasticsearch2f)
    • [*] actual production test (# actual production test [] httpslinkjuejinimtargethttp3a2f2fblogsocialcastcom2frealtime – search – solr – v – elasticsearch2f)
    • Elasticsearch vs. Solr
    • [other open source search engine based on Lucene solution *] (# other open source search engine based on Lucene solution [] httpslinkjuejinimtargethttp3a2f2fmail – archivesapacheorg2fmod_mbox2fhbas e-user2f201006mbox2f253c14915078881qm40web50304mailre2yahoocom253e)

Full text Search Engine Elasticsearch tutorial

Author: Ruan Yifeng

Original link: www.ruanyifeng.com

September 7-8, Beijing, face to face with Google, Twitch and other team technology giants www.bagevent.com

Full text search is one of the most common requirements, and open source Elasticsearch (Elastic for short) is currently the preferred full text search engine.

It can quickly store, search and analyze huge amounts of data. Wikipedia, Stack Overflow, Github all use it.

Underlying Elastic is the open source library Lucene. However, you can’t use Lucene directly, you have to write your own code to call its interface. Elastic is a Lucene wrapper that provides REST APIS right out of the box.

This article shows you how to build your own full-text search engine using Elastic from scratch. There are detailed instructions for each step, so you can learn it by following it.

A, install,

Elastic requires a Java 8 environment. If you don’t have Java installed on your machine, refer to this article, making sure that the environment variable JAVA_HOME is set correctly.

Once You have Java installed, you can install Elastic along with the official documentation. It is easy to download the compressed package directly.

$wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.1.zip$unzip elasticsearch - 5.5.1.zip $CD Elasticsearch 5.5.1 /Copy the code

Next, go to the unzipped directory and run the following command to start Elastic.

 $ ./bin/elasticsearch
Copy the code

If an error occurs “Max virtual memory areas Vm. max_map_count [65530] is too low”, run the following command.

 $ sudo sysctl -w vm.max_map_count=262144
Copy the code

If everything works fine, Elastic will run on the default 9200 port. At this point, open another command line window, request the port, and get a description.

$ curl localhost:9200 { "name" : "atntrTf", "cluster_name" : "elasticsearch", "cluster_uuid" : "Tf9250XhQ6ee4h7YI11anA ", "version" : {"number" : "5.5.1", "build_hash" :" 19C13d0 ", "build_date" : "2017-07-18T20:44:24.823z ", "build_snapshot" : false, "lucene_version" : "6.6.0"}, "tagline" : "You Know, for Search"}Copy the code

In the above code, request port 9200, and Elastic returns a JSON object containing the current node, cluster, version, etc.

Press Ctrl + C and Elastic will stop working.

By default, Elastic only allows local access. If remote access is required, you can modify the config/ ElasticSearch. yml file in the Elastic installation directory to uncomment network.host and change it to 0.0.0.0. Then restart Elastic.

Network. The host: 0.0.0.0Copy the code

In the code above, set it to 0.0.0.0 so that anyone can access it. Don’t do this for online services. Set it to a specific IP address.

2. Basic concepts

2.1 the Node and Cluster

Elastic is essentially a distributed database that allows multiple servers to work together and each server can run multiple Elastic instances.

A single Elastic instance is called a node. A group of nodes forms a cluster.

2.2 the Index

The Elastic indexes all the fields, and after processing, writes a Inverted Index. When looking for data, look up the index directly.

So the top-level unit of Elastic data management is called an Index. It is a synonym for a single database. The name of each Index (that is, database) must be lowercase.

The following command displays all indexes of the current node.

$ curl -X GET 'http://localhost:9200/_cat/indices? v'Copy the code

2.3 the Document

The single record inside Index is called a Document. A number of documents form an Index.

Document is represented in JSON format, and here is an example.

{"user": "user", "title": "engineer ", "desc":" database management "}Copy the code

Documents in the same Index are not required to have the same structure (scheme), but it is better to keep the same, so as to improve the search efficiency.

2.4 Type

Document can be grouped, for example, in the weather Index, it can be grouped by city (Beijing and Shanghai), or by climate (sunny and rainy days). This grouping is called Type, which is a virtual logical grouping used to filter documents.

Different types should have similar schemas. For example, an ID field cannot be a string in one group and a number in another. This is a difference from tables in a relational database. Data of completely different natures (such as products and logs) should be stored as two indexes instead of two Types in one Index (although that is possible).

The following command lists the types contained in each Index.

$ curl 'localhost:9200/_mapping? pretty=true'Copy the code

As planned, Elastic 6.x will only allow one Type per Index and will remove Type entirely.

Create Index and delete Index

Create an Index and send a PUT request to the Elastic server. The following example creates a new Index named weather.

 $ curl -X PUT 'localhost:9200/weather'
Copy the code

The server returns a JSON object with the Acknowledged field indicating that the operation succeeded.

 {  "acknowledged":true,  "shards_acknowledged":true}
Copy the code

We then issue a DELETE request to DELETE the Index.

 $ curl -X DELETE 'localhost:9200/weather'
Copy the code

4. Chinese word segmentation setting

First, install the Chinese word segmentation plug-in. Ik is used here, but other plug-ins (such as SmartCN) can also be considered.

 $ ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.5.1/elasticsearch-analysis-ik-5.5.1.zip
Copy the code

This code installs version 5.5.1 of the plugin, which works with Elastic 5.5.1.

Then restart Elastic and the newly installed plug-in will automatically load.

Then, create an Index and specify the field to be split. This step varies depending on the data structure, and the following commands are for this article only. Basically, all the Chinese fields that need to be searched should be set up separately.

 $ curl -X PUT 'localhost:9200/accounts' -d '{  "mappings": {    "person": {      "properties": {        "user": {          "type": "text",          "analyzer": "ik_max_word",          "search_analyzer": "ik_max_word"        },        "title": {          "type": "text",          "analyzer": "ik_max_word",          "search_analyzer": "ik_max_word"        },        "desc": {          "type": "text",          "analyzer": "ik_max_word",          "search_analyzer": "ik_max_word"        }      }    }  }}'
Copy the code

In the code above, we start by creating an Index called Accounts with a Type named Person inside. Person has three fields.

  • user
  • title
  • desc

These three fields are all Chinese and their types are text, so you need to specify a Chinese word splitter instead of using the default English word splitter.

The Elastic toender is called an Analyzer. We specify a tokenizer for each field.

 "user": {  "type": "text",  "analyzer": "ik_max_word",  "search_analyzer": "ik_max_word"}
Copy the code

In the code above, Analyzer is the tokenizer for field text and search_Analyzer is the tokenizer for search terms. The IK_MAX_word participle is provided by the plugin IK and can perform the maximum number of word segmentation for text.

5. Data operation

5.1 Adding Records

To add a record to an Index, send a PUT request to the specified /Index/Type. For example, send a request to /accounts/person to add a person record.

$curl -x PUT 'localhost: 9200 / accounts/person / 1' - d '{" user ":" zhang ", "title" : "engineer", "desc" : "database management"}'Copy the code

The JSON object returned by the server provides information such as Index, Type, Id, and Version.

 {  "_index":"accounts",  "_type":"person",  "_id":"1",  "_version":1,  "result":"created",  "_shards":{"total":2,"successful":1,"failed":0},  "created":true}
Copy the code

If you look closely, you can see that the request path is /accounts/person/1, with the last 1 being the Id of the record. It doesn’t have to be a number, it can be any string (such as ABC).

You can also add a record without specifying an Id. In this case, you need to make a POST request.

$curl -x POST 'localhost: 9200 / accounts/person' 3-d '{" user ":" bill ", "title" : "engineer", "desc" : "system management"}'Copy the code

In the code above, make a POST request to /accounts/ Person to add a record. In the JSON object returned by the server, the _ID field is a random string.

 {  "_index":"accounts",  "_type":"person",  "_id":"AV3qGfrC6jMbsbXb6k1p",  "_version":1,  "result":"created",  "_shards":{"total":2,"successful":1,"failed":0},  "created":true}
Copy the code

Note that if you execute the above command without first creating an Index (in this case accounts), Elastic will not report an error and will generate the specified Index. So be careful not to misspell the name of Index when typing.

5.2 Viewing Records

You can view this record by making a GET request to /Index/Type/Id.

$ curl 'localhost:9200/accounts/person/1? pretty=true'Copy the code

The code above asks to view the record /accounts/person/1, and the PRETTY =true parameter to the URL means that it is returned in an easy-to-read format.

In the returned data, the found field indicates that the query is successful, and the _source field returns the original record.

{ "_index" : "accounts", "_type" : "person", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "user" : "Title" : "engineer ", "desc" :" database management "}}Copy the code

If the Id is incorrect and the data is not found, the found field is false.

$ curl 'localhost:9200/weather/beijing/abc? pretty=true' { "_index" : "accounts", "_type" : "person", "_id" : "abc", "found" : false}Copy the code

5.3 Deleting Records

To DELETE a record is to issue a DELETE request.

 $ curl -X DELETE 'localhost:9200/accounts/person/1'
Copy the code

Don’t delete this record here, we’ll need it later.

5.4 Updating Records

To update a record is to send the data again using a PUT request.

$curl -x PUT 'localhost: 9200 / accounts/person / 1' - d '{" user ":" zhang ", "title" : "engineer", "desc" : "Database management, Software development "} {" _index ":" accounts ", "_type" : "person", "_id" : "1", "_version" : 2, "result" : "updated". "_shards":{"total":2,"successful":1,"failed":0}, "created":false}Copy the code

In the code above, we changed the raw data from “database administration” to “database Administration, software Development”. In the returned result, several fields have changed.

 "_version" : 2,"result" : "updated","created" : false
Copy the code

As you can see, the record Id remains the same, but version changes from 1 to 2, result changes from created to updated, and the created field changes to false, because this time it is not a new record.

6. Data query

6.1 Returning All Records

Using the GET method, request /Index/Type/_search directly and all records will be returned.

$ curl 'localhost:9200/accounts/person/_search' { "took":2, "timed_out":false, "_shards" : {" total ": 5," successful ": 5," failed ": 0}," hits ": {" total" : 2, "max_score" : 1.0, "hits" : [{" _index ":" accounts ", "_type" : "person", "_id" : "AV3qGfrC6jMbsbXb6k1p", "_score" : 1.0, "_source" : {" user ":" bill ", "title" : "engineer", "desc" : "System management"}}, {" _index ":" accounts ", "_type" : "person", "_id" : "1", "_score" : 1.0, "_source" : {" user ":" zhang ", "title" : "Engineer ", "desc" :" Database management, software development "}}]}}Copy the code

In the above code, the took field of the returned result indicates the operation time (in milliseconds), the timed_OUT field indicates whether the operation timed out, and the hits field indicates the hit record. The meanings of the inside field are as follows.

  • total: Number of returned records, 2 in this example.
  • max_score: The highest degree of matching, in this example1.0.
  • hits: An array of returned records.

Each of the returned records has a _score field, which indicates the matching program, and the default is in descending order by this field.

6.2 Full-text Search

Elastic’s queries are unique in that they use their own query syntax and require GET requests with data bodies.

$curl 'localhost: 9200 / accounts/person / _search' 3-d '{" query ": {" match" : {" desc ":" software "}}}'Copy the code

The above code uses a Match query that specifies a Match condition for the desc field containing the word “software”. The result is as follows.

{ "took":3, "timed_out":false, "_shards":{"total":5,"successful":5,"failed":0}, "hits":{ "total":1, "Max_score" : 0.28582606, "hits" : [{" _index ":" accounts ", "_type" : "person", "_id" : "1", "_score" : 0.28582606, "_source" : {" user ", "zhang", "title" : "engineer", "desc" : "database management, software development"}}}}]Copy the code

By default Elastic returns 10 results at a time, which can be changed with the size field.

$curl 'localhost: 9200 / accounts/person / _search' 3-d '{" query ": {" match" : {" desc ":" management "}}, "size" : 1}'Copy the code

The code above specifies that only one result is returned at a time.

You can also specify the displacement via the FROM field.

$curl 'localhost: 9200 / accounts/person / _search' 3-d '{" query ": {" match" : {" desc ":" management "}}, "from" : 1, the "size" : 1}'Copy the code

The code above specifies that starting at position 1 (default starting at position 0), only one result is returned.

6.3 Logical Operation

If there are multiple search keywords, Elastic considers them to be AN OR relationship.

$curl 'localhost: 9200 / accounts/person / _search' 3-d '{" query ": {" match" : {" desc ":" software system "}}}'Copy the code

The code above searches for software or systems.

If you want to perform an AND search for multiple keywords, you must use a Boolean query.

 $ curl 'localhost:9200/accounts/person/_search'  -d '{  "query": {    "bool": {      "must": [        { "match": { "desc": "软件" } },        { "match": { "desc": "系统" } }      ]    }  }}'
Copy the code

7. Reference links

  • ElasticSearch official manual
  • A Practical Introduction to Elasticsearch

(after)

One, foreword

When developing a website /App project, it is often necessary to build a search service. For example, news apps need to retrieve headlines/content, and community apps need to retrieve users/posts.

For simple requirements, you can use the database’s LIKE fuzzy search as an example:

SELECT * FROM news WHERE title LIKE ‘% ferrari %’

You can look up all the headlines with the keyword “Ferrari Sports Car”, but there are obvious drawbacks to this approach:

1, fuzzy query performance is very low, when the amount of data is huge, often make the database service interruption;

2, can not query the relevant data, can only strictly match keywords in the title.

Therefore, it is necessary to build a dedicated search service with advanced functions such as word segmentation and full-text search. Solr is one such search engine that lets you quickly build a search service suitable for your business.

Second, the installation

Download the installation package from lucene.apache.org/solr/, decompress it, and go to the Solr directory:

Wget ‘apache.website-solution.net/lucene/solr… ‘

The tar XVF solr – 6.2.0. TGZ

CD solr – 6.2.0

The directory structure is as follows:

Solr 6.2 Directory structure

Before starting the Solr service, verify that Java 1.8 is installed:

Viewing the Java Version

Start the Solr service:

./bin/solr start -m 1g

Solr will listen on port 8983 by default, where -m 1g specifies that 1 GB of memory is allocated to the JVM.

Access the Solr Admin Background in a browser:

http://127.0.0.1:8983/solr/#/

Solr manages the background

Create a Solr application:

./bin/solr create -c my_news

The my_news folder can be generated in the solr-6.2.0/server/solr directory as follows:

My_news directory structure

At the same time, you can see my_news in the admin background:

Management background

Create index

We will import data from the MySQL database into Solr and index it.

First, you need to understand two concepts in Solr: field and fieldType, as shown in the following configuration example:

Schema. The XML examples

Field specifies the name of a field, whether it is indexed/stored, and the field type.

FieldType specifies the name of a fieldType and any word segmentation plug-ins that may be used when querying/indexing.

Rename the default managed-schema configuration file in solr-6.2.0\server\solr\my_news\conf to schema.xml and add a new fieldType:

Participles type

Create a lib directory under my_news and add the ik-Analyzer-solr5-5.x. jar to the lib directory as follows:

My_news directory structure

Restart the service in the Solr installation directory:

./bin/solr restart

You can see the new type in the admin background:

Text_ik type

Next create a field: title and content corresponding to our database fields of type text_IK:

Create new field title

MySQL > select * from ‘MySQL’;

Edit the conf/solrconfig. XML file and add the class library and database configuration:

The class library

dataimport config

Create a new database connection configuration file conf/db-mysql-config. XML with the following contents:

Database configuration file

Jar/mysql-connector-java-5.1.39-bin.jar/mysql-connector-java-5.1.39-bin.jar/mysql-connector-java-5.1.39-bin.jar/mysql-connector-java-5.1.39-bin.jar/mysql-connector-java-5.1.39-bin.

Import full data

Create a scheduled update script:

Periodic update script

Join a scheduled task and update the index incrementally every 5 minutes:

Timing task

Test search results in Solr admin background:

Segmentation search results

At this point, the basic search engine is built, and external applications can obtain search results only by providing query parameters through HTTP protocol.

Fourth, search intervention

Human intervention in search results is often required, such as editing recommendations, bidding for rankings, or blocking search results. Solr already has the QueryElevationComponent plug-in built in, which retrieves the list of interventions for the search term from the configuration file and ranks the intervention results ahead of the search results.

In the solrconfig.xml file, you can see:

Interferes with its request configuration

The search component Elevator is defined and applied to the search request of /elevate. The configuration file of the intervention result is in elevate. XML in the same directory as solrconfig. XML.

Restart Solr, when searching for “keywords”, documents with id 1 and 4 will appear first, and documents with ID = 3 will be excluded from the results.

No intervention results

When there is a search intervention:

Intervention results

Interfering with search results through a configuration file is simple, but requires a Solr restart for each update to take effect. We can develop our own intervention component, similar to the QueryElevationComponent class, such as reading the intervention configuration from Redis.

5. Chinese word segmentation

Chinese search quality is closely related to the effect of word segmentation, which can be tested in Solr management background:

Word segmentation results test

The above example shows the test results of the word segmentation of “university of science and technology Beijing” using the IKAnalyzer plug-in. When users search for the keywords “Beijing”, “University of Science and Technology”, “University of Science and Technology”, “science and technology” and “university”, they will find documents with “University of Science and Technology Beijing” in the text.

The commonly used Chinese word segmentation plug-ins include IKAnalyzer, MMSeg4J, and Solr’s smartCN, etc. The word segmentation effects have their own advantages and disadvantages. You can choose which one to choose according to your own business scenarios by testing the effect respectively.

Word segmentation plug-ins generally have their own default thesaurus and extended thesaurus. The default thesaurus contains most commonly used Chinese words. If the default thesaurus does not meet your needs, such as for certain specialized fields, you can manually add them to the extended thesaurus so that the word segmentation plug-in can recognize the new words.

Example of word segmentation plug-in extended word base configuration

The word segmentation plugin can also specify stop thesaurus to remove some meaningless words from the segmentation result, such as “of”, “hum”, etc., for example:

Eliminate nonsense words

Six, summarized

This has covered some of the most common features of Solr, but Solr itself has many other rich features, such as distributed deployment.

I hope it helps.

Seven, the appendix

1. Reference Materials:

wiki.apache.org/solr/

Lucene.apache.org/solr/quicks…

Cwiki.apache.org/confluence/…

2. All configuration files and Jar packages used in the Demo above:

Github.com/Ceelog/Open…

3. Questions? Contact the author on Weibo/wechat @ceelog

Search engine selection: Elasticsearch vs Solr

This post was originally posted on my blog link: Elasticsearch vs. Solr

Elasticsearch profile*

Elasticsearch is a real-time distributed search and analysis engine. It helps you process large-scale data faster than ever before.

It can be used for full-text search, structured search and analysis, or you can combine all three.

Elasticsearch is a search engine built on top of Apache Lucene(TM). Lucene is the most advanced and efficient full-featured open source search engine framework available today.

But Lucene is just a framework, and to take full advantage of its capabilities, you need to use JAVA and integrate Lucene into your application. It takes a lot of learning to understand how it works, and Lucene is really complicated.

Elasticsearch uses Lucene as its internal engine, but you only need to use a unified API to perform full text search. You don’t need to know how Lucene works.

Elasticsearch is more than Just Lucene. It includes a full text search feature that does the following:

  • Distributed real-time file storage and indexing every field so that it can be searched.

  • Distributed search engine for real-time analysis.

  • Scalable to hundreds of servers, processing petabytes of structured or unstructured data.

With so much functionality integrated into a single server, you can easily communicate with ES RESTful apis through clients or in whatever programming language you like.

Getting started with Elasticsearch is pretty easy. It comes with a lot of very reasonable defaults, which is a good way for beginners to avoid having to deal with complex theories right away,

It is installed and ready to use, and can be very productive with very little learning cost.

The more advanced features of Elasticsearch are available, the more flexible the engine can be configured. You can customize Elasticsearch based on your own requirements.

Use cases:

  • Wikipedia uses Elasticsearch for full text searches and highlights keywords, as well as search-as-you-type and did-you-mean suggestions.

  • The Guardian uses Elasticsearch to process its visitor logs so that it can feed back to its editors how the public reacts to different articles in real time.

  • StackOverflow combines full-text search with geolocation and related information to provide a more-like-this representation of related questions.

  • GitHub uses Elasticsearch to retrieve over 130 billion lines of code.

  • Goldman Sachs uses it to process an index of five terabytes of data a day, and many investment banks use it to analyze stock market movements.

Elasticsearch isn’t just for large enterprises though, it’s also helped startups like DataDog and Klout expand their capabilities.

Pros and cons of Elasticsearch**:

advantages

  1. Elasticsearch is distributed. No other components are required, distribution is real-time and is called “Push replication”.
  • Elasticsearch fully supports Apache Lucene’s near real-time search.
  • Handling multitenancy requires no special configuration, whereas Solr requires more advanced Settings.
  • Elasticsearch uses the Gateway concept to make completeness much simpler.
  • Each node forms a peer-to-peer network structure. When some nodes fail, other nodes are automatically assigned to take their place.

disadvantages

  1. Only one developer (currently Elasticsearch GitHub has more than that, it already has quite active maintainers)
  • Not automatic enough (not suitable for the current new Index Warmup API)

Solr profile*

Solr (pronounced “Solar”) is the open source enterprise search platform for the Apache Lucene project. Its main functions include full text retrieval, hit marking, faceted search, dynamic clustering, database integration, and rich text (such as Word, PDF) processing. Solr is highly extensible and provides distributed search and index replication. Solr is the most popular enterprise search engine, and Solr4 also adds NoSQL support.

Solr is a standalone full-text search server written in Java and running in a Servlet container such as Apache Tomcat or Jetty. Solr uses the Lucene Java search library as the core of full-text indexing and searching, and has REST-like HTTP/XML and JSON apis. Solr’s powerful external configuration capabilities allow it to be adapted to suit many types of applications without Java coding. Solr has a plug-in architecture to support more advanced customization.

Due to the 2010 merger of Apache Lucene and Apache Solr, both projects were developed and implemented by the same Apache Software Foundation development team. Lucene/Solr or Solr/Lucene is the same when it comes to technology or products.

Advantages and disadvantages of Solr

advantages

  1. Solr has a much larger and more mature community of users, developers, and contributors.
  • You can add indexes in various formats, such as HTML, PDF, Microsoft Office, JSON, XML, and CSV.
  • Solr is relatively mature and stable.
  • It is faster to search without considering building an index.

disadvantages

  1. During index creation, the search efficiency decreases, and the real-time index search efficiency is not high.

Elasticsearch vs. Solr*

Solr is faster when simply searching through existing data.

Search Fesh Index While Idle

When indexes are created in real time, Solr causes I/O congestion, resulting in poor query performance. Elasticsearch has an obvious advantage.

search_fresh_index_while_indexing

Solr becomes less efficient as the amount of data increases, while Elasticsearch doesn’t change significantly.

search_fresh_index_while_indexing

In summary, Solr’s architecture is not suitable for real-time search applications.

Actual production environment test*

This is a 50-fold increase in average query speed after switching from Solr to Elasticsearch.

average_execution_time

Elasticsearch vs. Solr

  • Both are simple to install;

  • Solr uses Zookeeper for distributed management, while Elasticsearch has distributed coordination management.

  • Solr supports more data formats, while Elasticsearch only supports JSON files.

  • Solr offers more features, while Elasticsearch itself focuses on core features, with advanced features provided by third-party plugins.

  • Solr performs better than Elasticsearch in traditional search applications, but is significantly less efficient than Elasticsearch in real-time search applications.

Solr is a great solution for traditional search applications, but Elasticsearch is better suited for emerging real-time search applications.

Other open source search engine solutions based on Lucene*

  1. Use Lucene directly

    Note: Lucene is a JAVA search class library, which by itself is not a complete solution and requires additional development work.

    Advantages: Mature solution with many successful cases. Apache top level project, which continues to make rapid progress. A large and active development community, lots of developers. It is just a class library with plenty of room for customization and optimization: simple customization can meet most common requirements; Optimized to support searches of the order of 1 billion +.

    Cons: Additional development effort required. All extension, distribution, reliability, etc. need to be implemented themselves; Non-real-time, there is a Time delay between indexing and search, and the scalability of the current Lucene Near Real Time search scheme needs to be further improved

  • Katta

    Lucene based, support distributed, scalable, fault-tolerant, quasi – real – time search scheme.

    Advantages: Out of the box, it can cooperate with Hadoop to achieve distribution. Have extensibility and fault tolerance mechanism.

    Disadvantages: only search scheme, build index part or need to achieve their own. In the search function, only to achieve the most basic requirements. There are fewer success stories, and the maturity of projects is slightly lower. Due to the need to support distribution, customization is difficult for some complex query requirements.

  • Hadoop contrib/index

    Note: Map/Reduce mode, distributed index building scheme, can be used with Katta.

    Advantages: Distributed index building, scalability.

    Disadvantages: only build index scheme, do not include search implementation. Works in batch mode with poor support for real-time search.

  • LinkedIn’s open source solution

    Description: A range of solutions based on Lucene, including Zoie for quasi-real-time search, Facet search implementation Bobo, Machine learning algorithm Decomposer, Summary repository Krati, Database schema wrapper Sensei, and more

    Benefits: Proven solution that supports distributed, scalable, and rich functionality implementation

    The bad: Too tied to linkedin and not very customizable

  • Lucandra

    Note: Based on Lucene, indexes exist in Cassandra database

    Pros: Refer to Cassandra’s pros

    Disadvantages: Refer to Cassandra’s disadvantages. Also, this is just a demo, not a lot of validation

  • HBasene

    Note: Based on Lucene, indexes are stored in the HBase database

    Advantages: Refer to HBase advantages

    Disadvantages: Refer to HBase disadvantages. Also, in the implementation, Lucene terms are stored in rows, but Posting lists corresponding to each term are stored as columns. As the Posting lists of a single term increase, the speed of query will be greatly affected