Recently the team arranged a task, the project used in full text search, based on full-text search Solr, but the Solr search cloud program is not stable, often couldn’t query data, need to manually complete synchronization, and the other team in the maintenance, dependence is too strong, leading to Solr service a problem, our project is the basic, Because all dependent queries have no result data. So consider developing an adaptation layer that automatically switches to a new search –ES — if Solr searches fail.

You can solve this problem by designing Solr clusters or service fault tolerance. However, regardless of the rationality of my own design, the leader needed to develop, so I started to set up ES service from scratch. As I had never touched ES before, I recorded my own development process through this series.

What is full text search

What is a full-text search engine?

Definition in Baidu Encyclopedia: full-text search engine is the mainstream search engine widely used at present. Its working principle is that the computer index program through scanning every word in the article, to establish an index for each word, indicating the number and position of the word in the article, when the user queries, the retrieval program according to the index established in advance to search, and search results feedback to the user’s retrieval method. This process is similar to looking up words through the search word table in a dictionary.

From the definition we can already roughly understand the idea of full text search, for a more detailed explanation, let’s start from the data in life.

There are two kinds of data in our lives: structured data and unstructured data.

  • Structured data: data of fixed format or limited length, such as database, metadata, etc.
  • Unstructured data: Unstructured data, also known as full-text data, refers to data of indefinite length or no fixed format, such as emails, Word documents, etc.

Of course, there will be a third kind of semi-structured data, such as XML and HTML, which can be processed as structured data or pure text can be extracted as unstructured data.

According to two kinds of data classification, search is also divided into two kinds: structured data search and unstructured data search.

For structured data, we can generally store and search through the table of relational databases (mysql, Oracle, etc.), and also build indexes. For unstructured data, i.e. full-text data search, there are two main methods: sequential scanning and full-text retrieval.

Sequential scan: You can also know the general search mode based on the text name. That is, specific keywords are searched by sequential scan. For example, you are given a newspaper and asked to find out where the word “RNG” appears in that newspaper. You definitely need to scan the newspaper from cover to cover and mark where the keyword appears and where it appears.

This method is undoubtedly the most time-consuming and inefficient, if the newspaper typesetting is small, and there are many sections or even multiple newspapers, your eyes are almost the same after you scan.

Full text search: Sequential scanning of unstructured data is slow, can we optimize it? Can’t we just try to make our unstructured data have some structure? Part of the information in unstructured data is extracted and reorganized to make it with a certain structure, and then the data with a certain structure is searched, so as to achieve the purpose of relatively fast search. This way constitutes the basic idea of full-text retrieval. This piece of information extracted from unstructured data and then reorganized is called an index.

Also take reading newspapers as an example, we want to pay attention to the recent league of Legends S8 global finals news, if all RNG fans, how to quickly find RNG news newspapers and sections? The full-text search method is to extract keywords from all sections of all newspapers, such as “EDG”, “RNG”, “FW”, “Rangers”, “League of Legends” and so on. Then build an index of these keywords, through which we can correspond to the newspaper and section of the keyword. Note the distinction between directory search engines.

Why use a full-text search engine

Before, a colleague asked me, why do I use search engines? All our data are in the database, and Oracle, SQL Server and other databases can also provide query retrieval or cluster analysis function, directly through the database query can not it? Indeed, most of our query functions can be obtained through database query. If the query efficiency is low, we can improve the efficiency by building database index, optimizing SQL and even introducing cache to speed up the return of data. If the data volume is larger, the database can be divided into tables to share the query pressure.

So why a full-text search engine? We mainly analyze it from the following reasons:

  • Data type full text index search supports the search of unstructured data and can better search the unstructured text of any word or word group that exists in large quantities. For example, Google, Baidu, they are based on the keyword in the web page to generate index, we enter the keyword in the search, they will return the keyword that the index matches all the pages; There are also common projects in the application log search and so on. For these unstructured data text, relational database search is not well supported.

  • Index maintenance in general traditional databases, full text retrieval are very weak, because generally no one with data inventory text fields. Full-text retrieval requires scanning the entire table. If the data volume is large, even the syntax optimization of SQL will have little effect. Indexes are built, but they are also cumbersome to maintain, rebuilding indexes for both INSERT and UPDATE operations.

When to use full text Search engines:

  1. The data objects searched are large amounts of unstructured text data.
  2. File records in the hundreds of thousands or millions or more.
  3. Support for a large number of interactive text-based queries.
  4. Need very flexible full-text search queries.
  5. There is a special need for highly relevant search results that no relational database is available to fulfill.
  6. Cases where there is relatively little need for different record types, non-text data operations, or secure transaction processing.

Lucene, Solr, ElasticSearch?

The main search engines are Lucene, Solr, ElasticSearch.

Their index establishment is based on the way of inverted index generation index, what is inverted index?

An Inverted Index, also known as Inverted index, placed profile, or Inverted profile, is an index method that is used to store a mapping of the location of a word in a document or group of documents under a full-text search. It is the most commonly used data structure in document retrieval systems.

Lucene

Lucene is a Java full-text search engine written entirely in Java. Lucene is not a complete application, but rather a code base and API that can be easily used to add search capabilities to applications.

Lucene provides powerful functionality through a simple API:

Scalable high-performance index

  • Over 150GB/hour on modern hardware
  • Small RAM required – only 1MB heap
  • Incremental indexing is just as fast as bulk indexing
  • The index size is about 20-30% of the size of the indexed text

Powerful, accurate and efficient search algorithm

  • Rank search – Returns the best results first
  • Many powerful query types: phrase query, wildcard query, neighborhood query, range query, etc
  • On-site search (e.g. Title, author, content)
  • Sort by any field
  • Use merge results for multi-index searches
  • Allows simultaneous updates and searches
  • Flexible faceted, highlighting, linking and result grouping
  • Fast, memory efficient and error tolerant recommendations
  • Pluggable ranking models, including vector space model and Okapi BM25
  • Configurable Storage Engine (codec)

Cross-platform solutions

  • Provided as open source software under the Apache license, it allows you to use Lucene in both commercial and open source applications
  • 100%-pure Java
  • Implementations available in other programming languages are index compatible

Apache Software Foundation the Apache Software Foundation provides support for open source software projects in the Apache community.

But Lucene is just a framework, and to take full advantage of its capabilities, you need to use JAVA and integrate Lucene into your application. It takes a lot of learning to understand how it works, and it’s really complicated to be proficient with Lucene.

Solr

Apache Solr is an open source search platform built on a Java library called Lucene. It provides Apache Lucene’s search capabilities in a user-friendly manner. Having been an industry player for nearly a decade, it is a mature product with a strong and wide user community. It provides distributed indexing, replication, load-balancing queries, and automatic failover and recovery. If it is properly deployed and managed, it can become a highly reliable, scalable, and fault-tolerant search engine. Many Internet giants such as Netflix, eBay, Instagram and Amazon (CloudSearch) use Solr because of its ability to index and search multiple sites.

The list of main features includes:

  • Full-text search
  • prominent
  • Faceted search
  • Real-time indexes
  • Dynamic cluster
  • Database integration
  • NoSQL features and rich document processing (e.g. Word and PDF files)

ElasticSearch

Elasticsearch is an open source (Apache 2 license), a RESTful search engine built on the Apache Lucene library.

Elasticsearch was released a few years after Solr. It provides a distributed, multi-tenant capable full-text search engine with HTTP Web interface (REST) and unstructured JSON documents. Elasticsearch’s official client library provides Java, Groovy, PHP, Ruby, Perl, Python,.NET and Javascript.

Distributed search engines include indexes that can be divided into shards, and each shard can have multiple copies. Each Elasticsearch node can have one or more shards, and its engine can also act as a coordinator, delegating operations to the correct shard.

Elasticsearch can be extended with near real time search. One of its main features is multi-tenancy.

The list of main features includes:

  • Distributed search
  • multi-tenant
  • Analysis of the search
  • Grouping and aggregation

Select Elasticsearch vs. Solr

Due to its complexity, Lucene is rarely considered as a first choice in search, excluding companies that need to develop their own search framework and rely on Lucene at the bottom. So here we focus on Elasticsearch and Solr.

Elasticsearch vs. Solr. Which is better? How are they different? Which one should you use?

Historical comparison

Apache Solr is a mature project with a large and active development and user community, as well as the Apache brand. Solr, which was first released to open source in 2006, has long dominated the search engine space and is the engine of choice for anyone who needs search capabilities. Its maturity translates into rich functionality beyond simple text indexing and searching; Such as faceted, grouping, powerful filtering, pluggable document processing, pluggable search chain components, language detection, etc.

Solr has dominated search for years. Then, around 2010, Elasticsearch became an alternative on the market. Back then, it wasn’t nearly as stable as Solr, didn’t have Solr’s depth of functionality, thought sharing, branding, etc.

Elasticsearch is young, but it has some advantages of its own, Elasticsearch is built on more modern principles for more modern use cases, and was built to make it easier to handle large indexes and high query rates. Also, because it is so young and there is no community to work with, it is free to move forward without any consensus or collaboration with others (users or developers), backward compatibility, or any other more mature software usually has to deal with.

As a result, it exposed some very popular features (for example, Near real-time Search) before Solr. Technically, the NRT search power does come from Lucene, which is the base search library used by Solr and Elasticsearch. Ironically, because Elasticsearch first exposed NRT search, people associate NRT search with Elasticsearch, even though Solr and Lucene are both part of the same Apache project, so, One would expect Solr to have such demanding functionality in the first place.

Comparison of characteristic differences

Both search engines are popular, advanced open source search engines. They are both built around the core underlying search library – Lucene – but they are different. Like all things, each has its pros and cons, and depending on your needs and expectations, each can be better or worse. Solr and Elasticsearch are both growing fast, so without further discussion, here’s a list of their differences:

Characteristics of the Solr/SolrCloud Elasticsearch
Community and Developer Apache Software Foundation and community support A single business entity and its employees
Node found Apache Zookeeper, mature and battle-tested in a number of projects Zen is built into Elasticsearch itself and requires a dedicated master node for split brain protection
Fragments are placed Essentially static, manual work is required to migrate shards, starting with Solr 7 – the Autoscaling API allows for some dynamic operations Dynamic, with the ability to move shards on demand based on cluster status
The cache Global, each segment change is invalid Each segment is more suitable for dynamically changing data
Analysis engine performance Ideal for accurately computed static data The accuracy of the results depends on data placement
Full-text search function Lucene based language analysis, multiple suggestions, spell checking, rich highlighting support Lucene based language analysis, single suggested API implementation, highlighted recalculation
Conversation support Not quite yet, but coming soon Very nice API
Non-planar data processing Nested documents and parent-child support The natural support for nesting and object types allows for almost unlimited nesting and parent-child support
Query DSL JSON (limited), XML (limited), or URL parameters JSON
Index/collection lead control Leader placement control and leader rebalancing can even load nodes Can’t be
Machine learning Built-in – on top of stream aggregation, focusing on logistic regression and learning rank contribution modules Business features that focus on exceptions and outliers and time series data

Learn more here.

Comprehensive comparison

In addition, we analyze it from the following aspects:

  • Let’s take a look at Google search trends for these two products. Google Trends show that Elasticsearch has a lot of appeal compared to Solr, but that doesn’t mean Apache Solr is dead. While some might disagree, Solr is still one of the most popular search engines with a strong community and open source support.

  • Elasticsearch is easy to install and lightweight compared to Solr. In addition, you can get Elasticsearch up and running in minutes. However, this ease of deployment and use can become a problem if Elasticsearch is not managed properly. Json-based configuration is simple, but if you want to specify annotations for every configuration in a file, it’s not for you. In general, Elasticsearch is a better choice if your application uses JSON. Otherwise, use Solr because its schema.xml and solrconfig.xml are well documented.

  • Community Solr has a larger, more mature community of users, developers, and contributors. ES has a small but active community of users and a growing community of contributors. Solr is truly open source community code. Anyone can contribute to Solr, and new Solr developers (also known as committers) are selected based on merit. Elasticsearch is technically open source, but not so spiritually important. Anyone can see the source, anyone can change it and contribute, but only Elasticsearch’s staff can actually make changes to Elasticsearch. Solr contributors and committers come from many different organizations, while Elasticsearch committers come from a single company.

  • Maturity Solr is more mature, but ES is growing fast and I think it is stable.

  • The document Solr scores highly here. It is a very well-documented product with clear examples and API use case scenarios. The documentation for Elasticsearch is well organized, but it lacks good examples and clear configuration instructions.

conclusion

Solr or Elasticsearch? Sometimes it’s hard to find a clear answer. Whether you choose Solr or Elasticsearch, you first need to understand the correct use cases and future requirements. Summarize each of their attributes.

Remember:

  • Elasticsearch is more popular among new developers due to its ease of use. However, if you’re used to working with Solr, stick with it, as there are no specific advantages to moving to Elasticsearch.

  • Elasticsearch is a better choice if you need it to handle parsing queries in addition to searching text.

  • Select Elasticsearch if you want a distributed index. Elasticsearch is a better choice for cloud and distributed environments that require good scalability and performance.

  • Both have good business support (consulting, production support, integration, etc.)

  • Both have good tools, although Elasticsearch appeals to the DevOps crowd more because of its easy-to-use API, so a more lively tool ecosystem can be created around it.

  • Elasticsearch dominates the open source log management use case, with many organizations indexing their logs in Elasticsearch to make them searchable. While Solr can now be used for this purpose, it just misses the point.

  • Solr is still more text oriented. Elasticsearch, on the other hand, is typically used for filtering and grouping – parsing query workloads – not necessarily text search. Elasticsearch developers put a lot of effort into Lucene and Elasticsearch levels to make this type of query more efficient (lower memory footprint and CPU usage). Therefore, Elasticsearch is a better choice for applications that require not only text searches but also complex aggregations of search times.

  • Elasticsearch is easier to get started with a single download and a single command to start everything. Solr has traditionally required more work and knowledge, but Solr has recently made great strides in eliminating this and now just has to work to change its reputation.

  • In terms of performance, they are roughly the same. I say “roughly,” because no one has done a comprehensive and unbiased benchmark. For 95% of the use cases, either choice will work well in terms of performance, and the remaining 5% will need to test both solutions with their specific data and specific access patterns.

  • Elasticsearch is relatively simple to use operationally – it has only one process. Solr relies on Apache ZooKeeper in SolrCloud, its fully distributed deployment mode similar to Elasticsearch. ZooKeeper is super mature, super widely used and so on, but it’s still another active part. That said, if you’re using Hadoop, HBase, Spark, Kafka, or some other newer distributed software, you’re probably already running ZooKeeper somewhere in your organization.

  • Although Elasticsearch has a zooKeeper-like component Xen built in, ZooKeeper does a better job of preventing the dreaded split brain problems that sometimes occur in Elasticsearch clusters. To be fair, the Elasticsearch developers are aware of this issue and are working on improving this aspect of Elasticsearch.

  • If you like monitoring and metrics, use Elasticsearch and you’ll be in heaven. This thing has more indicators than how many people you can squeeze in Times Square on New Year’s Eve! Solr exposes key metrics, but not nearly as many as Elasticsearch.

In short, both are feature-rich search engines that provide more or less the same performance if designed and implemented properly. The overall content of this article is outlined in the following figure, carefully drawn and provided by garden friend ReyCG.

Reference:

  1. www.datanami.com/2015/01/22/…
  2. Blog.csdn.net/hhx0626/art…
  3. www.elastic.co/cn/
  4. Logz. IO/blog/solr – v…
  5. Sematext.com/blog/solr-v…

Personal public account: JaJian

Welcome long press the picture below to pay attention to the public number: JaJian!

We regularly provide you with the explanation and analysis of distributed, micro-services and other first-line Internet companies.