Have you seen Wreck-It Ralph 2? The concrete expression of every node of the Internet in the movie is really rich in imagination!




Every Internet user is embodied and “stops” at every site by means of transportation. The advertising space on the web page becomes a salesman holding up a sign to sell to the user, and the interested people follow the salesman to his humble little site by car. It’s fun to think.




Of course, the most impressive thing about the search engine here is Knowsmore, a little guy at a database counter who searches for directions when people come to ask questions and puts them on their way to the next stop.




The most interesting point is that every time you say a word, he will guess a lot of words and sentences, the search engine association function is particularly vivid, a “instinct association” expression makes you laugh or cry.




This seems to know everything behind the search engine, in addition to data, what is there to support? Find out about Elasticsearch!

How to use Elasticsearch


Elasticsearch is a search engine based on Apache Lucene(TM), a full-text search engine. Elasticsearch is not only a full-text search engine, but also a distributed real-time file storage engine that can be searched by indexing every field. The distributed search engine can be extended to hundreds of servers. Handles structured or unstructured data at the PB level. Elasticsearch is document-oriented and stores the same data structure as the object-oriented data structure. Based on this document data structure, ES can provide sophisticated indexing, full text search, analysis and aggregation, etc. It uses JSON data format to express, using inverted index to achieve rapid indexing of a large number of data.


When bulk multithreaded bulk insert is performed for Elasticsearch, the memory overflow occurs and the search process is slow due to memory overflow. Elasticsearch is known for its high memory requirements and dependencies. On the other hand, the search for Elasticsearch will not be able to avoid word segmentation in Chinese, so here are some experiences about Elasticsearch memory and word segmentation.



Memory allocation for Elasticsearch


The default memory of Elasticsearch is 1 GB




In real business scenarios, however, this default memory setting can quickly become a problem. For elasticSearch, the heap memory is still used for querying elasticSearch, so we need to change the heap memory of elasticSearch. Export ES_HEAP_SIZE = 10g

/ elasticSearch -xmx 10g -xms 10g/elasticSearch -xms 10g The goal is to reduce the stress of scaling the heap size by eliminating the need to re-delimit the size of the heap after the Java garbage collection mechanism has cleaned up the heap.


Elasticsearch can be allocated memory on a single machine.


We all know that memory is important for ElasticSearch, but how about giving it all the machine memory? Lucene is designed to cache data from the underlying OS in memory, and its performance depends on the interaction with the OS. If you give all the memory to ElasticSearch, Lucene will get very little memory, and the full text retrieval performance will be poor. So if the machine is exclusively for ElasticSearch search, allocate 50% of the memory to ElasticSearch, and the remaining 50% will be used by Lucene for file caching to improve performance.


Considerations Considerations For ElasticSearch, we need to disable Considerations for memory swapping on disk, and we can modify the Mlockall switch of elasticSearch.

Bootstarp.mlockall : true



Participle index of Elasticsearch


In our actual project, we used ElasticSearch to complete the retrieval of information. In the design process, considering the performance and efficiency of retrieval, the retrieval efficiency of ElasticSearch was very low as the data was piled up on one index. The performance of an index within 100G will not be greatly affected, so in the design process, I designed to create an index every month, all the indexes take the same alias, so that we can query the alias of all the indexes, to ensure the performance and efficiency of each index. The first index and the corresponding mapping are created manually, and the new index is automatically created on the first day of each month when writing to elasticSearch:




In the process of full-text information retrieval, I encountered some problems, because financial information requires high real-time performance. In the process of information retrieval, I previously used IK-Max-Word to segment the retrieved content. If the information is sorted according to the time of information, the retrieved information has poor relevance. Ik – Max – word segmentation mainly depends on the thesaurus, word need to enrich the library at the same time, also participle out the unnecessary words, such as electronic, hang seng will participle as the hang seng electronic, the hang seng, electronic so three words, to the process of retrieval, meet either on this three words can be retrieved, in accordance with the time prioritization, scoring time sorting, The matching may not be highly correlated, so that the problem is more obvious.




During the optimization process, change the participle of iK-max-word to IK-smart, and change the mapping to IK-smart to remove the participle





After obtaining the result, it uses bool to retrieve it




In this way, the results of the previous version of the retrieval are optimized, but the sentence parsing ability of IK-SMART is relatively weak, so we can guess that the ability of NLP is used to complete the optimization of this function.


Elasticsearch 6.x has improved the performance of lucene and changed the string of mapping. The retrieval performance of ElasticSearch is significantly improved compared to the old version of 2.x. Is studying 2.3.3 data upgrade to 6.x version demo, segmentation retrieval, 6.x support hanLP plug-in, interested partners can join the research background message together! Let’s build a Real Knowsmore!