Technical selection of mass data system architecture…

Full-text search engine (NLP, crawler, such as Baidu) Vertical search engine (e-commerce, OA, site search, video site)

Search engines have requirements:

  • Quick query speed (efficient compression algorithm, fast encoding and decoding)
  • Accurate results (BM25,TF-IDF)
  • Rich retrieval results (recall rate)

Overhand Elasticsearch

Environmental installation

  • Operating system | JDK | their compatibility
  • The JDK 8 11 | | (14) compatibility
  • Elastic. Co/cn/downloads compatibility

Elasticsearch directory structure

  • Bin: Executable script file, including start ES service, plug-in management, function commands
  • Config: config file directory, es configuration, role configuration, JVM configuration, etc
  • Lib: The Java library that es depends on
  • Data: The default directory for storing data. All data containing nodes, shards, indexes, and documents that the production environment requires must be modified.
  • Logs: Default log file storage path. Production environment must be changed.
  • Modules: Includes all ES modules, such as Cluster, Discovery, Indices, etc
  • Plugins: A directory of plug-ins already installed
  • JDK/JDK. App 7.0 is not available until now, with its own Java environment

Start the

  • Go to bin/elasticsearch and open localhost:9200
  • Multi-node mode

    • Multiple projects and single node
    • Single project with multiple nodes

      elasticserach -E -Ee path.logs=log1 -E -E
      elasticserach -E -Ee path.logs=log1 -E -E

Cluster health value

  • Health status

    • In Green, all primary and replica are active, cluster health
    • YELLOW, at least one replica is not available, but all the primaries are active, ensuring the integrity of data recognition
    • Red: At least one primary is unavailable and the cluster is unavailable
  • Health check

    • _cat/health
    • _cluster/health


  • Verify the service started successfully localhost:5601
  • Configure the ES service["http://localhost:9201"](kibana.yml)
  • Close Kibana from the command line:

    • Close the window
    • Ps – ef | grep or ps – 5601 ef | grep kibana or lsof – I: 5601
    • kill -9 pid
  • About the “Kibana Server is not ready yet” problem cause and solution

    • – Kibana and Elasticsearch versions are not compatible (keep version consistent)
    • The service address for Elasticsearch is different from the elasticsearch.hosts configured in Kibana (configured in elasticsearch.yml)
    • Disable cross domain access in Elasticsearch (Elasticsearch. Yml)
    • The server has the firewall on (turn off the firewall or change the server security policy)
    • Less than 90% left on Elasticsearch disk (clean up disk space, configure monitoring and alarm)

Through the phenomenon to see the essence: take you to see through the “index” essence

The index

  • Help with fast retrieval
  • Data structure as the carrier
  • Landing in document form

The composition of the database

Why is B+Trees(MySQL) not suitable for big data retrieval

  • Mysql > select * from user where MySQL = 0.295s
  • Mysql, millions, no index: 3.365 s
  • Mysql > full text index: 1.033s
  • Mysql > mysql > mysql > mysql > mysql > mysql
  • Es, ten million,.8s

MySQL index structureB – Trees visualization

The inverted index is fully read

Inverted index data structures

Inverted index core algorithm

  • Inverted list compression algorithm

    • FOR:Frame Of Reference

    • RBM:RoaringBitMap

  • The retrieval principle of word index

    • FST:Finit state Transducers http://examples.mikemccandles… FST implementation principle in Lucene

Is row index inverted index analysis: first of all to understand the concept of two kinds of data structure doc values is the word document of the mapping, inverted is the word of the mapping to the document id. In principle, why an inverted index is not good for aggregations is that you can’t determine the total number of DOCs by an inverted index, and since analysis is performed by default, even aggregated results may not be accurate, so you need to create the NOT_ANALYZED field to increase disk usage. In the simplest example, let’s say that this is a table of items, each of which has a number of tags, and we execute the following query

Query :{match:{tags:" price/price "}, aggs:{tag_terms:{terms:{field:"tags.keyword"}}}}

The meaning of this aggregate query is that the query contains all the tags of the item under the label “cost performance”

When executing AGG we use an inverted index, so the voice will look like this: Scan each term in the inversion index to see if the label of the corresponding DOC in the inversion list corresponding to this term contains “cost performance ratio”. If so, record it. Since we are not sure whether the following term meets the conditions, we need to judge one by one, which results in the scanning table.

If we use a positive index, and a positive index is a map of what terms are contained in a doc, which is the current doCID => for all the terms contained in the current field, we want to look for all the tags in a doc that meet the criteria, So we can just go to the key(doCID) and get values(all terms) instead of scanning the table.

Therefore, the essence of the efficient aggregation query using a forward index is the difference between the two data structures. It has nothing to do with the combination of inverted indexes, which is just a pre-filtering of the data. These are the reasons why forward sorted indexes are in principle friendly to aggregate queries.

Doc Values is a serialized, columnar storage structure, where values also contain word frequency data. This structure is very good FOR data compression (FOR and RBM compression algorithms) because Lucence’s underlying way of reading files is local mmap, which is basically read from disk into OS cache FOR decoding, using the data structure of the forward row index. Because the data in the column store can be compressed as efficiently as the Posting list, this greatly increases the speed of reading data from disk because of the small size, and then decoding the data in the OS Cache, which is much faster. Doc values are better for aggregation.

For an easy example, there are twenty students enrolled in a tutorial class. Each student can enroll in more than one class, and each class has a head teacher

The front index is the head teacher, that is, which students are included in each class

An inverted index is what classes each student is enrolled in

Now we want to know the music class and art classes contain what students, ask the teacher in charge asked 2 times, if we ask students to every student asked again and ask him if you called the music and art classes, if you don’t ask, don’t play with every student, you never know what you have not asked whether the students in class music and art class

In this example, the head teacher is equivalent to a straight index, and each doc is a class. Each doc contains several word items, and each word item is like a student. The head teacher knows which students there are in each class, that is, which word items each DOC contains. The students only know which class they belong to, which class (DOC) contains this word item.

ElasticSearch (ElasticSearch……