First, the principle of inverted index

Inverted Indexes are employed in ES, also known as Inverted indexes. There will be reverse indexes and there will be forward indexes.

  1. Positive index

    The forward index takes the ID of the document as the key word, and records the value information of each field in the document, and takes out the whole document by querying the ID.

    However, when querying which documents a keyword exists in, all documents need to be scanned and matched. This retrieval efficiency is relatively low.

  2. Inverted index

    An inverted index takes a word or word as a keyword index, and an inverted index establishes a mapping relationship between Term and Document.

    Inverted index table structure, construct inverted index after removing stop word:

    The word Document ID list
    elasticsearch 1, 3
    The most popular 1, 2,
    Search engine 1
    .

    Inverted indexes are mainly composed of Term dictionaries, Inverted lists and Inverted files.

    Term Dictionary: The usual unit of index for a search engine is a word, which is a collection of strings made up of all the words that have ever appeared in a collection of documents. PostingList: PostingList is a list of all documents where a word appears, as well as the location and frequency with which the word appears in the document. Each record is called a Posting. An Inverted list of all Inverted words is stored sequentially in a File on disk, known as an Inverted File. Inverted files are physical files that hold Inverted indexes.

  3. How to position

    For a large document set, which may contain millions of key words (terms), it will be slow to find a specific term, which needs to be filtered one by one. How to locate it quickly?

    Sorting and then using binary lookup, which is faster than going through everything, is called term dictionary. LogN disk lookups can be used to obtain the target, but random disk reads are still very expensive (about 10ms for a random access read).

    So try to read as little disk as possible, but cache in memory, and the entire term dictionary is very large, so you have the index of the term index dictionary.

    Term index is a b-tree:

    This tree does not contain all terms, it only records some prefixes of terms. Term index allows you to quickly locate an offset in the Term Dictionary, and then look up from there.

    So term index does not need to store all terms, but just mappings between some of their prefixes and blocks of term Dictionary, combined with Finite State Transducers compression technology (FST), Term index can be cached in memory. After finding the block position of the corresponding term dictionary from the term index, you can search for the term on the disk, which greatly reduces the number of random disk reads.

Ii. TF/IDF/BM25 calculation

Each search record ES will give a score, and ES has two scoring methods:

  1. TF: Term Frequency. It represents the number of times a word appears in content. Definition:

    TF = The number of occurrences of a word in the document/the total number of words in the document

    The more a word appears, the more important it is. If an article has ElasticSearch multiple times and Spring appears two or three times, it’s probably a professional article about ElasticSearch.

  2. IDF: Inverse Document Frequency, namely, Inverse Document Frequency, is an indicator to express the importance of words. Calculation formula:

    IDF = log(number of documents in the library/(number of documents containing the word +1)

    Log is a logarithmic function. If all articles contain a word, then the IDF of that word is log(1)=0, and the importance is zero. IDF of stop words is approximately 0.

    If a word appears in only a few articles, the IDF is large and its importance is higher. To avoid a 0 in the denominator, so plus 1.

  3. BM25

    BM25 is essentially an improvement on TF-IDF algorithm. For TF-IDF algorithm, the larger the value of TF(t) part is, the larger the value returned by the whole formula will be.

    With the gradual increase of TF(t), the return value of this algorithm tends to a value, and BM25 is optimized for this point.

    For example, the frequency of keywords in an article is increasing, the score will be higher and higher, and some article keywords appear 40 times, and some article keywords appear 60 or 80 times, but in fact, 40 times of the article, may be the expected result.

  4. View ES score calculation:

    Adding the explain flag to true lists the calculation execution plan:

    GET /movies/_search
    {
      "explain": true."query": {"match": {"title":"heart"}}}Copy the code

    The scoring details will be detailed:

    ."_explanation" : {
              "value": 6.0875173."description" : "weight(title:heart in 276) [PerFieldSimilarity], result of:"."details": [{"value": 6.0875173."description" : "Score (FREq =1.0), computed as Boost * IDF * TF from:"."details": [{"value": 2.2."description" : "boost"."details": []}, {"value": 5.8863263."description" : "Idf, computed as log(1 + (n-n + 0.5)/(N + 0.5)) from:"."details": [...Copy the code

    Total score calculation: Boost * IDF * TF (boost = amplification factor, default is 2.2)

    BM25 is calculated in tf’s description: (Freq + K1 * (1-b + b * dl/AVgDL))


This article was created and shared by Mirson. For further communication, please add to QQ group 19310171 or visit www.softart.cn