Rucene is a set of search engine core library rewritten by Rust language based on Lucene by Zhihu search technology team. Externally, Rucene is responsible for recall of the two core businesses of Zhihu search and recommendation. Internally, Rucene supports the company-level Logging service of Zhihu. In 2019, we have opened source Rucene to Github, project address: github.com/zhihu/rucen…

SIMD refers to Single Instruction Multiple Data. Generally speaking, it is to use the extended Instruction set provided by modern CPU architecture, such as SSE and AVX, to pass a Single CPU SIMD Instruction. Parallel access and computation of multiple data at once.

At present, there are few articles on The Chinese Internet to introduce SIMD optimization for specific business, especially Rust language. Based on some optimization practices of Rucene by zhihu search team, this paper summarizes how to optimize Rust program by using SIMD instruction.

There are three main ways to use SIMD optimization programs: One is to let the compiler optimize, which is the best you can do and depends on the implementation of the compiler. (In my research on SIMD, I learned that to use SIMD in Java, you need to write programs according to some conventions, and then you can let the JVM optimize as much as possible. Those interested in using SIMD in Java can refer to Lucene’s discussion on this topic in the latest version); The second is to use third-party libraries, with the help of others packaged libraries, through simple calls, SIMD optimization for some specific scenarios, such as Simdjson, Rust faster library, etc. Third, direct manual tearing OF SIMD instructions. In view of specific business characteristics, sometimes it is difficult to directly call the encapsulation library to use SIMD. In this case, we can only write programs to call SIMD instructions by ourselves. Next, this paper will introduce the three aspects.

Rust Compiler parameter tuning

We checked the relevant documents of RUSTC and Cargo provided by Rust and found no content related to SIMD optimization. However, during this process, we found that there are three compilation parameters that can significantly improve the performance of Rucene when properly configured, and their usage is as follows: For details about each parameter, see the official documents.

Cargo. Toml [profile.release] lto=true opt_level=3 codegen_units=1 Pass to cargo CARGO_PROFILE_RELEASE_LTO=true \ CARGO_PROFILE_RELEASE_OPT_LEVEL=3 \ via the environment variable CARGO_PROFILE_RELEASE_CODEGEN_UNITS=1 \ cargo build --releaseCopy the code

There are two caveats to using these three parameter configurations: first, the compilation process becomes very slow, and our solution is to use the second method, which is only used for image release builds; Second, not all applications will work, and some of our simple services use these three parameters to no avail, increasing compilation time. Therefore, when your Rust program is relatively complex, we recommend that you try to use these three parameters to optimize application performance.

use faster Library optimization Embbeding To calculate

In 2020, we tried to do Embbeding recall weight lifting on the engine side. At that time, P95 rose significantly after gray scale went online. After vector calculation optimization by using faster library, P95 rose within an acceptable range. You don’t have to worry about Embbeding, you have to solve the problem of optimizing the inner product of vectors. The first code is ordinary inner product calculation, and the second code is written using the method of SIMD instruction using faster. As the vector dimension increases, the performance advantage of SIMD mode becomes obvious. For details on how to use the Faster library, you can refer to the related documentation.

Fn main() {let mut score = 0.0; let dots1: Vec<f32> = vec! [1.04177, 0.28162, 2.02021]; let dots2: Vec<f32> = vec! [1.59189, 1.94172, 1.02021]; for i in 0.. dots1.len() { score += dots1[i] * dots2[i]; } println! ("score={}", score); }Copy the code
use faster::{f32s, IntoSIMDRefIterator, IntoSIMDZip, SIMDZippedIterator}; // Fn main() {let dots1: Vec<f32> = Vec! [1.04177, 0.28162, 2.02021]; let dots2: Vec<f32> = vec! [1.59189, 1.94172, 1.02021]; Let score = (dots1 simd_iter (f32s (0.0)), dots2. Simd_iter (f32s (0.0))). The zip (). Simd_reduce (f32s (0.0), | acc, (a, b)| acc + a * b) .sum(); println! ("score = {}", score); }Copy the code

use SIMD The command optimizes the decompression performance of inverted chain

Rucene’s efficient retrieval is based on the inverted index, in which the inverted chain is arranged in ascending order by document ID, and 128 document ids form a compressed block. Decompression of a large number of blocks is a major performance bottleneck for search engines providing online retrieval services. Next, let’s put search engines aside and make the problem a little more straightforward.

A block contains 128 integers in ascending order. For example, the previous eight integers are 1,3,8,15,19,31,58,100…

If this is the first block, subtract 0 from the first block, and you get the following delta sequence: 1,2,5,7,4,12,27,42…

If 42 of the 128 deltas is the maximum, the binary representation is 101010, and there are six significant bits, then all 128 deltas are stored in six bits.

The original storage scheme: serial storage next to one another 12 5 7 4 12 27 42… 000001 000010 000101 000111 000100 001100 011011 101010…

SIMD storage scheme: in parallel storage, the first four digits are stored in the lower six bits of the four adjacent I32s, and the last four digits are stored in the seventh to twelfth bits of the four I32s

– the first i32 please | — – > the second i32 please | — – > the third i32 please | — – > the fourth i32 please | 000001 000100… 000010 001100… 000101 011011… 000111 101010… . 5. 27. 7.

The following diagram should be clearer

The four I32s act as a storage unit, the first digit is stored in the lower six digits of the first I32, the second digit is stored in the lower six digits of the second I32, the third digit is stored in the lower six digits of the third I32, the fourth digit is stored in the lower six digits of the fourth I32, and the fifth digit is stored in the second six digits of the first I32. And so on. All bits from 1 to 32 can be stored with 4N i32s, 1 bit requires 4 I32s, and 32 bit requires 128 I32s. The data is aligned, there is no case where some of the four I32s are full and some are not, which is perfect for parallel operation.

Once you’ve figured out how to store it, the rest of the decompression is relatively simple. 4 I32s are read with a SIMD loading instruction, and the bitwise and operation of SIMD is used to take the lower 6 bits of the 4 I32s to obtain the first 4 integer values. Then, the 4 I32s are moved 6 bits to the right at the same time, and then the bitwise and operation is performed to obtain 4 more integer values, and so on. The following figure lists the sample code snippets and the main instructions used in SIMD optimization. For details about SIMD instructions, you can check the Rust standard library document or the official Instruction set document of Intel. For specific code implementation, you can check the corresponding MR of Rucene.

MR:github.com/zhihu/rucen…

// Unpack the first 4 i32 pseudocodes: Mask = _MM_set1_EPI32 (0b111111) // Set the mask values = _MM_lDDQU_si128 (block_ptr) // Load four i32 new_values = _mm_and_si128(values, mask) // Restore 4 i32Copy the code
// Main sse instructions used: _mm_set1_EPI32 // 4 i32s set to the same value _mm_lDDQU_si128 // Load 4 I32s to register _mm_storeu_si128 // Store 4 I32s to memory _mm_AND_SI128 // 128 bits and operations _MM_OR_SI128 // 128-bit bitwise or operation _mm_SLli_EPI32 // 4 I32s move the specified bits to the left at the same time _mm_SRli_EPI32 // 4 I32s move the specified bits to the right at the same time _mm_sub_EPI32 // 4 i32s subtracting 4 i32s _mm_add_epi32 // 4 I32s adding 4 i32s _mm_cvtSI128_si32 // Take the rightmost of the 4 I32sCopy the code

The optimization effect

We have launched two optimized versions: one is a partial decompression + compiler parameter tuning version, this version, compiler parameter tuning alone, about 10% performance improvement; The second is the optimized version of SIMD, which is approximately 15% better than the first optimization. Overall, benchmark test shows significant performance improvement, overall engine Merger P95 decreased by 30%+. Here are some performance screenshots:

Version Description:

Rucene-stdbase: Rucene baseline version Rucene-partial: on-demand decompression + compiler parameter tuning version RUCene-simdNew: SIMD optimized version

conclusion

Combined with three possible ways of using SIMD technology in the program, this paper summarizes some practical experience of Zhihu search technology team in implementing SIMD optimization in Rucene optimization. Our own understanding of CPU instruction set related technology is also relatively limited, but also feel the stones across the river, if there are fallacies in the article, welcome reader criticism and correction.

The resources

  1. Based on SIMD instruction PFOR – DELTA decompression and find: zhuanlan.zhihu.com/p/63662886
  2. New index compression algorithm PForDelta introduction and use of SIMD technology optimization: yq.aliyun.com/articles/56…
  3. SIMD applications: www.zhihu.com/market/pub/…
  4. Intrinsics guide:software.intel.com/sites/landi…

Recruitment information

Zhihu search technology team is currently recruiting front-end, back-end and search engine developers, welcome interested students to send their resumes to [email protected].