Huawei | Rust scientific computing analysis and practice of multidimensional array algorithm library

By Li Yuan/Zhang Handong

This article from the 3.27 Meetup in shenzhen convention activity on March 27, PPT and video links: disk. Solarfs. IO/sd / 6 e7b909b…


introduce

Rust Ndarray is an open source project developed by Bluss, a senior scientific computing specialist in Rust’s official team, that implements rust-based matrix and linear operations. The goal is to build scientific computing communities in Rust similar to Numpy and OpenBlas. It is machine vision, data mining, biological information and other types of scientific computing base, the main users in the community for some related technology universities or research institutes. The author participated in the overall planning of the open source project to open up the southward technology stack of NDARray for various scenarios in the community, make use of compiler, parallelization, software and hardware collaboration and other technologies to achieve breakthroughs in function and performance, and lay a solid base for the whole Rust scientific computing ecology.

At present, nDARray needs from the community include the adaptation of embedded environment, the improvement of basic mechanism, and the improvement of space utilization and computing performance. To nostd, flexible step size, broadcast mechanism, parallel computation, and so on. The following detailed introduction.

No_std,

The first is the NDARray NO_STD work, it mainly solves the adaptation of NDARray in embedded environment. STD is short for Rust standard library and consists of the core library and a number of other functional modules. The core library contains language core functions such as type, pointer, synchronization, memory management, and other non-core or hardware architecture related functions such as file management, operating system adaptation, thread management, and network. The Rust compiler includes all the standard libraries by default when compiling and generating final binaries (rlib). No_std, on the other hand, means that the compiler does not actively introduce the standard library at compile time. Instead, the programmer will introduce the relevant functional modules as needed. This mechanism is mainly used in embedded development. In addition to the factor that the standard library may not compile in the bare-metal environment without an operating system, it is also because in embedded environment, file storage resources are very precious, and in order to reduce the cost, it is necessary to save space as much as possible. In the no_STD environment, each generated rlib file is about 200KB less than that generated in the STD environment. A project typically relies on more than one Rlib file, so you can save a lot of resources overall.

In addition to this approach, it is also possible to significantly reduce space usage by compiling the entire standard library into a dynamic link library, allowing multiple RLibbs to link to the same dynamic link library if they exist. Here we share only the first method, which is no_std.

We need a Rust library to support a NO_STD environment to do two main things:

The first thing to do is to resolve your dependence on STDS.

There are three main methods:

  1. Use core libraries instead of standard libraries when using language core features

  2. Manually introduce additional function modules when you need to use functionality that is not available in the core library

  3. Use conditional compilation for feature tailoring.

Conditional compilation in Rust is mainly implemented by features customized by developers. By adding attributes to each part of the program, different feature types are judged to realize conditional compilation.

The second thing is to resolve the dependency of the dependency libraries on STDS. There are two main ways to do this. First of all, you must modify the dependent module so that it also implements NO_std-ization. The technical implementation is the same as the first step, but the problem you face will increase the recursion, because to achieve the no_std-ization of the dependent module, you also need to implement the no_STD-ization of the dependent module, and so on. Therefore, the second method is generally adopted here, that is, to use Cargo’s conditional introduction function. Anyone who has done Rust development will know that every project has a Cargo. Toml file.

Use the no_std-ization of Ndarray as an example. Ndarray is an open source project, and the need for NO_STD comes from the community. This need was proposed last year by Owner of the RustCV Community, an open source community specializing in developing computer vision algorithms with Rust. His idea was to use NDARray on robot chips in embedded environments, which would enable the robot to be equipped with various CV algorithms developed based on Ndarray. I also participated in the relevant issue discussion and undertook this task. The work you do corresponds to the steps above. A simple technique is to add the phrase “use core as STD” to the lib.rs file of your project to easily replace the standard library with the core library throughout the project without changing all the use STD statements. In addition to the core library, Ndarray makes extensive use of Vec, Slice and other features in the alloc module of the standard library, so it is necessary to manually introduce the Alloc module in lib.rs. For floating-point calculations that cannot be imported by a single module, such as logarithms, exponents, etc., conditional compilation is implemented by adding a property that only compiles an entity with that property in an STD environment. This program entities in a narrow sense refers to various functions, because and C based on macro conditional compilation, conditional compilation is based on the properties of the Rust, so I can’t inside the function like C by using macro compile and select individual statements, but according to the different judgment with the attribute of property whether or not to be compiled program entities.

For the ndarray dependent modules, the matrix multiplication module is specifically no_std-oriented. Other libraries, such as BLAS, Serde serialization, Rayon Multithreading, use Cargo’s conditional introduction, either introducing the corresponding VERSION of NO_STD in the NO_STD environment, or making it not available.

Flexible step

Next, I’ll introduce the use of step sizes in multidimensional arrays. The memory model for multidimensional arrays in NdarRay needs to be introduced first. The memory model contains data, data Pointers, dimensions and steps. One notable difference from Numpy is that it uses static dimensions, where dimensions and steps from 1 to 6 are all represented by fixed-length arrays, because of the nature of Rust, which will be expanded later. Of course, NdarRay itself also supports dynamically extensible arrays as dimensions. The operation speed of static dimension array is much faster than that of dynamic dimension, but the disadvantage is that the interactive logic between different dimensions is inconvenient and complex.

The step size, as the name implies, is the distance in memory between the adjacent index positions of each column and determines the order in which the pointer traverses the array. Ndarray of one of the important function, that may be expressed through different step length, expressing the same physical structure, and the logical structure of different array, maximum benefit, is the ability to save a new array of cost, time and space in a large quantity of data application scenarios, such as all kinds of big data applications, biological information study, the benefits is undoubtedly enormous. In some scenarios, the different step sizes will significantly affect the efficiency of the algorithm and program.

The classic example is the difference between c-style arrays and Fortune style arrays. The elements in the last column of a C-style array are memory adjacent, and the elements in the first column are memory farthest apart. Fortune is the opposite. The elements in the first column are next to each other in memory, and the last column is the farthest. There is a huge difference in the efficiency of these two styles of array arrangement in different operation scenarios, because the spatial continuity of the array traversal has a significant impact on the speed of access. If the operation logic is first, Fortune is faster, and if not, C-style is faster. Another way to think about it is that C – style arrays and Fortran – style arrays are logically transposed to each other in the same memory configuration.

On the above basis, the definition and function of negative step size arise spontaneously. That is, when we want to access a column in the reverse order of the array, we only need to adjust the step size of the column to the original negative value, instead of reallocating memory space for the data in the reverse order. For example, we want to flip an image, because the image data is generally an array composed of three dimensions, length, width and RGB, so we just need to change the step size of the horizontal axis to the original negative number, then we can get the flipped image data without copying an image of the same size. Non-sequential step sizes can also be thought of as a convenient slicing representation that gets slices of the original array by having the pointer jump through memory instead of traversing it sequentially. This is widely used in the feature point extraction scene of neural network training. More specifically, there is a zero-step form, which is equivalent to copying many copies of a piece. This is the basis for implementing the broadcast feature that we will cover later.

To realize these steps, the following three problems should be solved: the legitimacy of the steps, the judgment of continuity and the addressing algorithm. Legitimacy ensures that the program is safe when the step size is customized. The most important thing is to ensure that the pointer cannot point to the same memory in different coordinates when moving in memory according to the step size, otherwise it will cause read and write errors. Continuity refers to whether the array under the step is contiguous in memory arrangement. If it is continuous, then it can be processed as a single block of memory during replication and traversal, which is much, much faster. There is also the addressing algorithm, which is also important because if the addressing error results in access to in-memory data other than the array data, the program is vulnerable. The calculation is divided into calculating the pointer position and calculating the addressing offset, which will not be expanded here.

Radio features

Broadcast is an extremely important concept in multi-dimensional array operation, which defines the interactive logic between arrays of different dimensions. In the simplest case, when you multiply a two-dimensional array with a one-dimensional array, you multiply the one-dimensional array many times to make a two-dimensional array, and then multiply the corresponding elements of the two two-dimensional arrays, and you get the result you want. Such operations are very common, such as the geographical calculation of the distance from multiple landmarks to the origin, the calculation of dispersion in data mining and clustering, etc. When multiple arrays of different dimensions are operated on, the broadcast mechanism will follow similar rules to expand each array to the same dimension length and then perform the operation. In machine learning, the covariance matrix between two one-dimensional inputs is often calculated, so such a mechanism is needed.

The Ndarray community proposed implementing the broadcast mechanism six years ago, but no one solved it for 21 years. This is not because broadcasting is not important, but because of the grammatical limitations of Rust.

In particular, when two arrays are broadcast, there is no way to determine the type of the return value — as mentioned earlier, Ndarray uses static dimensions, that is, arrays of fixed length, such as [usize;1] for one dimension and [usize;2] for two dimensions. For Rust, [usize;1] and [usize;2] are different types, due to memory security concerns. There are also usize arrays of unfixed length. The unfixed usize array cannot be used as the return value because its size is determined at run time, not compile time. Rust requires that the parameters and return values of functions be compile-time. This is to ensure that the program stack size is determined when the function is called. So when an n-dimensional array is returned as a value, its dimension must also be determined — either 1, 2, 3… Or be smart and have the same dimension as the first input value, or the same dimension as the second input value. This is not enough in a broadcast, because it returns the older between the two input dimensions. This logic sounds simple, but it doesn’t work for a compiler because it doesn’t have the ability to make inferences in a function declaration. C does not have this problem, because there is no concept of static dimensions in C, no matter how long the array is, it is just a reference to an address. Other dynamic languages are similar to Java, and Python doesn’t have this problem because almost all of their objects are of reference type, which results in them dereferencing an object every time they access it, which is of course not as efficient as C and Rust. Can Rust return references to addresses just like C does? No, because Rust, as a memory-safe language, has ownership restrictions. An N-dimensional array created within the broadcast function cannot return its address, because its ownership dies at the end of the function. What about return to ownership? Not even more, because, as I said, its size cannot be determined at compile time.

So is there a solution to this problem? As mentioned earlier, Ndarray also supports dynamic arrays as dimensions. Dynamic arrays refer to types such as VEc, box, etc. They use smart Pointers so that although they hold a dynamic memory space, they have a fixed size, so they can be used as return types. The author also proposed using dynamic arrays as a dimension for return values in the broadcast in the community, but was immediately rejected by the owner because dynamic arrays in Rust are too slow.

But there is a way to do this. If instead of having to perform the process of determining which dimension of the two input arrays is larger, the compiler tells it which dimension of the array to return, then broadcasting can be done. Specifically, there are 8 types of dimensions from zero to six plus dynamic dimensions. If we write a macro that implements a broadcast function for each of the 8*8=64 interactions between them, the size of the returned dimension can be determined in each case. This approach is theoretically feasible, but there are bigger problems with implementation. Imagine if a function needs to use two arrays of broadcasts, and the dimension of the array is uncertain, which broadcast function should it call? Student: Should I implement this function again for all 64 cases? Obviously not.

But this problem can be solved with the aggregation types that are unique to Rust syntax. We put the implementation of interarray broadcasting in a custom trait that takes a generic parameter to represent the array to be broadcast with the trait entity and internally represents the Output array after the broadcast with an aggregate type Output. You then implement the trait for all types of arrays using macros. Thus, when we broadcast between arrays of any dimension, we can use the broadcast feature simply by adding a qualification in the WHERE statement that the array implements the trait. In fact, those of you who know the Rust compiler well should know that the compiler expands all generics into specific types during the singleton process, and each expansion generates a separate copy of code. So this method is essentially the same as the one above. However, the Rust compiler team is currently working on a polymorphic implementation that can generate only one copy of code for similar functions during compilation, so you can explore it yourself if you are interested.

However, such a method would restrict the array to contain the same data type between broadcasts, which is a bit restrictive, and the function declaration would seem too verbose. So along the same lines, we’ll implement the trait only for the dimensions of the array, and we’ll name the trait DimMax, which is the larger one between the two dimensions.

So, can you simplify it further by removing the extra qualification from the WHERE statement? This trait is implemented for two dimensions of arbitrary length, D1 and D2. This sounds nice, but is again constrained by Rust syntax — the aggregate type must also be determined at definition time — either manually or from other aggregate types in the input value. Personally, THOUGH, I think this problem can be solved — could you modify the compiler’s feature implementation mechanism to make it a little smarter, such as allowing static constants to be computed during the compilation of a trait implementation? Because the length of the static dimension must be constant. So it should be possible to calculate this constant at compile time, such as getting the maximum between the two constants and then getting a definite aggregate type. Of course, this is just my guess at the moment, and further research into Rust compiler is needed. But it still can be in the case of some common save the restrictions, namely the same dimension between broadcasting and radio with zero dimension, in both cases, broadcast the dimension is itself, so can be directly added to the definition of dimension, in the two specific cases can avoid the limitation of add a where clause.

Parallel computing acceleration

Finally, I would like to share the current status and development of NDARray in parallel computing. Ndarray currently uses the Rayon library to implement multi-threaded parallel acceleration in some scenarios. Rayon is a multithreaded parallel library based on iterators. Its core idea is to split an iterator into several sub-iterators in different directions, and allocate the iterator tasks to each sub-iterator, and then allocate them to multiple threads using work-stealing algorithm to realize parallelization. Iterators must satisfy the following three requirements: 1. Iterators can iterate backwards and backwards. 2. 3. The intermediate index can be split into two mutually exclusive child iterators that have the same properties as the parent, as shown in the figure.

Ndarray implements multi-threaded acceleration in the scenario of traversing a single element of an array, traversing a specific dimension, and traversing elements by elements when operating on elements corresponding to most group operations. Subsequently, Lanes iterator parallelism was added. If we take an N-dimensional array and take it as an N-1-dimensional array without a certain dimension, each element of the array becomes a 1-dimensional vector, which is called a swimming lane. Its main application scenario is in the matrix multiplication of more than two dimensions, where each element in the result matrix is the cross product of two swimming lanes. This calculation is also very common in deep learning scenarios such as text classification and natural language processing.

In addition to multithreading, there is another important parallel acceleration method, simD (Single Instruction, multiple data) Single instruction multiple data acceleration. The execution of multiple data computations during the execution of a single machine instruction. As a simple example, the ARM vaddq_s32 instruction computs the sum of two 128-bit vectors, each containing four 32-bit integers, in the execution time of a single instruction, and is therefore four times faster than ordinary circular addition. Simd instruction sets are applicable to different hardware architectures, such as AVX, AVX512, and SSE instruction sets for x86 architecture, NEON, and ASIMD instruction sets for ARM, and so on. To accelerate the n-dimensional array operation in Ndarray through simD instruction, there are three steps: 1. Make the Rust library support the various architectural SIMD directives. This is exactly what the official SIMD task force is doing. This is a lot of work, with over 10W lines of code supporting the ARM architecture alone and close to 20W for x86, not including some changes to the compiler. 2. Simd acceleration of general operation based on various instructions. For example, universal intrinsic, which was born in OpenCV, provides a variety of general computing interfaces such as vector dot product, matrix multiplication, distance calculation, sorting and their implementation. The computation can be converted into simD vector calculation through these interfaces to realize algorithm acceleration. However, the Universal intrinsic is not useful for the average developer, because it does not automatically generate SIMD instructions that match the length of the CPU vector register. The user needs to manually select the simD instructions, which may result in incomplete utilization of SIMD performance or crash due to misadaptation of instructions. This leads to the third step, which is the automatic detection of the CPU’s SIMD configuration and the automatic adaptation of simD instructions, making all SIMD instructions transparent to the user. Finally, there is a potential fourth step, which is automatic vectorization by the compiler, enabling all operations to be automatically generated by simD instructions by the compiler. This is an extremely difficult direction, and there are many LLVM members working on this implementation, but it is also a difficult one, and readers who are interested can try it.

Here’s another introduction to the STdarch repository in the Rust standard library. This library, as part of the standard library, provides all Rust developers with a set of SIMD instructions for a variety of common hardware architectures. It is developed and maintained by the official SIMD working group and library team members. However, with the exception of the x86 platform which was stable at the end of last year, all the other architectures are unstable, so the whole SIMD feature is not yet available in the stable release of Rust and will require contributions from all developers. Stdarch is closely related to the Rust compiler and LLVM. Stdarch is responsible for module classification, encapsulation and testing of instruction sets of different architectures and versions, and provides users with corresponding functional interfaces. The underlying assembly implementation and assembly optimization is in the Rust compiler and LLVM, so stdarch needs to encapsulate the compiler and LLVM implementation. There are two cases here. One is simD interfaces common to all architectures, such as addition, subtraction, multiplication and division, bit operation, etc. These instructions will be implemented by statically calling the related interfaces of LLVM in the code generation part of the compiler. This is then introduced and wrapped by stdarch using the extern “platform-intrinsics” keyword. The other is the specific instructions provided by each architecture, such as x86 VCOMI instruction, ARM VSLI instruction, etc., which are generally provided for specific computing scenarios, such as VSLI represents left displacement and insert corresponding elements. In this case, the relevant implementation in LLVM needs to be called and encapsulated in the form of static links. While the interface provided in STDARch still distinguishes between architecture and vector register length, the SIMD transparency or USIMD I mentioned is based on this to shield the hardware differences from the user and provide a more general computing interface.

This paper is mainly based on the specific problems and technical solutions of some n-dimensional array operation library Ndarray, and leads to some extension and thinking of Rust language and technology in the field of scientific computing. Hope to help you with Rust learning and development.

Contents: March issue of Rust Chinese (Rust_Magazine)