preface

I used TVM again the other day, wondering if I could indirectly replace TensorRT as a backend for GPU server deployments. TensorRT is indeed powerful on its GPU, and it is an “open source project” that Huang has focused on. But TensorRT has a few minor drawbacks:

  • Infer optimized models are bound to a specific GPU. For example, models generated on 1080TI cannot be used on 2080TI
  • A higher version of TensorRT relies on a higher version of CUDA, and a higher version of CUDA relies on a higher version of the driver
  • Although TensorRT is easy to use, Infer optimization is closed source, like deep learning alchemy and a black box. We are not sure whether the model can be successfully transformed. If not successful, can not directly modify the source, always feel shortcomings

The above shortcomings simply ridicule (light spray light spray), in fact, these problems are not a big problem, if it is their own drum, then the problem is not big. However, if you put it into a production environment, some of the above situations can cause some minor problems (after all, there are too many rules and regulations in the production environment), which is not so easy to use.

If you don’t know what TVM is, you can check it out or take a look at the two articles I wrote earlier:

  • Step by step interpretation of the neural network compiler TVM(I) – a simple example
  • Step by step interpretation of neural network compiler TVM(2) – using TVM to complete the deployment of C++ end

However, this article is not about TVM. In addition to the above two articles, I have also written some other TVM analysis (the third and fourth part). However, due to my laziness and lack of time, I have not sorted them out, and I will post them after sorting them out. The purpose of this article is to introduce my previous Halide article, related to TVM, and take this opportunity to introduce a ha.

What is a Halide

Halide is a Domain specific Language for image processing using C++ as the host Language. The main function is to achieve the underlying acceleration of the algorithm on the hard and soft level (independent of the design of the algorithm itself), and it is necessary to have a certain understanding of it. Both traditional image processing methods and deep learning applications use Halide’s ideas to some extent.

Among them, some algorithms in OpenCV(traditional image processing library) use Halide back end, and TVM(neural network compiler) also use Halide idea to optimize neural network operator.

, then, is why Halide, look at the picture above, a algorithm to deal with the same (local) Laplace transformation and using direct write algorithm c + + language was slow, Adobe use 3 months of the algorithm is optimized (optimization) by hand to make the algorithm of up to 10 times faster, but if you use the Halide, With just a few lines of code, this algorithm is 20 times faster than previous straightforward algorithms.

In short, Halide saves us a lot of time manually optimizing the underlying algorithm and lets us just focus on the design of the algorithm.

And the algorithm and the underlying optimization process, Halide helped us do.

Why can Halide optimize algorithms

Halide is characterized by the separation of the implementation of image algorithm computation (Function and Expression) and the scheduling of these computations on the computing hardware unit, which is based on Function. Finally, the whole image algorithm is transformed into an efficient multi-layer FOR loop. The partial data range division and data loading of the FOR loop are completed by Halide, and Overlay of data loading and algorithm calculation can be implemented to cover the delay caused by data loading. Halide’s Schedule can be specified by the programmer to specify some policies, the size of the hardware buffer, buffer line related Settings, so that according to the characteristics of different computing hardware to achieve efficient cell scheduling, and the implementation of the image algorithm does not need to be modified.

This part of the intercept on zhihu: www.zhihu.com/question/29…

The “triangular forces” that determine the performance of an algorithm when executed on a hardware platform are as follows.

Among them, the design of the algorithm itself is one aspect, a good algorithm is often much more efficient. Another aspect is the organization of the order of computation in the algorithm, and Halide can change the order of computation in our algorithm on a hardware platform:

Where Halide can achieve parallelism and good cache consistency for algorithms on hardware platforms:

For example

Let’s take the classic Blurred image example in Halide (the following code can also test the observation on your own computer) and use OpenCV to show how to manipulate the image:

First, we design an operation function that can blur the image:

// in is the input original image. Blury is the output blurred image
void box_filter_3x3(const Mat &in, Mat &blury)
{
    Mat blurx(in.size(), in.type());

    for(int x = 1; x < in.cols- 1; x ++)
        for(int y = 0 ; y < in.rows; y ++)
            blurx.at<uint8_t >(y, x) = static_cast<uint8_t>(
                    (in.at<uint8_t >(y, x- 1) + in.at<uint8_t >(y, x) + in.at<uint8_t >(y, x+1)) / 3);

    for(int x = 0; x < in.cols; x ++)
        for(int y = 1 ; y < in.rows- 1; y ++)
            blury.at<uint8_t >(y, x) = static_cast<uint8_t>(
                    (blurx.at<uint8_t >(y- 1, x) + blurx.at<uint8_t >(y, x) + blurx.at<uint8_t >(y+1, x)) / 3);

}
Copy the code

The operation of image blur is very simple. We first sum and average each pixel point and the two surrounding points on the X-axis, and then perform the same operation on the Y-axis, which is equivalent to a 3×3 average convolution check for the whole image, which will not be described in detail here.

We prepared an image (1920,1080), performed the above operations on it 100 times, and recorded the Time, and found that Time used:4521.72 ms.

Then we simply change the order of execution by changing the order of x and y in the nested loop above:

Mat blurx(in.size(), in.type());

   // There are nested transformations
   for(int y = 0 ; y < in.rows; y ++)
       for(int x = 1; x < in.cols- 1; x ++)
           blurx.at<uint8_t >(y, x) = static_cast<uint8_t>(
                   (in.at<uint8_t >(y, x- 1) + in.at<uint8_t >(y, x) + in.at<uint8_t >(y, x+1)) / 3);

   // There are nested transformations
   for(int y = 1 ; y < in.rows- 1; y ++)
       for(int x = 0; x < in.cols; x ++)
           blury.at<uint8_t >(y, x) = static_cast<uint8_t>(
                   (blurx.at<uint8_t >(y- 1, x) + blurx.at<uint8_t >(y, x) + blurx.at<uint8_t >(y+1, x)) / 3);
}

Copy the code

Similarly, we execute 100 times and record the Time: Time used: 3992.35ms, we can see that the following fuzzy operation executes faster than the above one. Of course, you might think, well, that’s not much faster. Of course, this is just a sample image. If the distance between the length and width of the image is large (for example, 1:10), or if we want to process tens of thousands of such operations at a time, once the magnitude increases, the difference between the two will not be the slightest bit.

Principle at hardware level

Why is this so? The algorithms performed by the above two operations are the same, but the speed is different. The reason is that the difference has less to do with the algorithm itself and more to do with the design of the hardware, such as parallelism and locality.

We can see below that Adobe engineers have optimized the above algorithm on the hardware level, which is 10 times faster than the previous algorithm, using SIMD, Tiling, Unrolling, and Vectorization techniques. It makes full use of the performance of the hardware to maximize the speed of program execution without changing the design of the algorithm itself.

The official sample

As a DSL, Halide is easy to use, so here we’ll download the source code and compile it. Once completed, we can use it (the compilation step is omitted here, you can find it on the official website) :

First we reference the Halide header and other files.

#include "Halide.h"
#include <stdio.h>
#include <algorithm>

using namespace Halide;
Copy the code

Before using Halide for the first time, you need to know some of the syntax in Halide:

Then we use Halide to define two variables that make no sense when used alone, and we name them with strings x and y:

Var x("x").y("y");
Copy the code

Then use Func to define a function to be executed and name it gradient.

Func gradient("gradient");
Copy the code

At this point we define the execution logic for each point in function, x + y for (x,y). Where x and Y are both Var, and the operation x + Y will be automatically converted into Expr type when given to gradient, which can be understood as giving the logic of the algebraic expression X + Y to gradient. Finally, realize function is used to execute the whole logic:

gradient(x, y) = x + y;
// Realize does not compile and execute the above operation until this stage
Buffer<int> output = gradient.realize(4.4);
Copy the code

This logic is expressed in C++ as:

for (int y = 0; y < 4; y++) {
    for (int x = 0; x < 4; x++) {
        printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y); }}Copy the code

The Halide pseudocode for the above implementation is:

produce gradient: for y: for x: gradient(...) =...Copy the code

Halide’s default order of calculation is row first, where x represents the position of elements in each row and y represents the position of elements in each column:

If we change the order in which y and x are computed:

// Put y before x
gradient.reorder(y, x);
Copy the code

The final calculation takes precedence over the column:

The corresponding pseudocode is:

produce gradient_col_major: for x: for y: gradient_col_major(...) =...Copy the code

Split the Split

We can split each dimension, assuming we still compute row first, but we split the X-axis into an outer loop and an inner loop, leaving the Y-axis unchanged:

Var x_outer, x_inner;
gradient.split(x, x_outer, x_inner, 2);
Copy the code

The corresponding C++ code is implemented as follows:

for (int y = 0; y < 4; y++) {
    for (int x_outer = 0; x_outer < 2; x_outer++) {
        for (int x_inner = 0; x_inner < 2; x_inner++) {
            int x = x_outer * 2 + x_inner;
            printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y); }}}Copy the code

Fusion of the fuse

Or instead of splitting, we could merge the x and y axes:

Var fused;
gradient.fuse(x, y, fused);
Copy the code

The corresponding C++ implementation code is:

for (int fused = 0; fused < 4*4; fused++) {
    int y = fused / 4;
    int x = fused % 4;
    printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
}
Copy the code

Keep in mind, however, that the above split and merge operations are just a demonstration of what Halide can do. Rather, they are not actually useful, meaning that the actual order of calculations does not change.

Tile tile

This step brings us to the important part of Halide, where we divide the x and y axes into 4-factor intervals and reorder the calculated paths:

Var x_outer, x_inner, y_outer, y_inner;
gradient.split(x, x_outer, x_inner, 4);
gradient.split(y, y_outer, y_inner, 4);
gradient.reorder(x_inner, y_inner, x_outer, y_outer);

// The above steps can be simplified to
gradient.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4.4);
Copy the code

The corresponding C++ calculation code is:

for (int y_outer = 0; y_outer < 2; y_outer++) {
    for (int x_outer = 0; x_outer < 2; x_outer++) {
        for (int y_inner = 0; y_inner < 4; y_inner++) {
            for (int x_inner = 0; x_inner < 4; x_inner++) {
                int x = x_outer * 4 + x_inner;
                int y = y_outer * 4 + y_inner;
                printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y); }}}}Copy the code

The visualization looks like this (note that the sample size is (8,8)) :

The advantage of tiling is that adjacent pixels can be taken full advantage of. For example, in a blur operation, we use overlapping input data (that is, the same element is used twice), which can greatly speed up the calculation performance.

To quantify the vector

Vectorization means that we use SIMD technology in the CPU to calculate multiple data at once, taking full advantage of the characteristics of the hardware. For example, in x86 we can use SSE technology to achieve this function.

In Halide, we first split the X-axis cycles into two by nesting the inner cycle factor 4 (i.e., the inner cycle x is executed four times, and the outer cycle is calculated according to the total number, for example, 2*4=8), and then transform the inner x cycle into a vector form:

Var x_outer, x_inner;
gradient.split(x, x_outer, x_inner, 4);
gradient.vectorize(x_inner);
Copy the code

Expressed in C++ is:

for (int y = 0; y < 4; y++) {
    for (int x_outer = 0; x_outer < 2; x_outer++) {
        // The loop over x_inner has gone away, and has been
        // replaced by a vectorized version of the
        // expression. On x86 processors, Halide generates SSE
        // for all of this.
        int x_vec[] = {x_outer * 4 + 0,
                        x_outer * 4 + 1,
                        x_outer * 4 + 2,
                        x_outer * 4 + 3};
        int val[] = {x_vec[0] + y,
                        x_vec[1] + y,
                        x_vec[2] + y,
                        x_vec[3] + y};
        printf("Evaluating at <%d, %d, %d, %d>, <%d, %d, %d, %d>:"
                " <%d, %d, %d, %d>\n",
                x_vec[0], x_vec[1], x_vec[2], x_vec[3],
                y, y, y, y,
                val[0], val[1], val[2], val[3]); }}Copy the code

After visualization, it becomes obvious that each line of outer x is executed twice, and the inner x becomes a vector, which can be executed in a single instruction set:

An unrolling

If multiple pixels in the image share overlapping data at the same time, then we can unroll the loop so that the data that can be shared is computed or loaded only once.

In the following we split the X-axis into the inside and the outside, because each time the inside grows from 0 to 1, if we expand the X-axis of the inner loop, we do not need to go here each time to read the x value of the inner loop:

Var x_outer, x_inner;
gradient.split(x, x_outer, x_inner, 2);
gradient.unroll(x_inner);
Copy the code

The corresponding C++ code is:

printf("Equivalent C:\n");
for (int y = 0; y < 4; y++) {
    for (int x_outer = 0; x_outer < 2; x_outer++) {
        // Instead of a for loop over x_inner, we get two
        // copies of the innermost statement.
        {
            int x_inner = 0;
            int x = x_outer * 2 + x_inner;
            printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
        }
        {
            int x_inner = 1;
            int x = x_outer * 2 + x_inner;
            printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y); }}}Copy the code

Fusing, Tiling, and Parallelizing

In this step, we combine fusion, tiling, and parallel operations to operate on an 8×8 image. First, we tiled both the x and y axes by a factor of 4. Then we cyclically fuse the outer y axis and the outer x axis (2+2=4), and then carry out the parallel operation of the fused operation, that is, execute the four (2+2=4) operations simultaneously:

Var x_outer, y_outer, x_inner, y_inner, tile_index;
gradient.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4.4);
gradient.fuse(x_outer, y_outer, tile_index);
gradient.parallel(tile_index);
Copy the code

The corresponding C++ code is:

// This outermost loop should be a parallel for loop, but that's hard in C.
for (int tile_index = 0; tile_index < 4; tile_index++) {
    int y_outer = tile_index / 2;
    int x_outer = tile_index % 2;
    for (int y_inner = 0; y_inner < 4; y_inner++) {
        for (int x_inner = 0; x_inner < 4; x_inner++) {
            int y = y_outer * 4 + y_inner;
            int x = x_outer * 4 + x_inner;
            printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y); }}}Copy the code

After visualization, we can see that in 8×8, the top left, bottom left, top right and bottom right areas are almost simultaneously (tile_index), and each area is calculated in the same way as the previous tile section, but this time it is replaced by parallel calculation:

integration

Let’s take a larger image this time. We’ll input an image size of 350 x 250 to optimize the operation:

First we tiled it with factors of 64 x 64, then we merged the loop operands outside the y and x axes, and finally we parallelized it (note here, we can see that 350 or 250 are not divisible by 64, don’t worry about this, Halide will automatically deal with the excess or insufficient parts).

Var x_outer, y_outer, x_inner, y_inner, tile_index;
gradient_fast
    .tile(x, y, x_outer, y_outer, x_inner, y_inner, 64.64).fuse(x_outer, y_outer, tile_index)
    .parallel(tile_index);
// Can be used continuously. Write, because object functions return a reference to the object itself
Copy the code

This is not enough, we have tiled the entire image into 6*4 parts, and in this step we tiled each part again, this time tiling each part in a 4×2 form, where y_inner_outer is divided into two (each is y_pairs), X_inner_outer divides into four x_vectors (each named X_Vectors), and then parallelizes each x_Vectors and expands y_pairs.

Var x_inner_outer, y_inner_outer, x_vectors, y_pairs;
gradient_fast
    .tile(x_inner, y_inner, x_inner_outer, y_inner_outer, x_vectors, y_pairs, 4.2).vectorize(x_vectors)
    .unroll(y_pairs);
Copy the code

The visualization results are as follows:

The corresponding c++ display code is:

for (int tile_index = 0; tile_index < 6 * 4; tile_index++) {
    int y_outer = tile_index / 4;
    int x_outer = tile_index % 4;
    for (int y_inner_outer = 0; y_inner_outer < 64/2; y_inner_outer++) {
        for (int x_inner_outer = 0; x_inner_outer < 64/4; x_inner_outer++) {
            // We're vectorized across x
            int x = std::min(x_outer * 64.350- 64.) + x_inner_outer*4;
            int x_vec[4] = {x + 0,
                            x + 1,
                            x + 2,
                            x + 3};

            // And we unrolled across y
            int y_base = std::min(y_outer * 64.250- 64.) + y_inner_outer*2;
            {
                // y_pairs = 0
                int y = y_base + 0;
                int y_vec[4] = {y, y, y, y};
                int val[4] = {x_vec[0] + y_vec[0],
                                x_vec[1] + y_vec[1],
                                x_vec[2] + y_vec[2],
                                x_vec[3] + y_vec[3]};

                // Check the result.
                for (int i = 0; i < 4; i++) {
                    if (result(x_vec[i], y_vec[i]) ! = val[i]) {printf("There was an error at %d %d! \n",
                                x_vec[i], y_vec[i]);
                        return - 1; }}} {// y_pairs = 1
                int y = y_base + 1;
                int y_vec[4] = {y, y, y, y};
                int val[4] = {x_vec[0] + y_vec[0],
                                x_vec[1] + y_vec[1],
                                x_vec[2] + y_vec[2],
                                x_vec[3] + y_vec[3]};

                // Check the result.
                for (int i = 0; i < 4; i++) {
                    if (result(x_vec[i], y_vec[i]) ! = val[i]) {printf("There was an error at %d %d! \n",
                                x_vec[i], y_vec[i]);
                        return - 1;
                    }
                }
            }
        }
    }
}
Copy the code

This concludes the basic operations in Halide.

There is a little

Oh, and if Halide were used to write the blur algorithm described at the beginning of this article, it would look like this:

Func blur_3x3(Func input) {
  Func blur_x, blur_y;
  Var x, y, xi, yi;

  // The algorithm - no storage or order
  blur_x(x, y) = (input(x- 1, y) + input(x, y) + input(x+1, y))/3;
  blur_y(x, y) = (blur_x(x, y- 1) + blur_x(x, y) + blur_x(x, y+1)) /3;

  // The schedule - defines order, locality; implies storage
  blur_y.tile(x, y, xi, yi, 256.32).vectorize(xi, 8).parallel(y);
  blur_x.compute_at(blur_y, x).vectorize(x, 8);

  return blur_y;
}
Copy the code

This famous code also appears on the official home page as a good example.

The characteristics of Halide

Halide, the underlying optimization library, has a few notable features:

Explicit programmer control

The compiler does exactly what you say.

Schedules cannot influence correctness. Exploration is fast and easy.

Explicit procedural control, that is, how we follow the order of the computation (independent of the algorithm itself) is determined and cannot be changed once we have set it up.

Stochastic search (autotuning)

Pick your favorite high-dimensional search.

Automatic search can be used by any optimizer with a search space, because the factor of optimization is uncertain every time an optimization operation is performed, and different configurations may result in different execution speeds for different hardware. So it is necessary to search space factors automatically and randomly.

metaprogramming

Halide’s ideas are closely related to metaprogramming. Not only its design ideas, but also its execution ideas, all follow the idea of metaprogramming, that is, there is no clear execution logic before the code is compiled, and only after the compilation, the execution logic will be formed.

Other related

Halide, as a bottom-level optimizer that has nothing to do with algorithms, will certainly have many applications combined with deep learning today. The OpenCV library uses halide to optimize the underlying neural network operator. The benchmark results are here, but we found that the halide neural network is not as fast as the normal C++ implementation. The reason is not clear whether it is due to the design of Halide itself, or to the compatibility of Halide optimizations and neural network operators, but it will take some time to take advantage of halide’s true implementation acceleration (even after a long wait for TVM).

Related questions:

Stackoverflow.com/questions/4…

By the way, Halide runs in two ways: JIT mode and AOT mode. JIT mode is a convenient way to use the algorithm and Halide’s code generator wrapped directly into a class that is called in other parts of the program and then compiled for use.

In embedded environment or cross-compilation environment, AOT mode is generally used. In this case, the Compiler function needs to be called to generate the algorithm code and Halide code generator to compile the code of the target machine and generate a.o target file and.h header file in advance. Then in the independent target machine application project source code through the file to call the algorithm implementation of the calculation function, and during the build link. O file, so that a Halide implementation of the algorithm can run on the target machine program.

Afterword.

This article only briefly introduces the basic knowledge of Halide. If you want to deeply understand Halide, you can read the official tutorial or read the source code. Whether we are algorithm engineers designing algorithms or low-level engineers implementing porting functions on relevant hardware platforms, Halide’s ideas are worth our reference and reflection.

Halide has a wide range of applications. The reason why I want to know about Halide is that TVM library is used. TVM uses Halide’s ideas to optimize neural network operators and has achieved good results. TVM can also implement some of its own algorithms in this way, which is compatible with Halide and easier to use.

There will be some TVM articles later… Review hard. If something isn’t right, everyone squirt.

References:

The images used above are partly derived from this:

  • Stellar.mit.edu/S/course/6/…
  • Halide’s official website: Halide-lang.org/

Pulled me

  • If you are like-minded with me, Lao Pan is willing to communicate with you;
  • If you like Lao Pan’s content, welcome to follow and support.
  • If you like my article, please like 👍 collect 📁 comment 💬 three times ~

If you want to know how Lao Pan learned to tread pits, and if you want to talk with me about his problems, please follow the public account “Oldpan blog”. Lao Pan will also sort out some of his own private stash, hoping to help you, click on the mysterious portal to obtain.