Writing | BBuf

Answer before “how to contribute to PyTorch zhihu problem” (www.zhihu.com/question/50… . The answer mentioned that when I was developing some operators on OneFlow last year, I found some bugs in PyTorch operator based on the operator AutoTest framework and gave Feedback or fixes to PyTorch. But this answer doesn’t explain what the AutoTest framework looks like or the rationale behind it.

Therefore, this article is used to introduce the operator AutoTest framework of OneFlow, and see how the OneFlow deep learning framework does the operator alignment task gracefully in the process of operator development (developed by @daqiaoxen, and later expanded and enriched by me and other colleagues to form today’s form). The AutoTest framework can also be easily ported to other deep learning training frameworks.

Github.com/Oneflow-Inc…

1. Traditional operator alignment

Not limited to OneFlow, any deep learning training framework written by any organization or individual needs to verify the correctness of the implementation of the operator. So, what is the general way to verify the correctness of operators in deep learning frameworks?

Taking PaddlePaddle of Baidu as an example, the correctness of operator verification is generally based on the results obtained by calling other standard libraries (for example, convolution of CUDNN is called for the verification of convolution operator. The verification of the ERF operator calls scipy’s ERF) or directly uses the result of numPY simulation (for example, the verification of the full operator is numPY simulation). In PyTorch’s test, there is also a way to hardcode some test samples, which is to compare the standard answers of the fixed input sample with the results of the operator calculation to determine the correctness of the operator implementation.

None of these methods is problematic, but it takes a lot of manpower to write tests and there may be some corner cases that are easy to think of in the early stage of operator development. In the case of OneFlow, since the operator behaves like an aligned PyTorch, what kind of test code can fully verify the correctness of transpose convolution Op in various situations? One way to do this is to enumerate each argument:

import torch
import numpy as np
import oneflow as flow

for N in range(1, 5):
    for C_in in range(1, 10):
        for L_in in range(1, 10):
            for H_in in range(1, 10):
                for C_out in range(1, 10):
                    for Ksize in range(1, 10):
                        for Pad in range(1, 10):
                            for Dilation in range(1, 10):
                                for Stride in range(1, min(L_in, H_in)):
                                    for OutPad in range(1, min(Dilation, Stride)):
                                        try:
                                            torch_input = torch.randn(N, C_in, L_in, H_in)
                                            flow_input = flow.tensor(torch_input.numpy())
                                            torch_input.requires_grad = True
                                            flow_input.requires_grad = True
                                            torch_m = torch.nn.ConvTranspose2d(in_channels=C_in, out_channels=C_out, kernel_size=Ksize, padding=Pad, stride=Stride,
                                                output_padding=(OutPad), dilation=Dilation, bias=False)
                                            flow_m = flow.nn.ConvTranspose2d(in_channels=C_in, out_channels=C_out, kernel_size=Ksize, padding=Pad, stride=Stride,
                                                output_padding=(OutPad), dilation=Dilation, bias=False)
                                            flow_m.weight.data = flow.tensor(torch_m.weight.data.detach().numpy(), requires_grad=True)
                                            torch_out = torch_m(torch_input)
                                            flow_out = flow_m(flow_input)
                                            torch_out = torch_out.sum()
                                            flow_out = flow_out.sum()
                                            assert(np.allclose(torch_out.detach().numpy(), flow_out.detach().numpy(), 1e-06, 1e-06)), "forward not equal"
                                            torch_out.backward()
                                            flow_out.backward()
                                            print(torch_input.grad.detach().numpy())
                                            print(flow_input.grad.detach()[:N, :C_in, :L_in, :H_in].numpy())
                                            assert(np.allclose(torch_input.grad.detach().numpy(), flow_input.grad.detach()[:N, :C_in, :L_in, :H_in].numpy(), 1e-03, 1e-03)), "backward not equal"
                                        except Exception as e:
                                            print('Input Param Error')
Copy the code

But this approach, while well-tested, also has its drawbacks. How is the upper bound of enumeration determined in the first place? If a large upper bound is given, the validation time of this operator will be too long to be used in CI processes. If the upper bound is very small, some corner cases may be ignored, resulting in incomplete testing and increasing the risk of operator bugs.

Based on these problems of operator testing, @Dakohyan developed an operator AutoTest framework to solve the problem of OneFlow operator and PyTorch operator alignment. Later, ON this basis, I also enriched some other functions for the AutoTest framework, and I feel that it has been relatively good. Next, I will make a comprehensive introduction.

The entire AutoTest framework has only two Python files, and the AutoTest framework can be easily ported to any other deep learning framework for operator alignment tasks.

1.Github.com/Oneflow-Inc…

2.Github.com/Oneflow-Inc…

2, operator AutoTest framework usage

Before introducing the principles, let’s look at the use of the AutoTest framework. As an example of the above deconvolution operator, using the AutoTest framework you can use the following code to complete the operator alignment test:

@autotest()
def test_deconv2d_with_random_data(test_case):
    channels = random(1, 6)
    m = torch.nn.ConvTranspose2d(
        in_channels=channels,
        out_channels=random(1, 20),
        kernel_size=random(1, 4),
        stride=random() | nothing(),
        padding=random(1, 3).to(int) | nothing(),
        dilation=random(1, 5) | nothing(),
        groups=random(1, 5) | nothing(),
        padding_mode=constant("zeros") | nothing(),
    )
    m.train(random())
    device = random_device()
    m.to(device)
    x = random_pytorch_tensor(ndim=4, dim1=channels).to(device)
    y = m(x)
    return y
Copy the code

Those familiar with PyTorch will notice that the operator test code has the same code style as PyTorch. Yes, The AutoTest framework is like a High level PyTorch. It has the same interface as PyTorch, but it runs OneFlow and PyTorch for a given input, and it takes every tensor and its gradient tensor. Then you have to do a tensor check that the shape of the OneFlow and PyTorch is exactly the same for automatic testing, which we’ll talk about later.

We can look at another example of testing the matmul operator:

 @autotest()
 def test_flow_matmul_with_random_data(test_case):
     k = random(1, 6)
     x = random_pytorch_tensor(ndim=2, dim1=k)
     y = random_pytorch_tensor(ndim=2, dim0=k)
     z = torch.matmul(x, y)
  return z
Copy the code

We have constructed two random tensor x and Y based on random_Pytorch_tensor method. Their dimensions are [M, k] and [k, N] respectively. The values of these dimensions are randomly generated.

Then the automatic testing framework will automatically take our random Op of legal parameters and put in the Tensor based on exactly the same number and type (PyTorch and OneFlow have one each) and run PyTorch and OneFlow code and do the automatic testing of the operator. Since the usage of the automated test framework aligns with the PyTorch usage, it will be easy to write the test sample after we develop the operator. There is no need to introduce other standard libraries or use Numpy to simulate the forward and reverse calculation process of an operator, which liberates productivity.

In addition, as long as the number of tests is enough, some OneFlow operator and PyTorch operator cannot align with each other can be covered in a high probability. In this case, if we can get the corresponding repetition sample, we can determine whether there is a problem in the implementation of OneFlow operator.

3, operator AutoTest framework implementation ideas

Understand the use of the AutoTest framework, here to explain the implementation of the AutoTest framework. The AutoTest framework will be implemented in two parts, one is how you generate random numbers, and the other is how you use the AutoTest program and record and compare the shapes and values of the intermediate tensor and the corresponding gradient tensor.

3.1 How to generate random data?

By random tensor you mean not just randomly putting in the tensor, but also the Op property parameters like kernel_size=random(1, 4) in the deconvolution Op test above and you specify that kernel_size is going to be evaluated at the interval [1, 4].

This part is implemented in:

Github.com/Oneflow-Inc…

First let’s take a look at what interfaces this file exports:

__all__ = [
    "random_tensor",
    "random_bool",
    "random_device",
    "random",
    "random_or_nothing",
    "oneof",
    "constant",
    "nothing"
]
Copy the code

These interfaces are classes that inherit from generator base classes to generate random data structures, which can be either built-in types like int or custom types like tensor. The randomness of all parameters of the AutoTest framework is based on these methods. Let’s look at the implementation of the generator base class:

class generator:
    def __init__(self, children):
        self.children = children
        self._value = None

    def _init(self):
        self._value = None
        for x in self.children:
            x._init()

    def eval(self):
        self._init()
        return self.value()

    def _calc_value(self):
        raise NotImplementedError()

    def value(self):
        if self._value is None:
            self._value = self._calc_value()
        return self._value

    def size(self):
        return 1

    def __or__(self, other):
        other = pack(other)
        return oneof(
            self, other, possibility=self.size() / (self.size() + other.size())
        )

    def __ror__(self, other):
        return self | other

    def __add__(self, other):
        return add(self, other)

    def __radd__(self, other):
        return self + other

    def __sub__(self, other):
        return self + neg(other)

    def __rsub__(self, other):
        return neg(self - other)

    def __mul__(self, other):
        return mul(self, other)

    def __rmul__(self, other):
        return self * other

    def to(self, annotation):
        self._to(annotation)
        for x in self.children:
            x.to(annotation)
        return self

    def _to(self, annotation):
        pass
Copy the code

This class holds not only _calc_value, value, eval, and other functions related to values, but also size, which reflects the number of data generated. It also holds a series of magic functions that allow different generator subclasses to combine with each other, increasing the flexibility of automatic test framework writing. Finally, there is ato member function that is overridden by classes that inherit from generator base classes to determine the numeric type of this random data structure.

All generator derived classes inherit from generator base classes and override the member functions __init__, __calc_value, size, _to, and so on. A derived class such as nothing overrides _calc_value and returns an entity of a class that does nothing.

class Nothing:
    pass

class nothing(generator):
    def __init__(self):
        super().__init__([])

    def _calc_value(self):
        return Nothing()
Copy the code

For another example, the derived class of random generator is defined as follows:

class random(generator):
    def __init__(self, low=1, high=6):
        self.low = pack(low)
        self.high = pack(high)
        super().__init__([self.low, self.high])
        self.annotation = None

    def _to(self, annotation):
        if self.annotation is not None:
            return
        if hasattr(annotation, "__origin__"):
            # PyTorch _size_2_t and similar types are defined by type variables,
            # leading to unexpected __args__ and __origin__
            #
            # >>> _size_2_t = Union[T, Tuple[T, T]][int]
            # >>> _size_2_t.__origin__
            # typing.Union[~T, typing.Tuple[~T, ~T]]
            #
            # So recreate a new annotation object by repr and eval
            #
            # >>> _size_2_t
            # typing.Union[int, typing.Tuple[int, int]]
            # >>> _size_2_t_new = eval(repr(annotation))
            # >>> _size_2_t_new.__origin__
            # typing.Union
            annotation = eval(repr(annotation))
        self.annotation = annotation

    def _generate(self, annotation):
        if hasattr(annotation, "__origin__"):
            if annotation.__origin__ is Union:
                x = random_util.choice(annotation.__args__)
                return self._generate(x)
            if annotation.__origin__ is Tuple or annotation.__origin__ is py_tuple:
                return [self._generate(x) for x in annotation.__args__]
            else:
                raise NotImplementedError(
                    f"Not implemented annotation {annotation} in random, type(annotation.__origin__) is {type(annotation.__origin__)}"
                )

        low, high = self.low.value(), self.high.value()

        if annotation == int:
            val = int(rng.integers(low, high))
        elif annotation == float:
            val = float(rng.random() * (high - low) + low)
        elif annotation == bool:
            val = random_util.choice([True, False])
        else:
            raise NotImplementedError(
                f"Not implemented annotation {annotation} in random"
            )
        return val

    def _calc_value(self):
        return self._generate(self.annotation)


def random_or_nothing(low, high):
    return oneof(random(low, high), nothing(), possibility=2 / 3)
Copy the code

One thing to note here is that a generator derived class that holds the annotation property can update the annotation property (such as the Random class) with to, It is also possible to ignore the annotation and construct a random result of the corresponding type (such as the random_device class) directly on _calc_value.

3.2 AutoTest core implementation

The core implementation of the AutoTest framework is:

Github.com/Oneflow-Inc…

The last two lines of this file are:

torch = GetDualObject("", torch_original, flow)
__all__ = ["autotest", "random_pytorch_tensor"]
Copy the code

Torch = GetDualObject(“”, torch_original, flow) Torch_original denotes the original PyTorch frame, The torch representation obtained with GetDualObject encapsulates the original PyTorch and OneFlow into a high level PyTorch. Therefore, the key implementation here is the function GetDualObject. Let’s not focus on what this function does, but what it returns. If you look at the code, you can see that this function returns a DualObject class. Let’s examine this class first:

class DualObject:
    def __init__(self, name, pytorch, oneflow):
        self.name = name
        self.pytorch = pytorch
        self.oneflow = oneflow
        if isinstance(pytorch, torch_original.nn.Module):
            state_dict = pytorch.state_dict()
            state_dict = {k: v.detach().cpu().numpy() for (k, v) in state_dict.items()}
            oneflow.load_state_dict(state_dict, strict=False)
            if testing:
                dual_modules_to_test.append(self)
        if isinstance(pytorch, torch_original.Tensor):
            if testing:
                dual_objects_to_test.append(self)

    def __repr__(self):
        return f"PyTorch object:\n{self.pytorch}\n\nOneFlow object:\n{self.oneflow}"

    def __getattr__(self, key):
        pytorch_attr = getattr(self.pytorch, key)
        oneflow_attr = getattr(self.oneflow, key)
        new_name = f"{self.name}.{key}"
        global call_pytorch
        call_pytorch = self.pytorch
        return GetDualObject(new_name, pytorch_attr, oneflow_attr)
Copy the code

Pytorch/Oneflow is passed in __init__ with the class name and pyTorch /oneflow. Torch_original and flow are passed in to pytorch. When you export random_Pytorch_tensor you have Pytorch_tensor and oneflow_tensor. Here’s how random_Pytorch_tensor works

def random_pytorch_tensor( ndim=None, dim0=1, dim1=None, dim2=None, dim3=None, dim4=None, low=0, high=1, dtype=float, requires_grad=True, ): if isinstance(requires_grad, generator): requires_grad = requires_grad.value() pytorch_tensor = ( random_tensor(ndim, dim0, dim1, dim2, dim3, dim4, low, high, dtype) .value() .requires_grad_(requires_grad and dtype ! = int) ) flow_tensor = flow.tensor( pytorch_tensor.detach().cpu().numpy(), requires_grad=(requires_grad and dtype ! = int), ) return GetDualObject("unused", pytorch_tensor, flow_tensor)Copy the code

You can see that it gets an object by calling GetDualObject, just like the implementation that exported the High Level PyTorch. Dual_modules_to_test and Dual_objects_to_test are used to record OneFlow’s NNN.Module and PyTorch’s tensor objects, respectively. In addition, the DualObject class overrides the __getattr__ magic method. Using Flatten as an example, we can see what attributes this magic method acquires from the AutoTest program:

def __getattr__(self, key): pytorch_attr = getattr(self.pytorch, key) oneflow_attr = getattr(self.oneflow, key) print(key) # print(pytorch_attr) # print(oneflow_attr) new_name = f"{self.name}.{key}" return GetDualObject(new_name, pytorch_attr, @autotest(auto_backward=False) def test_against_pytorch(test_case): m = torch.nn.Flatten( start_dim=random(1, 6) | nothing(), end_dim=random(1, 6) | nothing() ) m.train(random()) device = random_device() m.to(device) x = random_pytorch_tensor().to(device) y = m(x)  return yCopy the code

Then look at the print result for key in __getattr__ :

nn
Flatten
train
to
to
Copy the code

You can see that PyTorch or OneFlow’s nn.Module or any other function in a test program decorated with the AutoTest () decorator overrides this method, It takes the arguments and attributes of these nn.Module or other functions and returns a new DualObject using GetDualObject as well. We can print the corresponding DualObject of Flatten:

PyTorch object:
<bound method Module.train of Flatten(start_dim=1, end_dim=-1)>

OneFlow object:
<bound method Module.train of Flatten(start_dim=1, end_dim=-1)>
Copy the code

The GetDualObject function generates a DualObject based on the passed Pytorch and OneFlow objects and their names. The GetDualObject function overrides the original PyTorch passed in and the OneFlow object’s __call__ magic function for the high Level PyTorch, and returns a DualObject. This process also involves skipping some unconcerned magic functions and checking whether the attributes of the passed object are valid and binding specific types of random data generated by generator inherited classes based on nn.module and other API default parameters (done in the get_args function). Here’s another takeaway for the Tensor method, because the Tensor method is called differently (through getattr) from other modules and functions (through __call__).

GetDualObject: GetDualObject: GetDualObject: GetDualObject

Github.com/Oneflow-Inc…

Finally, let’s look at the implementation of the Autotest () decorator:

Def autotest(n=20, auto_backward=True, rtol=0.0001, atol= 1E-05, check_graph=True, check_allclose=True,): verbose = os.getenv("ONEFLOW_TEST_VERBOSE") is not None def deco(f): @functools.wraps(f) def new_f(test_case): nonlocal n loop_limit = n * 20 loop = 0 while n > 0: clear_note_fake_program() if loop > loop_limit: raise ValueError("autotest stuck in an endless loop!" ) dual_modules_to_test.clear() dual_objects_to_test.clear() try: global testing testing = True global testing_graph if check_graph: testing_graph = True res = f(test_case) testing = False testing_graph = False except (PyTorchDoesNotSupportError, BothDoNotSupportError) as e: if verbose: print(f"{f.__name__}") print(e) loop += 1 continue if res is not None: if not isinstance(res, collections.abc.Sequence): res = [res] func_outputs = res for x in res: if auto_backward: if isinstance(x.pytorch, torch_original.Tensor): call_tensor_id.append(id(x.pytorch)) x.sum().backward() dual_objects_to_test.append(x) for x in dual_modules_to_test: for key in x.pytorch.state_dict().keys(): if key not in x.oneflow.state_dict().keys(): warnings.warn(f"oneflow module don't have `{key}`") continue vis_parameters[key] = x.pytorch.state_dict()[key] dual_objects_to_test.append( GetDualObject( "unused", getattr(x.pytorch, key), getattr(x.oneflow, key), ) ) call_tensor_id.append(id(getattr(x.pytorch, key))) dual_objects_to_test.append( GetDualObject( "unused", getattr(x.pytorch, key).grad, getattr(x.oneflow, key).grad, ) ) call_tensor_id.append(id(getattr(x.pytorch, key).grad)) for x in dual_objects_to_test: if ( isinstance(x.pytorch, torch_original.Tensor) and id(x.pytorch) not in call_tensor_id ): vis_tensor.append(x.pytorch) # check eager for x in dual_objects_to_test: if check_allclose: test_case.assertTrue(check_equality(x, rtol=rtol, atol=atol), x) if verbose: print(f"{f.__name__} test eager passed.") n -= 1 loop += 1 return new_f return decoCopy the code

The decorator res = F (test_Case) will perform the automatic test that the decorator decorates, and it will run PyTorch and OneFlow to get all the intermediate output tensor, including the gradient of the tensor, given the input, Log them to the list dual_modules_to_test. Go through every tensor in this list again, and see if it’s exactly the same as shape. The comparison function is implemented in:

Github.com/Oneflow-Inc…

The idea is to take numpy data at tensor and compare it. The Autotest () decorator also has several parameters that can be adjusted to control whether the test is reversed, how many times it is executed, and the precision threshold for the final result comparison.

4. Generate buggy programs and data automatically

Now that we’ve done the AutoTest framework, how to use it, I’m going to show you how to get an application that can replay bugs and the corresponding input tensor and parameters. The principle is very simple, is to record the API used in GetDualObject process put together to form a complete program, here shows the effect in CI.

Github.com/Oneflow-Inc…

This example shows that in a CI process, OneFlow’s conv_transpose2D operator and PyTorch’s conv_transpose2D operator were not aligned in a certain case, so CI also output the corresponding duplicate code and data when reporting this error. It is convenient for framework developers to locate and judge:

The automatic test framework outputs duplicates and data when the operator and PyTorch are not aligned

In addition, the AutoTest framework is not only responsible for testing Eager operator, but also extended to support nn.Graph and Eager Consistent, which is greatly convenient for framework developers.

5, summary

This article introduces OneFlow’s operator AutoTest framework, which provides a deep learning method for gracefully doing operator alignment, making it as easy for developers and users to write tests as PyTorch. AutoTest framework flexibility and ease of use are relatively strong, welcome to learn or use.

(Originally published as:Zhuanlan.zhihu.com/p/458111952…

OneFlow’s new generation of open source deep learning frameworkGithub.com/Oneflow-Inc…