Optical flow | RAFT: Recurrent All - Paris Field Transforms

Article transferred to wechat public account “Machine Learning Alchemy”
Author: Brother Lian Dan (authorized)
The author’s contact information: wechat CYX645016617 (welcome to exchange)
“RAFT: Recurrent All-Pairs Field Transforms for Optical Flow”
The paper code is available: github.com/princeton-v…

“Introduction” : This article combined with the official code to provide an explanation of the model structure.

Model structure

Contains Feature Encoder, Context Encoder;
And then there’s the loop structure

class RAFT(nn.Module) :
    def __init__(self, args) :
        super(RAFT, self).__init__()
        self.args = args

        if args.small:
            self.hidden_dim = hdim = 96
            self.context_dim = cdim = 64
            args.corr_levels = 4
            args.corr_radius = 3
        
        else:
            self.hidden_dim = hdim = 128
            self.context_dim = cdim = 128
            args.corr_levels = 4
            args.corr_radius = 4

        if 'dropout' not in self.args:
            self.args.dropout = 0

        if 'alternate_corr' not in self.args:
            self.args.alternate_corr = False

        # feature network, context network, and update block
        if args.small:
            self.fnet = SmallEncoder(output_dim=128, norm_fn='instance', dropout=args.dropout)        
            self.cnet = SmallEncoder(output_dim=hdim+cdim, norm_fn='none', dropout=args.dropout)
            self.update_block = SmallUpdateBlock(self.args, hidden_dim=hdim)

        else:
            self.fnet = BasicEncoder(output_dim=256, norm_fn='instance', dropout=args.dropout)        
            self.cnet = BasicEncoder(output_dim=hdim+cdim, norm_fn='batch', dropout=args.dropout)
            self.update_block = BasicUpdateBlock(self.args, hidden_dim=hdim)

    def freeze_bn(self) :
        for m in self.modules():
            if isinstance(m, nn.BatchNorm2d):
                m.eval(a)def initialize_flow(self, img) :
        """ Flow is represented as difference between two coordinate grids flow = coords1 - coords0"""
        N, C, H, W = img.shape
        coords0 = coords_grid(N, H//8, W//8).to(img.device)
        coords1 = coords_grid(N, H//8, W//8).to(img.device)

        # optical flow computed as difference: flow = coords1 - coords0
        return coords0, coords1

    def upsample_flow(self, flow, mask) :
        """ Upsample flow field [H/8, W/8, 2] -> [H, W, 2] using convex combination """
        N, _, H, W = flow.shape
        mask = mask.view(N, 1.9.8.8, H, W)
        mask = torch.softmax(mask, dim=2)

        up_flow = F.unfold(8 * flow, [3.3], padding=1)
        up_flow = up_flow.view(N, 2.9.1.1, H, W)

        up_flow = torch.sum(mask * up_flow, dim=2)
        up_flow = up_flow.permute(0.1.4.2.5.3)
        return up_flow.reshape(N, 2.8*H, 8*W)


    def forward(self, image1, image2, iters=12, flow_init=None, upsample=True, test_mode=False) :
        """ Estimate optical flow between pair of frames """

        image1 = 2 * (image1 / 255.0) - 1.0
        image2 = 2 * (image2 / 255.0) - 1.0

        image1 = image1.contiguous()
        image2 = image2.contiguous()

        hdim = self.hidden_dim
        cdim = self.context_dim

        # run the feature network
        with autocast(enabled=self.args.mixed_precision):
            fmap1, fmap2 = self.fnet([image1, image2])        
        
        fmap1 = fmap1.float()
        fmap2 = fmap2.float(a)if self.args.alternate_corr:
            corr_fn = AlternateCorrBlock(fmap1, fmap2, radius=self.args.corr_radius)
        else:
            corr_fn = CorrBlock(fmap1, fmap2, radius=self.args.corr_radius)

        # run the context network
        with autocast(enabled=self.args.mixed_precision):
            cnet = self.cnet(image1)
            net, inp = torch.split(cnet, [hdim, cdim], dim=1)
            net = torch.tanh(net)
            inp = torch.relu(inp)

        coords0, coords1 = self.initialize_flow(image1)

        if flow_init is not None:
            coords1 = coords1 + flow_init

        flow_predictions = []
        for itr in range(iters):
            coords1 = coords1.detach()
            corr = corr_fn(coords1) # index correlation volume

            flow = coords1 - coords0
            with autocast(enabled=self.args.mixed_precision):
                net, up_mask, delta_flow = self.update_block(net, inp, corr, flow)

            # F(t+1) = F(t) + \Delta(t)
            coords1 = coords1 + delta_flow

            # upsample predictions
            if up_mask is None:
                # upflow8 is sampled 8 times by linear interpolation.
                flow_up = upflow8(coords1 - coords0)
            else:
                flow_up = self.upsample_flow(coords1 - coords0, up_mask)
            
            flow_predictions.append(flow_up)

        if test_mode:
            return coords1 - coords0, flow_up
            
        return flow_predictions

Copy the code

You can see that the model runs in the following steps:

Input two images image1 and image2;
The two images are input into FNET to obtain the feature images fMAP1 and FMAP2.
Fmap1 and FMAP2 are input into CorrBlock, and corR is obtained. This CorrBlock needs to be built in CUDA programming, the author has provided the corresponding CPP and CU files;
Image1 obtains net and INP features through Context Network (CNET). The structure of CNET and FNET is basically the same.
Then initialize the optical flow and the corresponding mesh, and then enter the loop part;
Net, INP, CORR, flow, net, UP_mask, and delta_flow are calculated in updateBlock each time;

Encoder

The BasicEncoder is a smallEncoder with a small memory size. The BasicEncoder is a smallEncoder with a small memory size.

class BasicEncoder(nn.Module) :
    def __init__(self, output_dim=128, norm_fn='batch', dropout=0.0) :
        super(BasicEncoder, self).__init__()
        self.norm_fn = norm_fn

        if self.norm_fn == 'group':
            self.norm1 = nn.GroupNorm(num_groups=8, num_channels=64)
            
        elif self.norm_fn == 'batch':
            self.norm1 = nn.BatchNorm2d(64)

        elif self.norm_fn == 'instance':
            self.norm1 = nn.InstanceNorm2d(64)

        elif self.norm_fn == 'none':
            self.norm1 = nn.Sequential()

        self.conv1 = nn.Conv2d(3.64, kernel_size=7, stride=2, padding=3)
        self.relu1 = nn.ReLU(inplace=True)

        self.in_planes = 64
        self.layer1 = self._make_layer(64,  stride=1)
        self.layer2 = self._make_layer(96, stride=2)
        self.layer3 = self._make_layer(128, stride=2)

        # output convolution
        self.conv2 = nn.Conv2d(128, output_dim, kernel_size=1)

        self.dropout = None
        if dropout > 0:
            self.dropout = nn.Dropout2d(p=dropout)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.InstanceNorm2d, nn.GroupNorm)):
                if m.weight is not None:
                    nn.init.constant_(m.weight, 1)
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)

    def _make_layer(self, dim, stride=1) :
        layer1 = ResidualBlock(self.in_planes, dim, self.norm_fn, stride=stride)
        layer2 = ResidualBlock(dim, dim, self.norm_fn, stride=1)
        layers = (layer1, layer2)
        
        self.in_planes = dim
        return nn.Sequential(*layers)


    def forward(self, x) :

        # The images entered here are two images that need to be registered
        is_list = isinstance(x, tuple) or isinstance(x, list)
        if is_list:
            batch_dim = x[0].shape[0]
            x = torch.cat(x, dim=0)

        x = self.conv1(x) # This is a simple convolution layer, magnifying the number of channels from 3 to 64
        x = self.norm1(x) The default is batchnorm
        x = self.relu1(x)

        x = self.layer1(x) # is made up of two residual blocks.
        x = self.layer2(x)
        x = self.layer3(x)

        x = self.conv2(x)

        if self.training and self.dropout is not None:
            x = self.dropout(x)

        if is_list:
            x = torch.split(x, [batch_dim, batch_dim], dim=0)

        return x
Copy the code

BasicEncoder samples the image down 3 times, which is 1/8 of the original;
Except that the first and last layers are simple convolution layers, the others are composed of residual modules, and each residual module contains two convolution layers.

updateblock

class BasicMotionEncoder(nn.Module) :
    def __init__(self, args) :
        super(BasicMotionEncoder, self).__init__()
        cor_planes = args.corr_levels * (2*args.corr_radius + 1) * *2
        self.convc1 = nn.Conv2d(cor_planes, 256.1, padding=0)
        self.convc2 = nn.Conv2d(256.192.3, padding=1)
        self.convf1 = nn.Conv2d(2.128.7, padding=3)
        self.convf2 = nn.Conv2d(128.64.3, padding=1)
        self.conv = nn.Conv2d(64+192.128-2.3, padding=1)

    def forward(self, flow, corr) :
        cor = F.relu(self.convc1(corr))
        cor = F.relu(self.convc2(cor))
        flo = F.relu(self.convf1(flow))
        flo = F.relu(self.convf2(flo))

        cor_flo = torch.cat([cor, flo], dim=1)
        out = F.relu(self.conv(cor_flo))
        return torch.cat([out, flow], dim=1)
class SepConvGRU(nn.Module) :
    def __init__(self, hidden_dim=128, input_dim=192+128) :
        super(SepConvGRU, self).__init__()
        self.convz1 = nn.Conv2d(hidden_dim+input_dim, hidden_dim, (1.5), padding=(0.2))
        self.convr1 = nn.Conv2d(hidden_dim+input_dim, hidden_dim, (1.5), padding=(0.2))
        self.convq1 = nn.Conv2d(hidden_dim+input_dim, hidden_dim, (1.5), padding=(0.2))

        self.convz2 = nn.Conv2d(hidden_dim+input_dim, hidden_dim, (5.1), padding=(2.0))
        self.convr2 = nn.Conv2d(hidden_dim+input_dim, hidden_dim, (5.1), padding=(2.0))
        self.convq2 = nn.Conv2d(hidden_dim+input_dim, hidden_dim, (5.1), padding=(2.0))


    def forward(self, h, x) :
        # horizontal
        hx = torch.cat([h, x], dim=1)
        z = torch.sigmoid(self.convz1(hx))
        r = torch.sigmoid(self.convr1(hx))
        q = torch.tanh(self.convq1(torch.cat([r*h, x], dim=1)))        
        h = (1-z) * h + z * q

        # vertical
        hx = torch.cat([h, x], dim=1)
        z = torch.sigmoid(self.convz2(hx))
        r = torch.sigmoid(self.convr2(hx))
        q = torch.tanh(self.convq2(torch.cat([r*h, x], dim=1)))       
        h = (1-z) * h + z * q

        return h
class FlowHead(nn.Module) :
    def __init__(self, input_dim=128, hidden_dim=256) :
        super(FlowHead, self).__init__()
        self.conv1 = nn.Conv2d(input_dim, hidden_dim, 3, padding=1)
        self.conv2 = nn.Conv2d(hidden_dim, 2.3, padding=1)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x) :
        return self.conv2(self.relu(self.conv1(x)))
class BasicUpdateBlock(nn.Module) :
    def __init__(self, args, hidden_dim=128, input_dim=128) :
        super(BasicUpdateBlock, self).__init__()
        self.args = args
        self.encoder = BasicMotionEncoder(args)
        self.gru = SepConvGRU(hidden_dim=hidden_dim, input_dim=128+hidden_dim)
        self.flow_head = FlowHead(hidden_dim, hidden_dim=256)

        self.mask = nn.Sequential(
            nn.Conv2d(128.256.3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256.64*9.1, padding=0))

    def forward(self, net, inp, corr, flow, upsample=True) :
        motion_features = self.encoder(flow, corr)
        inp = torch.cat([inp, motion_features], dim=1)

        net = self.gru(net, inp)
        delta_flow = self.flow_head(net)

        # scale mask to balence gradients
        mask = 25. * self.mask(net)
        return net, mask, delta_flow
Copy the code

BasicMotionEncoder: Input correlation matrix and optical flow feature fusion;
SepConvGRU: This is an interesting GRU structure in convolution.
- There are two inputs, net and INP;
- The two inputs are spliced together and placed in the convolution layer to obtain the output of update and reset gates.
- Then use reset*hidden and splice x to get the new hidden;
- Then update the Hidden layer based on the update weights.
- That is to say, the NET variable in the model is actually the hidden variable in the GRP circular network;

conclusion

I learned the GRU structure of convolution. I think this GRU structure can be combined with a lot of places.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Optical flow | RAFT: Recurrent All – Paris Field Transforms | ECCV2020

Model structure

Encoder

updateblock

conclusion

Optical flow | RAFT: Recurrent All – Paris Field Transforms | ECCV2020

Model structure

Encoder

updateblock

conclusion

Related Posts

Iv.Apache Griffin quality monitoring based on Hive Batch data

Reverse linked list python+c++

Wolf 2 Responsive Designer for MAC