Participate in the 21st day of The November Gwen Challenge, see the details of the event: 2021 last Gwen Challenge


Suppose I now want to identify an image, see if it is a dog or a cat (two categories), I use a Xiaomi phone, pixel 6400W.

I use a single-layer, fully connected neural network.

So my input is a 6.4×1076.4×10^76.4×107 (6.4e76.4E76.4e7) vector, and my output is a one-hot two-dimensional vector.

So my parameter is 2× 6.4E72 × 6.4E72 × 6.4E7.

It’s impossible to tell if it’s a dog or a cat, so I’ll add a hidden layer and reduce it to 1000 dimensions.

My argument is 1e3×6.4e7=6.4e101e3 \times 6.4e7=6.4e101e3 ×6.4e7=6.4e10

Add a hidden layer, the first layer parameter is over 100 million. It’s not necessary to recognize an image. And a two-tier network is unlikely to recognize it

Cat or dog?

So as the network deepens, the number of parameters used by the full connection layer is incalculable.

So what to do?

So that’s the convolution layer.

What about the convolution layer? I didn’t recognize the whole picture. I do it in pieces.


convolution

The fully connected layer directly computes the whole image (left), while the convolution layer divides the image into small blocks (right) and scans slowly from left to right from top to bottom, calculating only this small block at a time.

Since each block is identified one at a time, two rules apply:

  1. local
  2. Translation invariance

Translation invariance

Here’s a formula:


h i . j = k . l w i . j . k . l x k . l = a . b v i . j . a . b x i + a . j + b h_{i, j}=\sum_{k, l} w_{i, j, k, l} x_{k, l}=\sum_{a, b} v_{i, j, a, b} x_{i+a, j+b}

To explain the above formula:

  1. Why w is four dimensional:

    Think back to a single-layer neural network with inputs as vectors:


    [ h ] m x 1 = [ w ] m x n x [ x ] n x 1 [h]_{m \times 1}= [w]_{m \times n} \times [x]_{n \times 1}

    That’s what happens when the input is a matrix:


    [ h ] i x j = [ w ] i x j x k x l [ x ] k x l [h]_{i \times j} = [w]_{i \times j \times k \times l} * [x]_{k \times l}

    Note: I purposely changed the multiplication sign here, because this is not a matrix multiplication, but a bitwise multiplication of the matrix elements. And it doesn’t work that way, but I think it’s easier to say why w becomes four.

  2. V is W to index the vi, j, a, b = wi, j, I + a, j + bv_ {I, j, a, b} = w_ {I, j, I + a, j + b} vi, j, a, b = wi, j, I + a, j + b

    Why reindex?

    Since the convolution kernel (yellow block, weight) is fixed in size, it is fixed in size every time you compute with x, you can also change the letter A ×ba\times ba×b. That is, every time a×ba\times ba×b of x and I ×j×a×bi\times j \times a\times bi×j×a×b of w.

    But because the position is changing, and x is K × LK \times LK × L, where k and L are always changing, the x position is different every time. And x is in the whole picture, and you can use the I and j of the whole picture to figure out that every time x is I + AI + AI +a and j+bj+bj+b, so you convert k and L. The whole formula into [h] I * j = [v] I * j * a * b ∗ [x] I + a * j + b [h] _ {I \ times j} = [v] _ {I \ times j \ times \ times b} * [x] _ {I + a \ times J + b} [h] I * j = [v] I * j * a * b ∗ [x] I + a * j + b

    Ok, now one more question, every time you move x, but MY convolution kernel is translation invariant, no matter where you move I’m fixed in size, fixed in parameter value, so why do I have I and J? Peanut, you found your blind spot. Since w has nothing to do with the position of x, you can throw away the relative position of the graph. So let’s just get rid of the I and j. Then the whole formula into [h] I * j = [v] a * b ∗ [x] I + a * j + b [h] _ = {I \ times j} [v] _ [x] {} \ times b * _ {I + a \ times j + b} [h] I * j = [v] a * b ∗ [x] I + a * j + b

    So that’s what’s called translational invariance.

So at this point, our complicated formula becomes:


h i . j = a . b v a . b x i + a . j + b h_{i, j}=\sum_{a, b} v_{a, b} x_{i+a, j+b}

But this formula is called a two-dimensional crossover. It’s not two dimensional convolution.

What about locality?

That formula up there is inherently local.

Because the magnitude of w is fixed. So I’m only going to compute with the same size x each time. Only the part covered by the yellow box can be calculated.

Or you can convert a two-dimensional crossover into a two-dimensional convolution:

When ∣ a ∣, ∣ b ∣ > Δ | a |, | | b ∣ > \ Delta ∣ a, b ∣ ∣ > Δ, makes the va, b = 0 v_ {a, b} = 0 va, b = 0:


h i . j = a = Δ Δ b = Δ Δ v a . b x i + a . j + b h_{i, j}=\sum_{a=-\Delta}^{\Delta} \sum_{b=-\Delta}^{\Delta} v_{a, b} x_{i+a, j+b}

That is to say, given an interval δ \Delta δ, ab is now calculated from − δ -\Delta− δ and continues to + δ +\Delta+ δ, beyond which v becomes zero. It is guaranteed to operate only in the size of 2 δ ×2 δ 2\Delta \times 2\Delta2 δ ×2 δ.

Although the two formulas may look different, in practice they are the same thing: a×ba \times ba×b becomes 2 δ ×2 δ 2\Delta \times 2\Delta2 δ ×2 δ.

We also succeeded in obtaining the convolution formula.

When ∣ a ∣, ∣ b ∣ > Δ | a |, | | b ∣ > \ Delta ∣ a, b ∣ ∣ > Δ, makes the va, b = 0 v_ {a, b} = 0 va, b = 0:


h i . j = a = Δ Δ b = Δ Δ v a . b x i + a . j + b h_{i, j}=\sum_{a=-\Delta}^{\Delta} \sum_{b=-\Delta}^{\Delta} v_{a, b} x_{i+a, j+b}

So let’s do the convolution with the formula above, let’s say my graph is an 8 by 6 graph, and now I’m convolving with a 3 by 3 convolution kernel.

According to the formula, the two 3 by 3 matrices are calculated and summed, and the number is filled into the first cell of the convolution layer. And then it moves to the right, and then it doesn’t move to the right anymore and then it moves down, until finally. Move the entire graph.

After one convolution, the result is two rows and two columns less than the input. Because we don’t have to calculate the two cases where x and W don’t overlap, we just have to calculate the area that we can cover completely. So if you go through a 3 by 3 convolution kernel you lose two rows and two columns. If you go through a 5 by 5 convolution kernel you’re going to lose four rows and four columns.



  1. More on hands-on Deep Learning series can be seen here: Juejin. Cn

  2. Github address: DeepLearningNotes/d2l(github.com)

Still in update …………