Machine learning algorithms and natural language processing

@public account original columnist Don. Hub

Unit | jingdong algorithm engineer

School | imperial college

  • Graph convolutional networks (GCN) introduction to detail \

    • What is the GCN
      • Summary of GCN
      • The model definition
    • Mathematical deduction
      • Graph Laplacian
    • ref

The field of graph neural network is a relatively new field with a lot of exploration potential, so I have been thinking about getting started. Graph convolutional network is very popular among them. I found a tutorial: Complete Guide to Graph convolutional Network (GCN) Novice Village, trying to get out of the novice village, but it turned out to be an advice because of my limited qualifications. After reading it, I still didn’t understand it very well, so I decided to write an introduction myself. Of course, this article is very well introduced, and I refer to many of them, so I will cite many of them, plus the related derivation supplement.

This paper is mainly divided into two parts, the first part introduces what GCN is, the second part will carry out detailed mathematical derivation.

What is the GCN

Summary of GCN

The GCN discussed in this paper comes from the paper semi-supervised CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS, which is one of the most classic papers in the field of GCN.

As we can see from this GCN diagram, one ownsGraph of input channel as input, through hidden layers in the middle, getThe output of each output channel.

Graph convolutional network can be composed of two levels of acting transformations:

Note that all graphs in this article refer specifically to undirected and unweighted graphs.

  • Graph level:

For example, by introducing some forms of pooling operation (see, e.g. Duvenaud et al., NIPS 2015). Then change the structure of the graph. However, GCN has not carried out this level of operation. So we see that the output of our network structure is the same as the output graph. \

  • The node level:

In general, node level does not change the structure of the graph, but only changes the features/signals of the graph.As input: oneThe matrix (: Enter the number of nodes of the graph,The characteristic dimension of the input) to get the outputA:The matrix (Characteristic dimensions of output).

A) A feature Description: refers to each nodeCharacteristic representation of

B) The structure of each graph can pass through an adjacency matrixRepresentation (or any other matrix derived from it)

We can easily reconstruct a graph from an adjacency matrix. For example:Among themRepresents the node,On behalf of the edge

We do this by constructingThe adjacency matrix can be obtained by the matrix of, includingIf the nodeAnd the nodeConnect, otherwiseAnd we can get that from graphAnd by the same tokenYou can also get the structure of graph.

Therefore, each hidden layer in the middle of the network can be written as the following nonlinear function:

Input layerAnd the output layer , Is the layer. Different GCN models are adopted differentlyFunction.

The model definition

The functions used in the paper are as follows:

When I first watched it, I was scared! This function is too abstract. But let’s take a look at what it does, and then take the formula step by step, and why it does what it does.

Firstly, physically, the information of the next layer of each node is obtained by the weighted sum of the information of the previous layer and the information of adjacent nodes, and then through linear transformationAnd nonlinear transformation 。


Let’s break it down step by step, and we’re going to define a simpleFunction as the base of the network layer.

We can easily adopt the simplest layer-wise Propagation rules

We’re going to directlyYou do matrix multiplication, and then you go through a weight matrixDo a linear transformation, and then go through a nonlinear activation functionReLU, for example, finally gets the input of the next layer 。

We need to pay special attention to thisLet’s multiply matrices. What does that mean?

Let’s take a look at the figure below.

Suppose each nodeSo what is it going to be after you multiply the matrices.

The input layerAccording to the operation formula of the matrix, we can easily get the representation of the node in the next layerAnd it’s easy to findAnd theIt’s a neighbor of node 1. Refer to the code below for specific calculation results.

A = torch.tensor([
    [0.1.0.0.1.0],
    [1.0.1.0.1.0],
    [0.1.0.1.0.0],
    [0.0.1.0.1.1],
    [1.1.0.1.0.0],
    [0.0.0.1.0.0]
])

H_0 = torch.tensor([
    [1.1.1.1],
    [2.2.2.2],
    [3.3.3.3],
    [4.4.4.4],
    [5.5.5.5],
    [6.6.6.6]
])

A.matmul(H_0)
>>>tensor([[ 7.7.7.7],
        [ 9.9.9.9],
        [ 6.6.6.6],
        [14.14.14.14],
        [ 7.7.7.7],
        [ 4.4.4.4]])
Copy the code

So we have untilIs through the adjacency matrix fast way, quickly add the information of adjacent nodes to get their next level of input. \

But is that perfect?

  • Problem 1: Although we have information about the surrounding nodes, we have no information about ourselves (unless we have an edge pointing to ourselves).

The solution we adopted is to manually add a self-loop to each node, i.e, includingIs the identity matrix.

  • Question two: From the above results can also be seen after aAfter the matrix transformation, the resulting output will be larger, namely the eigenvectorThe scale of the input will change, and after multiple levels of change, it will be more and more different from the input scale.

So can we take the adjacency matrixYou normalize it so that the sum of each of the last rows is 1, so thatWeighted sum is obtained.

We can putDivide each row by the sum of rows to get normalized. And the sum of each row is the degree of each node. Expressed by matrix, is:for

Let’s look at the graph above.

  import torch

  A = torch.tensor([
      [0.1.0.0.1.0],
      [1.0.1.0.1.0],
      [0.1.0.1.0.0],
      [0.0.1.0.1.1],
      [1.1.0.1.0.0],
      [0.0.0.1.0.0]
  ], dtype=torch.float32)
  
  D = torch.tensor([
      [2.0.0.0.0.0],
      [0.3.0.0.0.0],
      [0.0.2.0.0.0],
      [0.0.0.3.0.0],
      [0.0.0.0.3.0],
      [0.0.0.0.0.1],
  ], dtype=torch.float32)
  
  hat_A = D.inverse().matmul(A)
  
  >>>hat_A
  tensor([[0.0000.0.5000.0.0000.0.0000.0.5000.0.0000],
          [0.3333.0.0000.0.3333.0.0000.0.3333.0.0000],
          [0.0000.0.5000.0.0000.0.5000.0.0000.0.0000],
          [0.0000.0.0000.0.3333.0.0000.0.3333.0.3333],
          [0.3333.0.3333.0.0000.0.3333.0.0000.0.0000],
          [0.0000.0.0000.0.0000.1.0000.0.0000.0.0000]])
Copy the code

But in practice we have used the symmetrical form of normalization:for 



\

This has to do with the Laplacian Matrix, which we’ll see in the next section. We can see that

  D_minus_sqrt = D.inverse().sqrt()
  D_minus_sqrt.matmul(A).matmul(D_minus_sqrt)
  >>>
  tensor([[0.0000.0.4082.0.0000.0.0000.0.4082.0.0000],
        [0.4082.0.0000.0.4082.0.0000.0.3333.0.0000],
        [0.0000.0.4082.0.0000.0.4082.0.0000.0.0000],
        [0.0000.0.0000.0.4082.0.0000.0.3333.0.5774],
        [0.4082.0.3333.0.0000.0.3333.0.0000.0.0000],
        [0.0000.0.0000.0.0000.0.5774.0.0000.0.0000]])
Copy the code

If we combine these two tricks, we get

Among them ,  是 The degree of matrix. whileIs theI did a symmetric normalization.

Mathematical deduction

Graph Laplacian

First of all, we can represent a graph in many ways, we can use an adjacency matrix, we can use an Incidence matrix. In this matrix, each row represents an edge, and each column represents a node. In each row, the beginning of the edge node is denoted as 1, and the end of the edge is denoted as -1. Let’s call this metrix. The details are shown in the figure below.

So graph Laplacian is defined as:

C = torch.tensor([
    [1.- 1.0.0],
    [1.0.- 1.0],
    [1.0.0.- 1],
    [0.- 1.1.0],
    [0.0.1.- 1],
    [0.- 1.0.1],
])
C.T.matmul(C)
>>>tensor(
       [[ 3.- 1.- 1.- 1],
        [- 1.3.- 1.- 1],
        [- 1.- 1.3.- 1],
        [- 1.- 1.- 1.3]])
Copy the code

We can find the value of the diagonal, where if, then their product is equal to 0, ifor, then their product = 1. So we know that the diagonal represents the Degree of each node.

For non-diagonal valuesWe can see that if the node 和 It’s not connected, soOtherwise,, so we know that the value of the non-diagonal is the negative value of the adjacency matrix.

So we can derive that

As shown below (note that W represents the adjacency matrix)

To sum up:

Refer to the following code for calculation

C = torch.tensor([
    [- 1.1.0.0.0.0], # 12 -
    [- 1.0.0.0.1.0], # 1- 5
    [0.- 1.1.0.0.0], # 2- 3
    [0.- 1.0.0.1.0], # 2- 5
    [0.0.- 1.1.0.0], # 34 -
    [0.0.0.- 1.1.0], # 4- 5
    [0.0.0.- 1.0.1], # 5- 6
])
C.T.matmul(C)
>>>
tensor([[ 2.- 1.0.0.- 1.0],
        [- 1.3.- 1.0.- 1.0],
        [ 0.- 1.2.- 1.0.0],
        [ 0.0.- 1.3.- 1.- 1],
        [- 1.- 1.0.- 1.3.0],
        [ 0.0.0.- 1.0.1]])
Copy the code

We need to know LaplacianThe nature of the:

  • Is a symmetric matrix
  • Eigen values with real numbers, non-negative values
  • Real, orthogonal eigenmatrices (Eigen Vectors), i.e.

For this, let’s assumeThe eigenvalue of isThe eigenvector is :

  • For the eigenvalues we have
  • Symmetric Normalized Laplacian (Symmetric Normalized Laplacian)

Its element values are 1 on the diagonal and 1 on the non-diagonal


We need to know that the convolution of two functions can be obtained by the following formula

Among themStands for Fourier transform

The Graph Fourier transform corresponds to the following:

The inverse Graph Fourier transform corresponds to the following:

Among themIs the LaplacianThe specific corresponding relationship of the characteristic matrix of


We know the ordinary convolution formula:

Then the corresponding graph convolution formula is:

As a feature of the graphFilter, we hope that its scope is the same as CNN, which is in the area near the central node, so we defineIs a function of Laplacian, so acting once is equivalent to propagating the information of neighboring nodes once.

And because, so we can putAs a function of the Laplacian eigenvalues, the parameters for 。

So the graph convolution in the Fourier domain can be expressed as:

We know thatWe need to compute the Laplacian matrix firstWhich involves a lot of matrix operations, so the article borrows from Chebyshev polynomials to perform approximate computations:

  • Among them , Represents theSublaplacian, that is, it depends on the nearest central nodeThe neighbor node of order (the distance between the neighbor node and the center node is K at most). \

  • , includingAs well as. \

Let’s go back to the original graph convolution calculation:

Among them

We know that the number of propagation neighbor layers adopted in the paper is 1, so takeAnd we assume, can be obtained:

In practical application, in order to prevent overfitting and reduce operation, we makeGet:

We make, as well as

Get:

Plus the activation functionWe got

Among themCorresponding to the input , Corresponding parameters

ref

  • en.wikipedia.org/wiki/L\

  • T. N. Kipf, M. Welling, Semi-Supervised Classification with Graph Convolutional Networks(ICLR 2017) [Link, PDF (arXiv), code, blog]\

  • math.stackexchange.com/

“`php

Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) note: WeChat group or qq group to join this site, please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching

Copy the code