The original address: cf020031308. Making. IO/cca shut / 2018…

  1. Through layer sampling, GCN training is accelerated
  2. The variance of the estimate is reduced through importance sampling
  3. Better separation of training and test sets: no edges involving test sets are used for training

methods

FastGCN has the following assumptions:

  • Graph: A subgraph of an infinite graph
  • Node: a group of independently identically distributed samples
  • Loss Function and convolution Layer: Integral of node representation function (expectation)

Under this assumption, the GCN of the next M layer (l is the loss function) :

H (l + 1) = sigma (A ^ H (l) (l) W H ^ {} (l + 1) = \ sigma (\ hat A H ^ ^ {} (l) W {} (l)) H (l + 1) = sigma (A ^ H (l) (l) W l = 1 n ∑ I = 1 nl (H (M) (I, :)) l = \ frac1n ^ \ sum \ limits_ {I = 1} {n} l (H ^ {} (M) (I, :)) l = n1i = 1 ∑ nl (H (M) (I, :))

Can be expressed as expectation:

H (l + 1) (v) = sigma (Ep [A ^ (v, u) h (l) (l) (u) W]) h ^ {} (l + 1) (v) = \ sigma (E_p [\ hat A (v, U) h ^ {} (l) (u) W ^ {} (l)]) h (l + 1) (v) = sigma (Ep [A ^ (v, u) h (l) (l) (u) W]) l = Ep [l] (h (M) (v)) l = E_p [l (h ^ {} (M) (v))] l = Ep [l] (h (M) (v))

Thus, every layer except the last layer is sampled uniformly for estimation:

H (l + 1) = (v,) sigma (NTL ∑ j = 1 tla ^ (v, uj (l)) H (l) (uj (l), 🙂 W (l)) H ^ {} (l + 1) (v, 🙂 = \sigma(\frac{n}{t_l} \sum\limits_{j=1}^{t_l} \hat A(v, u_j^{(l)}) H^{(l)}(u_j^{(l)}, 🙂 W ^ {} (l)) H (l + 1) = (v,) sigma (TLNJ = 1 ∑ tlA ^ (v, uj (l)) H (l) (uj (l), 🙂 W (l), l = 1 tm ∑ I = 1 HTML (H (M) (UI (M), :)) l = \ frac1 {t_M} \sum\limits_{i=1}^{t_M} l(H^{(M)}(u_i^{(M)}, :)) = tM1i L = 1 ∑ HTML (H (M) (UI (M), :)) del H ~ (L + 1) = (v,) NTL ∑ j = 1 tla ^ (v, uj (L)) del (H (L) (uj (L), 🙂 W (L)) \ triangledown \ tilde H ^ {} (L + 1) (v, 🙂 = \frac{n}{t_l} \sum\limits_{j=1}^{t_l} \hat A(v, u_j^{(l)}) \triangledown (H^{(l)}(u^{(l)}_j, 🙂 W ^ {} (l)) del H ~ (l + 1) = (v,) TLNJ = 1 ∑ tlA ^ (v, uj (l)) del (H (l) (uj (l), (l)) 🙂 W

Where tLT_LTL represents the sample number of layer L, and n represents the total number of nodes.

In order to reduce variance, Importance Sampling is introduced:


h ( l + 1 ) ( v ) = sigma ( E q [ A ^ ( v . u ) h ( l ) ( u ) W ( l ) p ( u ) q ( u ) ] ) h^{(l+1)}(v) = \sigma(E_q[\hat A(v, u) h^{(l)}(u) W^{(l)} \cdot \frac{p(u)}{q(u)}])

Q, which theoretically minimizes the variance, is


q E v [ A ^ ( v . u ) 2 ] h ( l ) ( u ) W ( l ) p ( u ) q^* \propto \sqrt{E_v[\hat A(v, u)^2]} \cdot |h^{(l)}(u) W^{(l)}| \cdot p(u)

(l) but the W W ^ {} (l) W (l) each iteration will become, and h (l) h ^ {} (l) (l) h matrix multiplication is time-consuming.

So the author simplifies it to


q ^ E v [ A ^ ( v . u ) 2 ] = A ^ ( : . u ) 2 \hat q \propto \sqrt{E_v[\hat A(v, u)^2]} = ||\hat A(:, u) ||^2

That is, nodes are sampled with probability proportional to the 2 norm of the column in A^\hat AA^ matrix. The calculation is simple and the distribution is layer independent.

However, the authors were unable to prove the validity of this estimate.

The experiment

The first layer Precompute

Review the formula of GCN:


H ( l + 1 ) = sigma ( A ^ H ( l ) W ( l ) ) H^{(l+1)} = \sigma(\hat A H^{(l)} W^{(l)})

Given H(0)H^{(0)}H(0), A^H(l)\hat A H^{(l)}A^H(l) is constant throughout the training, so this matrix multiplication can be saved.

The figure is the result of using Precompute on a Pubmed dataset: accuracy (F1) is maintained and training time (s) is significantly reduced.

In fact, this is an improvement on GCN, but FastGCN is sampling estimation, less operation, this part of the improvement of the role of the obvious.

Comparison of uniform sampling with importance sampling

From top to bottom are the Cora, Pubmed, and Reddit datasets.

IS IS always more accurate, the authors say, because it does have smaller variances. I think it’s more intuitive because the nodes are sampled with proportional probability, they have a higher probability of being associated with the upper nodes, so the upper nodes are more likely to learn.

Contrast with GraphSAGE and GCN

There was an NA because the Reddit dataset was too large for the original GCN author to measure, so it was changed to a batch version, equal to unsampled FastGCN

  • Similar accuracy performance
  • speed
    • The three datasets are each larger and denser than the last, so GraphSAGE has to be on Reddit to show its advantage over GCN
    • FastGCN improves GCN by an order of magnitude on Pubmed and Reddit datasets (note that the timeline is logarithmic)

conclusion

  • Contrast GraphSAGE is sampling in the domain of nodes, FastGCN is sampling on layers. So GraphSAGE is a “linear” improvement over GCN in terms of speed, and FastGCN is an order of magnitude improvement.
  • One drawback of FastGCN is that if the graph is large and sparse (which is the reality), the samples of adjacent layers are likely to be uncorrelated, making it impossible to learn.