# Mathematical Question

1 Mathematical Question

1.1 Background
Stochastic Gradient Descent (SGD) is a commonly used optimization method for large-scale machine learning
models. Specifically, a typical machine learning model is defined as follows:
min
x f(x) = 1
n
Xn
i=1
fi(x) , (1)
where x 2 Rd represents the model parameter, n is the number of sample, fi(·) denotes the loss function for
the i-th sample. To optimize large-scale machine learning models, SGD randomly selects a subset of samples
at each iteration to compute the stochastic gradient
gt , 1
|S|
Xn
i2S
rfi(xt) , (2)
where xt is the model parameter at the t-th iteration, S ⇢ [n] is the randomly selected subset and |S| denotes
the size of the subset. Note that gt is the unbiased estimation of the full gradient, i.e. E[gt] = rf(xt). Then,
SGD uses this stochastic gradient to update the model parameter:
xt+1 = xt \$ ⌘gt . (3)
To study the convergence rate of SGD, we have the following assumptions.
Assumption 1 The loss function f(x) has the Lipschitz continuous gradient with parameter L > 0, i.e.
kf(x) \$ f(y)k  Lkx \$ yk, 8x 2 Rd, 8y 2 Rd . (4)
Assumption 2 The gradient has bounded variance, i.e.
E[kgt \$ rf(x)k2]  "2 . (5)
Based on these two assumptions, we can prove the convergence of SGD as follows.
Proof. According to Assumption 1, we have
E[f(xt+1)]  E[f(xt)] + E[hrf(xt), xt+1 \$ xti] + L
2
E[kxt+1 \$ xtk2]
= E[f(xt)] \$ ⌘E[hrf(xt), gti] + ⌘2L
2
E[kgtk2]
= E[f(xt)] \$ ⌘E[krf(xt)k2] + ⌘2L
2
E[kgt \$ rf(x) + rf(x)k2]
 E[f(xt)] \$ ⌘(1 \$ ⌘L
2 )E[krf(xt)k2] + ⌘2L
2
E[kgt \$ rf(x)k2]
 E[f(xt)] \$ ⌘(1 \$ ⌘L
2 )E[krf(xt)k2] + ⌘2"2L
2
(6)
By setting ⌘ < 1/L, we can get
E[f(xt+1)]  E[f(xt)] \$ ⌘
2
E[krf(xt)k2] + ⌘2"2L
2 (7)
2
Summing up over t from 0 to T + 1, we have
E[f(xT +1)]  f(x0) \$ ⌘
2
X
T
t=0
Ekrf(xt)k2] + ⌘2"2LT
2 (8)
Finally, we can get
1
T
X
T
t=0
E[krf(xt)k2]  2(f(x0) \$ f(xT ))
⌘T + ⌘"2L (9)
By choosing ⌘ = O( p
1
T ), we have the convergence rate of SGD as follows:
1
T
X
T
t=0
E[krf(xt)k2]  O( 1
p
T ) (10)
1.2 Question
Although SGD is fast at each iteration, yet it has large variance when estimating the full gradient. Therefore,
its convergence speed is slow. Now, to accelerate the convergence speed, assume we have a new gradient
estimator vt for the full gradient rf(xt), and use it to update the model parameter:
xt+1 = xt \$ ⌘vt . (11)
For this new estimator, it is not an unbiased estimator for rf(xt), i.e. E[vt] 6= rf(xt). But, we have the
following condition
X
T
t=0
E[kvt \$ rf(x)k2]  ⌘2L2X
T
t=0
E[kvtk2]. (12)
Q: Given Eq.(4), Eq.(11), and Eq.(12), please prove that this method (using vt) can achieve the following
convergence rate
1
T
X
T
t=0
E[krf(xt)k2]  O(
1
T ) , (13)
and please give the value of the learning rate ⌘.
2 Coding Question
2.1 Background
Graph convolutional neural networks (GCN) have attracted a surge of attention in recent years, due to the
superior performance on the graph data. A typical GCN layer is defined as follows:
Z(l) = W(l)
H(l#1)A ,
H(l) = "(Z(l)
) , (14)
where H(l) = [h(l)
1 , h(l)
2 , ··· , h(l)
n ] 2 Rdl⇥n is the hidden feature of all nodes in the l-th layer, Z(l) =
[z
(l)
1 , z(l)
2 , ··· , z(l)
n ] 2 Rdl⇥n is the hidden feature before conducting non-linear activation operation, "(·)
is the non-linear activation function, W(l) 2 Rdl⇥dl!1 is the model parameter in the l-th layer. Compared
with fully connected neural networks, GCN has an aggregation operation to leverage the relational informa￾tion, which is denoted by H(l#1)A where A 2 Rn⇥n determines how to aggregate neighbours. Specifically, a
typical fully connected layer is defined as follows:
z
(l)
i = W(l)
h(l#1)
i . (15)
3
On the contrary, in the standard GCN [1], the GCN layer is defined as follows:
z
(l)
i = X
j2Ni
1
p|Ni||Nj |
W(l)
h(l#1)
j , (16)
where Ni denotes the neighbours of the i-th node and |Ni| represents the degree of the i-th node.
The aggregation operation brings new challenges for training GCN. In particular, di↵erent nodes are
dependent on each other in GCN, while samples are independent in regular deep neural networks. Therefore,
existing GCNs usually use the full gradient descent method to learn model parameters to avoid dealing with
the dependence between di↵erent nodes, such as this method https://github.com/tkipf/pygcn.
2.2 Question
Please use stochastic gradient descent method (SGD or its variants) to train GCN. In other words, please
re-implement the training method in https://github.com/tkipf/pygcn. At each iteration, you need to
sample a sub-graph rather than the entire graph to learn the model parameter (Here is an example using
SGD to train the regular CNN https://github.com/pytorch/examples/blob/master/mnist/main.py).
Note that the prediction performance will become worse when using SGD (or its variants) to train GCN.
Sampling a good subgraph will alleviate this issue.
References
1. T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint
arXiv:1609.02907, 2016.