Mathematical Question

1 Mathematical Question

1.1 Background

Stochastic Gradient Descent (SGD) is a commonly used optimization method for large-scale machine learning

models. Specifically, a typical machine learning model is defined as follows:

min

x f(x) = 1

i=1

fi(x) , (1)

where x 2 Rd represents the model parameter, n is the number of sample, fi(·) denotes the loss function for

the i-th sample. To optimize large-scale machine learning models, SGD randomly selects a subset of samples

at each iteration to compute the stochastic gradient

gt , 1

|S|

i2S

rfi(xt) , (2)

where xt is the model parameter at the t-th iteration, S ⇢ [n] is the randomly selected subset and |S| denotes

the size of the subset. Note that gt is the unbiased estimation of the full gradient, i.e. E[gt] = rf(xt). Then,

SGD uses this stochastic gradient to update the model parameter:

xt+1 = xt $ ⌘gt . (3)

To study the convergence rate of SGD, we have the following assumptions.

Assumption 1 The loss function f(x) has the Lipschitz continuous gradient with parameter L > 0, i.e.

kf(x) $ f(y)k  Lkx $ yk, 8x 2 Rd, 8y 2 Rd . (4)

Assumption 2 The gradient has bounded variance, i.e.

E[kgt $ rf(x)k2]  "2 . (5)

Based on these two assumptions, we can prove the convergence of SGD as follows.

Proof. According to Assumption 1, we have

E[f(xt+1)]  E[f(xt)] + E[hrf(xt), xt+1 $ xti] + L

E[kxt+1 $ xtk2]

= E[f(xt)] $ ⌘E[hrf(xt), gti] + ⌘2L

E[kgtk2]

= E[f(xt)] $ ⌘E[krf(xt)k2] + ⌘2L

E[kgt $ rf(x) + rf(x)k2]

 E[f(xt)] $ ⌘(1 $ ⌘L

2 )E[krf(xt)k2] + ⌘2L

E[kgt $ rf(x)k2]

 E[f(xt)] $ ⌘(1 $ ⌘L

2 )E[krf(xt)k2] + ⌘2"2L

(6)

By setting ⌘ < 1/L, we can get

E[f(xt+1)]  E[f(xt)] $ ⌘

E[krf(xt)k2] + ⌘2"2L

2 (7)

Summing up over t from 0 to T + 1, we have

E[f(xT +1)]  f(x0) $ ⌘

t=0

Ekrf(xt)k2] + ⌘2"2LT

2 (8)

Finally, we can get

t=0

E[krf(xt)k2]  2(f(x0) $ f(xT ))

⌘T + ⌘"2L (9)

By choosing ⌘ = O( p

T ), we have the convergence rate of SGD as follows:

t=0

E[krf(xt)k2]  O( 1

T ) (10)

1.2 Question

Although SGD is fast at each iteration, yet it has large variance when estimating the full gradient. Therefore,

its convergence speed is slow. Now, to accelerate the convergence speed, assume we have a new gradient

estimator vt for the full gradient rf(xt), and use it to update the model parameter:

xt+1 = xt $ ⌘vt . (11)

For this new estimator, it is not an unbiased estimator for rf(xt), i.e. E[vt] 6= rf(xt). But, we have the

following condition

t=0

E[kvt $ rf(x)k2]  ⌘2L2X

t=0

E[kvtk2]. (12)

Q: Given Eq.(4), Eq.(11), and Eq.(12), please prove that this method (using vt) can achieve the following

convergence rate

t=0

E[krf(xt)k2]  O(

T ) , (13)

and please give the value of the learning rate ⌘.

2 Coding Question

2.1 Background

Graph convolutional neural networks (GCN) have attracted a surge of attention in recent years, due to the

superior performance on the graph data. A typical GCN layer is defined as follows:

Z(l) = W(l)

H(l#1)A ,

H(l) = "(Z(l)

) , (14)

where H(l) = [h(l)

1 , h(l)

2 , ··· , h(l)

n ] 2 Rdl⇥n is the hidden feature of all nodes in the l-th layer, Z(l) =

(l)

1 , z(l)

2 , ··· , z(l)

n ] 2 Rdl⇥n is the hidden feature before conducting non-linear activation operation, "(·)

is the non-linear activation function, W(l) 2 Rdl⇥dl!1 is the model parameter in the l-th layer. Compared

with fully connected neural networks, GCN has an aggregation operation to leverage the relational information, which is denoted by H(l#1)A where A 2 Rn⇥n determines how to aggregate neighbours. Specifically, a

typical fully connected layer is defined as follows:

(l)

i = W(l)

h(l#1)

i . (15)

On the contrary, in the standard GCN [1], the GCN layer is defined as follows:

(l)

i = X

j2Ni

p|Ni||Nj |

W(l)

h(l#1)

j , (16)

where Ni denotes the neighbours of the i-th node and |Ni| represents the degree of the i-th node.

The aggregation operation brings new challenges for training GCN. In particular, di↵erent nodes are

dependent on each other in GCN, while samples are independent in regular deep neural networks. Therefore,

existing GCNs usually use the full gradient descent method to learn model parameters to avoid dealing with

the dependence between di↵erent nodes, such as this method https://github.com/tkipf/pygcn.

2.2 Question

Please use stochastic gradient descent method (SGD or its variants) to train GCN. In other words, please

re-implement the training method in https://github.com/tkipf/pygcn. At each iteration, you need to

sample a sub-graph rather than the entire graph to learn the model parameter (Here is an example using

SGD to train the regular CNN https://github.com/pytorch/examples/blob/master/mnist/main.py).

Note that the prediction performance will become worse when using SGD (or its variants) to train GCN.

Sampling a good subgraph will alleviate this issue.

References

1. T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint

arXiv:1609.02907, 2016.

QQ：99515681
WeChat：codinghelp
Email：99515681@qq.com
Work Time：8:00-23:00

Hots

Ghostwriter Cs1b Spring 2024 Tth Hw08h... 2024-04-19
Help With Managing Financial Risk Prob... 2024-04-19
Ghostwriter Cs 0449 – Project 5: /Dev/ 2024-04-19
Ghostwriter Elec 2141 Digital Circuit ... 2024-04-19
Help With Csc171 — Videogame Projecthe 2024-04-19
Help With Comp3411 Artificial Intellig 2024-04-19
Help With Stat3061: Random Processes &... 2024-04-19
Ghostwriter Accounting 452, Spring 202... 2024-04-19
Ghostwriter Finc5001 Foundations In Fi... 2024-04-19
Ghostwriter 7Ssmm712 – Topics In Appli 2024-04-19
Help With Com 337 - Film Studies For T... 2024-04-19
Ghostwriter Mes202tc - Digital Vlsi Sy... 2024-04-19
Ghostwriter Geography 2041B Distance S... 2024-04-19
Ghostwriter Ecos3006 International Tra... 2024-04-19
Help With Fit5225 2024 Sm1 Creating An... 2024-04-19
Help With Cit 593: Introduction To Com... 2024-04-19
Help With Math 4931: Take Home Examgho... 2024-04-19
Ghostwriter Csci 547|Info 533: Systems... 2024-04-19
Ghostwriter Cs536-S24 Intro To Pls And... 2024-04-19
Help With Fit5212 - Assignment 1Ghostw... 2024-04-19

Programming Assignment Help！