Home Page > > Details

1 Mathematical Question

1.1 Background

Stochastic Gradient Descent (SGD) is a commonly used optimization method for large-scale machine learning

models. Specifically, a typical machine learning model is defined as follows:

min

x f(x) = 1

n

Xn

i=1

fi(x) , (1)

where x 2 Rd represents the model parameter, n is the number of sample, fi(·) denotes the loss function for

the i-th sample. To optimize large-scale machine learning models, SGD randomly selects a subset of samples

at each iteration to compute the stochastic gradient

gt , 1

|S|

Xn

i2S

rfi(xt) , (2)

where xt is the model parameter at the t-th iteration, S ⇢ [n] is the randomly selected subset and |S| denotes

the size of the subset. Note that gt is the unbiased estimation of the full gradient, i.e. E[gt] = rf(xt). Then,

SGD uses this stochastic gradient to update the model parameter:

xt+1 = xt $ ⌘gt . (3)

To study the convergence rate of SGD, we have the following assumptions.

Assumption 1 The loss function f(x) has the Lipschitz continuous gradient with parameter L > 0, i.e.

kf(x) $ f(y)k Lkx $ yk, 8x 2 Rd, 8y 2 Rd . (4)

Assumption 2 The gradient has bounded variance, i.e.

E[kgt $ rf(x)k2] "2 . (5)

Based on these two assumptions, we can prove the convergence of SGD as follows.

Proof. According to Assumption 1, we have

E[f(xt+1)] E[f(xt)] + E[hrf(xt), xt+1 $ xti] + L

2

E[kxt+1 $ xtk2]

= E[f(xt)] $ ⌘E[hrf(xt), gti] + ⌘2L

2

E[kgtk2]

= E[f(xt)] $ ⌘E[krf(xt)k2] + ⌘2L

2

E[kgt $ rf(x) + rf(x)k2]

E[f(xt)] $ ⌘(1 $ ⌘L

2 )E[krf(xt)k2] + ⌘2L

2

E[kgt $ rf(x)k2]

E[f(xt)] $ ⌘(1 $ ⌘L

2 )E[krf(xt)k2] + ⌘2"2L

2

(6)

By setting ⌘ < 1/L, we can get

E[f(xt+1)] E[f(xt)] $ ⌘

2

E[krf(xt)k2] + ⌘2"2L

2 (7)

2

Summing up over t from 0 to T + 1, we have

E[f(xT +1)] f(x0) $ ⌘

2

X

T

t=0

Ekrf(xt)k2] + ⌘2"2LT

2 (8)

Finally, we can get

1

T

X

T

t=0

E[krf(xt)k2] 2(f(x0) $ f(xT ))

⌘T + ⌘"2L (9)

By choosing ⌘ = O( p

1

T ), we have the convergence rate of SGD as follows:

1

T

X

T

t=0

E[krf(xt)k2] O( 1

p

T ) (10)

1.2 Question

Although SGD is fast at each iteration, yet it has large variance when estimating the full gradient. Therefore,

its convergence speed is slow. Now, to accelerate the convergence speed, assume we have a new gradient

estimator vt for the full gradient rf(xt), and use it to update the model parameter:

xt+1 = xt $ ⌘vt . (11)

For this new estimator, it is not an unbiased estimator for rf(xt), i.e. E[vt] 6= rf(xt). But, we have the

following condition

X

T

t=0

E[kvt $ rf(x)k2] ⌘2L2X

T

t=0

E[kvtk2]. (12)

Q: Given Eq.(4), Eq.(11), and Eq.(12), please prove that this method (using vt) can achieve the following

convergence rate

1

T

X

T

t=0

E[krf(xt)k2] O(

1

T ) , (13)

and please give the value of the learning rate ⌘.

2 Coding Question

2.1 Background

Graph convolutional neural networks (GCN) have attracted a surge of attention in recent years, due to the

superior performance on the graph data. A typical GCN layer is defined as follows:

Z(l) = W(l)

H(l#1)A ,

H(l) = "(Z(l)

) , (14)

where H(l) = [h(l)

1 , h(l)

2 , ··· , h(l)

n ] 2 Rdl⇥n is the hidden feature of all nodes in the l-th layer, Z(l) =

[z

(l)

1 , z(l)

2 , ··· , z(l)

n ] 2 Rdl⇥n is the hidden feature before conducting non-linear activation operation, "(·)

is the non-linear activation function, W(l) 2 Rdl⇥dl!1 is the model parameter in the l-th layer. Compared

with fully connected neural networks, GCN has an aggregation operation to leverage the relational information, which is denoted by H(l#1)A where A 2 Rn⇥n determines how to aggregate neighbours. Specifically, a

typical fully connected layer is defined as follows:

z

(l)

i = W(l)

h(l#1)

i . (15)

3

On the contrary, in the standard GCN [1], the GCN layer is defined as follows:

z

(l)

i = X

j2Ni

1

p|Ni||Nj |

W(l)

h(l#1)

j , (16)

where Ni denotes the neighbours of the i-th node and |Ni| represents the degree of the i-th node.

The aggregation operation brings new challenges for training GCN. In particular, di↵erent nodes are

dependent on each other in GCN, while samples are independent in regular deep neural networks. Therefore,

existing GCNs usually use the full gradient descent method to learn model parameters to avoid dealing with

the dependence between di↵erent nodes, such as this method https://github.com/tkipf/pygcn.

2.2 Question

Please use stochastic gradient descent method (SGD or its variants) to train GCN. In other words, please

re-implement the training method in https://github.com/tkipf/pygcn. At each iteration, you need to

sample a sub-graph rather than the entire graph to learn the model parameter (Here is an example using

SGD to train the regular CNN https://github.com/pytorch/examples/blob/master/mnist/main.py).

Note that the prediction performance will become worse when using SGD (or its variants) to train GCN.

Sampling a good subgraph will alleviate this issue.

References

1. T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint

arXiv:1609.02907, 2016.

Contact Us(Ghostwriter Service)

- QQ：99515681
- WeChat：codinghelp
- Email：99515681@qq.com
- Work Time：8:00-23:00

- Ghostwriter Crc Programming,Help With ... 2020-10-27
- Algorithms Programmingghostwriter ,Hel... 2020-10-27
- Help With 158.225 Course Programming,G... 2020-10-27
- Gui Programminghelp With ,Ghostwriter ... 2020-10-27
- 70015 Programminghelp With ,Ghostwrite... 2020-10-27
- Ghostwriter Data Programming Course,He... 2020-10-27
- Comp9021help With ,Python Programmingd... 2020-10-27
- Ghostwriter 159.172 Programming,Help W... 2020-10-27
- Help With Program Programming,Python P... 2020-10-27
- Ghostwriter Fnce90045,Help With Java P... 2020-10-27
- Help With Spss|Help With Processing|Gh... 2020-10-27
- C++ Programmingdebug ,Cs 130Aghostwrit... 2020-10-27
- R Programminghelp With ,Ghostwriter Da... 2020-10-27
- Ghostwriter Ssci 599 Course,Help With ... 2020-10-27
- Cits1001help With ,Java Programmingdeb... 2020-10-26
- Ghostwriter Cfrm 542 Programming,R Cou... 2020-10-26
- Math2392help With ,Ghostwriter R Cours... 2020-10-26
- Help With Comp10002 Programming,Ghostw... 2020-10-26
- Math2392 Programmingghostwriter ,Data ... 2020-10-26
- Cse 320Help With ,Ghostwriter Programm... 2020-10-25

Contact Us - Email：99515681@qq.com WeChat：codinghelp

© 2014 www.asgnhelp.com

Programming Assignment Help！