#
ETF5952 Course AssignmentGhostwriter ,Risk Analysis AssignmentHelp With ,Python，c/c++，Java Programming AssignmentGhostwriter
Help With SPSS|Debug C/C++ Programming

ETF5952 Quantitative Methods for Risk Analysis

Semester 1, 2020

ASSIGNMENT 2

Deadline: 3PM, June 10, 2020

Important Instruction

• This assignment comprises 25% of the assessment for ETF5952. This is an individual, NOT a syndicate,

assignment. On the Assignment Cover Sheet, read the references to plagiarism and collusion from University

Statute 4.1. Part III-Academic Misconduct.

• Answer all questions, and start from a new page for each question. Your assignment must be typed and

you must submit a pdf file (A4 pages) with an Assignment Cover Sheet (from the ASSIGNMENTS section

of Moodle).

Name your assignment: Surname Initials AS.pdf and Upload this file to Moodle as follows:

1. Go to the “ASSIGNMENTS” section.

2. Click on the “ASSIGNMENT 2” link to upload.

3. The following message will appear momentarily, “File uploaded successfully.”

(To later confirm your upload was successful, go to the “ASSIGNMENTS” section and click. On the

“Assignment 2” uploading link. The uploaded file’s name will be shown.)

• If you have a valid reason not to meet the deadline, you will be requested to submit what you have done at

the due date and receive your grade relative to opportunity. Without any valid reasons, 10% of Assignments

allocated marks will be deducted for each day that it is late.

• Submit one pdf file only. Do NOT submit/attach R scripts or output files. Do not submit your assignment

in a folder.

• You should summarize what you obtain to answer questions, instead of providing all codes and outputs.

If you provide too many outputs relative to questions, then we will consider that you may not understand

the questions and your answers would be subject point deduction.

• If you have questions regarding materials, you are encouraged to use our consultation. The course email

should be used only for pointing out typos and personal matters.

1

Question 1 (25 points: 5+5+5+5+5+5)

To answer this questions, use a mobility data set for Australia, “move au.csv”. This data is extracted from Google

mobility data and see more information from the google site (https://www.google.com/covid19/mobility/). The

data set contains 6 variables regarding mobility information in 8 sub-regions, Australia from Feb 15 to May 7.

We consider a factor mode for the jth variable xi,t,j for region i and time t, given by

E[xi,t,j ] = φj,1νit,1 + φj,2νit,2 + · · · + φj,6νit,6.

Here, since each variable can vary over time and regions, latent factors depend on time and regions (but, the

analysis is similar).

1. To estimate the factor model, apply Principle Component Analysis (PCA). Use the scale option to standardize

6 variables. Report the plot of variances of PCs and explain which component is dominant (no

more than 30 words).

2. Report the estimated loadings in a table. From loading, explain the effect of the first factor on 6 variables

(no more than 30 words).

3. Using the estimated factors, report a boxplot of 6 factors and explain whether the result is consistent with

the one in Question 1.1.

4. Add the estimate first factor to the original data set as a new variable. Also, set “date” as a date variable

by using “as.Date” function. Report a scatter plot with x-axis of date and y-axis of the first factor. Draw

a horizontal line at y = 0. Interpret variations in the first factor over time (no more than 50 words).

5. Notice that the first factor, vit,1, can vary across regions. To see regional variations, create a box plot

of vit,1 for each region (“boxplot” function may not work well without adjustment. If so, I suggest to

use ggplot2 package). According to the first factor variations in Victoria relative to the ones in the other

regions, explain whether human mobility in Victoria decreased (no more than 50 words).

Question 2 (25 points: 5+5+5+10)

We will use a type of difference in difference estimation to estimate the effect of Napster on music sales. In this

question, use ”cex basefile97 02.csv”, which are extracted from several data sets and downloaded from Journal

of Applied Econometrics. Before the analysis, you have to clean the data set. The data set is provided with

“readme.sh.txt” file. Check carefully what kind of variables are in the data set. We do not use “newid”,

“intno” and “firmth”. We use “cdall” and “weight” as a dependent variables and a weight, respectively. The

variables, “year” and “nint” are key variables and consider the other variables as control variables. When

you load the data set, notice that the data set has no variable names in the data set and you have to use an

option for no header (check “?read.csv”). We do NOT use “weight” as weight for all regressions in this question.

Consequential marks will not be provided and you are strongly encouraged to read the readme file and set your

data carefully (it is easy to select variables by the column number. DATA[,3] means the 3rd variables and

DATA[,3:7] means the 3rd-7th variables). When you use gamlr, you do not need to report any hypothesis testing

result.

1. Let yit be music sales and dit take 1 if household HAS internet or 0 otherwise, for household i and year

t. Napster started in 1999 and let t.napt takes 1 if t ≥ 1999 or 0 otherwise. Set d.napit = dit × t.nap.

Without internet access, people cannot use Napster. Thus, we consider the following model

yit = α + βt + γdit + δd.napit + it,

where α, γ and δ are parameters βt is a year effect for t, and the error it. The parameter δ measures

the effect of Napster on music sale. Estimate this model by using the data set. Report only the estimated

effect of Napster and provide interpretation of Napster’s effect (20 words)

2. Estimate the model Question 2.1 with all available control variables. Report only the estimated effect of

Napster and provide interpretation of Napster’s effect (20 words)

2

3. Use lasso (gamlr) to estimate the model Question 2.2 (single machine learning). Report only the estimated

effect of Napster and provide interpretation of Napster’s effect (20 words)

4. Use the double machine learning to the effect of Napster. First, apply lasso (gamlr) to estimate

dit = α + βt + x0

itλ + ηit,

where xit are a vector of control variables. Note you also have to include time effects βt (time dummies).

Let ˆdit be the fitted values from this estimation.

Second, apply lasso (gamlr) to estimate

yit = α + βt + γdit + δd.napit + φ(ˆdit × t.napt) + x0itπ + it.

Here, keep the term ( ˆdit × t.napt) always. Report only the estimated effect of Napster and provide interpretation

of Napster’s effect (20 words)

Question 3 (25 points: 10+15)

In the lecture, the average treatment effect was introduced under a binary treatment status, but we often

encounter randomized control trials with multiple treatments. Consider the case of three treatment status,

where we have no treatment, treatment 1 and treatment 2. Let d1 be a dummy variable taking 1 for treatment

1 and 0 otherwise, and d2 be a dummy variable taking 1 for treatment 2 and 0 otherwise.

1. We consider the following regression

y = α + βd1 + γd2 + ,

where α, β and γ are parameters and is the error with E[] = 0. Using β and γ in this regression, explain

what you can estimate First, express only the final outcomes mathematically (no derivations) and second,

explain each (no more than 10 words for each).

2. Suppose that we want to estimate the difference of treatment effects between treatment 1 and 2 on average.

To this end, consider and expression a specification of a regression when only y, d1 and d2 are available,

and denote the key parameter by δ. Express mathematically what δ measures.

Question 4 (25 points: 5+5+5+5+5)

Use “Hitters” from the ISLR package.

1. The original data set contains some missing values, denoted by NA. Drop observations with the missing

values and then report the summary statistics of “Salary”, only.

2. Report a histogram of Salary and explain salary inequality at Major Leagues Baseball (no more than 15

words).

3. To understand sources of the salary inequality, we use regression tree for Salary with the remaining variables

in the data set. Provide the estimation result and list the all characteristics of high-salary players (just

make a list of the conditions: no explanation is required).

4. Friend A argues that the regression tree analysis based on Salary may be influenced by outliers. Explain if

the argument is correct or not (no more than 30 words).

5. Given Friend A’s argument, we consider an alternative formation by taking a log of Salary. Estimate a

regression tree for log Salary with the rest of variables as regressors. Provide the estimation result and

explain characteristics of high-salary players (just make a list of the conditions: no explanation is required).

3