# Ghostwriter Question 3: Stata exerciseHelp With SQL Programming

Question 3: Stata exercise

Use the dataset housing.dta to perform. the following exercises on Stata. Description of the data can be found in the file “About housing dataset.pdf”. We are interested in predicting the median house prices (MEDV ) using all the 13 independent variables as given in the data set.

(i) Estimate a linear regression model that relates the median house prices (MEDV ) to the 13 independent variables given in the dataset, using all sample values. (Note: for this part, let the coefficient standard errors be computed as default by Stata). Report and interpret the coefficient estimate of the variable CHAS. Is CHAS significant in predicting MEDV at the 1% level ?

(ii) Plot the residuals against the fitted values, to check for the presence of heteroskedasticity. Also perform. White’s heteroskedasticity test and interpret the result you obtain, considering a significance level of 5%.

(iii) If the errors are heteroskedastic, what issue is this likely to cause for the OLS output obtained in part (i)? Report the least squares output again after correcting for that issue. Is CHAS now significant in predicting MEDV at the 1% level ? We now want to compare different model fitting procedures for predicting median house prices using a linear model (you may ignore heteroskedasticity for the questions below).

Sample splitting:

(iv) Split the data set into a training set and a validation set, in a 70-30% ratio. You can use approximate ratios to round off the number of observations in each set to integer values. Use the random-number seed as 12345 in creating the random split. State how many observations are there in the training and the validation sets.

(v) Fit a linear model using least squares on the training set. Report and interpret the R2 obtained. (Remember to store the ols results for later analysis)

Ridge regression:

(vi) Fit a ridge regression model on the training set, with λ chosen by 10-fold cross-validation (use the random-number seed as 12345). Report the chosen λ value obtained. (Remember to store the ridge results for later analysis)

(vii) Report the ridge coefficients (unstandardised) obtained at the selected value of λ. Compare with least squares estimates obtained in part (v).

(viii) Plot the ridge coefficients (unstandardised) against λ. Comment on the nature of the ridge coefficient paths you observe as the tuning parameter λ increases

The lasso:

(ix) Fit a lasso model on the training set, with λ chosen by 10-fold cross-validation (use the random-number seed as 12345). Report the chosen λ value obtained. (Remember to store the lasso results for later analysis)

(x) Report the number of predictors selected by the lasso at the chosen λ.

(xi) Plot the lasso coefficients (unstandardised) against λ. Comment on the nature of the lasso coefficient paths you observe as the tuning parameter λ increases.

Comparison:

(xii) Using the coefficient paths obtained for ridge and the lasso in parts (viii) and (xi) respectively, compare the nature of ridge and lasso coefficients when λ lies in the range of 0.01 to 10: which method between ridge and lasso performs variable selection? Which one causes shrinkage of coefficients?

(xiii) Compare the mean square error (MSE) and R2 for the least squares, ridge regression and the lasso fitting procedures, for both the training versus the validation sets. Comment on your results.