# Help With MAST90044 Assignment 1 Thinking and Reasoning with Data

MAST90044 Thinking and Reasoning with Data
Semester 1 2020
Assignment 1
Due: 8am, Monday 27 April
Instructions
• Assignments are to be submitted (uploaded) via Canves.
• You must sign the plagiarism ideclaration. The link is available on the subject’s Canves website.
• Your assignment should show all working and reasoning, as marks will be given for method as well as
• Paste any R code and output into the appropriate places so that it can be seen easily along with your
other work. Graphics from R can be resized within your document; make them smaller as necessary.
• Assignments count for 50% of the assessment in this subject. This one is worth 15%, and covers the
work done in weeks 1 to 4.
• Tutors will not help you directly with assignment questions. However, they may give some help with
R.
• Solutions to the assignment questions will be made available later.
• When constructing a panel of graphs with multiple plots, it is good to use the R command
par(mfrow = c(nrows,ncols)) where nrows is the number of rows and ncols the number of columns
in the panel. The default is (1,1).
MAST90044 Thinking and Reasoning with Data Assignment 1
Q.1. The data set unesco.csv, available on the LMS, contains demographic and economic information from
the 1990 UNESCO yearbook on about half the world’s countries. Definitions of the variables in the
data set are as follows:
• Birth rate per 1,000 of population
• Death rate per 1,000 of population
• Infant deaths per 1,000 of population
• Life expectancy at birth for males
• Life expectancy at birth for females
• Gross National Product (GNP) per capita
• Geopolitical group
1 Eastern Europe (former Soviet Satellite)
2 South America and Mexico
3 Western Europe, North America, Japan
4 Middle East
5 Asia
6 Africa
• Country
Ignoring geopolitical group:
(a) Summarise the GNP values using summary statistics and two graphical tools. Briefly describe any
obvious features of the distribution.
(b) Use two graphical tools to compare the observed distribution of infant deaths with a normal
distribution. Briefly comment.
(c) Graphically examine the relationship between the infant death rate and GNP. Calculate the cor-
relation coefficient between the two variables. Comment on how useful it is in this situation.
(d) Graphically examine the relationship between life expectancy at birth for females and the birth
rate. Comment on the strength or otherwise of the relationship. Formulate a statistical model to
describe the relationship. Graphically fit the model.
Taking geopolitical group into account:
(e) Use two graphical tools to examine the relationship between life expectancy at birth for males and
geopolitical group. Use suitable R functions to calculate the mean and standard deviation for each
group, and the number of countries in each group. Comment on any obvious differences between
the groups and identify any clear outliers.
(f) Calculate the net population growth rate per 1000 of population (we will call this “net growth”).
Type library(lattice) in R to ensure that the xyplot() function is available. Use xyplot
to examine the relationship between net growth and GNP for each geopolitical group separately.
Note that in the matrix of plots, group 1 will be placed in the bottom left hand corner, and you
proceed across the row of plots. Comment on what the plots show in regard to the relationship,
and any limitations of this type of plot here.
(g) Create a plot of net growth vs GNP for group 2 on its own. Calculate the correlation coefficient,
and comment on the strength and direction of the relationship.
[4 + 3 + 4 + 5 + 8 + 7 + 5 = 36 marks]
2
MAST90044 Thinking and Reasoning with Data Assignment 1
Q.2. It is well known that quitting smoking is difficult. Many people who are trying to quit use nicotene
replacement methods like nicotene patches or nicotene gum to ease nicotene withdrawal symptoms. As
an alternative, medical researchers investigated whether the use of an antidepressant medication might
be a more effective aid to those attempting to give up cigarettes. In a study reported in March 4,
1999, New England Journal of Medicine, researchers published results that compared the effectiveness
of nicotene patches to the effectiveness of the antidepressant burpropion, which is marked with the
brand name Zyban. The study consisted of 893 participants who were randomly allocated to four
(i = 1, 2, 3, 4) treatment groups, listed below in the table. They did not know to which treatment
they were allocated i.e. this was a single-blind study. The table below shows the number of people not
smoking 6 months following the study, for each treatment.
Treatment Subjects not smoking (xi) Total subjects (ni)
Placebo only 30 160
Nicotene patch 52 244
Zyban 85 244
Zyban and nicotene patch 95 245
(a) Calculate the Wald, Agresti-Coull and Jeffreys prior 95% confidence intervals for each treatment
group separately. Draw the confidence intervals.
(b) Comment on the validity or otherwise of the assumptions made in these calculations.
(c) Find a point and an interval estimate of the difference in proportions of those not smoking after
6 months between people who used the ’Zyban + patch’ group and those who used Zyban alone.
Give an interpretation of the confidence interval. Make one comment, with supporting evidence
from above, on the claim that using a patch in addition to Zyban is effective for quitting.
(d) Construct a Wald confidence interval to test the claim that using a nicotene patch is no more
effective than using nothing at all. Interpret the confidence interval as well as a reason for your
choice of confidence interval method.
(e) Provide a single Wald confidence interval to test the claim that Zyban, with or without a patch,
is better than doing nothing or using a patch. Interpret the confidence interval as well as a reason
for your choice of confidence interval method.
[5 + 4 + 4 + 5 + 6 = 24 marks]
Q.3. The chi-squared distribution, denoted by X ∼ χ2ν , is used a great deal in statistics and science, and we
will meet it again later. The exact shape of the distribution depends on the degrees of freedom (ν), at
larger ν values the chi-squared approaches a normal distribution, and therefore stronger departure from
the normal distribution. Here we will examine how quickly the sampling distribution of the sample
mean taken from a X ∼ χ22 distribution converges to normality (or at least to symmetry).
(a) Take a large sample from the X ∼ χ22 distribution and test its departure from normality using two
graphical tools. You will need the R function rchisq. Comment on the result.
(b) Examine the sampling distribution of the sample mean from samples of size 5, by generating 1000
such samples and looking at a plot of the density (make a comment about the distribution).
(c) Compare the sampling distribution of the sample mean for a range of sample sizes (e.g. 1, 5, 10,
20, 40, 80), and use your results to suggest how large the sample size needs to be for adequate
convergence. The mean of a X ∼ χ2ν distribution is ν.
[ 5 + 3 + 5 = 13 marks]
Total marks = 73