STAT 3675Q - STATISTICAL COMPUTING UCONN

Fall 2019 Marcos Prates

1. Objective

The IMDB Movies Dataset (file imdb.csv) contains information about over 10,000 movies.

The names of the first twelve columns are self-explanatory (the duration is in seconds). The

rest of the variables (Action, Adult, Adventure, . . .) are dummy variables (0/1) indicating

if the movie has the given genre.

In this project, you will apply a number of statistical methods that have been covered during

the course using R.

• Projects are to be completed individually, or with someone.

• The project is worth 25% of the final grade.

Directions. You are asked to write a preliminary report and a report. Please follow carefully

the following guidelines

2. Preliminary report [30 points]

• Provide a single file with the format name_3675_prelim.pdf (or name1_name2_3675_prelim.pdf

if you work with someone), where name is your full name.

• The preliminary report is due on Sunday, November 22, 2019 at 11:59 PM. Submit it

via HuskyCT. The pdf must be generated using Rmarkdown.

Your preliminary report must contain the following elements.

(a) A preliminary exploratory analysis including summary statistics and basic graphs (4

pages max)

(b) Pose scientific questions that are interesting to you and indicate what statistical methods

may help answer those questions (1 page)

(c) Include the R code and all outputs.

3. Report [70 points]

For the report, provide a single file with the format name_3675_report.pdf (or

name1_name2_3675_report.pdf), where name is your full name. The pdf must be

generated using Rmarkdown.

• The report must be at least 10 pages long, without exceeding 30 pages (including the

code and the graphs).

• The report is due on Sunday, December 8, 2019 at 11:59 PM. Submit it via HuskyCT.

(a) Include the preliminary report

(b) Include at least one regression method

(c) Include at least one ANOVA analysis

(d) Include at least one classification method

For each method,

• Express all statistical models using mathematical formulae, and clearly state the meaning

of the notations, and the assumptions.

• Insert R code and necessary comments. Your output must contain the R code (do not

use the echo=FALSE option).

• Interpret extensively all outputs and graphs that you include.

4. Important dates

• November 22, 2019: Preliminary report is due

• Decebmer 8, 2019: Report is due

