# IPAL Programming in R Week 3

Programming in R - Week 3 Assignment

IPAL - The University of Chicago
Due: Sunday, July 19th at 11:59pm on Canvas
Structure
This assignment will focus on using real-world data to create a presentable, reproducible plot and regression
table that can be updated easily. As we covered in class, R Markdown is ideal for programmatically generating
reports that rely on constantly changing data. We will use this to create a report on the geographic distribution
of crime in Chicago, focusing on thefts. The end goal is to create a map, a regression table, and some short
writing discussing the implications of both.
Like before, this problem set is fairly open ended. It will be broken into three sections: Section 1 is worth 20
points, Sections 2 & 3 are worth 14 each. Start by creating a new project/folder for this assignment. The
output for this assignment should NOT be a .R file. Instead, it should be two separate files: a .Rmd file with
your raw code, and a finalized PDF or HTML document. The latter should include your code (in chunks,
x <- c(1:10,20:30) #Here's a random vector
y <- 8 * x #Multiplying every element of the vector by 8
mean(y) #Finding the mean value of all values in the vector 'y'
## [1] 125.7143
A part of the grade for this assignment will be based on the tidyness and quality of your output file.
This problem set will focus on mapping and analyzing thefts in Chicago. We can gather data about thefts
using the same Chicago Data Portal API that we used for Problem Set 2. Start by altering your API URL to
pull thefts instead of homicides. Be sure to increase limit returned by the API, as there are many more thefts
than homicides. Next, complete the following tasks:
download any given year of crime data. The function should have a single input: year, and should output a
dataframe of thefts for that year.
1.2. Create a vector of years starting in 2016 and ending in 2019. Use your function download_thefts and
a for loop to iterate through each year in the vector and download the data relevant to that year. Use
bind_rows() to combine the data for each year into a final dataframe called thefts. You can also use the
map family of functions to complete this task. NOTE: You may need to drop column 22 from each year’s data
to avoid errors.
1.3. Using the thefts dataframe, use the same lubridate functions you used in Problem Set 2 to extract
the year, month, day, week, and hour columns. Additionally, drop any rows that have an NA value for the
latitude or longitude columns.
1.4. Create a new column called classification in the thefts dataset. Using ifelse() or case_when(),
set classification equal to “petty” when the description column equals less than \$500, pocket-picking,
or purse-snatching. Set classification equal to “grand” for all other values of description.
1
Section 2: Mapping
Question 2.1
Now that we’ve loaded and cleaned our dataset, let’s take a look at the geographic distribution of thefts. We
need to convert the latitude and longitude columns in the data into spatial geometries (points) before
plotting. Use the st_as_sf() function to convert the respective columns into an sf geometry column. Specify
the CRS as 4326 when converting, and use the remove = FALSE argument to keep the original latitude and
longitude columns in the data. You will have a new column in your dataset named geometry if successful.
Question 2.2
Next, create a filtered version of the thefts dataset that contains only data from the first two months of
2019. Call this dataset thefts_fil.
Using thefts_fil, replicate the plot below using ggplot() and geom_sf().
Theft
Category
Grand
Petty
Thefts in Chicago (Jan. & Feb. 2020)
Source: City of Chicago Data Portal
This plot is still pretty hard to read, and it’s difficult to discern what conclusions we should draw from it. To
make the map clearer, we can aggregate the individual-level data to the Census tract level and examine data
from a longer time period.
Question 2.3
Start by downloading Census tracts for Cook County, IL using tidycensus. We want to get the geometries
for each tract, so be sure to set geometry = TRUE when using get_acs(). We also want to retrieve the total
population (variable code: “B01001_001”) for each tract. NOTE: You should use the 2016 5-year ACS for all
your boundaries and variables in this problem set. Save your Census tract data to a dataframe named cook.
2
Our goal is to determine which Census tract each theft occured in. To do so, we need to perform a point-in￾polygon merge. Use st_join() to perform a point-in-polygon merge of the full thefts dataset and cook.
You may have to change the CRS of cook to 4326 before performing the join. You can do this with the
st_transform() function. The lecture script contains a relevant example of the format for a point-in-polygon
merge. Save the result of your point-in-polygon merge to a dataframe named thefts_merged.
Question 2.4
The thefts_merged dataframe should now be a combination of the original thefts data and data from each
theft’s respective Census tract, including the tract’s GEOID. We can aggregate by this GEOID column to
get various summary statistics for each tract. Before aggregating however, get rid of the geometry column.
The geometry column contains point geometries which are no longer needed after merging, and getting rid of
it will speed up future operations. Set the geometry column to NULL using the standard assignment operator
(<-) or st_set_geometry(). Next, use group_by() and summarize() to get the average number of thefts
per year for each Census tract. Assign the result to a new dataframe called thefts_agg.
Finally, join thefts_agg back to the cook dataframe using a simple left_join(). Drop any rows with NA
values from your resulting dataset. In the joined data, create a new variable called thefts_pc equal to the
number of thefts per capita for each tract. Finally, replicate the following map to the best of your ability.
0.1
0.2
Avg. Thefts
Per Capita
Per Year
Thefts in Chicago (2016 − 2020)
Source: City of Chicago Data Portal
Briefly answer the following questions in the text of your Rmarkdown document.
• Why do you think thefts per capita is higher in the Loop and northwest side (the second- and first-most
red areas respectively)?
• What changes could we make to the map to further clarify the spatial distribution of thefts?
3
Section 3: Regression Analysis
Here, let’s try to formalize/test some of your answers to the questions above by running a regression.
Use the tidycensus package to retrieve median household income (“B19013_001”), percent white
(“B02001_002”), percent below the poverty line (“B17007_002”), and percent with a bachelor’s degree
(“B23006_023”) for each Census tract in Cook County. You’ll need to calculate the percentage values by
dividing the number of people for each value by the total tract population (for example, 50 white people /
200 total people x 100 % = 25% white). Also, notice that the dependent variable is now “Average Thefts per
1000 people per Year” in order to make the regression tables more concise.
Question 3.1.
Do your best to reproduce the regression results table below (Table 1) using the stargazer package/function.
Provide a brief interpretation of the results and comment on whether the coefficients seem plausible.
Table 1: Regression Results
Average Thefts per 1000 per Year
Population n 0.001∗∗
(0.000)
Median Household Income (1000s) 0.141∗∗∗
(0.043)
Pct White e 0.276∗∗∗
(0.031)
Pct Poverty 0.513∗∗∗
(0.177)
Pct Bachelor’s 0.314∗∗∗
(0.072)
Constant 17.534∗∗∗
(2.778)
Observations 849
R2 0.167
Notes: ∗∗∗Significant at the 1 percent level.
∗∗Significant at the 5 percent level.
∗Significant at the 10 percent level.
Question 3.2
Choose an additional variable from the Census that you think are relevant to include in your regression. You
can download a list of available variables and their associated codes by using the following command and
using the RStudio’s “filter” function to search for variables.
v18 <- load_variables(2018, "acs5", cache = TRUE)
view(v18)