EDA: Chicago red light camera violations

Chicago red light camera violations

In this post, I walk through a simple exploratory data analysis of red light camera violations in Chicago.

Data import

Data downloaded from the Chicago Data Portal.

library(tidyverse)
library(modelr)
(red_light_raw <- read_csv("~/Tresors/datasets/2019-04-28_chi-red-light.csv"))
## # A tibble: 462,769 x 15
##    INTERSECTION `CAMERA ID` ADDRESS `VIOLATION DATE` VIOLATIONS
##    <chr>              <dbl> <chr>   <chr>                 <dbl>
##  1 111TH AND H…        2424 800 W … 04/14/2019                2
##  2 111TH AND H…        2422 11100 … 04/14/2019                6
##  3 119TH AND H…        2402 11900 … 04/14/2019                4
##  4 119TH AND H…        2404 800 W … 04/14/2019                6
##  5 31ST ST AND…        2121 3100 S… 04/14/2019                4
##  6 31ST ST AND…        2123 400 E … 04/14/2019                2
##  7 35TH AND WE…        2092 3500 S… 04/14/2019                2
##  8 35TH AND WE…        2091 3500 S… 04/14/2019                2
##  9 55TH AND KE…        2162 5500 S… 04/14/2019                2
## 10 55TH AND WE…        2213 2400 W… 04/14/2019                2
## # … with 462,759 more rows, and 10 more variables: `X COORDINATE` <dbl>,
## #   `Y COORDINATE` <dbl>, LATITUDE <dbl>, LONGITUDE <dbl>, LOCATION <chr>,
## #   `Historical Wards 2003-2015` <dbl>, `Zip Codes` <dbl>, `Community
## #   Areas` <dbl>, `Census Tracts` <dbl>, Wards <dbl>

These data reflect the daily volume of violations created by the City of Chicago Red Light Program for each camera since July 1, 2014.

  • INTERSECTION = Intersection of the location of the red light enforcement camera(s). There may be more than one camera at each intersection
  • CAMERA ID = A unique ID for each physical camera at an intersection, which may contain more than one camera
  • ADDRESS = The address of the physical camera (CAMERA ID). The address may be the same for all cameras or different, based on the physical installation of each camera
  • VIOLATION DATE = The date of when the violations occurred. NOTE: The citation may be issued on a different date
  • VIOLATIONS = Number of violations for each camera on a particular day
  • X COORDINATE = The X Coordinate, measured in feet, of the location of the camera. Geocoded using Illinois State Plane East
  • Y COORDINATE = The Y Coordinate, measured in feet, of the location of the camera. Geocoded using Illinois State Plane East
  • LATITUDE = The latitude of the physical location of the camera(s) based on the ADDRESS column. Geocoded using the WGS84
  • LONGITUDE = The longitude of the physical location of the camera(s) based on the ADDRESS column. Geocoded using the WGS84
  • LOCATION = The coordinates of the camera(s) based on the LATITUDE and LONGITUDE columns. Geocoded using the WGS84

Data clean

(red_light <- red_light_raw %>% 
  janitor::clean_names() %>% 
  separate(violation_date, c("month", "day", "year"), "/") %>% 
  mutate_at(vars(month, day, year),
            ~ as.numeric(.)))
## # A tibble: 462,769 x 17
##    intersection camera_id address month   day  year violations x_coordinate
##    <chr>            <dbl> <chr>   <dbl> <dbl> <dbl>      <dbl>        <dbl>
##  1 111TH AND H…      2424 800 W …     4    14  2019          2     1172945.
##  2 111TH AND H…      2422 11100 …     4    14  2019          6     1172924.
##  3 119TH AND H…      2402 11900 …     4    14  2019          4     1173095.
##  4 119TH AND H…      2404 800 W …     4    14  2019          6     1173111.
##  5 31ST ST AND…      2121 3100 S…     4    14  2019          4          NA 
##  6 31ST ST AND…      2123 400 E …     4    14  2019          2     1179438.
##  7 35TH AND WE…      2092 3500 S…     4    14  2019          2     1160895.
##  8 35TH AND WE…      2091 3500 S…     4    14  2019          2     1160895.
##  9 55TH AND KE…      2162 5500 S…     4    14  2019          2     1155946.
## 10 55TH AND WE…      2213 2400 W…     4    14  2019          2     1161277.
## # … with 462,759 more rows, and 9 more variables: y_coordinate <dbl>,
## #   latitude <dbl>, longitude <dbl>, location <chr>,
## #   historical_wards_2003_2015 <dbl>, zip_codes <dbl>,
## #   community_areas <dbl>, census_tracts <dbl>, wards <dbl>

Simple EDA

I’m going to look at the number of red light violations by intersection across time.

ggplot(red_light, aes(year + month / 12, violations)) +
  geom_line(aes(group = intersection))

Turns out, this graph isn’t very useful. It’s hard to get an idea of what’s really going on here because there is so much lumped at bottom. For now, I’m going to focus on the more popular intersections (i.e., those with more violations).

What’s the pattern?

lsd_belmont <- red_light_popular %>% 
  filter(intersection == "LAKE SHORE DR AND BELMONT")

ggplot(lsd_belmont, aes(date, violations)) +
  geom_point(aes(group = intersection)) +
  geom_smooth(se = FALSE) +
  geom_jitter()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Now, let’s see how the monthly patterns change by year for the same intersection.

ggplot(lsd_belmont, aes(month, violations)) +
  geom_point(aes(group = year)) +
  geom_jitter() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Both of these graphs supports the seasonal trend idea–there is a bump in red light violations every summer, between May and August.

Now, I have a few questions:

  • Are these patterns the same for all intersections?
  • What’s driving these peaks? More drivers/tourists on the road in summer months?
  • What’s producing the gap between the high violations (above 25 violations) and low violations (below 25 violations)?
  • Is this pattern more pronounced in certain parts of the city (e.g., Lake Shore and Belmont is by a popular highway)?
  • What happened in the summer/fall of 2016?

Can we remove this pattern?

belmont_mod <- lm(log(violations) ~ factor(month), data = lsd_belmont)

lsd_belmont %>% 
  add_predictions(belmont_mod) %>% 
  ggplot(aes(date, pred)) +
  geom_line()

This shows the model has captured the seasonal pattern, but plotting the residuals will probably be more useful.

lsd_belmont %>% 
  add_residuals(belmont_mod) %>% 
  ggplot(aes(date, resid)) +
  geom_hline(yintercept = 0, color = "white", size = 3) +
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

It looks like the seasonal model does a good job of explaining the data, especially for 2017 on. And, by removing the strong monthly pattern, we can see the long-term trends much more clearly. There’s a steady increase from the beginning of the data to about 2016 when the number of red light violations peak. They then take a slight dip but remain relatively stable through the present.

More questions:

  • What’s driving this trend?
  • What happened around 2016?
  • Is this pattern the same for all intersections?

All intersections

Now, I want to extend this model to all of the intersections in the sample.

by_intersection <- red_light_popular %>% 
  group_by(intersection) %>% 
  nest()

intersection_model <- function(df) {
  lm(log10(violations) ~ factor(month), data = df)
}

partioned <- by_intersection %>% 
  mutate(
    model = map(data, intersection_model),
    resids = map2(data, model, add_residuals)
  ) %>% 
  unnest(resids)

ggplot(partioned, aes(date, resid)) +
  geom_line(aes(group = intersection), alpha = 1/5) +
  stat_summary(geom = "line", fun.y = function(x) quantile(x, 0.25), color = "blue") +
  stat_summary(geom = "line", fun.y = function(x) quantile(x, 0.75), color = "blue") +
  geom_smooth(se = FALSE, linetype = "dashed", color = "red")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

After removing the seasonal pattern, the long-term trend becomes much more stable. However, some unexplained patterns remain.

Further questions:

  • What drove the increase in violations between 2014 and mid-2016?
  • What happened at the end of 2016/beginning of 2017?
  • Do violations occur systematically throughout the city, or are certain locations more likely to lead to higher violations?
Avatar
Jeremy R. Winget
Graduate Research Assistant & Lecturer

Related

comments powered by Disqus