EDA: Chicago red light camera violations

An exploratory data analysis of red light traffic camera violations in Chicago, IL.

Published

April 29, 2019

Chicago red light camera violations

In this post, I walk through a simple exploratory data analysis of red light camera violations in Chicago.

Data import

Data downloaded from the Chicago Data Portal.

Code
library(tidyverse)
library(modelr)
(red_light_raw <- read_csv("2019-04-28_chi-red-light.csv"))
## # A tibble: 466,107 × 10
##    INTERSECTION   `CAMERA ID` ADDRESS `VIOLATION DATE` VIOLATIONS `X COORDINATE`
##    <chr>                <dbl> <chr>   <chr>                 <dbl>          <dbl>
##  1 IRVING PARK A…        2763 4700 W… 04/09/2015                4             NA
##  2 VAN BUREN AND…        2054 2400 W… 04/14/2015                5             NA
##  3 115TH AND HAL…        2552 11500 … 04/08/2015                5             NA
##  4 IRVING PARK A…        2764 4700 W… 04/19/2015                4             NA
##  5 ELSTON AND IR…        1503 3700 W… 04/23/2015                3             NA
##  6 31ST AND CALI…        2064 2800 W… 09/14/2014                3             NA
##  7 31ST AND CALI…        2064 2800 W… 12/16/2014                1             NA
##  8 31ST AND CALI…        2064 2800 W… 01/30/2015                4             NA
##  9 115TH AND HAL…        2552 11500 … 03/28/2015               14             NA
## 10 WENTWORTH AND…        2261 5500 S… 04/06/2015               11             NA
## # ℹ 466,097 more rows
## # ℹ 4 more variables: `Y COORDINATE` <dbl>, LATITUDE <dbl>, LONGITUDE <dbl>,
## #   LOCATION <chr>

These data reflect the daily volume of violations created by the City of Chicago Red Light Program for each camera since July 1, 2014.

  • INTERSECTION = Intersection of the location of the red light enforcement camera(s). There may be more than one camera at each intersection
  • CAMERA ID = A unique ID for each physical camera at an intersection, which may contain more than one camera
  • ADDRESS = The address of the physical camera (CAMERA ID). The address may be the same for all cameras or different, based on the physical installation of each camera
  • VIOLATION DATE = The date of when the violations occurred. NOTE: The citation may be issued on a different date
  • VIOLATIONS = Number of violations for each camera on a particular day
  • X COORDINATE = The X Coordinate, measured in feet, of the location of the camera. Geocoded using Illinois State Plane East
  • Y COORDINATE = The Y Coordinate, measured in feet, of the location of the camera. Geocoded using Illinois State Plane East
  • LATITUDE = The latitude of the physical location of the camera(s) based on the ADDRESS column. Geocoded using the WGS84
  • LONGITUDE = The longitude of the physical location of the camera(s) based on the ADDRESS column. Geocoded using the WGS84
  • LOCATION = The coordinates of the camera(s) based on the LATITUDE and LONGITUDE columns. Geocoded using the WGS84

Data clean

Code
(red_light <- red_light_raw %>%
  janitor::clean_names() %>%
  separate(violation_date, c("month", "day", "year"), "/") %>%
  mutate_at(vars(month, day, year),
            ~ as.numeric(.)))
## # A tibble: 466,107 × 12
##    intersection      camera_id address month   day  year violations x_coordinate
##    <chr>                 <dbl> <chr>   <dbl> <dbl> <dbl>      <dbl>        <dbl>
##  1 IRVING PARK AND …      2763 4700 W…     4     9  2015          4           NA
##  2 VAN BUREN AND WE…      2054 2400 W…     4    14  2015          5           NA
##  3 115TH AND HALSTED      2552 11500 …     4     8  2015          5           NA
##  4 IRVING PARK AND …      2764 4700 W…     4    19  2015          4           NA
##  5 ELSTON AND IRVIN…      1503 3700 W…     4    23  2015          3           NA
##  6 31ST AND CALIFOR…      2064 2800 W…     9    14  2014          3           NA
##  7 31ST AND CALIFOR…      2064 2800 W…    12    16  2014          1           NA
##  8 31ST AND CALIFOR…      2064 2800 W…     1    30  2015          4           NA
##  9 115TH AND HALSTED      2552 11500 …     3    28  2015         14           NA
## 10 WENTWORTH AND GA…      2261 5500 S…     4     6  2015         11           NA
## # ℹ 466,097 more rows
## # ℹ 4 more variables: y_coordinate <dbl>, latitude <dbl>, longitude <dbl>,
## #   location <chr>

Simple EDA

I’m going to look at the number of red light violations by intersection across time.

Code
ggplot(red_light, aes(year + month / 12, violations)) +
  geom_line(aes(group = intersection))

Turns out, this graph isn’t very useful. It’s hard to get an idea of what’s really going on here because there is so much lumped at bottom. For now, I’m going to focus on the more popular intersections (i.e., those with more violations).

What’s the pattern?

Code
lsd_belmont <- red_light_popular %>%
  filter(intersection == "LAKE SHORE DR AND BELMONT")

ggplot(lsd_belmont, aes(date, violations)) +
  geom_point(aes(group = intersection)) +
  geom_smooth(se = FALSE) +
  geom_jitter()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Now, let’s see how the monthly patterns change by year for the same intersection.

Code
ggplot(lsd_belmont, aes(month, violations)) +
  geom_point(aes(group = year)) +
  geom_jitter() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Both of these graphs supports the seasonal trend idea–there is a bump in red light violations every summer, between May and August.

Now, I have a few questions:

  • Are these patterns the same for all intersections?
  • What’s driving these peaks? More drivers/tourists on the road in summer months?
  • What’s producing the gap between the high violations (above 25 violations) and low violations (below 25 violations)?
  • Is this pattern more pronounced in certain parts of the city (e.g., Lake Shore and Belmont is by a popular highway)?
  • What happened in the summer/fall of 2016?

Can we remove this pattern?

Code
belmont_mod <- lm(log(violations) ~ factor(month), data = lsd_belmont)

lsd_belmont %>%
  add_predictions(belmont_mod) %>%
  ggplot(aes(date, pred)) +
  geom_line()

This shows the model has captured the seasonal pattern, but plotting the residuals will probably be more useful.

Code
lsd_belmont %>%
  add_residuals(belmont_mod) %>%
  ggplot(aes(date, resid)) +
  geom_hline(yintercept = 0, color = "white", size = 3) +
  geom_point() +
  geom_smooth(se = FALSE)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

It looks like the seasonal model does a good job of explaining the data, especially for 2017 on. And, by removing the strong monthly pattern, we can see the long-term trends much more clearly. There’s a steady increase from the beginning of the data to about 2016 when the number of red light violations peak. They then take a slight dip but remain relatively stable through the present.

More questions:

  • What’s driving this trend?
  • What happened around 2016?
  • Is this pattern the same for all intersections?

All intersections

Now, I want to extend this model to all of the intersections in the sample.

Code
by_intersection <- red_light_popular %>%
  group_by(intersection) %>%
  nest()

intersection_model <- function(df) {
  lm(log10(violations) ~ factor(month), data = df)
}

partioned <- by_intersection %>%
  mutate(
    model = map(data, intersection_model),
    resids = map2(data, model, add_residuals)
  ) %>%
  unnest(resids)

ggplot(partioned, aes(date, resid)) +
  geom_line(aes(group = intersection), alpha = 1/5) +
  stat_summary(geom = "line", fun.y = function(x) quantile(x, 0.25), color = "blue") +
  stat_summary(geom = "line", fun.y = function(x) quantile(x, 0.75), color = "blue") +
  geom_smooth(se = FALSE, linetype = "dashed", color = "red")
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

After removing the seasonal pattern, the long-term trend becomes much more stable. However, some unexplained patterns remain.

Further questions:

  • What drove the increase in violations between 2014 and mid-2016?
  • What happened at the end of 2016/beginning of 2017?
  • Do violations occur systematically throughout the city, or are certain locations more likely to lead to higher violations?

Citation

BibTeX citation:
@online{2019,
  author = {},
  title = {EDA: {Chicago} Red Light Camera Violations},
  date = {2019-04-29},
  url = {https://www.jrwinget.com/blog/2019-04-29-chicago-red-light/},
  langid = {en}
}
For attribution, please cite this work as:
“EDA: Chicago Red Light Camera Violations.” 2019. April 29, 2019. https://www.jrwinget.com/blog/2019-04-29-chicago-red-light/.