# Chicago red light camera violations

In this post, I walk through a simple exploratory data analysis of red light camera violations in Chicago.

## Data import

Data downloaded from the Chicago Data Portal.

```
library(tidyverse)
library(modelr)
(red_light_raw <- read_csv("~/Tresors/datasets/2019-04-28_chi-red-light.csv"))
```

```
## # A tibble: 462,769 x 15
## INTERSECTION `CAMERA ID` ADDRESS `VIOLATION DATE` VIOLATIONS
## <chr> <dbl> <chr> <chr> <dbl>
## 1 111TH AND H… 2424 800 W … 04/14/2019 2
## 2 111TH AND H… 2422 11100 … 04/14/2019 6
## 3 119TH AND H… 2402 11900 … 04/14/2019 4
## 4 119TH AND H… 2404 800 W … 04/14/2019 6
## 5 31ST ST AND… 2121 3100 S… 04/14/2019 4
## 6 31ST ST AND… 2123 400 E … 04/14/2019 2
## 7 35TH AND WE… 2092 3500 S… 04/14/2019 2
## 8 35TH AND WE… 2091 3500 S… 04/14/2019 2
## 9 55TH AND KE… 2162 5500 S… 04/14/2019 2
## 10 55TH AND WE… 2213 2400 W… 04/14/2019 2
## # … with 462,759 more rows, and 10 more variables: `X COORDINATE` <dbl>,
## # `Y COORDINATE` <dbl>, LATITUDE <dbl>, LONGITUDE <dbl>, LOCATION <chr>,
## # `Historical Wards 2003-2015` <dbl>, `Zip Codes` <dbl>, `Community
## # Areas` <dbl>, `Census Tracts` <dbl>, Wards <dbl>
```

These data reflect the daily volume of violations created by the City of Chicago Red Light Program for each camera since July 1, 2014.

**INTERSECTION**= Intersection of the location of the red light enforcement camera(s). There may be more than one camera at each intersection**CAMERA ID**= A unique ID for each physical camera at an intersection, which may contain more than one camera**ADDRESS**= The address of the physical camera (CAMERA ID). The address may be the same for all cameras or different, based on the physical installation of each camera**VIOLATION DATE**= The date of when the violations occurred. NOTE: The citation may be issued on a different date**VIOLATIONS**= Number of violations for each camera on a particular day**X COORDINATE**= The X Coordinate, measured in feet, of the location of the camera. Geocoded using Illinois State Plane East**Y COORDINATE**= The Y Coordinate, measured in feet, of the location of the camera. Geocoded using Illinois State Plane East**LATITUDE**= The latitude of the physical location of the camera(s) based on the ADDRESS column. Geocoded using the WGS84**LONGITUDE**= The longitude of the physical location of the camera(s) based on the ADDRESS column. Geocoded using the WGS84**LOCATION**= The coordinates of the camera(s) based on the LATITUDE and LONGITUDE columns. Geocoded using the WGS84

## Data clean

```
(red_light <- red_light_raw %>%
janitor::clean_names() %>%
separate(violation_date, c("month", "day", "year"), "/") %>%
mutate_at(vars(month, day, year),
~ as.numeric(.)))
```

```
## # A tibble: 462,769 x 17
## intersection camera_id address month day year violations x_coordinate
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 111TH AND H… 2424 800 W … 4 14 2019 2 1172945.
## 2 111TH AND H… 2422 11100 … 4 14 2019 6 1172924.
## 3 119TH AND H… 2402 11900 … 4 14 2019 4 1173095.
## 4 119TH AND H… 2404 800 W … 4 14 2019 6 1173111.
## 5 31ST ST AND… 2121 3100 S… 4 14 2019 4 NA
## 6 31ST ST AND… 2123 400 E … 4 14 2019 2 1179438.
## 7 35TH AND WE… 2092 3500 S… 4 14 2019 2 1160895.
## 8 35TH AND WE… 2091 3500 S… 4 14 2019 2 1160895.
## 9 55TH AND KE… 2162 5500 S… 4 14 2019 2 1155946.
## 10 55TH AND WE… 2213 2400 W… 4 14 2019 2 1161277.
## # … with 462,759 more rows, and 9 more variables: y_coordinate <dbl>,
## # latitude <dbl>, longitude <dbl>, location <chr>,
## # historical_wards_2003_2015 <dbl>, zip_codes <dbl>,
## # community_areas <dbl>, census_tracts <dbl>, wards <dbl>
```

## Simple EDA

I’m going to look at the number of red light violations by intersection across time.

```
ggplot(red_light, aes(year + month / 12, violations)) +
geom_line(aes(group = intersection))
```

Turns out, this graph isn’t very useful. It’s hard to get an idea of what’s really going on here because there is so much lumped at bottom. For now, I’m going to focus on the more popular intersections (i.e., those with more violations).

## Focus on popular intersections

These results might be a bit misleading–maybe popular intersections are fundamentally different (e.g., more dangerous, more lucrative). But, it will at least be a good place to start. Since there are 173 intersections in this dataset, I’m going to arbitrarily select all intersections that lead to greater than 5.5 red light violations on average.

```
(intersections <- red_light %>%
group_by(intersection) %>%
summarize(avg = mean(violations)) %>%
arrange(desc(avg)))
```

```
## # A tibble: 183 x 2
## intersection avg
## <chr> <dbl>
## 1 CICERO AND I55 33.7
## 2 LAKE AND UPPER WACKER 31.7
## 3 LAKE SHORE DR AND BELMONT 26.8
## 4 VAN BUREN AND WESTERN 20.8
## 5 LAFAYETTE AND 87TH 18.6
## 6 STATE AND 79TH 16.7
## 7 STONEY ISLAND AND 76TH 15.9
## 8 LINCOLN AND MCCORMICK 15.2
## 9 WENTWORTH AND GARFIELD 14.7
## 10 ARCHER AND CICERO 14.4
## # … with 173 more rows
```

```
(red_light_popular <- red_light %>%
semi_join(filter(intersections, avg > 5.5)) %>%
mutate(date = year + (month - 1) / 12))
```

`## Joining, by = "intersection"`

```
## # A tibble: 134,776 x 18
## intersection camera_id address month day year violations x_coordinate
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 31ST ST AND… 2121 3100 S… 4 14 2019 4 NA
## 2 31ST ST AND… 2123 400 E … 4 14 2019 2 1179438.
## 3 55TH AND WE… 2213 2400 W… 4 14 2019 2 1161277.
## 4 63RD AND ST… 2714 1 E 63… 4 14 2019 9 1177368.
## 5 63RD AND ST… 2712 6300 S… 4 14 2019 1 1177304.
## 6 75TH AND ST… 2621 7500 S… 4 14 2019 3 1177539.
## 7 75TH AND ST… 2624 1 E 75… 4 14 2019 10 1177602.
## 8 87TH AND VI… 2413 848 W … 4 14 2019 5 1172003.
## 9 99TH AND HA… 2433 800 W … 4 14 2019 4 1172711.
## 10 99TH AND HA… 2431 9900 S… 4 14 2019 4 1172690.
## # … with 134,766 more rows, and 10 more variables: y_coordinate <dbl>,
## # latitude <dbl>, longitude <dbl>, location <chr>,
## # historical_wards_2003_2015 <dbl>, zip_codes <dbl>,
## # community_areas <dbl>, census_tracts <dbl>, wards <dbl>, date <dbl>
```

Now, I replot the initial graph.

```
ggplot(red_light_popular, aes(date, violations)) +
geom_line(aes(group = intersection))
```

Decreasing the sample reduced the number of intersections that were lumped at the bottom, but there are still a lot of data there. Let’s try adding some transparency and a log10 transformation.

```
ggplot(red_light_popular, aes(date, violations)) +
geom_line(aes(group = intersection), alpha = 1/5) +
scale_y_log10() +
geom_smooth(se = FALSE)
```

`## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'`

The transformation does a fairly good job of shrinking the high violation intersections down and scaling the low violation intersections up. Now, it looks like there might be a reoccurring pattern occurring within a year. So, I’m wondering if there is a seasonal trend. To look at this, let’s focus on a single intersection for the moment: Lake Shore Dr. and Belmont.

## What’s the pattern?

```
lsd_belmont <- red_light_popular %>%
filter(intersection == "LAKE SHORE DR AND BELMONT")
ggplot(lsd_belmont, aes(date, violations)) +
geom_point(aes(group = intersection)) +
geom_smooth(se = FALSE) +
geom_jitter()
```

`## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'`

Now, let’s see how the monthly patterns change by year for the same intersection.

```
ggplot(lsd_belmont, aes(month, violations)) +
geom_point(aes(group = year)) +
geom_jitter() +
geom_smooth(se = FALSE)
```

`## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'`

Both of these graphs supports the seasonal trend idea–there is a bump in red light violations every summer, between May and August.

Now, I have a few questions:

- Are these patterns the same for all intersections?
- What’s driving these peaks? More drivers/tourists on the road in summer months?
- What’s producing the gap between the high violations (above 25 violations) and low violations (below 25 violations)?
- Is this pattern more pronounced in certain parts of the city (e.g., Lake Shore and Belmont is by a popular highway)?
- What happened in the summer/fall of 2016?

## Can we remove this pattern?

```
belmont_mod <- lm(log(violations) ~ factor(month), data = lsd_belmont)
lsd_belmont %>%
add_predictions(belmont_mod) %>%
ggplot(aes(date, pred)) +
geom_line()
```

This shows the model has captured the seasonal pattern, but plotting the residuals will probably be more useful.

```
lsd_belmont %>%
add_residuals(belmont_mod) %>%
ggplot(aes(date, resid)) +
geom_hline(yintercept = 0, color = "white", size = 3) +
geom_point() +
geom_smooth(se = FALSE)
```

`## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'`

It looks like the seasonal model does a good job of explaining the data, especially for 2017 on. And, by removing the strong monthly pattern, we can see the long-term trends much more clearly. There’s a steady increase from the beginning of the data to about 2016 when the number of red light violations peak. They then take a slight dip but remain relatively stable through the present.

More questions:

- What’s driving this trend?
- What happened around 2016?
- Is this pattern the same for all intersections?

## All intersections

Now, I want to extend this model to all of the intersections in the sample.

```
by_intersection <- red_light_popular %>%
group_by(intersection) %>%
nest()
intersection_model <- function(df) {
lm(log10(violations) ~ factor(month), data = df)
}
partioned <- by_intersection %>%
mutate(
model = map(data, intersection_model),
resids = map2(data, model, add_residuals)
) %>%
unnest(resids)
ggplot(partioned, aes(date, resid)) +
geom_line(aes(group = intersection), alpha = 1/5) +
stat_summary(geom = "line", fun.y = function(x) quantile(x, 0.25), color = "blue") +
stat_summary(geom = "line", fun.y = function(x) quantile(x, 0.75), color = "blue") +
geom_smooth(se = FALSE, linetype = "dashed", color = "red")
```

`## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'`

After removing the seasonal pattern, the long-term trend becomes much more stable. However, some unexplained patterns remain.

Further questions:

- What drove the increase in violations between 2014 and mid-2016?
- What happened at the end of 2016/beginning of 2017?
- Do violations occur systematically throughout the city, or are certain locations more likely to lead to higher violations?