First TidyTuesday submission

It’s been quite some time since I’ve written here, so I thought I would use one my of 2019 #rstats goals as an excuse to brush off the dust.

In this post, I write about my first #tidytuesday submission of the Economist’s “TV’s golden age is real” data set (original #tidytuesday code here). I also make a few improvements to some of the graphs and add tables with the gt package.

Special thanks to Isabella Ghement for providing a few tips on how to improve the original graphs!

First, load the required packages and data:

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  2.0.1     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(ggpmisc)
## For news about 'ggpmisc', please, see https://www.r4photobiology.info/
## For on-line documentation see https://docs.r4photobiology.info/ggpmisc/
library(ggrepel)
library(gt)
tv_rating <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-08/IMDb_Economist_tv_ratings.csv")
## Parsed with column specification:
## cols(
##   titleId = col_character(),
##   seasonNumber = col_double(),
##   title = col_character(),
##   date = col_date(format = ""),
##   av_rating = col_double(),
##   share = col_double(),
##   genres = col_character()
## )

Which years had the highest ratings?

tv_rating %>%
  mutate(year = year(date)) %>%
  group_by(year) %>%
  summarize(
    n = n(),
    avg_rating = mean(av_rating)
  ) %>%
  filter(n > 25) %>%
  arrange(desc(avg_rating)) %>%
  ggplot() +
  aes(year, avg_rating) +
  geom_point() +
  geom_smooth(
    formula = y ~ x,
    method = "lm",
    se = FALSE,
    color = "red"
  ) +
  labs(
    title = "Which years had the highest ratings?",
    x = "Year",
    y = "Average rating"
  ) +
  stat_poly_eq(aes(label = paste("atop(", ..eq.label.., ",", ..adj.rr.label.., ")")),
    formula = y ~ x, color = "red", parse = TRUE
  ) +
  theme_light()

Looks like the newer the TV drama, the more likely it was to have a higher rating. Maybe some of this is variance is due to shows with multiple seasons. Let’s see how this changes when looking at individual shows and their respective run lengths.

Which titles had the highest ratings, and how long did they run?

titles <- tv_rating %>%
  group_by(title) %>%
  summarize(
    n = n(),
    first_yr = min(year(date)),
    last_yr = max(year(date)),
    num_seasons = max(seasonNumber),
    yrs_aired = (max(year(date) - min(year(date)))),
    avg_rating = mean(av_rating)
  ) %>%
  filter(n > 10) %>% # not enough cases to filter by 25 ratings per title
  arrange(desc(avg_rating))

# highest rated titles' run lengths
titles %>%
  mutate(title = fct_reorder(title, yrs_aired)) %>%
  ggplot() +
  aes(title, yrs_aired, fill = title) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Most popular series' run lengths",
    x = "TV Series Title",
    y = "Number of years aired"
  ) +
  theme_light() +
  theme(legend.position = "none")

# highest rated titles rating over time
titles %>%
  ggplot() +
  aes(yrs_aired, avg_rating) +
  geom_point() +
  geom_smooth(
    formula = y ~ x,
    method = "lm",
    se = FALSE,
    color = "red"
  ) +
  labs(
    title = "Most popular series' ratings over time",
    x = "Number of years ran",
    y = "Average rating"
  ) +
  stat_poly_eq(aes(label = paste("atop(", ..eq.label.., ",", ..adj.rr.label.., ")")),
    formula = y ~ x, color = "red", parse = TRUE
  ) +
  theme_light()

I haven’t seen all of these shows, but for the most part, their titles seem to describe suspenseful dramas (e.g., mystery, crime, maybe even thriller/horror). However, King of the Hill doesn’t really fit this description, so let’s take a look at the genre variable. Keeping in mind that all of these shows are dramas, I’m conceptualizing these as sub-genres of drama.

Is there a difference in ratings between comedies and tragedies?

genre_split %>%
  mutate(
    com_trag = case_when(
      genres == "Comedy" |
        genres == "Animation" ~ "comedy",
      genres == "Crime" |
        genres == "Horror" |
        genres == "Thriller" ~ "tragedy"
    )
  ) %>%
  filter(!is.na(com_trag)) %>%
  group_by(com_trag) %>%
  top_ratings() %>%
  mutate(com_trag = str_to_title(com_trag)) %>%
  gt() %>%
  tab_header(title = "Mean/median differences between comedies and tragedies") %>%
  fmt_number(
    columns = vars(avg_rating, med_rating),
    decimals = 3
  ) %>%
  cols_label(
    com_trag = "Drama type",
    n = "Number of responses",
    avg_rating = "Average rating",
    med_rating = "Median rating"
  )
Mean/median differences between comedies and tragedies
Drama type Number of responses Average rating Median rating
Tragedy 1106 8.110 8.168
Comedy 552 8.038 8.071

So, there is a difference, people tend to rate tragedies higher than comedies, but in the grand scheme of things, this difference quite small. The average distance between comedies and tragedies is only 0.07, and the median difference is 0.1. Thus, it seems there’s not much difference in viewer ratings among sub-genres, at least not in our sample. But, this isn’t actually that surprising: Our sample was already narrowed down to TV drama titles. Since all of the titles share this common characteristic, what we’re probably seeing is the consistency of viewers to rate TV dramas in a similar fashion. In other words, people tend to rate all TV dramas similarly, regardless of the story line/sub-genre.

Since sub-genres didn’t bare much useful information, let’s take a look at the actual titles within the dataset. All of the top-rated shows aired for multiple seasons, but I doubt every show that aired multiple seasons was popular. In fact, some earlier analyses showed a decline in ratings over time. So, let’s see what the data say.

How do viewer ratings changes over time by TV show title

# list most popular shows from earlier analysis (with extra picks of my own)
shows <- c("The X-Files", "Law & Order", "Midsomer Murders", "Law & Order: Special Victims Unit", "ER", "Grey's Anatomy", "CSI: Crime Scene Investigation", "Supernatural", "King of the Hill", "Doctor Who", "Criminal Minds", "Bones", "Murdoch Mysteries", "American Horror Story", "Are you Afraid of the Dark?", "Californication", "Elementary", "Lost", "Numb3rs", "Shameless", "The Walking Dead", "The Sopranos", "Scrubs", "Oz", "House", "Dexter")

tv_rating %>%
  filter(title %in% shows) %>%
  mutate(title = str_replace(title, "Special Victims Unit", "SVU")) %>%
  group_by(title) %>%
  ggplot() +
  aes(date, av_rating) +
  facet_wrap(~title) +
  geom_line() +
  labs(
    title = "Viewer ratings over time by TV show title",
    x = "Years aired",
    y = "Average rating"
  ) +
  theme_light() +
  theme(axis.text.x = element_text(angle = 90))

Okay, now we have some interesting patterns to interpret. Overall, it looks like even the most popular shows experienced a decline in enthusiasm the longer they aired. However, there are notable exceptions to this trend: Criminal Minds and Murdoch Mysteries have really taken off in the last few years, both receiving the highest ratings out of any of the tiles in this sample. American Horror Story appears to be making a comeback as well recently. Noticeably, these exceptions all fit the suspenseful-like dramas noted earlier. It seems, then, the most successful TV dramas are ones with intense or striking elements (e.g., crime, murder, horror, etc.). I wonder if this says anything about the culture of the viewers…lookin’ at you America 🤔

Avatar
Jeremy R. Winget
Graduate Research Assistant & Lecturer

Related

comments powered by Disqus