This is the second in a series of posts that demonstrates how the tidyverse can be used to easily explore and analyze NFL play-by-play data. In part one, I used the dplyr package to calculate the offensive success rate of each NFL offense in during the 2016 season. However, when we left off, I noted that really we should look at the success rate of both offenses and defenses in order to get a better idea of which teams were the best overall. For this, we’ll use the tidyr package, which is main tidyverse package for tidying data. But first, let’s talk about the pipe operator.
Embrace the pipe
The pipe operator (%>%
) comes from the magrittr package, and it is used to make code more readable. Compare the following two chunks of code.
library(tidyverse)
<- read_rds("data/nfl_pbp_2016.rds")
nfl_pbp
<- select(nfl_pbp, game_id = gid, play_id = pid, offense = off,
success defense = def, play_type = type, down = dwn,
to_go = ytg, gained = yds)
<- filter(success, play_type %in% c("PASS", "RUSH")) success
library(tidyverse)
<- read_rds("data/nfl_pbp_2016.rds")
nfl_pbp
<- nfl_pbp %>%
success select(game_id = gid, play_id = pid, offense = off, defense = def,
play_type = type, down = dwn, to_go = ytg, gained = yds) %>%
filter(play_type %in% c("PASS", "RUSH"))
Both code chunks read in the data and do some initial manipulations to calculate success rate using the dplyr package (see part one of this series for a more in depth explanation of how these functions work). In the first chunk, the data has to be referenced multiple times to select some variables from nfl_pbp
and then filter to only passing and rushing plays. However, in the second chunk, all of the data manipulation is contained in a single code stanza, which improves readability.
The pipe works by passing the output of the function on the left as the first argument to the function on the right. So a %>% f(x, y) %>% g(z)
is interpreted by the computer as g(f(a, x, y), z)
. These two statements are equivalent, but in the first, it is much easier to see that a
is passes to function f()
, which also has arguments x
and y
, and then the output of that is passes to function g()
, which also has argument z
. In the tidyverse, the first argument is always the data frame that the operation is being performed on, and a data frame with the specified operation performed is returned. Thus, all of the tidyverse functions are compatible with the %>%
operator. Many other base R functions, such as unique
, are also compatible with the pipe.
Throughout the rest of this post, and the following posts in this series, we’ll use the %>%
to make the code more readable and straight forward.
tidyr
In this post, we’ll be using the tidyr package to calculate the offensive and defensive success rate for each team. Let’s remind ourselves what the data looks like.
success#> # A tibble: 34,149 × 8
#> game_id play_id offense defense play_type down to_go gained
#> <int> <int> <chr> <chr> <chr> <int> <int> <int>
#> 1 4257 697182 DEN CAR PASS 1 10 11
#> 2 4257 697183 DEN CAR PASS 1 10 0
#> 3 4257 697184 DEN CAR PASS 2 10 0
#> 4 4257 697185 DEN CAR PASS 3 10 12
#> 5 4257 697186 DEN CAR PASS 1 10 5
#> 6 4257 697187 DEN CAR RUSH 2 5 13
#> 7 4257 697188 DEN CAR RUSH 1 10 5
#> 8 4257 697189 DEN CAR RUSH 2 5 0
#> 9 4257 697190 CAR DEN RUSH 1 10 6
#> 10 4257 697191 CAR DEN RUSH 2 4 11
#> # … with 34,139 more rows
In the previous post, we grouped the data frame by offense in order to calculate success rate. Now, we want to group by team and offensive or defensive unit to calculate success rate. The problem we face is that there is no unit
variable. Instead, the unit information is split across two columns: offense
and defense
. This is where the tidyr package comes into play. There are two main functions within tidyr that are used for tidying data:
gather
: collapse multiple columns into key-value pairs;spread
: split key-value pairs into multiple columns.
I’ll demonstrate how both of these functions are used in our analysis of NFL success rates.
Using tidyr
Before we start using the tidyr package, let’s calculate whether each play was a success or not using the processes from the previous post.
<- success %>%
success mutate(
needed = case_when(
== 1 ~ to_go * 0.45,
down == 2 ~ to_go * 0.60,
down TRUE ~ to_go * 1.00
),play_success = case_when(
>= needed ~ TRUE,
gained < needed ~ FALSE
gained
)
)
success#> # A tibble: 34,149 × 10
#> game_id play_id offense defense play_type down to_go gained needed
#> <int> <int> <chr> <chr> <chr> <int> <int> <int> <dbl>
#> 1 4257 697182 DEN CAR PASS 1 10 11 4.5
#> 2 4257 697183 DEN CAR PASS 1 10 0 4.5
#> 3 4257 697184 DEN CAR PASS 2 10 0 6
#> 4 4257 697185 DEN CAR PASS 3 10 12 10
#> 5 4257 697186 DEN CAR PASS 1 10 5 4.5
#> 6 4257 697187 DEN CAR RUSH 2 5 13 3
#> 7 4257 697188 DEN CAR RUSH 1 10 5 4.5
#> 8 4257 697189 DEN CAR RUSH 2 5 0 3
#> 9 4257 697190 CAR DEN RUSH 1 10 6 4.5
#> 10 4257 697191 CAR DEN RUSH 2 4 11 2.4
#> # … with 34,139 more rows, and 1 more variable: play_success <lgl>
In order to calculate the success rate for each team and unit combination, we first need to gather the unit information into key-value pairs using the gather
function.
%>%
success gather(key = "team_unit", value = "team", offense:defense)
#> # A tibble: 68,298 × 10
#> game_id play_id play_type down to_go gained needed play_success team_unit
#> <int> <int> <chr> <int> <int> <int> <dbl> <lgl> <chr>
#> 1 4257 697182 PASS 1 10 11 4.5 TRUE offense
#> 2 4257 697183 PASS 1 10 0 4.5 FALSE offense
#> 3 4257 697184 PASS 2 10 0 6 FALSE offense
#> 4 4257 697185 PASS 3 10 12 10 TRUE offense
#> 5 4257 697186 PASS 1 10 5 4.5 TRUE offense
#> 6 4257 697187 RUSH 2 5 13 3 TRUE offense
#> 7 4257 697188 RUSH 1 10 5 4.5 TRUE offense
#> 8 4257 697189 RUSH 2 5 0 3 FALSE offense
#> 9 4257 697190 RUSH 1 10 6 4.5 TRUE offense
#> 10 4257 697191 RUSH 2 4 11 2.4 TRUE offense
#> # … with 68,288 more rows, and 1 more variable: team <chr>
When using the gather
function, the column names of the collapsed columns become the “key”, and the values are the “value”. In the function, we chose to name the column of keys team_unit
and the column of values team
. We then specified which columns should be collapsed. Each column can be named individually (i.e., c(offense, defense)
), or you can specify all columns in a range (i.e., offense:defense
).
Next, as I noted last time, play success is calculated such that it measures whether the play was a success for the offense. This means that if a play was successful for an offense it was unsuccessful for the defense, and vice versa. Therefore, we need to reverse the decision of whether or not a play was successful when we calculate defensive success rates.
%>%
success gather(key = "team_unit", value = "team", offense:defense) %>%
mutate(
play_success = case_when(
== "defense" ~ !play_success,
team_unit TRUE ~ play_success
)
)#> # A tibble: 68,298 × 10
#> game_id play_id play_type down to_go gained needed play_success team_unit
#> <int> <int> <chr> <int> <int> <int> <dbl> <lgl> <chr>
#> 1 4257 697182 PASS 1 10 11 4.5 TRUE offense
#> 2 4257 697183 PASS 1 10 0 4.5 FALSE offense
#> 3 4257 697184 PASS 2 10 0 6 FALSE offense
#> 4 4257 697185 PASS 3 10 12 10 TRUE offense
#> 5 4257 697186 PASS 1 10 5 4.5 TRUE offense
#> 6 4257 697187 RUSH 2 5 13 3 TRUE offense
#> 7 4257 697188 RUSH 1 10 5 4.5 TRUE offense
#> 8 4257 697189 RUSH 2 5 0 3 FALSE offense
#> 9 4257 697190 RUSH 1 10 6 4.5 TRUE offense
#> 10 4257 697191 RUSH 2 4 11 2.4 TRUE offense
#> # … with 68,288 more rows, and 1 more variable: team <chr>
Here, we’re saying if the team_unit
for the observation is equal to defense, play_success
is equal to the opposite of the current value of play success. If play_success
is not equal to defense, then the value is unchanged. We can now group by team
and team_unit
to calculate the success rate of each offense and defense.
%>%
success gather(key = "team_unit", value = "team", offense:defense) %>%
mutate(
play_success = case_when(
== "defense" ~ !play_success,
team_unit TRUE ~ play_success
)%>%
) group_by(team, team_unit) %>%
summarize(success_rate = mean(play_success, na.rm = TRUE))
#> # A tibble: 64 × 3
#> # Groups: team [32]
#> team team_unit success_rate
#> <chr> <chr> <dbl>
#> 1 ARI defense 0.579
#> 2 ARI offense 0.458
#> 3 ATL defense 0.512
#> 4 ATL offense 0.500
#> 5 BAL defense 0.580
#> 6 BAL offense 0.410
#> 7 BUF defense 0.553
#> 8 BUF offense 0.464
#> 9 CAR defense 0.542
#> 10 CAR offense 0.420
#> # … with 54 more rows
Finally, we want to spread the data frame back out so there is only one row per team. This is where the spread
function comes into play.
%>%
success gather(key = "team_unit", value = "team", offense:defense) %>%
mutate(
play_success = case_when(
== "defense" ~ !play_success,
team_unit TRUE ~ play_success
)%>%
) group_by(team, team_unit) %>%
summarize(success_rate = mean(play_success, na.rm = TRUE)) %>%
spread(key = team_unit, value = success_rate)
#> # A tibble: 32 × 3
#> # Groups: team [32]
#> team defense offense
#> <chr> <dbl> <dbl>
#> 1 ARI 0.579 0.458
#> 2 ATL 0.512 0.500
#> 3 BAL 0.580 0.410
#> 4 BUF 0.553 0.464
#> 5 CAR 0.542 0.420
#> 6 CHI 0.550 0.461
#> 7 CIN 0.558 0.464
#> 8 CLE 0.546 0.409
#> 9 DAL 0.521 0.511
#> 10 DEN 0.594 0.423
#> # … with 22 more rows
The spread
function takes the key-value pairs and spread them out into their own columns. The “key” column becomes the column names, so a new column will be created for each value in the “key”. We have only two values (offense and defense), meaning that these are the two new columns that are created. Each column is populated with the values associated with that observation in the “value” column specified in the call to the spread
function. Now we’re back to one row per team, so it’s easier to see the success rate of each unit for all of the teams.
Conclusion
This post shows how the tidyr package, which is main the data cleaning package in the tidyverse, can be used to quickly reshape data in order to facilitate and perform analyses. In the next post, I’m going to demonstrate how the ggplot2 package can be used to visualize these results. For more tidyr resources, check out:
Acknowledgments
Featured photo by Thomas Park on Unsplash.