Welcome to the first in a series of blog posts where I’ll be using sports data to demonstrate the power of the tidyverse tools in the context of sports analytics. The tidyverse is a suite of packages developed mainly by Hadley Wickham, with contributions from over 100 other people in the R community. The goal of the tidyverse is to provide easy to use R packages for a data science workflow that all follow a consistent philosophy and API. In each post in this series, I’ll be focusing on one package within the tidyverse in order to demonstrate how each of the major pieces of the tidyverse works. In all of the posts, I’ll be exploring the play-by-play data for every game of the 2016 NFL season. This data comes from Armchair Analysis, which is pay-walled. However, for a a one time flat rate, you can have access to their historical database with yearly updates forever.
This post will focus on the dplyr package, which is used for data manipulation.
dplyr
Before we get into how dplyr works, let’s first load in our data.
library(tidyverse)
<- read_rds("data/nfl_pbp_2016.rds")
nfl_pbp
nfl_pbp#> # A tibble: 44,385 × 30
#> gid pid off def type dseq len qtr min sec ptso ptsd
#> <int> <int> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int>
#> 1 4257 697181 DEN CAR KOFF 0 6 1 15 0 0 0
#> 2 4257 697182 DEN CAR PASS 1 30 1 15 0 0 0
#> 3 4257 697183 DEN CAR PASS 2 4 1 14 17 0 0
#> 4 4257 697184 DEN CAR PASS 3 5 1 14 13 0 0
#> 5 4257 697185 DEN CAR PASS 4 26 1 14 8 0 0
#> 6 4257 697186 DEN CAR PASS 5 41 1 13 42 0 0
#> 7 4257 697187 DEN CAR RUSH 6 38 1 13 1 0 0
#> 8 4257 697188 DEN CAR RUSH 7 36 1 12 23 0 0
#> 9 4257 697189 DEN CAR RUSH 8 10 1 11 47 0 0
#> 10 4257 697190 CAR DEN RUSH 1 36 1 11 37 0 0
#> # … with 44,375 more rows, and 18 more variables: timo <int>, timd <int>,
#> # dwn <int>, ytg <int>, yfog <int>, zone <int>, fd <int>, sg <int>, nh <int>,
#> # pts <int>, tck <int>, sk <int>, pen <int>, ints <int>, fum <int>,
#> # saf <int>, blk <int>, yds <int>
Here we can see that there were a total of 44,385 plays in the 2016 NFL season, and we have 30 variables for each play. The dplyr package provides a way to manipulate this data so that we can easily answer research questions. There are five main functions in dplyr that take care of the majority of data manipulation tasks:
arrange
: sort the data by a variable or set of variables;filter
: filter observations based on one or more criteria;mutate
: create new variables as functions of existing variables;select
: choose variables by name, or rename variables;summarize
: calculate summary statistics according to one or more grouping variables.
To demonstrate how these functions work, let’s calculate the rate of successful plays for each team. For the NFL, we’ll consider a play successful if the team gains a certain percentage of the yards necessary, that varies by down. Specifically, the yards needed in order for they play to be successful by down is as follows:
- 1st down: 45 percent of the needed yards;
- 2nd down: 60 percent of the needed yards;
- 3rd and 4th down: 100 percent of the needed yards
These values comes from Bob Carroll, Pete Palmer, and John Thorn in The Hidden Game of Football. In this definition, each play is independent. Say for example that a team starts with a 1st down and 10 yards to go. In order for the play to be successful, they need to gain 4.5 yards. They gain 6, so this play is successful. Now it is 2nd and 4. For this play to be successful, they need 60 percent of the 4 yards remaining, meaning they need to gain 2.4 yards. Instead they lose 1 yard, so this play is unsuccessful and it is now 3rd and 5. On third down they need all 5 yards (meaning they need to make the new set of downs), in order for the play to be successful. The calculations continue in this way for every play in the data set.
Using dplyr
First I’ll pull out the needed variables using the select
function.
<- select(nfl_pbp, game_id = gid, play_id = pid, offense = off,
success defense = def, play_type = type, down = dwn, to_go = ytg, gained = yds)
success#> # A tibble: 44,385 × 8
#> game_id play_id offense defense play_type down to_go gained
#> <int> <int> <chr> <chr> <chr> <int> <int> <int>
#> 1 4257 697181 DEN CAR KOFF 0 0 NA
#> 2 4257 697182 DEN CAR PASS 1 10 11
#> 3 4257 697183 DEN CAR PASS 1 10 0
#> 4 4257 697184 DEN CAR PASS 2 10 0
#> 5 4257 697185 DEN CAR PASS 3 10 12
#> 6 4257 697186 DEN CAR PASS 1 10 5
#> 7 4257 697187 DEN CAR RUSH 2 5 13
#> 8 4257 697188 DEN CAR RUSH 1 10 5
#> 9 4257 697189 DEN CAR RUSH 2 5 0
#> 10 4257 697190 CAR DEN RUSH 1 10 6
#> # … with 44,375 more rows
Notice that in the select
statement, we not only chose the variables we wanted, but we also gave them new names that more clearly convey the information included in that variable. For example, we chose to keep the unique game identifier, but we renamed it from gid
to game_id
.
Next, we need to filter out plays that don’t fall into our operational definition of success. For example, under the definition above, how would we determine if a kickoff was successful? Since only passing or rushing plays have criteria for success, we will use the filter
function to select only passing or rushing plays.
<- filter(success, play_type %in% c("PASS", "RUSH"))
success
success#> # A tibble: 34,149 × 8
#> game_id play_id offense defense play_type down to_go gained
#> <int> <int> <chr> <chr> <chr> <int> <int> <int>
#> 1 4257 697182 DEN CAR PASS 1 10 11
#> 2 4257 697183 DEN CAR PASS 1 10 0
#> 3 4257 697184 DEN CAR PASS 2 10 0
#> 4 4257 697185 DEN CAR PASS 3 10 12
#> 5 4257 697186 DEN CAR PASS 1 10 5
#> 6 4257 697187 DEN CAR RUSH 2 5 13
#> 7 4257 697188 DEN CAR RUSH 1 10 5
#> 8 4257 697189 DEN CAR RUSH 2 5 0
#> 9 4257 697190 CAR DEN RUSH 1 10 6
#> 10 4257 697191 CAR DEN RUSH 2 4 11
#> # … with 34,139 more rows
This leaves us with 34,149 plays. The next step is to calculate the number of yards needed in order for the play to be deemed successful. This can be done using the mutate
function along with the case_when
function. The case_when
function acts like a series of if statements to determine the proper condition.
<- mutate(success, needed = case_when(
success == 1 ~ to_go * 0.45,
down == 2 ~ to_go * 0.60,
down TRUE ~ to_go * 1.00
))
success#> # A tibble: 34,149 × 9
#> game_id play_id offense defense play_type down to_go gained needed
#> <int> <int> <chr> <chr> <chr> <int> <int> <int> <dbl>
#> 1 4257 697182 DEN CAR PASS 1 10 11 4.5
#> 2 4257 697183 DEN CAR PASS 1 10 0 4.5
#> 3 4257 697184 DEN CAR PASS 2 10 0 6
#> 4 4257 697185 DEN CAR PASS 3 10 12 10
#> 5 4257 697186 DEN CAR PASS 1 10 5 4.5
#> 6 4257 697187 DEN CAR RUSH 2 5 13 3
#> 7 4257 697188 DEN CAR RUSH 1 10 5 4.5
#> 8 4257 697189 DEN CAR RUSH 2 5 0 3
#> 9 4257 697190 CAR DEN RUSH 1 10 6 4.5
#> 10 4257 697191 CAR DEN RUSH 2 4 11 2.4
#> # … with 34,139 more rows
In this chunk, I’m saying if the down is equal to 1, then the yards needed is equal to 45 percent of the yards to go. If the down isn’t equal to 1, then we go to the second condition. If the down is equal to 2, the yards needed is equal to 60 percent of the yards to go. Finally, if the down isn’t equal to 2, then we go to the final condition which says that for all other downs, yards needed is equal to 100 percent of the yards to go. We can then determine which plays were a success by using mutate again to calculate if the yards gained is greater than the yards needed.
<- mutate(success, play_success = case_when(
success >= needed ~ TRUE,
gained < needed ~ FALSE
gained
))
success#> # A tibble: 34,149 × 10
#> game_id play_id offense defense play_type down to_go gained needed
#> <int> <int> <chr> <chr> <chr> <int> <int> <int> <dbl>
#> 1 4257 697182 DEN CAR PASS 1 10 11 4.5
#> 2 4257 697183 DEN CAR PASS 1 10 0 4.5
#> 3 4257 697184 DEN CAR PASS 2 10 0 6
#> 4 4257 697185 DEN CAR PASS 3 10 12 10
#> 5 4257 697186 DEN CAR PASS 1 10 5 4.5
#> 6 4257 697187 DEN CAR RUSH 2 5 13 3
#> 7 4257 697188 DEN CAR RUSH 1 10 5 4.5
#> 8 4257 697189 DEN CAR RUSH 2 5 0 3
#> 9 4257 697190 CAR DEN RUSH 1 10 6 4.5
#> 10 4257 697191 CAR DEN RUSH 2 4 11 2.4
#> # … with 34,139 more rows, and 1 more variable: play_success <lgl>
Now that we’ve calculated whether each play was a success, we can calculate the overall success rate for each team. To do this, we need to group our data by team. This can be accomplished using the group_by
function. This function doesn’t do anything to the data, but it will allow other functions to operate within each of the groups. Specifically, we will use the summarize
function to calculate the rate of successful plays within each group.
<- group_by(success, team = offense)
success <- summarize(success, success_rate = mean(play_success, na.rm = TRUE))
success
success#> # A tibble: 32 × 2
#> team success_rate
#> <chr> <dbl>
#> 1 ARI 0.458
#> 2 ATL 0.500
#> 3 BAL 0.410
#> 4 BUF 0.464
#> 5 CAR 0.420
#> 6 CHI 0.461
#> 7 CIN 0.464
#> 8 CLE 0.409
#> 9 DAL 0.511
#> 10 DEN 0.423
#> # … with 22 more rows
Finally, I will use the arrange
function to sort this data set by the success rate, allowing us to see which teams were the most successful on average. Note that I use the desc
function to sort success rate in a descending manner, so that the teams with the highest success rates are first. By default the arrange
function sorts ascending.
<- arrange(success, desc(success_rate))
success
success#> # A tibble: 32 × 2
#> team success_rate
#> <chr> <dbl>
#> 1 NO 0.518
#> 2 DAL 0.511
#> 3 ATL 0.500
#> 4 WAS 0.478
#> 5 GB 0.478
#> 6 IND 0.471
#> 7 TEN 0.469
#> 8 BUF 0.464
#> 9 CIN 0.464
#> 10 PIT 0.461
#> # … with 22 more rows
One consideration from these results is that I grouped by the offensive team. Thus, this analysis tells us how often the teams’ offenses were successful. To evaluate a team overall, you would also want to examine defensive success rates. However, these results do seem to make intuitive sense, as the top three teams (New Orleans, Dallas, and Atlanta), all had very good offenses last season.
Conclusion
This post shows how the dplyr package, which is the main data manipulation package in the tidyverse, can be used to quickly manipulate and analyze data with just a few lines of code. In the upcoming posts I’ll be using the tidyr package to help tidy data and ggplot2 to visualize data. I’ll also be introducing the pipe operator in the next post, which will make our code more readable. For more dplyr resources, check out:
Acknowledgments
Featured photo by Dave Adamson on Unsplash.