This is the first of what I hope will be on an ongoing series with the world class hoops heads at Nylon Calculus exploring learning Data Science through NBA Basketball. This tutorial will teach you how to recreate the results of Savvas Tjortjoglou’s fantastic piece on using Python to extract historic NBA draft data except we will be using R. I've also decided to add a few new things to mix it up a little bit as well. Finally, if you have yet to read Savvas's piece I highly recommend you do so.
This and any future tutorials are not intended to be construed as:
As I detailed in my Sports Illustrated Hackathon keynote presentation, it doesn’t matter which programming language you pick so long as you master it to and use your skills to solve problems.
Any of the major programming languages {R, Python, JavaScript, C, ect..} with some expertise empower you to do the truly amazing things that programming languages can do! If you are a true amateur I’d advocate taking some time to learn about the major languages and then deciding which you think is best suited for you to learn.
That said, this and future tutorials are intended to destroy some misconceptions about the language that I love, R. There are myths you may come across that imply R is inferior to Python for web-scraping, that it’s syntax doesn’t make sense and that the language is too hard to learn.
None of these things are true and I hope by the end of this post, whatever your existing programming language of choice is, learned or intended to be learned, you won’t disagree with this statement.
In order to complete today’s tutorial you are going to ensure your computer is locked and load with a few things.
The final step before we are ready for action is to ensure you have the necessary R packages installed and loaded. I am using the development version of many of these packages, so a word caution there.
For those who need to install the packages, I’ve referenced the github repos in the comments, just copy and paste the code starting at devtools::
until the end the line into the console or in a script that you can run. For the non development packages just copy and paste starting from install.packages
and do the same as before.
packages <- #These are the packages we need
c(
'devtools', # install.packages(devtools)
'dplyr', # devtools::install_github('hadley/dplyr')
'magrittr', # devtools::install_github('smbache/magrittr')
'rvest', # devtools::install_github('hadley/rvest')
'data.table', # install.packages(data.table)
'lubridate', # devtools::install_github(hadley/lubridate)
'DataCombine', #devtools::install_github('christophergandrud/DataCombine')
'stringr', # devtools::install_github('hadley/stringr')
'readr', # devtools::install_github('hadley/readr')
'formattable', #devtools::install_github('renkun-ken/formattable')
'tidyr' # devtools::install_github('hadley/tidyr')
)
lapply(packages, library, character.only = T)
In order to mine for NBA draft data gold we need to know a few things, the most important of which being what years has the draft has existed?
To answer this question we navigate our browsers here http://www.basketball-reference.com/draft/
. It looks like the draft has existed since 1947 if you include the NBA’s precursor league the BAA and 1950 if you don’t.
We are almost ready to start mining we just need to pick a draft year to test. Given that it was the year this distinguished author’s birth, let’s go with 1983. From the main draft index page click on the 1983 Draft
link.
Look at all that beautiful data just waiting for us to suck in, I am giddy.
However, before we get ahead of ourselves thinking we’ve unlocked the magic formula to bringing data we long to explore into R we must find one more key input. This important missing piece is the CSS identifier of the data we want from the page, in this case the draft table. Think of this piece of information as unique identifier for the the html data on the page we want.
Remember that SelectorGadget I mentioned earlier? It’s time to use it to extract the CSS identifier we need.
Just like the 1978 Cars song said, we’ve got Just What We Needed it’s time to go data mining.
During this next portion of the tutorial we are going to write code that will port the NBA draft data into a data frame
. We are going to use the excellent rvest
package accomplish much of this. Essentially our code is going to:
However, before we complete those steps I want to enlighten our code. By that I mean I want to teach our code to be aware of the context of the data we are collecting. This will make our code self adjusting so we don’t have to update our code over and over again as new drafts occur {assuming the HTML structure of the page where the data lives doesn’t change, a far from fail proof assumption}.
Specifically, we are going to teach our code to allow us to only search for draft years that are completed. We are also going to teach our code how to differentiate between a BAA and NBA draft while simultaneously telling us which one we picked. We also want our code to tell us if we are searching for an ineligible draft year and stop what it is doing if we happen to make this mistake.
One major aside, you are going to see me use alot of the symbol %>%
, it is a called a pipe chaining operator, think of it like the term then. I’m a huge fan of chaining, it makes code flow more smoothly and cuts down on the amount of code you need to write. If you are on a Mac you can have R write a pipe easily by pressing command+shift+M
, Windows users I believe there is similar syntax as well but I don't know it off the top of my head.
Finally, if you are here to try to learn some R, I urge you to try to follow along with my comments by looking for #
and in case you are wondering <-
is how you assign thing in R and %<>%
is a beautiful combination of %>%
and<-
much simultaneously takes something, transforms it and assigns it.
draft_year <- # our chosen draft year
1983
year.first_draft <- # remember from earlier 1st BAA draft
1947
current_month <- # we're gonna to teach the code to know if the draft has
Sys.Date() %>%
month %>%
as.numeric
if (current_month > 6) {
year.most_recent_draft <- # the draft is at the end of June so if its July we've passed the draft!
Sys.Date() %>% # tells us what today is
year %>% # extracts the year
as.numeric()
} else {
year.most_recent_draft <-
Sys.Date() %>%
year %>%
as.numeric() - 1 #if it's earlier than July we take last year's draft
}
if (!draft_year %in% year.first_draft:year.most_recent_draft) {
stop.message <-
"Not a valid draft year boss!! Drafts can only be between " %>%
paste0(year.first_draft, ' and ', year.most_recent_draft)
stop(stop.message)
}
if (draft_year < 1950) {
base <- # remember what we learned earlier?
'http://www.basketball-reference.com/draft/BAA_'
id.league <-
'BAA' # based on the date R will know hte league
} else {
base <- # well if it's not the BAA it's the NBA, DUHHH
'http://www.basketball-reference.com/draft/NBA_'
id.league <-
'NBA'
}
url.draft_year <-
base %>%
paste0(draft_year, '.html') # creates the url where the data lives
url.draft_year # our url with the data and it should work for any draft year!
## [1] "http://www.basketball-reference.com/draft/NBA_1983.html"
Fantastic, this code based upon our selected draft year creates the URL where the draft data lives. Now that we have this we can move on to the fun part
I am going to write some fairly advanced data cleaning code that will use R’s magic to infer player’s corresponding draft round so we can add it as a variable in our data frame, unfortunately Basketball-Reference doesn’t provide that information to us in an easily usable way.
page <- # Get the html from the page into R
url.draft_year %>%
read_html
raw_data <-
page %>%
html_nodes('#stats') %>% # Remember this is the css id from earlier
html_table(header = F, fill = F) %>% # This function reads the table
data.frame %>% # Data is not ts not a data.frame this puts it into tht form
tbl_df # This converts it into a super dope special type of data fram tbl_df
headers <- # Get the parent header rows so we can append them into the column names and sove the duplicate name problem
raw_data %>%
slice(1) %>% # Takes the 1st row where this header info
unlist %>% # Returns a the row as list we dont want that
as.character %>% # We want to explicitly define this vector as a character
tolower %>% # Gonna be part of our title, I perfer lower case titles
str_replace('\\ ', '_') # Column names SHOULD NOT contain spaces, use the snake!!
# Time to get the actual column items
columns <-
raw_data %>%
slice(2) %>% # This information lives in the second row
tolower %>%
str_replace('%', '_pct') %>% # % should also never be a column header this gets rid of them
str_replace('/', '_per_') # / should ALSO never be a column header, this removes them
name.df <- # This creates a data frame that will contain column names
data_frame(header = headers, column = columns) %>% # data_frame is a special faster data frame from dplyr that I reccomend you use whever possible
mutate( # Mutate adds variables to data frames
header = ifelse(header == '', NA, header), # No headers for blank items, they happen to be identifiers about the draft player versus statistics
header = ifelse(header %like% 'round|territorial_picks', NA, header) # If we see these words we want to exclude that row
) %>%
FillDown('header') %>% # Fills down the keys
mutate(name.column = ifelse(header %>% is.na, column, paste(header, column, sep = '.')) # if the field has a header we want to join the items together if not just keep the item as the name, we know which those are by where the NAs live
)
names(raw_data) <- # change the names to our new clean names!
name.df$name.column
## Use R skills to magically figure out which round the player was taken in
round_rows <- # find out the row number where we get indication of the round
'Round' %>% # that is the word we need to look for
grep(raw_data$player) # we want to return the row numbers in the 4th column where we find that word, the round headers live in this column
round_df <- #create data frame where the rounds start
data_frame(round = paste0('Round ', 1:length(round_rows)), # the number of times we see the word indicates how many rounds that year had with some slight exceptions
id.row = round_rows + 2 # the players names start 2 rows below where we find the word
)
## This is for later but trust me from exploring, there were other ways players ended up on teams that weren't considered drafts, there were specific words that Basketball-Reference uses to delinate them and those are the 2 words we looking for
if ('Other|Territorial Picks' %>% grep(raw_data$player) %>% #looking for their existence, if they exist the length of the results won't be zero
length > 0) {
other_rows <-
'Other|Territorial Picks' %>% grep(raw_data$player)
other_rounds.df <-
data_frame(round = 'Other', #create a special name so we will get an NA for the round when we magically extract the round later
id.row = other_rows + 2)
round_df %<>%
bind_rows(other_rounds.df)
}
raw_data %<>%
mutate(id.row = 1:nrow(.)) %>% # We need to create a temporary id to merge the rows in the draft data frame to this, here we do this
left_join(round_df, by = 'id.row') %>% # A left join keeps the first table and merges any matching table from the right table, in this case the right table is the round data, frame
select(round, everything()) %>% # cleans up order
FillDown('round') %>% # Fills the round by the boundries
mutate(id.round = round %>% extract_numeric) %>% # We want to extract the numeric round if it exists this does that
select(id.round, everything()) %>% # cleans up the order again
select(-c(id.row, round)) # we dont need these columns, using - tells R to remove these columns
## Last step, remove the rows we don't need
raw_data %<>% ## remove the rows we don't need
slice(-1) %>% # First row the way we import the data contains the items, remember
dplyr::filter(!rk == 'Rk', !player %like% 'Round|Other Picks') # In these tables we know how to find "bad" rows by looking for "Rk" in the rk column and the word Round or Other Picks in the player column filters out fields that don't contain data because they contain those 3 words
raw_data %<>%
select(-rk) %>% # We dont need the rank column its meaningless
rename( #A little more name cleaning
id.pick = pk, # better name, helps us remember that the pick is an id which it is
id.bref.team = tm, # this column corresponds with the basketball reference
totals.years_played = yrs # Better name, shows us how many years someone played
) %>%
arrange(id.pick) #orders by pick!
numeric_columns <- # Tell R which fields are numeric and then convert them to numeric fields!
raw_data %>%
select(-c(id.bref.team, college, player)) %>% # Selects the non numeric fields
names # Returns the names of the columns that we want to convert
raw_data %<>%
mutate_each_(funs(as.numeric), # this is where we tell R what function we want to use on our selected columns, here its as.numeric
vars = numeric_columns # These are the columns from earlier
)
raw_data %>%
glimpse # This function gives us a snapshot of our data frame!
## Observations: 226
## Variables: 22
## $ id.round (dbl) 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 5, ...
## $ id.pick (dbl) 1, 10, 100, 101, 102, 103, 104, 105, 106, ...
## $ id.bref.team (chr) "HOU", "WSB", "UTA", "DET", "DAL", "WSB", ...
## $ player (chr) "Ralph Sampson", "Jeff Malone", "Matt Clar...
## $ college (chr) "University of Virginia", "Mississippi Sta...
## $ totals.years_played (dbl) 9, 13, NA, 1, 1, NA, NA, NA, NA, NA, NA, N...
## $ totals.g (dbl) 456, 905, NA, 7, 1, NA, NA, NA, NA, NA, NA...
## $ totals.mp (dbl) 13591, 29660, NA, 28, 16, NA, NA, NA, NA, ...
## $ totals.pts (dbl) 7039, 17231, NA, 12, 3, NA, NA, NA, NA, NA...
## $ totals.trb (dbl) 4011, 2364, NA, 3, 5, NA, NA, NA, NA, NA, ...
## $ totals.ast (dbl) 1038, 2154, NA, 1, 0, NA, NA, NA, NA, NA, ...
## $ shooting.fg_pct (dbl) 0.486, 0.484, NA, 0.462, 0.333, NA, NA, NA...
## $ shooting.3p_pct (dbl) 0.172, 0.268, NA, NA, NA, NA, NA, NA, NA, ...
## $ shooting.ft_pct (dbl) 0.661, 0.871, NA, NA, 0.500, NA, NA, NA, N...
## $ per_game.mp (dbl) 29.8, 32.8, NA, 4.0, 16.0, NA, NA, NA, NA,...
## $ per_game.pts (dbl) 15.4, 19.0, NA, 1.7, 3.0, NA, NA, NA, NA, ...
## $ per_game.trb (dbl) 8.8, 2.6, NA, 0.4, 5.0, NA, NA, NA, NA, NA...
## $ per_game.ast (dbl) 2.3, 2.4, NA, 0.1, 0.0, NA, NA, NA, NA, NA...
## $ advanced.ws (dbl) 20.1, 54.2, NA, 0.0, 0.0, NA, NA, NA, NA, ...
## $ advanced.ws_per_48 (dbl) 0.071, 0.088, NA, -0.046, -0.003, NA, NA, ...
## $ advanced.bpm (dbl) 0.0, -2.1, NA, -6.9, -12.1, NA, NA, NA, NA...
## $ advanced.vorp (dbl) 6.8, -0.9, NA, 0.0, 0.0, NA, NA, NA, NA, N...
Booyah, just like that we have the data from the table for the 1983 draft.
Believe it or not we are almost done and the hardest parts are behind us. All that is left to do is import the other valuable data from the draft page that being the player’s BREF ID, if it exists, which in turn gives us the player’s BREF profile URL. In addition to that, we want to resolve the BREF team IDs to the actual draft team.
The player IDs can be found the same way we brought in the table data, with their CSS identifiers. In this case they are an attribute
of the 4th table data column.
As for the resolving the team IDs that is more complicated process which I’ve taken the liberty of already doing separately. We are going to just read in that file which contains all the BREF team IDs, the team names, current team ids, and whether the franchise is still in existence. After collecting and importing that information we will merge it the existing data frame and we will be finished!
The final step is a personal preference, though completely unnecessary to the process of learning to be an R coding master. I like to Messagify my code, especially when functions are built around the code, and that’s coming next folks. Messages show you that your code is done running and can be tailored to, make you laugh, feel good about your work, or as the case here, learn a potentially random new fact. When we build our functions I will show you how to make Messagifying optional
player <-
page %>%
html_nodes('td:nth-child(4) a') %>% # This is the column where the player IDs live
html_text %>% # this takes the html output and returns text
str_trim # To be safe this trims the code in case there are any unceassary white spaces
url.bref.player <- # This creates the player URL
page %>%
html_nodes('td:nth-child(4) a') %>%
html_attr('href') %>% # This function pulls in the html attribute, in this case the stem
paste0('http://www.basketball-reference.com', .) # We need to append the base URL!
stem.player <- # Here we are going to extract out from the steam the exact player ID
page %>%
html_nodes('td:nth-child(4) a') %>%
html_attr('href') %>%
str_replace_all('/players/|.html', '') # eliminates the unnecassary words to help us get at the clean ID
players_urls <- # This will create a data frame with the information we want
data_frame(player, url.bref.player, stem.player) %>%
separate(stem.player, c('letter.first', 'id.bref.player'), # Separates the remaining 2 parts by its delimiter the / we only want the second column which contains the id
sep = '/') %>%
select(-letter.first) # removes the unneeded column
## Resolve the team ids
teams_ids <-
'https://asbcllc.com/data/nba/bref/nba_teams_ids.csv' %>%
read_csv # imports my team data
data <- #create a new data frame which will be our final data frame with all the information
raw_data %>%
left_join(players_urls, by = 'player') %>% # joins are existing data with the players who have BREF profiles
mutate(id.league, # add in the league id from earlier
year.draft = draft_year, # add in the draft year from earlier, I like to use periods to separate columns, total personal preference
url.bref.draft = url.draft_year # this tells us the URL where we got our data from
)
data %<>%
left_join(teams_ids, by = 'id.bref.team') %>% # Joins our data frame with the team data which will merge in the missing team data
select( #reorder the data
id.league,
year.draft,
id.round,
id.pick:player,
id.bref.player,
id.bref.team,
team,
id.bref.current_team,
current_team,
everything() #select everything else in the data frame
) %>%
arrange(id.pick)
### Time to Messagify!!
players <-
data %>%
nrow #counts the number of players in that draft
random_player <- # I want our message to return information about a random player from that draft
data %>%
arrange(desc(totals.pts)) %>% # arranges the data with the top scorers first
mutate(rank.total_points = 1:nrow(.)) %>% # adds a ranking field
dplyr::filter(totals.pts > 0 & totals.years_played > 1) %>% #filters for players who played over 1 season and scored points
dplyr::filter(!is.na(id.bref.player)) %>% # filters out players who don't have a BREF profile page
sample_n(1) # returns 1 random row with a player that fits the criteria!
# Print the message
"Congratulations you pulled in data for all " %>%
paste0(
players,
' players from the ',
draft_year,
' Draft\nHave you heard of the #',
random_player$id.pick,
' pick, ',
random_player$player,
'?\nHe played ',
random_player$totals.years_played,
' seasons & ranked #',
random_player$rank.total_points,
' in his draft class for total points scored!'
) %>%
message
## Congratulations you pulled in data for all 226 players from the 1983 Draft
## Have you heard of the #33 pick, Dirk Minniefield?
## He played 3 seasons & ranked #33 in his draft class for total points scored!
The moment of truth is here, we’ve completed the steps needed to bring the cleaned and resolved draft data into R, now we have to make sure it worked!
To do that we are going to select the a few columns, arrange by top career total point scorers, and look at the top 5 names.
data %>%
select(id.round, id.pick, player, team, totals.years_played, totals.pts) %>%
mutate(totals.pts = totals.pts %>% comma(digits = 0)) %>% # formats to add a comma
head %>% # Takes the top 5 players
format_table(align = 'c') # This creates the beautiful SVG table you are looking at
id.round | id.pick | player | team | totals.years_played | totals.pts |
---|---|---|---|---|---|
1 | 1 | Ralph Sampson | Houston Rockets | 9 | 7,039 |
1 | 2 | Steve Stipanovich | Indiana Pacers | 5 | 5,323 |
1 | 3 | Rodney McCray | Houston Rockets | 10 | 9,014 |
1 | 4 | Byron Scott | San Diego Clippers | 14 | 15,097 |
1 | 5 | Sidney Green | Chicago Bulls | 10 | 5,080 |
1 | 6 | Russell Cross | Golden State Warriors | 1 | 166 |
Now that we know our code is fleek and works like Draymond Green in the paint, it’s time to take our code and turn it into a function we can use whenever we feel like it. As I mentioned earlier, I am going to add in a few lines of code that will make returning a message optional. I am going to rewrite the code from earlier in a consolidated manner so you can just skip to the bottom if you want.
get_nba_year_draft_data <-
function(draft_year = 1983, # this is the default year if you run the function with no year sepcified
return_message = T # this specifies the default behavior is to return our message
) {
options(warn = -1) # Turns off some annoying warnings
#Make Smart Draft Years
year.first_draft <-
1947
current_month <-
Sys.Date() %>%
month %>%
as.numeric
if (current_month > 6) {
year.most_recent_draft <-
Sys.Date() %>%
year %>%
as.numeric()
} else {
year.most_recent_draft <-
Sys.Date() %>%
year %>%
as.numeric() - 1
}
if (!draft_year %in% year.first_draft:year.most_recent_draft) {
stop.message <-
"Not a valid draft year boss!! Drafts can only be between " %>%
paste0(year.first_draft, ' and ', year.most_recent_draft)
stop(stop.message)
}
if (draft_year < 1950) {
base <-
'http://www.basketball-reference.com/draft/BAA_'
id.league <-
'BAA'
} else {
base <-
'http://www.basketball-reference.com/draft/NBA_'
id.league <-
'NBA'
}
url.draft_year <-
base %>%
paste0(draft_year, '.html')
page <-
url.draft_year %>%
read_html
raw_data <-
page %>%
html_nodes('#stats') %>%
html_table(header = F, fill = F) %>%
data.frame %>%
tbl_df
headers <-
raw_data %>%
slice(1) %>%
unlist %>%
as.character %>%
tolower %>%
str_replace('\\ ', '_')
#Time to get the column items
columns <-
raw_data %>%
slice(2) %>%
tolower %>%
str_replace('%', '_pct') %>%
str_replace('/', '_per_')
name.df <-
data_frame(header = headers, column = columns) %>%
mutate(
header = ifelse(header == '', NA, header),
header = ifelse(header %like% 'round|territorial_picks', NA, header)
) %>%
FillDown('header') %>%
mutate(name.column = ifelse(header %>% is.na, column,
paste(header, column, sep = '.'))
)
names(raw_data) <-
name.df$name.column
## Magically figure out which round the player was taken in
round_rows <-
'Round' %>%
grep(raw_data$player)
round_df <-
data_frame(round = paste0('Round ', 1:length(round_rows)),
id.row = round_rows + 2
)
if ('Other|Territorial Picks' %>% grep(raw_data$player) %>% length > 0) {
other_rows <-
'Other|Territorial Picks' %>% grep(raw_data$player)
other_rounds.df <-
data_frame(round = 'Other',
id.row = other_rows + 2)
round_df %<>%
bind_rows(other_rounds.df)
}
raw_data %<>%
mutate(id.row = 1:nrow(.)) %>%
left_join(round_df, by = 'id.row') %>%
select(round, everything()) %>%
FillDown('round') %>%
mutate(id.round = round %>% extract_numeric) %>%
select(id.round, everything()) %>%
select(-c(id.row, round))
raw_data %<>%
slice(-1) %>%
dplyr::filter(!rk == 'Rk', !player %like% 'Round|Other Picks')
raw_data %<>%
select(-rk) %>%
rename(
id.pick = pk,
id.bref.team = tm,
totals.years_played = yrs
) %>%
mutate(id.pick = id.pick %>% as.numeric) %>%
arrange(id.pick)
numeric_columns <-
raw_data %>%
select(-c(id.bref.team, college, player)) %>%
names
raw_data %<>%
mutate_each_(funs(as.numeric),
vars = numeric_columns
)
## Player ID Extraction
player <-
page %>%
html_nodes('td:nth-child(4) a') %>%
html_text
url.bref.player <-
page %>%
html_nodes('td:nth-child(4) a') %>%
html_attr('href') %>%
paste0('http://www.basketball-reference.com', .)
stem.player <-
page %>%
html_nodes('td:nth-child(4) a') %>%
html_attr('href') %>%
str_replace_all('/players/|.html', '')
players_urls <-
data_frame(player, url.bref.player, stem.player) %>%
separate(stem.player, c('letter.first', 'id.bref.player'),
sep = '/') %>%
select(-letter.first)
## resolve with team
teams_ids <-
'https://asbcllc.com/data/nba/bref/nba_teams_ids.csv' %>%
read_csv
data <-
raw_data %>%
left_join(players_urls, by = 'player') %>%
mutate(id.league,
year.draft = draft_year,
url.bref.draft = url.draft_year)
data %<>%
left_join(teams_ids, by = 'id.bref.team') %>%
select(
id.league,
year.draft,
id.round,
id.pick:player,
id.bref.player,
id.bref.team,
team,
id.bref.current_team,
current_team,
everything()
) %>%
arrange(id.pick)
if (
return_message == T # this will only run our message if we tell it to in our function
) {
players <-
data %>% nrow
random_player <-
data %>%
dplyr::filter(totals.pts > 0 & totals.years_played > 1) %>%
arrange(desc(totals.pts)) %>%
mutate(rank.total_points = 1:nrow(.)) %>%
dplyr::filter(!is.na(id.bref.player)) %>%
sample_n(1)
"Congratulations you pulled in data for all " %>%
paste0(
players,
' players from the ',
draft_year,
' Draft\nHave you heard of the #',
random_player$id.pick,
' pick, ',
random_player$player,
'?\nHe played ',
random_player$totals.years_played,
' seasons & ranked #',
random_player$rank.total_points,
' in his draft class for total points scored!'
) %>%
message
}
return(data) # returns the final data frame, this is what we see!
}
Lets put our function to work and make sure we can accurately pull in the data from 2013 Draft.
data <-
get_nba_year_draft_data(draft_year = 2013,return_message = T)
## Congratulations you pulled in data for all 60 players from the 2013 Draft
## Have you heard of the #34 pick, Isaiah Canaan?
## He played 2 seasons & ranked #22 in his draft class for total points scored!
Let’s repeat what we did earlier, this time we are going to add a few more columns and look at the first 15 picks.
data %>%
select(id.round, id.pick, player, team, totals.years_played, totals.pts, per_game.pts, per_game.trb, per_game.ast, advanced.ws) %>%
mutate(totals.pts = totals.pts %>% comma(digits = 0)) %>% #some formating
head(15) %>% # Takes the top 15
format_table(align = 'c')
id.round | id.pick | player | team | totals.years_played | totals.pts | per_game.pts | per_game.trb | per_game.ast | advanced.ws |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | Anthony Bennett | Cleveland Cavailers | 2 | 515 | 4.7 | 3.4 | 0.6 | -0.1 |
1 | 2 | Victor Oladipo | Orlando Magic | 2 | 2,398 | 15.8 | 4.2 | 4.1 | 4.8 |
1 | 3 | Otto Porter | Washington Wizards | 2 | 523 | 4.7 | 2.5 | 0.7 | 2.7 |
1 | 4 | Cody Zeller | Charlotte Hornets | 2 | 962 | 6.7 | 5.0 | 1.3 | 6.5 |
1 | 5 | Alex Len | Phoenix Suns | 2 | 518 | 4.7 | 5.0 | 0.3 | 3.6 |
1 | 6 | Nerlens Noel | New Orleans Hornets | 1 | 744 | 9.9 | 8.1 | 1.7 | 4.0 |
1 | 7 | Ben McLemore | Sacramento Kings | 2 | 1,716 | 10.5 | 2.9 | 1.4 | 3.3 |
1 | 8 | Kentavious Caldwell-Pope | Detroit Pistons | 2 | 1,513 | 9.3 | 2.5 | 1.0 | 4.6 |
1 | 9 | Trey Burke | Minnesota Timberwolves | 2 | 1,868 | 12.8 | 2.8 | 5.0 | 3.3 |
1 | 10 | C.J. McCollum | Portland Trailblazers | 2 | 625 | 6.3 | 1.4 | 0.9 | 2.0 |
1 | 11 | Michael Carter-Williams | Philadelphia 76ers | 2 | 2,133 | 15.7 | 5.8 | 6.5 | 2.1 |
1 | 12 | Steven Adams | Oklahoma City Thunder | 2 | 802 | 5.3 | 5.7 | 0.7 | 7.0 |
1 | 13 | Kelly Olynyk | Dallas Mavericks | 2 | 1,263 | 9.4 | 5.0 | 1.6 | 6.5 |
1 | 14 | Shabazz Muhammad | Utah Jazz | 2 | 655 | 8.7 | 2.8 | 0.7 | 2.3 |
1 | 15 | Giannis Antetokounmpo | Milwaukee Bucks | 2 | 1,555 | 9.8 | 5.6 | 2.3 | 7.4 |
We now have demonstrated that our function works and will provide us for draft data for any valid year we give it. We can take this function and integrate it into another function that will let us pull in the complete NBA/BAA draft history or the specific draft years we want. We already put in the hard work creating the first function, this second function only has a few missing pieces before it will be up and running.
We want our function to let us explicitly ask us whether we want the BAA data. We want our code to know what to do if we don’t identify an explicit range of drafts. In this case we want our code to return all the eligible drafts if the range is undefined. Finally Messagify this function to return a random fact about a player in the chosen range.
Let’s Do It!! It’s Only Going to Take a Minute or 2! Don’t Forget to Test the Function On Years You Are Interested in Seeing!!
get_all_nba_draft_data <- function(year_start = NA, year_end = NA,
include_baa = T, return_message = T) {
# This makes sure the first function is in your environment
if ('get_nba_year_draft_data' %in% ls() ){
#
}
if (
include_baa == T # asks us if we want BAA data, assumes we do
) {
year.first_draft <-
1947 # first BAA draft year
} else {
year.first_draft <-
1950 # first NBA draft year
}
if (year_start %>% is.na) {
year_start <-
year.first_draft # takes the first BAA or NBA draft year if we don't give our code a start year
}
current_month <-
Sys.Date() %>%
month %>%
as.numeric
if (current_month > 6) {
year.most_recent_draft <-
Sys.Date() %>%
year %>%
as.numeric()
} else{
year.most_recent_draft <-
Sys.Date() %>%
year %>%
as.numeric() - 1 # Takes the most recent draft if we don't identify an explicit end to the draft range
}
if (year_end %>% is.na) {
year_end <-
year.most_recent_draft #If we don't select an end draft year overwrite with the most recent
}
draft_years <-
year_start:year_end #create a numeric vector of the draft years we want
## Loop through the selected draft years, get the data and append it to the master data frame
all_data <-
data_frame() # form an empty data_frame that will act as the master data frame
for (year in draft_years){
data <-
get_nba_year_draft_data(draft_year = year, return_message = F) # we dont want to see a bunch of messages so lets turn it off for this
all_data %<>% # takes the master data frame
bind_rows(data) # binds the rows of the year's data frame to that
}
if (return_message == T) {
players <-
all_data %>%
nrow
## We want this message to show us some random facts about one of the top 1000 scorers
random_player <-
all_data %>%
dplyr::filter(totals.pts > 0) %>% #people who have scored
arrange(desc(totals.pts)) %>% #sort by points
mutate(rank.total_points = 1:nrow(.)) %>%
dplyr::filter(!is.na(id.bref.player)) %>%
slice(1:1000) %>% #take the top 1000s players by points
sample_n(1) #take a sample of 1
"Congratulations you pulled in data for " %>%
paste0(
players,
' players from the ',
year_start,
' to ',
year_end,
' drafts\nHave you heard of the #',
random_player$id.pick,
' pick in the ',
random_player$year.draft,
' Draft, ',
random_player$player,
'?\nHe played ',
random_player$totals.years_played,
' seasons & ranks #',
random_player$rank.total_points,
' all time in total points scored during your selected draft eras!'
) %>%
message
}
return(all_data) #returns the data
}
The function should be loaded into your environment. Let’s test it out. Since the function is smart and makes assumptions if information isn’t given, you don’t need to input any assumptions, doing this will return the full NBA draft history inclusive of the BAA drafts.
Oh yea, be patient this function might take a little while.
all_data <-
get_all_nba_draft_data()
## Congratulations you pulled in data for 7789 players from the 1947 to 2015 drafts
## Have you heard of the #4 pick in the 1985 Draft, Xavier McDaniel?
## He played 12 seasons & ranks #170 all time in total points scored during your selected draft eras!
all_data %>%
select(year.draft, id.round, id.pick, player, team, totals.years_played, totals.pts, per_game.pts, per_game.trb, per_game.ast, advanced.ws) %>%
arrange(desc(totals.pts)) %>%
mutate(totals.pts = totals.pts %>% comma(digits = 0)) %>% #some formating
head(15) %>% #lets do the top 15 picks
format_table(align = 'c')
year.draft | id.round | id.pick | player | team | totals.years_played | totals.pts | per_game.pts | per_game.trb | per_game.ast | advanced.ws |
---|---|---|---|---|---|---|---|---|---|---|
1969 | 1 | 1 | Kareem Abdul-Jabbar | Milwaukee Bucks | 20 | 38,387 | 24.6 | 11.2 | 3.6 | 273.4 |
1985 | 1 | 13 | Karl Malone | Utah Jazz | 19 | 36,928 | 25.0 | 10.1 | 3.6 | 234.6 |
1996 | 1 | 13 | Kobe Bryant | Charlotte Hornets | 19 | 32,482 | 25.4 | 5.3 | 4.8 | 173.1 |
1984 | 1 | 3 | Michael Jordan | Chicago Bulls | 15 | 32,292 | 30.1 | 6.2 | 5.3 | 214.0 |
1959 | NA | NA | Wilt Chamberlain | Philadelphia Warriors | 14 | 31,419 | 30.1 | 22.9 | 4.4 | 247.3 |
1992 | 1 | 1 | Shaquille O’Neal | Orlando Magic | 19 | 28,596 | 23.7 | 10.9 | 2.5 | 181.7 |
1998 | 1 | 9 | Dirk Nowitzki | Milwaukee Bucks | 17 | 28,119 | 22.2 | 7.9 | 2.6 | 192.0 |
1968 | 1 | 1 | Elvin Hayes | San Diego Rockets | 16 | 27,313 | 21.0 | 12.5 | 1.8 | 120.8 |
1984 | 1 | 1 | Hakeem Olajuwon | Houston Rockets | 18 | 26,946 | 21.8 | 11.1 | 2.5 | 162.8 |
1960 | 1 | 1 | Oscar Robertson | Cincinnati Royals | 14 | 26,710 | 25.7 | 7.5 | 9.5 | 189.2 |
1982 | 1 | 3 | Dominique Wilkins | Utah Jazz | 15 | 26,668 | 24.8 | 6.7 | 2.5 | 117.5 |
1962 | 1 | 7 | John Havlicek | Boston Celtics | 16 | 26,395 | 20.8 | 6.3 | 4.8 | 131.7 |
1997 | 1 | 1 | Tim Duncan | San Antonio Spurs | 18 | 25,974 | 19.5 | 11.0 | 3.1 | 201.2 |
1995 | 1 | 5 | Kevin Garnett | Minnesota Timberwolves | 20 | 25,949 | 18.2 | 10.2 | 3.8 | 190.4 |
1998 | 1 | 10 | Paul Pierce | Boston Celtics | 17 | 25,899 | 20.7 | 5.8 | 3.7 | 149.1 |
Now that we have the entire draft history in a data frame we can write it to disk so we have it locally.
all_data %>%
write_csv(path = 'data/nba_baa_drafts_1947_2015.csv') #pick where ever you want to save it this is just my choice.
We did it ladies and gentleman. We now have a function that can pull in any and all NBA or BAA drafts we give it.
Today’s tutorial ends here but we now have refined NBA draft data gold ready for all sorts of interesting data analysis and visualization, things R also excels at.
Check back soon and for a tutorial that will teach how how to visualize and analyze this dataset.
I hope you enjoyed this and recognize that R can do anything that Python does with ease and grace.
Also, if you ever hear someone say R isn’t good at web-scraping you can shake your head and laugh!
You can find the code for the functions here and I urge you to download or fork the repo and use the code whenever you need it!
As always please don’t hesitate to reach out to me on twitter with any questions, comments and concerns!