If you are like me and love sports, data, and visualization [which will come in the form of GIF's and JPG's today] this post is for you. In this tutorial we will learn how to harness power of R to build a function that gives us access to data from Basketball Reference.


Arm Yourself

In order to partake in this adventure we will need these powerful weapons in your arsenal. First make sure you have R and RStudio installed. Then if you don’t already use Firefox or Chrome pick a browser horse and install it. Finally, launch your chosen browse and install the fantastic SelectorGadget widget. Fire up , crack your knuckles and finish any other pre-game routines, it’s game time.


Party Time

Now that we are ready, we must decide what exactly to explore. Being a huge fan of THE and still being on cloud 9 from our recent beatdown of the despised New York Knickerbockers I’ve decided to explore NBA team data from my favorite NBA data site Basketball Reference.


After some navigation through the links on the home page we find what we want, the Seasons Index page. Let’s look at the 2013-2014 season. Looks like the page is essentially a bunch of tables, fantastic news for data scraping. Let’s try to pull in the Team Stats table. Use the Selector widget and navigate to the bottom of the Team Stats and click such that an orange box surrounds the table. A pop up with the text #team should appear.

Booya, that’s the CSS Selector for the table. With this information we know what is needed to bring this data into R. Navigate back into RStudio, fire up a new R Script file because its data scraping time.


Ready, Aim, Fire

Once in the script we need to load the packages needed to bring us to victory. I am about to start embedding code but before I do here a few quick notes. As a self taught coder I like to code left to right and assign using ->, something for some strange reason, is frowned upon by the authority. Most people code right to left and assign using <- but it’s a free country, and I like to code how I read but you can do it however you like, there is no wrong answer just being able to do things or not.

Next point, there is a new craze hitting R that all the cool kids are using, it’s called piping and it is used with the symbols %>% or %>>%, think of it like the word THEN, I prefer the %>>% syntax from the fantastic pipeR package. Finally anytime you see # it is a comment that the code will skip, please pardon my jibberish but I hope you can understand most of my comments.

Load the Necessary Packages

c('rvest','dplyr','pipeR', 'knitr') -> packages # you don't need knitr but I do to make this post
#dplyr or pipeR use the install.packages function to install them, install.packages('dplyr') and install.packages('pipeR')
#If you don't have rvest do the following - install devtools, install.packages('devtools')
#Load devtools using library(devtools) and then install rvest by using install_github('hadley/rvest')
lapply(packages, library, character.only = T) #loops through the packages and loads them
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:utils':
##
##     history
##
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
##     filter
##
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
## [[1]]
## [1] "rvest"     "stats"     "graphics"  "grDevices" "utils"     "datasets"
## [7] "methods"   "base"
##
## [[2]]
## [1] "dplyr"     "rvest"     "stats"     "graphics"  "grDevices" "utils"
## [7] "datasets"  "methods"   "base"
##
## [[3]]
##  [1] "pipeR"     "dplyr"     "rvest"     "stats"     "graphics"
##  [6] "grDevices" "utils"     "datasets"  "methods"   "base"
##
## [[4]]
##  [1] "knitr"     "pipeR"     "dplyr"     "rvest"     "stats"
##  [6] "graphics"  "grDevices" "utils"     "datasets"  "methods"
## [11] "base"

Start Scraping

Step 1 lets get the table into R. Since we want scale this to other tables we are going to turn off the headers and see if we can find the row with the column names. In Basketball-References all the stats tables usually start with a column called Rk or rank.

It looks like the headers are in the first row [but in some cases its the second]. Let’s extract out that row and place it into a character vector. Next we use that vector to name our Data Frame [what r calls the table with all the data in it]. Since R is a case sensitive language it’s good practice to use lower case headers and we will do that here. Also R hates things likes spaces or characters like % or / in headers so let’s replace any of those with a period. Finally we want to skip the row with the headers and in order to prepare ourselves for the function we will soon write, we want to find the row the headers are in and then take the data from the row after that until the end.

'http://www.basketball-reference.com/leagues/NBA_2014.html' -> url
'#team' -> css_page
url %>>%
    html %>>%
    html_nodes(css_page) %>>%
    html_table(header = F) %>>%
    data.frame() %>>%
    tbl_df() -> total_table
total_table %>>%
    filter(X.1 == 'Rk') %>>% as.character -> names
'Rk' %>>% grep(x = total_table$X.1) -> row_of_header #find where rank is
names %>>% tolower -> names(total_table)
names(total_table) %>>% (gsub('\\%|/','\\.',.)) -> names(total_table)
(row_of_header + 1) %>>% (total_table[.:nrow(total_table),]) -> total_table #skip that row and go to the end row and go to the end
total_table %>>% head
## Source: local data frame [6 x 26]
##
##   rk                    team  g    mp   fg  fga  fg.  3p  3pa  3p.   2p
## 1  1   Los Angeles Clippers* 82 19755 3208 6761 .474 693 1966 .352 2515
## 2  2        Houston Rockets* 82 19830 3118 6603 .472 779 2179 .358 2339
## 3  3  Minnesota Timberwolves 82 19855 3189 7175 .444 600 1757 .341 2589
## 4  4 Portland Trail Blazers* 82 19855 3207 7134 .450 770 2071 .372 2437
## 5  5  Oklahoma City Thunder* 82 19805 3194 6782 .471 664 1839 .361 2530
## 6  6      San Antonio Spurs* 82 19755 3326 6844 .486 698 1757 .397 2628
## Variables not shown: 2pa (chr), 2p. (chr), ft (chr), fta (chr), ft. (chr),
##   orb (chr), drb (chr), trb (chr), ast (chr), stl (chr), blk (chr), tov
##   (chr), pf (chr), pts (chr), pts.g (chr)

Booya It’s In R

Now you will see a Data Frame in your workspace called total_table, we already found and replaced the headers but there are still a few more quick data cleaning chores remaining before you have a clean dataset. It looks like there are asterisks at the end of some of the team’s that indicate whether the team made the playoffs, sorry Knicks fans. Let’s add a logical column for teams that made the playoffs and then remove the astericks. Next let’s remove the rank column and get rid of the row with the league averages since that is not an actual team and we can easily recalculate summary statistics when we need them.

We Did It, Look at That Clean Data Frame

team g mp fg fga fg. 3p 3pa 3p. 2p 2pa 2p. ft fta ft. orb drb trb ast stl blk tov pf pts pts.g playoff_team
Los Angeles Clippers 82 19755 3208 6761 .474 693 1966 .352 2515 4795 .525 1741 2384 .730 858 2668 3526 2016 703 397 1136 1767 8850 107.9 TRUE
Houston Rockets 82 19830 3118 6603 .472 779 2179 .358 2339 4424 .529 1814 2549 .712 920 2797 3717 1755 621 461 1323 1676 8829 107.7 TRUE
Minnesota Timberwolves 82 19855 3189 7175 .444 600 1757 .341 2589 5418 .478 1790 2301 .778 1024 2644 3668 1963 718 297 1142 1504 8768 106.9 FALSE
Portland Trail Blazers 82 19855 3207 7134 .450 770 2071 .372 2437 5063 .481 1569 1926 .815 1022 2786 3808 1904 454 387 1125 1576 8753 106.7 TRUE
Oklahoma City Thunder 82 19805 3194 6782 .471 664 1839 .361 2530 4943 .512 1653 2052 .806 887 2781 3668 1794 678 501 1256 1858 8705 106.2 TRUE
San Antonio Spurs 82 19755 3326 6844 .486 698 1757 .397 2628 5087 .517 1289 1642 .785 762 2786 3548 2064 604 420 1180 1495 8639 105.4 TRUE
Phoenix Suns 82 19755 3172 6845 .463 765 2055 .372 2407 4790 .503 1520 2004 .758 928 2601 3529 1563 688 374 1258 1798 8629 105.2 FALSE
Dallas Mavericks 82 19830 3249 6858 .474 721 1877 .384 2528 4981 .508 1378 1733 .795 840 2514 3354 1935 704 356 1110 1636 8597 104.8 TRUE
Denver Nuggets 82 19755 3147 7042 .447 702 1959 .358 2445 5083 .481 1563 2154 .726 1009 2717 3726 1838 615 459 1305 1890 8559 104.4 FALSE
Golden State Warriors 82 19830 3236 7005 .462 774 2037 .380 2462 4968 .496 1303 1731 .753 896 2819 3715 1912 642 407 1247 1784 8549 104.3 TRUE
Los Angeles Lakers 82 19705 3139 6980 .450 774 2032 .381 2365 4948 .478 1390 1835 .757 745 2620 3365 2006 611 446 1239 1627 8442 103.0 FALSE
Miami Heat 82 19880 3142 6272 .501 665 1829 .364 2477 4443 .558 1431 1884 .760 627 2397 3024 1847 732 367 1212 1596 8380 102.2 TRUE
Toronto Raptors 82 19955 2992 6718 .445 713 1917 .372 2279 4801 .475 1608 2055 .782 935 2552 3487 1737 577 343 1159 1882 8305 101.3 TRUE
Atlanta Hawks 82 19830 3061 6688 .458 768 2116 .363 2293 4572 .502 1392 1782 .781 713 2565 3278 2041 680 326 1251 1577 8282 101.0 TRUE
Detroit Pistons 82 19780 3182 7124 .447 507 1580 .321 2675 5544 .483 1415 2111 .670 1196 2525 3721 1714 687 395 1193 1666 8286 101.0 FALSE
Washington Wizards 82 20055 3177 6920 .459 647 1704 .380 2530 5216 .485 1253 1715 .731 886 2573 3459 1909 668 377 1204 1675 8254 100.7 TRUE
Sacramento Kings 82 19830 3026 6766 .447 491 1475 .333 2535 5291 .479 1698 2237 .759 990 2656 3646 1547 587 318 1249 1849 8241 100.5 FALSE
New Orleans Pelicans 82 19855 3101 6761 .459 486 1303 .373 2615 5458 .479 1489 1936 .769 933 2486 3419 1745 647 523 1129 1857 8177 99.7 FALSE
Philadelphia 76ers 82 19855 3108 7150 .435 577 1847 .312 2531 5303 .477 1362 1918 .710 949 2556 3505 1791 765 330 1384 1844 8155 99.5 FALSE
New York Knicks 82 19855 3027 6739 .449 759 2038 .372 2268 4701 .482 1271 1670 .761 870 2437 3307 1641 631 367 1063 1815 8084 98.6 FALSE
Brooklyn Nets 82 19880 2931 6391 .459 709 1922 .369 2222 4469 .497 1508 2002 .753 721 2407 3128 1714 705 311 1191 1777 8079 98.5 TRUE
Cleveland Cavaliers 82 19930 3036 6955 .437 584 1640 .356 2452 5315 .461 1398 1861 .751 989 2629 3618 1738 579 304 1163 1640 8054 98.2 FALSE
Charlotte Bobcats 82 19905 2976 6730 .442 516 1471 .351 2460 5259 .468 1474 2000 .737 776 2724 3500 1778 499 421 1010 1493 7942 96.9 TRUE
Indiana Pacers 82 19780 2949 6573 .449 550 1542 .357 2399 5031 .477 1485 1907 .779 834 2831 3665 1651 550 446 1237 1675 7933 96.7 TRUE
Orlando Magic 82 19955 3022 6784 .445 563 1596 .353 2459 5188 .474 1307 1714 .763 794 2654 3448 1726 630 350 1222 1678 7914 96.5 FALSE
Boston Celtics 82 19730 2996 6883 .435 575 1729 .333 2421 5154 .470 1325 1706 .777 980 2505 3485 1726 584 343 1261 1743 7892 96.2 FALSE
Memphis Grizzlies 82 19805 3122 6723 .464 405 1147 .353 2717 5576 .487 1235 1666 .741 950 2526 3476 1792 631 375 1124 1568 7884 96.1 TRUE
Milwaukee Bucks 82 19880 2952 6737 .438 548 1553 .353 2404 5184 .464 1377 1843 .747 971 2399 3370 1760 541 403 1238 1713 7829 95.5 FALSE
Utah Jazz 82 19780 2951 6652 .444 543 1577 .344 2408 5075 .474 1346 1803 .747 904 2477 3381 1664 570 366 1200 1699 7791 95.0 FALSE
Chicago Bulls 82 19930 2843 6577 .432 508 1459 .348 2335 5118 .456 1486 1908 .779 937 2683 3620 1860 594 424 1223 1565 7680 93.7 TRUE

Save the File to a CSV {If You So Choose}

total_table %>>% write.csv('Desktop/2014_team_data.csv', row.names = F) 

That Was Easy, Let’s Kick This Up and Turn Our Code Into a Function

Now that we know how to pull the table from page let’s teach R how to do this as a function with inputs. If we accomplish this we can do cool things like loop through all the seasons since 1951 or pull in different tables from the page. We will become masters of all NBA team data with just a few inputs, an internet connection and R.

The key to making this work is understanding the structure of the URL. Thankfully our friends at Sports Reference make it easy for us. The URL structure consist of 3 things, the base, the league and the year end of the season. We can easily teach our function to create this by pasting the 3 inputs together to form a URL.

We know how to scrape the table once we have a URL from earlier. We also want to extract out the team ID’s which we can get from the URL’s on the page mirroring how we scrape a table except looking for XML tags with //a and a hyperlink. This process requires some data cleaning as well to extract out just the team id.

Since we can scale this function we want add to the Data Frame the season and table name from the page [team, opponent, and misc are in every season but there are more you can use if you look through various years] as this function will find whichever table you ask it to. Finally we want to give ourself the ability to timestamp if we want to keep track of changes in the data for the current season or run a Sports Reference to automatically scrape the data each day of the season.

So here’s our beloved getBREFTeamStatTable function.

getBREFTeamStatTable <- function(season_end = 2015, table_name = 'team', date = T){
    c('rvest','dplyr','pipeR') -> packages
    lapply(packages, library, character.only = T)
    'http://www.basketball-reference.com/leagues/' -> base
    'NBA' -> league
    '#' %>>% paste0(table_name) -> css_page
    css_page %>>% paste0(" , ", css_page,' a') -> css_id
    table_name %>>% tolower -> table_name
    table_name %>>% paste('stats', sep = "_") -> table
    base %>>% paste0(league,'_',season_end,".html") -> url
    url %>>% ## get table
        html %>>%
        html_nodes(css_page) %>>%
        html_table(header = F) %>>% data.frame() %>>% tbl_df() -> df

    if(df$X.1[1] == 'Rk'){
        df %>>%
            filter(X.1 == "Rk") %>>% as.character -> names
        'Rk' %>>% grep(x = df$X.1) -> row_of_header #find where rank is
        (row_of_header + 1) %>>% (df[.:nrow(df),]) -> df #skip that row and go to the end
        names %>>% tolower-> names(df)} else{
            df %>>%
                filter(X.1 == "Rk") %>>% as.character -> names
            'Rk' %>>% grep(x = df$X.1) -> row_of_header #find where rank is
            (row_of_header + 1) %>>% (df[.:nrow(df),]) -> df #skip that row and go to the end
            names %>>% tolower-> names(df)
        }
    names(df) %>>% (gsub('\\%|/','\\.',.)) -> names(df)
    NULL -> df$rk
    c('team','arena') -> table_name_character
    df[,!(df %>>% names) %in% table_name_character] %>>%
        apply(2, function(x) gsub('\\,','',x) %>>% as.numeric(x))  ->
        df[,!(df %>>% names) %in% table_name_character] #get rid of commas and make numeric
    df$team %>>% grepl(pattern = '\\*') -> df$playoff_team
    df$team %>>% (gsub('\\*','',.)) -> df$team
    df %>>% nrow() -1  -> rows
    df[1:rows,] -> df
    (season_end-1) %>>% paste0("-",season_end) -> season
    ##Grab Team Ids
    url %>>% ## get table
        html %>>%
        html_nodes(css_id) %>>%
        html_attrs() %>>% unlist %>>% as.character -> stems
    stems[3:length(stems)] -> stems #skip the 1st 2 rows since they are labels
    stems %>>% (gsub('\\/|.html|teams','',.)) %>>% #strip out the text
        (gsub(season_end,'',.)) -> bref_team_id #strip out the year to get the team id
    data.frame(season,table_name = table, bref_team_id, df) -> df  #combine into 1 df
    if(date == T){
        Sys.time() -> df$scrape_time #add scrape time if you want it
    }
    return(df)
}

Let’s test it out. Note I auto-populated the function so you can have this year’s team data ready to go without entering parameters. Let’s try it, fingers crossed!

getBREFTeamStatTable() -> team2015
team2015 %>>% kable('html', table.attr='id="team2015"')
season table_name bref_team_id team g mp fg fga fg. X3p X3pa X3p. X2p X2pa X2p. ft fta ft. orb drb trb ast stl blk tov pf pts pts.g playoff_team scrape_time
2014-2015 team_stats GSW Golden State Warriors 5 1200 197 400 0.493 47 122 0.385 150 278 0.540 95 118 0.805 45 176 221 124 51 30 108 115 536 107.2 FALSE 2014-11-09 13:57:02
2014-2015 team_stats DAL Dallas Mavericks 6 1440 250 512 0.488 50 139 0.360 200 373 0.536 89 111 0.802 61 172 233 128 44 30 65 116 639 106.5 FALSE 2014-11-09 13:57:02
2014-2015 team_stats BOS Boston Celtics 6 1440 252 532 0.474 46 150 0.307 206 382 0.539 88 111 0.793 78 188 266 150 47 28 93 133 638 106.3 FALSE 2014-11-09 13:57:02
2014-2015 team_stats SAC Sacramento Kings 6 1490 207 475 0.436 23 83 0.277 184 392 0.469 196 239 0.820 74 221 295 106 43 28 104 157 633 105.5 FALSE 2014-11-09 13:57:02
2014-2015 team_stats TOR Toronto Raptors 6 1440 216 500 0.432 40 132 0.303 176 368 0.478 160 207 0.773 68 170 238 110 47 28 59 140 632 105.3 FALSE 2014-11-09 13:57:02
2014-2015 team_stats BRK Brooklyn Nets 5 1200 199 407 0.489 47 113 0.416 152 294 0.517 79 101 0.782 44 169 213 98 32 24 72 111 524 104.8 FALSE 2014-11-09 13:57:02
2014-2015 team_stats CHI Chicago Bulls 7 1705 261 548 0.476 61 151 0.404 200 397 0.504 147 193 0.762 66 225 291 159 39 46 105 143 730 104.3 FALSE 2014-11-09 13:57:02
2014-2015 team_stats ATL Atlanta Hawks 5 1250 189 415 0.455 50 122 0.410 139 293 0.474 90 123 0.732 42 153 195 124 48 27 74 119 518 103.6 FALSE 2014-11-09 13:57:02
2014-2015 team_stats LAC Los Angeles Clippers 6 1440 220 490 0.449 50 148 0.338 170 342 0.497 130 164 0.793 49 157 206 136 52 24 70 128 620 103.3 FALSE 2014-11-09 13:57:02
2014-2015 team_stats PHO Phoenix Suns 6 1490 231 514 0.449 49 151 0.325 182 363 0.501 108 136 0.794 54 199 253 115 39 33 80 150 619 103.2 FALSE 2014-11-09 13:57:02
2014-2015 team_stats LAL Los Angeles Lakers 5 1200 183 423 0.433 24 73 0.329 159 350 0.454 120 158 0.759 67 133 200 97 37 19 76 137 510 102.0 FALSE 2014-11-09 13:57:02
2014-2015 team_stats HOU Houston Rockets 7 1680 234 527 0.444 91 230 0.396 143 297 0.481 154 215 0.716 78 228 306 140 64 34 135 160 713 101.9 FALSE 2014-11-09 13:57:02
2014-2015 team_stats MIA Miami Heat 6 1440 217 463 0.469 51 139 0.367 166 324 0.512 125 165 0.758 57 175 232 136 52 17 88 131 610 101.7 FALSE 2014-11-09 13:57:02
2014-2015 team_stats POR Portland Trail Blazers 6 1440 223 494 0.451 57 162 0.352 166 332 0.500 98 124 0.790 70 210 280 134 30 32 86 131 601 100.2 FALSE 2014-11-09 13:57:02
2014-2015 team_stats UTA Utah Jazz 6 1440 220 474 0.464 50 155 0.323 170 319 0.533