If you are planning to make use of the following procedure to scrap data from a Sports Reference page, please go through the Disclaimer

Pre-Match Conference:

What we normally used to do ?

The traditional data scraping through rvest() is as follows:
  1. Grab the page URL and table-id
    You can get tableids by Inspecting Element or by using SelectorGadget
    By Inspecting Element:
    1. Right Click on a table and select Inspect Element on the ensuring context menu
    2. The data on which you made the right click gets highlighted on the Developer tab
    3. Scroll upwards until you come across the < table > tag with either class name (say .wikitable) or a id name (say #stats_shooting) associated with it
    4. The corresponding table id here is "#stats_passing_squads"
    Drag this SelectorGadgetJS link to your bookmarks and you are good to continue.
    By using SelectorGadget:
    1. Click on SelectorGadget from your bookmarks.
    2. Hover over the table's boundary until you find the "div table" label appearing on the orange marker.
    3. At that instant, if you perform a left click, the marker turns red with the id of the table appearing on the SelectorGadget toolbar on your screen.
    4. The corresponding table id here is "#stats_keeper_squads"
  2. And then (har)rvest it
    require("dplyr")

    stats_passing_squads = xml2::read_html("https://fbref.com/en/comps/22/passing/Major-League-Soccer-Stats") %>% rvest::html_nodes("#stats_passing_squads") %>% rvest::html_table()

    stats_passing_squads = stats_passing_squads[[1]]

What's special with FBref tables and Why just using rvest won't help?

  • The site is structured in such a way that all tables except the first (even the 1st sometimes), are rendered dynamically. I managed to capture the way the tables load.
  • Notice how the 1st table loads quickly and how the reminder are taking their time to load.
  • I have annotated the tableids on the tables for the sake of clarity.

So when we try to scrap such tables, this happens :


What Next? :(

The issue is we can see the tables and their ids in naked eye but can't put it on paper use them in R Studio. So we are going to use RSelenium which helps us in this matter.

How does RSelenium help us? Let's get started

But Before getting started, there is a catch.

We can't just enter the url in a browser and ask "RSelenium" to do its job. We need to install Selenium in our computer, so that RSelenium can communicate with real Selenium. And for that we will use the concept of containers (Docker) rathering than using executables which are tiresome and has different procedure for different OSes.
Hold your breath. All these are steep leaning curves and rather one-time investments.

Getting Started:

Now with all dependancies installed, let's get into the action
  1. Start Docker for Desktop in your OS. (It really takes time to load. Watch Haaland's goals meanwhile.)
  2. Run the following code in RStudio's Terminal (not Console): docker run -d -p 4445:4444 selenium/standalone-chrome
    The de facto way of knowing the success status is by running docker ps on which you will get the container id of selenium as output.
    If Terminal is not visible in your panes, then use ALT + SHIFT + R to open it.
  3. Remember the anonymous browser that I mentioned earlier? We are about to invoke him now:
    remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",port = 4445L,browserName = "chrome")
    remDr$open()

    remDr$navigate("https://fbref.com/en/comps/22/passing/Major-League-Soccer-Stats")
    remDr$screenshot(display = TRUE)

    Magnifique !

  4. The rest of the code involves rvest where we will be dumping the html source code into read_html()
    require("dplyr")
    stats_passing_squads = xml2::read_html(remDr$getPageSource()[[1]]) %>% rvest::html_nodes("#stats_passing_squads") %>% rvest::html_table()
    stats_passing_squads = stats_passing_squads[[1]]

    Gro├čartig !


You - "So Pranav. You are asking me to follow the above steps everytime?"
Me - " No not absolutely. Hence comes the most awaited part."

We can create a function to automate all these steps (including running Docker from Console). Here's how I do it.

FBref data at one click of a button

getFBrefStats = function(url,id){
require(RSelenium)
require(dplyr)

# For some unspecified reason we are starting and stopping the docker container initailly.
# Similar to heating the bike's engine before shifting the gears.
system("docker run -d -p 4445:4444 selenium/standalone-chrome")
t = system("docker ps",intern=TRUE)
if(is.na(as.character(strsplit(t[2],split = " ")[[1]][1]))==FALSE)
{
system(paste("docker stop ",as.character(strsplit(t[2],split = " ")[[1]][1]),sep=""))
}

# To avoid starting docker in Terminal

system("docker run -d -p 4445:4444 selenium/standalone-chrome")
Sys.sleep(3)
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "chrome")
# Automating the scraping initiation considering that Page navigation might crash sometimes in
# R Selenuium and we have to start the process again. Good to see that this while() logic
# works perfectly

while (TRUE) {
tryCatch({
#Entering our URL gets the browser to navigate to the page
remDr$open()
remDr$navigate(as.character(url))
}, error = function(e) {
remDr$close()
Sys.sleep(2)
print("slept 2 seconds")
next
}, finally = {
#remDr$screenshot(display = TRUE) #This will take a screenshot and display it in the RStudio viewer
break
})
}

# Scraping required stats

data <- xml2::read_html(remDr$getPageSource()[[1]]) %>%
rvest::html_nodes(id) %>%
rvest::html_table()
data = data[[1]]

remDr$close()
remove(remDr)
# Automating the following steps:
# 1. run "docker ps" in Terminal and get the container ID from the output
# 2. now run "docker stop container_id" e.g. docker stop f59930f56e38

t = system("docker ps",intern=TRUE)
system(paste("docker stop ",as.character(strsplit(t[2],split = " ")[[1]][1]),sep=""))

return(data)
}
Test Drive:
All you have to do is start "Docker for Desktop" and call our function in RStudio.
bundesliga_players_fbref_shooting = getFBrefStats("https://fbref.com/en/comps/20/shooting/Bundesliga-Stats","#stats_shooting")
head(bundesliga_players_fbref_shooting)
Full Time!

Post-Match Conference:

Where you can make full use of such automation?

If you wish to see me provide such an example for our community, just drop a message at @npranav10

Also a shout-out to Eliot McKinley for encouraging me to write this article.

Reference:

Callum Taylor : Using RSelenium and Docker To Webscrape In R - Using The WHO Snake Database
Apart from borrowing few snippets from the above piece, I did manage to bring in some automation into the procedure to gather data.
If you feel, there is also another way to achieve this, don't hesitate to contact me.

Disclaimer:

Sports Reference LLC says : "Except as specifically provided in this paragraph, you agree not to use or launch any automated system, including without limitation, robots, spiders, offline readers, or like devices, that accesses the Site in a manner which sends more request messages to the Site server in any given period of time than a typical human would normally produce in the same period by using a conventional on-line Web browser to read, view, and submit materials."