jeudi 2 août 2018

Test R Package that Uses Webscraping

I'm building an R package that can scrape data from various sports sites using rvest and am wondering the best way to test a webscraping function. An example is posted below.

access_page <- function(url, search){
  #This is the code to seach for player name on basketball reference
  s <- html_session(url)
  f <- html_form(s)[[1]] %>%
    set_values(., search=search)
  s <- submit_form(s,f)$url %>%
    html_session(.)

  #This checks if search goes directly to player page or search page
  #search page ends in '=', players page ends in 'html'
  if(str_sub(s$url, nchar(s$url), -1) == "="){
    if(class(follow_link(s, search)) == "try-error"){
      stop(paste("No ", search, "in database."))
    }
    s <- follow_link(s, search)
    # warning("Multiple players returned. Selecting first on search page.")
  }
  s
}

nba_player <- function(player, advanced = F){
  url <- "https://www.basketball-reference.com/"
  s <- access_page(url, player)

  #Data cleaning
  if(!advanced){
    df <- s %>%
      read_html(.) %>%
      html_table(., fill=T) %>%
      as.data.frame(.) %>%
      as.tibble(.) %>%
      dplyr::filter(., Season != "Career")
  } else {
    df <- s %>% read_html(.) %>%
      html_nodes(., xpath = "//comment()") %>%
      html_text(.) %>%
      paste(., collapse = "") %>%
      read_html(.) %>%
      html_table(., fill = T) %>%
      .[[4]] %>%
      .[, grep("[a-zA-Z]", names(df))]
  }
  #Checks if player has team stats too (only if he's been traded)
  if(length(grep("season", df$Season))> 0){
    df <- df %>%
      .[-grep("season", .$Season), ]
  }
  names(df)[11:21] <- c("FG%", "3PM", "3PA", "3P%", "2PM", "2PA", "2P%", "eFG%", "FT", "FTA", "FT%")
  df$Names <- rep(player, nrow(df))
  df
}

I am using testthat to test the package, but I am not sure the best practice for testing these functions. It feels like scraping a website over 10 times everytime I test is a little much. What's the most appropriate way to test web scraping functions in general? Should I just save an offline html to test with?

Aucun commentaire:

Enregistrer un commentaire