library("knitr")
On Monday we experimented with several packages that “wrapped” APIs. That is, they handled the creation of the request and the formatting of the output. Today we’re going to look at (part of) what these functions were doing.
First we’re going to examine the structure of API requests via the Open Movie Database. OMDb is very similar to IMDB, except it has a nice, simple API. We can go to the website, input some search parameters, and obtain both the XML query and the response from it.
EXERCISE determine the shape of an API request:
Let’s experiment with different values of the title
and year
fields. Notice the pattern in the request. For example for Title = Interstellar and Year = 2014, we get:
http://www.omdbapi.com/?t=Interstellar&y=2014&plot=short&r=xml
Try pasting this link into the browser. Also experiment with json
and xml
How can we create this request in R?
request <- paste0("http://www.omdbapi.com/?t=", "Interstellar", "&", "y=", "2014", "&", "plot=", "short", "&", "r=", "xml")
request
## [1] "http://www.omdbapi.com/?t=Interstellar&y=2014&plot=short&r=xml"
It works, but it’s a bit ungainly. Lets try to abstract that into a function:
omdb <- function(Title, Year, Plot, Format){
baseurl <- "http://www.omdbapi.com/?"
params <- c("t=", "y=", "plot=", "r=")
values <- c(Title, Year, Plot, Format)
param_values <- Map(paste0, params, values)
args <- paste0(param_values, collapse = "&")
paste0(baseurl, args)
}
omdb("Interstellar", "2014", "short", "xml")
## [1] "http://www.omdbapi.com/?t=Interstellar&y=2014&plot=short&r=xml"
Now we have a handy function that returns the API query. We can paste in the link, but we can also obtain data from within R:
request_interstellar <- omdb("Interstellar", "2014", "short", "xml")
answer_xml <- RCurl::getURL(request_interstellar)
answer_xml
## [1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root response=\"True\"><movie title=\"Interstellar\" year=\"2014\" rated=\"PG-13\" released=\"07 Nov 2014\" runtime=\"169 min\" genre=\"Adventure, Sci-Fi\" director=\"Christopher Nolan\" writer=\"Jonathan Nolan, Christopher Nolan\" actors=\"Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow\" plot=\"A group of explorers use a newly discovered wormhole to surpass the limitations on human space travel and conquer an interstellar endeavor.\" language=\"English\" country=\"USA, UK\" awards=\"1 nomination.\" poster=\"http://ia.media-imdb.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg\" metascore=\"73\" imdbRating=\"9.1\" imdbVotes=\"114,391\" imdbID=\"tt0816692\" type=\"movie\"/></root>"
request_interstellar <- omdb("Interstellar", "2014", "short", "json")
answer_json <- RCurl::getURL(request_interstellar)
answer_json
## [1] "{\"Title\":\"Interstellar\",\"Year\":\"2014\",\"Rated\":\"PG-13\",\"Released\":\"07 Nov 2014\",\"Runtime\":\"169 min\",\"Genre\":\"Adventure, Sci-Fi\",\"Director\":\"Christopher Nolan\",\"Writer\":\"Jonathan Nolan, Christopher Nolan\",\"Actors\":\"Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow\",\"Plot\":\"A group of explorers use a newly discovered wormhole to surpass the limitations on human space travel and conquer an interstellar endeavor.\",\"Language\":\"English\",\"Country\":\"USA, UK\",\"Awards\":\"1 nomination.\",\"Poster\":\"http://ia.media-imdb.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg\",\"Metascore\":\"73\",\"imdbRating\":\"9.1\",\"imdbVotes\":\"114,391\",\"imdbID\":\"tt0816692\",\"Type\":\"movie\",\"Response\":\"True\"}"
The get a form of data that is obviously structured. What is it?
These are the two common languages of web services: JavaScript Object Notation and eXtensible Markup Language.
Here’s an example of JSON: from this wonderful site
{
"crust": "original",
"toppings": ["cheese", "pepperoni", "garlic"],
"status": "cooking",
"customer": {
"name": "Brian",
"phone": "573-111-1111"
}
}
And here is XML:
<order>
<crust>original</crust>
<toppings>
<topping>cheese</topping>
<topping>pepperoni</topping>
<topping>garlic</topping>
</toppings>
<status>cooking</status>
</order>
You can see that both of these data structures are quite easy to read. They are “self-describing”. In other words, they tell you how they are meant to be read.
There are easy means of taking these data types and creating R objects. You’ve already met the function fromJSON
in the jsonlite
package, thanks to Bernard:
library("jsonlite")
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
fromJSON(answer_json)
## $Title
## [1] "Interstellar"
##
## $Year
## [1] "2014"
##
## $Rated
## [1] "PG-13"
##
## $Released
## [1] "07 Nov 2014"
##
## $Runtime
## [1] "169 min"
##
## $Genre
## [1] "Adventure, Sci-Fi"
##
## $Director
## [1] "Christopher Nolan"
##
## $Writer
## [1] "Jonathan Nolan, Christopher Nolan"
##
## $Actors
## [1] "Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow"
##
## $Plot
## [1] "A group of explorers use a newly discovered wormhole to surpass the limitations on human space travel and conquer an interstellar endeavor."
##
## $Language
## [1] "English"
##
## $Country
## [1] "USA, UK"
##
## $Awards
## [1] "1 nomination."
##
## $Poster
## [1] "http://ia.media-imdb.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg"
##
## $Metascore
## [1] "73"
##
## $imdbRating
## [1] "9.1"
##
## $imdbVotes
## [1] "114,391"
##
## $imdbID
## [1] "tt0816692"
##
## $Type
## [1] "movie"
##
## $Response
## [1] "True"
The output is a named list! A familiar and friendly R structure. Because data frames are lists, and because this list has no nested lists-within-lists, we can coerce it very simply:
answer_list <- fromJSON(answer_json)
kable(data.frame(answer_list))
Title | Year | Rated | Released | Runtime | Genre | Director | Writer | Actors | Plot | Language | Country | Awards | Poster | Metascore | imdbRating | imdbVotes | imdbID | Type | Response |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Interstellar | 2014 | PG-13 | 07 Nov 2014 | 169 min | Adventure, Sci-Fi | Christopher Nolan | Jonathan Nolan, Christopher Nolan | Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow | A group of explorers use a newly discovered wormhole to surpass the limitations on human space travel and conquer an interstellar endeavor. | English | USA, UK | 1 nomination. | http://ia.media-imdb.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg |73 |9.1 |114,391 |tt0816692 |movie |True | |
A similar process exists for XML formats:
library(XML)
ans_xml_parsed <- xmlParse(answer_xml)
ans_xml_parsed
## <?xml version="1.0" encoding="UTF-8"?>
## <root response="True">
## <movie title="Interstellar" year="2014" rated="PG-13" released="07 Nov 2014" runtime="169 min" genre="Adventure, Sci-Fi" director="Christopher Nolan" writer="Jonathan Nolan, Christopher Nolan" actors="Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow" plot="A group of explorers use a newly discovered wormhole to surpass the limitations on human space travel and conquer an interstellar endeavor." language="English" country="USA, UK" awards="1 nomination." poster="http://ia.media-imdb.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg" metascore="73" imdbRating="9.1" imdbVotes="114,391" imdbID="tt0816692" type="movie"/>
## </root>
##
Not exactly the response we were hoping for! This shows us some of the XML document’s structure:
<root>
node with a single child, <movie>
.From Nolan and Lang 2014:
The
xmlRoot()
function returns an object of classXMLInternalElementNode
. This is a regular XML node and not specific to the root node, i.e., all XML nodes will appear in R with this class or a more specific class. An object of class XMLInternalElementNode has four fields: name, attributes, children and value, which we access with the methods xmlName(), xmlAttrs(), xmlChildren(), and xmlValue()
field | method |
---|---|
name | xmlName() |
attributes | xmlAttrs() |
children | xmlChildren() |
value | xmlValue() |
ans_xml_parsed_root <- xmlRoot(ans_xml_parsed)[["movie"]] # could also use [[1]]
ans_xml_parsed_root
## <movie title="Interstellar" year="2014" rated="PG-13" released="07 Nov 2014" runtime="169 min" genre="Adventure, Sci-Fi" director="Christopher Nolan" writer="Jonathan Nolan, Christopher Nolan" actors="Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow" plot="A group of explorers use a newly discovered wormhole to surpass the limitations on human space travel and conquer an interstellar endeavor." language="English" country="USA, UK" awards="1 nomination." poster="http://ia.media-imdb.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg" metascore="73" imdbRating="9.1" imdbVotes="114,391" imdbID="tt0816692" type="movie"/>
ans_xml_attrs <- xmlAttrs(ans_xml_parsed_root)
ans_xml_attrs
## title
## "Interstellar"
## year
## "2014"
## rated
## "PG-13"
## released
## "07 Nov 2014"
## runtime
## "169 min"
## genre
## "Adventure, Sci-Fi"
## director
## "Christopher Nolan"
## writer
## "Jonathan Nolan, Christopher Nolan"
## actors
## "Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow"
## plot
## "A group of explorers use a newly discovered wormhole to surpass the limitations on human space travel and conquer an interstellar endeavor."
## language
## "English"
## country
## "USA, UK"
## awards
## "1 nomination."
## poster
## "http://ia.media-imdb.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg"
## metascore
## "73"
## imdbRating
## "9.1"
## imdbVotes
## "114,391"
## imdbID
## "tt0816692"
## type
## "movie"
kable(data.frame(t(ans_xml_attrs)))
title | year | rated | released | runtime | genre | director | writer | actors | plot | language | country | awards | poster | metascore | imdbRating | imdbVotes | imdbID | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Interstellar | 2014 | PG-13 | 07 Nov 2014 | 169 min | Adventure, Sci-Fi | Christopher Nolan | Jonathan Nolan, Christopher Nolan | Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow | A group of explorers use a newly discovered wormhole to surpass the limitations on human space travel and conquer an interstellar endeavor. | English | USA, UK | 1 nomination. | http://ia.media-imdb.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg |73 |9.1 |114,391 |tt0816692 |movie | |
XPATH
, the scandalously shallow introduction:There is a special syntax for querying the structure of XML documents, called XPATH, which is an essential skill if you are doing extensive work with XML documents.
movienode <- getNodeSet(ans_xml_parsed, "//movie")
movienode
## [[1]]
## <movie title="Interstellar" year="2014" rated="PG-13" released="07 Nov 2014" runtime="169 min" genre="Adventure, Sci-Fi" director="Christopher Nolan" writer="Jonathan Nolan, Christopher Nolan" actors="Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow" plot="A group of explorers use a newly discovered wormhole to surpass the limitations on human space travel and conquer an interstellar endeavor." language="English" country="USA, UK" awards="1 nomination." poster="http://ia.media-imdb.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg" metascore="73" imdbRating="9.1" imdbVotes="114,391" imdbID="tt0816692" type="movie"/>
##
## attr(,"class")
## [1] "XMLNodeSet"
httr
httr
is yet another star in the hadleyverse, this one designed to facilitate all things HTTP from within R. This includes the major HTTP verbs, which are: * GET: fetch an existing resource. The URL contains all the necessary information the server needs to locate and return the resource. * POST: create a new resource. POST requests usually carry a payload that specifies the data for the new resource. * PUT: update an existing resource. The payload may contain the updated data for the resource. * DELETE: delete an existing resource. Source: HTTP made really easy
HTTP is the foundation for APIs; understanding how it works is the key to interacting with all the diverse APIs out there. An excellent beginning resource for APIs (including HTTP basics) is this simple guide
httr
also facilitates a variety of authentication protocols.
#devtools::install_github("hadley/httr", build_vignettes = TRUE, dependencies = TRUE)
install.packages("httr")
httr
contains one function for every HTTP verb. The functions have the same names as the verbs (e.g. GET()
, POST()
). They have more informative outputs than simply using RCurl::getURL()
, and come with some nice convenience functions for working with the output:
library(httr)
interstellar_json <- omdb("Interstellar", "2014", "short", "json")
response_json <- GET(interstellar_json)
content(response_json, as = "parsed", type = "application/json")
## $Title
## [1] "Interstellar"
##
## $Year
## [1] "2014"
##
## $Rated
## [1] "PG-13"
##
## $Released
## [1] "07 Nov 2014"
##
## $Runtime
## [1] "169 min"
##
## $Genre
## [1] "Adventure, Sci-Fi"
##
## $Director
## [1] "Christopher Nolan"
##
## $Writer
## [1] "Jonathan Nolan, Christopher Nolan"
##
## $Actors
## [1] "Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow"
##
## $Plot
## [1] "A group of explorers use a newly discovered wormhole to surpass the limitations on human space travel and conquer an interstellar endeavor."
##
## $Language
## [1] "English"
##
## $Country
## [1] "USA, UK"
##
## $Awards
## [1] "1 nomination."
##
## $Poster
## [1] "http://ia.media-imdb.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg"
##
## $Metascore
## [1] "73"
##
## $imdbRating
## [1] "9.1"
##
## $imdbVotes
## [1] "114,391"
##
## $imdbID
## [1] "tt0816692"
##
## $Type
## [1] "movie"
##
## $Response
## [1] "True"
interstellar_xml <- omdb("Interstellar", "2014", "short", "xml")
response_xml <- GET(interstellar_xml)
content(response_xml, as = "parsed")
## <?xml version="1.0" encoding="UTF-8"?>
## <root response="True">
## <movie title="Interstellar" year="2014" rated="PG-13" released="07 Nov 2014" runtime="169 min" genre="Adventure, Sci-Fi" director="Christopher Nolan" writer="Jonathan Nolan, Christopher Nolan" actors="Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow" plot="A group of explorers use a newly discovered wormhole to surpass the limitations on human space travel and conquer an interstellar endeavor." language="English" country="USA, UK" awards="1 nomination." poster="http://ia.media-imdb.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg" metascore="73" imdbRating="9.1" imdbVotes="114,391" imdbID="tt0816692" type="movie"/>
## </root>
##
In addition, httr
gives us access to lots of useful information about the quality of our response. For example, the header:
headers(response_xml)
## $`cache-control`
## [1] "private, max-age=909"
##
## $`content-type`
## [1] "text/xml; charset=utf-8"
##
## $expires
## [1] "Wed, 26 Nov 2014 17:26:14 GMT"
##
## $`last-modified`
## [1] "Wed, 26 Nov 2014 16:26:14 GMT"
##
## $vary
## [1] "*"
##
## $server
## [1] "Microsoft-IIS/7.5"
##
## $`x-aspnet-version`
## [1] "4.0.30319"
##
## $`x-powered-by`
## [1] "ASP.NET"
##
## $`access-control-allow-origin`
## [1] "*"
##
## $date
## [1] "Wed, 26 Nov 2014 17:11:04 GMT"
##
## $`content-length`
## [1] "731"
##
## attr(,"class")
## [1] "insensitive" "list"
And also a handy means to extract specifically the HTTP status code:
status_code(response_xml)
## [1] 200
It is very good to learn your http status codes.
The documentation for httr
includes a vignette of “best practices for writing an API package”, which is useful for when you want to bring your favourite web resource into the world of R!
What if data is present on a website, but isn’t provided in an API at all? It is possible to grab that information too. How easy that is to do depends a lot on the quality of the website that we are using.
HTML is a structured way of displaying information. It is very similar in structure to XML (in fact many modern html sites are actually XHTML5, which is also valid XML)
Two pieces of equipment
rvest
: devtools::install_github("hadley/rvest")
Before we go any further, let’s play a game together!
library(rvest)
Let’s make a simple HTML table and then parse it!
r echo=FALSE, results='asis' library(gapminder) knitr::kable(head(gapminder))
``` (uncommenting the code chunk)
html("file:///home/andrew/Documents/projects/GapminderHead/GapminderHead.html") %>%
html_table
## [[1]]
## country continent year lifeExp pop gdpPercap
## 1 Afghanistan Asia 1952 28.80 8425333 779.4
## 2 Afghanistan Asia 1957 30.33 9240934 820.9
## 3 Afghanistan Asia 1962 32.00 10267083 853.1
## 4 Afghanistan Asia 1967 34.02 11537966 836.2
## 5 Afghanistan Asia 1972 36.09 13079460 740.0
## 6 Afghanistan Asia 1977 38.44 14880372 786.1
(note that this is also possible with XML
package)
getting Star Trek species data from MemoryAlpha
library("magrittr")
library("dplyr")
library("tidyr")
character_data <- function(chname){
paste0("http://en.memory-alpha.org/wiki/", chname) %>%
html %>%
html_nodes(".wiki-sidebar") %>%
html_table(header = FALSE) %>%
extract2(1) %>%
set_colnames(c("trait", "value")) %>%
mutate(trait = gsub(":", "", trait)) %>%
filter(trait %in% c("Gender","Species","Affiliation","Rank")) %>%
mutate(name = chname) %>%
spread(trait, value)
}
character_data("Worf")
## name Affiliation Gender Rank
## 1 Worf Federation StarfleetHouse of Martok Male Lieutenant Commander
## Species
## 1 Klingon
MOST IMPORTANT confirm that there is NO RopenSci package and NO API before you spend hours scraping (the API was right here)
First go to this website about Airports. Follow the link to get your API key (you will need to click a confirmation email)
All the airports of the planet:
https://airport.api.aero/airport/?user_key={yourkey}
https://airport.api.aero/airport/match/toronto?user_key={yourkey}
https://airport.api.aero/airport/distance/YVR/LAX?user_key={yourkey}
Do you need just the US airports? this API does that and is free
And even simpler API queries very simple data about the airports of the world:
fromJSON(file = "http://airportcode.riobard.com/search?q=Toronto&fmt=JSON")
fromJSON(file = "http://airportcode.riobard.com/airport/YVR?fmt=json")
perfectly possible to combine these into a handy data.frame
. One way might be:
library(jsonlite)
tdot_data <- fromJSON("http://airportcode.riobard.com/search?q=Toronto&fmt=JSON")
gameday
Does anybody remember this lovely function?
gday <- function(team="canucks") {
url <- paste0("http://live.nhle.com/GameData/GCScoreboard/", Sys.Date(), ".jsonp")
grepl(team, RCurl::getURL(url), ignore.case=TRUE)
}
Here is the httr
version:
library(httr)
req <- GET("http://live.nhle.com/GameData/GCScoreboard/2014-11-24.jsonp")
jsonp <- content(req, "text")
json <- gsub('([a-zA-Z_0-9\\.]*\\()|(\\);?$)', "", jsonp, perl = TRUE)
data <- fromJSON(json)
data$games %>%
kable
##
##
## |atcommon |canationalbroadcasts |ata |rl | atsog|bs |htcommon | id|atn | hts|atc |htn |usnationalbroadcasts |gcl |hta | ats|htc | htsog|bsc | gs|gcll |
## |:--------|:--------------------|:---|:----|-----:|:--------|:---------|---------:|:------------|---:|:------|:------------|:--------------------|:----|:---|---:|:------|-----:|:-----|--:|:----|
## |PENGUINS |TVA, SN |PIT |TRUE | 33|FINAL OT |BRUINS | 2.014e+09|PITTSBURGH | 2|winner |BOSTON |NBCSN |TRUE |BOS | 3| | 29|final | 5|TRUE |
## |FLYERS | |PHI |TRUE | 21|FINAL SO |ISLANDERS | 2.014e+09|PHILADELPHIA | 1| |NY ISLANDERS | |TRUE |NYI | 0|winner | 46|final | 5|TRUE |
## |SENATORS | |OTT |TRUE | 26|FINAL |RED WINGS | 2.014e+09|OTTAWA | 4| |DETROIT | |TRUE |DET | 3|winner | 43|final | 5|TRUE |
## |WILD | |MIN |TRUE | 39|FINAL |PANTHERS | 2.014e+09|MINNESOTA | 1|winner |FLORIDA | |TRUE |FLA | 4| | 30|final | 5|TRUE |