Webscraping in R

William Marble
August 11, 2016

Introduction

Oftentimes there is valuable information online that is not nicely formatted for statistical analysis
Some organizations make their data easily downloadable, but this isn't always the case
Other times, the creators of the data weren't intending to create a database for researchers
This tutorial will show you how wrangle semi-structured data from the internet into a spreadsheet-like format

Two political science examples

Nielson and Simmons (2015)

Nielson and Simmons are interested in studying the effect of signing treaties on countries' standing in the international community
There might be both tangible and intangible benefits
Measure intangible benefits using European Union press releases

Two political science examples

Grimmer (2013)

Grimmer is interested in how members of Congress communicate with constituents
Focus on position taking or credit claiming?
Uses a large corpus of press releases from members of Congress

Two political science examples

The authors of the press releases didn't intend to create a database for researchers
Nevertheless, in both cases they unintentionally created valuable data for social scientists
By using “webscraping'' tools, the researchers were able to give structure to the data that allowed them to generate new insights

Overiew of the rest of the tutorial

Follow along by downloading notes and code: http://stanford.edu/~wpmarble/webscraping_tutorial/webscraping_tutorial.pdf http://stanford.edu/~wpmarble/webscraping_tutorial/code.R

Workflow for webscraping
Brief primer on HTML
Tools for webscraping
Simple example
In-depth example
Very Short Intro to APIs (time permitting)

When is webscraping useful?

Data structured in a consistent manner across many web pages
When there's enough data that it's too much to do by hand

When is webscraping not useful?

When you don't know how to navigate to the web pages you want
When the data aren't structured in a consistent manner across web pages
When writing the code would take longer than doing it by hand (!!!)

Webscraping workflow

Identify information on the internet that you want to use
Figure out how to automatically navigate to the web pages. Is there a consistent URL format? E.g., http://url.com/year/month/day.html
Figure out how to extract the information you want using features on the website (HTML markers and/or text markers)
Write a script to extract, format, and save the information you want using the flags you identified
Loop through all the websites from step 2, applying the script to each of them
Do some awesome analysis on your newly unlocked data!

Identify information on the internet that you want to use
Figure out how to automatically navigate to the web pages. Is there a consistent URL format? E.g., http://url.com/year/month/day.html
Figure out how to extract the information you want using features on the website (HTML markers and/or text markers)
Write a script to extract, format, and save the information you want using the flags you identified
Loop through all the websites from step 2, applying the script to each of them
Do some awesome analysis on your newly unlocked data!

Primer on HTML

HTML tells browsers how to display information. Unstanding the basics is important for webscraping.

An example of what a website looks like under the hood: http://stanford.edu/~wpmarble/webscraping_tutorial/html/silly_webpage.txt

Primer on HTML

Elements are surrounded by code that tells web browsers what they are – <element>some text </element>
There is sometimes extra information, like “class” or “id” – <p class=“someclass”> paragraph </p>
Elements are nested
In Chrome, right click, then click “View Page Source”
Use Chrome extension SelectorGadget to select elements you want

rvest, the webscraping package for R

Read in a webpage using read_html()

my_webpage = read_html("http://stanford.edu/~wpmarble/webscraping_tutorial/html/silly_webpage.html")
my_webpage

{xml_document}
<html>
[1] <head>\n    <title>This is the title of the webpage</title>\n  </head>
[2] <body>\n    <h1>This is a heading</h1>  \n    <p class="notThisOne"> ...

rvest, the webscraping package for R

Select “nodes” using a CSS selector (the thing inside the brackets) using html_nodes()

all_paragraphs = html_nodes(my_webpage, "p")
all_paragraphs

{xml_nodeset (3)}
[1] <p class="notThisOne">This is a paragraph</p>
[2] <p class="thisOne">This is another paragraph with a different class! ...
[3] <p class="divGraf"> \n        This is a paragraph inside a division, ...

rvest, the webscraping package for R

Extract elements of the nodes using html_text(), html_attrs(), html_name(), etc.

p_text = html_text(all_paragraphs)
p_text

[1] "This is a paragraph"                                                                     
[2] "This is another paragraph with a different class!"                                       
[3] " \n        This is a paragraph inside a division, along with a \n        a link.\n      "

Much more capability, but these are the main commands you need. Plenty of examples to come.

Regular expressions

Often you'll see a pattern in text that you can use to pull out what you want
Regular expressions (or regex) is a language that allows you to precisely specify these patterns
Full treatment beyond the scope of this presentation, but a few R commands

Regular expressions

grep(pattern, string) takes a string vector and returns a vector of the indices of the string that match the pattern

mystring = c("this is", "a string", "vector", "this")
grep("this", mystring)

[1] 1 4

Regular expressions

grepl(pattern, string) takes a string vector as an input and returns a logical vector that says whether each element of the string matches the pattern

mystring = c("this is", "a string", "vector", "this")
grepl("this", mystring)

[1]  TRUE FALSE FALSE  TRUE

Regular expressions

gsub(pattern, replacement, string) finds all the instances of pattern in string and replaces it with replacement

mystring = c("this is", "a string", "vector", "this")
gsub(pattern="is", replacement="WTF", mystring)

[1] "thWTF WTF" "a string"  "vector"    "thWTF"

Examples

Notes: http://stanford.edu/~wpmarble/webscraping_tutorial/webscraping_tutorial.pdf

Code: http://stanford.edu/~wpmarble/webscraping_tutorial/code.R

Questions/comments: wpmarble@stanford.edu!