William Marble
August 11, 2016
Nielson and Simmons (2015)
Grimmer (2013)
Follow along by downloading notes and code: http://stanford.edu/~wpmarble/webscraping_tutorial/webscraping_tutorial.pdf http://stanford.edu/~wpmarble/webscraping_tutorial/code.R
Identify information on the internet that you want to use
Figure out how to automatically navigate to the web pages. Is there a consistent URL format? E.g., http://url.com/year/month/day.html
Figure out how to extract the information you want using features on the website (HTML markers and/or text markers)
Write a script to extract, format, and save the information you want using the flags you identified
Loop through all the websites from step 2, applying the script to each of them
Do some awesome analysis on your newly unlocked data!
Identify information on the internet that you want to use
Figure out how to automatically navigate to the web pages. Is there a consistent URL format? E.g., http://url.com/year/month/day.html
Figure out how to extract the information you want using features on the website (HTML markers and/or text markers)
Write a script to extract, format, and save the information you want using the flags you identified
Loop through all the websites from step 2, applying the script to each of them
Do some awesome analysis on your newly unlocked data!
HTML tells browsers how to display information. Unstanding the basics is important for webscraping.
An example of what a website looks like under the hood: http://stanford.edu/~wpmarble/webscraping_tutorial/html/silly_webpage.txt
Read in a webpage using read_html()
my_webpage = read_html("http://stanford.edu/~wpmarble/webscraping_tutorial/html/silly_webpage.html")
my_webpage
{xml_document}
<html>
[1] <head>\n <title>This is the title of the webpage</title>\n </head>
[2] <body>\n <h1>This is a heading</h1> \n <p class="notThisOne"> ...
Select “nodes” using a CSS selector (the thing inside the brackets) using html_nodes()
all_paragraphs = html_nodes(my_webpage, "p")
all_paragraphs
{xml_nodeset (3)}
[1] <p class="notThisOne">This is a paragraph</p>
[2] <p class="thisOne">This is another paragraph with a different class! ...
[3] <p class="divGraf"> \n This is a paragraph inside a division, ...
Extract elements of the nodes using html_text(), html_attrs(), html_name(), etc.
p_text = html_text(all_paragraphs)
p_text
[1] "This is a paragraph"
[2] "This is another paragraph with a different class!"
[3] " \n This is a paragraph inside a division, along with a \n a link.\n "
Much more capability, but these are the main commands you need. Plenty of examples to come.
grep(pattern, string) takes a string vector and returns a vector of the indices of the string that match the pattern
mystring = c("this is", "a string", "vector", "this")
grep("this", mystring)
[1] 1 4
grepl(pattern, string) takes a string vector as an input and returns a logical vector that says whether each element of the string matches the pattern
mystring = c("this is", "a string", "vector", "this")
grepl("this", mystring)
[1] TRUE FALSE FALSE TRUE
gsub(pattern, replacement, string) finds all the instances of pattern in string and replaces it with replacement
mystring = c("this is", "a string", "vector", "this")
gsub(pattern="is", replacement="WTF", mystring)
[1] "thWTF WTF" "a string" "vector" "thWTF"