An introduction to ggplot2

William Marble
February 19, 2016

What is ggplot2?

ggplot2 is a graphics package for R based on the “Grammar of Graphics” (Leland Wilkinson, 1999/2000)
Created by Hadley Wickham, world's foremost R guru
The idea is “to take the good parts of base and lattice graphics and none of the bad parts”
The learning curve for ggplot comes primarily from learning how to think about data visualization in the way Wickham wants you to think about data visualization

Benefits of ggplot2

The good:

The “Grammar of Graphics” gives a systematic way of thinking about lots of different types of graphics in a unified framework
Simple to create complex graphics that convey a lot of information
Easy to add “layers” to a plot without much extra code
Lots of out-of-the-box functions for different types of graphics
Excellent documentation (http://docs.ggplot2.org/current/)

Drawbacks of ggplot2

The bad:

Learning curve
Data must structured in a particular way – may require extra pre-processing of the data
Not as customizable as base graphics
Strange default settings (grid lines, grey background, weird colors)

Installation

# load ggplot, install if not already installed
if (!require(ggplot2)){
  install.packages("ggplot2")
  require(ggplot2)
}
# and for fun themes
if (!require(ggthemes)){
  install.packages("ggthemes")
  require(ggthemes)
}

Running example

Congressional district-level data; unit of observation is district-year

dwnom1 = Congress member's first-dimension DW-NOMINATE score (voteview.com)
median = ideology of median donor within district (Bonica)
gini = estimated gini coefficient within district
party = representative's party

ideol.data = read.csv(file="http://stanford.edu/~wpmarble/data/rep_data.csv")
print(head(ideol.data[, c("year", "dwnom1", "median", "gini", "party")]), digits=2)

  year dwnom1 median gini      party
1 1984  0.234  0.439 0.40 Republican
2 1984  0.354  0.584 0.43 Republican
3 1984  0.343  0.344 0.44 Republican
4 1984 -0.036  0.153 0.42   Democrat
5 1984 -0.202  0.318 0.42   Democrat
6 1984 -0.156  0.023 0.41   Democrat

ggplot(ideol.data, aes(x = dwnom1)) + geom_histogram()

plot of chunk unnamed-chunk-4

ggplot(ideol.data, aes(x = median, y = dwnom1)) + geom_point()

$plot of chunk unnamed-chunk-5$

ggplot(ideol.data, aes(x = median, y = dwnom1)) + geom_point() + stat_smooth(method="lm")

plot of chunk unnamed-chunk-6

ggplot(ideol.data, aes(x = median, y = dwnom1)) + geom_point() + stat_smooth(method="loess")

plot of chunk unnamed-chunk-7

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party)) + 
  geom_point() + scale_color_manual(values = c("blue", "red"))

plot of chunk unnamed-chunk-8

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party)) + 
  geom_point(alpha = .3) + stat_smooth(method = "lm") + 
  scale_color_manual(values = c("blue", "red"))

plot of chunk unnamed-chunk-9

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party)) + 
  geom_point(alpha = .3) + stat_smooth(method = "loess") +
  scale_color_manual(values = c("blue", "red"))

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party)) + 
  geom_point(alpha = .3) + stat_smooth(method = "loess") + 
  scale_color_manual(name="Party", values = c("blue", "red")) +
  scale_x_continuous(breaks = c(-1, 0, 1)) + 
  facet_grid(~decade) + labs(x = "Median donor DIME score", y = "NOMINATE score") + 
  ggtitle("District representation, by party and decade") + 
  theme_bw() + theme(legend.position = c(.1, .8), panel.grid.major = element_blank(), panel.grid.minor = element_blank())

$plot of chunk unnamed-chunk-11$

Understanding the grammar of graphics

Reason it's so easy to add all those features is because of the way ggplot makes you think about data
Take data, transform them using a stat, and map them onto aesthetics, which are interpretable using scales
- data must be stored as data frames
- stat could be any transformation of the data – including keeping it the same (identity), binning the data, taking the mean, interquartile range, etc.
- aesthetics could be position (x and y coordinates), colo(u)r, fill, size, opacity, line type, shape, etc.; they are ways to represent variables
- scales are usually given by axes and legends
Data are represented visually using a geometry; these geom_ functions do most of the work
- scatter plot = geom_point
- line chart = geom_line
- histogram = geom_histogram
- kernel density = geom_density

Building blocks

Every plot begins with a call to the function ggplot().

ggplot(data = NULL, mapping = aes(), ...)

This initial call to ggplot() will not produce a graphic yet. Rather, you're telling ggplot what data frame to look at, and how the variables in that data frame map onto aesthetics.

We can then add to it using the + operator to tell it to draw geoms or change other features.

Scatter plot

Example – I want to create a scatter plot of district median and representative's ideology. What might the aesthetic(s) be? What would the geom(s) be?

Scatter plot

Example – I want to create a scatter plot of district median and representative's ideology. What might the aesthetic(s) be? What would the geom(s) be?

aesthetics:
1. position on the x axis
2. position on the y axis

Scatter plot

Example – I want to create a scatter plot of district median and representative's ideology. What might the aesthetic(s) be? What would the geom(s) be?

aesthetics:
1. position on the x axis
2. position on the y axis
geoms:
1. points

Scatter plot

ggplot(data = ideol.data, mapping = aes(x = median, y = dwnom1))

Scatter plot

ggplot(data = ideol.data, mapping = aes(x = median, y = dwnom1)) + geom_point()

$plot of chunk unnamed-chunk-14$

Histogram

Want to see distribution of members' ideology.

aesthetics?
stat?
geoms?

Histogram

Want to see distribution of members' ideology.

aesthetics? x
stat? bin
geoms? bar

geom_histogram wraps this up into one function for you.

Histogram

Using geom_histogram()

ggplot(ideol.data, aes(x = dwnom1)) + geom_histogram()

plot of chunk unnamed-chunk-15

Histogram

Using geom_bar(stat=“bin”)

ggplot(ideol.data, aes(x = dwnom1)) + geom_bar(stat="bin")

plot of chunk unnamed-chunk-16

Kernel density

Similarly, geom_density() gives (almost) the same output as geom_line(stat=“density”)

ggplot(ideol.data, aes(x = dwnom1)) + geom_density()

plot of chunk unnamed-chunk-17

Kernel density

Similarly, geom_density() gives (almost) the same output as geom_line(stat=“density”)

ggplot(ideol.data, aes(x = dwnom1)) + geom_line(stat="density")

plot of chunk unnamed-chunk-18

Boxplot (with categorical variable)

What does income inequality look like in districts represented by Democrats vs. Republicans?

ggplot(ideol.data[!is.na(ideol.data$party),], aes(x = party, y = gini)) + geom_boxplot()

plot of chunk unnamed-chunk-19

Note: ggplot knows what to do when I give x a categorical variable

So what?

Everything so far is easy enough to do with base graphics
So why bother with ggplot?
The power of ggplot lies in the ease with which you can add new variables
Especially useful for analyzing subgroups

Subgroup analysis

Want to recreate the scatter plot, but highlight the Democrats and Republicans. Only thing we need to do is add colour=party to our aes() command:

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party)) + 
  geom_point() + scale_color_manual(values = c("blue", "red"))

plot of chunk unnamed-chunk-20

Subgroup analysis

What if we have another variable? Could map that onto the opacity of the points. Say, polarization of the district's donors.

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party, alpha = sd)) + geom_point() + scale_color_manual(values = c("blue", "red"))

plot of chunk unnamed-chunk-21

Faceting

Possibly the most useful feature of ggplot in subgroup analysis is faceting. Easily split up the plots by some categorical variable. The syntax is facet_wrap(~variable) or facet_grid(~variable)

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party, alpha = sd)) + geom_point() + scale_color_manual(values = c("blue", "red")) + facet_grid(~decade)

Faceting

ggplot(ideol.data[!is.na(ideol.data$party),], aes(x = party, y = gini)) + geom_boxplot() + facet_wrap(~decade)

plot of chunk unnamed-chunk-23

Can also facet by more than one variable (e.g., ~party+decade) but not recommended

Smoothing

Easy to add a smoother like a regression line, loess line, or any arbitrary function of x. The function is stat_smooth(method, formula = y~x, se = T). The other aesthetics carry over (e.g., subgroup by colour). Automatically includes standard error bands.

Examples

first-order local polynomial regression (loess): stat_smooth(method = “loess”, formula = y ~ x)
first-order linear regression: stat_smooth(method = “lm”, formula = y ~ x)
second-order linear regression: stat_smooth(method = “lm”, formula = y ~ x + I(x^2))

Smoothing

loess

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party)) + geom_point(alpha = .3, size = .8) + stat_smooth(method = "loess") + scale_color_manual(values = c("blue", "red"))

Smoothing

linear regression

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party)) + geom_point(alpha = .3, size = .8) + stat_smooth(method = "lm") + scale_color_manual(values = c("blue", "red"))

plot of chunk unnamed-chunk-25

Smoothing

linear regression with quadratic term (and get rid of standard error bars)

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party)) + geom_point(alpha = .3, size = .8) + stat_smooth(method = "lm", formula = y~x+I(x^2), se=F) + scale_color_manual(values = c("blue", "red"))

plot of chunk unnamed-chunk-26

Arbitrary curve

Can use another ggplot function, stat_function(), that will plot an arbitrary function. Works like curve() in base graphics. E.g., define a function $ f(x) = \ln(|x|) + x^2 $.

arbitraryFunction = function(x) log(abs(x)) + x^2
ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party)) + geom_point(alpha = .3, size = .8) + stat_function(fun = arbitraryFunction) + scale_color_manual(values = c("blue", "red"))

plot of chunk unnamed-chunk-27

Any questions so far?

a friendly face

Making the plot look pretty

So far, we've covered the nuts and bolts, but these plots still don't look that great

Ugly default background
No title
Axis titles
Axis labels
Legend placement

Themes

Luckily, there are lots of pre-packaged themes. I like theme_bw():

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party)) + geom_point(alpha = .3, size = .8) + stat_smooth(method = "lm", formula = y~x+I(x^2), se=F) + scale_color_manual(values = c("blue", "red")) + theme_bw()

Themes

theme_classic()

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party)) + geom_point(alpha = .3, size = .8) + stat_smooth(method = "lm", formula = y~x+I(x^2), se=F) + scale_color_manual(values = c("blue", "red")) + theme_classic()

plot of chunk unnamed-chunk-29

Themes

theme_dark()

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party)) + geom_point(alpha = .3, size = .8) + stat_smooth(method = "lm", formula = y~x+I(x^2), se=F) + scale_color_manual(values = c("blue", "red")) + theme_dark()

Themes

Extensions from the ggthemes library: theme_economist()

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party)) + geom_point(alpha = .3, size = .8) + stat_smooth(method = "lm", formula = y~x+I(x^2), se=F) + scale_color_manual(values = c("blue", "red")) + theme_economist()

plot of chunk unnamed-chunk-31

Themes

Extensions from the ggthemes library: theme_fivethirtyeight()

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party)) + geom_point(alpha = .3, size = .8) + stat_smooth(method = "lm", formula = y~x+I(x^2), se=F) + scale_color_manual(values = c("blue", "red")) + theme_fivethirtyeight()

plot of chunk unnamed-chunk-32

Scales

Scales change the way that aesthetics are presented. Usually the command is scale_X_TYPE where “X”“ is the aesthetic (e.g., color or size or line type) and "TYPE” is generally one of “continuous” or “discrete”, based on the type of variable being mapped to that aesthetic, or “manual” for more customization.

For instance, I've been using scale_colour_manual() to change the color of the points. Can also use this to change the name of the legend.

ggplot(ideol.data, aes(x = median, y = dwnom1, colour = party)) + geom_point() + theme_bw() + scale_colour_manual(name="New Legend\nName", values=c("purple", "orange"))

plot of chunk unnamed-chunk-33

Scales

Use scale_x_continuous or scale_y_continuous to change axis labels:

ggplot(ideol.data, aes(x = median, y = dwnom1)) + geom_point() + theme_bw() + scale_x_continuous(breaks = c(-pi/3, 0, 1)) + scale_y_continuous(breaks = c(-pi/6, 0, pi/6, 1))

plot of chunk unnamed-chunk-34

Axis titles

labs(x = “label”, y = “label”)

ggplot(ideol.data, aes(x = median, y = dwnom1)) + geom_point() + theme_bw() + labs(x = expression(beta), y = "y label")

plot of chunk unnamed-chunk-35

Title

ggtitle()

ggplot(ideol.data, aes(x = median, y = dwnom1)) + geom_point() + theme_bw() + ggtitle("Awesome title")

plot of chunk unnamed-chunk-36

Changing legend position

It's a suboption of the theme() command, namely legend.position = c(x,y), where x and y go from 0 to 1:

ggplot(ideol.data, aes(x = median, y = dwnom1, colour=party)) + geom_point() + theme_bw() + theme(legend.position=c(.5,.5))

plot of chunk unnamed-chunk-37

Changing legend position

It's a suboption of the theme() command, namely legend.position = c(x,y), where x and y go from 0 to 1:

ggplot(ideol.data, aes(x = median, y = dwnom1, colour=party)) + geom_point() + theme_bw() + theme(legend.position=c(0, .9))

plot of chunk unnamed-chunk-38

Changing legend position

It's a suboption of the theme() command, namely legend.position = c(x,y), where x and y go from 0 to 1:

ggplot(ideol.data, aes(x = median, y = dwnom1, colour=party)) + geom_point() + theme_bw() + theme(legend.position='bottom')

plot of chunk unnamed-chunk-39

Changing legend position

Can also use some of the defaults, like 'left', 'right', 'bottom'

ggplot(ideol.data, aes(x = median, y = dwnom1, colour=party)) + geom_point() + theme_bw() + theme(legend.position='bottom')

plot of chunk unnamed-chunk-40

Changing font sizes

The syntax is a little tricky. You change text in the theme() command. The idea of any text is an instance of element_text(), and that different types of text on the plot inherit options from higher level text:

text
- title
- axis.title
  - axis.title.x
  - axis.title.y
- legend.title
- axis.text
- axis.text.x
- axis.text.y

and so on. Type ?theme to see more.

element_text() includes options for font family, color, size, justification, angle, margin, etc.

Changing font sizes

ggplot(ideol.data, aes(x = median, y = dwnom1, colour=party)) + geom_point() + theme_bw() + ggtitle("An awesome title") + theme(title = element_text(size = 20))

plot of chunk unnamed-chunk-41

Changing font sizes

ggplot(ideol.data, aes(x = median, y = dwnom1, colour=party)) + geom_point() + theme_bw() + ggtitle("An awesome title") + theme(axis.title.x = element_text(size = 20, colour="blue"))

plot of chunk unnamed-chunk-42

Questions?

what is going on

A challenge-winning plot

Hans's power analysis plot:

Feel the Power

A challenge-winning plot

data <- na.omit(read.csv("http://stanford.edu/~wpmarble/data/olken_data.csv"))
head(data)

  treat_invite pct_missing head_edu   mosques  pct_poor total_budget
1            0  0.38527447        6 0.9083831 0.4001222     40.56500
2            1 -0.09574836       14 1.0666667 0.1856149     69.32150
3            1  0.14771932       12 0.7117438 0.4000000     41.10650
4            1 -0.18259122        9 0.9489917 0.4379366     17.06200
5            0 -0.29304767        9 1.6233766 0.3126954     72.08600
6            1 -0.09736358        9 0.7381890 0.4076087     69.83308

A challenge-winning plot

ate <- mean(data$pct_missing[data$treat_invite==1]) - mean(data$pct_missing[data$treat_invite==0])

p <- mean(data$treat_invite)
n1 <- 250
n2 <- 500
var1 <- var(data$pct_missing[data$treat_invite==1])
var0 <- var(data$pct_missing[data$treat_invite==0])

var.power1 <- (var1 / (p*n1)) + (var0 / ((1-p)*n1))
var.power2 <- (var1 / (p*n2)) + (var0 / ((1-p)*n2))

delta <- seq(from=-.2, to=.2, by=.001)

# Power calculation
power1 <- pnorm(-1.96 - delta/sqrt(var.power1)) + 1 -pnorm(1.96 - delta/sqrt(var.power1))
power2 <- pnorm(-1.96 - delta/sqrt(var.power2)) + 1 -pnorm(1.96 - delta/sqrt(var.power2))

power <- data.frame(delta, power1, power2)
colnames(power) <- c("delta","power1", "power2")

A challenge-winning plot

head(power)

   delta    power1    power2
1 -0.200 0.9926039 0.9999897
2 -0.199 0.9921419 0.9999882
3 -0.198 0.9916547 0.9999865
4 -0.197 0.9911411 0.9999845
5 -0.196 0.9906001 0.9999822
6 -0.195 0.9900304 0.9999796

A challenge-winning plot

Ideas on how to go from the data to the plot? How should we map the variables to aesthetics?

A challenge-winning plot

Let's start with the basics:

power.plot = ggplot(power, aes(x = delta))

A challenge-winning plot

Add some geoms:

power.plot = ggplot(power, aes(x = delta))
power.plot = power.plot + 
  geom_line(aes(y = power1, colour = "N = 250")) + 
  geom_line(aes(y = power2, colour = "N = 500"), lty = 3)
power.plot

A challenge-winning plot

Some titles would be good:

power.plot = power.plot + ggtitle("Power Analysis") + 
  labs(x = expression(delta), y = "Power")
power.plot

plot of chunk unnamed-chunk-48

A challenge-winning plot

Let's get rid of the ugly grey background, make the axis titles bold, and adjust the main title a bit:

power.plot = power.plot + theme_bw() +  
  theme(axis.title=element_text(size=13, face="bold")) +
  theme(plot.title = element_text(lineheight=.8, face="bold", 
                                  size=16, hjust=.5)) 
power.plot

plot of chunk unnamed-chunk-49

Now we're getting somewhere! What's up with the legend, though? Let's fix that.

A challenge-winning plot

Let's fix the legend

power.plot = power.plot + 
  scale_color_manual(values=c("N = 250"="gray50", "N = 500"="gray30"), 
                     name="Sample Size") + 
  guides(colour = guide_legend(override.aes = list(size=c(1.2,1.2), 
                                                   linetype=c(1,3)))) +
  theme(legend.title=element_text(face="bold", size=11), 
        legend.text = element_text(size = 10), 
        legend.position="bottom")

A challenge-winning plot

power.plot

plot of chunk unnamed-chunk-51

A challenge-winning plot

We're almost there. Let's add the power = .8 line using geom_hline()

power.plot = power.plot + geom_hline(yintercept=.8, colour="black", 
                                     lty=2, size=.4)
power.plot

plot of chunk unnamed-chunk-52

A challenge-winning plot

Almost done!, the rectangles showing the minimum detectable effect (a little tricky). Hans used annotate(geom, xmin, xmax, ymin, ymax, …). The geom argument tells ggplot what type of geometry you're using to annotate the plot – in this case “rect” – and the other options are specific to the “rect” geometry.

power.plot = power.plot + 
  annotate("rect", xmin = -0.128, xmax = 0.128, ymin = 0.04999579, 
           ymax = 0.8, alpha = .1) +
  annotate("rect", xmin = -0.091, xmax = 0.091, ymin = 0.04999579, 
           ymax = 0.8, alpha = .1)

A challenge-winning plot

power.plot

plot of chunk unnamed-chunk-54

A challenge-winning plot

Finally, the last piece is adding the text annotation (“Power = 0.8”). Again we use annotate(), but this time with the “text” geom:

power.plot = power.plot + 
  annotate("text", x = 0.18, y = 0.78, label = "Power = 0.8", size=4)

A challenge-winning plot

power.plot

plot of chunk unnamed-chunk-56

Alternative way to draw the lines

Remember the power data look like this:

head(power)

   delta    power1    power2
1 -0.200 0.9926039 0.9999897
2 -0.199 0.9921419 0.9999882
3 -0.198 0.9916547 0.9999865
4 -0.197 0.9911411 0.9999845
5 -0.196 0.9906001 0.9999822
6 -0.195 0.9900304 0.9999796

Alternative way to draw the lines

ggplot is often simpler if we store our data in long format, because then we could draw the lines using aes(lty=category) in our main ggplot call. Reshape to demonstrate:

library(reshape)
power2 = melt(power, id.vars = "delta", 
              measure.vars = c("power1", "power2"),
              variable_name = "power")
power2[c(1:4, 800:802), ]

     delta  power     value
1   -0.200 power1 0.9926039
2   -0.199 power1 0.9921419
3   -0.198 power1 0.9916547
4   -0.197 power1 0.9911411
800  0.198 power2 0.9999865
801  0.199 power2 0.9999882
802  0.200 power2 0.9999897

Alternative way to draw the lines

ggplot(power2, aes(x = delta, y = value, lty = power)) + geom_line()

Summing up

ggplot2 is a pretty easy way to create pretty complex graphics
The idea is to think about how variables map onto aesthetics
Lots of pre-packaged functions that will do the heavy lifting for you (e.g., geom_line, geom_histogram, etc.)
Can add a smoother using stat_smooth(method = c(“loess”, “lm”, “gam”), formula, …)
Control the look of the plot using themes
Control the meaning of the variables (e.g., the axes and the colors) using scales
Can build the plot incrementally instead of all in one command