Should students and researchers interested in data science use R, the free software environment for statistical analysis and plotting? The simple answer is no. There are many open-source alternatives to R, such as Python. Many people also use other spreadsheet applications such as Microsoft Excel, Google’s Sheets, or Apple’s Numbers as well as more specialized statistical software programs (e.g., Stata or SPSS).
In this post, I am not making a case for a particular program or workflow. My goal is to encourage students to think about the “reproducibility” and the “replicability” of their research – to borrow Patrick Schloss’ words. While we could achieve these standards using any of these programs, I think R provides users many benefits, including access to thousands of free packages, including the mighty ggplot2.
As somebody who has used Excel or Google Sheets extensively to produce graphs, I have learned – the hard way – that using these applications is more time-consuming than analyzing data or producing data visualizations in R. This does not mean that this process is at first easy. Learning how to wrangle data in R or plotting with ggplot2 requires lots of practice. But I believe that as our skills improve, the workflow becomes streamlined and easier to reproduce and replicate.
The goal of this tutorial is to introduce ggplot2’s main features and to encourage you to learn how to integrate this package into your workflow. To understand how ggplot2 works, we need to familiarize ourselves with its layering system.
If you have not downloaded R or RStudio, consider doing so. At the end of this tutorial, there will be instructions to access the gapmider dataset in R and you will be able to use the code in this post to recreate the graphs. Even if you are inclined to use other software programs, I encourage you to read the second part of this post as its insights apply to the way we draw graphs in general.
The Power of ggplot2
ggplot2 is a plotting package developed for R by Hadley Wickham, based on Leland Wilkinson’s 1999 book, The Grammar of Graphics. It is part of the tidyverse, a set of packages influenced by Wickham’s philosophy on the process of tidying or preparing data for analysis and plotting.
Before we begin this tutorial, let’s use one of my datasets on who speaks at the United Nations General Assembly’s yearly General Debate to answer the following question: how many women leaders have participated in these events from 1970 to 2021? Before can plot these data, we should first look at the structure of the dataset. Here is a snippet of the data for the first four years.
Using the following code, we can produce the following graph.
ggplot (gender_count, aes(x=year, y = n, fill=gender_lower)) + geom_area (stat = "identity", position= "stack")
This is ggplot2‘s most basic graph. With the same data and with a few more lines of code, we can transform this graph.
It looks more polished and it includes some important elements, such as the title of the graph. Here is the code for the second graph.
ggplot (gender_count, aes(x=year, y = n, fill=gender_lower)) + geom_area (stat = "identity", position= "stack") + theme_clean () + scale_fill_nejm () + labs (fill= "Gender:")+ theme (text = element_text(family= "sans"))+ theme (legend.position = "bottom", legend.title = element_text (size=8, face ="bold"), legend.text = element_text(size = 8), legend.key.size = unit(.75,"line"))+ theme (plot.title = element_text(size=14, face="bold"), axis.title = element_text (size=8))+ theme (plot.background = element_rect(fill = "transparent"), legend.background = element_rect (fill= "transparent"), panel.background = element_rect(fill = "transparent"), legend.box.background = element_rect(fill = "transparent"))+ labs (title= "Comparing the Number of Male and Female Leaders\nAddressing the UN General Debate From 1970 to 2021", x = "", y = "Number of Speakers")
If you find this code confusing or even intimidating, rest assured that after completing this tutorial you will understand how these lines of code produced different parts of the second graph. And more importantly, you will be able to start creating your own graphs using R and ggplot2.
The “Grammar of Graphics”
To understand ggplot2, we must first understand Wilkinson’s grammar of graphics. To put it simply, graphs are made up of layers, each defined by its own set of rules. If we understand a graph’s core elements, we can then learn how to manipulate each layer to produce different types of graphs with a certain look.
The first layer is the data layer. Data piped into ggplot2 should be structured in a tidy format, where each column is a variable and each row is an observation. If you look at the table with the data used to plot the two area graphs, you can see the structure of the data corresponds to this format. We could have organized these data differently, but ggplot2 works best with tidy data formats. In a future tutorial, we will learn how to use the tidyverse’s dplyr and tidyr to transform wide-format data into tidy data formats. For now, it is important to recognize that ggplot2 works best with this type of data structure.
The second layer is the aesthetics also known in the code as the “aes”. It takes the following information and commands.
- Inputs for the x and y axes, where the x is usually represented as the independent variable and the y as the dependent variable.
- Color, which specifies the color used by a specific geometry (e.g. line, bar, point, etc…).
- Fill, which is used to color the inside of a geometric object.
- Shape, specifying the figure used to plot a point in a graph (e.g. square, triangle, etc…).
- Linetype: the type of line used in a “geom_line” (e.g. solid, dashed, etc…).
- Size: used to determine the scale of a dimension.
- Alpha, which sets the transparency of a geometry.
- Group, which is used with discrete variables to create line or bar graphs.
The third layer sets the geometric object used to graph the data. We can select many types of geometric types. Here are some of the most popular options.
- geo_point (): scatterplot
- geom_line (): lines connecting points by increasing the value of X.
- geom_path (): lines connecting points in a particular sequence.
- geom_boxplot (): box and whisker plots using categorical variables.
- geom_bar (): bar charts
- geom_area (): area charts.
- geom_histogram (): histograms
- geom_density (): density plots
- geom_smooth (): function used in a scatterplot to run a smooth regression line
- geom_dumbbell (): dumbbell plots.
We can even use other geometric objects to produce choropleth maps.
The fourth layer allows us to split categorical variables into different facets or panels. ggplot2 arranges these plots side by side. The fifth layer is the statistics layer, where the package uses different statistical functions (e.g., mean, standard deviation, counts, etc…) to produce new values which can be then plotted into the graph. The sixth layer works on elements connected to the graph’s coordinates. These features are a bit more advanced. But, this layer allows us to create circular or polar graphs, establish the size of the graph, and even locate specific elements, such as text labels, in specific parts of a graph. In this tutorial, we will not be using this layer.
The last layer includes all the theme elements. This layer controls the look of the graph and it adds a graph’s most essential elements: the graph’s title, axes labels, legends, tick marks, grids, font types and size, the color of the background, and so forth.
To connect these layers, ggplot2 requires the user to use the “+“. The next figure shows all the elements of the second graph displayed above. It is worth noting all the layers and how they are connected by the “+“.
As noted above, we can produce a basic graph with just two lines, but to get maximum control over the “look” of our graphs, we can manipulate a graph’s multiple layers.
Learning by Doing:
Now that we understand the philosophy that informs ggplot2, let’s open R and load, load the tidyverse, and the gapminder package, which includes a dataset we will use to plot our graphs. Below you will find the code to reproduce these graphs. Feel free to cut and paste it into your RStudio environment.
Step 1: I am assuming you have already downloaded R and RStudio to your computer. If not, follow these instructions. Open your RStudio, open a new project, and install the tidyverse and gapminder packages.
Step 2: Now you can load your packages.
library (tidyverse) library (gapminder)
Step 3: Get familiarized with the data.
When running this function you should be able to see a table of your data’s first rows and the names of each column.
country continent year lifeExp pop gdpPercap <fctr> <fctr> <int> <dbl> <dbl> <dbl> Afghanistan Asia 1952 28.801 8425333 779.4453 Afghanistan Asia 1957 30.332 9240934 820.8530 Afghanistan Asia 1962 31.997 10267083 853.1007 Afghanistan Asia 1967 34.020 11537966 836.1971 Afghanistan Asia 1972 36.088 13079460 739.9811 Afghanistan Asia 1977 38.438 14880372 786.1134
It should be noted that this dataframe is structured according to tidy principles.
Step 4: Let’s plot the relationship between GDP per capita (gdpPercap) and life expectancy (lifeExp). Here is the first line of the code.
ggplot(data=gapminder, aes(x = gdpPercap, y = lifeExp))
Noticed that we entered two of ggplot2’s layers: data and aesthetics or the aes. In order to see the relationship between the X and Y variables, we need to add a geom () layer. Let’s do that and remember to add “+” to connect the layers.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point()
The outliers connected to “gdpPercap” are affecting the distribution of the data. Let’s transform the data by using a simple log transformation. In other words, let’s add a stat layer to our plot!
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point()+ scale_x_log10()
Could we do something else? We could color the dots by continent. This is an aes layer and we can put it in the geom_point function or in the corresponding aes layer after the X and Y variables. The result will be the same.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point(aes (color=continent))+ scale_x_log10() ## OR: ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color=continent)) + geom_point()+ scale_x_log10()
We could also try to fit a line to make sense of what type of relationship we have between both variables. Thus, we are now entering another stat layer. Note that the reference to “lm” corresponds to a linear model.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point(aes (color=continent))+ scale_x_log10()+ geom_smooth(method ="lm")
We have not used the facet layer yet. In this graph, we will use a “geom_line” rather than “geom_point”. Each line represents a country and each facet a continent.
ggplot(gapminder, aes(x = year, y = lifeExp, group= country, color = continent)) + geom_line ()+ facet_wrap(~ continent) + scale_color_manual(values = continent_colors)
Of course, we can add theme layers to make these graphs look prettier. We will do that below. Let’s use the following code, which we already used before.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color=continent)) + geom_point()+ scale_x_log10()
Let’s add two extra lines.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color=continent)) + geom_point()+ scale_x_log10()+ labs (title= "My Title", x= "GDP Per Capita", y= "Life Expectancy")+ theme_minimal()
The labs command added the graph’s title and change the labels of the X and Y axes. We changed the “look” of the graph with the theme_ command. In this case, we used theme_minimal(), which comes pre-installed in the gpplot2 library. The package actually includes 10 themes. Just for the sake of example, let’s revise this code and replace the theme_minimal () for theme_classic ().
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color=continent)) + geom_point()+ scale_x_log10()+ labs (title= "My Title", x= "GDP Per Capita", y= "Life Expectancy")+ theme_classic()
If you want other looks, you can download the ggthemes package, which includes many more themes_ and color palettes.
Finally, we have plotted graphs using two or more variables. ggplot2 can also produce histograms and density plots in order to carry out univariate analyses.
ggplot(gapminder, aes(lifeExp))+ geom_histogram()
ggplot(gapminder, aes(lifeExp))+ geom_density(aes(fill=continent), alpha= 0.25)+ theme_light()
In the last code snippet, notice that we added: “alpha = 0.25″. This sets the transparency of the density plots, allowing us to understand the distribution of the life expectancy data by continent.
After reading this tutorial and playing around with R, ggplot2, and gapminder, I hope you are less intimidated by coding in R and using ggplot2’s graphing functions. I will be adding more tutorials in the near future on how to use tidy principles to restructure existing datasets. I will also explain how we can use ggplot2 to produce maps and other types of visuals.
Want to learn more about coding in R and plotting with ggplot2?
There are many channels on YouTube that teach viewers how to use ggplot2. I personally enjoy Pat Schloss’ Riffomanas Project. While this is not recommended for beginners, he explains clearly how to use the theme layers to produce very complex graphs. For beginners, I highly recommend Greg Martin’s R Programming in 101. His videos cover everything from using the tidyverse to graphing with ggplot2.
About the author:
Carlos L. Yordán is an Associate Professor of International Relations at Drew University. He is also the director of the Semester on the United Nations.