The World Bank’s DataBank currently hosts over 80 databases that include “time series data on a multitude of topics for many countries around the world.” The World Development Indicators (WDI) database is probably the DataBank’s most popular, and a crucial resource for policy experts and academics.
There are multiple ways to access the WDI. One way is to visit the World Bank’s data portal which allows users to not only gain access to the WDI’s multiple indicators but also segment the data by years, countries, and so forth. The portal allows users to create different types of graphs, which can be downloaded and pasted into their documents. In addition, users can download the data associated with their queries in many different formats. Then these data can be analyzed in Google Sheets, Microsoft Excel, or more specialized statistical software (e.g., Stata). For those interested in this way of accessing and downloading the World Bank’s data, I highly recommend an easy-to-follow tutorial developed by Western Michigan University’s Global Studies Progam.
In this post, I will use R‘s WDI package, developed by Vincent Arel-Bundock, to download, wrangle, analyze and plot the World Bank’s data. Is this technique easier than using the World Bank’s data portal or using a spreadsheet application to analyze and graph these data? It depends on users’ level of comfort with R and the tidyverse’s many packages for data manipulation and graphing. I think that learning how to use R to access these data is less time-consuming and the research will be easier to reproduce and replicate.
Working with the WDI Package:
If you have not downloaded and installed R and RStudio on your computer, follow these instructions. If you have done so already, open RStudio, create a new project, and if you have not done so already install the tidyverse, WDI, and the scales packages. The last of these packages work with ggplot2 (which is part of the tidyverse) to transform numbers in scientific notations into other values or formats.
install.packages ("WDI")
install.packages ("tidyverse")
install.packages ("scales")
Once the installations are complete, you will then load the libraries.
library (WDI)
library (tidyverse)
library (scales)
One of the challenges of working with the World Bank’s WDI is that it “contains 1,400 time-series indicators for 217 economies and more than 40 country groups, with data for many indicators going back more than 50 years.” Luckily, Arel-Bundock’s WDI package includes an easy-to-use search function that produces the indicator’s ID and its name.
Example 1: No Data Wrangling
If you are doing research on military issues, you can use this search function to see if the WDI includes any associated indicators. Using the function is straightforward.
WDIsearch ("military")
Here is the output.
indicator
[1,] "MS.MIL.XPND.CD"
[2,] "MS.MIL.XPND.CN"
[3,] "MS.MIL.XPND.GD.ZS"
[4,] "MS.MIL.XPND.GN.ZS"
[5,] "MS.MIL.XPND.ZS"
[6,] "VC.PKP.TOTL.UN"
name
[1,] "Military expenditure (current USD)"
[2,] "Military expenditure (current LCU)"
[3,] "Military expenditure (% of GDP)"
[4,] "Military expenditure (% of GNI)"
[5,] "Military expenditure (% of general government expenditure)"
[6,] "Presence of peace keepers (number of troops, police, and military observers in mandate)"
The sequence between the two lists is the same. Thus, the first indicator corresponds to the first name and so forth.
To access the data for the first indicator, we use the following function.
df_mil <- WDI(
country = "all",
indicator = "MS.MIL.XPND.CD",
start = 1960,
end = 2020,
extra = FALSE,
cache = NULL,
latest = NULL,
language = "en")
head(df_mil)
We can use the head function to look at the output’s first rows and get familiarized with the way the data has been structured.
iso2 country MS.MIL.XPND.CD year
<chr> <chr> <dbl> <int>
1 ZH Africa Eastern and Southern 10576660181 2020
2 ZH Africa Eastern and Southern 11708245732 2019
3 ZH Africa Eastern and Southern 12265438298 2018
4 ZH Africa Eastern and Southern 16158643503 2017
5 ZH Africa Eastern and Southern 13659413872 2016
6 ZH Africa Eastern and Southern 15383318098 2015
6 rows
The ISO2 is a two-letter country code, developed by the International Organization for Standardization. For example, “US” is the code for the United States. In this case, “ZH” does not represent a country, but one of the World Bank’s “country groups” for states in East and Southern Africa. The military expenditure is listed under the MS.MIL.XZPND.CD column, which is the ID of the time series. It is important to highlight that <dbl> stands for double and this basically a numeric variable that can read numbers with decimals. The <int> is short for integer, which is also numerical.
If you are looking for military expenditure data for the United States, you can do so easily.
df_us_mil <- WDI(
country = "US",
indicator = "MS.MIL.XPND.CD",
start = 1960,
end = 2020,
extra = FALSE,
cache = NULL,
latest = NULL,
language = "en")
If you only want to access data for the UN Security Council’s permanent members (P5) after the end of the Cold War, you would specify the following information in the WDI function.
df_p5_mil <- WDI(
country = c("US", "FR", "RU", "CN", "GB"),
indicator = "MS.MIL.XPND.CD",
start = 1992,
end = 2020,
extra = FALSE,
cache = NULL,
latest = NULL,
language = "en"
)
“GB” stands for the United Kingdom. Also, note that we have delimited the search from 1992-2020 as this represents the years following the Cold War.
If we want to plot these data with a line graph, it is pretty simple with ggplot2. But before doing this, it is important to make sure the data is structured according to tidy data principles. As I noted in the tutorial on graphing in R with ggplot2, to use this package, the data needs to be in a tidy format. To look at our data we can use the head function, which will display the top rows of the data set and the column names.
head (df_p5_mil)
Here is the output.
iso2 country MS.MIL.XPND.CD year
<chr> <chr> <dbl> <int>
1 CN China 252304223741 2020
2 CN China 240332555458 2019
3 CN China 232530595985 2018
4 CN China 210443034741 2017
5 CN China 198538361428 2016
6 CN China 196538827231 2015
This dataframe is structured using the following tidy data principles:
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
Hadley Wickham (2014)
Given the structure of our dataframe, we do not need to restructure the data. Let’s use ggplot2 to generate a quick line graph.
ggplot (df_p5_mil, aes (x=year, y= MS.MIL.XPND.CD, color = country))+
geom_line (size = 1)+
theme_minimal ()
This is a good starting point but the graph needs some extra elements.
ggplot (df_p5_mil, aes (x=year, y= MS.MIL.XPND.CD, color = country))+
geom_line (size = 1)+
theme_minimal()+
scale_y_continuous (labels =unit_format(unit = "", scale = 1e-9, prefix = "$"))+
labs (title = "The P5's Military Expenditures (1992-2020)",
x = "",
y= "US Dollars (Billions)")
What did we add? We changed the scales from the scientific notation “8e+11 to “$800”. In order to make sure our scale was in the billions of US Dollars we added a new label to the y axis, from “MS.MIL.XPND.CD” to”US Dollars (Billions)”. In addition, we removed the x-axis label from year to “” in the labs function. We can add other theme layers to change the size of the font, the font family, the color palette, and so forth. But this graph works for now.
This graph informs us that the US’ military expenditures are much, much higher than its competitors. Since the early 2000s, China’s expenditures have also increased.
Example 2: Learning to Tidy our Data
Let’s look at what data the World Bank’s WDI has for literacy.
WDIsearch ("literacy")
I am not going to show the output as it includes more than 190 indicators. After scanning the list, I decided to work with “SE.ADT.ILIT.FE.ZS” and “SE.ADT.ILIT.MA.ZS”. The first measures the literacy rate for adult (e.g., 15 and older) women and the other for adult men.
We can download these two datasets at the same time. Note the line for “indicator”. This is the way to pull more than one indicator. Also, I limited our search for observations from 1980 to 2020.
litr <- WDI(
country = "all",
indicator = c("SE.ADT.LITR.FE.ZS","SE.ADT.LITR.MA.ZS"),
start = 1980,
end = 2020,
extra = FALSE,
cache = NULL,
latest = NULL,
language = "en"
)
I want to produce a graph that compares literacy rates among adult men and women. But we before can we do so, we need to first look at the structure of our data.
head (litr)
Here is a summary of the data.
iso2c country year SE.ADT.LITR.FE.ZS SE.ADT.LITR.MA.ZS
<chr> <chr> <int> <dbl> <dbl>
1 1A Arab World 1980 31.94252 60.02993
2 1A Arab World 1981 32.59526 60.47446
3 1A Arab World 1982 33.40364 61.00201
4 1A Arab World 1983 33.85635 60.96328
5 1A Arab World 1984 34.84444 61.69899
6 1A Arab World 1985 35.72112 62.49827
This does not correspond to a tidy data structure. In order to build a plot that compares two values, we will have to use the tidyr package to restructure the dataframe. Also, let’s change the name of the variables for the two literacy rates as well, using the dplyr package. These two packages are part of the tidyverse, so we don’t have to upload them individually.
df_litr <- litr%>%
rename (male = SE.ADT.LITR.MA.ZS)%>%
rename (female = SE.ADT.LITR.FE.ZS)
Now we can use tidyr’s pivot_longer function to tidy our dataframe.
tidy_litr <-pivot_longer (df_litr, cols = c(female, male), names_to = "gender", values_to = "literacy_rate")
head (tidy_litr)
Here is the output. Notice how the pivot_longer function created two new variables called “gender” and “literacy_rate”. This is a great function!
ISO2c country year gender literacy_rate
<chr> <chr> <int> <chr> <dbl>
1A Arab World 1980 female 31.94252
1A Arab World 1980 male 60.02993
1A Arab World 1981 female 32.59526
1A Arab World 1981 male 60.47446
1A Arab World 1982 female 33.40364
1A Arab World 1982 male 61.00201
Looking at the data, we can already tell there is a discrepancy between adult men and women regarding their rate of literacy. Let’s see whether Arab women have been able to make some gains since 1980.
1a_df <- tidy_litr %>%
filter (country == "Arab World") %>%
ggplot (aes (x= year, y = literacy_rate, color= gender))+
geom_line(size = 1)+
theme_minimal()+
labs (title= "Comparing the Literacy Rate Among Adult Men and Women\nin the Arab World (1980-2020)",
x="",
y = "Literacy Rate (%)")
ia_df
Before we see the line graph, it is important to note that we are using dplyr’s filter function to only extract the rows connected to the Arab World. The use of this symbol ” %>%”, known as pipe operator, allows us to chain various functions. Usually, in ggplot2, we are supposed to enter the dataframe before the aes layer. But because we already entered the dataframe at the start of the code chunk, we don’t need to do so in the ggplot2 function.
While Arab women have made gains, men have higher rates of literacy. How does this trend compare to the world average or other regions of the world? Let’s rework the last dataframe and use the ggplot2‘s facet_wrap function to display multiple graphs side by side.
facet_litr_graph <- tidy_litr %>%
filter (country == c("Sub-Saharan Africa","Arab World", "South Asia", "East Asia & Pacific", "Latin America & Caribbean", "Europe & Central Asia", "World")) %>%
ggplot (aes (x= year, y = literacy_rate, color= gender))+
geom_line( size = 1)+
facet_wrap (~ country, scales = "free")+
scale_y_continuous(limits = c(20, 100))+
theme_minimal()+
labs (title= "Comparing the Literacy Rate of Adult Men and Women in Different Regions of\nthe World (1980-2020)",
x="Year",
y = "Literacy Rate (%)")
facet_litr_graph
The facet graph is pasted below. Let’s highlight a few things. In terms of the scale_y_continous function, we used this to set the limits of our scale. Thus, all the graphs start at 20 and continue to 100. Also, in the facet_wrap function, we tell ggplot2 to display the graphs by “country”, which in this case captures a few of the World Bank’s “country groups”.
By faceting these graphs, we can see that in terms of the literacy rate for adult women in the Arab world is trailing adult women in East Asia and the Pacific, Europe and Central Asia, Latin America, and the Caribbean. Their performance is stronger than Sub-Saharan Africa and closer to South Asia’s. In addition, Arab women have some ways to go before they can catch up or meet the world’s average for literacy.
Concluding Thoughts:
The World Bank’s DataBank is probably one of the best repositories of economics and development datasets in the world. Its collection of world development indicators (WDI) is one of the most important resources for students, scholars, and policy experts working in the disciplines of political science, international relations, and economics. While users can access and download these data through the World Bank’s data portal and analyze the data using spreadsheet applications or more specialized statistical software packages, in this post, I explain why we should use Arel-Bundock‘s WDI package for R to download, analyze and graph these data.
Using this package is not too difficult. The more challenging part of this workflow is the restructuring of the data into a tidy format. But once we understand how to do this, we can repurpose our code to execute other analyzes using the WDI data, saving us some time, while also helping us produce research that is reproducible and replicable.
Want to learn more about the WDI package, tidyr or dplyr?
For another tutorial on how to use the WDI package in R, read Steven V. Miller’s blog post.
For more information on dplyr‘s basic functions, I recommend Data Slice’s YouTube channel and his video titled: “Dplyr Essentials.”
On some R basics, including dplyr functions, see Greg Martin’s R Programming 101 and more specifically his video: “R programming for beginners. Manipulate data using the tidyverse: select, filter and mutate“.
About the author:
Carlos L. Yordán is an Associate Professor of International Relations at Drew University. He is also the director of the Semester on the United Nations.