Data Wrangling and Graphing World Bank Data in R with the WDI Package

The World Bank’s DataBank currently hosts over 80 databases that include “time series data on a multitude of topics for many countries around the world.” The World Development Indicators (WDI) database is probably the DataBank’s most popular, and a crucial resource for policy experts and academics.

There are multiple ways to access the WDI. One way is to visit the World Bank’s data portal which allows users to not only gain access to the WDI’s multiple indicators but also segment the data by years, countries, and so forth. The portal allows users to create different types of graphs, which can be downloaded and pasted into their documents. In addition, users can download the data associated with their queries in many different formats. Then these data can be analyzed in Google Sheets, Microsoft Excel, or more specialized statistical software (e.g., Stata). For those interested in this way of accessing and downloading the World Bank’s data, I highly recommend an easy-to-follow tutorial developed by Western Michigan University’s Global Studies Progam.

In this post, I will use R‘s WDI package, developed by Vincent Arel-Bundock, to download, wrangle, analyze and plot the World Bank’s data. Is this technique easier than using the World Bank’s data portal or using a spreadsheet application to analyze and graph these data? It depends on users’ level of comfort with R and the tidyverse’s many packages for data manipulation and graphing. I think that learning how to use R to access these data is less time-consuming and the research will be easier to reproduce and replicate.

Working with the WDI Package:

If you have not downloaded and installed R and RStudio on your computer, follow these instructions. If you have done so already, open RStudio, create a new project, and if you have not done so already install the tidyverse, WDI, and the scales packages. The last of these packages work with ggplot2 (which is part of the tidyverse) to transform numbers in scientific notations into other values or formats.

install.packages ("WDI")
install.packages ("tidyverse")
install.packages ("scales")

Once the installations are complete, you will then load the libraries.

library (WDI)
library (tidyverse)
library (scales)

One of the challenges of working with the World Bank’s WDI is that it “contains 1,400 time-series indicators for 217 economies and more than 40 country groups, with data for many indicators going back more than 50 years.” Luckily, Arel-Bundock’s WDI package includes an easy-to-use search function that produces the indicator’s ID and its name.

Example 1: No Data Wrangling

If you are doing research on military issues, you can use this search function to see if the WDI includes any associated indicators. Using the function is straightforward.

WDIsearch ("military")

Here is the output.

indicator          
[1,] "MS.MIL.XPND.CD"   
[2,] "MS.MIL.XPND.CN"   
[3,] "MS.MIL.XPND.GD.ZS"
[4,] "MS.MIL.XPND.GN.ZS"
[5,] "MS.MIL.XPND.ZS"   
[6,] "VC.PKP.TOTL.UN"   
     name                                                                                     
[1,] "Military expenditure (current USD)"                                                     
[2,] "Military expenditure (current LCU)"                                                     
[3,] "Military expenditure (% of GDP)"                                                        
[4,] "Military expenditure (% of GNI)"                                                        
[5,] "Military expenditure (% of general government expenditure)"                             
[6,] "Presence of peace keepers (number of troops, police, and military observers in mandate)"

The sequence between the two lists is the same. Thus, the first indicator corresponds to the first name and so forth.

To access the data for the first indicator, we use the following function.

df_mil <- WDI(
  country = "all",
  indicator = "MS.MIL.XPND.CD",
  start = 1960,
  end = 2020,
  extra = FALSE,
  cache = NULL,
  latest = NULL,
  language = "en")

head(df_mil)

We can use the head function to look at the output’s first rows and get familiarized with the way the data has been structured.


       iso2    country                      MS.MIL.XPND.CD      year
        <chr>   <chr>                             <dbl>         <int>
1	ZH	Africa Eastern and Southern	10576660181	2020
2	ZH	Africa Eastern and Southern	11708245732	2019
3	ZH	Africa Eastern and Southern	12265438298	2018
4	ZH	Africa Eastern and Southern	16158643503	2017
5	ZH	Africa Eastern and Southern	13659413872	2016
6	ZH	Africa Eastern and Southern	15383318098	2015
6 rows

The ISO2 is a two-letter country code, developed by the International Organization for Standardization. For example, “US” is the code for the United States. In this case, “ZH” does not represent a country, but one of the World Bank’s “country groups” for states in East and Southern Africa. The military expenditure is listed under the MS.MIL.XZPND.CD column, which is the ID of the time series. It is important to highlight that <dbl> stands for double and this basically a numeric variable that can read numbers with decimals. The <int> is short for integer, which is also numerical.

If you are looking for military expenditure data for the United States, you can do so easily.

df_us_mil <- WDI(
country = "US",
indicator = "MS.MIL.XPND.CD",
start = 1960,
end = 2020,
extra = FALSE,
cache = NULL,
latest = NULL,
language = "en")

If you only want to access data for the UN Security Council’s permanent members (P5) after the end of the Cold War, you would specify the following information in the WDI function.

df_p5_mil <- WDI(
  country = c("US", "FR", "RU", "CN", "GB"), 
  indicator = "MS.MIL.XPND.CD",
  start = 1992,
  end = 2020,
  extra = FALSE,
  cache = NULL,
  latest = NULL,
  language = "en"
)

“GB” stands for the United Kingdom. Also, note that we have delimited the search from 1992-2020 as this represents the years following the Cold War.

If we want to plot these data with a line graph, it is pretty simple with ggplot2. But before doing this, it is important to make sure the data is structured according to tidy data principles. As I noted in the tutorial on graphing in R with ggplot2, to use this package, the data needs to be in a tidy format. To look at our data we can use the head function, which will display the top rows of the data set and the column names.

head (df_p5_mil)

Here is the output.

       iso2     country MS.MIL.XPND.CD   year
       <chr>    <chr>   <dbl>           <int>
1	CN	China	252304223741	2020
2	CN	China	240332555458	2019
3	CN	China	232530595985	2018
4	CN	China	210443034741	2017
5	CN	China	198538361428	2016
6	CN	China	196538827231	2015

This dataframe is structured using the following tidy data principles:

Each variable forms a column.

Each observation forms a row.

Each type of observational unit forms a table.

Hadley Wickham (2014)

Given the structure of our dataframe, we do not need to restructure the data. Let’s use ggplot2 to generate a quick line graph.

ggplot (df_p5_mil, aes (x=year, y= MS.MIL.XPND.CD, color = country))+
  geom_line (size = 1)+
  theme_minimal ()

This is a good starting point but the graph needs some extra elements.

ggplot (df_p5_mil, aes (x=year, y= MS.MIL.XPND.CD, color = country))+
  geom_line (size = 1)+
  theme_minimal()+
  scale_y_continuous (labels =unit_format(unit = "", scale = 1e-9, prefix = "$"))+
  labs (title = "The P5's Military Expenditures (1992-2020)",
        x = "",
        y= "US Dollars (Billions)")

What did we add? We changed the scales from the scientific notation “8e+11 to “$800”. In order to make sure our scale was in the billions of US Dollars we added a new label to the y axis, from “MS.MIL.XPND.CD” to”US Dollars (Billions)”. In addition, we removed the x-axis label from year to “” in the labs function. We can add other theme layers to change the size of the font, the font family, the color palette, and so forth. But this graph works for now.

This graph informs us that the US’ military expenditures are much, much higher than its competitors. Since the early 2000s, China’s expenditures have also increased.

Example 2: Learning to Tidy our Data

Let’s look at what data the World Bank’s WDI has for literacy.

WDIsearch ("literacy")

I am not going to show the output as it includes more than 190 indicators. After scanning the list, I decided to work with “SE.ADT.ILIT.FE.ZS” and “SE.ADT.ILIT.MA.ZS”. The first measures the literacy rate for adult (e.g., 15 and older) women and the other for adult men.

We can download these two datasets at the same time. Note the line for “indicator”. This is the way to pull more than one indicator. Also, I limited our search for observations from 1980 to 2020.

litr <- WDI(
  country = "all",
  indicator = c("SE.ADT.LITR.FE.ZS","SE.ADT.LITR.MA.ZS"),
  start = 1980,
  end = 2020,
  extra = FALSE,
  cache = NULL,
  latest = NULL,
  language = "en"
) 

I want to produce a graph that compares literacy rates among adult men and women. But we before can we do so, we need to first look at the structure of our data.

head (litr)

Here is a summary of the data.


       iso2c      country       year  SE.ADT.LITR.FE.ZS SE.ADT.LITR.MA.ZS
       <chr>       <chr>        <int>   <dbl>           <dbl>
1	1A	Arab World	1980	31.94252	60.02993
2	1A	Arab World	1981	32.59526	60.47446
3	1A	Arab World	1982	33.40364	61.00201
4	1A	Arab World	1983	33.85635	60.96328
5	1A	Arab World	1984	34.84444	61.69899
6	1A	Arab World	1985	35.72112	62.49827

This does not correspond to a tidy data structure. In order to build a plot that compares two values, we will have to use the tidyr package to restructure the dataframe. Also, let’s change the name of the variables for the two literacy rates as well, using the dplyr package. These two packages are part of the tidyverse, so we don’t have to upload them individually.

df_litr <- litr%>%
  rename (male = SE.ADT.LITR.MA.ZS)%>%
  rename (female = SE.ADT.LITR.FE.ZS)

Now we can use tidyr’s pivot_longer function to tidy our dataframe.

tidy_litr <-pivot_longer (df_litr, cols = c(female, male), names_to = "gender", values_to = "literacy_rate")

head (tidy_litr)

Here is the output. Notice how the pivot_longer function created two new variables called “gender” and “literacy_rate”. This is a great function!


ISO2c    country        year    gender  literacy_rate
<chr>    <chr>          <int>   <chr>   <dbl>
1A	Arab World	1980	female	31.94252
1A	Arab World	1980	male	60.02993
1A	Arab World	1981	female	32.59526
1A	Arab World	1981	male	60.47446
1A	Arab World	1982	female	33.40364
1A	Arab World	1982	male	61.00201

Looking at the data, we can already tell there is a discrepancy between adult men and women regarding their rate of literacy. Let’s see whether Arab women have been able to make some gains since 1980.

1a_df <- tidy_litr %>%
  filter (country == "Arab World") %>%
  ggplot (aes (x= year, y = literacy_rate, color= gender))+
  geom_line(size = 1)+
  theme_minimal()+
  labs (title= "Comparing the Literacy Rate Among Adult Men and Women\nin the Arab World (1980-2020)",
        x="",
        y = "Literacy Rate (%)")

ia_df

Before we see the line graph, it is important to note that we are using dplyr’s filter function to only extract the rows connected to the Arab World. The use of this symbol ” %>%”, known as pipe operator, allows us to chain various functions. Usually, in ggplot2, we are supposed to enter the dataframe before the aes layer. But because we already entered the dataframe at the start of the code chunk, we don’t need to do so in the ggplot2 function.

While Arab women have made gains, men have higher rates of literacy. How does this trend compare to the world average or other regions of the world? Let’s rework the last dataframe and use the ggplot2‘s facet_wrap function to display multiple graphs side by side.

facet_litr_graph <- tidy_litr %>%
  filter (country == c("Sub-Saharan Africa","Arab World", "South Asia", "East Asia & Pacific", "Latin America & Caribbean", "Europe & Central Asia", "World")) %>%
  ggplot (aes (x= year, y = literacy_rate, color= gender))+
  geom_line( size = 1)+
  facet_wrap (~ country, scales = "free")+
  scale_y_continuous(limits = c(20, 100))+
  theme_minimal()+
  labs (title= "Comparing the Literacy Rate of Adult Men and Women in Different Regions of\nthe World (1980-2020)",
        x="Year",
        y = "Literacy Rate (%)")

facet_litr_graph

The facet graph is pasted below. Let’s highlight a few things. In terms of the scale_y_continous function, we used this to set the limits of our scale. Thus, all the graphs start at 20 and continue to 100. Also, in the facet_wrap function, we tell ggplot2 to display the graphs by “country”, which in this case captures a few of the World Bank’s “country groups”.

By faceting these graphs, we can see that in terms of the literacy rate for adult women in the Arab world is trailing adult women in East Asia and the Pacific, Europe and Central Asia, Latin America, and the Caribbean. Their performance is stronger than Sub-Saharan Africa and closer to South Asia’s. In addition, Arab women have some ways to go before they can catch up or meet the world’s average for literacy.

Concluding Thoughts:

The World Bank’s DataBank is probably one of the best repositories of economics and development datasets in the world. Its collection of world development indicators (WDI) is one of the most important resources for students, scholars, and policy experts working in the disciplines of political science, international relations, and economics. While users can access and download these data through the World Bank’s data portal and analyze the data using spreadsheet applications or more specialized statistical software packages, in this post, I explain why we should use Arel-Bundock‘s WDI package for R to download, analyze and graph these data.

Using this package is not too difficult. The more challenging part of this workflow is the restructuring of the data into a tidy format. But once we understand how to do this, we can repurpose our code to execute other analyzes using the WDI data, saving us some time, while also helping us produce research that is reproducible and replicable.

Want to learn more about the WDI package, tidyr or dplyr?

For another tutorial on how to use the WDI package in R, read Steven V. Miller’s blog post.

For more information on dplyr‘s basic functions, I recommend Data Slice’s YouTube channel and his video titled: “Dplyr Essentials.”

On some R basics, including dplyr functions, see Greg Martin’s R Programming 101 and more specifically his video: “R programming for beginners. Manipulate data using the tidyverse: select, filter and mutate“.

About the author:

Carlos L. Yordán is an Associate Professor of International Relations at Drew University. He is also the director of the Semester on the United Nations.

css.php