Introduction to Tidyverse

RMarkdown file to accompany the tidyverse + gapminder workshop presentation

Janani Ravi https://jravilab.github.io (R-Ladies East Lansing | PDI @MSU)https://rladies-eastlansing.github.io
2022-02-24

About me

I am a computational biologist, an Asst. Professor in the department of Pathobiology & Diagnostic Investigation at Michigan State University. In our group, we develop computational approaches to understand infectious disease biology. Check out my webpage for more info. You can reach me here.

I’m also the founder & co-organizer of the R-Ladies East Lansing group on campus. We conduct R-related workshops & meetups regularly! So, do check out our upcoming events on Meetup.

Part 1: Getting Started w/ readr

You can access all relevant material pertaining to this workshop here. Other related workshops & useful cheatsheets.

Installation and set-up

Install RStudio

Running RStudio locally? Download RStudio

Want to try the latest ‘Preview’ version of RStudio? RStudio Preview version

Trouble with local installation? Login & start using RStudio Cloud right away!

New to RStudio IDE? Use Help Page #1 & Page #2

Install R

… if you haven’t already! The RStudio startup message should specify your current local version of R. For e.g., R v4.0.5

Install tidyverse & other datasets

install.packages("tidyverse") # for data wrangling
install.packages("gapminder") # sample dataset

Trouble with installing tidyverse?

install.packages("readr")  # Importing data files
# install.packages("readxl") # Importing excel files
install.packages("tidyr")  # Tidy Data
install.packages("dplyr")  # Data manipulation
install.packages("ggplot2")  # Data Visualization (w/ Grammar of Graphics)

Loading packages

library(tidyverse)
# OR load the individual packages:
# library(readr)
# library(readxl)
# library(tidyr)
# library(dplyr)
# library(ggplot2)

library(gapminder)

Some useful cheatsheets

Cheatsheets @RStudio

You can also access all relevant R/RStudio/Slack cheatsheets on our GitHub repo.

Data import

library(tidyverse)
read_csv(file="my_data.csv",
         col_names=T)    # comma-separated values, as exported from excel/spreadsheets
read_delim(file="my_data.txt", col_names=T,
           delim="//")  # any delimitter
# Other useful packages
# readxl by Jenny Bryan
read_excel(path="path/to/excel.xls",
          sheet=1,
          range="A1:D50",
          col_names=T)

Loading existing datasets

Gapminder

We will work with the Gapminder dataset by Hans Rosling.

Unveiling the beauty of statistics for a fact based world view. Gapminder.org

Tools to generate their trademark bubble charts

Snapshot of their data

Knowing your data

# gapminder::gapminder
str(gapminder)    # Structure of the dataframe
gapminder       # Data is in a cleaend up 'tibble' format by default
head(gapminder)    # Shows the top few observations (rows) of your data frame
glimpse(gapminder)  # Info-dense summary of the data
View(gapminder)    # View data in a visual GUI-based spreadsheet-like format

Running the code bit step-wise

── Attaching packages ─────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5     ✓ purrr   0.3.4
✓ tibble  3.1.6     ✓ dplyr   1.0.8
✓ tidyr   1.2.0     ✓ stringr 1.4.0
✓ readr   2.1.2     ✓ forcats 0.5.1
── Conflicts ────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
library(gapminder)
str(gapminder)    # Structure of the dataframe
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
gapminder       # Data is in a cleaend up 'tibble' format by default
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# … with 1,694 more rows
head(gapminder)    # Shows the top few observations (rows) of your data frame
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.
glimpse(gapminder)  # Info-dense summary of the data
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanist…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia…
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982…
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, …
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, …
View(gapminder)    # View data in a visual GUI-based spreadsheet-like format
library(knitr)
kable(head(gapminder))
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1972 36.088 13079460 739.9811
Afghanistan Asia 1977 38.438 14880372 786.1134

Part 2: Reshaping data with tidyr

# gather()  # Gather COLUMNS -> ROWS
# spread()  # Spread ROWS -> COLUMNS
pivot_longer()  # wide -> long
pivot_wider()   # long -> wide
separate()      # Separate 1 COLUMN -> many COLUMNS
unite()          # Unite several COLUMNS -> 1 COLUMN

Data preparation

# We'll use the R built-in USArrests data set (datasets package). We start by subsetting a small dataset
my_data <- USArrests[c(1, 10, 20, 30), ]
my_data
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Georgia      17.4     211       60 25.8
Maryland     11.3     300       67 27.8
New Jersey    7.4     159       89 18.8
# Row names are states, so let's use the function bind_cols() to add a column named "state" in the data. This will make the data tidy and the analysis easier.
my_data <- bind_cols(state = rownames(my_data),
                     my_data)
my_data
                state Murder Assault UrbanPop Rape
Alabama       Alabama   13.2     236       58 21.2
Georgia       Georgia   17.4     211       60 25.8
Maryland     Maryland   11.3     300       67 27.8
New Jersey New Jersey    7.4     159       89 18.8

Gather

Gather columns into key-value pairs. Wide -> Long

# Gather all columns except the column state
my_data2 <- gather(my_data,
                   key = "arrest_attribute",
                   value = "arrest_estimate",
                   -state)
my_data2
        state arrest_attribute arrest_estimate
1     Alabama           Murder            13.2
2     Georgia           Murder            17.4
3    Maryland           Murder            11.3
4  New Jersey           Murder             7.4
5     Alabama          Assault           236.0
6     Georgia          Assault           211.0
7    Maryland          Assault           300.0
8  New Jersey          Assault           159.0
9     Alabama         UrbanPop            58.0
10    Georgia         UrbanPop            60.0
11   Maryland         UrbanPop            67.0
12 New Jersey         UrbanPop            89.0
13    Alabama             Rape            21.2
14    Georgia             Rape            25.8
15   Maryland             Rape            27.8
16 New Jersey             Rape            18.8
# Gather only Murder and Assault columns
my_data2 <- gather(my_data,
                   key = "arrest_attribute",
                   value = "arrest_estimate",
                   Murder, Assault)
my_data2
       state UrbanPop Rape arrest_attribute arrest_estimate
1    Alabama       58 21.2           Murder            13.2
2    Georgia       60 25.8           Murder            17.4
3   Maryland       67 27.8           Murder            11.3
4 New Jersey       89 18.8           Murder             7.4
5    Alabama       58 21.2          Assault           236.0
6    Georgia       60 25.8          Assault           211.0
7   Maryland       67 27.8          Assault           300.0
8 New Jersey       89 18.8          Assault           159.0

Spread

Spread a key-value pair across multiple columns: Long -> Wide

# Spread "my_data2" to turn back to the original data:
my_data3 <- spread(my_data2, 
                   key = "arrest_attribute",
                   value = "arrest_estimate"
)
my_data3
       state UrbanPop Rape Assault Murder
1    Alabama       58 21.2     236   13.2
2    Georgia       60 25.8     211   17.4
3   Maryland       67 27.8     300   11.3
4 New Jersey       89 18.8     159    7.4

Unite

Unite multiple columns into one

# The R code below uses the data set "my_data" and unites the columns Murder and Assault
my_data4 <- unite(my_data,
                  col = "Murder_Assault",
                  Murder, Assault,
                  sep = "_")
my_data4
                state Murder_Assault UrbanPop Rape
Alabama       Alabama       13.2_236       58 21.2
Georgia       Georgia       17.4_211       60 25.8
Maryland     Maryland       11.3_300       67 27.8
New Jersey New Jersey        7.4_159       89 18.8

Separate

Separate one column into multiple columns

separate(my_data4,
         col = "Murder_Assault",
         into = c("Murder", "Assault"),
         sep = "_")
                state Murder Assault UrbanPop Rape
Alabama       Alabama   13.2     236       58 21.2
Georgia       Georgia   17.4     211       60 25.8
Maryland     Maryland   11.3     300       67 27.8
New Jersey New Jersey    7.4     159       89 18.8

Part 3: Data wranging with dplyr

filter()  # PICK observations by their values | ROWS
select()  # PICK variables by their names | COLUMNS
mutate()  # CREATE new variables w/ functions of existing variables | COLUMNS
transmute()  # COMPUTE 1 or more COLUMNS but drop original columns
arrange()  # REORDER the ROWS
summarize()  # COLLAPSE many values to a single SUMMARY
group_by()  # GROUP data into rows with the same value of variable (COLUMN)

Filter

Return rows with matching conditions

head(gapminder)  # Snapshot of the dataframe
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.
# Now, filter by year and look at only the data from the year 1962
filter(gapminder, year==1962)
# A tibble: 142 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1962    32.0 10267083      853.
 2 Albania     Europe     1962    64.8  1728137     2313.
 3 Algeria     Africa     1962    48.3 11000948     2551.
 4 Angola      Africa     1962    34    4826015     4269.
 5 Argentina   Americas   1962    65.1 21283783     7133.
 6 Australia   Oceania    1962    70.9 10794968    12217.
 7 Austria     Europe     1962    69.5  7129864    10751.
 8 Bahrain     Asia       1962    56.9   171863    12753.
 9 Bangladesh  Asia       1962    41.2 56839289      686.
10 Belgium     Europe     1962    70.2  9218400    10991.
# … with 132 more rows
# Can be rewritten using "Piping" %>%
gapminder %>%  # Pipe ('then') operator to serially connect operations
  filter(year==1962)
# A tibble: 142 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1962    32.0 10267083      853.
 2 Albania     Europe     1962    64.8  1728137     2313.
 3 Algeria     Africa     1962    48.3 11000948     2551.
 4 Angola      Africa     1962    34    4826015     4269.
 5 Argentina   Americas   1962    65.1 21283783     7133.
 6 Australia   Oceania    1962    70.9 10794968    12217.
 7 Austria     Europe     1962    69.5  7129864    10751.
 8 Bahrain     Asia       1962    56.9   171863    12753.
 9 Bangladesh  Asia       1962    41.2 56839289      686.
10 Belgium     Europe     1962    70.2  9218400    10991.
# … with 132 more rows
# Filter for China in 2002
gapminder %>%
  filter(year==2002,
         country=="China")
# A tibble: 1 × 6
  country continent  year lifeExp        pop gdpPercap
  <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
1 China   Asia       2002    72.0 1280400000     3119.

Select

Select/rename variables by name

gapminder %>%
  select(year, country, lifeExp)
# A tibble: 1,704 × 3
    year country     lifeExp
   <int> <fct>         <dbl>
 1  1952 Afghanistan    28.8
 2  1957 Afghanistan    30.3
 3  1962 Afghanistan    32.0
 4  1967 Afghanistan    34.0
 5  1972 Afghanistan    36.1
 6  1977 Afghanistan    38.4
 7  1982 Afghanistan    39.9
 8  1987 Afghanistan    40.8
 9  1992 Afghanistan    41.7
10  1997 Afghanistan    41.8
# … with 1,694 more rows

Arrange

Arrange rows by variables

head(gapminder)  # Snapshot of the dataframe
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.
# Arrange/Sort by Life Expectancy
arrange(gapminder, lifeExp)  # ascending order
# A tibble: 1,704 × 6
   country      continent  year lifeExp     pop gdpPercap
   <fct>        <fct>     <int>   <dbl>   <int>     <dbl>
 1 Rwanda       Africa     1992    23.6 7290203      737.
 2 Afghanistan  Asia       1952    28.8 8425333      779.
 3 Gambia       Africa     1952    30    284320      485.
 4 Angola       Africa     1952    30.0 4232095     3521.
 5 Sierra Leone Africa     1952    30.3 2143249      880.
 6 Afghanistan  Asia       1957    30.3 9240934      821.
 7 Cambodia     Asia       1977    31.2 6978607      525.
 8 Mozambique   Africa     1952    31.3 6446316      469.
 9 Sierra Leone Africa     1957    31.6 2295678     1004.
10 Burkina Faso Africa     1952    32.0 4469979      543.
# … with 1,694 more rows
arrange(gapminder, -lifeExp)  # descending order
# A tibble: 1,704 × 6
   country          continent  year lifeExp    pop gdpPercap
   <fct>            <fct>     <int>   <dbl>  <int>     <dbl>
 1 Japan            Asia       2007    82.6 1.27e8    31656.
 2 Hong Kong, China Asia       2007    82.2 6.98e6    39725.
 3 Japan            Asia       2002    82   1.27e8    28605.
 4 Iceland          Europe     2007    81.8 3.02e5    36181.
 5 Switzerland      Europe     2007    81.7 7.55e6    37506.
 6 Hong Kong, China Asia       2002    81.5 6.76e6    30209.
 7 Australia        Oceania    2007    81.2 2.04e7    34435.
 8 Spain            Europe     2007    80.9 4.04e7    28821.
 9 Sweden           Europe     2007    80.9 9.03e6    33860.
10 Israel           Asia       2007    80.7 6.43e6    25523.
# … with 1,694 more rows
# Want to rewrite using piping?
gapminder %>%  # Pipe ('then') operator to serially connect operations
  arrange(lifeExp)
# A tibble: 1,704 × 6
   country      continent  year lifeExp     pop gdpPercap
   <fct>        <fct>     <int>   <dbl>   <int>     <dbl>
 1 Rwanda       Africa     1992    23.6 7290203      737.
 2 Afghanistan  Asia       1952    28.8 8425333      779.
 3 Gambia       Africa     1952    30    284320      485.
 4 Angola       Africa     1952    30.0 4232095     3521.
 5 Sierra Leone Africa     1952    30.3 2143249      880.
 6 Afghanistan  Asia       1957    30.3 9240934      821.
 7 Cambodia     Asia       1977    31.2 6978607      525.
 8 Mozambique   Africa     1952    31.3 6446316      469.
 9 Sierra Leone Africa     1957    31.6 2295678     1004.
10 Burkina Faso Africa     1952    32.0 4469979      543.
# … with 1,694 more rows
# Combining two verbs
gapminder %>%
  filter(year==2007) %>%
  arrange(desc(gdpPercap))
# A tibble: 142 × 6
   country          continent  year lifeExp    pop gdpPercap
   <fct>            <fct>     <int>   <dbl>  <int>     <dbl>
 1 Norway           Europe     2007    80.2 4.63e6    49357.
 2 Kuwait           Asia       2007    77.6 2.51e6    47307.
 3 Singapore        Asia       2007    80.0 4.55e6    47143.
 4 United States    Americas   2007    78.2 3.01e8    42952.
 5 Ireland          Europe     2007    78.9 4.11e6    40676.
 6 Hong Kong, China Asia       2007    82.2 6.98e6    39725.
 7 Switzerland      Europe     2007    81.7 7.55e6    37506.
 8 Netherlands      Europe     2007    79.8 1.66e7    36798.
 9 Canada           Americas   2007    80.7 3.34e7    36319.
10 Iceland          Europe     2007    81.8 3.02e5    36181.
# … with 132 more rows

Mutate

Mutate: Adds new variables; keeps existing variables Transmute: Adds new variables; drops existing variables

# library(tidyverse)
# library(gapminder)
# Changing existing variables
gapminder %>%
  mutate(pop=pop/1000000)
# A tibble: 1,704 × 6
   country     continent  year lifeExp   pop gdpPercap
   <fct>       <fct>     <int>   <dbl> <dbl>     <dbl>
 1 Afghanistan Asia       1952    28.8  8.43      779.
 2 Afghanistan Asia       1957    30.3  9.24      821.
 3 Afghanistan Asia       1962    32.0 10.3       853.
 4 Afghanistan Asia       1967    34.0 11.5       836.
 5 Afghanistan Asia       1972    36.1 13.1       740.
 6 Afghanistan Asia       1977    38.4 14.9       786.
 7 Afghanistan Asia       1982    39.9 12.9       978.
 8 Afghanistan Asia       1987    40.8 13.9       852.
 9 Afghanistan Asia       1992    41.7 16.3       649.
10 Afghanistan Asia       1997    41.8 22.2       635.
# … with 1,694 more rows
# Use mutate to change lifeExp to be in months
gapminder %>%
  mutate(lifeExp = lifeExp * 12)
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    346.  8425333      779.
 2 Afghanistan Asia       1957    364.  9240934      821.
 3 Afghanistan Asia       1962    384. 10267083      853.
 4 Afghanistan Asia       1967    408. 11537966      836.
 5 Afghanistan Asia       1972    433. 13079460      740.
 6 Afghanistan Asia       1977    461. 14880372      786.
 7 Afghanistan Asia       1982    478. 12881816      978.
 8 Afghanistan Asia       1987    490. 13867957      852.
 9 Afghanistan Asia       1992    500. 16317921      649.
10 Afghanistan Asia       1997    501. 22227415      635.
# … with 1,694 more rows
# Adding new variables
gapminder %>%
  mutate(grossgdp = pop * gdpPercap)
# A tibble: 1,704 × 7
   country continent  year lifeExp    pop gdpPercap grossgdp
   <fct>   <fct>     <int>   <dbl>  <int>     <dbl>    <dbl>
 1 Afghan… Asia       1952    28.8 8.43e6      779.  6.57e 9
 2 Afghan… Asia       1957    30.3 9.24e6      821.  7.59e 9
 3 Afghan… Asia       1962    32.0 1.03e7      853.  8.76e 9
 4 Afghan… Asia       1967    34.0 1.15e7      836.  9.65e 9
 5 Afghan… Asia       1972    36.1 1.31e7      740.  9.68e 9
 6 Afghan… Asia       1977    38.4 1.49e7      786.  1.17e10
 7 Afghan… Asia       1982    39.9 1.29e7      978.  1.26e10
 8 Afghan… Asia       1987    40.8 1.39e7      852.  1.18e10
 9 Afghan… Asia       1992    41.7 1.63e7      649.  1.06e10
10 Afghan… Asia       1997    41.8 2.22e7      635.  1.41e10
# … with 1,694 more rows
# Combing 3 verbs
gapminder %>%
  mutate(grossgdp = pop * gdpPercap) %>%
  filter(year==2007) %>%
  arrange(desc(grossgdp))
# A tibble: 142 × 7
   country continent  year lifeExp    pop gdpPercap grossgdp
   <fct>   <fct>     <int>   <dbl>  <int>     <dbl>    <dbl>
 1 United… Americas   2007    78.2 3.01e8    42952.  1.29e13
 2 China   Asia       2007    73.0 1.32e9     4959.  6.54e12
 3 Japan   Asia       2007    82.6 1.27e8    31656.  4.04e12
 4 India   Asia       2007    64.7 1.11e9     2452.  2.72e12
 5 Germany Europe     2007    79.4 8.24e7    32170.  2.65e12
 6 United… Europe     2007    79.4 6.08e7    33203.  2.02e12
 7 France  Europe     2007    80.7 6.11e7    30470.  1.86e12
 8 Brazil  Americas   2007    72.4 1.90e8     9066.  1.72e12
 9 Italy   Europe     2007    80.5 5.81e7    28570.  1.66e12
10 Mexico  Americas   2007    76.2 1.09e8    11978.  1.30e12
# … with 132 more rows

Group_by & Summarize

Summarize: Reduces multiple values down to a single value Group by one or more variables

# Finding mean life exp across all years all continents
gapminder %>%
  summarize(meanLifeExp = mean(lifeExp))
# A tibble: 1 × 1
  meanLifeExp
        <dbl>
1        59.5
# Summarize to find the median life expectancy
gapminder %>%
  summarize(medianLifeExp = median(lifeExp))
# A tibble: 1 × 1
  medianLifeExp
          <dbl>
1          60.7
# Avg life Exp and total pop in 2007
gapminder %>%
  filter(year==2007) %>%
  summarize(meanLifeExp = mean(lifeExp),
            totalPop = sum(as.numeric(pop)))
# A tibble: 1 × 2
  meanLifeExp   totalPop
        <dbl>      <dbl>
1        67.0 6251013179
# Filter for 1957 then summarize the median life expectancy
gapminder %>%
  filter(year==1957) %>%
  summarize(medianLifeExp = median(lifeExp))
# A tibble: 1 × 1
  medianLifeExp
          <dbl>
1          48.4
# Avg life Exp and total pop in each year
gapminder %>%
  group_by(year) %>%
  summarize(meanLifeExp = mean(lifeExp),
            totalPop =  sum(as.numeric(pop)))
# A tibble: 12 × 3
    year meanLifeExp   totalPop
   <int>       <dbl>      <dbl>
 1  1952        49.1 2406957150
 2  1957        51.5 2664404580
 3  1962        53.6 2899782974
 4  1967        55.7 3217478384
 5  1972        57.6 3576977158
 6  1977        59.6 3930045807
 7  1982        61.5 4289436840
 8  1987        63.2 4691477418
 9  1992        64.2 5110710260
10  1997        65.0 5515204472
11  2002        65.7 5886977579
12  2007        67.0 6251013179
# Avg life Exp and total pop in each year and contient
gapminder %>%
  group_by(year,continent) %>% 
  summarize(meanLifeExp = mean(lifeExp),
            totalPop = sum(as.numeric(pop)))
`summarise()` has grouped output by 'year'. You can
override using the `.groups` argument.
# A tibble: 60 × 4
# Groups:   year [12]
    year continent meanLifeExp   totalPop
   <int> <fct>           <dbl>      <dbl>
 1  1952 Africa           39.1  237640501
 2  1952 Americas         53.3  345152446
 3  1952 Asia             46.3 1395357351
 4  1952 Europe           64.4  418120846
 5  1952 Oceania          69.3   10686006
 6  1957 Africa           41.3  264837738
 7  1957 Americas         56.0  386953916
 8  1957 Asia             49.3 1562780599
 9  1957 Europe           66.7  437890351
10  1957 Oceania          70.3   11941976
# … with 50 more rows

Part 4: Visualizing tidy data with ggplot

Recap of ggplot2

Creating a plot w/ Grammar of Graphics

# Add the size aesthetic to represent a country's gdpPercap
gapminder_1952 <- gapminder %>%
  filter(year==1952)

ggplot(gapminder_1952,
       aes(x = pop, y = lifeExp,
           color = continent, size = gdpPercap)) +
  geom_point() +
  scale_x_log10()
# Instead of showing all categorical variables in one plot , we can have 5 different plots in one plot using faceting
gapminder_2007 <- gapminder %>%
  filter(year==2007)

ggplot(data=gapminder_2007,
       aes(x=gdpPercap,y=lifeExp)) +
  geom_point() + 
  scale_x_log10() + 
  facet_wrap(~continent)
# Scatter plot comparing gdpPercap and lifeExp, with color representing continent
# and size representing population, faceted by year
ggplot(data=gapminder,
       aes(x=gdpPercap,y=lifeExp,
           color=continent, size = pop)) +
  geom_point() + 
  scale_x_log10() + 
  facet_wrap(~year)
by_year <- gapminder %>%
  group_by(year) %>% 
  summarize(meanLifeExp = mean(lifeExp),
            totalPop = sum(as.numeric(pop)))

by_year_continent <- gapminder %>%
  group_by(year,continent) %>% 
  summarize(meanLifeExp = mean(lifeExp),
            totalPop =sum(as.numeric(pop)))
`summarise()` has grouped output by 'year'. You can
override using the `.groups` argument.
# Visualizing population over time
ggplot(data=by_year,
       aes(x=year,y=totalPop)) +
  geom_point()
# Visualizing population over time,starting at zero, for each continent
ggplot(data=by_year_continent,
       aes(x=year,y=totalPop,color=continent)) +
  geom_point() +
  expand_limits(y=0)

gganimate

gganimate: A Grammar of Animated Graphics

library(tidyverse)
library(gapminder)
static_plot <- ggplot(gapminder,
                     aes(gdpPercap, lifeExp,
                         size = pop, colour = country)) +
  geom_point(alpha = 0.7, show.legend = FALSE) +
  scale_colour_manual(values = country_colors) +
  scale_size(range = c(2, 12)) +
  scale_x_log10() + theme_minimal() +
  facet_wrap(~continent)
static_plot

No renderer backend detected. gganimate will default to writing frames to separate files
Consider installing:
- the `gifski` package for gif output
- the `av` package for video output
and restarting the R session
animated_plot <- ggplot(gapminder,
                        aes(gdpPercap, lifeExp,
                            size = pop, colour = country)) +
  geom_point(alpha = 0.7, show.legend = FALSE) +
  scale_colour_manual(values = country_colors) +
  scale_size(range = c(2, 12)) +
  scale_x_log10() + theme_minimal() +
  facet_wrap(~continent) +
  # Here comes the gganimate specific bits
  labs(title = 'Year: {frame_time}', # labels
       x = 'GDP per capita', y = 'life expectancy') +
  transition_time(year) + # the dynamic variable
  ease_aes('linear')
animated_plot
Warning: No renderer available. Please install the gifski,
av, or magick package to create animated output
NULL

Part 5: Export & Wrap-up

Saving your plots

ggsave

Save a ggplot (or other grid object) with sensible defaults

library(tidyverse)
# Save your file name
plot1 <- "gapminder_static_plot.png"

# Save your absolute/relative path
my_full_path <- here("gapminder")

# To save as a tab-delimited text file ...
ggsave(filename=plot1,
       plot=static_plot,
       device="png",
       path=my_full_path,
       dpi=300)

Saving your data files

write_delim

Write a data frame to a delimited file

library(tidyverse)
# Save your file name
filename <- "my_new_data.txt"

# Save your absolute/relative path
my_full_path <- paste(c("~/GitHub",
                        "/workshop-tidyverse"), sep="/")

# To save as a tab-delimited text file ...
write_tsv(x=my_newly_formatted_data, # your final reformatted dataset
          path=paste(my_full_path, filename, "/"), # Absolute path recommended.
          # However, you can directly use 'filename' here
          # if you are saving the file in the same directory
          # as your code.
          col_names=T) # if you want the column names to be
# saved in the first row, recommended

# Alternatively, you could save it as a comma-separated text file
write_csv(x=my_newly_formatted_data,
          path=my_path,
          col_names=T)
# Or save it with any other delimiter
# choose wisely, pick a delim that's not part of your dataframe
write_delim(x=my_newly_formatted_data,
            path=my_path,
            col_names=T,
            delim="---")

What you learnt today!

Option Description
Part 1: Getting Started
install.packages Download and install packages from CRAN-like repositories or from local files
library Library and require load and attach add-on packages
tidyverse > readr/readxl
read_delim Read a delimited file (incl csv, tsv) into a tibble
read_csv read_csv() and read_tsv() are special cases of the general read_delim()
read_excel Read xls and xlsx files
Data snapshot
str Compactly Display the Structure of an Arbitrary R Object
head Return the First or Last Part of an Object
glimpse Get a glimpse of your data
View Invoke a Data Viewer
kable Create tables in LaTeX, HTML, Markdown and reStructuredText
paged_table Create a table in HTML with support for paging rows and columns
Part 2: tidyverse > tidyr
pivot_longer Gather Columns Into Key-Value Pairs (COLS -> ROWS)
pivot_wider Spread a key-value pair across multiple columns
separate Separate one column into multiple column
unite Unite multiple columns into one
Part 3: tidyverse > dplyr
filter Return rows with matching conditions
select Select/rename variables by name
mutate Add new variables
transmute Adds new variables; drops existing variables
arrange Arrange rows by variables
summarise Reduces multiple values down to a single value
group_by Group by one or more variables
join Join two tbls together: left_join, right_join, inner_join
bind Efficiently bind multiple data frames by row and column: bind_rows, bind_cols
setops Set operations: intersect, union, setdiff, setequal
Part 4: tidyverse > ggplot
ggplot Create a new ggplot
gganimate gganimate: A Grammar of Animated Graphics
Part 5: Export & Wrap-up
tidyverse > readr
ggsave Save a ggplot (or other grid object) with sensible defaults
write_delim Write a data frame to a delimited file
write_tsv write_delim customized for tab-separated values
write_csv write_delim customized for comma-separated values


Credits

Arjun Krishnan and I co-developed the content for this workshop.

Acknowledgements

Contact

Additional resources

Some awesome open-source books