RMarkdown file to accompany the tidyverse + gapminder workshop presentation
I am a computational biologist, an Asst. Professor in the department of Pathobiology & Diagnostic Investigation at Michigan State University. In our group, we develop computational approaches to understand infectious disease biology. Check out my webpage for more info. You can reach me here.
I’m also the founder & co-organizer of the R-Ladies East Lansing group on campus. We conduct R-related workshops & meetups regularly! So, do check out our upcoming events on Meetup.
readr
You can access all relevant material pertaining to this workshop here. Other related workshops & useful cheatsheets.
Running RStudio locally? Download RStudio
Want to try the latest ‘Preview’ version of RStudio? RStudio Preview version
Trouble with local installation? Login & start using RStudio Cloud right away!
New to RStudio IDE? Use Help Page #1 & Page #2
… if you haven’t already! The RStudio startup message should specify your current local version of R. For e.g., R v4.0.5
install.packages("tidyverse") # for data wrangling
install.packages("gapminder") # sample dataset
Trouble with installing tidyverse?
install.packages("PACKAGENAME")
tidyverse
suite of packages hereinstall.packages("readr") # Importing data files
# install.packages("readxl") # Importing excel files
install.packages("tidyr") # Tidy Data
install.packages("dplyr") # Data manipulation
install.packages("ggplot2") # Data Visualization (w/ Grammar of Graphics)
You can also access all relevant R/RStudio/Slack cheatsheets on our GitHub repo.
library(tidyverse)
read_csv(file="my_data.csv",
col_names=T) # comma-separated values, as exported from excel/spreadsheets
read_delim(file="my_data.txt", col_names=T,
delim="//") # any delimitter
# Other useful packages
# readxl by Jenny Bryan
read_excel(path="path/to/excel.xls",
sheet=1,
range="A1:D50",
col_names=T)
We will work with the Gapminder dataset by Hans Rosling.
Unveiling the beauty of statistics for a fact based world view. Gapminder.org
Tools to generate their trademark bubble charts
# gapminder::gapminder
str(gapminder) # Structure of the dataframe
gapminder # Data is in a cleaend up 'tibble' format by default
head(gapminder) # Shows the top few observations (rows) of your data frame
glimpse(gapminder) # Info-dense summary of the data
View(gapminder) # View data in a visual GUI-based spreadsheet-like format
── Attaching packages ─────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5 ✓ purrr 0.3.4
✓ tibble 3.1.6 ✓ dplyr 1.0.8
✓ tidyr 1.2.0 ✓ stringr 1.4.0
✓ readr 2.1.2 ✓ forcats 0.5.1
── Conflicts ────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
$ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num [1:1704] 779 821 853 836 740 ...
gapminder # Data is in a cleaend up 'tibble' format by default
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# … with 1,694 more rows
head(gapminder) # Shows the top few observations (rows) of your data frame
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
glimpse(gapminder) # Info-dense summary of the data
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanist…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia…
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982…
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, …
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, …
View(gapminder) # View data in a visual GUI-based spreadsheet-like format
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
library(rmarkdown)
paged_table(gapminder)
# gather() # Gather COLUMNS -> ROWS
# spread() # Spread ROWS -> COLUMNS
pivot_longer() # wide -> long
pivot_wider() # long -> wide
separate() # Separate 1 COLUMN -> many COLUMNS
unite() # Unite several COLUMNS -> 1 COLUMN
# We'll use the R built-in USArrests data set (datasets package). We start by subsetting a small dataset
my_data <- USArrests[c(1, 10, 20, 30), ]
my_data
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Georgia 17.4 211 60 25.8
Maryland 11.3 300 67 27.8
New Jersey 7.4 159 89 18.8
# Row names are states, so let's use the function bind_cols() to add a column named "state" in the data. This will make the data tidy and the analysis easier.
my_data <- bind_cols(state = rownames(my_data),
my_data)
my_data
state Murder Assault UrbanPop Rape
Alabama Alabama 13.2 236 58 21.2
Georgia Georgia 17.4 211 60 25.8
Maryland Maryland 11.3 300 67 27.8
New Jersey New Jersey 7.4 159 89 18.8
Gather columns into key-value pairs. Wide -> Long
# Gather all columns except the column state
my_data2 <- gather(my_data,
key = "arrest_attribute",
value = "arrest_estimate",
-state)
my_data2
state arrest_attribute arrest_estimate
1 Alabama Murder 13.2
2 Georgia Murder 17.4
3 Maryland Murder 11.3
4 New Jersey Murder 7.4
5 Alabama Assault 236.0
6 Georgia Assault 211.0
7 Maryland Assault 300.0
8 New Jersey Assault 159.0
9 Alabama UrbanPop 58.0
10 Georgia UrbanPop 60.0
11 Maryland UrbanPop 67.0
12 New Jersey UrbanPop 89.0
13 Alabama Rape 21.2
14 Georgia Rape 25.8
15 Maryland Rape 27.8
16 New Jersey Rape 18.8
# Gather only Murder and Assault columns
my_data2 <- gather(my_data,
key = "arrest_attribute",
value = "arrest_estimate",
Murder, Assault)
my_data2
state UrbanPop Rape arrest_attribute arrest_estimate
1 Alabama 58 21.2 Murder 13.2
2 Georgia 60 25.8 Murder 17.4
3 Maryland 67 27.8 Murder 11.3
4 New Jersey 89 18.8 Murder 7.4
5 Alabama 58 21.2 Assault 236.0
6 Georgia 60 25.8 Assault 211.0
7 Maryland 67 27.8 Assault 300.0
8 New Jersey 89 18.8 Assault 159.0
Spread a key-value pair across multiple columns: Long -> Wide
# Spread "my_data2" to turn back to the original data:
my_data3 <- spread(my_data2,
key = "arrest_attribute",
value = "arrest_estimate"
)
my_data3
state UrbanPop Rape Assault Murder
1 Alabama 58 21.2 236 13.2
2 Georgia 60 25.8 211 17.4
3 Maryland 67 27.8 300 11.3
4 New Jersey 89 18.8 159 7.4
Unite multiple columns into one
# The R code below uses the data set "my_data" and unites the columns Murder and Assault
my_data4 <- unite(my_data,
col = "Murder_Assault",
Murder, Assault,
sep = "_")
my_data4
state Murder_Assault UrbanPop Rape
Alabama Alabama 13.2_236 58 21.2
Georgia Georgia 17.4_211 60 25.8
Maryland Maryland 11.3_300 67 27.8
New Jersey New Jersey 7.4_159 89 18.8
Separate one column into multiple columns
state Murder Assault UrbanPop Rape
Alabama Alabama 13.2 236 58 21.2
Georgia Georgia 17.4 211 60 25.8
Maryland Maryland 11.3 300 67 27.8
New Jersey New Jersey 7.4 159 89 18.8
filter() # PICK observations by their values | ROWS
select() # PICK variables by their names | COLUMNS
mutate() # CREATE new variables w/ functions of existing variables | COLUMNS
transmute() # COMPUTE 1 or more COLUMNS but drop original columns
arrange() # REORDER the ROWS
summarize() # COLLAPSE many values to a single SUMMARY
group_by() # GROUP data into rows with the same value of variable (COLUMN)
Return rows with matching conditions
head(gapminder) # Snapshot of the dataframe
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
# Now, filter by year and look at only the data from the year 1962
filter(gapminder, year==1962)
# A tibble: 142 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1962 32.0 10267083 853.
2 Albania Europe 1962 64.8 1728137 2313.
3 Algeria Africa 1962 48.3 11000948 2551.
4 Angola Africa 1962 34 4826015 4269.
5 Argentina Americas 1962 65.1 21283783 7133.
6 Australia Oceania 1962 70.9 10794968 12217.
7 Austria Europe 1962 69.5 7129864 10751.
8 Bahrain Asia 1962 56.9 171863 12753.
9 Bangladesh Asia 1962 41.2 56839289 686.
10 Belgium Europe 1962 70.2 9218400 10991.
# … with 132 more rows
# Can be rewritten using "Piping" %>%
gapminder %>% # Pipe ('then') operator to serially connect operations
filter(year==1962)
# A tibble: 142 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1962 32.0 10267083 853.
2 Albania Europe 1962 64.8 1728137 2313.
3 Algeria Africa 1962 48.3 11000948 2551.
4 Angola Africa 1962 34 4826015 4269.
5 Argentina Americas 1962 65.1 21283783 7133.
6 Australia Oceania 1962 70.9 10794968 12217.
7 Austria Europe 1962 69.5 7129864 10751.
8 Bahrain Asia 1962 56.9 171863 12753.
9 Bangladesh Asia 1962 41.2 56839289 686.
10 Belgium Europe 1962 70.2 9218400 10991.
# … with 132 more rows
# A tibble: 1 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 China Asia 2002 72.0 1280400000 3119.
Select/rename variables by name
# A tibble: 1,704 × 3
year country lifeExp
<int> <fct> <dbl>
1 1952 Afghanistan 28.8
2 1957 Afghanistan 30.3
3 1962 Afghanistan 32.0
4 1967 Afghanistan 34.0
5 1972 Afghanistan 36.1
6 1977 Afghanistan 38.4
7 1982 Afghanistan 39.9
8 1987 Afghanistan 40.8
9 1992 Afghanistan 41.7
10 1997 Afghanistan 41.8
# … with 1,694 more rows
Arrange rows by variables
head(gapminder) # Snapshot of the dataframe
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
# Arrange/Sort by Life Expectancy
arrange(gapminder, lifeExp) # ascending order
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Rwanda Africa 1992 23.6 7290203 737.
2 Afghanistan Asia 1952 28.8 8425333 779.
3 Gambia Africa 1952 30 284320 485.
4 Angola Africa 1952 30.0 4232095 3521.
5 Sierra Leone Africa 1952 30.3 2143249 880.
6 Afghanistan Asia 1957 30.3 9240934 821.
7 Cambodia Asia 1977 31.2 6978607 525.
8 Mozambique Africa 1952 31.3 6446316 469.
9 Sierra Leone Africa 1957 31.6 2295678 1004.
10 Burkina Faso Africa 1952 32.0 4469979 543.
# … with 1,694 more rows
arrange(gapminder, -lifeExp) # descending order
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Japan Asia 2007 82.6 1.27e8 31656.
2 Hong Kong, China Asia 2007 82.2 6.98e6 39725.
3 Japan Asia 2002 82 1.27e8 28605.
4 Iceland Europe 2007 81.8 3.02e5 36181.
5 Switzerland Europe 2007 81.7 7.55e6 37506.
6 Hong Kong, China Asia 2002 81.5 6.76e6 30209.
7 Australia Oceania 2007 81.2 2.04e7 34435.
8 Spain Europe 2007 80.9 4.04e7 28821.
9 Sweden Europe 2007 80.9 9.03e6 33860.
10 Israel Asia 2007 80.7 6.43e6 25523.
# … with 1,694 more rows
# Want to rewrite using piping?
gapminder %>% # Pipe ('then') operator to serially connect operations
arrange(lifeExp)
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Rwanda Africa 1992 23.6 7290203 737.
2 Afghanistan Asia 1952 28.8 8425333 779.
3 Gambia Africa 1952 30 284320 485.
4 Angola Africa 1952 30.0 4232095 3521.
5 Sierra Leone Africa 1952 30.3 2143249 880.
6 Afghanistan Asia 1957 30.3 9240934 821.
7 Cambodia Asia 1977 31.2 6978607 525.
8 Mozambique Africa 1952 31.3 6446316 469.
9 Sierra Leone Africa 1957 31.6 2295678 1004.
10 Burkina Faso Africa 1952 32.0 4469979 543.
# … with 1,694 more rows
# A tibble: 142 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Norway Europe 2007 80.2 4.63e6 49357.
2 Kuwait Asia 2007 77.6 2.51e6 47307.
3 Singapore Asia 2007 80.0 4.55e6 47143.
4 United States Americas 2007 78.2 3.01e8 42952.
5 Ireland Europe 2007 78.9 4.11e6 40676.
6 Hong Kong, China Asia 2007 82.2 6.98e6 39725.
7 Switzerland Europe 2007 81.7 7.55e6 37506.
8 Netherlands Europe 2007 79.8 1.66e7 36798.
9 Canada Americas 2007 80.7 3.34e7 36319.
10 Iceland Europe 2007 81.8 3.02e5 36181.
# … with 132 more rows
Mutate: Adds new variables; keeps existing variables Transmute: Adds new variables; drops existing variables
# library(tidyverse)
# library(gapminder)
# Changing existing variables
gapminder %>%
mutate(pop=pop/1000000)
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8.43 779.
2 Afghanistan Asia 1957 30.3 9.24 821.
3 Afghanistan Asia 1962 32.0 10.3 853.
4 Afghanistan Asia 1967 34.0 11.5 836.
5 Afghanistan Asia 1972 36.1 13.1 740.
6 Afghanistan Asia 1977 38.4 14.9 786.
7 Afghanistan Asia 1982 39.9 12.9 978.
8 Afghanistan Asia 1987 40.8 13.9 852.
9 Afghanistan Asia 1992 41.7 16.3 649.
10 Afghanistan Asia 1997 41.8 22.2 635.
# … with 1,694 more rows
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 346. 8425333 779.
2 Afghanistan Asia 1957 364. 9240934 821.
3 Afghanistan Asia 1962 384. 10267083 853.
4 Afghanistan Asia 1967 408. 11537966 836.
5 Afghanistan Asia 1972 433. 13079460 740.
6 Afghanistan Asia 1977 461. 14880372 786.
7 Afghanistan Asia 1982 478. 12881816 978.
8 Afghanistan Asia 1987 490. 13867957 852.
9 Afghanistan Asia 1992 500. 16317921 649.
10 Afghanistan Asia 1997 501. 22227415 635.
# … with 1,694 more rows
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap grossgdp
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghan… Asia 1952 28.8 8.43e6 779. 6.57e 9
2 Afghan… Asia 1957 30.3 9.24e6 821. 7.59e 9
3 Afghan… Asia 1962 32.0 1.03e7 853. 8.76e 9
4 Afghan… Asia 1967 34.0 1.15e7 836. 9.65e 9
5 Afghan… Asia 1972 36.1 1.31e7 740. 9.68e 9
6 Afghan… Asia 1977 38.4 1.49e7 786. 1.17e10
7 Afghan… Asia 1982 39.9 1.29e7 978. 1.26e10
8 Afghan… Asia 1987 40.8 1.39e7 852. 1.18e10
9 Afghan… Asia 1992 41.7 1.63e7 649. 1.06e10
10 Afghan… Asia 1997 41.8 2.22e7 635. 1.41e10
# … with 1,694 more rows
# Combing 3 verbs
gapminder %>%
mutate(grossgdp = pop * gdpPercap) %>%
filter(year==2007) %>%
arrange(desc(grossgdp))
# A tibble: 142 × 7
country continent year lifeExp pop gdpPercap grossgdp
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 United… Americas 2007 78.2 3.01e8 42952. 1.29e13
2 China Asia 2007 73.0 1.32e9 4959. 6.54e12
3 Japan Asia 2007 82.6 1.27e8 31656. 4.04e12
4 India Asia 2007 64.7 1.11e9 2452. 2.72e12
5 Germany Europe 2007 79.4 8.24e7 32170. 2.65e12
6 United… Europe 2007 79.4 6.08e7 33203. 2.02e12
7 France Europe 2007 80.7 6.11e7 30470. 1.86e12
8 Brazil Americas 2007 72.4 1.90e8 9066. 1.72e12
9 Italy Europe 2007 80.5 5.81e7 28570. 1.66e12
10 Mexico Americas 2007 76.2 1.09e8 11978. 1.30e12
# … with 132 more rows
Summarize: Reduces multiple values down to a single value Group by one or more variables
# Finding mean life exp across all years all continents
gapminder %>%
summarize(meanLifeExp = mean(lifeExp))
# A tibble: 1 × 1
meanLifeExp
<dbl>
1 59.5
# Summarize to find the median life expectancy
gapminder %>%
summarize(medianLifeExp = median(lifeExp))
# A tibble: 1 × 1
medianLifeExp
<dbl>
1 60.7
# Avg life Exp and total pop in 2007
gapminder %>%
filter(year==2007) %>%
summarize(meanLifeExp = mean(lifeExp),
totalPop = sum(as.numeric(pop)))
# A tibble: 1 × 2
meanLifeExp totalPop
<dbl> <dbl>
1 67.0 6251013179
# Filter for 1957 then summarize the median life expectancy
gapminder %>%
filter(year==1957) %>%
summarize(medianLifeExp = median(lifeExp))
# A tibble: 1 × 1
medianLifeExp
<dbl>
1 48.4
# Avg life Exp and total pop in each year
gapminder %>%
group_by(year) %>%
summarize(meanLifeExp = mean(lifeExp),
totalPop = sum(as.numeric(pop)))
# A tibble: 12 × 3
year meanLifeExp totalPop
<int> <dbl> <dbl>
1 1952 49.1 2406957150
2 1957 51.5 2664404580
3 1962 53.6 2899782974
4 1967 55.7 3217478384
5 1972 57.6 3576977158
6 1977 59.6 3930045807
7 1982 61.5 4289436840
8 1987 63.2 4691477418
9 1992 64.2 5110710260
10 1997 65.0 5515204472
11 2002 65.7 5886977579
12 2007 67.0 6251013179
# Avg life Exp and total pop in each year and contient
gapminder %>%
group_by(year,continent) %>%
summarize(meanLifeExp = mean(lifeExp),
totalPop = sum(as.numeric(pop)))
`summarise()` has grouped output by 'year'. You can
override using the `.groups` argument.
# A tibble: 60 × 4
# Groups: year [12]
year continent meanLifeExp totalPop
<int> <fct> <dbl> <dbl>
1 1952 Africa 39.1 237640501
2 1952 Americas 53.3 345152446
3 1952 Asia 46.3 1395357351
4 1952 Europe 64.4 418120846
5 1952 Oceania 69.3 10686006
6 1957 Africa 41.3 264837738
7 1957 Americas 56.0 386953916
8 1957 Asia 49.3 1562780599
9 1957 Europe 66.7 437890351
10 1957 Oceania 70.3 11941976
# … with 50 more rows
Creating a plot w/ Grammar of Graphics
# Add the size aesthetic to represent a country's gdpPercap
gapminder_1952 <- gapminder %>%
filter(year==1952)
ggplot(gapminder_1952,
aes(x = pop, y = lifeExp,
color = continent, size = gdpPercap)) +
geom_point() +
scale_x_log10()
# Instead of showing all categorical variables in one plot , we can have 5 different plots in one plot using faceting
gapminder_2007 <- gapminder %>%
filter(year==2007)
ggplot(data=gapminder_2007,
aes(x=gdpPercap,y=lifeExp)) +
geom_point() +
scale_x_log10() +
facet_wrap(~continent)
# Scatter plot comparing gdpPercap and lifeExp, with color representing continent
# and size representing population, faceted by year
ggplot(data=gapminder,
aes(x=gdpPercap,y=lifeExp,
color=continent, size = pop)) +
geom_point() +
scale_x_log10() +
facet_wrap(~year)
by_year <- gapminder %>%
group_by(year) %>%
summarize(meanLifeExp = mean(lifeExp),
totalPop = sum(as.numeric(pop)))
by_year_continent <- gapminder %>%
group_by(year,continent) %>%
summarize(meanLifeExp = mean(lifeExp),
totalPop =sum(as.numeric(pop)))
`summarise()` has grouped output by 'year'. You can
override using the `.groups` argument.
# Visualizing population over time
ggplot(data=by_year,
aes(x=year,y=totalPop)) +
geom_point()
# Visualizing population over time,starting at zero, for each continent
ggplot(data=by_year_continent,
aes(x=year,y=totalPop,color=continent)) +
geom_point() +
expand_limits(y=0)
gganimate: A Grammar of Animated Graphics
library(tidyverse)
library(gapminder)
static_plot <- ggplot(gapminder,
aes(gdpPercap, lifeExp,
size = pop, colour = country)) +
geom_point(alpha = 0.7, show.legend = FALSE) +
scale_colour_manual(values = country_colors) +
scale_size(range = c(2, 12)) +
scale_x_log10() + theme_minimal() +
facet_wrap(~continent)
static_plot
No renderer backend detected. gganimate will default to writing frames to separate files
Consider installing:
- the `gifski` package for gif output
- the `av` package for video output
and restarting the R session
animated_plot <- ggplot(gapminder,
aes(gdpPercap, lifeExp,
size = pop, colour = country)) +
geom_point(alpha = 0.7, show.legend = FALSE) +
scale_colour_manual(values = country_colors) +
scale_size(range = c(2, 12)) +
scale_x_log10() + theme_minimal() +
facet_wrap(~continent) +
# Here comes the gganimate specific bits
labs(title = 'Year: {frame_time}', # labels
x = 'GDP per capita', y = 'life expectancy') +
transition_time(year) + # the dynamic variable
ease_aes('linear')
animated_plot
Warning: No renderer available. Please install the gifski,
av, or magick package to create animated output
NULL
Save a ggplot (or other grid object) with sensible defaults
Write a data frame to a delimited file
library(tidyverse)
# Save your file name
filename <- "my_new_data.txt"
# Save your absolute/relative path
my_full_path <- paste(c("~/GitHub",
"/workshop-tidyverse"), sep="/")
# To save as a tab-delimited text file ...
write_tsv(x=my_newly_formatted_data, # your final reformatted dataset
path=paste(my_full_path, filename, "/"), # Absolute path recommended.
# However, you can directly use 'filename' here
# if you are saving the file in the same directory
# as your code.
col_names=T) # if you want the column names to be
# saved in the first row, recommended
# Alternatively, you could save it as a comma-separated text file
write_csv(x=my_newly_formatted_data,
path=my_path,
col_names=T)
# Or save it with any other delimiter
# choose wisely, pick a delim that's not part of your dataframe
write_delim(x=my_newly_formatted_data,
path=my_path,
col_names=T,
delim="---")
Option | Description |
---|---|
Part 1: Getting Started | |
install.packages |
Download and install packages from CRAN-like repositories or from local files |
library |
Library and require load and attach add-on packages |
tidyverse > readr/readxl | |
read_delim |
Read a delimited file (incl csv, tsv) into a tibble |
read_csv |
read_csv() and read_tsv() are special cases of the general read_delim() |
read_excel |
Read xls and xlsx files |
Data snapshot | |
str |
Compactly Display the Structure of an Arbitrary R Object |
head |
Return the First or Last Part of an Object |
glimpse |
Get a glimpse of your data |
View |
Invoke a Data Viewer |
kable |
Create tables in LaTeX, HTML, Markdown and reStructuredText |
paged_table |
Create a table in HTML with support for paging rows and columns |
Part 2: tidyverse > tidyr | |
pivot_longer |
Gather Columns Into Key-Value Pairs (COLS -> ROWS) |
pivot_wider |
Spread a key-value pair across multiple columns |
separate |
Separate one column into multiple column |
unite |
Unite multiple columns into one |
Part 3: tidyverse > dplyr | |
filter |
Return rows with matching conditions |
select |
Select/rename variables by name |
mutate |
Add new variables |
transmute |
Adds new variables; drops existing variables |
arrange |
Arrange rows by variables |
summarise |
Reduces multiple values down to a single value |
group_by |
Group by one or more variables |
join |
Join two tbls together: left_join , right_join , inner_join |
bind |
Efficiently bind multiple data frames by row and column: bind_rows , bind_cols |
setops |
Set operations: intersect , union , setdiff , setequal |
Part 4: tidyverse > ggplot | |
ggplot |
Create a new ggplot |
gganimate |
gganimate: A Grammar of Animated Graphics |
Part 5: Export & Wrap-up | |
tidyverse > readr | |
ggsave |
Save a ggplot (or other grid object) with sensible defaults |
write_delim |
Write a data frame to a delimited file |
write_tsv |
write_delim customized for tab-separated values |
write_csv |
write_delim customized for comma-separated values |
Arjun Krishnan and I co-developed the content for this workshop.