RMarkdown file to accompany the Intro to R/Tidyverse workshop (using processed transcriptomic data)
I am a computational biologist, an Asst. Professor in the departments of Pathobiology & Diagnostic Investigation and Microbiology & Molecular Genetics at Michigan State University. In our group, we develop computational approaches to understand infectious disease biology. Check out my webpage for more info. You can reach me here.
I’m also the founder & co-organizer of the R-Ladies East Lansing, Women+ Data Science, and AsiaR. We conduct R and data science related meetups & workshops regularly! So, do check out our upcoming events on Meetup.
readr
You can access all relevant material pertaining to this workshop here. Other related workshops & useful cheatsheets.
Running RStudio locally? Download RStudio
Want to try the latest ‘Preview’ version of RStudio? RStudio Preview version
Trouble with local installation? Login & start using RStudio Cloud right away!
New to RStudio IDE? Use Help Page #1 & Page #2
… if you haven’t already! The RStudio startup message should
specify your current local version of R. For e.g.,
R v4.1.2
install.packages("tidyverse") # for data wrangling
install.packages("here") # to set paths relative to your current project.
# install.packages("gapminder") # sample dataset
Explore the RLEL
workshops for more examples and sample codes for Tidy Data and
DataViz [e.g., >>
presentations/20181105-workshop-tidydata
uses the
gapminder
& USArrests
datasets].
Trouble installing tidyverse?
install.packages("PACKAGENAME")
tidyverse
suite of packages here# If tidyverse installation fails, install individual constituent packages this way...
install.packages("readr") # Importing data files
install.packages("tidyr") # Tidy Data
install.packages("dplyr") # Data manipulation
install.packages("ggplot2") # Data Visualization (w/ Grammar of Graphics)
install.packages("readxl") # Importing excel files
── Attaching packages ─────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.7 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
# OR load the individual packages, if you have trouble installing/loading `tidyverse`
# library(readr)
# library(readxl)
# library(tidyr)
# library(dplyr)
# library(ggplot2)
library(here) # https://github.com/jennybc/here_here
here() starts at /Users/jananiravi/GitHub/workshop-tidyverse
# library(gapminder) # useful dataset for data wrangling, visualization
Cheatsheets @RStudio | Our cheatsheets repo
More resources towards the end of the document.
read_csv
, write_csv
read_tsv
, write_tsv
read_delim
, write_delim
here::here
here::here() # where you opened your RStudio session/Project
[1] "/Users/jananiravi/GitHub/workshop-tidyverse"
# Downloading sample transcriptomics dataset from NCBI's FTP site
url <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE69nnn/GSE69360/suppl/GSE69360_RNAseq.counts.txt.gz"
gse69360 <- read_tsv(url) # to read tab-delimitted files
Rows: 57820 Columns: 25
── Column specification ────────────────────────────────────
Delimiter: "\t"
chr (5): Geneid, Chr, Start, End, Strand
dbl (20): Length, AA_Colon, AA_Heart, AA_Kidney, AA_Live...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
?read_tsv # to check defaults
# Comma-separated values, as exported from excel/spreadsheets
read_csv(file="path/to/my_data.csv", col_names=T)
# Other atypical delimitters
read_delim(file="path/to/my_data.txt", col_names=T, delim="//")
# Other useful packages
# readxl by Jenny Bryan
read_excel(path="path/to/excel.xls",
sheet=1,
range="A1:D50",
col_names=T)
## Tip: Always open .Rproj and use relative paths with here()
## Example with here()
read_tsv(file=here("transcriptomics/GSE69360_RNAseq.counts.txt"), col_names=T)
Dataset details:
str(gse69360) # Structure of the dataframe
gse69360 # Data is in a cleaned up 'tibble' format by default
head(gse69360) # Shows the top few observations (rows) of your dataframe
glimpse(gse69360) # Info-dense summary of the data
View(head(gse69360, 100)) # View data in a visual GUI-based spreadsheet-like format
colnames(gse69360) # Column names
nrow(gse69360) # No. of rows
ncol(gse69360) # No. of columns
gse69360[1:5,7:10] # Subsetting a dataframe
## saving the data file
write_tsv(gse69360[1:100,7:12], "gse_subset.txt")
Geneid | Chr | Start | End | Strand | Length | AA_Colon | AA_Heart | AA_Kidney | AA_Liver | AA_Lung | AA_Stomach | AF_Colon | AF_Stomach | BA_Colon | BA_Heart | BA_Kidney | BA_Liver | BA_Lung | BA_Stomach | BF_Colon | BF_Stomach | OA_Stomach1 | OA_Stomach2 | OA_Stomach3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSG00000223972.4 | chr1;chr1;chr1;chr1 | 11869;12595;12975;13221 | 12227;12721;13052;14412 | +;+;+;+ | 1756 | 1 | 0 | 0 | 0 | 0 | 0 | 7 | 2 | 0 | 0 | 0 | 0 | 7 | 0 | 0 | 3 | 3 | 1 | 0 |
ENSG00000227232.4 | chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1 | 14363;14970;15796;16607;16854;17233;17498;17602;17915;18268;24734;29321;29534 | 14829;15038;15947;16765;17055;17368;17504;17742;18061;18379;24891;29370;29806 | -;-;-;-;-;-;-;-;-;-;-;-;- | 2073 | 36 | 21 | 32 | 28 | 17 | 17 | 102 | 41 | 10 | 24 | 22 | 6 | 49 | 2 | 153 | 157 | 38 | 53 | 18 |
ENSG00000243485.2 | chr1;chr1;chr1 | 29554;30267;30976 | 30039;30667;31109 | +;+;+ | 1021 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ENSG00000237613.2 | chr1;chr1;chr1 | 34554;35245;35721 | 35174;35481;36081 | -;-;- | 1219 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ENSG00000268020.2 | chr1;chr1 | 52473;54830 | 53312;54936 | +;+ | 947 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
ENSG00000240361.1 | chr1 | 62948 | 63887 | + | 940 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
library(rmarkdown)
paged_table(gse69360)
tidyr
## legacy tidyverse functions
gather() # Gather COLUMNS -> ROWS
spread() # Spread ROWS -> COLUMNS
## newer tidyverse functions
pivot_longer() # Pivot data from wide to long
pivot_wider() # Pivot data from long to wide
separate() # Separate 1 COLUMN -> many COLUMNS
unite() # Unite several COLUMNS -> 1 COLUMN
gather
: Gather columns into key-value pairs. Wide
-> Longspread
: Spread a key-value pair across multiple
columns: Long -> Wide# Gather all columns except 'Geneid'
gse69360 %>%
select(Geneid, matches("[AF]_")) %>%
# gather(-Geneid, key="Sample", value="Counts") # wide -> long format ## legacy
pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts")
# A tibble: 1,098,580 × 3
Geneid Sample Counts
<chr> <chr> <dbl>
1 ENSG00000223972.4 AA_Colon 1
2 ENSG00000223972.4 AA_Heart 0
3 ENSG00000223972.4 AA_Kidney 0
4 ENSG00000223972.4 AA_Liver 0
5 ENSG00000223972.4 AA_Lung 0
6 ENSG00000223972.4 AA_Stomach 0
7 ENSG00000223972.4 AF_Colon 7
8 ENSG00000223972.4 AF_Stomach 2
9 ENSG00000223972.4 BA_Colon 0
10 ENSG00000223972.4 BA_Heart 0
# … with 1,098,570 more rows
# Gather, then Spread --> Back to original data
gse69360 %>%
select(Geneid, matches("[AF]_")) %>%
# gather(-Geneid, key="Sample", value="Counts") %>%
# spread(key = "Sample", value = "Counts") # spread to turn back to the original data!
pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>%
pivot_wider(names_from="Sample", values_from="Counts")
# A tibble: 57,820 × 20
Geneid AA_Colon AA_Heart AA_Kidney AA_Liver AA_Lung
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG0000022… 1 0 0 0 0
2 ENSG0000022… 36 21 32 28 17
3 ENSG0000024… 0 0 0 0 0
4 ENSG0000023… 0 0 0 0 0
5 ENSG0000026… 0 0 0 0 0
6 ENSG0000024… 0 0 0 0 0
7 ENSG0000018… 0 0 0 0 0
8 ENSG0000023… 3 0 1 2 1
9 ENSG0000023… 0 0 0 0 0
10 ENSG0000023… 0 0 0 0 0
# … with 57,810 more rows, and 14 more variables:
# AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
# BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
# BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
# BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
# OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
unite
: Unite multiple columns into oneseparate
: Separate one column into multiple
columnsgse69360 %>%
select(Geneid, matches("[AF]_")) %>% # selecting only Counts columns
pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>% # wide -> long
separate(Sample, into=c("Source_Stage", "Tissue"), sep="_") # separate logically
# A tibble: 1,098,580 × 4
Geneid Source_Stage Tissue Counts
<chr> <chr> <chr> <dbl>
1 ENSG00000223972.4 AA Colon 1
2 ENSG00000223972.4 AA Heart 0
3 ENSG00000223972.4 AA Kidney 0
4 ENSG00000223972.4 AA Liver 0
5 ENSG00000223972.4 AA Lung 0
6 ENSG00000223972.4 AA Stomach 0
7 ENSG00000223972.4 AF Colon 7
8 ENSG00000223972.4 AF Stomach 2
9 ENSG00000223972.4 BA Colon 0
10 ENSG00000223972.4 BA Heart 0
# … with 1,098,570 more rows
gse69360 %>%
select(Geneid, matches("[AF]_")) %>%
pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>%
separate(Sample, into=c("Source_Stage", "Tissue"), sep="_") %>%
separate(Source_Stage, into=c("Source", "Stage"), sep=1) # separate by char position
# A tibble: 1,098,580 × 5
Geneid Source Stage Tissue Counts
<chr> <chr> <chr> <chr> <dbl>
1 ENSG00000223972.4 A A Colon 1
2 ENSG00000223972.4 A A Heart 0
3 ENSG00000223972.4 A A Kidney 0
4 ENSG00000223972.4 A A Liver 0
5 ENSG00000223972.4 A A Lung 0
6 ENSG00000223972.4 A A Stomach 0
7 ENSG00000223972.4 A F Colon 7
8 ENSG00000223972.4 A F Stomach 2
9 ENSG00000223972.4 B A Colon 0
10 ENSG00000223972.4 B A Heart 0
# … with 1,098,570 more rows
gse69360 %>%
select(Geneid, matches("[AF]_")) %>%
pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>%
separate(Sample, into=c("Source_Stage", "Tissue"), sep="_") %>%
separate(Source_Stage, into=c("Source", "Stage"), sep=1) %>%
unite(Stage_Tissue, Stage, Tissue) # combining a different set of columns
# A tibble: 1,098,580 × 4
Geneid Source Stage_Tissue Counts
<chr> <chr> <chr> <dbl>
1 ENSG00000223972.4 A A_Colon 1
2 ENSG00000223972.4 A A_Heart 0
3 ENSG00000223972.4 A A_Kidney 0
4 ENSG00000223972.4 A A_Liver 0
5 ENSG00000223972.4 A A_Lung 0
6 ENSG00000223972.4 A A_Stomach 0
7 ENSG00000223972.4 A F_Colon 7
8 ENSG00000223972.4 A F_Stomach 2
9 ENSG00000223972.4 B A_Colon 0
10 ENSG00000223972.4 B A_Heart 0
# … with 1,098,570 more rows
dplyr
conflicted::conflict_prefer(name="filter", winner="dplyr")
filter() # PICK observations by their values | ROWS
select() # PICK variables by their names | COLUMNS
mutate() # CREATE new variables w/ functions of existing variables | COLUMNS
transmute() # COMPUTE 1 or more COLUMNS but drop original columns
arrange() # REORDER the ROWS
summarize() # COLLAPSE many values to a single SUMMARY
group_by() # GROUP data into rows with the same value of variable (COLUMN)
filter
: Return rows with matching
conditionshead(gse69360) # Snapshot of the dataframe
# A tibble: 6 × 25
Geneid Chr Start End Strand Length AA_Colon AA_Heart
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 ENSG000… chr1… 1186… 1222… +;+;+… 1756 1 0
2 ENSG000… chr1… 1436… 1482… -;-;-… 2073 36 21
3 ENSG000… chr1… 2955… 3003… +;+;+ 1021 0 0
4 ENSG000… chr1… 3455… 3517… -;-;- 1219 0 0
5 ENSG000… chr1… 5247… 5331… +;+ 947 0 0
6 ENSG000… chr1 62948 63887 + 940 0 0
# … with 17 more variables: AA_Kidney <dbl>,
# AA_Liver <dbl>, AA_Lung <dbl>, AA_Stomach <dbl>,
# AF_Colon <dbl>, AF_Stomach <dbl>, BA_Colon <dbl>,
# BA_Heart <dbl>, BA_Kidney <dbl>, BA_Liver <dbl>,
# BA_Lung <dbl>, BA_Stomach <dbl>, BF_Colon <dbl>,
# BF_Stomach <dbl>, OA_Stomach1 <dbl>, OA_Stomach2 <dbl>,
# OA_Stomach3 <dbl>
# Now, filter by condition
filter(gse69360, Length<=50)
# A tibble: 124 × 25
Geneid Chr Start End Strand Length AA_Colon AA_Heart
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 ENSG00… chr1… 3829… 3829… -;- 45 1 0
2 ENSG00… chr1 9521… 9521… - 41 7 1
3 ENSG00… chr1 1477… 1477… + 35 0 1
4 ENSG00… chr1… 1545… 1545… +;+ 21 0 0
5 ENSG00… chr1 2360… 2360… - 50 1 2
6 ENSG00… chr2… 1160… 1160… +;+ 36 1 0
7 ENSG00… chr2 8916… 8916… - 38 183 1
8 ENSG00… chr2 8916… 8916… - 37 98 2
9 ENSG00… chr2 8916… 8916… - 38 80 0
10 ENSG00… chr2 8916… 8916… - 38 0 0
# … with 114 more rows, and 17 more variables:
# AA_Kidney <dbl>, AA_Liver <dbl>, AA_Lung <dbl>,
# AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
# BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
# BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
# BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
# OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# Can be rewritten using "Piping" %>%
gse69360 %>% # Pipe ('then') operator to serially connect operations
filter(Length <= 50)
# A tibble: 124 × 25
Geneid Chr Start End Strand Length AA_Colon AA_Heart
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 ENSG00… chr1… 3829… 3829… -;- 45 1 0
2 ENSG00… chr1 9521… 9521… - 41 7 1
3 ENSG00… chr1 1477… 1477… + 35 0 1
4 ENSG00… chr1… 1545… 1545… +;+ 21 0 0
5 ENSG00… chr1 2360… 2360… - 50 1 2
6 ENSG00… chr2… 1160… 1160… +;+ 36 1 0
7 ENSG00… chr2 8916… 8916… - 38 183 1
8 ENSG00… chr2 8916… 8916… - 37 98 2
9 ENSG00… chr2 8916… 8916… - 38 80 0
10 ENSG00… chr2 8916… 8916… - 38 0 0
# … with 114 more rows, and 17 more variables:
# AA_Kidney <dbl>, AA_Liver <dbl>, AA_Lung <dbl>,
# AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
# BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
# BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
# BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
# OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# A tibble: 542 × 25
Geneid Chr Start End Strand Length AA_Colon AA_Heart
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 ENSGR0… chrY… 1204… 1205… +;+ 259 0 0
2 ENSGR0… chrY… 1429… 1430… +;+;+… 6191 0 0
3 ENSGR0… chrY… 1700… 1701… -;-;-… 2363 0 0
4 ENSGR0… chrY… 2317… 2319… +;+ 428 0 0
5 ENSGR0… chrY… 2446… 2496… -;-;-… 9015 0 0
6 ENSGR0… chrY 3753… 3754… - 101 0 0
7 ENSGR0… chrY 4345… 4348… - 328 0 0
8 ENSGR0… chrY 4559… 4560… - 117 0 0
9 ENSGR0… chrY… 5350… 5353… +;+;+… 4384 0 0
10 ENSGR0… chrY… 9009… 9011… +;+ 588 0 0
# … with 532 more rows, and 17 more variables:
# AA_Kidney <dbl>, AA_Liver <dbl>, AA_Lung <dbl>,
# AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
# BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
# BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
# BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
# OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# A tibble: 1 × 25
Geneid Chr Start End Strand Length AA_Colon AA_Heart
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 ENSG000… chrY… 5306… 5306… -;- 42 1 0
# … with 17 more variables: AA_Kidney <dbl>,
# AA_Liver <dbl>, AA_Lung <dbl>, AA_Stomach <dbl>,
# AF_Colon <dbl>, AF_Stomach <dbl>, BA_Colon <dbl>,
# BA_Heart <dbl>, BA_Kidney <dbl>, BA_Liver <dbl>,
# BA_Lung <dbl>, BA_Stomach <dbl>, BF_Colon <dbl>,
# BF_Stomach <dbl>, OA_Stomach1 <dbl>, OA_Stomach2 <dbl>,
# OA_Stomach3 <dbl>
select
: Select/rename variables/columns by
name# A tibble: 57,820 × 5
Geneid AF_Colon AF_Stomach BF_Colon BF_Stomach
<chr> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000223972.4 7 2 0 3
2 ENSG00000227232.4 102 41 153 157
3 ENSG00000243485.2 0 0 0 0
4 ENSG00000237613.2 0 0 0 0
5 ENSG00000268020.2 0 0 0 0
6 ENSG00000240361.1 0 0 0 0
7 ENSG00000186092.4 0 0 0 0
8 ENSG00000238009.2 4 0 5 4
9 ENSG00000239945.1 0 0 0 0
10 ENSG00000233750.3 2 3 3 1
# … with 57,810 more rows
# A tibble: 57,820 × 20
Geneid AA_Colon AA_Heart AA_Kidney AA_Liver AA_Lung
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG0000022… 1 0 0 0 0
2 ENSG0000022… 36 21 32 28 17
3 ENSG0000024… 0 0 0 0 0
4 ENSG0000023… 0 0 0 0 0
5 ENSG0000026… 0 0 0 0 0
6 ENSG0000024… 0 0 0 0 0
7 ENSG0000018… 0 0 0 0 0
8 ENSG0000023… 3 0 1 2 1
9 ENSG0000023… 0 0 0 0 0
10 ENSG0000023… 0 0 0 0 0
# … with 57,810 more rows, and 14 more variables:
# AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
# BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
# BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
# BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
# OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# A tibble: 57,820 × 6
Geneid Chr Start End Strand Length
<chr> <chr> <chr> <chr> <chr> <dbl>
1 ENSG00000223972.4 chr1;chr1;ch… 1186… 1222… +;+;+… 1756
2 ENSG00000227232.4 chr1;chr1;ch… 1436… 1482… -;-;-… 2073
3 ENSG00000243485.2 chr1;chr1;ch… 2955… 3003… +;+;+ 1021
4 ENSG00000237613.2 chr1;chr1;ch… 3455… 3517… -;-;- 1219
5 ENSG00000268020.2 chr1;chr1 5247… 5331… +;+ 947
6 ENSG00000240361.1 chr1 62948 63887 + 940
7 ENSG00000186092.4 chr1 69091 70008 + 918
8 ENSG00000238009.2 chr1;chr1;ch… 8929… 9162… -;-;-… 3569
9 ENSG00000239945.1 chr1;chr1 8955… 9005… -;- 1319
10 ENSG00000233750.3 chr1 1310… 1348… + 3812
# … with 57,810 more rows
# Select then Filter
gse69360 %>%
select(Geneid, Chr, Length, matches("[AF]_")) %>%
filter(grepl("chrY", Chr) | Length <= 100)
# A tibble: 4,671 × 22
Geneid Chr Length AA_Colon AA_Heart AA_Kidney AA_Liver
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00… chr1… 90 0 0 0 0
2 ENSG00… chr1… 57 0 0 0 0
3 ENSG00… chr1 95 5 0 2 0
4 ENSG00… chr1 90 0 0 0 0
5 ENSG00… chr1 83 7 0 1 2
6 ENSG00… chr1 96 0 0 0 0
7 ENSG00… chr1 70 0 0 0 0
8 ENSG00… chr1 73 0 0 0 0
9 ENSG00… chr1 89 0 0 0 0
10 ENSG00… chr1 70 0 0 0 0
# … with 4,661 more rows, and 15 more variables:
# AA_Lung <dbl>, AA_Stomach <dbl>, AF_Colon <dbl>,
# AF_Stomach <dbl>, BA_Colon <dbl>, BA_Heart <dbl>,
# BA_Kidney <dbl>, BA_Liver <dbl>, BA_Lung <dbl>,
# BA_Stomach <dbl>, BF_Colon <dbl>, BF_Stomach <dbl>,
# OA_Stomach1 <dbl>, OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
mutate
: Adds new variables; keeps existing
variablestransmute
: Adds new variables; drops existing
variables# Excluding columns matching a condition
gse69360 %>%
select(-matches("[AF]_")) %>%
head(., 10) #%>% View()
# A tibble: 10 × 6
Geneid Chr Start End Strand Length
<chr> <chr> <chr> <chr> <chr> <dbl>
1 ENSG00000223972.4 chr1;chr1;ch… 1186… 1222… +;+;+… 1756
2 ENSG00000227232.4 chr1;chr1;ch… 1436… 1482… -;-;-… 2073
3 ENSG00000243485.2 chr1;chr1;ch… 2955… 3003… +;+;+ 1021
4 ENSG00000237613.2 chr1;chr1;ch… 3455… 3517… -;-;- 1219
5 ENSG00000268020.2 chr1;chr1 5247… 5331… +;+ 947
6 ENSG00000240361.1 chr1 62948 63887 + 940
7 ENSG00000186092.4 chr1 69091 70008 + 918
8 ENSG00000238009.2 chr1;chr1;ch… 8929… 9162… -;-;-… 3569
9 ENSG00000239945.1 chr1;chr1 8955… 9005… -;- 1319
10 ENSG00000233750.3 chr1 1310… 1348… + 3812
# Storing gene location information in a separate dataframe
gene_loc <- gse69360 %>% # saving output to a variable
select(-matches("[AF]_")) %>% # select columns
mutate(Geneid = gsub("\\.[0-9]*$", "", Geneid)) %>% # remove isoform no.
mutate(Chr = gsub(";.*$", "", gse69360$Chr)) %>% # keep the first element for Chr
mutate(Start = as.numeric(gsub(";.*$", "", gse69360$Start))) %>% # "" for Start
mutate(End = as.numeric(gsub(";.*$", "", gse69360$End))) %>% # "" for End
mutate(Strand = gsub(";.*$", "", gse69360$Strand)) # "" for Strand
# Check to see if you have what you expected!
gene_loc
# A tibble: 57,820 × 6
Geneid Chr Start End Strand Length
<chr> <chr> <dbl> <dbl> <chr> <dbl>
1 ENSG00000223972 chr1 11869 12227 + 1756
2 ENSG00000227232 chr1 14363 14829 - 2073
3 ENSG00000243485 chr1 29554 30039 + 1021
4 ENSG00000237613 chr1 34554 35174 - 1219
5 ENSG00000268020 chr1 52473 53312 + 947
6 ENSG00000240361 chr1 62948 63887 + 940
7 ENSG00000186092 chr1 69091 70008 + 918
8 ENSG00000238009 chr1 89295 91629 - 3569
9 ENSG00000239945 chr1 89551 90050 - 1319
10 ENSG00000233750 chr1 131025 134836 + 3812
# … with 57,810 more rows
head(gene_loc, 10)
# A tibble: 10 × 6
Geneid Chr Start End Strand Length
<chr> <chr> <dbl> <dbl> <chr> <dbl>
1 ENSG00000223972 chr1 11869 12227 + 1756
2 ENSG00000227232 chr1 14363 14829 - 2073
3 ENSG00000243485 chr1 29554 30039 + 1021
4 ENSG00000237613 chr1 34554 35174 - 1219
5 ENSG00000268020 chr1 52473 53312 + 947
6 ENSG00000240361 chr1 62948 63887 + 940
7 ENSG00000186092 chr1 69091 70008 + 918
8 ENSG00000238009 chr1 89295 91629 - 3569
9 ENSG00000239945 chr1 89551 90050 - 1319
10 ENSG00000233750 chr1 131025 134836 + 3812
# Creating new variables
gene_loc %>%
mutate(kbStart = Start/1000, # creates new variables/columns
kbEnd = End/1000,
kbLength = Length/1000)
# A tibble: 57,820 × 9
Geneid Chr Start End Strand Length kbStart kbEnd
<chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 ENSG0000… chr1 11869 12227 + 1756 11.9 12.2
2 ENSG0000… chr1 14363 14829 - 2073 14.4 14.8
3 ENSG0000… chr1 29554 30039 + 1021 29.6 30.0
4 ENSG0000… chr1 34554 35174 - 1219 34.6 35.2
5 ENSG0000… chr1 52473 53312 + 947 52.5 53.3
6 ENSG0000… chr1 62948 63887 + 940 62.9 63.9
7 ENSG0000… chr1 69091 70008 + 918 69.1 70.0
8 ENSG0000… chr1 89295 91629 - 3569 89.3 91.6
9 ENSG0000… chr1 89551 90050 - 1319 89.6 90.0
10 ENSG0000… chr1 131025 134836 + 3812 131. 135.
# … with 57,810 more rows, and 1 more variable:
# kbLength <dbl>
# Creating new variables & dropping old ones
gene_loc %>%
transmute(kbStart = Start/1000, # drops original columns
kbEnd = End/1000,
kbLength = Length/1000)
# A tibble: 57,820 × 3
kbStart kbEnd kbLength
<dbl> <dbl> <dbl>
1 11.9 12.2 1.76
2 14.4 14.8 2.07
3 29.6 30.0 1.02
4 34.6 35.2 1.22
5 52.5 53.3 0.947
6 62.9 63.9 0.94
7 69.1 70.0 0.918
8 89.3 91.6 3.57
9 89.6 90.0 1.32
10 131. 135. 3.81
# … with 57,810 more rows
distinct
: Pick unique entriesarrange
: Arrange rows by variables# A tibble: 25 × 1
Chr
<chr>
1 chr1
2 chr2
3 chr3
4 chr4
5 chr5
6 chr6
7 chr7
8 chr8
9 chr9
10 chr10
# … with 15 more rows
# A tibble: 2 × 1
Strand
<chr>
1 +
2 -
# A tibble: 50 × 2
Chr Strand
<chr> <chr>
1 chr1 +
2 chr1 -
3 chr2 -
4 chr2 +
5 chr3 +
6 chr3 -
7 chr4 -
8 chr4 +
9 chr5 +
10 chr5 -
# … with 40 more rows
# A tibble: 57,820 × 6
Geneid Chr Start End Strand Length
<chr> <chr> <dbl> <dbl> <chr> <dbl>
1 ENSGR0000228572 chrY 120410 120513 + 259
2 ENSGR0000182378 chrY 142989 143061 + 6191
3 ENSGR0000178605 chrY 170025 170137 - 2363
4 ENSGR0000226179 chrY 231725 231983 + 428
5 ENSGR0000167393 chrY 244698 249631 - 9015
6 ENSGR0000266731 chrY 375316 375416 - 101
7 ENSGR0000234958 chrY 434510 434837 - 328
8 ENSGR0000229232 chrY 455971 456087 - 117
9 ENSGR0000185960 chrY 535079 535337 + 4384
10 ENSGR0000237531 chrY 900956 901162 + 588
# … with 57,810 more rows
# A tibble: 57,820 × 6
Geneid Chr Start End Strand Length
<chr> <chr> <dbl> <dbl> <chr> <dbl>
1 ENSG00000268141 chr1 154585067 154585080 + 21
2 ENSG00000224335 chr1 147706573 147706607 + 35
3 ENSG00000263526 chr1 95211416 95211456 - 41
4 ENSG00000230610 chr1 38292232 38292262 - 45
5 ENSG00000252056 chr1 236026688 236026737 - 50
6 ENSG00000238705 chr1 26881033 26881084 + 52
7 ENSG00000264650 chr1 45499641 45499692 - 52
8 ENSG00000265769 chr1 52366592 52366643 - 52
9 ENSG00000251795 chr1 93303575 93303627 + 53
10 ENSG00000238310 chr1 109750416 109750470 - 55
# … with 57,810 more rows
# arrange(Chr, -Length) # to reverse sort by 'numeric' Length
summarize
: Reduces multiple values down to a single
valuegroup_by
: Combine entries by one or more
variables# Combine by a variable, then calculate summary statistics for each group
gene_loc %>%
group_by(Chr) %>% # combine rows by Chr
summarize(numGenes = n(), # then summarise, number of genes/Chr
startoffirstGene = min(Start)) # min to get the first Start location
# A tibble: 25 × 3
Chr numGenes startoffirstGene
<chr> <int> <dbl>
1 chr1 5363 11869
2 chr10 2260 90652
3 chr11 3208 75780
4 chr12 2818 67607
5 chr13 1217 19041312
6 chr14 2244 19110203
7 chr15 2080 20083769
8 chr16 2343 61553
9 chr17 2903 4961
10 chr18 1127 11103
# … with 15 more rows
# Example to show you can use all math/stat functions to summarize data groups
gene_loc %>%
arrange(Length) %>%
group_by(Chr, Strand) %>%
summarize(numGenes = n(),
smallestGene = first(Geneid),
minLength = min(Length),
firstqLength = quantile(Length, 0.25),
medianLength = median(Length),
iqrLength = IQR(Length),
thirdqLength = quantile(Length, 0.75),
maxLength = max(Length),
longestGene = last(Geneid))
`summarise()` has grouped output by 'Chr'. You can override
using the `.groups` argument.
# A tibble: 50 × 11
# Groups: Chr [25]
Chr Strand numGenes smallestGene minLength firstqLength
<chr> <chr> <int> <chr> <dbl> <dbl>
1 chr1 - 2651 ENSG0000026… 41 354.
2 chr1 + 2712 ENSG0000026… 21 394.
3 chr10 - 1109 ENSG0000026… 52 322
4 chr10 + 1151 ENSG0000027… 27 356
5 chr11 - 1605 ENSG0000026… 30 477
6 chr11 + 1603 ENSG0000025… 37 476
7 chr12 - 1415 ENSG0000025… 28 375
8 chr12 + 1403 ENSG0000026… 30 402
9 chr13 - 631 ENSG0000021… 48 297
10 chr13 + 586 ENSG0000026… 57 371.
# … with 40 more rows, and 5 more variables:
# medianLength <dbl>, iqrLength <dbl>,
# thirdqLength <dbl>, maxLength <dbl>, longestGene <chr>
Let’s combine everything from above to tidy the full GSE69360 dataset.
Dataset details:
# Extracting just the expression values & cleaning it up
View(head(gse69360, 50))
gene_counts <- gse69360 %>%
select(-Chr, -Start, -End, -Strand, -Length) %>% # another way to select just the expression data
rename(OA_Stomach = OA_Stomach1) %>% # rename couple of columns
mutate(OA_Stomach2 = NULL, OA_Stomach3 = NULL) %>% # remove a couple of columns
mutate(Geneid = gsub("\\.[0-9]*$", "", Geneid)) # cleanup data a specific column
logcpm <- gene_counts %>%
select(-Geneid) %>%
mutate_all(function(x) { log2((x*(1e+6)/sum(x)) + 1) } ) # convert counts in each sample to counts-per-million
summary(logcpm)
AA_Colon AA_Heart AA_Kidney
Min. : 0.00000 Min. : 0.0000 Min. : 0.0000
1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.0000
Median : 0.06883 Median : 0.0000 Median : 0.0000
Mean : 1.10187 Mean : 0.8103 Mean : 0.8785
3rd Qu.: 1.33045 3rd Qu.: 0.5151 3rd Qu.: 0.7181
Max. :18.41311 Max. :18.7478 Max. :18.8061
AA_Liver AA_Lung AA_Stomach
Min. : 0.00000 Min. : 0.000 Min. : 0.00000
1st Qu.: 0.00000 1st Qu.: 0.000 1st Qu.: 0.00000
Median : 0.05563 Median : 0.000 Median : 0.06964
Mean : 0.98788 Mean : 1.049 Mean : 0.93120
3rd Qu.: 0.95867 3rd Qu.: 1.123 3rd Qu.: 0.91851
Max. :17.85244 Max. :17.722 Max. :17.85986
AF_Colon AF_Stomach BA_Colon
Min. : 0.0000 Min. : 0.0000 Min. : 0.00000
1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00000
Median : 0.1995 Median : 0.0000 Median : 0.08114
Mean : 1.2903 Mean : 0.7748 Mean : 0.98586
3rd Qu.: 1.8951 3rd Qu.: 0.8105 3rd Qu.: 1.10906
Max. :17.4620 Max. :18.3586 Max. :18.75742
BA_Heart BA_Kidney BA_Liver
Min. : 0.00000 Min. : 0.00000 Min. : 0.0000
1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0000
Median : 0.04537 Median : 0.05856 Median : 0.0000
Mean : 0.80833 Mean : 0.96659 Mean : 0.8194
3rd Qu.: 0.68444 3rd Qu.: 1.11092 3rd Qu.: 0.5974
Max. :18.92846 Max. :18.87617 Max. :17.7612
BA_Lung BA_Stomach BF_Colon
Min. : 0.0000 Min. : 0.0000 Min. : 0.0000
1st Qu.: 0.1643 1st Qu.: 0.0000 1st Qu.: 0.0000
Median : 0.4456 Median : 0.0000 Median : 0.1793
Mean : 1.3873 Mean : 0.7528 Mean : 1.3716
3rd Qu.: 1.9392 3rd Qu.: 0.6261 3rd Qu.: 1.9775
Max. :17.5495 Max. :18.3249 Max. :15.8531
BF_Stomach OA_Stomach
Min. : 0.0000 Min. : 0.0000
1st Qu.: 0.0000 1st Qu.: 0.0000
Median : 0.3085 Median : 0.1003
Mean : 1.5164 Mean : 1.0800
3rd Qu.: 2.4221 3rd Qu.: 1.3692
Max. :15.7256 Max. :17.9600
gene_logcpm <- gene_counts %>% # a new dataframe with logcpm values
select(Geneid) %>%
bind_cols(logcpm) %>% # bind the logcpm matrix to the geneids
pivot_longer(-Geneid, names_to="Sample", values_to="Logcpm") %>% # convert to tidy data
separate(Sample, # cleanup complex variables
into = c("Source", "Stage", "Tissue"),
sep = c(1,2),
remove = F) %>% # keep original variable
mutate(Tissue = gsub("^_", "", Tissue),
Stage = ifelse(Stage == "A", "Adult", "Fetus"))
View(head(gene_logcpm, 50))
# Plotting the distribution of gene-expression in each sample
gene_logcpm %>%
ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
geom_boxplot(outlier.size = 0.2, outlier.shape = 0.2) +
scale_y_continuous(limits = c(0, 1)) +
coord_flip() +
theme_minimal()
Warning: Removed 259347 rows containing non-finite values
(stat_boxplot).
ggplot
Creating a plot w/ Grammar of Graphics
- Recap and continuation of dplyr
- Basics of plotting data with ggplot2:
data
,aes
,geom
- Customization: Colors, labels, and legends
ggplot
, factor
, aes
geom_bar
, geom_histogram
facet_wrap
scale_x_log10
, labs
,
coord_flip
, theme
,
theme_minimal
gene_loc %>% # data
ggplot(aes(x = Chr)) + # aesthetics: what to plot?
geom_bar() # geometry: how to plot?
gene_loc$Chr <- factor(gene_loc$Chr,
levels = paste("chr",
c((1:22), "X", "Y", "M"),
sep=""))
plot_chr_numgenes <- gene_loc %>%
ggplot(aes(x = Chr)) +
geom_bar()
plot_chr_numgenes
plot_chr_numgenes +
coord_flip() +
theme_minimal()
plot_chr_numgenes +
labs(title = "No. genes per chromosome",
x = "Chromosome",
y = "No. of genes") +
theme_minimal() +
coord_flip()
gene_loc %>%
ggplot(aes(x = Length)) +
geom_histogram(color = "white") +
scale_x_log10() +
theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.
plot_chr_genelength <- gene_loc %>%
ggplot(aes(x = Length, fill = Chr)) +
geom_histogram(color = "white") +
scale_x_log10() +
theme_minimal() +
facet_wrap(~Chr, scales = "free_y")
plot_chr_genelength
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.
plot_chr_genelength +
theme(legend.position = "none") +
labs(x = "Gene length (log-scale)",
y = "No. of genes")
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.
geom_point
geom_abline
, geom_vline
,
geom_hline
geom_smooth
, geom_text_repel
gene_loc %>%
ggplot(aes(x = End-Start, y = Length)) +
geom_point()
plot_strend_length <- gene_loc %>%
ggplot(aes(x = End-Start, y = Length)) +
geom_point(alpha = 0.1, size = 0.5, color = "grey", fill = "grey")
plot_strend_length
plot_strend_length <- plot_strend_length +
scale_x_log10("End-Start") +
scale_y_log10("Gene length") +
theme_minimal()
plot_strend_length
Warning: Transformation introduced infinite values in
continuous x-axis
plot_strend_length +
geom_abline(intercept = 0, slope = 1, col = "red") +
geom_hline(yintercept = 500, color = "blue") +
geom_vline(xintercept = 1000, color = "orange")
Warning: Transformation introduced infinite values in
continuous x-axis
# A tibble: 25 × 3
Chr meanLength numGenes
<fct> <dbl> <int>
1 chr1 2258. 5363
2 chr2 2304. 4047
3 chr3 2382. 3101
4 chr4 2109. 2563
5 chr5 2188. 2859
6 chr6 2124. 2905
7 chr7 2164. 2876
8 chr8 2036. 2386
9 chr9 2123. 2323
10 chr10 2160. 2260
# … with 15 more rows
gene_loc %>%
group_by(Chr) %>%
summarize(meanLength = mean(Length), numGenes = n()) %>%
ggplot(aes(x = numGenes, y = meanLength)) +
geom_point()
# install.packages("ggrepel", dependencies=T)
library(ggrepel)
gene_loc %>%
group_by(Chr) %>%
summarize(meanLength = mean(Length), numGenes = n()) %>%
ggplot(aes(x = numGenes, y = meanLength)) +
geom_point() +
geom_smooth(color = "lightblue", alpha = 0.1) +
labs(x = "No. of genes", y = "Mean gene length") +
geom_text_repel(aes(label = Chr), color="red", segment.color="grey80") +
theme_minimal()
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
geom_boxplot
, geom_violin
scale_y_continuous
gene_logcpm %>%
ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
geom_boxplot() +
coord_flip() +
theme_minimal()
gene_logcpm %>%
ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
geom_violin() +
scale_y_continuous(limits = c(0, 0.5)) +
coord_flip() +
theme_minimal()
Warning: Removed 320317 rows containing non-finite values
(stat_ydensity).
# Plotting the distribution of gene-expression in each sample
plot_sample_bxp <- gene_logcpm %>%
ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
geom_boxplot(outlier.size = 0.2, outlier.alpha = 0.2) +
scale_y_continuous(limits = c(0, 1)) +
coord_flip() +
theme_minimal()
plot_sample_bxp
Warning: Removed 259347 rows containing non-finite values
(stat_boxplot).
# Plotting scatterplot of 2 sets of samples
plot_ffcolon_scatter <- gene_logcpm %>%
filter(Sample == "AF_Colon" | Sample == "BF_Colon") %>%
select(Geneid, Sample, Logcpm) %>%
pivot_wider(names_from = Sample, values_from = Logcpm) %>%
ggplot(aes(x = AF_Colon, y = BF_Colon)) +
geom_point(alpha = 0.1, size = 0.5) +
geom_smooth(method=lm) +
theme_minimal()
plot_ffcolon_scatter
`geom_smooth()` using formula 'y ~ x'
# Finding genes with high variance across samples
num_totgenes <- gene_logcpm %>%
distinct(Geneid) %>%
nrow()
highvar_genes <- gene_logcpm %>%
group_by(Geneid) %>%
summarize(iqr = IQR(Logcpm)) %>%
top_n((ceiling(num_totgenes*0.05)), iqr) %>%
pull(Geneid)
length(highvar_genes)
[1] 2891
# Plotting expression of high-var Y chr genes across samples
chry_highvar_genes <- gene_loc %>%
filter(Chr == "chrY" & Geneid %in% highvar_genes) %>%
pull(Geneid)
gene_logcpm %>%
filter(Geneid %in% chry_highvar_genes)
# A tibble: 238 × 6
Geneid Sample Source Stage Tissue Logcpm
<chr> <chr> <chr> <chr> <chr> <dbl>
1 ENSG00000129824 AA_Colon A Adult Colon 6.12
2 ENSG00000129824 AA_Heart A Adult Heart 0
3 ENSG00000129824 AA_Kidney A Adult Kidney 0.106
4 ENSG00000129824 AA_Liver A Adult Liver 1.75
5 ENSG00000129824 AA_Lung A Adult Lung 0
6 ENSG00000129824 AA_Stomach A Adult Stomach 0
7 ENSG00000129824 AF_Colon A Fetus Colon 0.0352
8 ENSG00000129824 AF_Stomach A Fetus Stomach 4.51
9 ENSG00000129824 BA_Colon B Adult Colon 6.05
10 ENSG00000129824 BA_Heart B Adult Heart 4.81
# … with 228 more rows
plot_chry_highvar_boxplot <- gene_logcpm %>%
filter(Geneid %in% chry_highvar_genes) %>%
ggplot(aes(x = reorder(Sample, Logcpm, FUN = median),
y = Logcpm,
color = Sample)) +
geom_boxplot() +
coord_flip() +
theme_minimal() +
theme(legend.position = "none")
plot_chry_highvar_boxplot
RMarkdown
Save a ggplot (or other grid object) with sensible defaults
library(tidyverse)
# Save your file name
plot1 <- "chr_highvar.png"
# Save your absolute/relative path
my_full_path <- here("transcriptomics")
# To save as a tab-delimited text file ...
ggsave(filename=plot1,
plot=plot_chry_highvar_boxplot,
device="png",
path=my_full_path,
dpi=600)
Saving 6.5 x 4 in image
Write a dataframe to a delimited file
library(tidyverse)
# Save your file name
filename <- "my_new_data.txt"
# Save your absolute/relative path
my_full_path <- here("data")
# To save as a tab-delimited text file ...
write_tsv(x=my_newly_formatted_data, # your final reformatted dataset
path=paste(my_full_path, filename, "/"), # Absolute path recommended.
# However, you can directly use 'filename' here
# if you are saving the file in the same directory
# as your code.
col_names=T) # if you want the column names to be
# saved in the first row, recommended
# Alternatively, you could save it as a comma-separated text file
write_csv(x=my_newly_formatted_data,
path=my_path,
col_names=T)
# Or save it with any other delimiter
# choose wisely, pick a delim that's not part of your dataframe
write_delim(x=my_newly_formatted_data,
path=my_path,
col_names=T,
delim="---")
Option | Description |
---|---|
Part 1 | Getting Started |
install.packages |
Download and install packages from CRAN-like repositories or from local files |
library |
Library and require load and attach add-on packages |
package::function |
To run a function w/o loading the package |
Import | tidyverse > readr &
readxl |
read_delim |
Read a delimited file (incl csv, tsv) into a tibble |
read_csv, read_tsv |
read_csv() and read_tsv() are special cases of the general read_delim() |
read_excel |
Read xls and xlsx files |
here |
Part of here package. To set paths relative to your
current project directory |
Data snapshot | |
str |
Compactly Display the Structure of an Arbitrary R Object |
head |
Return the First or Last Part of an Object |
glimpse |
Get a glimpse of your data |
View |
Invoke a Data Viewer (pop-up, requires XQuartz/X11) |
kable |
Create tables in LaTeX, HTML, Markdown and reStructuredText |
paged_table |
Create a table in HTML with support for paging rows and columns |
Part 2 | tidyverse > tidyr |
pivot_longer |
Gather Columns Into Key-Value Pairs (COLS -> ROWS) |
pivot_wider |
Spread a key-value pair across multiple columns |
separate |
Separate one column into multiple column |
unite |
Unite multiple columns into one |
Part 3 | tidyverse > dplyr |
filter |
Return rows with matching conditions |
select |
Select/rename variables by name |
mutate |
Add new variables |
transmute |
Add new variables & drops existing variables |
distinct |
Get unique entries |
arrange |
Arrange rows by variables |
summarise |
Reduces multiple values down to a single value |
group_by |
Group by one or more variables |
join |
Join two dataframes together: left_join ,
right_join , inner_join |
bind |
Efficiently bind multiple dataframes by row and column:
bind_rows , bind_cols |
setops |
Set operations: intersect , union ,
setdiff , setequal |
Part 4 | tidyverse > ggplot |
ggplot |
Create a new ggplot |
Aesthetics | |
aes |
Specify details on what to plot |
factor |
Specify a variable to be ordered & categorical |
facet_wrap |
Split into multiple sub-plots based on a categorical variable |
Geometries | |
geom_bar |
Create a barplot |
geom_histogram |
Create a histogram |
geom_point |
Create a scatter plot |
geom_boxplot |
Create a boxplot |
geom_violin |
Create a violin plot |
geom_abline, geom_hline, geom_vline |
Add slanted, horizontal, or vertical lines |
geom_smooth |
Add a smooth curve that fits the data points |
Plot customization | |
theme, theme_minimal, theme_bw |
Adjust themes to minimal, black/white, etc |
scale_x_log10 |
Change x/y axes to log scale |
scale_y_continuous |
Change x/y axes to continuous & set limits |
coord_flip |
Flip x & y coordinates |
labs |
Specify axes labels and plot titles |
Part 5 | Export & Wrap-up |
Export | tidyverse > readr |
ggsave |
Save a ggplot (or other grid object) with sensible defaults |
write_delim |
Write a dataframe to a delimited file |
write_tsv |
write_delim customized for tab-separated values |
write_csv |
write_delim customized for comma-separated values |
Arjun Krishnan and I co-developed the content for this workshop.
For attribution, please cite this work as
Ravi & Krishnan (2022, May 28). Introduction to Tidyverse | using transcriptomic data. Retrieved from https://github.com/jananiravi/workshop-tidyverse
BibTeX citation
@misc{ravi2022introduction, author = {Ravi, Janani and Krishnan, Arjun}, title = {Introduction to Tidyverse | using transcriptomic data}, url = {https://github.com/jananiravi/workshop-tidyverse}, year = {2022} }