Introduction to Tidyverse | using transcriptomic data

RMarkdown file to accompany the Intro to R/Tidyverse workshop (using processed transcriptomic data)

Janani Ravi https://jravilab.github.io (PDI, MMG, MSU | R-Ladies East Lansing)https://github.com/rladies-eastlansing , Arjun Krishnan https://thekrishnanlab.org (CMSE & BMB, MSU)https://cmse.msu.edu/
2022-05-28

About me

I am a computational biologist, an Asst. Professor in the departments of Pathobiology & Diagnostic Investigation and Microbiology & Molecular Genetics at Michigan State University. In our group, we develop computational approaches to understand infectious disease biology. Check out my webpage for more info. You can reach me here.

I’m also the founder & co-organizer of the R-Ladies East Lansing, Women+ Data Science, and AsiaR. We conduct R and data science related meetups & workshops regularly! So, do check out our upcoming events on Meetup.

Part 1: Getting Started w/ readr

You can access all relevant material pertaining to this workshop here. Other related workshops & useful cheatsheets.

Installation and set-up

Install RStudio

Running RStudio locally? Download RStudio

Want to try the latest ‘Preview’ version of RStudio? RStudio Preview version

Trouble with local installation? Login & start using RStudio Cloud right away!

New to RStudio IDE? Use Help Page #1 & Page #2

Install R

… if you haven’t already! The RStudio startup message should specify your current local version of R. For e.g., R v4.1.2

Install packages

install.packages("tidyverse")   # for data wrangling
install.packages("here")        # to set paths relative to your current project.
# install.packages("gapminder") # sample dataset

Explore the RLEL workshops for more examples and sample codes for Tidy Data and DataViz [e.g., >> presentations/20181105-workshop-tidydata uses the gapminder & USArrests datasets].

Trouble installing tidyverse?

# If tidyverse installation fails, install individual constituent packages this way...
install.packages("readr")    # Importing data files
install.packages("tidyr")    # Tidy Data
install.packages("dplyr")    # Data manipulation
install.packages("ggplot2")  # Data Visualization (w/ Grammar of Graphics)
install.packages("readxl")   # Importing excel files

Loading packages

── Attaching packages ─────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
# OR load the individual packages, if you have trouble installing/loading `tidyverse`
# library(readr)
# library(readxl)
# library(tidyr)
# library(dplyr)
# library(ggplot2)
library(here) # https://github.com/jennybc/here_here
here() starts at /Users/jananiravi/GitHub/workshop-tidyverse
# library(gapminder) # useful dataset for data wrangling, visualization

Some useful cheatsheets

Cheatsheets @RStudio | Our cheatsheets repo

More resources towards the end of the document.

Data import

here::here() # where you opened your RStudio session/Project
[1] "/Users/jananiravi/GitHub/workshop-tidyverse"
# Downloading sample transcriptomics dataset from NCBI's FTP site
url <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE69nnn/GSE69360/suppl/GSE69360_RNAseq.counts.txt.gz"
gse69360 <- read_tsv(url) # to read tab-delimitted files
Rows: 57820 Columns: 25
── Column specification ────────────────────────────────────
Delimiter: "\t"
chr  (5): Geneid, Chr, Start, End, Strand
dbl (20): Length, AA_Colon, AA_Heart, AA_Kidney, AA_Live...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
?read_tsv # to check defaults
# Comma-separated values, as exported from excel/spreadsheets
read_csv(file="path/to/my_data.csv", col_names=T)
# Other atypical delimitters
read_delim(file="path/to/my_data.txt", col_names=T, delim="//")

# Other useful packages
# readxl by Jenny Bryan
read_excel(path="path/to/excel.xls",
           sheet=1,
           range="A1:D50",
           col_names=T)

## Tip: Always open .Rproj and use relative paths with here()
## Example with here()
read_tsv(file=here("transcriptomics/GSE69360_RNAseq.counts.txt"), col_names=T)

Knowing your data

Dataset details:

str(gse69360)              # Structure of the dataframe
gse69360                   # Data is in a cleaned up 'tibble' format by default
head(gse69360)             # Shows the top few observations (rows) of your dataframe
glimpse(gse69360)          # Info-dense summary of the data
View(head(gse69360, 100))  # View data in a visual GUI-based spreadsheet-like format

colnames(gse69360)         # Column names
nrow(gse69360)             # No. of rows
ncol(gse69360)             # No. of columns

gse69360[1:5,7:10]         # Subsetting a dataframe

## saving the data file
write_tsv(gse69360[1:100,7:12], "gse_subset.txt")
library(knitr)
kable(head(gse69360))
Geneid Chr Start End Strand Length AA_Colon AA_Heart AA_Kidney AA_Liver AA_Lung AA_Stomach AF_Colon AF_Stomach BA_Colon BA_Heart BA_Kidney BA_Liver BA_Lung BA_Stomach BF_Colon BF_Stomach OA_Stomach1 OA_Stomach2 OA_Stomach3
ENSG00000223972.4 chr1;chr1;chr1;chr1 11869;12595;12975;13221 12227;12721;13052;14412 +;+;+;+ 1756 1 0 0 0 0 0 7 2 0 0 0 0 7 0 0 3 3 1 0
ENSG00000227232.4 chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1 14363;14970;15796;16607;16854;17233;17498;17602;17915;18268;24734;29321;29534 14829;15038;15947;16765;17055;17368;17504;17742;18061;18379;24891;29370;29806 -;-;-;-;-;-;-;-;-;-;-;-;- 2073 36 21 32 28 17 17 102 41 10 24 22 6 49 2 153 157 38 53 18
ENSG00000243485.2 chr1;chr1;chr1 29554;30267;30976 30039;30667;31109 +;+;+ 1021 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSG00000237613.2 chr1;chr1;chr1 34554;35245;35721 35174;35481;36081 -;-;- 1219 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSG00000268020.2 chr1;chr1 52473;54830 53312;54936 +;+ 947 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0
ENSG00000240361.1 chr1 62948 63887 + 940 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Part 2: Reshaping data w/ tidyr

## legacy tidyverse functions
gather()   # Gather COLUMNS -> ROWS
spread()   # Spread ROWS -> COLUMNS

## newer tidyverse functions
pivot_longer()  # Pivot data from wide to long
pivot_wider()   # Pivot data from long to wide

separate() # Separate 1 COLUMN -> many COLUMNS
unite()    # Unite several COLUMNS -> 1 COLUMN

Gather, Spread

# Gather all columns except 'Geneid'
gse69360 %>%
  select(Geneid, matches("[AF]_")) %>%
  # gather(-Geneid, key="Sample", value="Counts") # wide -> long format  ## legacy
  pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts")
# A tibble: 1,098,580 × 3
   Geneid            Sample     Counts
   <chr>             <chr>       <dbl>
 1 ENSG00000223972.4 AA_Colon        1
 2 ENSG00000223972.4 AA_Heart        0
 3 ENSG00000223972.4 AA_Kidney       0
 4 ENSG00000223972.4 AA_Liver        0
 5 ENSG00000223972.4 AA_Lung         0
 6 ENSG00000223972.4 AA_Stomach      0
 7 ENSG00000223972.4 AF_Colon        7
 8 ENSG00000223972.4 AF_Stomach      2
 9 ENSG00000223972.4 BA_Colon        0
10 ENSG00000223972.4 BA_Heart        0
# … with 1,098,570 more rows
# Gather, then Spread --> Back to original data
gse69360 %>%
  select(Geneid, matches("[AF]_")) %>%
  # gather(-Geneid, key="Sample", value="Counts") %>%
  # spread(key = "Sample", value = "Counts")     # spread to turn back to the original data!
  pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>%
  pivot_wider(names_from="Sample", values_from="Counts")
# A tibble: 57,820 × 20
   Geneid       AA_Colon AA_Heart AA_Kidney AA_Liver AA_Lung
   <chr>           <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
 1 ENSG0000022…        1        0         0        0       0
 2 ENSG0000022…       36       21        32       28      17
 3 ENSG0000024…        0        0         0        0       0
 4 ENSG0000023…        0        0         0        0       0
 5 ENSG0000026…        0        0         0        0       0
 6 ENSG0000024…        0        0         0        0       0
 7 ENSG0000018…        0        0         0        0       0
 8 ENSG0000023…        3        0         1        2       1
 9 ENSG0000023…        0        0         0        0       0
10 ENSG0000023…        0        0         0        0       0
# … with 57,810 more rows, and 14 more variables:
#   AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
#   BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
#   BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
#   BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
#   OA_Stomach2 <dbl>, OA_Stomach3 <dbl>

Unite, Separate

gse69360 %>%
  select(Geneid, matches("[AF]_")) %>%               # selecting only Counts columns
  pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>% # wide -> long
  separate(Sample, into=c("Source_Stage", "Tissue"), sep="_") # separate logically
# A tibble: 1,098,580 × 4
   Geneid            Source_Stage Tissue  Counts
   <chr>             <chr>        <chr>    <dbl>
 1 ENSG00000223972.4 AA           Colon        1
 2 ENSG00000223972.4 AA           Heart        0
 3 ENSG00000223972.4 AA           Kidney       0
 4 ENSG00000223972.4 AA           Liver        0
 5 ENSG00000223972.4 AA           Lung         0
 6 ENSG00000223972.4 AA           Stomach      0
 7 ENSG00000223972.4 AF           Colon        7
 8 ENSG00000223972.4 AF           Stomach      2
 9 ENSG00000223972.4 BA           Colon        0
10 ENSG00000223972.4 BA           Heart        0
# … with 1,098,570 more rows
gse69360 %>%
  select(Geneid, matches("[AF]_")) %>%
  pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>%
  separate(Sample, into=c("Source_Stage", "Tissue"), sep="_") %>%
  separate(Source_Stage, into=c("Source", "Stage"), sep=1) # separate by char position
# A tibble: 1,098,580 × 5
   Geneid            Source Stage Tissue  Counts
   <chr>             <chr>  <chr> <chr>    <dbl>
 1 ENSG00000223972.4 A      A     Colon        1
 2 ENSG00000223972.4 A      A     Heart        0
 3 ENSG00000223972.4 A      A     Kidney       0
 4 ENSG00000223972.4 A      A     Liver        0
 5 ENSG00000223972.4 A      A     Lung         0
 6 ENSG00000223972.4 A      A     Stomach      0
 7 ENSG00000223972.4 A      F     Colon        7
 8 ENSG00000223972.4 A      F     Stomach      2
 9 ENSG00000223972.4 B      A     Colon        0
10 ENSG00000223972.4 B      A     Heart        0
# … with 1,098,570 more rows
gse69360 %>%
  select(Geneid, matches("[AF]_")) %>%
  pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>%
  separate(Sample, into=c("Source_Stage", "Tissue"), sep="_") %>%
  separate(Source_Stage, into=c("Source", "Stage"), sep=1) %>%
  unite(Stage_Tissue, Stage, Tissue) # combining a different set of columns
# A tibble: 1,098,580 × 4
   Geneid            Source Stage_Tissue Counts
   <chr>             <chr>  <chr>         <dbl>
 1 ENSG00000223972.4 A      A_Colon           1
 2 ENSG00000223972.4 A      A_Heart           0
 3 ENSG00000223972.4 A      A_Kidney          0
 4 ENSG00000223972.4 A      A_Liver           0
 5 ENSG00000223972.4 A      A_Lung            0
 6 ENSG00000223972.4 A      A_Stomach         0
 7 ENSG00000223972.4 A      F_Colon           7
 8 ENSG00000223972.4 A      F_Stomach         2
 9 ENSG00000223972.4 B      A_Colon           0
10 ENSG00000223972.4 B      A_Heart           0
# … with 1,098,570 more rows

Part 3: Data wranging with dplyr

conflicted::conflict_prefer(name="filter", winner="dplyr")
filter()    # PICK observations by their values | ROWS
select()    # PICK variables by their names | COLUMNS
mutate()    # CREATE new variables w/ functions of existing variables | COLUMNS
transmute() # COMPUTE 1 or more COLUMNS but drop original columns
arrange()   # REORDER the ROWS
summarize() # COLLAPSE many values to a single SUMMARY
group_by()  # GROUP data into rows with the same value of variable (COLUMN)

Filter

head(gse69360) # Snapshot of the dataframe
# A tibble: 6 × 25
  Geneid   Chr   Start End   Strand Length AA_Colon AA_Heart
  <chr>    <chr> <chr> <chr> <chr>   <dbl>    <dbl>    <dbl>
1 ENSG000… chr1… 1186… 1222… +;+;+…   1756        1        0
2 ENSG000… chr1… 1436… 1482… -;-;-…   2073       36       21
3 ENSG000… chr1… 2955… 3003… +;+;+    1021        0        0
4 ENSG000… chr1… 3455… 3517… -;-;-    1219        0        0
5 ENSG000… chr1… 5247… 5331… +;+       947        0        0
6 ENSG000… chr1  62948 63887 +         940        0        0
# … with 17 more variables: AA_Kidney <dbl>,
#   AA_Liver <dbl>, AA_Lung <dbl>, AA_Stomach <dbl>,
#   AF_Colon <dbl>, AF_Stomach <dbl>, BA_Colon <dbl>,
#   BA_Heart <dbl>, BA_Kidney <dbl>, BA_Liver <dbl>,
#   BA_Lung <dbl>, BA_Stomach <dbl>, BF_Colon <dbl>,
#   BF_Stomach <dbl>, OA_Stomach1 <dbl>, OA_Stomach2 <dbl>,
#   OA_Stomach3 <dbl>
# Now, filter by condition
filter(gse69360, Length<=50)
# A tibble: 124 × 25
   Geneid  Chr   Start End   Strand Length AA_Colon AA_Heart
   <chr>   <chr> <chr> <chr> <chr>   <dbl>    <dbl>    <dbl>
 1 ENSG00… chr1… 3829… 3829… -;-        45        1        0
 2 ENSG00… chr1  9521… 9521… -          41        7        1
 3 ENSG00… chr1  1477… 1477… +          35        0        1
 4 ENSG00… chr1… 1545… 1545… +;+        21        0        0
 5 ENSG00… chr1  2360… 2360… -          50        1        2
 6 ENSG00… chr2… 1160… 1160… +;+        36        1        0
 7 ENSG00… chr2  8916… 8916… -          38      183        1
 8 ENSG00… chr2  8916… 8916… -          37       98        2
 9 ENSG00… chr2  8916… 8916… -          38       80        0
10 ENSG00… chr2  8916… 8916… -          38        0        0
# … with 114 more rows, and 17 more variables:
#   AA_Kidney <dbl>, AA_Liver <dbl>, AA_Lung <dbl>,
#   AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
#   BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
#   BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
#   BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
#   OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# Can be rewritten using "Piping" %>%
gse69360 %>%   # Pipe ('then') operator to serially connect operations
  filter(Length <= 50)
# A tibble: 124 × 25
   Geneid  Chr   Start End   Strand Length AA_Colon AA_Heart
   <chr>   <chr> <chr> <chr> <chr>   <dbl>    <dbl>    <dbl>
 1 ENSG00… chr1… 3829… 3829… -;-        45        1        0
 2 ENSG00… chr1  9521… 9521… -          41        7        1
 3 ENSG00… chr1  1477… 1477… +          35        0        1
 4 ENSG00… chr1… 1545… 1545… +;+        21        0        0
 5 ENSG00… chr1  2360… 2360… -          50        1        2
 6 ENSG00… chr2… 1160… 1160… +;+        36        1        0
 7 ENSG00… chr2  8916… 8916… -          38      183        1
 8 ENSG00… chr2  8916… 8916… -          37       98        2
 9 ENSG00… chr2  8916… 8916… -          38       80        0
10 ENSG00… chr2  8916… 8916… -          38        0        0
# … with 114 more rows, and 17 more variables:
#   AA_Kidney <dbl>, AA_Liver <dbl>, AA_Lung <dbl>,
#   AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
#   BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
#   BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
#   BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
#   OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# Filtering using regex/substring match
gse69360 %>%
  filter(grepl("chrY", Chr))
# A tibble: 542 × 25
   Geneid  Chr   Start End   Strand Length AA_Colon AA_Heart
   <chr>   <chr> <chr> <chr> <chr>   <dbl>    <dbl>    <dbl>
 1 ENSGR0… chrY… 1204… 1205… +;+       259        0        0
 2 ENSGR0… chrY… 1429… 1430… +;+;+…   6191        0        0
 3 ENSGR0… chrY… 1700… 1701… -;-;-…   2363        0        0
 4 ENSGR0… chrY… 2317… 2319… +;+       428        0        0
 5 ENSGR0… chrY… 2446… 2496… -;-;-…   9015        0        0
 6 ENSGR0… chrY  3753… 3754… -         101        0        0
 7 ENSGR0… chrY  4345… 4348… -         328        0        0
 8 ENSGR0… chrY  4559… 4560… -         117        0        0
 9 ENSGR0… chrY… 5350… 5353… +;+;+…   4384        0        0
10 ENSGR0… chrY… 9009… 9011… +;+       588        0        0
# … with 532 more rows, and 17 more variables:
#   AA_Kidney <dbl>, AA_Liver <dbl>, AA_Lung <dbl>,
#   AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
#   BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
#   BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
#   BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
#   OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# Two filters at a time
gse69360 %>%
  filter(Length <= 50 & grepl("chrY", Chr))
# A tibble: 1 × 25
  Geneid   Chr   Start End   Strand Length AA_Colon AA_Heart
  <chr>    <chr> <chr> <chr> <chr>   <dbl>    <dbl>    <dbl>
1 ENSG000… chrY… 5306… 5306… -;-        42        1        0
# … with 17 more variables: AA_Kidney <dbl>,
#   AA_Liver <dbl>, AA_Lung <dbl>, AA_Stomach <dbl>,
#   AF_Colon <dbl>, AF_Stomach <dbl>, BA_Colon <dbl>,
#   BA_Heart <dbl>, BA_Kidney <dbl>, BA_Liver <dbl>,
#   BA_Lung <dbl>, BA_Stomach <dbl>, BF_Colon <dbl>,
#   BF_Stomach <dbl>, OA_Stomach1 <dbl>, OA_Stomach2 <dbl>,
#   OA_Stomach3 <dbl>

Select

# Selecting columns that match a pattern
gse69360 %>%
  select(Geneid, matches(".F_"))
# A tibble: 57,820 × 5
   Geneid            AF_Colon AF_Stomach BF_Colon BF_Stomach
   <chr>                <dbl>      <dbl>    <dbl>      <dbl>
 1 ENSG00000223972.4        7          2        0          3
 2 ENSG00000227232.4      102         41      153        157
 3 ENSG00000243485.2        0          0        0          0
 4 ENSG00000237613.2        0          0        0          0
 5 ENSG00000268020.2        0          0        0          0
 6 ENSG00000240361.1        0          0        0          0
 7 ENSG00000186092.4        0          0        0          0
 8 ENSG00000238009.2        4          0        5          4
 9 ENSG00000239945.1        0          0        0          0
10 ENSG00000233750.3        2          3        3          1
# … with 57,810 more rows
# Excluding specific columns
gse69360 %>%
  select(-Chr, -Start, -End, -Strand, -Length)
# A tibble: 57,820 × 20
   Geneid       AA_Colon AA_Heart AA_Kidney AA_Liver AA_Lung
   <chr>           <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
 1 ENSG0000022…        1        0         0        0       0
 2 ENSG0000022…       36       21        32       28      17
 3 ENSG0000024…        0        0         0        0       0
 4 ENSG0000023…        0        0         0        0       0
 5 ENSG0000026…        0        0         0        0       0
 6 ENSG0000024…        0        0         0        0       0
 7 ENSG0000018…        0        0         0        0       0
 8 ENSG0000023…        3        0         1        2       1
 9 ENSG0000023…        0        0         0        0       0
10 ENSG0000023…        0        0         0        0       0
# … with 57,810 more rows, and 14 more variables:
#   AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
#   BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
#   BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
#   BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
#   OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# Excluding columns matching a pattern
gse69360 %>%
  select(-matches("[AF]_"))
# A tibble: 57,820 × 6
   Geneid            Chr           Start End   Strand Length
   <chr>             <chr>         <chr> <chr> <chr>   <dbl>
 1 ENSG00000223972.4 chr1;chr1;ch… 1186… 1222… +;+;+…   1756
 2 ENSG00000227232.4 chr1;chr1;ch… 1436… 1482… -;-;-…   2073
 3 ENSG00000243485.2 chr1;chr1;ch… 2955… 3003… +;+;+    1021
 4 ENSG00000237613.2 chr1;chr1;ch… 3455… 3517… -;-;-    1219
 5 ENSG00000268020.2 chr1;chr1     5247… 5331… +;+       947
 6 ENSG00000240361.1 chr1          62948 63887 +         940
 7 ENSG00000186092.4 chr1          69091 70008 +         918
 8 ENSG00000238009.2 chr1;chr1;ch… 8929… 9162… -;-;-…   3569
 9 ENSG00000239945.1 chr1;chr1     8955… 9005… -;-      1319
10 ENSG00000233750.3 chr1          1310… 1348… +        3812
# … with 57,810 more rows
# Select then Filter
gse69360 %>%
  select(Geneid, Chr, Length, matches("[AF]_")) %>%
  filter(grepl("chrY", Chr) | Length <= 100)
# A tibble: 4,671 × 22
   Geneid  Chr   Length AA_Colon AA_Heart AA_Kidney AA_Liver
   <chr>   <chr>  <dbl>    <dbl>    <dbl>     <dbl>    <dbl>
 1 ENSG00… chr1…     90        0        0         0        0
 2 ENSG00… chr1…     57        0        0         0        0
 3 ENSG00… chr1      95        5        0         2        0
 4 ENSG00… chr1      90        0        0         0        0
 5 ENSG00… chr1      83        7        0         1        2
 6 ENSG00… chr1      96        0        0         0        0
 7 ENSG00… chr1      70        0        0         0        0
 8 ENSG00… chr1      73        0        0         0        0
 9 ENSG00… chr1      89        0        0         0        0
10 ENSG00… chr1      70        0        0         0        0
# … with 4,661 more rows, and 15 more variables:
#   AA_Lung <dbl>, AA_Stomach <dbl>, AF_Colon <dbl>,
#   AF_Stomach <dbl>, BA_Colon <dbl>, BA_Heart <dbl>,
#   BA_Kidney <dbl>, BA_Liver <dbl>, BA_Lung <dbl>,
#   BA_Stomach <dbl>, BF_Colon <dbl>, BF_Stomach <dbl>,
#   OA_Stomach1 <dbl>, OA_Stomach2 <dbl>, OA_Stomach3 <dbl>

Mutate

# Excluding columns matching a condition
gse69360 %>%
  select(-matches("[AF]_")) %>%
  head(., 10) #%>% View()
# A tibble: 10 × 6
   Geneid            Chr           Start End   Strand Length
   <chr>             <chr>         <chr> <chr> <chr>   <dbl>
 1 ENSG00000223972.4 chr1;chr1;ch… 1186… 1222… +;+;+…   1756
 2 ENSG00000227232.4 chr1;chr1;ch… 1436… 1482… -;-;-…   2073
 3 ENSG00000243485.2 chr1;chr1;ch… 2955… 3003… +;+;+    1021
 4 ENSG00000237613.2 chr1;chr1;ch… 3455… 3517… -;-;-    1219
 5 ENSG00000268020.2 chr1;chr1     5247… 5331… +;+       947
 6 ENSG00000240361.1 chr1          62948 63887 +         940
 7 ENSG00000186092.4 chr1          69091 70008 +         918
 8 ENSG00000238009.2 chr1;chr1;ch… 8929… 9162… -;-;-…   3569
 9 ENSG00000239945.1 chr1;chr1     8955… 9005… -;-      1319
10 ENSG00000233750.3 chr1          1310… 1348… +        3812
# Storing gene location information in a separate dataframe
gene_loc <- gse69360 %>%                                            # saving output to a variable
  select(-matches("[AF]_")) %>%                                     # select columns
  mutate(Geneid = gsub("\\.[0-9]*$", "", Geneid)) %>%               # remove isoform no.
  mutate(Chr = gsub(";.*$", "", gse69360$Chr)) %>%                  # keep the first element for Chr
  mutate(Start = as.numeric(gsub(";.*$", "", gse69360$Start))) %>%  # "" for Start
  mutate(End = as.numeric(gsub(";.*$", "", gse69360$End))) %>%      # "" for End
  mutate(Strand = gsub(";.*$", "", gse69360$Strand))                # "" for Strand

# Check to see if you have what you expected!
gene_loc
# A tibble: 57,820 × 6
   Geneid          Chr    Start    End Strand Length
   <chr>           <chr>  <dbl>  <dbl> <chr>   <dbl>
 1 ENSG00000223972 chr1   11869  12227 +        1756
 2 ENSG00000227232 chr1   14363  14829 -        2073
 3 ENSG00000243485 chr1   29554  30039 +        1021
 4 ENSG00000237613 chr1   34554  35174 -        1219
 5 ENSG00000268020 chr1   52473  53312 +         947
 6 ENSG00000240361 chr1   62948  63887 +         940
 7 ENSG00000186092 chr1   69091  70008 +         918
 8 ENSG00000238009 chr1   89295  91629 -        3569
 9 ENSG00000239945 chr1   89551  90050 -        1319
10 ENSG00000233750 chr1  131025 134836 +        3812
# … with 57,810 more rows
head(gene_loc, 10)
# A tibble: 10 × 6
   Geneid          Chr    Start    End Strand Length
   <chr>           <chr>  <dbl>  <dbl> <chr>   <dbl>
 1 ENSG00000223972 chr1   11869  12227 +        1756
 2 ENSG00000227232 chr1   14363  14829 -        2073
 3 ENSG00000243485 chr1   29554  30039 +        1021
 4 ENSG00000237613 chr1   34554  35174 -        1219
 5 ENSG00000268020 chr1   52473  53312 +         947
 6 ENSG00000240361 chr1   62948  63887 +         940
 7 ENSG00000186092 chr1   69091  70008 +         918
 8 ENSG00000238009 chr1   89295  91629 -        3569
 9 ENSG00000239945 chr1   89551  90050 -        1319
10 ENSG00000233750 chr1  131025 134836 +        3812
# Creating new variables
gene_loc %>%
  mutate(kbStart = Start/1000,     # creates new variables/columns
         kbEnd = End/1000,
         kbLength = Length/1000)
# A tibble: 57,820 × 9
   Geneid    Chr    Start    End Strand Length kbStart kbEnd
   <chr>     <chr>  <dbl>  <dbl> <chr>   <dbl>   <dbl> <dbl>
 1 ENSG0000… chr1   11869  12227 +        1756    11.9  12.2
 2 ENSG0000… chr1   14363  14829 -        2073    14.4  14.8
 3 ENSG0000… chr1   29554  30039 +        1021    29.6  30.0
 4 ENSG0000… chr1   34554  35174 -        1219    34.6  35.2
 5 ENSG0000… chr1   52473  53312 +         947    52.5  53.3
 6 ENSG0000… chr1   62948  63887 +         940    62.9  63.9
 7 ENSG0000… chr1   69091  70008 +         918    69.1  70.0
 8 ENSG0000… chr1   89295  91629 -        3569    89.3  91.6
 9 ENSG0000… chr1   89551  90050 -        1319    89.6  90.0
10 ENSG0000… chr1  131025 134836 +        3812   131.  135. 
# … with 57,810 more rows, and 1 more variable:
#   kbLength <dbl>
# Creating new variables & dropping old ones
gene_loc %>%
  transmute(kbStart = Start/1000,  # drops original columns
            kbEnd = End/1000,
            kbLength = Length/1000)
# A tibble: 57,820 × 3
   kbStart kbEnd kbLength
     <dbl> <dbl>    <dbl>
 1    11.9  12.2    1.76 
 2    14.4  14.8    2.07 
 3    29.6  30.0    1.02 
 4    34.6  35.2    1.22 
 5    52.5  53.3    0.947
 6    62.9  63.9    0.94 
 7    69.1  70.0    0.918
 8    89.3  91.6    3.57 
 9    89.6  90.0    1.32 
10   131.  135.     3.81 
# … with 57,810 more rows

Distinct & Arrange

# Pick only the unique entries in a column
gene_loc %>%
  distinct(Chr)
# A tibble: 25 × 1
   Chr  
   <chr>
 1 chr1 
 2 chr2 
 3 chr3 
 4 chr4 
 5 chr5 
 6 chr6 
 7 chr7 
 8 chr8 
 9 chr9 
10 chr10
# … with 15 more rows
gene_loc %>%
  distinct(Strand)
# A tibble: 2 × 1
  Strand
  <chr> 
1 +     
2 -     
# Pick unique combinations
gene_loc %>%
  distinct(Chr, Strand)
# A tibble: 50 × 2
   Chr   Strand
   <chr> <chr> 
 1 chr1  +     
 2 chr1  -     
 3 chr2  -     
 4 chr2  +     
 5 chr3  +     
 6 chr3  -     
 7 chr4  -     
 8 chr4  +     
 9 chr5  +     
10 chr5  -     
# … with 40 more rows
# Then sort aka arrange your data
gene_loc %>%
  arrange(desc(Chr))    # sort in descending order
# A tibble: 57,820 × 6
   Geneid          Chr    Start    End Strand Length
   <chr>           <chr>  <dbl>  <dbl> <chr>   <dbl>
 1 ENSGR0000228572 chrY  120410 120513 +         259
 2 ENSGR0000182378 chrY  142989 143061 +        6191
 3 ENSGR0000178605 chrY  170025 170137 -        2363
 4 ENSGR0000226179 chrY  231725 231983 +         428
 5 ENSGR0000167393 chrY  244698 249631 -        9015
 6 ENSGR0000266731 chrY  375316 375416 -         101
 7 ENSGR0000234958 chrY  434510 434837 -         328
 8 ENSGR0000229232 chrY  455971 456087 -         117
 9 ENSGR0000185960 chrY  535079 535337 +        4384
10 ENSGR0000237531 chrY  900956 901162 +         588
# … with 57,810 more rows
gene_loc %>%
  arrange(Chr, Length)  # sort by Chr, then Length
# A tibble: 57,820 × 6
   Geneid          Chr       Start       End Strand Length
   <chr>           <chr>     <dbl>     <dbl> <chr>   <dbl>
 1 ENSG00000268141 chr1  154585067 154585080 +          21
 2 ENSG00000224335 chr1  147706573 147706607 +          35
 3 ENSG00000263526 chr1   95211416  95211456 -          41
 4 ENSG00000230610 chr1   38292232  38292262 -          45
 5 ENSG00000252056 chr1  236026688 236026737 -          50
 6 ENSG00000238705 chr1   26881033  26881084 +          52
 7 ENSG00000264650 chr1   45499641  45499692 -          52
 8 ENSG00000265769 chr1   52366592  52366643 -          52
 9 ENSG00000251795 chr1   93303575  93303627 +          53
10 ENSG00000238310 chr1  109750416 109750470 -          55
# … with 57,810 more rows
# arrange(Chr, -Length) # to reverse sort by 'numeric' Length

Group_by & Summarize

# Combine by a variable, then calculate summary statistics for each group
gene_loc %>%
  group_by(Chr) %>%                         # combine rows by Chr
  summarize(numGenes = n(),                 # then summarise, number of genes/Chr
            startoffirstGene = min(Start))  # min to get the first Start location
# A tibble: 25 × 3
   Chr   numGenes startoffirstGene
   <chr>    <int>            <dbl>
 1 chr1      5363            11869
 2 chr10     2260            90652
 3 chr11     3208            75780
 4 chr12     2818            67607
 5 chr13     1217         19041312
 6 chr14     2244         19110203
 7 chr15     2080         20083769
 8 chr16     2343            61553
 9 chr17     2903             4961
10 chr18     1127            11103
# … with 15 more rows
# Example to show you can use all math/stat functions to summarize data groups
gene_loc %>%
  arrange(Length) %>%
  group_by(Chr, Strand) %>%
  summarize(numGenes = n(),
            smallestGene = first(Geneid),
            minLength = min(Length),
            firstqLength = quantile(Length, 0.25),
            medianLength = median(Length),
            iqrLength = IQR(Length),
            thirdqLength = quantile(Length, 0.75),
            maxLength = max(Length),
            longestGene = last(Geneid))
`summarise()` has grouped output by 'Chr'. You can override
using the `.groups` argument.
# A tibble: 50 × 11
# Groups:   Chr [25]
   Chr   Strand numGenes smallestGene minLength firstqLength
   <chr> <chr>     <int> <chr>            <dbl>        <dbl>
 1 chr1  -          2651 ENSG0000026…        41         354.
 2 chr1  +          2712 ENSG0000026…        21         394.
 3 chr10 -          1109 ENSG0000026…        52         322 
 4 chr10 +          1151 ENSG0000027…        27         356 
 5 chr11 -          1605 ENSG0000026…        30         477 
 6 chr11 +          1603 ENSG0000025…        37         476 
 7 chr12 -          1415 ENSG0000025…        28         375 
 8 chr12 +          1403 ENSG0000026…        30         402 
 9 chr13 -           631 ENSG0000021…        48         297 
10 chr13 +           586 ENSG0000026…        57         371.
# … with 40 more rows, and 5 more variables:
#   medianLength <dbl>, iqrLength <dbl>,
#   thirdqLength <dbl>, maxLength <dbl>, longestGene <chr>
# Renaming chromosomes & declaring them as factors w/ an intrinsic order
# gene_loc$Chr <- factor(gene_loc$Chr,
#                        levels = paste("chr",
#                                       c((1:22), "X", "Y", "M"),
#                                       sep=""))

# Saving your data locally
gene_loc %>%
  write_tsv(here("transcriptomics/GSE69360.gene-locations.txt"))

More data wrangling

Let’s combine everything from above to tidy the full GSE69360 dataset.

Dataset details:

# Extracting just the expression values & cleaning it up
View(head(gse69360, 50))

gene_counts <- gse69360 %>%
  select(-Chr, -Start, -End, -Strand, -Length) %>%          # another way to select just the expression data
  rename(OA_Stomach = OA_Stomach1) %>%                      # rename couple of columns
  mutate(OA_Stomach2 = NULL, OA_Stomach3 = NULL) %>%        # remove a couple of columns
  mutate(Geneid = gsub("\\.[0-9]*$", "", Geneid))           # cleanup data a specific column

logcpm <- gene_counts %>%
  select(-Geneid) %>%
  mutate_all(function(x) { log2((x*(1e+6)/sum(x)) + 1) } )  # convert counts in each sample to counts-per-million

summary(logcpm)
    AA_Colon           AA_Heart         AA_Kidney      
 Min.   : 0.00000   Min.   : 0.0000   Min.   : 0.0000  
 1st Qu.: 0.00000   1st Qu.: 0.0000   1st Qu.: 0.0000  
 Median : 0.06883   Median : 0.0000   Median : 0.0000  
 Mean   : 1.10187   Mean   : 0.8103   Mean   : 0.8785  
 3rd Qu.: 1.33045   3rd Qu.: 0.5151   3rd Qu.: 0.7181  
 Max.   :18.41311   Max.   :18.7478   Max.   :18.8061  
    AA_Liver           AA_Lung         AA_Stomach      
 Min.   : 0.00000   Min.   : 0.000   Min.   : 0.00000  
 1st Qu.: 0.00000   1st Qu.: 0.000   1st Qu.: 0.00000  
 Median : 0.05563   Median : 0.000   Median : 0.06964  
 Mean   : 0.98788   Mean   : 1.049   Mean   : 0.93120  
 3rd Qu.: 0.95867   3rd Qu.: 1.123   3rd Qu.: 0.91851  
 Max.   :17.85244   Max.   :17.722   Max.   :17.85986  
    AF_Colon         AF_Stomach         BA_Colon       
 Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.00000  
 1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.00000  
 Median : 0.1995   Median : 0.0000   Median : 0.08114  
 Mean   : 1.2903   Mean   : 0.7748   Mean   : 0.98586  
 3rd Qu.: 1.8951   3rd Qu.: 0.8105   3rd Qu.: 1.10906  
 Max.   :17.4620   Max.   :18.3586   Max.   :18.75742  
    BA_Heart          BA_Kidney           BA_Liver      
 Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0000  
 1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.0000  
 Median : 0.04537   Median : 0.05856   Median : 0.0000  
 Mean   : 0.80833   Mean   : 0.96659   Mean   : 0.8194  
 3rd Qu.: 0.68444   3rd Qu.: 1.11092   3rd Qu.: 0.5974  
 Max.   :18.92846   Max.   :18.87617   Max.   :17.7612  
    BA_Lung          BA_Stomach         BF_Colon      
 Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
 1st Qu.: 0.1643   1st Qu.: 0.0000   1st Qu.: 0.0000  
 Median : 0.4456   Median : 0.0000   Median : 0.1793  
 Mean   : 1.3873   Mean   : 0.7528   Mean   : 1.3716  
 3rd Qu.: 1.9392   3rd Qu.: 0.6261   3rd Qu.: 1.9775  
 Max.   :17.5495   Max.   :18.3249   Max.   :15.8531  
   BF_Stomach        OA_Stomach     
 Min.   : 0.0000   Min.   : 0.0000  
 1st Qu.: 0.0000   1st Qu.: 0.0000  
 Median : 0.3085   Median : 0.1003  
 Mean   : 1.5164   Mean   : 1.0800  
 3rd Qu.: 2.4221   3rd Qu.: 1.3692  
 Max.   :15.7256   Max.   :17.9600  
gene_logcpm <- gene_counts %>%                              # a new dataframe with logcpm values
  select(Geneid) %>%
  bind_cols(logcpm) %>%                                     # bind the logcpm matrix to the geneids
  pivot_longer(-Geneid, names_to="Sample", values_to="Logcpm") %>%     # convert to tidy data
  separate(Sample,                                          # cleanup complex variables
           into = c("Source", "Stage", "Tissue"),
           sep = c(1,2),
           remove = F) %>%                                  # keep original variable
  mutate(Tissue = gsub("^_", "", Tissue),
         Stage = ifelse(Stage == "A", "Adult", "Fetus"))

View(head(gene_logcpm, 50))

# Plotting the distribution of gene-expression in each sample
gene_logcpm %>%
  ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
  geom_boxplot(outlier.size = 0.2, outlier.shape = 0.2) +
  scale_y_continuous(limits = c(0, 1)) +
  coord_flip() +
  theme_minimal()
Warning: Removed 259347 rows containing non-finite values
(stat_boxplot).


Part 4: Visualizing tidy data w/ ggplot

Basics of ggplot2

Creating a plot w/ Grammar of Graphics

  1. Recap and continuation of dplyr
  2. Basics of plotting data with ggplot2: data, aes, geom
  3. Customization: Colors, labels, and legends

Barplots & Histograms

gene_loc %>%                                                      # data
  ggplot(aes(x = Chr)) +                                          # aesthetics: what to plot?
  geom_bar()                                                      # geometry: how to plot?
gene_loc$Chr <- factor(gene_loc$Chr,
                       levels = paste("chr",
                                      c((1:22), "X", "Y", "M"),
                                      sep=""))

plot_chr_numgenes <- gene_loc %>%
  ggplot(aes(x = Chr)) +
  geom_bar()
plot_chr_numgenes
plot_chr_numgenes +
  coord_flip() +
  theme_minimal()
plot_chr_numgenes +
  labs(title = "No. genes per chromosome",
       x = "Chromosome",
       y = "No. of genes") +
  theme_minimal() +
  coord_flip()
gene_loc %>%
  ggplot(aes(x = Length)) +
  geom_histogram(color = "white") +
  scale_x_log10() +
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.
plot_chr_genelength <- gene_loc %>%
  ggplot(aes(x = Length, fill = Chr)) +
  geom_histogram(color = "white") +
  scale_x_log10() +
  theme_minimal() +
  facet_wrap(~Chr, scales = "free_y")
plot_chr_genelength
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.
plot_chr_genelength +
  theme(legend.position = "none") +
  labs(x = "Gene length (log-scale)",
       y = "No. of genes")
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.

Scatter plots

gene_loc %>%
  ggplot(aes(x = End-Start, y = Length)) +
  geom_point()
plot_strend_length <- gene_loc %>%
  ggplot(aes(x = End-Start, y = Length)) +
  geom_point(alpha = 0.1, size = 0.5, color = "grey", fill = "grey")
plot_strend_length
plot_strend_length <- plot_strend_length +
  scale_x_log10("End-Start") +
  scale_y_log10("Gene length") +
  theme_minimal()
plot_strend_length
Warning: Transformation introduced infinite values in
continuous x-axis
plot_strend_length +
  geom_abline(intercept = 0, slope = 1, col = "red") +
  geom_hline(yintercept = 500, color = "blue") +
  geom_vline(xintercept = 1000, color = "orange")
Warning: Transformation introduced infinite values in
continuous x-axis
gene_loc %>%
  group_by(Chr) %>%
  summarize(meanLength = mean(Length), numGenes = n())
# A tibble: 25 × 3
   Chr   meanLength numGenes
   <fct>      <dbl>    <int>
 1 chr1       2258.     5363
 2 chr2       2304.     4047
 3 chr3       2382.     3101
 4 chr4       2109.     2563
 5 chr5       2188.     2859
 6 chr6       2124.     2905
 7 chr7       2164.     2876
 8 chr8       2036.     2386
 9 chr9       2123.     2323
10 chr10      2160.     2260
# … with 15 more rows
gene_loc %>%
  group_by(Chr) %>%
  summarize(meanLength = mean(Length), numGenes = n()) %>%
  ggplot(aes(x = numGenes, y = meanLength)) +
  geom_point()
# install.packages("ggrepel", dependencies=T)
library(ggrepel)
gene_loc %>%
  group_by(Chr) %>%
  summarize(meanLength = mean(Length), numGenes = n()) %>%
  ggplot(aes(x = numGenes, y = meanLength)) +
  geom_point() +
  geom_smooth(color = "lightblue", alpha = 0.1) +
  labs(x = "No. of genes", y = "Mean gene length") +
  geom_text_repel(aes(label = Chr), color="red", segment.color="grey80") +
  theme_minimal()
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

Boxplots & Violin plots

gene_logcpm %>%
  ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
  geom_boxplot() +
  coord_flip() +
  theme_minimal()
gene_logcpm %>%
  ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
  geom_violin() +
  scale_y_continuous(limits = c(0, 0.5)) +
  coord_flip() +
  theme_minimal()
Warning: Removed 320317 rows containing non-finite values
(stat_ydensity).
# Plotting the distribution of gene-expression in each sample
plot_sample_bxp <- gene_logcpm %>%
  ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
  geom_boxplot(outlier.size = 0.2, outlier.alpha = 0.2) +
  scale_y_continuous(limits = c(0, 1)) +
  coord_flip() +
  theme_minimal()
plot_sample_bxp
Warning: Removed 259347 rows containing non-finite values
(stat_boxplot).
# Plotting scatterplot of 2 sets of samples
plot_ffcolon_scatter <- gene_logcpm %>%
  filter(Sample == "AF_Colon" | Sample == "BF_Colon") %>%
  select(Geneid, Sample, Logcpm) %>%
  pivot_wider(names_from = Sample, values_from = Logcpm) %>%
  ggplot(aes(x = AF_Colon, y = BF_Colon)) +
  geom_point(alpha = 0.1, size = 0.5) +
  geom_smooth(method=lm) +
  theme_minimal()
plot_ffcolon_scatter
`geom_smooth()` using formula 'y ~ x'

Some data sleuthing

# Finding genes with high variance across samples
num_totgenes <- gene_logcpm %>%
  distinct(Geneid) %>%
  nrow()

highvar_genes <- gene_logcpm %>%
  group_by(Geneid) %>%
  summarize(iqr = IQR(Logcpm)) %>%
  top_n((ceiling(num_totgenes*0.05)), iqr) %>%
  pull(Geneid)

length(highvar_genes)
[1] 2891
# Plotting expression of high-var Y chr genes across samples
chry_highvar_genes <- gene_loc %>%
  filter(Chr == "chrY" & Geneid %in% highvar_genes) %>%
  pull(Geneid)

gene_logcpm %>%
  filter(Geneid %in% chry_highvar_genes)
# A tibble: 238 × 6
   Geneid          Sample     Source Stage Tissue  Logcpm
   <chr>           <chr>      <chr>  <chr> <chr>    <dbl>
 1 ENSG00000129824 AA_Colon   A      Adult Colon   6.12  
 2 ENSG00000129824 AA_Heart   A      Adult Heart   0     
 3 ENSG00000129824 AA_Kidney  A      Adult Kidney  0.106 
 4 ENSG00000129824 AA_Liver   A      Adult Liver   1.75  
 5 ENSG00000129824 AA_Lung    A      Adult Lung    0     
 6 ENSG00000129824 AA_Stomach A      Adult Stomach 0     
 7 ENSG00000129824 AF_Colon   A      Fetus Colon   0.0352
 8 ENSG00000129824 AF_Stomach A      Fetus Stomach 4.51  
 9 ENSG00000129824 BA_Colon   B      Adult Colon   6.05  
10 ENSG00000129824 BA_Heart   B      Adult Heart   4.81  
# … with 228 more rows
plot_chry_highvar_boxplot <- gene_logcpm %>%
  filter(Geneid %in% chry_highvar_genes) %>%
  ggplot(aes(x = reorder(Sample, Logcpm, FUN = median),
             y = Logcpm,
             color = Sample)) +
  geom_boxplot() +
  coord_flip() +
  theme_minimal() +
  theme(legend.position = "none")
plot_chry_highvar_boxplot


Part 5: Export & Wrap-up w/ RMarkdown

Saving your plots

ggsave

Save a ggplot (or other grid object) with sensible defaults

library(tidyverse)
# Save your file name
plot1 <- "chr_highvar.png"

# Save your absolute/relative path
my_full_path <- here("transcriptomics")

# To save as a tab-delimited text file ...
ggsave(filename=plot1,
       plot=plot_chry_highvar_boxplot,
       device="png",
       path=my_full_path,
       dpi=600)
Saving 6.5 x 4 in image

Saving your data files

write_delim

Write a dataframe to a delimited file

library(tidyverse)
# Save your file name
filename <- "my_new_data.txt"

# Save your absolute/relative path
my_full_path <- here("data")

# To save as a tab-delimited text file ...
write_tsv(x=my_newly_formatted_data, # your final reformatted dataset
          path=paste(my_full_path, filename, "/"), # Absolute path recommended.
          # However, you can directly use 'filename' here
          # if you are saving the file in the same directory
          # as your code.
          col_names=T) # if you want the column names to be
# saved in the first row, recommended

# Alternatively, you could save it as a comma-separated text file
write_csv(x=my_newly_formatted_data,
          path=my_path,
          col_names=T)
# Or save it with any other delimiter
# choose wisely, pick a delim that's not part of your dataframe
write_delim(x=my_newly_formatted_data,
            path=my_path,
            col_names=T,
            delim="---")

What you learnt today!

Option Description
Part 1 Getting Started
install.packages Download and install packages from CRAN-like repositories or from local files
library Library and require load and attach add-on packages
package::function To run a function w/o loading the package
Import tidyverse > readr & readxl
read_delim Read a delimited file (incl csv, tsv) into a tibble
read_csv, read_tsv read_csv() and read_tsv() are special cases of the general read_delim()
read_excel Read xls and xlsx files
here Part of here package. To set paths relative to your current project directory
Data snapshot
str Compactly Display the Structure of an Arbitrary R Object
head Return the First or Last Part of an Object
glimpse Get a glimpse of your data
View Invoke a Data Viewer (pop-up, requires XQuartz/X11)
kable Create tables in LaTeX, HTML, Markdown and reStructuredText
paged_table Create a table in HTML with support for paging rows and columns
Part 2 tidyverse > tidyr
pivot_longer Gather Columns Into Key-Value Pairs (COLS -> ROWS)
pivot_wider Spread a key-value pair across multiple columns
separate Separate one column into multiple column
unite Unite multiple columns into one
Part 3 tidyverse > dplyr
filter Return rows with matching conditions
select Select/rename variables by name
mutate Add new variables
transmute Add new variables & drops existing variables
distinct Get unique entries
arrange Arrange rows by variables
summarise Reduces multiple values down to a single value
group_by Group by one or more variables
join Join two dataframes together: left_join, right_join, inner_join
bind Efficiently bind multiple dataframes by row and column: bind_rows, bind_cols
setops Set operations: intersect, union, setdiff, setequal
Part 4 tidyverse > ggplot
ggplot Create a new ggplot
Aesthetics
aes Specify details on what to plot
factor Specify a variable to be ordered & categorical
facet_wrap Split into multiple sub-plots based on a categorical variable
Geometries
geom_bar Create a barplot
geom_histogram Create a histogram
geom_point Create a scatter plot
geom_boxplot Create a boxplot
geom_violin Create a violin plot
geom_abline, geom_hline, geom_vline Add slanted, horizontal, or vertical lines
geom_smooth Add a smooth curve that fits the data points
Plot customization
theme, theme_minimal, theme_bw Adjust themes to minimal, black/white, etc
scale_x_log10 Change x/y axes to log scale
scale_y_continuous Change x/y axes to continuous & set limits
coord_flip Flip x & y coordinates
labs Specify axes labels and plot titles
Part 5 Export & Wrap-up
Export tidyverse > readr
ggsave Save a ggplot (or other grid object) with sensible defaults
write_delim Write a dataframe to a delimited file
write_tsv write_delim customized for tab-separated values
write_csv write_delim customized for comma-separated values

Credits

Arjun Krishnan and I co-developed the content for this workshop.

Acknowledgements

Contact

Additional resources

Some awesome open-source books

Citation

For attribution, please cite this work as

Ravi & Krishnan (2022, May 28). Introduction to Tidyverse | using transcriptomic data. Retrieved from https://github.com/jananiravi/workshop-tidyverse

BibTeX citation

@misc{ravi2022introduction,
  author = {Ravi, Janani and Krishnan, Arjun},
  title = {Introduction to Tidyverse | using transcriptomic data},
  url = {https://github.com/jananiravi/workshop-tidyverse},
  year = {2022}
}