Introduction to Tidyverse | using transcriptomic data

Janani Ravi; Arjun Krishnan

About me

I am a computational biologist, an Asst. Professor in the departments of Pathobiology & Diagnostic Investigation and Microbiology & Molecular Genetics at Michigan State University. In our group, we develop computational approaches to understand infectious disease biology. Check out my webpage for more info. You can reach me here.

I’m also the founder & co-organizer of the R-Ladies East Lansing, Women+ Data Science, and AsiaR. We conduct R and data science related meetups & workshops regularly! So, do check out our upcoming events on Meetup.

Part 1: Getting Started w/ `readr`

You can access all relevant material pertaining to this workshop here. Other related workshops & useful cheatsheets.

Installation and set-up

Install RStudio

Running RStudio locally? Download RStudio

Want to try the latest ‘Preview’ version of RStudio? RStudio Preview version

Trouble with local installation? Login & start using RStudio Cloud right away!

New to RStudio IDE? Use Help Page #1 & Page #2

Install R

… if you haven’t already! The RStudio startup message should specify your current local version of R. For e.g., R v4.1.2

Install packages

Install: once
Load: each time

install.packages("tidyverse")   # for data wrangling
install.packages("here")        # to set paths relative to your current project.
# install.packages("gapminder") # sample dataset

Explore the RLEL workshops for more examples and sample codes for Tidy Data and DataViz [e.g., >> presentations/20181105-workshop-tidydata uses the gapminder & USArrests datasets].

Trouble installing tidyverse?

Check your R/RStudio versions
For the time-being install the individual packages using install.packages("PACKAGENAME")
Check out the whole tidyverse suite of packages here

# If tidyverse installation fails, install individual constituent packages this way...
install.packages("readr")    # Importing data files
install.packages("tidyr")    # Tidy Data
install.packages("dplyr")    # Data manipulation
install.packages("ggplot2")  # Data Visualization (w/ Grammar of Graphics)
install.packages("readxl")   # Importing excel files

Loading packages

library(tidyverse)

── Attaching packages ─────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1

── Conflicts ────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

# OR load the individual packages, if you have trouble installing/loading `tidyverse`
# library(readr)
# library(readxl)
# library(tidyr)
# library(dplyr)
# library(ggplot2)
library(here) # https://github.com/jennybc/here_here

here() starts at /Users/jananiravi/GitHub/workshop-tidyverse

# library(gapminder) # useful dataset for data wrangling, visualization

Some useful cheatsheets

Cheatsheets @RStudio | Our cheatsheets repo

More resources towards the end of the document.

Data import

read_csv, write_csv
read_tsv, write_tsv
read_delim, write_delim
here::here

here::here() # where you opened your RStudio session/Project

[1] "/Users/jananiravi/GitHub/workshop-tidyverse"

# Downloading sample transcriptomics dataset from NCBI's FTP site
url <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE69nnn/GSE69360/suppl/GSE69360_RNAseq.counts.txt.gz"
gse69360 <- read_tsv(url) # to read tab-delimitted files

Rows: 57820 Columns: 25
── Column specification ────────────────────────────────────
Delimiter: "\t"
chr  (5): Geneid, Chr, Start, End, Strand
dbl (20): Length, AA_Colon, AA_Heart, AA_Kidney, AA_Live...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

?read_tsv # to check defaults

# Comma-separated values, as exported from excel/spreadsheets
read_csv(file="path/to/my_data.csv", col_names=T)
# Other atypical delimitters
read_delim(file="path/to/my_data.txt", col_names=T, delim="//")

# Other useful packages
# readxl by Jenny Bryan
read_excel(path="path/to/excel.xls",
           sheet=1,
           range="A1:D50",
           col_names=T)

## Tip: Always open .Rproj and use relative paths with here()
## Example with here()
read_tsv(file=here("transcriptomics/GSE69360_RNAseq.counts.txt"), col_names=T)

Knowing your data

Dataset details:

AA: Agilent Adult; AF: Agilent Fetus
BA: BioChain Adult; BF: BioChain Fetus
OA: OriGene Adult
Tissues: Colon, Heart, Kidney, Liver, Lung, Stomach

str(gse69360)              # Structure of the dataframe
gse69360                   # Data is in a cleaned up 'tibble' format by default
head(gse69360)             # Shows the top few observations (rows) of your dataframe
glimpse(gse69360)          # Info-dense summary of the data
View(head(gse69360, 100))  # View data in a visual GUI-based spreadsheet-like format

colnames(gse69360)         # Column names
nrow(gse69360)             # No. of rows
ncol(gse69360)             # No. of columns

gse69360[1:5,7:10]         # Subsetting a dataframe

## saving the data file
write_tsv(gse69360[1:100,7:12], "gse_subset.txt")

library(knitr)
kable(head(gse69360))

Geneid	Chr	Start	End	Strand	Length	AA_Colon	AA_Heart	AA_Kidney	AA_Liver	AA_Lung	AA_Stomach	AF_Colon	AF_Stomach	BA_Colon	BA_Heart	BA_Kidney	BA_Liver	BA_Lung	BA_Stomach	BF_Colon	BF_Stomach	OA_Stomach1	OA_Stomach2	OA_Stomach3
ENSG00000223972.4	chr1;chr1;chr1;chr1	11869;12595;12975;13221	12227;12721;13052;14412	+;+;+;+	1756	1	0	0	0	0	0	7	2	0	0	0	0	7	0	0	3	3	1	0
ENSG00000227232.4	chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1	14363;14970;15796;16607;16854;17233;17498;17602;17915;18268;24734;29321;29534	14829;15038;15947;16765;17055;17368;17504;17742;18061;18379;24891;29370;29806	-;-;-;-;-;-;-;-;-;-;-;-;-	2073	36	21	32	28	17	17	102	41	10	24	22	6	49	2	153	157	38	53	18
ENSG00000243485.2	chr1;chr1;chr1	29554;30267;30976	30039;30667;31109	+;+;+	1021	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ENSG00000237613.2	chr1;chr1;chr1	34554;35245;35721	35174;35481;36081	-;-;-	1219	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ENSG00000268020.2	chr1;chr1	52473;54830	53312;54936	+;+	947	0	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0	0
ENSG00000240361.1	chr1	62948	63887	+	940	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

library(rmarkdown)
paged_table(gse69360)

Part 2: Reshaping data w/ `tidyr`

## legacy tidyverse functions
gather()   # Gather COLUMNS -> ROWS
spread()   # Spread ROWS -> COLUMNS

## newer tidyverse functions
pivot_longer()  # Pivot data from wide to long
pivot_wider()   # Pivot data from long to wide

separate() # Separate 1 COLUMN -> many COLUMNS
unite()    # Unite several COLUMNS -> 1 COLUMN

Gather, Spread

gather: Gather columns into key-value pairs. Wide -> Long
spread: Spread a key-value pair across multiple columns: Long -> Wide

# Gather all columns except 'Geneid'
gse69360 %>%
  select(Geneid, matches("[AF]_")) %>%
  # gather(-Geneid, key="Sample", value="Counts") # wide -> long format  ## legacy
  pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts")

# A tibble: 1,098,580 × 3
   Geneid            Sample     Counts
   <chr>             <chr>       <dbl>
 1 ENSG00000223972.4 AA_Colon        1
 2 ENSG00000223972.4 AA_Heart        0
 3 ENSG00000223972.4 AA_Kidney       0
 4 ENSG00000223972.4 AA_Liver        0
 5 ENSG00000223972.4 AA_Lung         0
 6 ENSG00000223972.4 AA_Stomach      0
 7 ENSG00000223972.4 AF_Colon        7
 8 ENSG00000223972.4 AF_Stomach      2
 9 ENSG00000223972.4 BA_Colon        0
10 ENSG00000223972.4 BA_Heart        0
# … with 1,098,570 more rows

# Gather, then Spread --> Back to original data
gse69360 %>%
  select(Geneid, matches("[AF]_")) %>%
  # gather(-Geneid, key="Sample", value="Counts") %>%
  # spread(key = "Sample", value = "Counts")     # spread to turn back to the original data!
  pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>%
  pivot_wider(names_from="Sample", values_from="Counts")

# A tibble: 57,820 × 20
   Geneid       AA_Colon AA_Heart AA_Kidney AA_Liver AA_Lung
   <chr>           <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
 1 ENSG0000022…        1        0         0        0       0
 2 ENSG0000022…       36       21        32       28      17
 3 ENSG0000024…        0        0         0        0       0
 4 ENSG0000023…        0        0         0        0       0
 5 ENSG0000026…        0        0         0        0       0
 6 ENSG0000024…        0        0         0        0       0
 7 ENSG0000018…        0        0         0        0       0
 8 ENSG0000023…        3        0         1        2       1
 9 ENSG0000023…        0        0         0        0       0
10 ENSG0000023…        0        0         0        0       0
# … with 57,810 more rows, and 14 more variables:
#   AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
#   BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
#   BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
#   BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
#   OA_Stomach2 <dbl>, OA_Stomach3 <dbl>

Unite, Separate

unite: Unite multiple columns into one
separate: Separate one column into multiple columns

gse69360 %>%
  select(Geneid, matches("[AF]_")) %>%               # selecting only Counts columns
  pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>% # wide -> long
  separate(Sample, into=c("Source_Stage", "Tissue"), sep="_") # separate logically

# A tibble: 1,098,580 × 4
   Geneid            Source_Stage Tissue  Counts
   <chr>             <chr>        <chr>    <dbl>
 1 ENSG00000223972.4 AA           Colon        1
 2 ENSG00000223972.4 AA           Heart        0
 3 ENSG00000223972.4 AA           Kidney       0
 4 ENSG00000223972.4 AA           Liver        0
 5 ENSG00000223972.4 AA           Lung         0
 6 ENSG00000223972.4 AA           Stomach      0
 7 ENSG00000223972.4 AF           Colon        7
 8 ENSG00000223972.4 AF           Stomach      2
 9 ENSG00000223972.4 BA           Colon        0
10 ENSG00000223972.4 BA           Heart        0
# … with 1,098,570 more rows

gse69360 %>%
  select(Geneid, matches("[AF]_")) %>%
  pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>%
  separate(Sample, into=c("Source_Stage", "Tissue"), sep="_") %>%
  separate(Source_Stage, into=c("Source", "Stage"), sep=1) # separate by char position

# A tibble: 1,098,580 × 5
   Geneid            Source Stage Tissue  Counts
   <chr>             <chr>  <chr> <chr>    <dbl>
 1 ENSG00000223972.4 A      A     Colon        1
 2 ENSG00000223972.4 A      A     Heart        0
 3 ENSG00000223972.4 A      A     Kidney       0
 4 ENSG00000223972.4 A      A     Liver        0
 5 ENSG00000223972.4 A      A     Lung         0
 6 ENSG00000223972.4 A      A     Stomach      0
 7 ENSG00000223972.4 A      F     Colon        7
 8 ENSG00000223972.4 A      F     Stomach      2
 9 ENSG00000223972.4 B      A     Colon        0
10 ENSG00000223972.4 B      A     Heart        0
# … with 1,098,570 more rows

gse69360 %>%
  select(Geneid, matches("[AF]_")) %>%
  pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>%
  separate(Sample, into=c("Source_Stage", "Tissue"), sep="_") %>%
  separate(Source_Stage, into=c("Source", "Stage"), sep=1) %>%
  unite(Stage_Tissue, Stage, Tissue) # combining a different set of columns

# A tibble: 1,098,580 × 4
   Geneid            Source Stage_Tissue Counts
   <chr>             <chr>  <chr>         <dbl>
 1 ENSG00000223972.4 A      A_Colon           1
 2 ENSG00000223972.4 A      A_Heart           0
 3 ENSG00000223972.4 A      A_Kidney          0
 4 ENSG00000223972.4 A      A_Liver           0
 5 ENSG00000223972.4 A      A_Lung            0
 6 ENSG00000223972.4 A      A_Stomach         0
 7 ENSG00000223972.4 A      F_Colon           7
 8 ENSG00000223972.4 A      F_Stomach         2
 9 ENSG00000223972.4 B      A_Colon           0
10 ENSG00000223972.4 B      A_Heart           0
# … with 1,098,570 more rows

Part 3: Data wranging with `dplyr`

conflicted::conflict_prefer(name="filter", winner="dplyr")
filter()    # PICK observations by their values | ROWS
select()    # PICK variables by their names | COLUMNS
mutate()    # CREATE new variables w/ functions of existing variables | COLUMNS
transmute() # COMPUTE 1 or more COLUMNS but drop original columns
arrange()   # REORDER the ROWS
summarize() # COLLAPSE many values to a single SUMMARY
group_by()  # GROUP data into rows with the same value of variable (COLUMN)

Filter

filter: Return rows with matching conditions

head(gse69360) # Snapshot of the dataframe

# A tibble: 6 × 25
  Geneid   Chr   Start End   Strand Length AA_Colon AA_Heart
  <chr>    <chr> <chr> <chr> <chr>   <dbl>    <dbl>    <dbl>
1 ENSG000… chr1… 1186… 1222… +;+;+…   1756        1        0
2 ENSG000… chr1… 1436… 1482… -;-;-…   2073       36       21
3 ENSG000… chr1… 2955… 3003… +;+;+    1021        0        0
4 ENSG000… chr1… 3455… 3517… -;-;-    1219        0        0
5 ENSG000… chr1… 5247… 5331… +;+       947        0        0
6 ENSG000… chr1  62948 63887 +         940        0        0
# … with 17 more variables: AA_Kidney <dbl>,
#   AA_Liver <dbl>, AA_Lung <dbl>, AA_Stomach <dbl>,
#   AF_Colon <dbl>, AF_Stomach <dbl>, BA_Colon <dbl>,
#   BA_Heart <dbl>, BA_Kidney <dbl>, BA_Liver <dbl>,
#   BA_Lung <dbl>, BA_Stomach <dbl>, BF_Colon <dbl>,
#   BF_Stomach <dbl>, OA_Stomach1 <dbl>, OA_Stomach2 <dbl>,
#   OA_Stomach3 <dbl>

# Now, filter by condition
filter(gse69360, Length<=50)

# A tibble: 124 × 25
   Geneid  Chr   Start End   Strand Length AA_Colon AA_Heart
   <chr>   <chr> <chr> <chr> <chr>   <dbl>    <dbl>    <dbl>
 1 ENSG00… chr1… 3829… 3829… -;-        45        1        0
 2 ENSG00… chr1  9521… 9521… -          41        7        1
 3 ENSG00… chr1  1477… 1477… +          35        0        1
 4 ENSG00… chr1… 1545… 1545… +;+        21        0        0
 5 ENSG00… chr1  2360… 2360… -          50        1        2
 6 ENSG00… chr2… 1160… 1160… +;+        36        1        0
 7 ENSG00… chr2  8916… 8916… -          38      183        1
 8 ENSG00… chr2  8916… 8916… -          37       98        2
 9 ENSG00… chr2  8916… 8916… -          38       80        0
10 ENSG00… chr2  8916… 8916… -          38        0        0
# … with 114 more rows, and 17 more variables:
#   AA_Kidney <dbl>, AA_Liver <dbl>, AA_Lung <dbl>,
#   AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
#   BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
#   BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
#   BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
#   OA_Stomach2 <dbl>, OA_Stomach3 <dbl>

# Can be rewritten using "Piping" %>%
gse69360 %>%   # Pipe ('then') operator to serially connect operations
  filter(Length <= 50)

# A tibble: 124 × 25
   Geneid  Chr   Start End   Strand Length AA_Colon AA_Heart
   <chr>   <chr> <chr> <chr> <chr>   <dbl>    <dbl>    <dbl>
 1 ENSG00… chr1… 3829… 3829… -;-        45        1        0
 2 ENSG00… chr1  9521… 9521… -          41        7        1
 3 ENSG00… chr1  1477… 1477… +          35        0        1
 4 ENSG00… chr1… 1545… 1545… +;+        21        0        0
 5 ENSG00… chr1  2360… 2360… -          50        1        2
 6 ENSG00… chr2… 1160… 1160… +;+        36        1        0
 7 ENSG00… chr2  8916… 8916… -          38      183        1
 8 ENSG00… chr2  8916… 8916… -          37       98        2
 9 ENSG00… chr2  8916… 8916… -          38       80        0
10 ENSG00… chr2  8916… 8916… -          38        0        0
# … with 114 more rows, and 17 more variables:
#   AA_Kidney <dbl>, AA_Liver <dbl>, AA_Lung <dbl>,
#   AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
#   BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
#   BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
#   BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
#   OA_Stomach2 <dbl>, OA_Stomach3 <dbl>

# Filtering using regex/substring match
gse69360 %>%
  filter(grepl("chrY", Chr))

# A tibble: 542 × 25
   Geneid  Chr   Start End   Strand Length AA_Colon AA_Heart
   <chr>   <chr> <chr> <chr> <chr>   <dbl>    <dbl>    <dbl>
 1 ENSGR0… chrY… 1204… 1205… +;+       259        0        0
 2 ENSGR0… chrY… 1429… 1430… +;+;+…   6191        0        0
 3 ENSGR0… chrY… 1700… 1701… -;-;-…   2363        0        0
 4 ENSGR0… chrY… 2317… 2319… +;+       428        0        0
 5 ENSGR0… chrY… 2446… 2496… -;-;-…   9015        0        0
 6 ENSGR0… chrY  3753… 3754… -         101        0        0
 7 ENSGR0… chrY  4345… 4348… -         328        0        0
 8 ENSGR0… chrY  4559… 4560… -         117        0        0
 9 ENSGR0… chrY… 5350… 5353… +;+;+…   4384        0        0
10 ENSGR0… chrY… 9009… 9011… +;+       588        0        0
# … with 532 more rows, and 17 more variables:
#   AA_Kidney <dbl>, AA_Liver <dbl>, AA_Lung <dbl>,
#   AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
#   BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
#   BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
#   BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
#   OA_Stomach2 <dbl>, OA_Stomach3 <dbl>

# Two filters at a time
gse69360 %>%
  filter(Length <= 50 & grepl("chrY", Chr))

# A tibble: 1 × 25
  Geneid   Chr   Start End   Strand Length AA_Colon AA_Heart
  <chr>    <chr> <chr> <chr> <chr>   <dbl>    <dbl>    <dbl>
1 ENSG000… chrY… 5306… 5306… -;-        42        1        0
# … with 17 more variables: AA_Kidney <dbl>,
#   AA_Liver <dbl>, AA_Lung <dbl>, AA_Stomach <dbl>,
#   AF_Colon <dbl>, AF_Stomach <dbl>, BA_Colon <dbl>,
#   BA_Heart <dbl>, BA_Kidney <dbl>, BA_Liver <dbl>,
#   BA_Lung <dbl>, BA_Stomach <dbl>, BF_Colon <dbl>,
#   BF_Stomach <dbl>, OA_Stomach1 <dbl>, OA_Stomach2 <dbl>,
#   OA_Stomach3 <dbl>

Select

select: Select/rename variables/columns by name

# Selecting columns that match a pattern
gse69360 %>%
  select(Geneid, matches(".F_"))

# A tibble: 57,820 × 5
   Geneid            AF_Colon AF_Stomach BF_Colon BF_Stomach
   <chr>                <dbl>      <dbl>    <dbl>      <dbl>
 1 ENSG00000223972.4        7          2        0          3
 2 ENSG00000227232.4      102         41      153        157
 3 ENSG00000243485.2        0          0        0          0
 4 ENSG00000237613.2        0          0        0          0
 5 ENSG00000268020.2        0          0        0          0
 6 ENSG00000240361.1        0          0        0          0
 7 ENSG00000186092.4        0          0        0          0
 8 ENSG00000238009.2        4          0        5          4
 9 ENSG00000239945.1        0          0        0          0
10 ENSG00000233750.3        2          3        3          1
# … with 57,810 more rows

# Excluding specific columns
gse69360 %>%
  select(-Chr, -Start, -End, -Strand, -Length)

# A tibble: 57,820 × 20
   Geneid       AA_Colon AA_Heart AA_Kidney AA_Liver AA_Lung
   <chr>           <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
 1 ENSG0000022…        1        0         0        0       0
 2 ENSG0000022…       36       21        32       28      17
 3 ENSG0000024…        0        0         0        0       0
 4 ENSG0000023…        0        0         0        0       0
 5 ENSG0000026…        0        0         0        0       0
 6 ENSG0000024…        0        0         0        0       0
 7 ENSG0000018…        0        0         0        0       0
 8 ENSG0000023…        3        0         1        2       1
 9 ENSG0000023…        0        0         0        0       0
10 ENSG0000023…        0        0         0        0       0
# … with 57,810 more rows, and 14 more variables:
#   AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
#   BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
#   BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
#   BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
#   OA_Stomach2 <dbl>, OA_Stomach3 <dbl>

# Excluding columns matching a pattern
gse69360 %>%
  select(-matches("[AF]_"))

# A tibble: 57,820 × 6
   Geneid            Chr           Start End   Strand Length
   <chr>             <chr>         <chr> <chr> <chr>   <dbl>
 1 ENSG00000223972.4 chr1;chr1;ch… 1186… 1222… +;+;+…   1756
 2 ENSG00000227232.4 chr1;chr1;ch… 1436… 1482… -;-;-…   2073
 3 ENSG00000243485.2 chr1;chr1;ch… 2955… 3003… +;+;+    1021
 4 ENSG00000237613.2 chr1;chr1;ch… 3455… 3517… -;-;-    1219
 5 ENSG00000268020.2 chr1;chr1     5247… 5331… +;+       947
 6 ENSG00000240361.1 chr1          62948 63887 +         940
 7 ENSG00000186092.4 chr1          69091 70008 +         918
 8 ENSG00000238009.2 chr1;chr1;ch… 8929… 9162… -;-;-…   3569
 9 ENSG00000239945.1 chr1;chr1     8955… 9005… -;-      1319
10 ENSG00000233750.3 chr1          1310… 1348… +        3812
# … with 57,810 more rows

# Select then Filter
gse69360 %>%
  select(Geneid, Chr, Length, matches("[AF]_")) %>%
  filter(grepl("chrY", Chr) | Length <= 100)

# A tibble: 4,671 × 22
   Geneid  Chr   Length AA_Colon AA_Heart AA_Kidney AA_Liver
   <chr>   <chr>  <dbl>    <dbl>    <dbl>     <dbl>    <dbl>
 1 ENSG00… chr1…     90        0        0         0        0
 2 ENSG00… chr1…     57        0        0         0        0
 3 ENSG00… chr1      95        5        0         2        0
 4 ENSG00… chr1      90        0        0         0        0
 5 ENSG00… chr1      83        7        0         1        2
 6 ENSG00… chr1      96        0        0         0        0
 7 ENSG00… chr1      70        0        0         0        0
 8 ENSG00… chr1      73        0        0         0        0
 9 ENSG00… chr1      89        0        0         0        0
10 ENSG00… chr1      70        0        0         0        0
# … with 4,661 more rows, and 15 more variables:
#   AA_Lung <dbl>, AA_Stomach <dbl>, AF_Colon <dbl>,
#   AF_Stomach <dbl>, BA_Colon <dbl>, BA_Heart <dbl>,
#   BA_Kidney <dbl>, BA_Liver <dbl>, BA_Lung <dbl>,
#   BA_Stomach <dbl>, BF_Colon <dbl>, BF_Stomach <dbl>,
#   OA_Stomach1 <dbl>, OA_Stomach2 <dbl>, OA_Stomach3 <dbl>

Mutate

mutate: Adds new variables; keeps existing variables
transmute: Adds new variables; drops existing variables

# Excluding columns matching a condition
gse69360 %>%
  select(-matches("[AF]_")) %>%
  head(., 10) #%>% View()

# A tibble: 10 × 6
   Geneid            Chr           Start End   Strand Length
   <chr>             <chr>         <chr> <chr> <chr>   <dbl>
 1 ENSG00000223972.4 chr1;chr1;ch… 1186… 1222… +;+;+…   1756
 2 ENSG00000227232.4 chr1;chr1;ch… 1436… 1482… -;-;-…   2073
 3 ENSG00000243485.2 chr1;chr1;ch… 2955… 3003… +;+;+    1021
 4 ENSG00000237613.2 chr1;chr1;ch… 3455… 3517… -;-;-    1219
 5 ENSG00000268020.2 chr1;chr1     5247… 5331… +;+       947
 6 ENSG00000240361.1 chr1          62948 63887 +         940
 7 ENSG00000186092.4 chr1          69091 70008 +         918
 8 ENSG00000238009.2 chr1;chr1;ch… 8929… 9162… -;-;-…   3569
 9 ENSG00000239945.1 chr1;chr1     8955… 9005… -;-      1319
10 ENSG00000233750.3 chr1          1310… 1348… +        3812

# Storing gene location information in a separate dataframe
gene_loc <- gse69360 %>%                                            # saving output to a variable
  select(-matches("[AF]_")) %>%                                     # select columns
  mutate(Geneid = gsub("\\.[0-9]*$", "", Geneid)) %>%               # remove isoform no.
  mutate(Chr = gsub(";.*$", "", gse69360$Chr)) %>%                  # keep the first element for Chr
  mutate(Start = as.numeric(gsub(";.*$", "", gse69360$Start))) %>%  # "" for Start
  mutate(End = as.numeric(gsub(";.*$", "", gse69360$End))) %>%      # "" for End
  mutate(Strand = gsub(";.*$", "", gse69360$Strand))                # "" for Strand

# Check to see if you have what you expected!
gene_loc

# A tibble: 57,820 × 6
   Geneid          Chr    Start    End Strand Length
   <chr>           <chr>  <dbl>  <dbl> <chr>   <dbl>
 1 ENSG00000223972 chr1   11869  12227 +        1756
 2 ENSG00000227232 chr1   14363  14829 -        2073
 3 ENSG00000243485 chr1   29554  30039 +        1021
 4 ENSG00000237613 chr1   34554  35174 -        1219
 5 ENSG00000268020 chr1   52473  53312 +         947
 6 ENSG00000240361 chr1   62948  63887 +         940
 7 ENSG00000186092 chr1   69091  70008 +         918
 8 ENSG00000238009 chr1   89295  91629 -        3569
 9 ENSG00000239945 chr1   89551  90050 -        1319
10 ENSG00000233750 chr1  131025 134836 +        3812
# … with 57,810 more rows

head(gene_loc, 10)

# A tibble: 10 × 6
   Geneid          Chr    Start    End Strand Length
   <chr>           <chr>  <dbl>  <dbl> <chr>   <dbl>
 1 ENSG00000223972 chr1   11869  12227 +        1756
 2 ENSG00000227232 chr1   14363  14829 -        2073
 3 ENSG00000243485 chr1   29554  30039 +        1021
 4 ENSG00000237613 chr1   34554  35174 -        1219
 5 ENSG00000268020 chr1   52473  53312 +         947
 6 ENSG00000240361 chr1   62948  63887 +         940
 7 ENSG00000186092 chr1   69091  70008 +         918
 8 ENSG00000238009 chr1   89295  91629 -        3569
 9 ENSG00000239945 chr1   89551  90050 -        1319
10 ENSG00000233750 chr1  131025 134836 +        3812

# Creating new variables
gene_loc %>%
  mutate(kbStart = Start/1000,     # creates new variables/columns
         kbEnd = End/1000,
         kbLength = Length/1000)

# A tibble: 57,820 × 9
   Geneid    Chr    Start    End Strand Length kbStart kbEnd
   <chr>     <chr>  <dbl>  <dbl> <chr>   <dbl>   <dbl> <dbl>
 1 ENSG0000… chr1   11869  12227 +        1756    11.9  12.2
 2 ENSG0000… chr1   14363  14829 -        2073    14.4  14.8
 3 ENSG0000… chr1   29554  30039 +        1021    29.6  30.0
 4 ENSG0000… chr1   34554  35174 -        1219    34.6  35.2
 5 ENSG0000… chr1   52473  53312 +         947    52.5  53.3
 6 ENSG0000… chr1   62948  63887 +         940    62.9  63.9
 7 ENSG0000… chr1   69091  70008 +         918    69.1  70.0
 8 ENSG0000… chr1   89295  91629 -        3569    89.3  91.6
 9 ENSG0000… chr1   89551  90050 -        1319    89.6  90.0
10 ENSG0000… chr1  131025 134836 +        3812   131.  135. 
# … with 57,810 more rows, and 1 more variable:
#   kbLength <dbl>

# Creating new variables & dropping old ones
gene_loc %>%
  transmute(kbStart = Start/1000,  # drops original columns
            kbEnd = End/1000,
            kbLength = Length/1000)

# A tibble: 57,820 × 3
   kbStart kbEnd kbLength
     <dbl> <dbl>    <dbl>
 1    11.9  12.2    1.76 
 2    14.4  14.8    2.07 
 3    29.6  30.0    1.02 
 4    34.6  35.2    1.22 
 5    52.5  53.3    0.947
 6    62.9  63.9    0.94 
 7    69.1  70.0    0.918
 8    89.3  91.6    3.57 
 9    89.6  90.0    1.32 
10   131.  135.     3.81 
# … with 57,810 more rows

Distinct & Arrange

distinct: Pick unique entries
arrange: Arrange rows by variables

# Pick only the unique entries in a column
gene_loc %>%
  distinct(Chr)

# A tibble: 25 × 1
   Chr  
   <chr>
 1 chr1 
 2 chr2 
 3 chr3 
 4 chr4 
 5 chr5 
 6 chr6 
 7 chr7 
 8 chr8 
 9 chr9 
10 chr10
# … with 15 more rows

gene_loc %>%
  distinct(Strand)

# A tibble: 2 × 1
  Strand
  <chr> 
1 +     
2 -

# Pick unique combinations
gene_loc %>%
  distinct(Chr, Strand)

# A tibble: 50 × 2
   Chr   Strand
   <chr> <chr> 
 1 chr1  +     
 2 chr1  -     
 3 chr2  -     
 4 chr2  +     
 5 chr3  +     
 6 chr3  -     
 7 chr4  -     
 8 chr4  +     
 9 chr5  +     
10 chr5  -     
# … with 40 more rows

# Then sort aka arrange your data
gene_loc %>%
  arrange(desc(Chr))    # sort in descending order

# A tibble: 57,820 × 6
   Geneid          Chr    Start    End Strand Length
   <chr>           <chr>  <dbl>  <dbl> <chr>   <dbl>
 1 ENSGR0000228572 chrY  120410 120513 +         259
 2 ENSGR0000182378 chrY  142989 143061 +        6191
 3 ENSGR0000178605 chrY  170025 170137 -        2363
 4 ENSGR0000226179 chrY  231725 231983 +         428
 5 ENSGR0000167393 chrY  244698 249631 -        9015
 6 ENSGR0000266731 chrY  375316 375416 -         101
 7 ENSGR0000234958 chrY  434510 434837 -         328
 8 ENSGR0000229232 chrY  455971 456087 -         117
 9 ENSGR0000185960 chrY  535079 535337 +        4384
10 ENSGR0000237531 chrY  900956 901162 +         588
# … with 57,810 more rows

gene_loc %>%
  arrange(Chr, Length)  # sort by Chr, then Length

# A tibble: 57,820 × 6
   Geneid          Chr       Start       End Strand Length
   <chr>           <chr>     <dbl>     <dbl> <chr>   <dbl>
 1 ENSG00000268141 chr1  154585067 154585080 +          21
 2 ENSG00000224335 chr1  147706573 147706607 +          35
 3 ENSG00000263526 chr1   95211416  95211456 -          41
 4 ENSG00000230610 chr1   38292232  38292262 -          45
 5 ENSG00000252056 chr1  236026688 236026737 -          50
 6 ENSG00000238705 chr1   26881033  26881084 +          52
 7 ENSG00000264650 chr1   45499641  45499692 -          52
 8 ENSG00000265769 chr1   52366592  52366643 -          52
 9 ENSG00000251795 chr1   93303575  93303627 +          53
10 ENSG00000238310 chr1  109750416 109750470 -          55
# … with 57,810 more rows

# arrange(Chr, -Length) # to reverse sort by 'numeric' Length

Group_by & Summarize

summarize: Reduces multiple values down to a single value
group_by: Combine entries by one or more variables

# Combine by a variable, then calculate summary statistics for each group
gene_loc %>%
  group_by(Chr) %>%                         # combine rows by Chr
  summarize(numGenes = n(),                 # then summarise, number of genes/Chr
            startoffirstGene = min(Start))  # min to get the first Start location

# A tibble: 25 × 3
   Chr   numGenes startoffirstGene
   <chr>    <int>            <dbl>
 1 chr1      5363            11869
 2 chr10     2260            90652
 3 chr11     3208            75780
 4 chr12     2818            67607
 5 chr13     1217         19041312
 6 chr14     2244         19110203
 7 chr15     2080         20083769
 8 chr16     2343            61553
 9 chr17     2903             4961
10 chr18     1127            11103
# … with 15 more rows

# Example to show you can use all math/stat functions to summarize data groups
gene_loc %>%
  arrange(Length) %>%
  group_by(Chr, Strand) %>%
  summarize(numGenes = n(),
            smallestGene = first(Geneid),
            minLength = min(Length),
            firstqLength = quantile(Length, 0.25),
            medianLength = median(Length),
            iqrLength = IQR(Length),
            thirdqLength = quantile(Length, 0.75),
            maxLength = max(Length),
            longestGene = last(Geneid))

`summarise()` has grouped output by 'Chr'. You can override
using the `.groups` argument.

# A tibble: 50 × 11
# Groups:   Chr [25]
   Chr   Strand numGenes smallestGene minLength firstqLength
   <chr> <chr>     <int> <chr>            <dbl>        <dbl>
 1 chr1  -          2651 ENSG0000026…        41         354.
 2 chr1  +          2712 ENSG0000026…        21         394.
 3 chr10 -          1109 ENSG0000026…        52         322 
 4 chr10 +          1151 ENSG0000027…        27         356 
 5 chr11 -          1605 ENSG0000026…        30         477 
 6 chr11 +          1603 ENSG0000025…        37         476 
 7 chr12 -          1415 ENSG0000025…        28         375 
 8 chr12 +          1403 ENSG0000026…        30         402 
 9 chr13 -           631 ENSG0000021…        48         297 
10 chr13 +           586 ENSG0000026…        57         371.
# … with 40 more rows, and 5 more variables:
#   medianLength <dbl>, iqrLength <dbl>,
#   thirdqLength <dbl>, maxLength <dbl>, longestGene <chr>

# Renaming chromosomes & declaring them as factors w/ an intrinsic order
# gene_loc$Chr <- factor(gene_loc$Chr,
#                        levels = paste("chr",
#                                       c((1:22), "X", "Y", "M"),
#                                       sep=""))

# Saving your data locally
gene_loc %>%
  write_tsv(here("transcriptomics/GSE69360.gene-locations.txt"))

More data wrangling

Let’s combine everything from above to tidy the full GSE69360 dataset.

Dataset details:

AA: Agilent Adult; AF: Agilent Fetus
BA: BioChain Adult; BF: BioChain Fetus
OA: OriGene Adult
Tissues: Colon, Heart, Kidney, Liver, Lung, Stomach

# Extracting just the expression values & cleaning it up
View(head(gse69360, 50))

gene_counts <- gse69360 %>%
  select(-Chr, -Start, -End, -Strand, -Length) %>%          # another way to select just the expression data
  rename(OA_Stomach = OA_Stomach1) %>%                      # rename couple of columns
  mutate(OA_Stomach2 = NULL, OA_Stomach3 = NULL) %>%        # remove a couple of columns
  mutate(Geneid = gsub("\\.[0-9]*$", "", Geneid))           # cleanup data a specific column

logcpm <- gene_counts %>%
  select(-Geneid) %>%
  mutate_all(function(x) { log2((x*(1e+6)/sum(x)) + 1) } )  # convert counts in each sample to counts-per-million

summary(logcpm)

    AA_Colon           AA_Heart         AA_Kidney      
 Min.   : 0.00000   Min.   : 0.0000   Min.   : 0.0000  
 1st Qu.: 0.00000   1st Qu.: 0.0000   1st Qu.: 0.0000  
 Median : 0.06883   Median : 0.0000   Median : 0.0000  
 Mean   : 1.10187   Mean   : 0.8103   Mean   : 0.8785  
 3rd Qu.: 1.33045   3rd Qu.: 0.5151   3rd Qu.: 0.7181  
 Max.   :18.41311   Max.   :18.7478   Max.   :18.8061  
    AA_Liver           AA_Lung         AA_Stomach      
 Min.   : 0.00000   Min.   : 0.000   Min.   : 0.00000  
 1st Qu.: 0.00000   1st Qu.: 0.000   1st Qu.: 0.00000  
 Median : 0.05563   Median : 0.000   Median : 0.06964  
 Mean   : 0.98788   Mean   : 1.049   Mean   : 0.93120  
 3rd Qu.: 0.95867   3rd Qu.: 1.123   3rd Qu.: 0.91851  
 Max.   :17.85244   Max.   :17.722   Max.   :17.85986  
    AF_Colon         AF_Stomach         BA_Colon       
 Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.00000  
 1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.00000  
 Median : 0.1995   Median : 0.0000   Median : 0.08114  
 Mean   : 1.2903   Mean   : 0.7748   Mean   : 0.98586  
 3rd Qu.: 1.8951   3rd Qu.: 0.8105   3rd Qu.: 1.10906  
 Max.   :17.4620   Max.   :18.3586   Max.   :18.75742  
    BA_Heart          BA_Kidney           BA_Liver      
 Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0000  
 1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.0000  
 Median : 0.04537   Median : 0.05856   Median : 0.0000  
 Mean   : 0.80833   Mean   : 0.96659   Mean   : 0.8194  
 3rd Qu.: 0.68444   3rd Qu.: 1.11092   3rd Qu.: 0.5974  
 Max.   :18.92846   Max.   :18.87617   Max.   :17.7612  
    BA_Lung          BA_Stomach         BF_Colon      
 Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
 1st Qu.: 0.1643   1st Qu.: 0.0000   1st Qu.: 0.0000  
 Median : 0.4456   Median : 0.0000   Median : 0.1793  
 Mean   : 1.3873   Mean   : 0.7528   Mean   : 1.3716  
 3rd Qu.: 1.9392   3rd Qu.: 0.6261   3rd Qu.: 1.9775  
 Max.   :17.5495   Max.   :18.3249   Max.   :15.8531  
   BF_Stomach        OA_Stomach     
 Min.   : 0.0000   Min.   : 0.0000  
 1st Qu.: 0.0000   1st Qu.: 0.0000  
 Median : 0.3085   Median : 0.1003  
 Mean   : 1.5164   Mean   : 1.0800  
 3rd Qu.: 2.4221   3rd Qu.: 1.3692  
 Max.   :15.7256   Max.   :17.9600

gene_logcpm <- gene_counts %>%                              # a new dataframe with logcpm values
  select(Geneid) %>%
  bind_cols(logcpm) %>%                                     # bind the logcpm matrix to the geneids
  pivot_longer(-Geneid, names_to="Sample", values_to="Logcpm") %>%     # convert to tidy data
  separate(Sample,                                          # cleanup complex variables
           into = c("Source", "Stage", "Tissue"),
           sep = c(1,2),
           remove = F) %>%                                  # keep original variable
  mutate(Tissue = gsub("^_", "", Tissue),
         Stage = ifelse(Stage == "A", "Adult", "Fetus"))

View(head(gene_logcpm, 50))

# Plotting the distribution of gene-expression in each sample
gene_logcpm %>%
  ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
  geom_boxplot(outlier.size = 0.2, outlier.shape = 0.2) +
  scale_y_continuous(limits = c(0, 1)) +
  coord_flip() +
  theme_minimal()

Warning: Removed 259347 rows containing non-finite values
(stat_boxplot).

Part 4: Visualizing tidy data w/ `ggplot`

Basics of ggplot2

Creating a plot w/ Grammar of Graphics

Recap and continuation of dplyr

Basics of plotting data with ggplot2: data, aes, geom

Customization: Colors, labels, and legends

Barplots & Histograms

ggplot, factor, aes
geom_bar, geom_histogram
facet_wrap
scale_x_log10, labs, coord_flip, theme, theme_minimal

gene_loc %>%                                                      # data
  ggplot(aes(x = Chr)) +                                          # aesthetics: what to plot?
  geom_bar()                                                      # geometry: how to plot?

gene_loc$Chr <- factor(gene_loc$Chr,
                       levels = paste("chr",
                                      c((1:22), "X", "Y", "M"),
                                      sep=""))

plot_chr_numgenes <- gene_loc %>%
  ggplot(aes(x = Chr)) +
  geom_bar()
plot_chr_numgenes

plot_chr_numgenes +
  coord_flip() +
  theme_minimal()

plot_chr_numgenes +
  labs(title = "No. genes per chromosome",
       x = "Chromosome",
       y = "No. of genes") +
  theme_minimal() +
  coord_flip()

gene_loc %>%
  ggplot(aes(x = Length)) +
  geom_histogram(color = "white") +
  scale_x_log10() +
  theme_minimal()

`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.

plot_chr_genelength <- gene_loc %>%
  ggplot(aes(x = Length, fill = Chr)) +
  geom_histogram(color = "white") +
  scale_x_log10() +
  theme_minimal() +
  facet_wrap(~Chr, scales = "free_y")
plot_chr_genelength

`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.

plot_chr_genelength +
  theme(legend.position = "none") +
  labs(x = "Gene length (log-scale)",
       y = "No. of genes")

`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.

Scatter plots

geom_point
geom_abline, geom_vline, geom_hline
geom_smooth, geom_text_repel

gene_loc %>%
  ggplot(aes(x = End-Start, y = Length)) +
  geom_point()

plot_strend_length <- gene_loc %>%
  ggplot(aes(x = End-Start, y = Length)) +
  geom_point(alpha = 0.1, size = 0.5, color = "grey", fill = "grey")
plot_strend_length

plot_strend_length <- plot_strend_length +
  scale_x_log10("End-Start") +
  scale_y_log10("Gene length") +
  theme_minimal()
plot_strend_length

Warning: Transformation introduced infinite values in
continuous x-axis

plot_strend_length +
  geom_abline(intercept = 0, slope = 1, col = "red") +
  geom_hline(yintercept = 500, color = "blue") +
  geom_vline(xintercept = 1000, color = "orange")

Warning: Transformation introduced infinite values in
continuous x-axis

gene_loc %>%
  group_by(Chr) %>%
  summarize(meanLength = mean(Length), numGenes = n())

# A tibble: 25 × 3
   Chr   meanLength numGenes
   <fct>      <dbl>    <int>
 1 chr1       2258.     5363
 2 chr2       2304.     4047
 3 chr3       2382.     3101
 4 chr4       2109.     2563
 5 chr5       2188.     2859
 6 chr6       2124.     2905
 7 chr7       2164.     2876
 8 chr8       2036.     2386
 9 chr9       2123.     2323
10 chr10      2160.     2260
# … with 15 more rows

gene_loc %>%
  group_by(Chr) %>%
  summarize(meanLength = mean(Length), numGenes = n()) %>%
  ggplot(aes(x = numGenes, y = meanLength)) +
  geom_point()

# install.packages("ggrepel", dependencies=T)
library(ggrepel)
gene_loc %>%
  group_by(Chr) %>%
  summarize(meanLength = mean(Length), numGenes = n()) %>%
  ggplot(aes(x = numGenes, y = meanLength)) +
  geom_point() +
  geom_smooth(color = "lightblue", alpha = 0.1) +
  labs(x = "No. of genes", y = "Mean gene length") +
  geom_text_repel(aes(label = Chr), color="red", segment.color="grey80") +
  theme_minimal()

`geom_smooth()` using method = 'loess' and formula 'y ~ x'

Boxplots & Violin plots

geom_boxplot, geom_violin
scale_y_continuous

gene_logcpm %>%
  ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
  geom_boxplot() +
  coord_flip() +
  theme_minimal()

gene_logcpm %>%
  ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
  geom_violin() +
  scale_y_continuous(limits = c(0, 0.5)) +
  coord_flip() +
  theme_minimal()

Warning: Removed 320317 rows containing non-finite values
(stat_ydensity).

# Plotting the distribution of gene-expression in each sample
plot_sample_bxp <- gene_logcpm %>%
  ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
  geom_boxplot(outlier.size = 0.2, outlier.alpha = 0.2) +
  scale_y_continuous(limits = c(0, 1)) +
  coord_flip() +
  theme_minimal()
plot_sample_bxp

Warning: Removed 259347 rows containing non-finite values
(stat_boxplot).

# Plotting scatterplot of 2 sets of samples
plot_ffcolon_scatter <- gene_logcpm %>%
  filter(Sample == "AF_Colon" | Sample == "BF_Colon") %>%
  select(Geneid, Sample, Logcpm) %>%
  pivot_wider(names_from = Sample, values_from = Logcpm) %>%
  ggplot(aes(x = AF_Colon, y = BF_Colon)) +
  geom_point(alpha = 0.1, size = 0.5) +
  geom_smooth(method=lm) +
  theme_minimal()
plot_ffcolon_scatter

`geom_smooth()` using formula 'y ~ x'

Some data sleuthing

# Finding genes with high variance across samples
num_totgenes <- gene_logcpm %>%
  distinct(Geneid) %>%
  nrow()

highvar_genes <- gene_logcpm %>%
  group_by(Geneid) %>%
  summarize(iqr = IQR(Logcpm)) %>%
  top_n((ceiling(num_totgenes*0.05)), iqr) %>%
  pull(Geneid)

length(highvar_genes)

[1] 2891

# Plotting expression of high-var Y chr genes across samples
chry_highvar_genes <- gene_loc %>%
  filter(Chr == "chrY" & Geneid %in% highvar_genes) %>%
  pull(Geneid)

gene_logcpm %>%
  filter(Geneid %in% chry_highvar_genes)

# A tibble: 238 × 6
   Geneid          Sample     Source Stage Tissue  Logcpm
   <chr>           <chr>      <chr>  <chr> <chr>    <dbl>
 1 ENSG00000129824 AA_Colon   A      Adult Colon   6.12  
 2 ENSG00000129824 AA_Heart   A      Adult Heart   0     
 3 ENSG00000129824 AA_Kidney  A      Adult Kidney  0.106 
 4 ENSG00000129824 AA_Liver   A      Adult Liver   1.75  
 5 ENSG00000129824 AA_Lung    A      Adult Lung    0     
 6 ENSG00000129824 AA_Stomach A      Adult Stomach 0     
 7 ENSG00000129824 AF_Colon   A      Fetus Colon   0.0352
 8 ENSG00000129824 AF_Stomach A      Fetus Stomach 4.51  
 9 ENSG00000129824 BA_Colon   B      Adult Colon   6.05  
10 ENSG00000129824 BA_Heart   B      Adult Heart   4.81  
# … with 228 more rows

plot_chry_highvar_boxplot <- gene_logcpm %>%
  filter(Geneid %in% chry_highvar_genes) %>%
  ggplot(aes(x = reorder(Sample, Logcpm, FUN = median),
             y = Logcpm,
             color = Sample)) +
  geom_boxplot() +
  coord_flip() +
  theme_minimal() +
  theme(legend.position = "none")
plot_chry_highvar_boxplot

Part 5: Export & Wrap-up w/ `RMarkdown`

Saving your plots

ggsave

Save a ggplot (or other grid object) with sensible defaults

library(tidyverse)
# Save your file name
plot1 <- "chr_highvar.png"

# Save your absolute/relative path
my_full_path <- here("transcriptomics")

# To save as a tab-delimited text file ...
ggsave(filename=plot1,
       plot=plot_chry_highvar_boxplot,
       device="png",
       path=my_full_path,
       dpi=600)

Saving 6.5 x 4 in image

Saving your data files

write_delim

Write a dataframe to a delimited file

library(tidyverse)
# Save your file name
filename <- "my_new_data.txt"

# Save your absolute/relative path
my_full_path <- here("data")

# To save as a tab-delimited text file ...
write_tsv(x=my_newly_formatted_data, # your final reformatted dataset
          path=paste(my_full_path, filename, "/"), # Absolute path recommended.
          # However, you can directly use 'filename' here
          # if you are saving the file in the same directory
          # as your code.
          col_names=T) # if you want the column names to be
# saved in the first row, recommended

# Alternatively, you could save it as a comma-separated text file
write_csv(x=my_newly_formatted_data,
          path=my_path,
          col_names=T)
# Or save it with any other delimiter
# choose wisely, pick a delim that's not part of your dataframe
write_delim(x=my_newly_formatted_data,
            path=my_path,
            col_names=T,
            delim="---")

What you learnt today!

Option	Description
Part 1	Getting Started
`install.packages`	Download and install packages from CRAN-like repositories or from local files
`library`	Library and require load and attach add-on packages
`package::function`	To run a function w/o loading the package
Import	`tidyverse` > `readr` & `readxl`
`read_delim`	Read a delimited file (incl csv, tsv) into a tibble
`read_csv, read_tsv`	read_csv() and read_tsv() are special cases of the general read_delim()
`read_excel`	Read xls and xlsx files
`here`	Part of `here` package. To set paths relative to your current project directory
Data snapshot
`str`	Compactly Display the Structure of an Arbitrary R Object
`head`	Return the First or Last Part of an Object
`glimpse`	Get a glimpse of your data
`View`	Invoke a Data Viewer (pop-up, requires XQuartz/X11)
`kable`	Create tables in LaTeX, HTML, Markdown and reStructuredText
`paged_table`	Create a table in HTML with support for paging rows and columns
Part 2	`tidyverse` > `tidyr`
`pivot_longer`	Gather Columns Into Key-Value Pairs (COLS -> ROWS)
`pivot_wider`	Spread a key-value pair across multiple columns
`separate`	Separate one column into multiple column
`unite`	Unite multiple columns into one
Part 3	`tidyverse` > `dplyr`
`filter`	Return rows with matching conditions
`select`	Select/rename variables by name
`mutate`	Add new variables
`transmute`	Add new variables & drops existing variables
`distinct`	Get unique entries
`arrange`	Arrange rows by variables
`summarise`	Reduces multiple values down to a single value
`group_by`	Group by one or more variables
`join`	Join two dataframes together: `left_join`, `right_join`, `inner_join`
`bind`	Efficiently bind multiple dataframes by row and column: `bind_rows`, `bind_cols`
`setops`	Set operations: `intersect`, `union`, `setdiff`, `setequal`
Part 4	`tidyverse` > `ggplot`
`ggplot`	Create a new ggplot
Aesthetics
`aes`	Specify details on what to plot
`factor`	Specify a variable to be ordered & categorical
`facet_wrap`	Split into multiple sub-plots based on a categorical variable
Geometries
`geom_bar`	Create a barplot
`geom_histogram`	Create a histogram
`geom_point`	Create a scatter plot
`geom_boxplot`	Create a boxplot
`geom_violin`	Create a violin plot
`geom_abline, geom_hline, geom_vline`	Add slanted, horizontal, or vertical lines
`geom_smooth`	Add a smooth curve that fits the data points
Plot customization
`theme, theme_minimal, theme_bw`	Adjust themes to minimal, black/white, etc
`scale_x_log10`	Change x/y axes to log scale
`scale_y_continuous`	Change x/y axes to continuous & set limits
`coord_flip`	Flip x & y coordinates
`labs`	Specify axes labels and plot titles
Part 5	Export & Wrap-up
Export	`tidyverse` > `readr`
`ggsave`	Save a ggplot (or other grid object) with sensible defaults
`write_delim`	Write a dataframe to a delimited file
`write_tsv`	write_delim customized for tab-separated values
`write_csv`	write_delim customized for comma-separated values

Credits

Arjun Krishnan and I co-developed the content for this workshop.

Acknowledgements

Krishnan Lab | JRaviLab
R-Ladies East Lansing
Hosts: R-Ladies Chennai | R-Ladies Bangalore | R-Ladies Tunis | MSU | ISCB Academy

Contact

Webpage: jravilab.github.io | thekrishnanlab.org
Email: janani@msu.edu | arjun@msu.edu
Twitter: @janani137 | @compbiologist
GitHub: @jananiravi | @arjunkrish

Additional resources

You can access all relevant material pertaining to this workshop here.
Other related R-Ladies East Lansing workshops & useful cheatsheets.
Computational biology gists
Data-to-viz.com and R Graph Gallery

Some awesome open-source books

Hands-On Programming with R: Grolemund #HOPR https://rstudio-education.github.io/hopr
R for Data Science: Wickham & Grolemund #R4DS https://r4ds.had.co.nz
R Programming for Data Science: Peng https://leanpub.com/rprogramming
Learning Statistics with R: Navarro https://learningstatisticswithr.com/book

Introduction to Tidyverse | using transcriptomic data

About me

Part 1: Getting Started w/ `readr`

Installation and set-up

Install RStudio

Install R

Install packages

Loading packages

Some useful cheatsheets

Data import

Knowing your data

Part 2: Reshaping data w/ `tidyr`

Gather, Spread

Unite, Separate

Part 3: Data wranging with `dplyr`

Filter

Select

Mutate

Distinct & Arrange

Group_by & Summarize

More data wrangling

Part 4: Visualizing tidy data w/ `ggplot`

Basics of ggplot2

Barplots & Histograms

Scatter plots

Boxplots & Violin plots

Some data sleuthing

Part 5: Export & Wrap-up w/ `RMarkdown`

Saving your plots

ggsave

Saving your data files

write_delim

What you learnt today!

Credits

Acknowledgements

Contact

Additional resources

Some awesome open-source books

Other helpful links

Citation

Introduction to Tidyverse | using transcriptomic data

About me

Part 1: Getting Started w/ readr

Installation and set-up

Install RStudio

Install R

Install packages

Loading packages

Some useful cheatsheets

Data import

Knowing your data

Part 2: Reshaping data w/ tidyr

Gather, Spread

Unite, Separate

Part 3: Data wranging with dplyr

Filter

Select

Mutate

Distinct & Arrange

Group_by & Summarize

More data wrangling

Part 4: Visualizing tidy data w/ ggplot

Basics of ggplot2

Barplots & Histograms

Scatter plots

Boxplots & Violin plots

Some data sleuthing

Part 5: Export & Wrap-up w/ RMarkdown

Saving your plots

ggsave

Saving your data files

write_delim

What you learnt today!

Credits

Acknowledgements

Contact

Additional resources

Some awesome open-source books

Other helpful links

Citation

Part 1: Getting Started w/ `readr`

Part 2: Reshaping data w/ `tidyr`

Part 3: Data wranging with `dplyr`

Part 4: Visualizing tidy data w/ `ggplot`

Part 5: Export & Wrap-up w/ `RMarkdown`