ML for Microbial Genomics

Author

Janani Ravi | jravilab.github.io

Published

August 2, 2023

ML for AMR

This is a companion repo & webpage for the ‘ML for Microbial Genomics’ workshop, first presented at the MLHD 2023 conference \@ICTS! You can access the material here: https://jananiravi.github.io/2023-mlhd and slides here.

Overview

This session will cover ideas, concepts, and insights needed to get started with building machine learning models in R with high-dimensional data, such as microbial genomics. No prior knowledge in ML is required.

Acknowledgments

  • JRaviLab: Jacob Krol, Ethan Wolfe, Evan Brenner, Keenan Manpearl, Joseph Burke, Vignesh Sridhar, Jill Bilodeaux (contributed to the antimicrobial resistance project)
  • Arjun Krishnan (for the tidymodels qmd primer)
  • R-Ladies, esp. R-Ladies East Lansing, R-Ladies Aurora; R/Bioconductor; rOpenSci (for all things R!)
  • tidymodels resource by Julia Silge et al., | https://tidymodels.org

Install and load packages

To use the code in this document, you will need to install the following packages: glmnet, tidyverse, and tidymodels.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
✔ broom        1.0.5     ✔ rsample      1.1.1
✔ dials        1.2.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.3
✔ modeldata    1.1.0     ✔ workflowsets 1.0.1
✔ parsnip      1.1.0     ✔ yardstick    1.2.0
✔ recipes      1.0.6     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.
library(glmnet) # for LR
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loaded glmnet 4.1-7
library(vip)    # to extract important features

Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi
library(ranger) # for RF

Explore your data

Here, we will use microbial genomics data (e.g., gene presence/absence across multiple microbial genomes) wrangled and processed from the BV-BRC to predict the antibiotic resistance phenotype of each sample (genome) based on the presence/absence of genes in that sample.

To make the dataset usable on your local desktop machine, we have pre-processed the data (using custom scripts that use NCBI/BV-BRC data and metadata, NCBI and BV-BRC CLI, Prokka for genome annotation, and Roary/CD-HIT for constructing ht gene presence/absence matrix and gene clusters that serve as ML features). For this workshop, we have selected a subset of ~900 genomes from Staphylococcus aureus, and limited the data to n genes after filtering out core (present in >95% of genomes) and unique (present in <5% of genomes) genes.

The data is contained in the files abc.csv with samples (genomes) along the rows and genes along the columns. To get started, let’s read this data into R using the readr::read_delim function. These files also carry relevant metadata of the genomes and drugs.

Read in the data file

# Can be set to read csv/tsv: any feature matrix file with metadata
# e.g., gpa-feature-matrix.tsv
gpa_featmat <- read_delim("data/staph_penicillin_pangenome.csv",
                          delim = ",", col_names = T)
Rows: 920 Columns: 2328
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr    (4): antibiotic, amr_pheno, drug_class, assembly_accession
dbl (2324): s_no, genome_id, prmA, hisC_1, araB, yqeN, tagH_2, tet(38), lrgB...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data exploration

Let’s print the tibble to examine it quickly.

gpa_featmat
# A tibble: 920 × 2,328
    s_no genome_id antibiotic amr_pheno   drug_class assembly_accession  prmA
   <dbl>     <dbl> <chr>      <chr>       <chr>      <chr>              <dbl>
 1     0     1280. penicillin Susceptible penicillin GCA_024925485.1        1
 2     1     1280. penicillin Susceptible penicillin GCA_024925485.1        1
 3     2     1280. penicillin Susceptible penicillin GCA_024972975.1        1
 4     3     1280. penicillin Susceptible penicillin GCA_024972975.1        1
 5     4     1280. penicillin Resistant   penicillin GCA_025232045.1        1
 6     5     1280. penicillin Resistant   penicillin GCA_025232045.1        1
 7     6    46170. penicillin Resistant   penicillin GCA_002204575.1        1
 8     7     1280. penicillin Susceptible penicillin GCA_002089075.2        1
 9     8     1280. penicillin Resistant   penicillin GCA_002089095.2        1
10     9     1280. penicillin Resistant   penicillin GCA_002097595.2        1
# ℹ 910 more rows
# ℹ 2,321 more variables: hisC_1 <dbl>, araB <dbl>, yqeN <dbl>, tagH_2 <dbl>,
#   `tet(38)` <dbl>, lrgB <dbl>, cmtB <dbl>, scmP_2 <dbl>, est_2 <dbl>,
#   glcB <dbl>, ponA <dbl>, clpX <dbl>, yiiM <dbl>, thiN <dbl>, ilvE <dbl>,
#   ydcV <dbl>, menH <dbl>, relA <dbl>, yicL <dbl>, rho <dbl>, guaA <dbl>,
#   hemB <dbl>, hemA <dbl>, glpQ_1 <dbl>, suhB <dbl>, tatC2 <dbl>, groL <dbl>,
#   glpK <dbl>, frdA <dbl>, yycI <dbl>, pepA_1 <dbl>, feuC <dbl>, miaA <dbl>, …
dim(gpa_featmat)
[1]  920 2328

Then, let’s examine the amr_pheno column of this data frame that tells us which antimicrobial resistance (AMR) phenotype (resistance/susceptible) for each sample (i.e., each row, genome) for different drugs. We can tabulate the number and fraction of genomes per phenotype easily using the count and mutate functions from dplyr.

gpa_featmat %>%
  count(amr_pheno) %>% 
  mutate(prop = n/sum(n))
# A tibble: 2 × 3
  amr_pheno       n  prop
  <chr>       <int> <dbl>
1 Resistant     481 0.523
2 Susceptible   439 0.477

Before we proceed, let’s also try and get a sense of the values in this feature matrix. Since there are thousands of genes, we’ll randomly pick a few of them and visualize the distribution of their values across all the samples using a histogram.

gpa_sum <- gpa_featmat |>
  select(7:last_col()) |>
  summarize(across(where(is.numeric), sum))

gpa_sum_long <- gpa_sum |> 
  pivot_longer(cols = everything(), names_to = "gene")


ggplot(gpa_sum_long, aes(value)) +
  # geom_histogram(bins=10) +
  geom_bar() +
  scale_x_binned() +
  theme_minimal() +
  xlab("Genes present in X genomes") +
  ylab("N Genes with X frequency")

Feature matrices –> ML

To keep the problem simple, we pick one drug of interest (penicillin) and define the problem as classifying whether a genome is resistant or susceptible to this antibiotic.

pos_pheno <- "Resistant"

Then, we need to modify the amr_pheno variable into a binary indicator of whether it is resistant or not and finally convert that variable into a factor so that the model knows to consider it as a way to partition the samples.

Set up the feature matrix and labels for the ML model

gpa_featmat_pheno <-
  gpa_featmat %>%
  mutate(amr_pheno = ifelse(amr_pheno==pos_pheno,
                            "Resistant", "Susceptible")) %>%
  mutate(across(where(is.character), as.factor))

A critical quantity to be fully aware of when setting up an ML problem is class balance, i.e., the relative sizes of the positive ("Resistant") and negative ("Susceptible") classes.

gpa_featmat_pheno %>% 
  count(amr_pheno) %>% 
  mutate(prop = n/sum(n))
# A tibble: 2 × 3
  amr_pheno       n  prop
  <fct>       <int> <dbl>
1 Resistant     481 0.523
2 Susceptible   439 0.477

We can see that, in our dataset, only xx% of the samples are “Resistant”. Referred to as class imbalance, this scenario is extremely common in biomedicine and needs careful attention when analyzing and interpreting results.

Data splitting

If we take the data from all samples and train an AMR classification ML model, we cannot easily tell how good the model is. So, let’s reserve 25% of the samples to a test set, which we will hold out until the end of the project, at which point there should only be one or two models under serious consideration. The test set will be used as an unbiased source for measuring final model performance.

This is also the first step where we need to pay attention to class balance. We use stratified random samples so that both the splits contain nearly identical proportions of positive and negative samples.

# The function `initial_split()` takes the original data and saves the information on how to make the partitions.
set.seed(123)
splits <- initial_split(data = gpa_featmat_pheno,
                        strata = amr_pheno)
# Within initial_split, you can specify proportion using "prop" and
# grouping/datasets to go into the same set using "group"

# The `training()` and `testing()` functions return the actual datasets.
gpa_other <- training(splits)
gpa_test  <- testing(splits)

Let’s check if we indeed did achieve stratified data splits.

# other set proportions by AMR pheno
gpa_other %>%
  count(amr_pheno) %>% 
  mutate(prop = n/sum(n))
# A tibble: 2 × 3
  amr_pheno       n  prop
  <fct>       <int> <dbl>
1 Resistant     360 0.522
2 Susceptible   329 0.478
# test set proportions by R/S ratio
gpa_test %>%
  count(amr_pheno) %>% 
  mutate(prop = n/sum(n))
# A tibble: 2 × 3
  amr_pheno       n  prop
  <fct>       <int> <dbl>
1 Resistant     121 0.524
2 Susceptible   110 0.476

What’s up with the gpa_other split that’s not testing? This split will be used to create two new datasets:

  1. The set held out for the purpose of measuring performance, called the validation set, and
  2. The remaining data used to fit the model, called the training set.

We’ll use the validation_split function to allocate 20% of the gpa_other samples to the validation set and the remaining 80% to the training set. Note that this function too has the strata argument. Do you see why we need it here?

set.seed(234)
gpa_val <- validation_split(data = gpa_other,
                            strata = amr_pheno, # maintain original data split
                            prop = 0.80) # 80% training; 20% validation
gpa_val
# Validation Set Split (0.8/0.2)  using stratification 
# A tibble: 1 × 2
  splits            id        
  <list>            <chr>     
1 <split [551/138]> validation

Training ML models in R: Penalized logistic regression

Since our outcome variable AMR_pheno is categorical, logistic regression would be a good first model to start. Let’s use a model that can perform feature selection during training. The glmnet R package fits a generalized linear model via penalized maximum likelihood. This method of estimating the logistic regression slope parameters uses a penalty on the process so that the coefficients of less relevant predictors are driven towards a value of zero. One of the glmnet penalization methods, called the lasso method, can actually set the predictor slopes to zero if a large enough penalty is used.

Build the model

To specify a penalized logistic regression model that uses a feature selection penalty, we will use parsnip package (part of tidymodels) that is great at providing a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying packages.

Here, let’s use it with the glmnet engine:

# Build logistic regression model
lr_model <- 
  logistic_reg(penalty = tune(), # strength of regularization/penalty
               mixture = 1) %>% # specifies a pure lasso model
  set_engine("glmnet") # set to generalized linear models

We’ll set the penalty argument to tune() as a placeholder for now. This is a model hyperparameter that we will tune to find the best value for making predictions with our data. Setting mixture to a value of 1 means that the glmnet model will potentially remove irrelevant predictors and choose a simpler model. Sum of absolute values of beta-coefficients is minimized.

You can try with mixture=0 for L2 ridge regression (or 0-1 for elasticnet combining L1 and L2).

Create the recipe

Next, we’re going to use the recipes to build dplyr-like pipeable sequences of feature engineering steps to get our data ready for modeling. Recipes are built as a series of pre-processing steps, such as:

  • converting qualitative predictors to indicator variables (also known as dummy variables),

  • transforming data to be on a different scale (e.g., taking the logarithm of a variable),

  • transforming whole groups of predictors together,

  • extracting key features from raw variables (e.g., getting the day of the week out of a date variable),

and so on. Here, we’re using it to set up the outcome variable as a function of gene presence and then do two things:

  • step_zv() removes indicator variables that only contain a single unique value (e.g. all zeros). This is important because, for penalized models, the predictors should be centered and scaled.

  • step_normalize() centers and scales numeric variables.

lr_recipe <- 
  recipe(amr_pheno ~ ., data = gpa_other) %>% # specify data + labels
  update_role(c(s_no, genome_id, assembly_accession, # genome attributes
                antibiotic, drug_class), # drug attributes
              new_role = "Supplementary") %>% # tag metadata not used for ML
  step_zv(all_predictors()) %>% # remove predictors with only one value
  # step_nzv(all_predictors()) # for near-zero variance
  step_normalize(all_predictors()) # normalize all predictors

Try with step_nzv instead of only step_zv.

Create the workflow

Let’s bundle the model and recipe into a single workflow() object to make management of the R objects easier:

# Standard model recipe for LR | uses our recipe definition from above
lr_workflow <- workflow() %>% 
  add_model(lr_model) %>%
  add_recipe(lr_recipe)

Create the grid for tuning

Before we fit this model, we need to set up a grid of penalty values to tune.

# Try values from 0.0001 to 0.1 to penalize for complex models;
# Minimizing no. of features with non-zero coefficients
lr_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 10))
lr_reg_grid
# A tibble: 10 × 1
    penalty
      <dbl>
 1 0.0001  
 2 0.000215
 3 0.000464
 4 0.001   
 5 0.00215 
 6 0.00464 
 7 0.01    
 8 0.0215  
 9 0.0464  
10 0.1     

Train and tune the model

The tune::tune_grid() function will help us train these 10 penalized logistic regression models and save the validation set prediction (via the call to control_grid()) so that diagnostic information will be available after fitting the model. To quantify how well the model performs (on the validation set), let’s first consider the area under the ROC curve across a range of hyperparameters.

lr_res <- 
  lr_workflow %>% 
  tune_grid(resamples = gpa_val, # using validation split
            grid = lr_reg_grid,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(roc_auc))
#metrics = metric_set(pr_auc)) # if you want to optimize for AUPRC instead

Tune the model with cross-validation instead?

lr_res_cv <- 
  lr_workflow %>% 
  tune_grid(resamples = vfold_cv(gpa_other), # new CV line
            grid = lr_reg_grid,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(roc_auc))
#metrics = metric_set(pr_auc)) # if you want to optimize for AUPRC instead

Evaluation metrics

A plot of the area under the ROC curve against the range of penalty values will help us guess which value is best for the problem/dataset at hand.

lr_plot <- lr_res %>% 
  collect_metrics() %>% 
  ggplot(aes(x = penalty, y = mean)) + 
  geom_point() + 
  geom_line() + 
  ylab("Area under the ROC Curve") +
  #ylab("Area under the PR Curve") +
  scale_x_log10(labels = scales::label_number()) +
  theme_bw()

lr_plot

What is your interpretation of this plot? Write it here.

We can also tabulate these results to help pick the “best” hyperparameter.

top_models <-
  lr_res %>% 
  show_best("roc_auc", n = 10) %>% 
  arrange(penalty) 
top_models
# A tibble: 10 × 7
    penalty .metric .estimator  mean     n std_err .config              
      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1 0.0001   roc_auc binary     0.996     1      NA Preprocessor1_Model01
 2 0.000215 roc_auc binary     0.996     1      NA Preprocessor1_Model02
 3 0.000464 roc_auc binary     0.996     1      NA Preprocessor1_Model03
 4 0.001    roc_auc binary     0.996     1      NA Preprocessor1_Model04
 5 0.00215  roc_auc binary     0.996     1      NA Preprocessor1_Model05
 6 0.00464  roc_auc binary     0.997     1      NA Preprocessor1_Model06
 7 0.01     roc_auc binary     0.997     1      NA Preprocessor1_Model07
 8 0.0215   roc_auc binary     0.996     1      NA Preprocessor1_Model08
 9 0.0464   roc_auc binary     0.996     1      NA Preprocessor1_Model09
10 0.1      roc_auc binary     0.979     1      NA Preprocessor1_Model10

Let’s select the best value and visualize the validation set ROC curve. Why are we picking the 6th value instead of the 1st even though they have nearly identical performance metrics?

lr_best <- lr_res %>% 
  collect_metrics() %>% 
  arrange(penalty, mean) %>% 
  slice(6)

# Alternatively, you can just use
lr_best <- lr_res |> 
  select_best(metric = "roc_auc")
lr_best
# A tibble: 1 × 2
  penalty .config              
    <dbl> <chr>                
1 0.00464 Preprocessor1_Model06
lr_roc <- lr_res %>% 
  collect_predictions(parameters = lr_best) %>% 
  roc_curve(amr_pheno, .pred_Resistant) %>% 
  mutate(model = "Logistic Regression")

autoplot(lr_roc)

## Alternatively ... 
# Select the best LR model
final_lr_model <- finalize_workflow(lr_workflow, lr_best)
# Fit the data
lr_fit <- final_lr_model %>% fit(data = gpa_other)
# Save predictions
lr_aug <- augment(lr_fit, gpa_test)
# Calculate AUROC
auroc <- lr_aug %>% roc_auc(truth = amr_pheno, .pred_Resistant) %>%
      select(.estimate) %>% as.numeric()
print(paste("AUROC:", auroc))
[1] "AUROC: 0.993313298271975"

The area under the ROC curve has a nice property that it can be interpreted as a probability and has a close connection to a statistical test (the Mann-Whitney U test).

Selecting the top features

# Extract top 10 genes
n_top_genes <- 10
top_genes <- lr_fit %>% extract_fit_parsnip() %>%
  vip::vi() %>% slice(1:n_top_genes) %>%
  select(1) %>% pull()
print(top_genes)
 [1] "group_7632"  "group_7634"  "blaZ"        "blaR1"       "ugpQ"       
 [6] "blaR1-2"     "bin3"        "group_11618" "cadC"        "group_5831" 

When you have imbalanced classes

However, the AUROC measure is not sensitive to class imbalances and can come out to be high even if the model is making many mistakes in the minor positive class — which is typically of biomedical interest — and getting most of the major negative class correct.

So, the final analysis we’re going to do is to evaluate performance based on another metric called area under the Precision-Recall curve that is more sensitive to the minor positive class by focusing on the fraction of top positive predictions that are correct (precision) and the fraction of positive samples that are correctly predicted (recall).

lr_res_pr <- lr_workflow %>% 
  tune_grid(resamples = gpa_val,
            grid = lr_reg_grid,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(pr_auc))
lr_plot_pr <- lr_res_pr %>% 
  collect_metrics() %>% 
  ggplot(aes(x = penalty, y = mean)) + 
  geom_point() + 
  geom_line() + 
  ylab("Area under the PR Curve") +
  scale_x_log10(labels = scales::label_number()) +
  theme_bw()

lr_plot

lr_best_pr <- lr_res_pr %>% 
  collect_metrics() %>% 
  arrange(penalty) %>% 
  slice(6)

# Alternatively, you can just use
lr_best_pr <- lr_res_pr |> 
  select_best(metric = "pr_auc")
lr_best
# A tibble: 1 × 2
  penalty .config              
    <dbl> <chr>                
1 0.00464 Preprocessor1_Model06
lr_pr <- lr_res_pr %>% 
  collect_predictions(parameters = lr_best) %>% 
  pr_curve(amr_pheno, .pred_Resistant) %>% 
  mutate(model = "Logistic Regression")

autoplot(lr_pr)

Retrieving your top features

# Select best LR model
best_lr_model_pr <- select_best(lr_res_pr, "pr_auc")
final_lr_model_pr <- finalize_workflow(lr_workflow, best_lr_model_pr)

# Fit the data
lr_fit_pr <- final_lr_model_pr %>%
  fit(data = gpa_other)

# Save predictions
lr_aug_pr <- augment(lr_fit_pr, gpa_test)

# Get AUPRC
auprc <- lr_aug_pr %>%
  pr_auc(truth = amr_pheno, .pred_Resistant) %>%
  select(.estimate) %>% as.numeric()
print(paste("AUPRC:", auprc))
[1] "AUPRC: 0.99443515870738"
## Extract top 10 genes
n_top_genes <- 10

top_genes_pr <- lr_fit_pr |> 
 extract_fit_parsnip() |>
 vip::vi() |>
 slice(1:n_top_genes) |>
 select(1) |>
 pull()
print(top_genes_pr)
 [1] "group_7632"  "group_7634"  "blaZ"        "blaR1"       "ugpQ"       
 [6] "blaR1-2"     "bin3"        "group_11618" "cadC"        "group_5831" 

Predicting AR w/ Random Forest

# Setting a seed enables our analysis to be reproducible when random numbers are used.
    set.seed(569)

    rf_splits <- initial_split(gpa_featmat_pheno,
                                #prop = train_test_split,
                                strata = amr_pheno)

    #Create separate data frames for the training and testing sets.
    gpa_train <- training(rf_splits)
    gpa_test <- testing(rf_splits)
    
    set.seed(234)
    gpa_val <- validation_split(data = gpa_train,
                                strata = amr_pheno, # maintain original data split
                                prop = 0.80) # 80% training; 20% validation
    gpa_val
# Validation Set Split (0.8/0.2)  using stratification 
# A tibble: 1 × 2
  splits            id        
  <list>            <chr>     
1 <split [551/138]> validation
    #Create recipe
    rf_recipe <- recipe(amr_pheno ~ ., data = gpa_train) %>%
      # To keep these columns but not use them as predictors or outcome
      update_role(c(s_no, genome_id, assembly_accession, # genome attributes
                    antibiotic, drug_class), # drug attributes
                  new_role = "Supplementary") %>%
      step_zv(all_predictors()) %>% # remove predictors with only one value
      # step_nzv(all_predictors()) # for near-zero variance
      step_normalize(all_predictors()) # normalize all predictors

    # Build random forest model
    num_trees <- 1000
    rf_model <- rand_forest(trees = num_trees) %>%
      set_engine("ranger", importance = "impurity") %>%
      set_mode("classification")

    # Create workflow
    rf_workflow <- workflow() %>%
      add_model(rf_model) %>%
      add_recipe(rf_recipe)

    # Specify the hyperparameter tuning grid
    rf_grid <- tibble(mtry = c(0.002, 0.02, 0.2), min_n = c(2, 6, 12))

    # Tune the model using cross-validation;
    # try 9 different hyperparameter sets; use auprc as evaluation metric.
    rf_res <- tune_grid(rf_workflow,
                        resamples = vfold_cv(gpa_train),
                        grid = rf_grid,
                        control = control_grid(save_pred = T),
                        metrics = metric_set(roc_auc))
Warning: No tuning parameters have been detected, performance will be evaluated
using the resamples with no tuning. Did you want to [tune()] parameters?
    rf_best <- rf_res |> 
      select_best(metric = "roc_auc")
    
    # Plot AUROC
    rf_roc <- rf_res %>% 
      collect_predictions(parameters = rf_best) %>% 
      roc_curve(amr_pheno, .pred_Resistant) %>% 
      mutate(model = "Logistic Regression")
    autoplot(rf_roc)

    # Select best RF model
    best_rf_model <- select_best(rf_res, "roc_auc")
    final_rf_model <- finalize_workflow(rf_workflow, best_rf_model)

    # Fit the data
    rf_fit <- final_rf_model %>% fit(data = gpa_train)

    # Save predictions
    rf_aug <- augment(rf_fit, gpa_test)

    # Get auroc
    auroc <- rf_aug %>%
      roc_auc(truth = amr_pheno, .pred_Resistant) %>%
      select(.estimate) %>%
      as.numeric()
    print(paste("AUROC:", auroc))
[1] "AUROC: 0.989857250187829"
    # Extract top 10 genes
    n_top_genes <- 10
    top_genes_rf <- rf_fit %>%
      extract_fit_parsnip() %>%
      vip::vi() %>% slice(1:n_top_genes) %>%
      select(1) %>% pull()
    print(top_genes_rf)
 [1] "blaZ"        "blaR1"       "hin"         "blaI"        "blaR1-2"    
 [6] "cadC"        "group_2301"  "group_7632"  "group_7634"  "group_12234"

Too many features?

Try dimensionality reduction with SVD –> retrieve top PCs –> find contributing features to the top PCs.

Recap & Conclusions

  • Reproducible docs & code with qmd/rmd

  • Basic data cleanup to get it ready for ML models

  • tidymodels

  • Building recipes and workflows

  • Calculating AUROC and AUPRC

  • Train-validate-test splits to optimize for best hyperparameters

  • Picking the best models based on low penalty and high AUROC/AUPRC

  • Plotting AUROC/AUPRC

  • Logistic regression with L1 lasso regression (and L2)

  • Random Forest models

How to contact us