pkgdown/extra.css

Skip to contents

In this vignette, we will demonstrate how to use the HDAnalyzeR package to build classification models. We will start by loading HDAnalyzeR and dplyr packages, widen the example dataset and loading the metadata.

library(HDAnalyzeR)
library(dplyr) 

wide_data <- widen_data(example_data)
metadata <- example_metadata

First, let’s start with a regularized regression LASSO model via do_rreg(). Exactly like in the previous vignette with the differential expression functions, we have to state the case and control(s) groups.

res <- do_rreg(wide_data, metadata, case = "AML", control = c("CLL", "MYEL", "GLIOM"))

res$hypopt_res$hypopt_vis

res$testfit_res$metrics
#> $accuracy
#> [1] 0.68
#> 
#> $sensitivity
#> [1] 0.73
#> 
#> $specificity
#> [1] 0.62
#> 
#> $auc
#> [1] 0.67
#> 
#> $conf_matrix
#>           Truth
#> Prediction  0  1
#>          0 11  5
#>          1  4  8
#> 
#> $roc_curve

res$var_imp_res$var_imp_plot

res$boxplot_res

We can change several parameters in the do_rreg() function. For example, we can change the number of cross-validation folds, the number of grid points for the hyperparameter optimization, the number of cores to use, the feature correlation threshold, and even the type of regularized regression. For more information, please refer to the documentation.

res <- do_rreg(wide_data, 
               metadata, 
               case = "AML", 
               control = c("CLL", "MYEL", "GLIOM"), 
               cv_sets = 8, 
               grid_size = 15, 
               ncores = 2, 
               cor_threshold = 0.7, 
               type = "elnet",
               palette = "cancers12")

res$testfit_res$metrics
#> $accuracy
#> [1] 0.61
#> 
#> $sensitivity
#> [1] 0.6
#> 
#> $specificity
#> [1] 0.62
#> 
#> $auc
#> [1] 0.69
#> 
#> $conf_matrix
#>           Truth
#> Prediction 0 1
#>          0 9 5
#>          1 6 8
#> 
#> $roc_curve

res$var_imp_res$var_imp_plot

res$boxplot_res

We can use a different variable to classify like Sex and even a different algorithm like random forest via do_rf(). Do not forget to change the exclude_cols parameter to exclude the Disease column instead of default Sex.

res <- do_rf(wide_data, 
             metadata, 
             variable = "Sex", 
             case = "F", 
             control = "M", 
             exclude_cols = "Disease", 
             palette = "sex_hpa",
             points = FALSE)

res$testfit_res$metrics
#> $accuracy
#> [1] 0.55
#> 
#> $sensitivity
#> [1] 0.18
#> 
#> $specificity
#> [1] 0.92
#> 
#> $auc
#> [1] 0.74
#> 
#> $conf_matrix
#>           Truth
#> Prediction  0  1
#>          0 17  7
#>          1 75 85
#> 
#> $roc_curve

res$var_imp_res$var_imp_plot

res$boxplot_res

If our data have a single predictor, we can use do_lreg() instead of do_rreg() to perform a logistic regression. Random forest can be used as it was for multiple predictors. Let’s also change the ratio of the training and testing sets.

one_pred_data <- wide_data |> 
  select(DAid, ANGPTL2) |> 
  na.omit()

res <- do_lreg(one_pred_data, 
               metadata, 
               case = "AML",
               control = c("CLL", "MYEL", "GLIOM"),
               ratio = 0.7,
               palette = "cancers12")

res$metrics
#> $accuracy
#> [1] 0.57
#> 
#> $sensitivity
#> [1] 0.53
#> 
#> $specificity
#> [1] 0.6
#> 
#> $auc
#> [1] 0.61
#> 
#> $conf_matrix
#>           Truth
#> Prediction 0 1
#>          0 8 6
#>          1 7 9
#> 
#> $roc_curve

res$boxplot_res

We can also do multiclassification predictions with all available classes in the data via do_rreg_multi() and do_rf_multi(). The parameters are similar with the binary classification functions but in this case there are no case and control arguments.

res <- do_rreg_multi(wide_data, 
                     metadata, 
                     type = "elnet",
                     palette = "cancers12")

res$auc
#> # A tibble: 12 × 2
#>    Disease   AUC
#>    <chr>   <dbl>
#>  1 AML     0.917
#>  2 BRC     0.694
#>  3 CLL     0.929
#>  4 CRC     0.608
#>  5 CVX     0.738
#>  6 ENDC    0.712
#>  7 GLIOM   0.955
#>  8 LUNGC   0.848
#>  9 LYMPH   0.625
#> 10 MYEL    0.839
#> 11 OVC     0.792
#> 12 PRC     0.357
res$auc_barplot

res$roc_curve

res$var_imp_res$var_imp_plot

Finally, let’s use the plot_features_summary() to summarize our model results. We can create models of different cases and controls and compare them like we did in the Get Started vignette. However, we can also compare different models. Let’s run two different models for two different cases and summarize them.

ridge_aml <- do_rreg(wide_data, 
                     metadata, 
                     case = "AML", 
                     control = c("CLL", "MYEL", "GLIOM"), 
                     type = "ridge")

rf_aml <- do_rf(wide_data, metadata, case = "AML", control = c("CLL", "MYEL", "GLIOM"))

ridge_gliom <- do_rreg(wide_data, 
                       metadata, 
                       case = "GLIOM", 
                       control = c("CLL", "MYEL", "AML"), 
                       type = "ridge")

rf_gliom <- do_rf(wide_data, metadata, case = "GLIOM", control = c("CLL", "MYEL", "AML"))
res <- plot_features_summary(list("Ridge_AML" = ridge_aml, 
                                  "RF_AML" = rf_aml, 
                                  "Ridge_GLIOM" = ridge_gliom, 
                                  "RF_GLIOM" = rf_gliom))
res$features_barplot

res$upset_plot_features

res$metrics_lineplot

From the last plot we can quickly see that the Ridge regression model works better for Glioma classification, while the Random Forest model works better for AML classification.

📓 Remember that these data are a dummy-dataset with fake data and the results in this guide should not be interpreted as real results. The purpose of this vignette is to show you how to use the package and its functions.