Regularized classification model pipeline

do_rreg() runs the regularized classification model pipeline. It splits the data into training and test sets, creates class-balanced case-control groups, and fits the model. It also performs hyperparameter optimization, fits the best model, tests it, and plots useful the feature variable importance.

Usage

do_rreg(
  olink_data,
  metadata,
  variable = "Disease",
  case,
  control,
  wide = TRUE,
  strata = TRUE,
  balance_groups = TRUE,
  only_female = NULL,
  only_male = NULL,
  exclude_cols = "Sex",
  ratio = 0.75,
  type = "lasso",
  cor_threshold = 0.9,
  cv_sets = 5,
  grid_size = 10,
  ncores = 4,
  hypopt_vis = TRUE,
  palette = NULL,
  vline = TRUE,
  subtitle = c("accuracy", "sensitivity", "specificity", "auc", "features",
    "top-features", "mixture"),
  varimp_yaxis_names = FALSE,
  nfeatures = 9,
  points = TRUE,
  boxplot_xaxis_names = FALSE,
  seed = 123
)

Arguments

olink_data: Olink data.
metadata: Metadata.
variable: The variable to predict. Default is "Disease".
case: The case group.
control: The control groups.
wide: Whether the data is wide format. Default is TRUE.
strata: Whether to stratify the data. Default is TRUE.
balance_groups: Whether to balance the groups. Default is TRUE.
only_female: Vector of diseases that are female specific. Default is NULL.
only_male: Vector of diseases that are male specific. Default is NULL.
exclude_cols: Columns to exclude from the data before the model is tuned. Default is "Sex".
ratio: Ratio of training data to test data. Default is 0.75.
type: Type of regularization. Default is "lasso". Other options are "ridge" and "elnet".
cor_threshold: Threshold of absolute correlation values. This will be used to remove the minimum number of features so that all their resulting absolute correlations are less than this value.
cv_sets: Number of cross-validation sets. Default is 5.
grid_size: Size of the hyperparameter optimization grid. Default is 10.
ncores: Number of cores to use for parallel processing. Default is 4.
hypopt_vis: Whether to visualize hyperparameter optimization results. Default is TRUE.
palette: The color palette for the plot. If it is a character, it should be one of the palettes from get_hpa_palettes(). Default is NULL.
vline: Whether to add a vertical line at 50% importance. Default is TRUE.
subtitle: Vector of subtitle elements to include in the plot. Default is a list with all.
varimp_yaxis_names: Whether to add y-axis names to the variable importance plot. Default is FALSE.
nfeatures: Number of top features to include in the boxplot. Default is 9.
points: Whether to add points to the boxplot. Default is TRUE.
boxplot_xaxis_names: Whether to add x-axis names to the boxplot. Default is FALSE.
seed: Seed for reproducibility. Default is 123.

Value

A list with results for each disease. The list contains:

hypopt_res: Hyperparameter optimization results.
finalfit_res: Final model fitting results.
testfit_res: Test model fitting results.
var_imp_res: Variable importance results.

Details

If the data contain missing values, KNN imputation will be applied. If no check for feature correlation is preferred, set cor_threshold to 1.

Examples

do_rreg(example_data,
        example_metadata,
        case = "AML",
        control = c("CLL", "MYEL"),
        balance_groups = TRUE,
        wide = FALSE,
        type = "elnet",
        palette = "cancers12",
        cv_sets = 5,
        grid_size = 10,
        ncores = 1)
#> Joining with `by = join_by(DAid)`
#> Sets and groups are ready. Model fitting is starting...
#> Classification model for AML as case is starting...
#> $hypopt_res
#> $hypopt_res$elnet_tune
#> # Tuning results
#> # 5-fold cross-validation using stratification 
#> # A tibble: 5 × 5
#>   splits          id    .metrics          .notes           .predictions      
#>   <list>          <chr> <list>            <list>           <list>            
#> 1 <split [59/16]> Fold1 <tibble [10 × 6]> <tibble [0 × 3]> <tibble [160 × 7]>
#> 2 <split [59/16]> Fold2 <tibble [10 × 6]> <tibble [0 × 3]> <tibble [160 × 7]>
#> 3 <split [60/15]> Fold3 <tibble [10 × 6]> <tibble [0 × 3]> <tibble [150 × 7]>
#> 4 <split [61/14]> Fold4 <tibble [10 × 6]> <tibble [0 × 3]> <tibble [140 × 7]>
#> 5 <split [61/14]> Fold5 <tibble [10 × 6]> <tibble [0 × 3]> <tibble [140 × 7]>
#> 
#> $hypopt_res$elnet_wf
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 4 Recipe Steps
#> 
#> • step_normalize()
#> • step_nzv()
#> • step_corr()
#> • step_impute_knn()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Logistic Regression Model Specification (classification)
#> 
#> Main Arguments:
#>   penalty = tune::tune()
#>   mixture = tune::tune()
#> 
#> Computational engine: glmnet 
#> 
#> 
#> $hypopt_res$train_set
#> # A tibble: 75 × 102
#>    DAid    AARSD1   ABL1  ACAA1   ACAN    ACE2  ACOX1    ACP5    ACP6  ACTA2
#>    <chr>    <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>   <dbl>  <dbl>
#>  1 DA00003  NA    NA     NA      0.989 NA       0.330  1.37   NA      NA    
#>  2 DA00004   3.41  3.38   1.69  NA      1.52   NA      0.841   0.582   1.70 
#>  3 DA00005   5.01  5.05   0.128  0.401 -0.933  -0.584  0.0265  1.16    2.73 
#>  4 DA00007  NA    NA      3.96   0.682  3.14    2.62   1.47    2.25    2.01 
#>  5 DA00008   2.78  0.812 -0.552  0.982 -0.101  -0.304  0.376  -0.826   1.52 
#>  6 DA00009   4.39  3.34  -0.452 -0.868  0.395   1.71   1.49   -0.0285  0.200
#>  7 DA00010   1.83  1.21  -0.912 -1.04  -0.0918 -0.304  1.69    0.0920  2.04 
#>  8 DA00011   3.48  4.96   3.50  -0.338  4.48    1.26   2.18    1.62    1.79 
#>  9 DA00012   4.31  0.710 -1.44  -0.218 -0.469  -0.361 -0.0714 -1.30    2.86 
#> 10 DA00013   1.31  2.52   1.11   0.997  4.56   -1.35   0.833   2.33    3.57 
#> # ℹ 65 more rows
#> # ℹ 92 more variables: ACTN4 <dbl>, ACY1 <dbl>, ADA <dbl>, ADA2 <dbl>,
#> #   ADAM15 <dbl>, ADAM23 <dbl>, ADAM8 <dbl>, ADAMTS13 <dbl>, ADAMTS15 <dbl>,
#> #   ADAMTS16 <dbl>, ADAMTS8 <dbl>, ADCYAP1R1 <dbl>, ADGRE2 <dbl>, ADGRE5 <dbl>,
#> #   ADGRG1 <dbl>, ADGRG2 <dbl>, ADH4 <dbl>, ADM <dbl>, AGER <dbl>, AGR2 <dbl>,
#> #   AGR3 <dbl>, AGRN <dbl>, AGRP <dbl>, AGXT <dbl>, AHCY <dbl>, AHSP <dbl>,
#> #   AIF1 <dbl>, AIFM1 <dbl>, AK1 <dbl>, AKR1B1 <dbl>, AKR1C4 <dbl>, …
#> 
#> $hypopt_res$test_set
#> # A tibble: 27 × 102
#>    DAid    AARSD1       ABL1  ACAA1    ACAN  ACE2   ACOX1   ACP5  ACP6 ACTA2
#>    <chr>    <dbl>      <dbl>  <dbl>   <dbl> <dbl>   <dbl>  <dbl> <dbl> <dbl>
#>  1 DA00001   3.39  2.76       1.71   0.0333 1.76  -0.919   1.54  2.15  2.81 
#>  2 DA00002   1.42  1.25      -0.816 -0.459  0.826 -0.902   0.647 1.30  0.798
#>  3 DA00006   6.83  1.18      -1.74  -0.156  1.53  -0.721   0.620 0.527 0.772
#>  4 DA00016   1.79  1.36       0.106 -0.372  3.40  -1.19    1.77  1.07  2.00 
#>  5 DA00022   7.07  5.67       3.68  -0.458  3.09   0.690   0.649 2.17  1.83 
#>  6 DA00023   2.92 -0.0000706  0.602  1.59   0.198  1.61    0.283 2.35  2.11 
#>  7 DA00034   3.45  2.91       1.31   0.423  0.647  1.40    0.691 0.720 1.95 
#>  8 DA00035   4.39  3.31       0.454  0.290  2.68   0.116  -1.32  0.945 2.14 
#>  9 DA00038   2.23  1.42       0.484  1.72   1.46   0.0747  1.82  0.109 4.27 
#> 10 DA00039   4.26  0.572     -1.97  -0.433  0.208  0.790  -0.236 1.52  0.652
#> # ℹ 17 more rows
#> # ℹ 92 more variables: ACTN4 <dbl>, ACY1 <dbl>, ADA <dbl>, ADA2 <dbl>,
#> #   ADAM15 <dbl>, ADAM23 <dbl>, ADAM8 <dbl>, ADAMTS13 <dbl>, ADAMTS15 <dbl>,
#> #   ADAMTS16 <dbl>, ADAMTS8 <dbl>, ADCYAP1R1 <dbl>, ADGRE2 <dbl>, ADGRE5 <dbl>,
#> #   ADGRG1 <dbl>, ADGRG2 <dbl>, ADH4 <dbl>, ADM <dbl>, AGER <dbl>, AGR2 <dbl>,
#> #   AGR3 <dbl>, AGRN <dbl>, AGRP <dbl>, AGXT <dbl>, AHCY <dbl>, AHSP <dbl>,
#> #   AIF1 <dbl>, AIFM1 <dbl>, AK1 <dbl>, AKR1B1 <dbl>, AKR1C4 <dbl>, …
#> 
#> $hypopt_res$hypopt_vis

#> 
#> 
#> $finalfit_res
#> $finalfit_res$final
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 4 Recipe Steps
#> 
#> • step_normalize()
#> • step_nzv()
#> • step_corr()
#> • step_impute_knn()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> 
#> Call:  glmnet::glmnet(x = maybe_matrix(x), y = y, family = "binomial",      alpha = ~0.261111111111111) 
#> 
#>     Df  %Dev  Lambda
#> 1    0  0.00 0.97210
#> 2    3  0.87 0.92790
#> 3    3  1.92 0.88570
#> 4    3  2.95 0.84540
#> 5    3  3.96 0.80700
#> 6    4  5.16 0.77030
#> 7    4  6.42 0.73530
#> 8    5  7.66 0.70190
#> 9    8  9.15 0.67000
#> 10  10 11.01 0.63950
#> 11  11 12.97 0.61050
#> 12  12 14.94 0.58270
#> 13  12 16.86 0.55620
#> 14  14 18.80 0.53100
#> 15  14 20.69 0.50680
#> 16  14 22.51 0.48380
#> 17  14 24.27 0.46180
#> 18  15 25.98 0.44080
#> 19  15 27.67 0.42080
#> 20  16 29.31 0.40170
#> 21  18 30.99 0.38340
#> 22  19 32.65 0.36600
#> 23  20 34.29 0.34930
#> 24  20 35.94 0.33350
#> 25  22 37.56 0.31830
#> 26  22 39.18 0.30380
#> 27  24 40.80 0.29000
#> 28  24 42.39 0.27680
#> 29  25 43.95 0.26430
#> 30  26 45.47 0.25230
#> 31  27 46.96 0.24080
#> 32  27 48.40 0.22980
#> 33  28 49.80 0.21940
#> 34  30 51.17 0.20940
#> 35  32 52.53 0.19990
#> 36  35 53.90 0.19080
#> 37  35 55.28 0.18210
#> 38  35 56.62 0.17390
#> 39  36 57.91 0.16600
#> 40  36 59.17 0.15840
#> 41  37 60.38 0.15120
#> 42  37 61.56 0.14430
#> 43  37 62.70 0.13780
#> 44  38 63.82 0.13150
#> 45  40 64.91 0.12550
#> 46  40 66.01 0.11980
#> 
#> ...
#> and 54 more lines.
#> 
#> $finalfit_res$best
#> # A tibble: 1 × 2
#>   penalty mixture
#>     <dbl>   <dbl>
#> 1  0.0774   0.261
#> 
#> $finalfit_res$final_wf
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 4 Recipe Steps
#> 
#> • step_normalize()
#> • step_nzv()
#> • step_corr()
#> • step_impute_knn()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Logistic Regression Model Specification (classification)
#> 
#> Main Arguments:
#>   penalty = 0.0774263682681128
#>   mixture = 0.261111111111111
#> 
#> Computational engine: glmnet 
#> 
#> 
#> 
#> $testfit_res
#> $testfit_res$metrics
#> $testfit_res$metrics$accuracy
#> [1] 0.78
#> 
#> $testfit_res$metrics$sensitivity
#> [1] 0.71
#> 
#> $testfit_res$metrics$specificity
#> [1] 0.85
#> 
#> $testfit_res$metrics$auc
#> [1] 0.85
#> 
#> $testfit_res$metrics$conf_matrix
#>           Truth
#> Prediction  0  1
#>          0 10  2
#>          1  4 11
#> 
#> $testfit_res$metrics$roc_curve

#> 
#> 
#> $testfit_res$mixture
#> [1] 0.2611111
#> 
#> 
#> $var_imp_res
#> $var_imp_res$features
#> # A tibble: 69 × 4
#>    Variable Importance Sign  Scaled_Importance
#>    <fct>         <dbl> <chr>             <dbl>
#>  1 AZU1          0.773 POS               100  
#>  2 ATP5PO        0.724 NEG                93.6
#>  3 ANGPTL2       0.671 POS                86.8
#>  4 ADA           0.669 POS                86.5
#>  5 ANG           0.624 NEG                80.6
#>  6 ARID4B        0.598 NEG                77.3
#>  7 ATOX1         0.567 NEG                73.3
#>  8 ACP6          0.567 NEG                73.3
#>  9 APP           0.527 NEG                68.2
#> 10 ADAM8         0.527 NEG                68.1
#> # ℹ 59 more rows
#> 
#> $var_imp_res$var_imp_plot

#> 
#> 
#> $boxplot_res
#> Warning: Removed 69 rows containing non-finite outside the scale range
#> (`stat_boxplot()`).
#> Warning: Removed 13 rows containing non-finite outside the scale range
#> (`stat_boxplot()`).
#> Warning: Removed 56 rows containing missing values or values outside the scale range
#> (`geom_point()`).
#> Warning: Removed 13 rows containing missing values or values outside the scale range
#> (`geom_point()`).

#>