pkgdown/extra.css

Skip to contents

hd_model_lr() runs the logistic regression model pipeline. It creates class-balanced case-control groups for the train set, fits the model, evaluates the model, and plots the feature importance and model performance.

Usage

hd_model_lr(
  dat,
  variable = "Disease",
  case,
  control = NULL,
  balance_groups = TRUE,
  cor_threshold = 0.9,
  palette = NULL,
  plot_y_labels = TRUE,
  verbose = TRUE,
  plot_title = c("accuracy", "sensitivity", "specificity", "auc", "features",
    "top-features"),
  seed = 123
)

Arguments

dat

An hd_model object or a list containing the train and test data.

variable

The name of the metadata variable containing the case and control groups. Default is "Disease".

case

The case class.

control

The control groups. If NULL, it will be set to all other unique values of the variable that are not the case. Default is NULL.

balance_groups

Whether to balance the groups. Default is TRUE.

cor_threshold

Threshold of absolute correlation values. This will be used to remove the minimum number of features so that all their resulting absolute correlations are less than this value.

palette

The color palette for the classes. If it is a character, it should be one of the palettes from hd_palettes(). Default is NULL.

plot_y_labels

Whether to show y-axis labels in the feature importance plot. Default is TRUE.

verbose

Whether to print progress messages. Default is TRUE.

plot_title

Vector of title elements to include in the plot. It should be a subset of c("accuracy", "sensitivity", "specificity", "auc", "features", "top-features").

seed

Seed for reproducibility. Default is 123.

Value

A model object containing the train and test data, the metrics, the ROC curve, the selected features and their importance.

Details

This model is ideal when the number of features is small. Otherwise, use hd_model_rreg() as it is more robust to high-dimensional data. The numeric predictors will be normalized and the nominal predictors will be one-hot encoded. If the data contain missing values, KNN (k=5) imputation will be used to impute. If less than 3 features are selected, the feature importance plot will not be generated.

Logistic regression models are not supported for multiclass classification, so case argument is always required. If multi-class classification is needed, use hd_model_rreg() instead. This function utilizes the "glm" engine. Also, as it is a classification model no continuous variable is allowed.

Examples

# Initialize an HDAnalyzeR object with only a subset of the predictors
hd_object <- hd_initialize(
  example_data |> dplyr::filter(Assay %in% c("ADA", "AARSD1", "ACAA1", "ACAN1", "ACOX1")),
  example_metadata
)

# Split the data into training and test sets
hd_split <- hd_split_data(
  hd_object,
  metadata_cols = c("Age", "Sex"),  # Include metadata columns
  variable = "Disease"
)
#> Warning: Too little data to stratify.
#>  Resampling will be unstratified.

# Run the logistic regression model pipeline
hd_model_lr(hd_split,
            variable = "Disease",
            case = "AML",
            palette = "cancers12")
#> The groups in the train set are balanced. If you do not want to balance the groups, set `balance_groups = FALSE`.
#> Tuning logistic regression model...
#> Evaluating the model...
#> Generating visualizations...
#> $train_data
#> # A tibble: 76 × 8
#>    DAid    Disease AARSD1  ACAA1  ACOX1    ADA   Age Sex  
#>    <chr>   <fct>    <dbl>  <dbl>  <dbl>  <dbl> <dbl> <chr>
#>  1 DA00003 1        NA    NA      0.330  0.952    61 F    
#>  2 DA00004 1         3.41  1.69  NA      2.69     54 M    
#>  3 DA00005 1         5.01  0.128 -0.584  3.75     57 F    
#>  4 DA00006 1         6.83 -1.74  -0.721  2.03     86 M    
#>  5 DA00007 1        NA     3.96   2.62   3.99     85 F    
#>  6 DA00008 1         2.78 -0.552 -0.304  2.83     88 F    
#>  7 DA00010 1         1.83 -0.912 -0.304 -0.448    48 M    
#>  8 DA00011 1         3.48  3.50   1.26   2.42     54 F    
#>  9 DA00012 1         4.31 -1.44  -0.361  0.725    78 F    
#> 10 DA00013 1         1.31  1.11  -1.35   1.13     81 M    
#> # ℹ 66 more rows
#> 
#> $test_data
#> # A tibble: 147 × 8
#>    DAid    Disease AARSD1  ACAA1   ACOX1    ADA   Age Sex  
#>    <chr>   <fct>    <dbl>  <dbl>   <dbl>  <dbl> <dbl> <chr>
#>  1 DA00001 1        3.39   1.71  -0.919  5.39      42 F    
#>  2 DA00002 1        1.42  -0.816 -0.902  0.0114    69 M    
#>  3 DA00009 1        4.39  -0.452  1.71   3.61      80 M    
#>  4 DA00015 1        3.31  NA      0.687  4.11      47 M    
#>  5 DA00017 1        1.46  -2.73   0.0234 1.58      44 M    
#>  6 DA00018 1        2.62   0.537  0.290  1.86      75 M    
#>  7 DA00028 1        2.47  -0.486 NA      3.97      78 F    
#>  8 DA00032 1        3.62  -1.34   1.53   2.96      62 M    
#>  9 DA00035 1        4.39   0.454  0.116  2.82      59 F    
#> 10 DA00044 1        0.964  1.55   0.164  0.836     72 F    
#> # ℹ 137 more rows
#> 
#> $model_type
#> [1] "binary_class"
#> 
#> $final_workflow
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 5 Recipe Steps
#> 
#> • step_dummy()
#> • step_nzv()
#> • step_normalize()
#> • step_corr()
#> • step_impute_knn()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: glm 
#> 
#> 
#> $metrics
#> $metrics$accuracy
#> [1] 0.7142857
#> 
#> $metrics$sensitivity
#> [1] 0.8333333
#> 
#> $metrics$specificity
#> [1] 0.7037037
#> 
#> $metrics$auc
#> [1] 0.8753086
#> 
#> $metrics$confusion_matrix
#>           Truth
#> Prediction  0  1
#>          0 95  2
#>          1 40 10
#> 
#> 
#> $roc_curve

#> 
#> $probability_plot

#> 
#> $features
#> # A tibble: 5 × 4
#>   Feature Importance Sign  Scaled_Importance
#>   <fct>        <dbl> <chr>             <dbl>
#> 1 ADA          3.71  POS              1     
#> 2 Sex_M        2.00  POS              0.502 
#> 3 Age          1.78  NEG              0.440 
#> 4 ACAA1        1.16  NEG              0.259 
#> 5 ACOX1        0.563 NEG              0.0855
#> 
#> $feat_imp_plot

#> 
#> attr(,"class")
#> [1] "hd_model"