pkgdown/extra.css

Skip to contents

HDAnalyzeR is an R package developed to streamline and enhance proteomics analysis, particularly for biomarker selection from blood plasma samples. It is developed by and optimized to be used from Human Disease Blood Atlas group internally with Olink proteomics data. This vignette will guide you through the essential steps to use the package, from data loading and quality control to data cleaning, dimensionality reduction, and biomarker identification. Let’s get started by loading the package!

This document introduces you to HDAnalyzeR’s basic set of tools, and shows you how to analyze and identify biomarkers in a dataset of cancer blood plasma samples.

Loading the Data

First, we load the package’s example_data and example_metadata.

head(example_data)
#>      DAid    Sample  OlinkID UniProt  Assay           Panel        NPX
#> 1 DA00001 AML_syn_1 OID21311  Q9BTE6 AARSD1        Oncology  3.3903461
#> 2 DA00001 AML_syn_1 OID21280  P00519   ABL1        Oncology  2.7588517
#> 3 DA00001 AML_syn_1 OID21269  P09110  ACAA1        Oncology  1.7070090
#> 4 DA00001 AML_syn_1 OID20159  P16112   ACAN Cardiometabolic  0.0332709
#> 5 DA00001 AML_syn_1 OID20105  Q9BYF1   ACE2 Cardiometabolic  1.7553590
#> 6 DA00001 AML_syn_1 OID20124  Q15067  ACOX1 Cardiometabolic -0.9192835
#>   Assay_Warning QC_Warning PlateID
#> 1          PASS       PASS  Run001
#> 2          PASS       PASS  Run001
#> 3          PASS       PASS  Run001
#> 4          PASS       PASS  Run001
#> 5          PASS       PASS  Run001
#> 6          PASS       PASS  Run001
head(example_metadata)
#>      DAid    Sample Disease   Stage Grade Sex Age  BMI Cohort
#> 1 DA00001 AML_syn_1     AML       2  <NA>   F  42 22.7   UCAN
#> 2 DA00002 AML_syn_2     AML Unknown  <NA>   M  69 33.1   UCAN
#> 3 DA00003 AML_syn_3     AML       2  <NA>   F  61 26.2   UCAN
#> 4 DA00004 AML_syn_4     AML Unknown  <NA>   M  54 28.1   UCAN
#> 5 DA00005 AML_syn_5     AML       2  <NA>   F  57 21.4   UCAN
#> 6 DA00006 AML_syn_6     AML Unknown  <NA>   M  86 33.9   UCAN

📓 In real-world scenarios, you would load your own data and metadata files instead of using the example dataset. In order to run the package without issues, make sure that your data include the following columns: DAid, Assay and NPX, while your metadata include the following columns: DAid and Disease. Also, if you have a Sex column in your metadata, the data should be encoded as M and F.

Quality Control (QC)

Data QC

qc_summary_data() provides a comprehensive summary of the input dataset. It will check the column types, calculate the percentage of NAs in each column and row and plot their distributions, perform normality tests for all the different Assays, calculate protein-protein correlations, and create a heatmap of these correlations. Users can also specify the threshold for reporting protein-protein correlations.

qc_data <- qc_summary_data(example_data, wide = FALSE, threshold = 0.7)

#> [1] "Summary:"
#> [1] "Note: In case of long output, only the first 10 rows are shown. To see the rest display the object with view()"
#> [1] "Number of samples: 586"
#> [1] "Number of variables: 100"
#> [1] "--------------------------------------"
#> [1] "character : 1"
#> [1] "numeric : 100"
#> [1] "--------------------------------------"
#> [1] "NA percentage in each column:"
#> # A tibble: 91 × 2
#>    column   na_percentage
#>    <chr>            <dbl>
#>  1 ACE2               6.1
#>  2 ACTA2              6.1
#>  3 ACTN4              6.1
#>  4 ADAM15             6.1
#>  5 ADAMTS16           6.1
#>  6 ADH4               6.1
#>  7 AKR1C4             6.1
#>  8 AMBN               6.1
#>  9 AMN                6.1
#> 10 AOC1               6.1
#> # ℹ 81 more rows
#> [1] "--------------------------------------"
#> [1] "NA percentage in each row:"
#> # A tibble: 144 × 2
#>    DAid    na_percentage
#>    <chr>           <dbl>
#>  1 DA00450          57.4
#>  2 DA00482          53.5
#>  3 DA00542          53.5
#>  4 DA00003          50.5
#>  5 DA00463          46.5
#>  6 DA00116          43.6
#>  7 DA00475          42.6
#>  8 DA00578          42.6
#>  9 DA00443          41.6
#> 10 DA00476          35.6
#> # ℹ 134 more rows
#> [1] "--------------------------------------"
#> [1] "Normality test results:"
#> # A tibble: 100 × 4
#>    Protein    p_value adj.P.Val is_normal
#>    <chr>        <dbl>     <dbl> <lgl>    
#>  1 ARID4B    2.00e-21  1.64e-19 FALSE    
#>  2 ARTN      4.91e-21  1.64e-19 FALSE    
#>  3 ATF2      4.01e-21  1.64e-19 FALSE    
#>  4 AZU1      6.02e-20  1.51e-18 FALSE    
#>  5 APBB1IP   1.64e-16  3.27e-15 FALSE    
#>  6 ADA       2.81e-15  4.69e-14 FALSE    
#>  7 ADCYAP1R1 5.75e-15  8.21e-14 FALSE    
#>  8 AOC1      2.17e-14  2.71e-13 FALSE    
#>  9 AREG      7.47e-14  8.30e-13 FALSE    
#> 10 ADGRG1    1.39e-12  1.39e-11 FALSE    
#> # ℹ 90 more rows
#> [1] "--------------------------------------"
#> [1] "Protein-protein correlations above 0.7:"
#>   Protein1 Protein2 Correlation
#> 1  ATP5IF1    AIFM1        0.76
#> 2    AXIN1 ARHGEF12        0.76
#> 3    AIFM1  ATP5IF1        0.76
#> 4 ARHGEF12    AXIN1        0.76
#> 5 ARHGEF12    AIFM1        0.71
#> 6    AIFM1 ARHGEF12        0.71
#> [1] "--------------------------------------"
#> [1] "Correlation heatmap:"
#> [1] "--------------------------------------"

qc_data$na_col_dist
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qc_data$na_row_dist
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Metadata QC

qc_summary_metadata() summarizes quality control results of the metadata dataframe. It checks the column types, calculates the percentage of NAs in each column and row exactly as qc_summary_data(), and creates summary visualizations for key metadata variables such as Sex, Age, and BMI.

qc_metadata <- qc_summary_metadata(example_metadata, disease_palette = "cancers12")
#> [1] "Summary:"
#> [1] "Note: In case of long output, only the first 10 rows are shown. To see the rest display the object with view()"
#> [1] "Number of samples: 586"
#> [1] "Number of variables: 8"
#> [1] "--------------------------------------"
#> [1] "character : 7"
#> [1] "numeric : 2"
#> [1] "--------------------------------------"
#> [1] "NA percentage in each column:"
#> # A tibble: 1 × 2
#>   column na_percentage
#>   <chr>          <dbl>
#> 1 Grade           91.5
#> [1] "--------------------------------------"
#> [1] "NA percentage in each row:"
#> # A tibble: 536 × 2
#>    DAid    na_percentage
#>    <chr>           <dbl>
#>  1 DA00001          11.1
#>  2 DA00002          11.1
#>  3 DA00003          11.1
#>  4 DA00004          11.1
#>  5 DA00005          11.1
#>  6 DA00006          11.1
#>  7 DA00007          11.1
#>  8 DA00008          11.1
#>  9 DA00009          11.1
#> 10 DA00010          11.1
#> # ℹ 526 more rows
#> [1] "--------------------------------------"
#> Sex contains:
#> # A tibble: 19 × 3
#>    Disease Sex       n
#>    <chr>   <chr> <int>
#>  1 AML     F        23
#>  2 AML     M        27
#>  3 BRC     F        50
#>  4 CLL     F        21
#>  5 CLL     M        27
#>  6 CRC     F        28
#>  7 CRC     M        22
#>  8 CVX     F        50
#>  9 ENDC    F        50
#> 10 GLIOM   F        24
#> 11 GLIOM   M        26
#> 12 LUNGC   F        33
#> 13 LUNGC   M        17
#> 14 LYMPH   F        22
#> 15 LYMPH   M        28
#> 16 MYEL    F        15
#> 17 MYEL    M        23
#> 18 OVC     F        50
#> 19 PRC     M        50
qc_metadata$barplot_Sex

qc_metadata$distplot_Age
#> Picking joint bandwidth of 6.06

Data Cleaning

Data Cleaning

As we saw from the QC results, the data contains NAs and other issues that need to be addressed. clean_data() preprocesses the dataset by filtering out rows based on specified criteria. In this case we will keep only the data which Assay_Warning is “PASS” and only DAid, Assay and NPX columns.

clean_data <- clean_data(example_data, 
                         keep_cols = c("DAid", "Assay", "NPX"),
                         filter_assay_warning = TRUE)
head(clean_data)
#> # A tibble: 6 × 3
#>   DAid    Assay      NPX
#>   <chr>   <chr>    <dbl>
#> 1 DA00001 AARSD1  3.39  
#> 2 DA00001 ABL1    2.76  
#> 3 DA00001 ACAA1   1.71  
#> 4 DA00001 ACAN    0.0333
#> 5 DA00001 ACE2    1.76  
#> 6 DA00001 ACOX1  -0.919

Metadata Cleaning

In our case, clean_metadata() preprocesses the metadata just by keeping only the specified columns.

clean_metadata <- clean_metadata(example_metadata, 
                                 keep_cols = c("DAid", "Disease", "Sex", "Age"))
head(clean_metadata)
#> # A tibble: 6 × 4
#>   DAid    Disease Sex     Age
#>   <chr>   <chr>   <chr> <dbl>
#> 1 DA00001 AML     F        42
#> 2 DA00002 AML     M        69
#> 3 DA00003 AML     F        61
#> 4 DA00004 AML     M        54
#> 5 DA00005 AML     F        57
#> 6 DA00006 AML     M        86

Data Transformation

Once the data is cleaned, we recommend transforming it into a tidy format (wide format) if it’s not already in that form. widen_data() will create for us the wide Olink dataset.

wide_data <- widen_data(clean_data)
head(wide_data)
#> # A tibble: 6 × 101
#>   DAid    AARSD1  ABL1  ACAA1    ACAN   ACE2  ACOX1   ACP5   ACP6  ACTA2   ACTN4
#>   <chr>    <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
#> 1 DA00001   3.39  2.76  1.71   0.0333  1.76  -0.919 1.54    2.15   2.81   0.742 
#> 2 DA00002   1.42  1.25 NA     -0.459   0.826 -0.902 0.647   1.30   0.798 -0.0659
#> 3 DA00003  NA    NA    NA      0.989  NA      0.330 1.37   NA     NA     NA     
#> 4 DA00004   3.41  3.38  1.69  NA       1.52  NA     0.841   0.582  1.70   0.108 
#> 5 DA00005   5.01  5.05  0.128  0.401  -0.933 -0.584 0.0265  1.16   2.73   0.350 
#> 6 DA00006   6.83  1.18 -1.74  -0.156   1.53  -0.721 0.620   0.527  0.772 NA     
#> # ℹ 90 more variables: ACY1 <dbl>, ADA <dbl>, ADA2 <dbl>, ADAM15 <dbl>,
#> #   ADAM23 <dbl>, ADAMTS13 <dbl>, ADAMTS15 <dbl>, ADAMTS16 <dbl>,
#> #   ADAMTS8 <dbl>, ADCYAP1R1 <dbl>, ADGRE2 <dbl>, ADGRE5 <dbl>, ADGRG1 <dbl>,
#> #   ADGRG2 <dbl>, ADH4 <dbl>, AGER <dbl>, AGR2 <dbl>, AGR3 <dbl>, AGRN <dbl>,
#> #   AGRP <dbl>, AGXT <dbl>, AHCY <dbl>, AHSP <dbl>, AIF1 <dbl>, AIFM1 <dbl>,
#> #   AK1 <dbl>, AKR1B1 <dbl>, AKR1C4 <dbl>, AKT1S1 <dbl>, AKT3 <dbl>,
#> #   ALCAM <dbl>, ALDH1A1 <dbl>, ALDH3A1 <dbl>, ALPP <dbl>, AMBN <dbl>, …

📓 While HDAnalyzeR can work with long format data, most functions will transform it into a wide format, which can slightly slow down the pipeline. Thus, starting with tidy data is advisable.

Imputation and Dimensionality Reduction

Next, we will impute missing values using K-nearest neighbors (KNN) with 3 neighbors via impute_knn().

imputed_data <- impute_knn(wide_data, 
                           k = 3,
                           exclude_cols = c("DAid"),
                           show_na_percentage = FALSE)
head(imputed_data)
#> # A tibble: 6 × 101
#>   DAid    AARSD1  ABL1   ACAA1    ACAN    ACE2  ACOX1   ACP5  ACP6 ACTA2   ACTN4
#>   <chr>    <dbl> <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl> <dbl> <dbl>   <dbl>
#> 1 DA00001   3.39  2.76  1.71    0.0333  1.76   -0.919 1.54   2.15  2.81   0.742 
#> 2 DA00002   1.42  1.25 -0.0721 -0.459   0.826  -0.902 0.647  1.30  0.798 -0.0659
#> 3 DA00003   3.80  2.82  2.35    0.989  -0.0218  0.330 1.37   0.561 1.34   0.737 
#> 4 DA00004   3.41  3.38  1.69    0.262   1.52    1.86  0.841  0.582 1.70   0.108 
#> 5 DA00005   5.01  5.05  0.128   0.401  -0.933  -0.584 0.0265 1.16  2.73   0.350 
#> 6 DA00006   6.83  1.18 -1.74   -0.156   1.53   -0.721 0.620  0.527 0.772  0.229 
#> # ℹ 90 more variables: ACY1 <dbl>, ADA <dbl>, ADA2 <dbl>, ADAM15 <dbl>,
#> #   ADAM23 <dbl>, ADAMTS13 <dbl>, ADAMTS15 <dbl>, ADAMTS16 <dbl>,
#> #   ADAMTS8 <dbl>, ADCYAP1R1 <dbl>, ADGRE2 <dbl>, ADGRE5 <dbl>, ADGRG1 <dbl>,
#> #   ADGRG2 <dbl>, ADH4 <dbl>, AGER <dbl>, AGR2 <dbl>, AGR3 <dbl>, AGRN <dbl>,
#> #   AGRP <dbl>, AGXT <dbl>, AHCY <dbl>, AHSP <dbl>, AIF1 <dbl>, AIFM1 <dbl>,
#> #   AK1 <dbl>, AKR1B1 <dbl>, AKR1C4 <dbl>, AKT1S1 <dbl>, AKT3 <dbl>,
#> #   ALCAM <dbl>, ALDH1A1 <dbl>, ALDH3A1 <dbl>, ALPP <dbl>, AMBN <dbl>, …

After imputation, we will run Principal Component Analysis (PCA) via do_pca() and Uniform Manifold Approximation and Projection (UMAP) via do_umap() to check for outliers, batch effects, and other potential issues.

pca_res <- do_pca(imputed_data, 
                  clean_metadata,
                  color = "Sex",
                  palette = "sex_hpa",
                  impute = FALSE,
                  pcs = 6)

pca_res$pca_plot

pca_res$loadings_plot

pca_res$variance_plot

umap_res <- do_umap(imputed_data, 
                    clean_metadata,
                    color = "Disease",
                    palette = "cancers12",
                    impute = FALSE)

umap_res$umap_plot

We will perform another QC check to ensure that everything is as expected after cleaning and imputing the data. We will not showcase the output again for brevity.

qc_data <- qc_summary_data(imputed_data, wide = TRUE, threshold = 0.7, report = FALSE)

qc_metadata <- qc_summary_metadata(clean_metadata, disease_palette = "cancers12", report = FALSE)

Biomarker Identification

Differential Expression Analysis

We will run a differential expression analysis to identify potential biomarkers. We will use do_limma() so that we will be able to correct also for Age. This method will help us pinpoint proteins that are significantly different between conditions. We will present only the results for AML (Acute Myeloid Leukemia).

de_res_aml <- do_limma(imputed_data, 
                       clean_metadata,
                       case = "AML",
                       control = c("BRC", "PRC"),
                       correct = c("Sex", "Age"),
                       correct_type = c("factor", "numeric"),
                       only_female = "BRC",
                       only_male = "PRC")
#> Comparing AML with BRC, PRC.
#> Warning in do_limma_de(join_data, variable, case, control, correct,
#> correct_type, : 436 rows were removed because they contain NAs in Disease or
#> Sex, Age!

de_res_aml$de_results
#> # A tibble: 100 × 11
#>    Assay    logFC   CI.L   CI.R AveExpr     t  P.Value adj.P.Val     B Disease
#>    <chr>    <dbl>  <dbl>  <dbl>   <dbl> <dbl>    <dbl>     <dbl> <dbl> <chr>  
#>  1 ADA      1.50   1.20   1.81  1.25     9.75 7.95e-18  7.95e-16 29.8  AML    
#>  2 APEX1    1.84   1.43   2.24  0.583    8.97 8.98e-16  4.49e-14 25.2  AML    
#>  3 AZU1     1.69   1.29   2.08  0.537    8.46 1.81e-14  6.04e-13 22.2  AML    
#>  4 APBB1IP  1.19   0.840  1.53  0.00561  6.77 2.54e-10  6.35e- 9 12.8  AML    
#>  5 ANGPT1  -1.66  -2.18  -1.13  1.38    -6.23 4.11e- 9  8.22e- 8 10.1  AML    
#>  6 ARTN     0.921  0.615  1.23  0.648    5.95 1.74e- 8  2.89e- 7  8.67 AML    
#>  7 ACTA2    0.711  0.458  0.964 1.52     5.55 1.21e- 7  1.73e- 6  6.77 AML    
#>  8 ADGRG1   1.30   0.827  1.77  1.80     5.44 2.08e- 7  2.60e- 6  6.25 AML    
#>  9 ADAMTS8 -0.809 -1.11  -0.506 0.420   -5.27 4.55e- 7  5.06e- 6  5.47 AML    
#> 10 ANGPT2   0.741  0.459  1.02  1.06     5.18 6.73e- 7  6.73e- 6  5.10 AML    
#> # ℹ 90 more rows
#> # ℹ 1 more variable: sig <chr>

de_res_aml$volcano_plot
#> Warning: ggrepel: 12 unlabeled data points (too many overlaps). Consider
#> increasing max.overlaps

de_res_brc <- do_limma(imputed_data, 
                       clean_metadata,
                       case = "BRC",
                       control = c("AML", "PRC"),
                       correct = c("Sex", "Age"),
                       correct_type = c("factor", "numeric"),
                       only_female = "BRC",
                       only_male = "PRC")
de_res_prc <- do_limma(imputed_data, 
                       clean_metadata,
                       case = "PRC",
                       control = c("BRC", "AML"),
                       correct = c("Sex", "Age"),
                       correct_type = c("factor", "numeric"),
                       only_female = "BRC",
                       only_male = "PRC")

We can also summarize the results via plot_de_summary(). In order to use this function we need to store the results in a list.

de_res <- list("AML" = de_res_aml, 
               "BRC" = de_res_brc, 
               "PRC" = de_res_prc)

de_summary <- plot_de_summary(de_res, disease_palette = "cancers12")
de_summary$de_barplot

de_summary$upset_plot_up

de_summary$upset_plot_down

Lasso Machine Learning Classification Model

In addition to differential expression analysis, we will use a Lasso machine learning classification model to identify significant features. This model will help us understand which proteins are most predictive of the conditions being studied. Once again, we will present only the results for AML.

lasso_res_aml <- do_rreg(imputed_data, 
                         clean_metadata, 
                         case = "AML",
                         control = c("BRC", "PRC"),
                         only_female = "BRC",
                         only_male = "PRC",
                         exclude_cols = c("Sex", "Age"),
                         type = "lasso",
                         palette = "cancers12",
                         subtitle = c("accuracy", 
                                      "sensitivity", 
                                      "specificity", 
                                      "auc", 
                                      "features",
                                      "top-features"),
                         nfeatures = 12,
                         points = FALSE)
#> Joining with `by = join_by(DAid)`
#> Sets and groups are ready. Model fitting is starting...
#> Classification model for AML as case is starting...

lasso_res_aml$hypopt_res$hypopt_vis


lasso_res_aml$testfit_res$metrics
#> $accuracy
#> [1] 0.74
#> 
#> $sensitivity
#> [1] 0.71
#> 
#> $specificity
#> [1] 0.77
#> 
#> $auc
#> [1] 0.87
#> 
#> $conf_matrix
#>           Truth
#> Prediction  0  1
#>          0 10  3
#>          1  4 10
#> 
#> $roc_curve


lasso_res_aml$var_imp_res$features
#> # A tibble: 23 × 4
#>    Variable Importance Sign  Scaled_Importance
#>    <fct>         <dbl> <chr>             <dbl>
#>  1 ADGRG1        2.23  POS               100  
#>  2 APEX1         1.88  POS                84.3
#>  3 ABL1          1.26  POS                56.8
#>  4 ANGPT1        1.15  NEG                51.6
#>  5 ALDH1A1       1.11  NEG                49.7
#>  6 ANXA11        1.07  NEG                48.3
#>  7 ADM           1.04  POS                46.8
#>  8 ADGRG2        0.988 NEG                44.4
#>  9 AGR3          0.890 NEG                40.0
#> 10 AHCY          0.654 POS                29.4
#> # ℹ 13 more rows

lasso_res_aml$var_imp_res$var_imp_plot


lasso_res_aml$boxplot_res

lasso_res_brc <- do_rreg(imputed_data, 
                         clean_metadata, 
                         case = "BRC",
                         control = c("AML", "PRC"),
                         only_female = "BRC",
                         only_male = "PRC",
                         exclude_cols = c("Sex", "Age"),
                         type = "lasso",
                         palette = "cancers12",
                         subtitle = c("accuracy", 
                                      "sensitivity", 
                                      "specificity", 
                                      "auc", 
                                      "features",
                                      "top-features"),
                         nfeatures = 12,
                         points = FALSE)
lasso_res_prc <- do_rreg(imputed_data, 
                         clean_metadata, 
                         case = "PRC",
                         control = c("BRC", "AML"),
                         only_female = "BRC",
                         only_male = "PRC",
                         exclude_cols = c("Sex", "Age"),
                         type = "lasso",
                         palette = "cancers12",
                         subtitle = c("accuracy", 
                                      "sensitivity", 
                                      "specificity", 
                                      "auc", 
                                      "features",
                                      "top-features"),
                         nfeatures = 12,
                         points = FALSE)

We can get a visual summary of the results via plot_features_summary() too. In order to use this function we need to store the results in a list.

lasso_res <- list("AML" = lasso_res_aml, 
                  "BRC" = lasso_res_brc, 
                  "PRC" = lasso_res_prc)

features_summary <- plot_features_summary(lasso_res, case_palette = "cancers12")
features_summary$metrics_lineplot

features_summary$features_barplot

features_summary$upset_plot_features

One step further

The final step involves performing an Over-Representation Analysis for the up-regulated proteins from differential expression, that are also identified as features by the ML model. In this example, we will use the Gene Ontology (GO) database and show the results only for AML.

# Extract the proteins identified by both DE and Lasso
de_proteins <- de_res_aml$de_results |> 
  dplyr::filter(sig == "significant up") |> 
  dplyr::pull(Assay)

lasso_proteins <- lasso_res_aml$var_imp_res$features |> 
  dplyr::filter(Scaled_Importance > 0) |> 
  dplyr::pull(Variable)

intersect_proteins <- intersect(de_proteins, lasso_proteins)

# Perform ORA with GO database and visualize results
enrichment <- do_ora(intersect_proteins, database = "GO")
#> No background provided. When working with Olink data it is recommended to use background.
#> 
#> 
#> 'select()' returned 1:1 mapping between keys and columns
plot_ora(enrichment, intersect_proteins, ncateg = 5)
#> 'select()' returned 1:1 mapping between keys and columns
#> $dotplot

#> 
#> $barplot

#> 
#> $cnetplot

📓 Remember that these data are a dummy-dataset with fake data and the results in this guide should not be interpreted as real results. The purpose of this vignette is to show you how to use the package and its functions.