HDAnalyzeR is an R package developed to streamline and enhance proteomics analysis, particularly for biomarker selection from blood plasma samples. It is developed by and optimized to be used from Human Disease Blood Atlas group internally with Olink proteomics data. This vignette will guide you through the essential steps to use the package, from data loading and quality control to data cleaning, dimensionality reduction, and biomarker identification. Let’s get started by loading the package!
This document introduces you to HDAnalyzeR’s basic set of tools, and shows you how to analyze and identify biomarkers in a dataset of cancer blood plasma samples.
Loading the Data
First, we load the package’s example_data
and
example_metadata
.
head(example_data)
#> DAid Sample OlinkID UniProt Assay Panel NPX
#> 1 DA00001 AML_syn_1 OID21311 Q9BTE6 AARSD1 Oncology 3.3903461
#> 2 DA00001 AML_syn_1 OID21280 P00519 ABL1 Oncology 2.7588517
#> 3 DA00001 AML_syn_1 OID21269 P09110 ACAA1 Oncology 1.7070090
#> 4 DA00001 AML_syn_1 OID20159 P16112 ACAN Cardiometabolic 0.0332709
#> 5 DA00001 AML_syn_1 OID20105 Q9BYF1 ACE2 Cardiometabolic 1.7553590
#> 6 DA00001 AML_syn_1 OID20124 Q15067 ACOX1 Cardiometabolic -0.9192835
#> Assay_Warning QC_Warning PlateID
#> 1 PASS PASS Run001
#> 2 PASS PASS Run001
#> 3 PASS PASS Run001
#> 4 PASS PASS Run001
#> 5 PASS PASS Run001
#> 6 PASS PASS Run001
head(example_metadata)
#> DAid Sample Disease Stage Grade Sex Age BMI Cohort
#> 1 DA00001 AML_syn_1 AML 2 <NA> F 42 22.7 UCAN
#> 2 DA00002 AML_syn_2 AML Unknown <NA> M 69 33.1 UCAN
#> 3 DA00003 AML_syn_3 AML 2 <NA> F 61 26.2 UCAN
#> 4 DA00004 AML_syn_4 AML Unknown <NA> M 54 28.1 UCAN
#> 5 DA00005 AML_syn_5 AML 2 <NA> F 57 21.4 UCAN
#> 6 DA00006 AML_syn_6 AML Unknown <NA> M 86 33.9 UCAN
📓 In real-world scenarios, you would load your own data and metadata files instead of using the example dataset. In order to run the package without issues, make sure that your data include the following columns: DAid, Assay and NPX, while your metadata include the following columns: DAid and Disease. Also, if you have a Sex column in your metadata, the data should be encoded as M and F.
Quality Control (QC)
Data QC
qc_summary_data()
provides a comprehensive summary of
the input dataset. It will check the column types, calculate the
percentage of NAs in each column and row and plot their distributions,
perform normality tests for all the different Assays, calculate
protein-protein correlations, and create a heatmap of these
correlations. Users can also specify the threshold for reporting
protein-protein correlations.
qc_data <- qc_summary_data(example_data, wide = FALSE, threshold = 0.7)
#> [1] "Summary:"
#> [1] "Note: In case of long output, only the first 10 rows are shown. To see the rest display the object with view()"
#> [1] "Number of samples: 586"
#> [1] "Number of variables: 100"
#> [1] "--------------------------------------"
#> [1] "character : 1"
#> [1] "numeric : 100"
#> [1] "--------------------------------------"
#> [1] "NA percentage in each column:"
#> # A tibble: 91 × 2
#> column na_percentage
#> <chr> <dbl>
#> 1 ACE2 6.1
#> 2 ACTA2 6.1
#> 3 ACTN4 6.1
#> 4 ADAM15 6.1
#> 5 ADAMTS16 6.1
#> 6 ADH4 6.1
#> 7 AKR1C4 6.1
#> 8 AMBN 6.1
#> 9 AMN 6.1
#> 10 AOC1 6.1
#> # ℹ 81 more rows
#> [1] "--------------------------------------"
#> [1] "NA percentage in each row:"
#> # A tibble: 144 × 2
#> DAid na_percentage
#> <chr> <dbl>
#> 1 DA00450 57.4
#> 2 DA00482 53.5
#> 3 DA00542 53.5
#> 4 DA00003 50.5
#> 5 DA00463 46.5
#> 6 DA00116 43.6
#> 7 DA00475 42.6
#> 8 DA00578 42.6
#> 9 DA00443 41.6
#> 10 DA00476 35.6
#> # ℹ 134 more rows
#> [1] "--------------------------------------"
#> [1] "Normality test results:"
#> # A tibble: 100 × 4
#> Protein p_value adj.P.Val is_normal
#> <chr> <dbl> <dbl> <lgl>
#> 1 ARID4B 2.00e-21 1.64e-19 FALSE
#> 2 ARTN 4.91e-21 1.64e-19 FALSE
#> 3 ATF2 4.01e-21 1.64e-19 FALSE
#> 4 AZU1 6.02e-20 1.51e-18 FALSE
#> 5 APBB1IP 1.64e-16 3.27e-15 FALSE
#> 6 ADA 2.81e-15 4.69e-14 FALSE
#> 7 ADCYAP1R1 5.75e-15 8.21e-14 FALSE
#> 8 AOC1 2.17e-14 2.71e-13 FALSE
#> 9 AREG 7.47e-14 8.30e-13 FALSE
#> 10 ADGRG1 1.39e-12 1.39e-11 FALSE
#> # ℹ 90 more rows
#> [1] "--------------------------------------"
#> [1] "Protein-protein correlations above 0.7:"
#> Protein1 Protein2 Correlation
#> 1 ATP5IF1 AIFM1 0.76
#> 2 AXIN1 ARHGEF12 0.76
#> 3 AIFM1 ATP5IF1 0.76
#> 4 ARHGEF12 AXIN1 0.76
#> 5 ARHGEF12 AIFM1 0.71
#> 6 AIFM1 ARHGEF12 0.71
#> [1] "--------------------------------------"
#> [1] "Correlation heatmap:"
#> [1] "--------------------------------------"
qc_data$na_col_dist
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qc_data$na_row_dist
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Metadata QC
qc_summary_metadata()
summarizes quality control results
of the metadata dataframe. It checks the column types, calculates the
percentage of NAs in each column and row exactly as
qc_summary_data()
, and creates summary visualizations for
key metadata variables such as Sex, Age, and BMI.
qc_metadata <- qc_summary_metadata(example_metadata, disease_palette = "cancers12")
#> [1] "Summary:"
#> [1] "Note: In case of long output, only the first 10 rows are shown. To see the rest display the object with view()"
#> [1] "Number of samples: 586"
#> [1] "Number of variables: 8"
#> [1] "--------------------------------------"
#> [1] "character : 7"
#> [1] "numeric : 2"
#> [1] "--------------------------------------"
#> [1] "NA percentage in each column:"
#> # A tibble: 1 × 2
#> column na_percentage
#> <chr> <dbl>
#> 1 Grade 91.5
#> [1] "--------------------------------------"
#> [1] "NA percentage in each row:"
#> # A tibble: 536 × 2
#> DAid na_percentage
#> <chr> <dbl>
#> 1 DA00001 11.1
#> 2 DA00002 11.1
#> 3 DA00003 11.1
#> 4 DA00004 11.1
#> 5 DA00005 11.1
#> 6 DA00006 11.1
#> 7 DA00007 11.1
#> 8 DA00008 11.1
#> 9 DA00009 11.1
#> 10 DA00010 11.1
#> # ℹ 526 more rows
#> [1] "--------------------------------------"
#> Sex contains:
#> # A tibble: 19 × 3
#> Disease Sex n
#> <chr> <chr> <int>
#> 1 AML F 23
#> 2 AML M 27
#> 3 BRC F 50
#> 4 CLL F 21
#> 5 CLL M 27
#> 6 CRC F 28
#> 7 CRC M 22
#> 8 CVX F 50
#> 9 ENDC F 50
#> 10 GLIOM F 24
#> 11 GLIOM M 26
#> 12 LUNGC F 33
#> 13 LUNGC M 17
#> 14 LYMPH F 22
#> 15 LYMPH M 28
#> 16 MYEL F 15
#> 17 MYEL M 23
#> 18 OVC F 50
#> 19 PRC M 50
qc_metadata$barplot_Sex
qc_metadata$distplot_Age
#> Picking joint bandwidth of 6.06
Data Cleaning
Data Cleaning
As we saw from the QC results, the data contains NAs and other issues
that need to be addressed. clean_data()
preprocesses the
dataset by filtering out rows based on specified criteria. In this case
we will keep only the data which Assay_Warning is “PASS” and only DAid,
Assay and NPX columns.
clean_data <- clean_data(example_data,
keep_cols = c("DAid", "Assay", "NPX"),
filter_assay_warning = TRUE)
head(clean_data)
#> # A tibble: 6 × 3
#> DAid Assay NPX
#> <chr> <chr> <dbl>
#> 1 DA00001 AARSD1 3.39
#> 2 DA00001 ABL1 2.76
#> 3 DA00001 ACAA1 1.71
#> 4 DA00001 ACAN 0.0333
#> 5 DA00001 ACE2 1.76
#> 6 DA00001 ACOX1 -0.919
Metadata Cleaning
In our case, clean_metadata()
preprocesses the metadata
just by keeping only the specified columns.
clean_metadata <- clean_metadata(example_metadata,
keep_cols = c("DAid", "Disease", "Sex", "Age"))
head(clean_metadata)
#> # A tibble: 6 × 4
#> DAid Disease Sex Age
#> <chr> <chr> <chr> <dbl>
#> 1 DA00001 AML F 42
#> 2 DA00002 AML M 69
#> 3 DA00003 AML F 61
#> 4 DA00004 AML M 54
#> 5 DA00005 AML F 57
#> 6 DA00006 AML M 86
Data Transformation
Once the data is cleaned, we recommend transforming it into a tidy
format (wide format) if it’s not already in that form.
widen_data()
will create for us the wide Olink dataset.
wide_data <- widen_data(clean_data)
head(wide_data)
#> # A tibble: 6 × 101
#> DAid AARSD1 ABL1 ACAA1 ACAN ACE2 ACOX1 ACP5 ACP6 ACTA2 ACTN4
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 DA00001 3.39 2.76 1.71 0.0333 1.76 -0.919 1.54 2.15 2.81 0.742
#> 2 DA00002 1.42 1.25 NA -0.459 0.826 -0.902 0.647 1.30 0.798 -0.0659
#> 3 DA00003 NA NA NA 0.989 NA 0.330 1.37 NA NA NA
#> 4 DA00004 3.41 3.38 1.69 NA 1.52 NA 0.841 0.582 1.70 0.108
#> 5 DA00005 5.01 5.05 0.128 0.401 -0.933 -0.584 0.0265 1.16 2.73 0.350
#> 6 DA00006 6.83 1.18 -1.74 -0.156 1.53 -0.721 0.620 0.527 0.772 NA
#> # ℹ 90 more variables: ACY1 <dbl>, ADA <dbl>, ADA2 <dbl>, ADAM15 <dbl>,
#> # ADAM23 <dbl>, ADAMTS13 <dbl>, ADAMTS15 <dbl>, ADAMTS16 <dbl>,
#> # ADAMTS8 <dbl>, ADCYAP1R1 <dbl>, ADGRE2 <dbl>, ADGRE5 <dbl>, ADGRG1 <dbl>,
#> # ADGRG2 <dbl>, ADH4 <dbl>, AGER <dbl>, AGR2 <dbl>, AGR3 <dbl>, AGRN <dbl>,
#> # AGRP <dbl>, AGXT <dbl>, AHCY <dbl>, AHSP <dbl>, AIF1 <dbl>, AIFM1 <dbl>,
#> # AK1 <dbl>, AKR1B1 <dbl>, AKR1C4 <dbl>, AKT1S1 <dbl>, AKT3 <dbl>,
#> # ALCAM <dbl>, ALDH1A1 <dbl>, ALDH3A1 <dbl>, ALPP <dbl>, AMBN <dbl>, …
📓 While HDAnalyzeR can work with long format data, most functions will transform it into a wide format, which can slightly slow down the pipeline. Thus, starting with tidy data is advisable.
Imputation and Dimensionality Reduction
Next, we will impute missing values using K-nearest neighbors (KNN)
with 3 neighbors via impute_knn()
.
imputed_data <- impute_knn(wide_data,
k = 3,
exclude_cols = c("DAid"),
show_na_percentage = FALSE)
head(imputed_data)
#> # A tibble: 6 × 101
#> DAid AARSD1 ABL1 ACAA1 ACAN ACE2 ACOX1 ACP5 ACP6 ACTA2 ACTN4
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 DA00001 3.39 2.76 1.71 0.0333 1.76 -0.919 1.54 2.15 2.81 0.742
#> 2 DA00002 1.42 1.25 -0.0721 -0.459 0.826 -0.902 0.647 1.30 0.798 -0.0659
#> 3 DA00003 3.80 2.82 2.35 0.989 -0.0218 0.330 1.37 0.561 1.34 0.737
#> 4 DA00004 3.41 3.38 1.69 0.262 1.52 1.86 0.841 0.582 1.70 0.108
#> 5 DA00005 5.01 5.05 0.128 0.401 -0.933 -0.584 0.0265 1.16 2.73 0.350
#> 6 DA00006 6.83 1.18 -1.74 -0.156 1.53 -0.721 0.620 0.527 0.772 0.229
#> # ℹ 90 more variables: ACY1 <dbl>, ADA <dbl>, ADA2 <dbl>, ADAM15 <dbl>,
#> # ADAM23 <dbl>, ADAMTS13 <dbl>, ADAMTS15 <dbl>, ADAMTS16 <dbl>,
#> # ADAMTS8 <dbl>, ADCYAP1R1 <dbl>, ADGRE2 <dbl>, ADGRE5 <dbl>, ADGRG1 <dbl>,
#> # ADGRG2 <dbl>, ADH4 <dbl>, AGER <dbl>, AGR2 <dbl>, AGR3 <dbl>, AGRN <dbl>,
#> # AGRP <dbl>, AGXT <dbl>, AHCY <dbl>, AHSP <dbl>, AIF1 <dbl>, AIFM1 <dbl>,
#> # AK1 <dbl>, AKR1B1 <dbl>, AKR1C4 <dbl>, AKT1S1 <dbl>, AKT3 <dbl>,
#> # ALCAM <dbl>, ALDH1A1 <dbl>, ALDH3A1 <dbl>, ALPP <dbl>, AMBN <dbl>, …
After imputation, we will run Principal Component Analysis (PCA) via
do_pca()
and Uniform Manifold Approximation and Projection
(UMAP) via do_umap()
to check for outliers, batch effects,
and other potential issues.
pca_res <- do_pca(imputed_data,
clean_metadata,
color = "Sex",
palette = "sex_hpa",
impute = FALSE,
pcs = 6)
pca_res$pca_plot
pca_res$loadings_plot
pca_res$variance_plot
umap_res <- do_umap(imputed_data,
clean_metadata,
color = "Disease",
palette = "cancers12",
impute = FALSE)
umap_res$umap_plot
We will perform another QC check to ensure that everything is as expected after cleaning and imputing the data. We will not showcase the output again for brevity.
qc_data <- qc_summary_data(imputed_data, wide = TRUE, threshold = 0.7, report = FALSE)
qc_metadata <- qc_summary_metadata(clean_metadata, disease_palette = "cancers12", report = FALSE)
Biomarker Identification
Differential Expression Analysis
We will run a differential expression analysis to identify potential
biomarkers. We will use do_limma()
so that we will be able
to correct also for Age. This method will help us pinpoint proteins that
are significantly different between conditions. We will present only the
results for AML (Acute Myeloid Leukemia).
de_res_aml <- do_limma(imputed_data,
clean_metadata,
case = "AML",
control = c("BRC", "PRC"),
correct = c("Sex", "Age"),
correct_type = c("factor", "numeric"),
only_female = "BRC",
only_male = "PRC")
#> Comparing AML with BRC, PRC.
#> Warning in do_limma_de(join_data, variable, case, control, correct,
#> correct_type, : 436 rows were removed because they contain NAs in Disease or
#> Sex, Age!
de_res_aml$de_results
#> # A tibble: 100 × 11
#> Assay logFC CI.L CI.R AveExpr t P.Value adj.P.Val B Disease
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 ADA 1.50 1.20 1.81 1.25 9.75 7.95e-18 7.95e-16 29.8 AML
#> 2 APEX1 1.84 1.43 2.24 0.583 8.97 8.98e-16 4.49e-14 25.2 AML
#> 3 AZU1 1.69 1.29 2.08 0.537 8.46 1.81e-14 6.04e-13 22.2 AML
#> 4 APBB1IP 1.19 0.840 1.53 0.00561 6.77 2.54e-10 6.35e- 9 12.8 AML
#> 5 ANGPT1 -1.66 -2.18 -1.13 1.38 -6.23 4.11e- 9 8.22e- 8 10.1 AML
#> 6 ARTN 0.921 0.615 1.23 0.648 5.95 1.74e- 8 2.89e- 7 8.67 AML
#> 7 ACTA2 0.711 0.458 0.964 1.52 5.55 1.21e- 7 1.73e- 6 6.77 AML
#> 8 ADGRG1 1.30 0.827 1.77 1.80 5.44 2.08e- 7 2.60e- 6 6.25 AML
#> 9 ADAMTS8 -0.809 -1.11 -0.506 0.420 -5.27 4.55e- 7 5.06e- 6 5.47 AML
#> 10 ANGPT2 0.741 0.459 1.02 1.06 5.18 6.73e- 7 6.73e- 6 5.10 AML
#> # ℹ 90 more rows
#> # ℹ 1 more variable: sig <chr>
de_res_aml$volcano_plot
#> Warning: ggrepel: 12 unlabeled data points (too many overlaps). Consider
#> increasing max.overlaps
de_res_brc <- do_limma(imputed_data,
clean_metadata,
case = "BRC",
control = c("AML", "PRC"),
correct = c("Sex", "Age"),
correct_type = c("factor", "numeric"),
only_female = "BRC",
only_male = "PRC")
de_res_prc <- do_limma(imputed_data,
clean_metadata,
case = "PRC",
control = c("BRC", "AML"),
correct = c("Sex", "Age"),
correct_type = c("factor", "numeric"),
only_female = "BRC",
only_male = "PRC")
We can also summarize the results via plot_de_summary()
.
In order to use this function we need to store the results in a
list.
de_res <- list("AML" = de_res_aml,
"BRC" = de_res_brc,
"PRC" = de_res_prc)
de_summary <- plot_de_summary(de_res, disease_palette = "cancers12")
de_summary$de_barplot
de_summary$upset_plot_up
de_summary$upset_plot_down
Lasso Machine Learning Classification Model
In addition to differential expression analysis, we will use a Lasso machine learning classification model to identify significant features. This model will help us understand which proteins are most predictive of the conditions being studied. Once again, we will present only the results for AML.
lasso_res_aml <- do_rreg(imputed_data,
clean_metadata,
case = "AML",
control = c("BRC", "PRC"),
only_female = "BRC",
only_male = "PRC",
exclude_cols = c("Sex", "Age"),
type = "lasso",
palette = "cancers12",
subtitle = c("accuracy",
"sensitivity",
"specificity",
"auc",
"features",
"top-features"),
nfeatures = 12,
points = FALSE)
#> Joining with `by = join_by(DAid)`
#> Sets and groups are ready. Model fitting is starting...
#> Classification model for AML as case is starting...
lasso_res_aml$hypopt_res$hypopt_vis
lasso_res_aml$testfit_res$metrics
#> $accuracy
#> [1] 0.74
#>
#> $sensitivity
#> [1] 0.71
#>
#> $specificity
#> [1] 0.77
#>
#> $auc
#> [1] 0.87
#>
#> $conf_matrix
#> Truth
#> Prediction 0 1
#> 0 10 3
#> 1 4 10
#>
#> $roc_curve
lasso_res_aml$var_imp_res$features
#> # A tibble: 23 × 4
#> Variable Importance Sign Scaled_Importance
#> <fct> <dbl> <chr> <dbl>
#> 1 ADGRG1 2.23 POS 100
#> 2 APEX1 1.88 POS 84.3
#> 3 ABL1 1.26 POS 56.8
#> 4 ANGPT1 1.15 NEG 51.6
#> 5 ALDH1A1 1.11 NEG 49.7
#> 6 ANXA11 1.07 NEG 48.3
#> 7 ADM 1.04 POS 46.8
#> 8 ADGRG2 0.988 NEG 44.4
#> 9 AGR3 0.890 NEG 40.0
#> 10 AHCY 0.654 POS 29.4
#> # ℹ 13 more rows
lasso_res_aml$var_imp_res$var_imp_plot
lasso_res_aml$boxplot_res
lasso_res_brc <- do_rreg(imputed_data,
clean_metadata,
case = "BRC",
control = c("AML", "PRC"),
only_female = "BRC",
only_male = "PRC",
exclude_cols = c("Sex", "Age"),
type = "lasso",
palette = "cancers12",
subtitle = c("accuracy",
"sensitivity",
"specificity",
"auc",
"features",
"top-features"),
nfeatures = 12,
points = FALSE)
lasso_res_prc <- do_rreg(imputed_data,
clean_metadata,
case = "PRC",
control = c("BRC", "AML"),
only_female = "BRC",
only_male = "PRC",
exclude_cols = c("Sex", "Age"),
type = "lasso",
palette = "cancers12",
subtitle = c("accuracy",
"sensitivity",
"specificity",
"auc",
"features",
"top-features"),
nfeatures = 12,
points = FALSE)
We can get a visual summary of the results via
plot_features_summary()
too. In order to use this function
we need to store the results in a list.
lasso_res <- list("AML" = lasso_res_aml,
"BRC" = lasso_res_brc,
"PRC" = lasso_res_prc)
features_summary <- plot_features_summary(lasso_res, case_palette = "cancers12")
features_summary$metrics_lineplot
features_summary$features_barplot
features_summary$upset_plot_features
One step further
The final step involves performing an Over-Representation Analysis for the up-regulated proteins from differential expression, that are also identified as features by the ML model. In this example, we will use the Gene Ontology (GO) database and show the results only for AML.
# Extract the proteins identified by both DE and Lasso
de_proteins <- de_res_aml$de_results |>
dplyr::filter(sig == "significant up") |>
dplyr::pull(Assay)
lasso_proteins <- lasso_res_aml$var_imp_res$features |>
dplyr::filter(Scaled_Importance > 0) |>
dplyr::pull(Variable)
intersect_proteins <- intersect(de_proteins, lasso_proteins)
# Perform ORA with GO database and visualize results
enrichment <- do_ora(intersect_proteins, database = "GO")
#> No background provided. When working with Olink data it is recommended to use background.
#>
#>
#> 'select()' returned 1:1 mapping between keys and columns
plot_ora(enrichment, intersect_proteins, ncateg = 5)
#> 'select()' returned 1:1 mapping between keys and columns
#> $dotplot
#>
#> $barplot
#>
#> $cnetplot
📓 Remember that these data are a dummy-dataset with fake data and the results in this guide should not be interpreted as real results. The purpose of this vignette is to show you how to use the package and its functions.