What is HDAnalyzeR?
HDAnalyzeR is an R package developed to streamline and enhance proteomics analysis, particularly for biomarker discovery from blood plasma samples. It is developed by the Human Disease Blood Atlas group. This vignette will guide you through the essential steps to use the package from data loading to biomarker identification. Let’s get started by loading the package!
How to get Help?
If you need help with the package, you can use the ?
operator to access the documentation for a specific function. For
example, to access the documentation for the
hd_initialize()
function, you can run:
?hd_initialize
All package functions can be found in the HDAnalyzeR Function Reference page. For whole example workflows you can visit any of the package articles.
This document introduces you to HDAnalyzeR’s basic set of tools, and shows you how to analyze and identify biomarkers in a artificial dataset of cancer blood plasma samples.
Loading the Data
First, we load the package’s example_data
and
example_metadata
.
head(example_data)
#> DAid Sample OlinkID UniProt Assay Panel NPX
#> 1 DA00001 AML_syn_1 OID21311 Q9BTE6 AARSD1 Oncology 3.3903461
#> 2 DA00001 AML_syn_1 OID21280 P00519 ABL1 Oncology 2.7588517
#> 3 DA00001 AML_syn_1 OID21269 P09110 ACAA1 Oncology 1.7070090
#> 4 DA00001 AML_syn_1 OID20159 P16112 ACAN Cardiometabolic 0.0332709
#> 5 DA00001 AML_syn_1 OID20105 Q9BYF1 ACE2 Cardiometabolic 1.7553590
#> 6 DA00001 AML_syn_1 OID20124 Q15067 ACOX1 Cardiometabolic -0.9192835
#> Assay_Warning QC_Warning PlateID
#> 1 PASS PASS Run001
#> 2 PASS PASS Run001
#> 3 PASS PASS Run001
#> 4 PASS PASS Run001
#> 5 PASS PASS Run001
#> 6 PASS PASS Run001
head(example_metadata)
#> DAid Sample Disease Stage Grade Sex Age BMI Cohort
#> 1 DA00001 AML_syn_1 AML 2 <NA> F 42 22.7 UCAN
#> 2 DA00002 AML_syn_2 AML Unknown <NA> M 69 33.1 UCAN
#> 3 DA00003 AML_syn_3 AML 2 <NA> F 61 26.2 UCAN
#> 4 DA00004 AML_syn_4 AML Unknown <NA> M 54 28.1 UCAN
#> 5 DA00005 AML_syn_5 AML 2 <NA> F 57 21.4 UCAN
#> 6 DA00006 AML_syn_6 AML Unknown <NA> M 86 33.9 UCAN
We will initialize the HDAnalyzeR object with the data and metadata.
The hd_initialize()
function will create an object that
will be used throughout the analysis. We will specify the sample ID,
variable name, and value name in order to correctly load and widen the
data. In this case, the data is in long format, so we set
is_wide = FALSE
.
hd_obj <- hd_initialize(dat = example_data,
metadata = example_metadata,
is_wide = FALSE,
sample_id = "DAid",
var_name = "Assay",
value_name = "NPX")
# See the data transformed into a wide format!
head(hd_obj$data)
#> # A tibble: 6 × 101
#> DAid AARSD1 ABL1 ACAA1 ACAN ACE2 ACOX1 ACP5 ACP6 ACTA2 ACTN4
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 DA00001 3.39 2.76 1.71 0.0333 1.76 -0.919 1.54 2.15 2.81 0.742
#> 2 DA00002 1.42 1.25 -0.816 -0.459 0.826 -0.902 0.647 1.30 0.798 -0.0659
#> 3 DA00003 NA NA NA 0.989 NA 0.330 1.37 NA NA NA
#> 4 DA00004 3.41 3.38 1.69 NA 1.52 NA 0.841 0.582 1.70 0.108
#> 5 DA00005 5.01 5.05 0.128 0.401 -0.933 -0.584 0.0265 1.16 2.73 0.350
#> 6 DA00006 6.83 1.18 -1.74 -0.156 1.53 -0.721 0.620 0.527 0.772 NA
#> # ℹ 90 more variables: ACY1 <dbl>, ADA <dbl>, ADA2 <dbl>, ADAM15 <dbl>,
#> # ADAM23 <dbl>, ADAM8 <dbl>, ADAMTS13 <dbl>, ADAMTS15 <dbl>, ADAMTS16 <dbl>,
#> # ADAMTS8 <dbl>, ADCYAP1R1 <dbl>, ADGRE2 <dbl>, ADGRE5 <dbl>, ADGRG1 <dbl>,
#> # ADGRG2 <dbl>, ADH4 <dbl>, ADM <dbl>, AGER <dbl>, AGR2 <dbl>, AGR3 <dbl>,
#> # AGRN <dbl>, AGRP <dbl>, AGXT <dbl>, AHCY <dbl>, AHSP <dbl>, AIF1 <dbl>,
#> # AIFM1 <dbl>, AK1 <dbl>, AKR1B1 <dbl>, AKR1C4 <dbl>, AKT1S1 <dbl>,
#> # AKT3 <dbl>, ALCAM <dbl>, ALDH1A1 <dbl>, ALDH3A1 <dbl>, ALPP <dbl>, …
📓 In real-world scenarios, you would load your own data and metadata files instead of using the example dataset.
⚠️ It is strongly recommended to use
hd_initialize()
function to initialize an HDAnalyzeR object before starting any analysis. This initialization step ensures that your input data and metadata are correctly formatted and compatible with the package. While it is possible to use HDAnalyzeR functions on data outside of an HDAnalyzeR object, users must ensure that the data adheres to the required structure: the first column must contain sample IDs, followed by numeric columns in wide format. Additionally, the metadata must include the same set of sample IDs. Proper initialization or careful adherence to these requirements is crucial for accurate and efficient analysis.
Quality Control (QC)
hd_qc_summary()
provides a comprehensive summary of the
input data and metadata.
Regarding the data, it will check the column types, calculate the percentage of NAs in each column and row and plot their distributions, calculate variable-variable (in this case protein-protein) correlations, and create a heatmap of these correlations. Users can also specify the threshold for reporting variable-variable correlations.
Regarding the metadata, it will check the column types and calculate the percentage of NAs in each column and row exactly as before. It will also create summary visualizations for the metadata variables like “Sex” and “Age”.
Let’s run the QC summary and take a look at some results!
qc_data <- hd_qc_summary(hd_obj,
variable = "Disease",
palette = list(Disease = "cancers12", Sex = "sex"),
cor_threshold = 0.7,
verbose = FALSE)
qc_data$data_summary$na_col_hist
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qc_data$data_summary$cor_heatmap
qc_data$metadata_summary$Sex
qc_data$metadata_summary$Age
#> Picking joint bandwidth of 6.06
Imputation and Dimensionality Reduction
As we saw from the QC results, the data contains NAs and other issues
that need to be addressed. We will impute missing values using K-nearest
neighbors (KNN) with 5 neighbors via hd_impute_knn()
.
hd_obj <- hd_impute_knn(hd_obj, k = 5, verbose = FALSE)
head(hd_obj$data)
#> # A tibble: 6 × 101
#> DAid AARSD1 ABL1 ACAA1 ACAN ACE2 ACOX1 ACP5 ACP6 ACTA2 ACTN4
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 DA00001 3.39 2.76 1.71 0.0333 1.76 -0.919 1.54 2.15 2.81 0.742
#> 2 DA00002 1.42 1.25 -0.816 -0.459 0.826 -0.902 0.647 1.30 0.798 -0.0659
#> 3 DA00003 3.70 2.85 2.27 0.989 0.184 0.330 1.37 0.876 1.33 0.570
#> 4 DA00004 3.41 3.38 1.69 0.415 1.52 1.23 0.841 0.582 1.70 0.108
#> 5 DA00005 5.01 5.05 0.128 0.401 -0.933 -0.584 0.0265 1.16 2.73 0.350
#> 6 DA00006 6.83 1.18 -1.74 -0.156 1.53 -0.721 0.620 0.527 0.772 0.446
#> # ℹ 90 more variables: ACY1 <dbl>, ADA <dbl>, ADA2 <dbl>, ADAM15 <dbl>,
#> # ADAM23 <dbl>, ADAM8 <dbl>, ADAMTS13 <dbl>, ADAMTS15 <dbl>, ADAMTS16 <dbl>,
#> # ADAMTS8 <dbl>, ADCYAP1R1 <dbl>, ADGRE2 <dbl>, ADGRE5 <dbl>, ADGRG1 <dbl>,
#> # ADGRG2 <dbl>, ADH4 <dbl>, ADM <dbl>, AGER <dbl>, AGR2 <dbl>, AGR3 <dbl>,
#> # AGRN <dbl>, AGRP <dbl>, AGXT <dbl>, AHCY <dbl>, AHSP <dbl>, AIF1 <dbl>,
#> # AIFM1 <dbl>, AK1 <dbl>, AKR1B1 <dbl>, AKR1C4 <dbl>, AKT1S1 <dbl>,
#> # AKT3 <dbl>, ALCAM <dbl>, ALDH1A1 <dbl>, ALDH3A1 <dbl>, ALPP <dbl>, …
After imputation, we will run Principal Component Analysis (PCA) via
hd_auto_pca()
and Uniform Manifold Approximation and
Projection (UMAP) via hd_auto_umap()
to check for outliers,
batch effects, and other potential issues. These are wrapper functions
that will run the necessary steps to perform PCA and UMAP, including
scaling the data and coloring the plots by metadata variables.
pca_res <- hd_auto_pca(hd_obj,
components = 10,
plot_color = "Disease",
plot_palette = "cancers12")
head(pca_res$pca_res)
#> # A tibble: 6 × 11
#> DAid PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 DA00… -3.69 -4.32 -2.30 -3.15 -2.53 -2.78 -2.77 0.603 -3.20 -0.532
#> 2 DA00… 4.17 -2.70 2.10 -0.472 -4.47 -1.96 -1.04 1.51 -0.229 -0.0614
#> 3 DA00… -3.31 4.74 1.41 0.920 -0.588 0.316 -0.103 -0.441 1.36 0.201
#> 4 DA00… -4.78 0.447 1.44 0.134 -1.15 -0.284 0.279 3.34 0.0452 -0.866
#> 5 DA00… -5.04 -3.69 0.820 -5.73 -0.697 -3.87 -1.01 1.96 -0.970 -1.12
#> 6 DA00… 0.429 0.0714 -1.86 -7.80 0.809 -2.89 -0.733 0.381 -0.617 1.32
pca_res$pca_plot
pca_res$pca_loadings_plot
pca_res$pca_variance_plot
Let’s color UMAP by Sex instead of Disease to see if there are any sex related differences.
umap_res <- hd_auto_umap(hd_obj,
plot_color = "Sex",
plot_palette = "sex")
head(umap_res$umap_res)
#> # A tibble: 6 × 3
#> DAid UMAP1 UMAP2
#> <fct> <dbl> <dbl>
#> 1 DA00001 2.26 1.14
#> 2 DA00002 -0.917 -2.46
#> 3 DA00003 0.394 2.49
#> 4 DA00004 0.658 2.74
#> 5 DA00005 2.31 1.29
#> 6 DA00006 -1.83 -0.203
umap_res$umap_plot
Biomarker Identification
Differential Expression Analysis
We will run a differential expression analysis to identify potential
biomarkers. We will use hd_de_limma()
so that we will be
able to correct also for Sex and Age. This method will help us pinpoint
proteins that are significantly different between conditions. We will
run the analysis 3 times for different case classes against all other
classes each time. We will also run hd_plot_volcano()
to
visualize the results. We will present only the results for Acute
Myeloid Leukemia (AML).
de_res_aml <- hd_de_limma(hd_obj,
variable = "Disease",
case = "AML",
correct = c("Sex", "Age")) |>
hd_plot_volcano(logfc_lim = 1)
head(de_res_aml$de_res)
#> # A tibble: 6 × 10
#> Feature logFC CI.L CI.R AveExpr t P.Value adj.P.Val B Disease
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 AZU1 1.54 1.23 1.84 0.226 9.97 9.51e-22 9.51e-20 38.5 AML
#> 2 ADA 1.40 1.12 1.68 0.924 9.75 6.37e-21 3.19e-19 36.6 AML
#> 3 APBB1IP 1.08 0.799 1.35 -0.253 7.62 1.04e-13 3.45e-12 20.3 AML
#> 4 APEX1 1.45 1.07 1.82 0.515 7.56 1.50e-13 3.74e-12 19.9 AML
#> 5 ANGPT1 -1.63 -2.09 -1.16 1.72 -6.87 1.65e-11 3.29e-10 15.3 AML
#> 6 ARTN 0.826 0.572 1.08 0.423 6.38 3.60e-10 6.00e- 9 12.3 AML
de_res_aml$volcano_plot
de_res_crc <- hd_de_limma(hd_obj,
variable = "Disease",
case = "CRC",
correct = c("Sex", "Age"))
de_res_lungc <- hd_de_limma(hd_obj,
variable = "Disease",
case = "LUNGC",
correct = c("Sex", "Age"))
We can also summarize the results via
hd_plot_de_summary()
. In order to use this function we need
to store all the results in a list.
de_res <- list("AML" = de_res_aml,
"CRC" = de_res_crc,
"LUNGC" = de_res_lungc)
de_summary <- hd_plot_de_summary(de_res, class_palette = "cancers12")
de_summary$de_barplot
de_summary$upset_plot_up
de_summary$upset_plot_down
Machine Learning Classification Model
In addition to differential expression analysis, we will use a regularized machine learning classification model (LASSO) to identify significant features. This model will help us understand which protein features are most predictive of the conditions being studied.
We will start by creating the data split with
hd_split_data()
so we don’t have any data leakage between
train and test data and then we will run the model with
hd_model_rreg()
. Once again, we will run the analysis 3
times but we will present only the results for AML.
hd_split <- hd_split_data(hd_obj, variable = "Disease")
lasso_res_aml <- hd_model_rreg(hd_split,
variable = "Disease",
case = "AML",
mixture = 1,
grid_size = 5,
palette = "cancers12",
verbose = FALSE)
lasso_res_aml$metrics
#> $accuracy
#> [1] 0.7959184
#>
#> $sensitivity
#> [1] 0.9166667
#>
#> $specificity
#> [1] 0.7851852
#>
#> $auc
#> [1] 0.9351852
#>
#> $confusion_matrix
#> Truth
#> Prediction 0 1
#> 0 106 1
#> 1 29 11
lasso_res_aml$roc_curve
lasso_res_aml$probability_plot
lasso_res_aml$feat_imp_plot
lasso_res_crc <- hd_model_rreg(hd_split,
variable = "Disease",
case = "CRC",
mixture = 1,
grid_size = 5,
palette = "cancers12",
verbose = FALSE)
lasso_res_lungc <- hd_model_rreg(hd_split,
variable = "Disease",
case = "LUNGC",
mixture = 1,
grid_size = 5,
palette = "cancers12",
verbose = FALSE)
We can get a visual summary of the results via
hd_plot_model_summary()
. In order to use this function we
need to store the results in a list as before.
lasso_res <- list("AML" = lasso_res_aml,
"CRC" = lasso_res_crc,
"LUNGC" = lasso_res_lungc)
features_summary <- hd_plot_model_summary(lasso_res, class_palette = "cancers12")
features_summary$metrics_barplot
features_summary$features_barplot
features_summary$upset_plot_features
One step further
The final step involves performing a pathway enrichment analysis to
get more biological information about the features. In this example, we
will run an Over-Representation Analysis for the up-regulated proteins
from differential expression, that are also identified as features by
the ML model. We will use the Gene Ontology (GO) database and show the
results only for AML. We will use hd_ora()
and
hd_plot_ora()
to perform the analysis and visualize its
results.
# Extract the proteins identified by both DE and Lasso
de_proteins <- de_res_aml$de_res |>
dplyr::filter(adj.P.Val < 0.05) |>
dplyr::pull(Feature)
lasso_proteins <- lasso_res_aml$features |>
dplyr::filter(Scaled_Importance > 0) |>
dplyr::pull(Feature)
intersect_proteins <- intersect(de_proteins, lasso_proteins)
# Perform ORA with GO database and visualize results
enrichment <- hd_ora(intersect_proteins,
database = "GO",
ontology = "BP",
pval_lim = 0.2) # We will use a higher p-value threshold for this example
enrichment_plots <- hd_plot_ora(enrichment)
enrichment_plots$dotplot
enrichment_plots$treeplot
enrichment_plots$cnetplot
📓 Remember that these data are a dummy-dataset with artificial data and the results in this guide should not be interpreted as real results. The purpose of this vignette is to show you how to use the package and its functions.