Differential Expression Analysis
Source:vignettes/differential_expression.Rmd
differential_expression.Rmd
This vignette will guide you through the differential expression analysis of your data. We will load HDAnalyzeR and ggplot2, widen the data and load the metadata we are going to use.
library(HDAnalyzeR)
library(ggplot2)
wide_data <- widen_data(example_data)
metadata <- example_metadata
We will start by running a simple differential expression analysis
using the do_limma()
function. In this function we have to
state the group that will be the case, as well as the control(s). We
will also keep the default correction for both Sex
and
Age
variables. We get the warning because the metadata are
not pre-filtered, but it is safe to ignore it in this case as it is
something that we do on purpose.
do_limma(wide_data, metadata, case = "AML", control = "CLL")
#> Comparing AML with CLL.
#> Warning in do_limma_de(join_data, variable, case, control, correct,
#> correct_type, : 488 rows were removed because they contain NAs in Disease or
#> Sex, Age!
#> $de_results
#> # A tibble: 100 × 11
#> Assay logFC CI.L CI.R AveExpr t P.Value adj.P.Val B Disease
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 ADA 1.42 0.955 1.89 1.56 6.04 2.53e-8 2.53e-6 8.73 AML
#> 2 ADAM8 -1.23 -1.67 -0.795 1.74 -5.59 1.94e-7 9.71e-6 6.76 AML
#> 3 AZU1 1.92 1.20 2.64 0.777 5.30 6.82e-7 2.27e-5 5.55 AML
#> 4 ARID4B -1.38 -1.91 -0.847 1.85 -5.15 1.34e-6 3.35e-5 4.91 AML
#> 5 ARTN 1.08 0.597 1.55 0.804 4.46 2.28e-5 3.90e-4 2.21 AML
#> 6 ANGPT1 -1.71 -2.47 -0.947 0.992 -4.45 2.34e-5 3.90e-4 2.20 AML
#> 7 ACP6 -0.725 -1.06 -0.390 1.20 -4.30 3.99e-5 5.70e-4 1.67 AML
#> 8 ACAN -0.701 -1.04 -0.364 0.556 -4.13 7.65e-5 8.24e-4 1.06 AML
#> 9 ADAMTS8 -0.785 -1.16 -0.406 0.239 -4.12 8.12e-5 8.24e-4 1.02 AML
#> 10 ADGRG2 -0.659 -0.977 -0.341 0.00861 -4.11 8.24e-5 8.24e-4 0.983 AML
#> # ℹ 90 more rows
#> # ℹ 1 more variable: sig <chr>
#>
#> $volcano_plot
We are able to state more control groups if we want to. We can also change the correction for the variables as well as both the p-value and logFC significance thresholds.
do_limma(wide_data,
metadata,
case = "AML",
control = c("CLL", "MYEL", "GLIOM"),
correct = "Sex",
pval_lim = 0.01,
logfc_lim = 1)
#> Comparing AML with CLL, MYEL, GLIOM.
#> Warning in do_limma_de(join_data, variable, case, control, correct,
#> correct_type, : 400 rows were removed because they contain NAs in Disease or
#> Sex!
#> $de_results
#> # A tibble: 100 × 11
#> Assay logFC CI.L CI.R AveExpr t P.Value adj.P.Val B Disease
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 ADA 1.37 1.02 1.72 1.26 7.76 5.48e-13 5.48e-11 19.0 AML
#> 2 ANGPT1 -1.77 -2.32 -1.23 1.46 -6.40 1.27e- 9 4.89e- 8 11.4 AML
#> 3 AZU1 1.54 1.06 2.01 0.592 6.36 1.47e- 9 4.89e- 8 11.3 AML
#> 4 ANGPT2 0.815 0.522 1.11 1.02 5.48 1.38e- 7 3.45e- 6 6.82 AML
#> 5 APP -0.872 -1.19 -0.553 1.05 -5.40 2.05e- 7 4.11e- 6 6.45 AML
#> 6 ADGRG1 1.17 0.637 1.71 1.86 4.32 2.59e- 5 4.31e- 4 1.83 AML
#> 7 ARHGEF12 -1.14 -1.71 -0.577 3.41 -3.98 9.84e- 5 1.41e- 3 0.537 AML
#> 8 AMY2B -0.588 -0.884 -0.293 0.250 -3.93 1.20e- 4 1.50e- 3 0.336 AML
#> 9 ACAN -0.551 -0.831 -0.271 0.617 -3.89 1.42e- 4 1.58e- 3 0.186 AML
#> 10 AGR2 -1.05 -1.59 -0.499 1.84 -3.78 2.20e- 4 2.20e- 3 -0.207 AML
#> # ℹ 90 more rows
#> # ℹ 1 more variable: sig <chr>
#>
#> $volcano_plot
We can customize the volcano plot further by adding a subtitle and removing the number of significant proteins. We can also remove the title via ggplot2 because now it seems to be redundant.
res <- do_limma(wide_data,
metadata,
case = "AML",
control = c("CLL", "MYEL", "GLIOM"),
correct = "Sex",
pval_lim = 0.01,
logfc_lim = 1,
report_nproteins = FALSE,
subtitle = "AML vs CLL, MYEL and GLIOM")
#> Comparing AML with CLL, MYEL, GLIOM.
#> Warning in do_limma_de(join_data, variable, case, control, correct,
#> correct_type, : 400 rows were removed because they contain NAs in Disease or
#> Sex!
res$volcano_plot + ggplot2::labs(title = NULL)
Let’s move to another method. We will use the do_ttest()
that performs a t-test for each protein. This function works in a
similar way with do_limma()
but it cannot correct for other
variables like Sex
and Age
.
do_ttest(wide_data, metadata, case = "AML", control = "CLL")
#> $de_results
#> # A tibble: 100 × 6
#> Assay P.Value logFC Disease adj.P.Val sig
#> <chr> <dbl> <dbl> <chr> <dbl> <chr>
#> 1 ADAM8 0.00000107 -1.36 AML 0.0000769 significant down
#> 2 AZU1 0.00000154 1.93 AML 0.0000769 significant up
#> 3 ADA 0.00000259 1.41 AML 0.0000863 significant up
#> 4 ARID4B 0.00000809 -1.53 AML 0.000202 significant down
#> 5 ANGPT1 0.0000121 -1.74 AML 0.000225 significant down
#> 6 ARTN 0.0000135 1.32 AML 0.000225 significant up
#> 7 ACAN 0.0000272 -0.679 AML 0.000388 significant down
#> 8 ADGRG2 0.0000504 -0.641 AML 0.000630 significant down
#> 9 ADAMTS8 0.0000969 -0.758 AML 0.00108 significant down
#> 10 ACP6 0.000130 -0.814 AML 0.00130 significant down
#> # ℹ 90 more rows
#>
#> $volcano_plot
If we have diseases that are gender specific, we can specify them and
only their gender will be used in the analysis. This is also available
in the do_limma()
function, but in that case we should not
forget to remove the Sex
correction.
do_ttest(wide_data,
metadata,
case = "BRC",
control = c("AML", "CLL", "PRC"),
only_female = "BRC",
only_male = "PRC")
#> $de_results
#> # A tibble: 100 × 6
#> Assay P.Value logFC Disease adj.P.Val sig
#> <chr> <dbl> <dbl> <chr> <dbl> <chr>
#> 1 APEX1 0.00000000102 -2.24 BRC 0.000000102 significant down
#> 2 ARID4B 0.00000257 -1.30 BRC 0.000129 significant down
#> 3 B4GALT1 0.00000401 -0.832 BRC 0.000134 significant down
#> 4 ABL1 0.0000737 -1.34 BRC 0.00184 significant down
#> 5 ARHGAP25 0.000161 -1.04 BRC 0.00321 significant down
#> 6 APBB1IP 0.000197 -1.24 BRC 0.00328 significant down
#> 7 ADA2 0.00107 -0.624 BRC 0.0129 significant down
#> 8 ANGPT2 0.00104 -0.562 BRC 0.0129 significant down
#> 9 AZU1 0.00116 -1.24 BRC 0.0129 significant down
#> 10 ADAM8 0.00223 -0.634 BRC 0.0223 significant down
#> # ℹ 90 more rows
#>
#> $volcano_plot
We could run differential expression against another categorical
variable like Sex
.
do_ttest(wide_data,
metadata,
variable = "Sex",
case = "F",
control = "M")
#> $de_results
#> # A tibble: 100 × 6
#> Assay P.Value logFC Sex adj.P.Val sig
#> <chr> <dbl> <dbl> <chr> <dbl> <chr>
#> 1 ALPP 8.42e-10 0.983 F 0.0000000842 significant up
#> 2 ART3 3.05e- 7 -0.356 F 0.0000152 significant down
#> 3 B4GALT1 8.86e- 7 -0.349 F 0.0000295 significant down
#> 4 ACY1 3.25e- 6 -0.479 F 0.0000812 significant down
#> 5 ACE2 9.82e- 6 -0.448 F 0.000164 significant down
#> 6 ANGPTL3 9.64e- 6 0.296 F 0.000164 significant up
#> 7 ADGRG2 5.02e- 5 0.304 F 0.000680 significant up
#> 8 ATXN10 5.44e- 5 -0.430 F 0.000680 significant down
#> 9 APBB1IP 7.95e- 5 -0.422 F 0.000884 significant down
#> 10 AXIN1 1.68e- 4 -0.473 F 0.00168 significant down
#> # ℹ 90 more rows
#>
#> $volcano_plot
Moreover, we can also perform Differential Expression Analysis
against a continuous variable such as Age
via
do_limma_continuous()
. We can also correct for categorical
and other continuous variables.
do_limma_continuous(wide_data,
metadata,
variable = "Age",
correct = "Sex",
pval_lim = 0.001)
#> $de_results
#> # A tibble: 100 × 9
#> Assay logFC as.factor.Sex.F as.factor.Sex.M AveExpr F P.Value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ADAMTS15 -0.000719 3.09 2.92 2.99 1874. 2.10e-291
#> 2 AARSD1 0.000327 2.96 3.25 3.13 1608. 1.29e-274
#> 3 AKT1S1 0.00154 3.28 3.46 3.47 1478. 3.54e-265
#> 4 ATG4A -0.00157 2.56 2.71 2.55 1138. 2.26e-238
#> 5 ATOX1 -0.00166 3.02 3.18 2.97 1061. 1.13e-232
#> 6 ADM 0.00536 1.53 1.47 1.87 954. 7.63e-224
#> 7 AK1 -0.00373 2.51 2.66 2.34 786. 3.94e-202
#> 8 AKR1B1 -0.000171 2.28 2.33 2.29 783. 2.64e-200
#> 9 ATP5IF1 -0.00321 3.66 4.02 3.60 740. 1.11e-196
#> 10 ARHGEF12 -0.00163 3.19 3.56 3.26 683. 2.06e-187
#> # ℹ 90 more rows
#> # ℹ 2 more variables: adj.P.Val <dbl>, sig <chr>
#>
#> $volcano_plot
#> Warning: ggrepel: 11 unlabeled data points (too many overlaps). Consider
#> increasing max.overlaps
As a last step, we can summarize the results via
plot_de_summary()
. Let’s first run a differential
expression analysis for 4 different cases (1 vs 3).
res_aml <- do_limma(wide_data, metadata, case = "AML", control = c("CLL", "MYEL", "GLIOM"))
res_cll <- do_limma(wide_data, metadata, case = "CLL", control = c("AML", "MYEL", "GLIOM"))
res_myel <- do_limma(wide_data, metadata, case = "MYEL", control = c("AML", "CLL", "GLIOM"))
res_gliom <- do_limma(wide_data, metadata, case = "GLIOM", control = c("AML", "CLL", "MYEL"))
de_summary_res <- plot_de_summary(list("AML" = res_aml, "CLL" = res_cll, "MYEL" = res_myel, "GLIOM" = res_gliom),
disease_palette = "cancers12")
de_summary_res$de_barplot
de_summary_res$upset_plot_up
de_summary_res$upset_plot_down