Summarize quality control information — hd_qc

hd_qc_summary() summarizes the quality control results of the input data and metadata. It returns general information about the datasets, missing value information, protein-protein correlations, and metadata summary visualizations.

Usage

hd_qc_summary(
  dat,
  metadata = NULL,
  variable,
  palette = NULL,
  unique_threshold = 5,
  cor_threshold = 0.8,
  cor_method = "pearson",
  verbose = TRUE
)

Arguments

dat: An HDAnalyzeR object or a dataset in wide format and sample ID as its first column.
metadata: A dataset containing the metadata information with the sample ID as the first column. If a HDAnalyzeR object is provided, this parameter is not needed.
variable: The name of the metadata variable (column) containing the different classes (for example the column that contains your case and control groups).
palette: A list of color palettes for the plots. The names of the list should match the column names in the metadata. Default is NULL.
unique_threshold: The threshold to consider a numeric variable as categorical. Default is 5.
cor_threshold: The threshold to consider a protein-protein correlation as high. Default is 0.8.
cor_method: The method to calculate the correlation. Default is "pearson". Other options are "spearman" and "kendall".
verbose: Whether to print the summary. Default is TRUE.

Value

A list containing the qc summary of data and metadata.

Examples

# Create the HDAnalyzeR object providing the data and metadata
hd_object <- hd_initialize(example_data,
                           example_metadata |> dplyr::select(-Sample))

# Run the quality control summary
qc_res <- hd_qc_summary(hd_object,
                        variable = "Disease",
                        palette = list(Disease = "cancers12", Sex = "sex"),
                        cor_threshold = 0.7,
                        verbose = TRUE)
#> [1] "Summary:"
#> [1] "Note: In case of long output, only the first 10 rows are shown. To see the rest display the object with view()"
#> [1] "Number of samples: 586"
#> [1] "Number of variables: 101"
#> [1] "--------------------------------------"
#> [1] "categorical : 1"
#> [1] "continuous : 100"
#> [1] "--------------------------------------"
#> [1] "NA percentage in each column:"
#> # A tibble: 91 × 2
#>    column   na_percentage
#>    <chr>            <dbl>
#>  1 ACE2               6.1
#>  2 ACTA2              6.1
#>  3 ACTN4              6.1
#>  4 ADAM15             6.1
#>  5 ADAMTS16           6.1
#>  6 ADH4               6.1
#>  7 AKR1C4             6.1
#>  8 AMBN               6.1
#>  9 AMN                6.1
#> 10 AOC1               6.1
#> # ℹ 81 more rows
#> [1] "--------------------------------------"
#> [1] "NA percentage in each row:"
#> # A tibble: 144 × 2
#>    DAid    na_percentage
#>    <chr>           <dbl>
#>  1 DA00450          57.4
#>  2 DA00482          53.5
#>  3 DA00542          53.5
#>  4 DA00003          50.5
#>  5 DA00463          46.5
#>  6 DA00116          43.6
#>  7 DA00475          42.6
#>  8 DA00578          42.6
#>  9 DA00443          41.6
#> 10 DA00476          35.6
#> # ℹ 134 more rows
#> [1] "--------------------------------------"
#> [1] "Protein-protein correlations above 0.7:"
#>   Protein1 Protein2 Correlation
#> 1  ATP5IF1    AIFM1        0.76
#> 2    AXIN1 ARHGEF12        0.76
#> 3    AIFM1  ATP5IF1        0.76
#> 4 ARHGEF12    AXIN1        0.76
#> 5 ARHGEF12    AIFM1        0.71
#> 6    AIFM1 ARHGEF12        0.71
#> [1] "--------------------------------------"
#> [1] "Summary:"
#> [1] "Note: In case of long output, only the first 10 rows are shown. To see the rest display the object with view()"
#> [1] "Number of samples: 586"
#> [1] "Number of variables: 8"
#> [1] "--------------------------------------"
#> [1] "categorical : 6"
#> [1] "continuous : 2"
#> [1] "--------------------------------------"
#> [1] "NA percentage in each column:"
#> # A tibble: 1 × 2
#>   column na_percentage
#>   <chr>          <dbl>
#> 1 Grade           91.5
#> [1] "--------------------------------------"
#> [1] "NA percentage in each row:"
#> # A tibble: 536 × 2
#>    DAid    na_percentage
#>    <chr>           <dbl>
#>  1 DA00001          12.5
#>  2 DA00002          12.5
#>  3 DA00003          12.5
#>  4 DA00004          12.5
#>  5 DA00005          12.5
#>  6 DA00006          12.5
#>  7 DA00007          12.5
#>  8 DA00008          12.5
#>  9 DA00009          12.5
#> 10 DA00010          12.5
#> # ℹ 526 more rows
#> [1] "--------------------------------------"

# Data summary -------------------------------------------------------------
qc_res$data_summary$na_col_hist
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qc_res$data_summary$na_row_hist
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qc_res$data_summary$cor_results
#>   Protein1 Protein2 Correlation
#> 1  ATP5IF1    AIFM1        0.76
#> 2    AXIN1 ARHGEF12        0.76
#> 3    AIFM1  ATP5IF1        0.76
#> 4 ARHGEF12    AXIN1        0.76
#> 5 ARHGEF12    AIFM1        0.71
#> 6    AIFM1 ARHGEF12        0.71
qc_res$data_summary$cor_heatmap


# Metadata summary ---------------------------------------------------------
qc_res$metadata_summary$na_col_hist
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qc_res$metadata_summary$na_row_hist
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qc_res$metadata_summary$Age
#> Picking joint bandwidth of 6.06

qc_res$metadata_summary$Sex

qc_res$metadata_summary$BMI
#> Picking joint bandwidth of 1.77

qc_res$metadata_summary$Stage

qc_res$metadata_summary$Grade

qc_res$metadata_summary$Cohort