pkgdown/extra.css

Skip to contents

hd_qc_summary() summarizes the quality control results of the input data and metadata. It returns general information about the datasets, missing value information, protein-protein correlations, and metadata summary visualizations.

Usage

hd_qc_summary(
  dat,
  metadata = NULL,
  variable,
  palette = NULL,
  unique_threshold = 5,
  cor_threshold = 0.8,
  cor_method = "pearson",
  verbose = TRUE
)

Arguments

dat

An HDAnalyzeR object or a dataset in wide format and sample ID as its first column.

metadata

A dataset containing the metadata information with the sample ID as the first column. If a HDAnalyzeR object is provided, this parameter is not needed.

variable

The name of the metadata variable (column) containing the different classes (for example the column that contains your case and control groups).

palette

A list of color palettes for the plots. The names of the list should match the column names in the metadata. Default is NULL.

unique_threshold

The threshold to consider a numeric variable as categorical. Default is 5.

cor_threshold

The threshold to consider a protein-protein correlation as high. Default is 0.8.

cor_method

The method to calculate the correlation. Default is "pearson". Other options are "spearman" and "kendall".

verbose

Whether to print the summary. Default is TRUE.

Value

A list containing the qc summary of data and metadata.

Examples

# Create the HDAnalyzeR object providing the data and metadata
hd_object <- hd_initialize(example_data,
                           example_metadata |> dplyr::select(-Sample))

# Run the quality control summary
qc_res <- hd_qc_summary(hd_object,
                        variable = "Disease",
                        palette = list(Disease = "cancers12", Sex = "sex"),
                        cor_threshold = 0.7,
                        verbose = TRUE)
#> [1] "Summary:"
#> [1] "Note: In case of long output, only the first 10 rows are shown. To see the rest display the object with view()"
#> [1] "Number of samples: 586"
#> [1] "Number of variables: 101"
#> [1] "--------------------------------------"
#> [1] "categorical : 1"
#> [1] "continuous : 100"
#> [1] "--------------------------------------"
#> [1] "NA percentage in each column:"
#> # A tibble: 91 × 2
#>    column   na_percentage
#>    <chr>            <dbl>
#>  1 ACE2               6.1
#>  2 ACTA2              6.1
#>  3 ACTN4              6.1
#>  4 ADAM15             6.1
#>  5 ADAMTS16           6.1
#>  6 ADH4               6.1
#>  7 AKR1C4             6.1
#>  8 AMBN               6.1
#>  9 AMN                6.1
#> 10 AOC1               6.1
#> # ℹ 81 more rows
#> [1] "--------------------------------------"
#> [1] "NA percentage in each row:"
#> # A tibble: 144 × 2
#>    DAid    na_percentage
#>    <chr>           <dbl>
#>  1 DA00450          57.4
#>  2 DA00482          53.5
#>  3 DA00542          53.5
#>  4 DA00003          50.5
#>  5 DA00463          46.5
#>  6 DA00116          43.6
#>  7 DA00475          42.6
#>  8 DA00578          42.6
#>  9 DA00443          41.6
#> 10 DA00476          35.6
#> # ℹ 134 more rows
#> [1] "--------------------------------------"
#> [1] "Protein-protein correlations above 0.7:"
#>   Protein1 Protein2 Correlation
#> 1  ATP5IF1    AIFM1        0.76
#> 2    AXIN1 ARHGEF12        0.76
#> 3    AIFM1  ATP5IF1        0.76
#> 4 ARHGEF12    AXIN1        0.76
#> 5 ARHGEF12    AIFM1        0.71
#> 6    AIFM1 ARHGEF12        0.71
#> [1] "--------------------------------------"
#> [1] "Summary:"
#> [1] "Note: In case of long output, only the first 10 rows are shown. To see the rest display the object with view()"
#> [1] "Number of samples: 586"
#> [1] "Number of variables: 8"
#> [1] "--------------------------------------"
#> [1] "categorical : 6"
#> [1] "continuous : 2"
#> [1] "--------------------------------------"
#> [1] "NA percentage in each column:"
#> # A tibble: 1 × 2
#>   column na_percentage
#>   <chr>          <dbl>
#> 1 Grade           91.5
#> [1] "--------------------------------------"
#> [1] "NA percentage in each row:"
#> # A tibble: 536 × 2
#>    DAid    na_percentage
#>    <chr>           <dbl>
#>  1 DA00001          12.5
#>  2 DA00002          12.5
#>  3 DA00003          12.5
#>  4 DA00004          12.5
#>  5 DA00005          12.5
#>  6 DA00006          12.5
#>  7 DA00007          12.5
#>  8 DA00008          12.5
#>  9 DA00009          12.5
#> 10 DA00010          12.5
#> # ℹ 526 more rows
#> [1] "--------------------------------------"

# Data summary -------------------------------------------------------------
qc_res$data_summary$na_col_hist
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qc_res$data_summary$na_row_hist
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qc_res$data_summary$cor_results
#>   Protein1 Protein2 Correlation
#> 1  ATP5IF1    AIFM1        0.76
#> 2    AXIN1 ARHGEF12        0.76
#> 3    AIFM1  ATP5IF1        0.76
#> 4 ARHGEF12    AXIN1        0.76
#> 5 ARHGEF12    AIFM1        0.71
#> 6    AIFM1 ARHGEF12        0.71
qc_res$data_summary$cor_heatmap


# Metadata summary ---------------------------------------------------------
qc_res$metadata_summary$na_col_hist
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qc_res$metadata_summary$na_row_hist
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qc_res$metadata_summary$Age
#> Picking joint bandwidth of 6.06

qc_res$metadata_summary$Sex

qc_res$metadata_summary$BMI
#> Picking joint bandwidth of 1.77

qc_res$metadata_summary$Stage

qc_res$metadata_summary$Grade

qc_res$metadata_summary$Cohort