pkgdown/extra.css

Skip to contents

qc_summary_data() summarizes the quality control results of the input dataset. It can handles both long and wide dataframes. The function checks the column types, calculates the percentage of NAs in each column and row, performs a normality test, calculates the protein-protein correlations, and creates a heatmap of the correlations. The user can specify the reporting protein-protein correlation threshold.

Usage

qc_summary_data(
  df,
  wide = TRUE,
  threshold = 0.8,
  cor_method = "pearson",
  report = TRUE
)

Arguments

df

The input dataset.

wide

Whether the input dataset is in wide format. Default is TRUE.

threshold

The reporting protein-protein correlation threshold. Default is 0.8.

cor_method

The correlation method. Default is "pearson".

report

Whether to print the summary. Default is TRUE.

Value

A list containing the following elements:

  • na_percentage_col: A tibble with the column names and the percentage of NAs in each column.

  • na_percentage_row: A tibble with the DAids and the percentage of NAs in each row.

  • normality_results: A tibble with the protein names, p-values, adjusted p-values, and normality status.

  • cor_matrix: A matrix of protein-protein correlations.

  • cor_results: A tibble with the filtered protein pairs and their correlation values.

  • heatmap: A heatmap of protein-protein correlations.

Details

The correlation method is Pearson and the normality test is Shapiro-Wilk. If the wide dataset contains more than 5000 rows, a random sample of 5000 rows is taken to assess normality as it is a requirement from Shapiro-Wilk test.

Examples

qc_res <- qc_summary_data(example_data, wide = FALSE, threshold = 0.7)
#> [1] "Summary:"
#> [1] "Note: In case of long output, only the first 10 rows are shown. To see the rest display the object with view()"
#> [1] "Number of samples: 586"
#> [1] "Number of variables: 100"
#> [1] "--------------------------------------"
#> [1] "character : 1"
#> [1] "numeric : 100"
#> [1] "--------------------------------------"
#> [1] "NA percentage in each column:"
#> # A tibble: 91 × 2
#>    column   na_percentage
#>    <chr>            <dbl>
#>  1 ACE2               6.1
#>  2 ACTA2              6.1
#>  3 ACTN4              6.1
#>  4 ADAM15             6.1
#>  5 ADAMTS16           6.1
#>  6 ADH4               6.1
#>  7 AKR1C4             6.1
#>  8 AMBN               6.1
#>  9 AMN                6.1
#> 10 AOC1               6.1
#> # ℹ 81 more rows
#> [1] "--------------------------------------"
#> [1] "NA percentage in each row:"
#> # A tibble: 144 × 2
#>    DAid    na_percentage
#>    <chr>           <dbl>
#>  1 DA00450          57.4
#>  2 DA00482          53.5
#>  3 DA00542          53.5
#>  4 DA00003          50.5
#>  5 DA00463          46.5
#>  6 DA00116          43.6
#>  7 DA00475          42.6
#>  8 DA00578          42.6
#>  9 DA00443          41.6
#> 10 DA00476          35.6
#> # ℹ 134 more rows
#> [1] "--------------------------------------"
#> [1] "Normality test results:"
#> # A tibble: 100 × 4
#>    Protein    p_value adj.P.Val is_normal
#>    <chr>        <dbl>     <dbl> <lgl>    
#>  1 ARID4B    2.00e-21  1.64e-19 FALSE    
#>  2 ARTN      4.91e-21  1.64e-19 FALSE    
#>  3 ATF2      4.01e-21  1.64e-19 FALSE    
#>  4 AZU1      6.02e-20  1.51e-18 FALSE    
#>  5 APBB1IP   1.64e-16  3.27e-15 FALSE    
#>  6 ADA       2.81e-15  4.69e-14 FALSE    
#>  7 ADCYAP1R1 5.75e-15  8.21e-14 FALSE    
#>  8 AOC1      2.17e-14  2.71e-13 FALSE    
#>  9 AREG      7.47e-14  8.30e-13 FALSE    
#> 10 ADGRG1    1.39e-12  1.39e-11 FALSE    
#> # ℹ 90 more rows
#> [1] "--------------------------------------"
#> [1] "Protein-protein correlations above 0.7:"
#>   Protein1 Protein2 Correlation
#> 1  ATP5IF1    AIFM1        0.76
#> 2    AXIN1 ARHGEF12        0.76
#> 3    AIFM1  ATP5IF1        0.76
#> 4 ARHGEF12    AXIN1        0.76
#> 5 ARHGEF12    AIFM1        0.71
#> 6    AIFM1 ARHGEF12        0.71
#> [1] "--------------------------------------"
#> [1] "Correlation heatmap:"
#> [1] "--------------------------------------"

qc_res$heatmap