14  Data Quality Checks and Parsing

14.1 Aim

Add robust QC before agreement or impact analyses: completeness reporting, safer parsing of HER2 strings, NA-aware molecular typing, and traceable logs.

Note for Pathologist: This is our internal checklist. We ensure every case has a Valid ID, every score is between 0-100, and we don’t accidentally include “Test” or “Training” cases. If this document is clean, the results in the other documents are trustworthy.

14.2 Completeness and QC Summary

# A tibble: 4 × 11
  file                n_rows non_na_bx_no non_na_er_percent non_na_er_percent_ai
  <chr>                <int>        <int>             <int>                <int>
1 aiforia breast - C…    310          310               308                  306
2 aiforia breast - F…    311          311               311                  307
3 aiforia breast - F…    312          312               312                  312
4 aiforia breast - M…    317          317               317                  314
# ℹ 6 more variables: non_na_pr_percent <int>, non_na_pr_percent_ai <int>,
#   non_na_her2 <int>, non_na_her2_ai <int>, non_na_ki67_sectra <int>,
#   non_na_ki67_ai <int>

Note for Pathologist: The table above shows how many non-missing values exist for each variable in each raw data file. If a column has fewer non-missing entries than total rows, some data points are missing. This is expected for some fields, but large amounts of missing data could bias the analysis.

14.3 HER2 Parsing Helper

14.4 Molecular Subtype with Explicit NA Handling

14.5 Flagging Potential Issues

# A tibble: 1 × 5
  total er_in_range pr_in_range ki67_in_range her2_valid
  <int>       <int>       <int>         <int>      <int>
1  1184        1175        1159          1162       1073

14.6 Logging

  • Print qc_counts and qc_flags in the render log to document completeness.
  • Consider writing a CSV of flagged rows to data/processed/qc_flags.csv for traceability.