# A tibble: 4 × 11
file n_rows non_na_bx_no non_na_er_percent non_na_er_percent_ai
<chr> <int> <int> <int> <int>
1 aiforia breast - C… 310 310 308 306
2 aiforia breast - F… 311 311 311 307
3 aiforia breast - F… 312 312 312 312
4 aiforia breast - M… 317 317 317 314
# ℹ 6 more variables: non_na_pr_percent <int>, non_na_pr_percent_ai <int>,
# non_na_her2 <int>, non_na_her2_ai <int>, non_na_ki67_sectra <int>,
# non_na_ki67_ai <int>
14 Data Quality Checks and Parsing
14.1 Aim
Add robust QC before agreement or impact analyses: completeness reporting, safer parsing of HER2 strings, NA-aware molecular typing, and traceable logs.
Note for Pathologist: This is our internal checklist. We ensure every case has a Valid ID, every score is between 0-100, and we don’t accidentally include “Test” or “Training” cases. If this document is clean, the results in the other documents are trustworthy.
14.2 Completeness and QC Summary
Note for Pathologist: The table above shows how many non-missing values exist for each variable in each raw data file. If a column has fewer non-missing entries than total rows, some data points are missing. This is expected for some fields, but large amounts of missing data could bias the analysis.
14.3 HER2 Parsing Helper
14.4 Molecular Subtype with Explicit NA Handling
14.5 Flagging Potential Issues
# A tibble: 1 × 5
total er_in_range pr_in_range ki67_in_range her2_valid
<int> <int> <int> <int> <int>
1 1184 1175 1159 1162 1073
14.6 Logging
- Print
qc_countsandqc_flagsin the render log to document completeness.
- Consider writing a CSV of flagged rows to
data/processed/qc_flags.csvfor traceability.