Objective
Import and merge data from 4 pathologists, standardize column names, and prepare for analysis.
Note for Pathologist: This specific document serves as a background data processing step. It consolidates raw Excel files from different pathologists, standardizes the terminology (e.g., converting “Score 3” to numeric 3), and ensures we only analyze cases that were evaluated by all participating pathologists (the “common cohort”). This ensures fair comparisons across the study.
Import Data
We expect 4 Excel files in data/raw.
[1] "Processing: aiforia breast - CI.xlsx"
[1] "File: aiforia breast - CI.xlsx"
[1] "BxNo" "ER%" "ER%-AI" "PR%" "PR%-AI"
[6] "HER2" "HER2-AI" "Ki67 Sectra" "Ki67-AI" "Comment"
[11] "Pathologist"
[1] "Processing: aiforia breast - FA.xlsx"
[1] "File: aiforia breast - FA.xlsx"
[1] "BxNo" "ER%" "ER%-AI" "PR%" "PR%-AI"
[6] "HER2" "HER2-AI" "Ki67 Sectra" "Ki67-AI" "Comment"
[11] "Pathologist"
[1] "Processing: aiforia breast - FGS.xlsx"
[1] "File: aiforia breast - FGS.xlsx"
[1] "BxNo" "ER%" "ER%-AI" "PR%" "PR%-AI"
[6] "HER2" "HER2-AI" "Ki67 Sectra" "Ki67-AI" "Comment"
[11] "Pathologist"
[1] "Processing: aiforia breast - MO.xlsx"
[1] "File: aiforia breast - MO.xlsx"
[1] "BxNo" "ER%" "ER%-AI" "PR%" "PR%-AI"
[6] "HER2" "HER2-AI" "Ki67 Sectra" "Ki67-AI" "Comment"
[11] "Pathologist"
[1] "Here: /Users/serdarbalci/Documents/GitHub/aiforiabreast"
[1] "BxNo" "ER%" "ER%-AI" "PR%" "PR%-AI"
[6] "HER2" "HER2-AI" "Ki67 Sectra" "Ki67-AI" "Comment"
[11] "Pathologist"
Load Biopsy Types
Load biopsy type information from a separate Excel file and prepare for merging.
[1] "Loaded biopsy types for 819 cases"
Excision Tru-cut <NA>
450 364 5
Standardize Data
Define the expected column names and clean the data.
[1] "Columns after clean_names:"
[1] "bx_no" "er_percent" "er_percent_ai" "pr_percent"
[5] "pr_percent_ai" "her2" "her2_ai" "ki67_sectra"
[9] "ki67_ai" "comment" "pathologist"
[1] "Running filter: Common Cases Only (from /Users/serdarbalci/Documents/GitHub/aiforiabreast/output/exclusions/common_cases_list.csv )"
[1] "Filtered to common cohort. N observations: 1184"
[1] "Unique cases: 296"
[1] "Biopsy types merged. Cases with biopsy type: 1184"
Data Dictionary
| case_id |
Biopsy Number |
| Pathologist |
Pathologist Identifier |
| er_pre |
ER % (Pathologist only) |
| er_post |
ER % (Pathologist + AI) |
| pr_pre |
PR % (Pathologist only) |
| pr_post |
PR % (Pathologist + AI) |
| her2_pre |
HER2 Score (0-3) (Pathologist only) |
| her2_post |
HER2 Score (0-3) (Pathologist + AI) |
| ki67_pre |
Ki67 % (Pathologist only) |
| ki67_post |
Ki67 % (Pathologist + AI) |
| er_pre_cat |
ER Category (Negative, Low, Positive) |
| pr_pre_cat |
PR Category (Negative, Low, Positive) |
| ki67_pre_cat20 |
Ki67 Category (<20%, >=20%) |
| ki67_pre_cat30 |
Ki67 Category (<30%, >=30%) |
| molecular_subtype_pre |
Molecular Subtype (Pre-AI) |
| molecular_subtype_post |
Molecular Subtype (Post-AI) |
| biopsy_type |
Biopsy Type (Excision, Tru-cut); Vacuum recoded as Tru-cut |