3  Data Cleaning

3.1 Objective

Import and merge data from 4 pathologists, standardize column names, and prepare for analysis.

Note for Pathologist: This specific document serves as a background data processing step. It consolidates raw Excel files from different pathologists, standardizes the terminology (e.g., converting “Score 3” to numeric 3), and ensures we only analyze cases that were evaluated by all participating pathologists (the “common cohort”). This ensures fair comparisons across the study.

3.2 Setup

3.3 Import Data

We expect 4 Excel files in data/raw.

[1] "Processing: aiforia breast - CI.xlsx"
[1] "File: aiforia breast - CI.xlsx"
 [1] "BxNo"        "ER%"         "ER%-AI"      "PR%"         "PR%-AI"     
 [6] "HER2"        "HER2-AI"     "Ki67 Sectra" "Ki67-AI"     "Comment"    
[11] "Pathologist"
[1] "Processing: aiforia breast - FA.xlsx"
[1] "File: aiforia breast - FA.xlsx"
 [1] "BxNo"        "ER%"         "ER%-AI"      "PR%"         "PR%-AI"     
 [6] "HER2"        "HER2-AI"     "Ki67 Sectra" "Ki67-AI"     "Comment"    
[11] "Pathologist"
[1] "Processing: aiforia breast - FGS.xlsx"
[1] "File: aiforia breast - FGS.xlsx"
 [1] "BxNo"        "ER%"         "ER%-AI"      "PR%"         "PR%-AI"     
 [6] "HER2"        "HER2-AI"     "Ki67 Sectra" "Ki67-AI"     "Comment"    
[11] "Pathologist"
[1] "Processing: aiforia breast - MO.xlsx"
[1] "File: aiforia breast - MO.xlsx"
 [1] "BxNo"        "ER%"         "ER%-AI"      "PR%"         "PR%-AI"     
 [6] "HER2"        "HER2-AI"     "Ki67 Sectra" "Ki67-AI"     "Comment"    
[11] "Pathologist"
[1] "Here: /Users/serdarbalci/Documents/GitHub/aiforiabreast"
[1] "Files found: 4"
 [1] "BxNo"        "ER%"         "ER%-AI"      "PR%"         "PR%-AI"     
 [6] "HER2"        "HER2-AI"     "Ki67 Sectra" "Ki67-AI"     "Comment"    
[11] "Pathologist"

3.4 Load Biopsy Types

Load biopsy type information from a separate Excel file and prepare for merging.

[1] "Loaded biopsy types for 819 cases"

Excision  Tru-cut     <NA> 
     450      364        5 

3.5 Standardize Data

Define the expected column names and clean the data.

[1] "Columns after clean_names:"
 [1] "bx_no"         "er_percent"    "er_percent_ai" "pr_percent"   
 [5] "pr_percent_ai" "her2"          "her2_ai"       "ki67_sectra"  
 [9] "ki67_ai"       "comment"       "pathologist"  
[1] "Running filter: Common Cases Only (from /Users/serdarbalci/Documents/GitHub/aiforiabreast/output/exclusions/common_cases_list.csv )"
[1] "Filtered to common cohort. N observations: 1184"
[1] "Unique cases: 296"
[1] "Biopsy types merged. Cases with biopsy type: 1184"

3.6 Data Dictionary

Column Description
case_id Biopsy Number
Pathologist Pathologist Identifier
er_pre ER % (Pathologist only)
er_post ER % (Pathologist + AI)
pr_pre PR % (Pathologist only)
pr_post PR % (Pathologist + AI)
her2_pre HER2 Score (0-3) (Pathologist only)
her2_post HER2 Score (0-3) (Pathologist + AI)
ki67_pre Ki67 % (Pathologist only)
ki67_post Ki67 % (Pathologist + AI)
er_pre_cat ER Category (Negative, Low, Positive)
pr_pre_cat PR Category (Negative, Low, Positive)
ki67_pre_cat20 Ki67 Category (<20%, >=20%)
ki67_pre_cat30 Ki67 Category (<30%, >=30%)
molecular_subtype_pre Molecular Subtype (Pre-AI)
molecular_subtype_post Molecular Subtype (Post-AI)
biopsy_type Biopsy Type (Excision, Tru-cut); Vacuum recoded as Tru-cut