2  Materials and Methods

2.1 Study Design

This retrospective, within-subjects study evaluated the impact of artificial intelligence (AI) assistance on interobserver agreement and systematic bias in breast cancer biomarker assessment. The study employed a two-phase design where each pathologist evaluated the same cases twice:

  1. Phase 1 (Pre-AI): Manual assessment of immunohistochemistry (IHC) without AI assistance
  2. Phase 2 (Post-AI): Repeat assessment of the same cases with AI decision support from Aiforia® Breast Cancer modules

The within-subjects design allows each pathologist to serve as their own control, eliminating between-pathologist variability and increasing statistical power to detect AI effects. Pathologists first completed all assessments using Sectra PACS (Phase 1), then reviewed the same cases with Aiforia® AI assistance (Phase 2) in a sequential workflow.

Study Period: February 2024 – June 2024

Study Setting: Memorial Hospitals Group, Department of Pathology, Turkey (multi-site tertiary care academic medical center)

Primary Objective: Assess whether AI assistance improves interobserver agreement for breast cancer biomarker quantification (ER, PR, HER2, Ki-67)

Secondary Objectives:
1. Detect systematic bias introduced by AI (over- or underestimation)
2. Quantify clinical impact on molecular subtype classification
3. Analyze individual pathologist AI adoption patterns
4. Identify cases where AI increases rather than decreases disagreement

2.2 Study Population and Sampling

2.2.1 Case Selection

Inclusion Criteria:
- Consecutive invasive breast carcinoma cases
- Adequate tissue for all four biomarkers (ER, PR, HER2, Ki-67)
- Digital whole slide images available (scanned at 40× magnification)
- Cases from routine clinical workflow (representative sample)

Exclusion Criteria:
- In situ carcinoma only (no invasive component)
- Insufficient tissue for complete biomarker panel
- Technical failures in IHC staining (inadequate controls, staining artifacts)
- Cases with excessive crush artifact or poor fixation
- Neoadjuvant therapy-treated specimens (altered biomarker expression)

Sample Size:
- Target sample based on power analysis: 69 cases (see Section 3)
- Actual enrolled: 296 cases (exceeds target by >4× to account for exclusions)
- Final analyzed cohort: 296 cases after quality control exclusions

2.2.2 Participants (Pathologists)

Number of Pathologists: 4 board-certified anatomic pathologists

Anonymization: Pathologists anonymized as Pathologist A, B, C, and D for analysis

Experience Level (if available):
- Years in practice: Range [X-Y years]
- Breast pathology subspecialty training: [N] pathologists
- Prior digital pathology experience: All pathologists (>2 years)
- Prior AI exposure: Minimal (introductory training only before study)

Recruitment: All pathologists routinely sign out breast cases using digital pathology at Memorial Hospitals Group and volunteered to participate in the study.

Training:
- Pre-study training session (2 hours) on Aiforia® Breast Cancer modules
- Practice cases (N=10) to familiarize with AI interface and interpretation
- No additional training between Phase 1 and Phase 2 to avoid learning effects

2.3 Sample Size and Statistical Power

A priori power analysis was performed to determine the required sample size for detecting clinically meaningful improvements in interobserver agreement (The jamovi project 2024; R Core Team 2024; Balci 2022; Rotondi 2022).

Design Parameters:
- Observers: 4 pathologists (AI treated as independent assessment tool, not fifth observer)
- Expected Effect Size:
- Null hypothesis (H₀): κ₀ = 0.4 (moderate agreement)
- Alternative hypothesis (H₁): κ₁ = 0.6 (substantial agreement)
- Minimum detectable difference: Δκ = 0.2
- Statistical Parameters:
- Type I error rate (α): 0.05 (two-tailed)
- Statistical power (1-β): 0.80

Marker-Specific Calculations:

Sample size requirements varied by marker based on expected prevalence distributions:

  1. ER (3 categories: Negative, Low-positive 1-<10%, Positive ≥10%)
    • Expected prevalence: 20%, 10%, 70%
    • Required sample size: n = 51 cases
  2. PR (2 categories: Negative <10%, Positive ≥10%)
    • Expected prevalence: 40% negative, 60% positive
    • Required sample size: n = 58 cases
  3. HER2 (5 categories: 0, Low-expression, 1+, 2+, 3+)
    • Expected prevalence: 35%, 5%, 35%, 5%, 20%
    • Required sample size: n = 32 cases
  4. Ki-67 (2 categories: <30%, ≥30%)
    • Expected prevalence: 70% low, 30% high
    • Required sample size: n = 69 cases

Final Sample Size: Maximum across all markers = 69 cases minimum

Actual Enrollment: 296 cases (300 assessed for eligibility, 4 excluded)

Note: With 296 cases analyzed (>4× the minimum 69-case requirement), the study is well-powered for all pre-specified hypotheses.

2.4 Immunohistochemistry Protocol

2.4.1 Tissue Processing

Fixation:
- Fixative: 10% neutral buffered formalin
- Fixation time: 6-72 hours (per ASCO/CAP guidelines)
- Cold ischemia time: <1 hour for surgical specimens

Embedding: Paraffin-embedded tissue blocks prepared using routine laboratory protocols

Sectioning:
- Section thickness: 4 μm
- Mounted on positively charged glass slides (for optimal adhesion)

2.4.2 IHC Staining

Platform: Dako-Omnis automated immunostainer (Agilent Technologies, Santa Clara, CA)

Antibody Clones:
- ER: Clone EP1 (DAKO), ready-to-use monoclonal rabbit anti-human
- PR: Clone PgR1294 (DAKO), ready-to-use monoclonal mouse anti-human
- HER2: CERBB2 Oncoprotein Rabbit Polyclonal (DAKO), ready-to-use
- Ki-67: Clone MIB1 (DAKO), ready-to-use monoclonal mouse anti-human

Staining Protocol (all markers):
1. Deparaffinization and rehydration (automated)
2. Heat-induced epitope retrieval (HIER):
- Buffer: EnVision FLEX Target Retrieval Solution, High pH (Dako)
- Temperature: 97°C
- Duration: 20 minutes
3. Primary antibody incubation:
- Duration: 20 minutes (ER, PR, Ki-67), 30 minutes (HER2)
- Temperature: Room temperature
4. Detection: EnVision FLEX+ detection system (horseradish peroxidase)
5. Chromogen: 3,3’-Diaminobenzidine (DAB)
6. Counterstain: Hematoxylin
7. Dehydration, clearing, and coverslipping (automated)

2.4.3 Quality Control

External Controls:
- ER: Breast carcinoma known positive (>90% nuclear staining)
- PR: Breast carcinoma known positive (>90% nuclear staining)
- HER2: Breast carcinoma known 3+ (circumferential membrane staining)
- Ki-67: Tonsil tissue (positive and negative control zones)

Internal Controls:
- Normal breast epithelium (when present): ER/PR nuclear staining expected
- Lymphocytes: Ki-67 positive (proliferating cells in germinal centers)
- Stromal cells: Negative for all markers

Acceptance Criteria:
- External controls show expected staining intensity and pattern
- Internal controls validate specificity
- No background staining or artifacts that impair interpretation
- Counterstain quality adequate for nuclear/cytoplasmic delineation

Failed Runs: Cases with inadequate controls or technical artifacts repeated on fresh sections

2.4.4 Scoring Guidelines

ER and PR (Nuclear Staining):
- Evaluated using Allred score system (proportion + intensity)
- Positive threshold: ≥1% nuclear staining (per ASCO/CAP 2020 guidelines)
- Clinical cutoff: 10% (Low-positive: 1-9%, Positive: ≥10%)
- Reported as: Percentage of positive tumor nuclei (0-100%)

HER2 (Membrane Staining):
- Evaluated using ASCO/CAP 2018/2023 guidelines
- Scoring criteria:
- 0: No staining or incomplete membrane staining in ≤10% of tumor cells
- 1+ (Low-expression): Incomplete, faint membrane staining in >10% of tumor cells
- 2+: Incomplete and/or weak-to-moderate circumferential membrane staining in >10%, OR complete intense circumferential membrane staining in ≤10%
- 3+: Complete, intense circumferential membrane staining in >10% of tumor cells
- HER2-low: Score 1+ or 2+ (per 2023 ESMO consensus for trastuzumab deruxtecan eligibility)
- Note: Score 2+ cases undergo reflex FISH testing (not part of this study)

Ki-67 (Nuclear Staining):
- Method: Hot spot approach (areas of highest proliferation)
- Counting: Minimum 500 tumor cells in invasive front
- Reported as: Percentage of positive tumor nuclei (0-100%)
- Clinical cutoffs:
- 20%: Luminal A vs Luminal B distinction (primary)
- 30%: High proliferation cutoff (secondary)

2.5 Aiforia® AI System

2.5.1 AI Platform

Software: Aiforia® Create platform (Aiforia Technologies Oy, Helsinki, Finland)

Version: [Specify version used, e.g., v5.2.1] (consistent across all assessments)

Access: Cloud-based web interface accessed via secure institutional login

Integration: Connected to Sectra PACS for seamless whole slide image access

2.5.2 AI Modules Used

Four dedicated breast cancer biomarker modules:

  1. Aiforia® Breast ER Module
    • Training: Deep learning model trained on [N] annotated breast cancer cases
    • Output: Percentage of ER-positive tumor nuclei
    • Validation: Previously validated against manual scoring (published data)
  2. Aiforia® Breast PR Module
    • Training: Similar architecture to ER module
    • Output: Percentage of PR-positive tumor nuclei
    • Validation: Previously validated against manual scoring
  3. Aiforia® Breast HER2 Module
    • Training: Trained to distinguish 0, 1+, 2+, 3+ based on membrane intensity and completeness
    • Output: HER2 score category (0, Low-expression, 1+, 2+, 3+)
    • Validation: Previously validated against FISH results as gold standard
  4. Aiforia® Breast Ki-67 Module
    • Training: Trained on hot spot regions with manual annotations
    • Output: Percentage of Ki-67 positive tumor nuclei
    • Algorithm: Automated hot spot detection + cell counting

2.5.3 AI Algorithm Specifications

Image Analysis Pipeline (consistent across all modules):

  1. Tissue Detection:
    • Automated identification of tissue vs background
    • Exclusion of glass, air, and debris
  2. Tumor Segmentation:
    • Deep learning-based tumor vs non-tumor classification
    • Automatic exclusion of: stroma, lymphocytes, necrosis, normal ducts, adipose tissue
    • Pathologist option to refine tumor regions (manual override)
  3. Cell Detection:
    • Nuclear segmentation using watershed algorithm
    • Minimum nuclear size: 25 μm² (excludes debris)
    • Maximum nuclear size: 500 μm² (excludes artifacts)
  4. Biomarker Quantification:
    • ER/PR/Ki-67: Positive vs negative nuclear classification based on DAB chromogen intensity
    • HER2: Membrane completeness and intensity scoring
    • Threshold: Internally calibrated (proprietary algorithm)
  5. Quality Control Flags:
    • Insufficient tumor area (<1 mm²)
    • Staining artifacts detected
    • Control tissue inadvertently included

Computational Specifications:
- Processing time: 2-5 minutes per slide (varies by tissue area)
- Resolution: Analysis performed on 40× equivalent magnification
- Output format: Annotated heatmaps + quantitative percentages

2.5.4 AI Decision Support Interface

Presentation to Pathologist:
- Visual: Heatmap overlay on whole slide image (color-coded by marker positivity)
- Quantitative: Percentage value displayed prominently
- Interactive: Pathologist can pan/zoom, review individual cells
- Override capability: Pathologist can include/exclude regions and recalculate

AI Transparency:
- Algorithm version and training data characteristics disclosed to pathologists
- Confidence intervals not provided (not standard feature)
- Pathologist informed that AI is a decision support tool, not diagnostic

2.6 Assessment Workflow

2.6.1 Digital Slide Preparation

Scanning:
- Scanner: Leica Aperio AT2 (Leica Biosystems, Wetzlar, Germany)
- Magnification: 40× (0.25 μm/pixel resolution)
- File format: Aperio SVS format
- Storage: Sectra IDS7 Image Management System (Sectra AB, Linköping, Sweden)

Image Quality Control:
- Automated checks: Focus quality, illumination uniformity
- Manual review: Confirm no scanning artifacts, complete tissue capture
- Re-scanning: Performed if quality issues identified

2.6.2 Phase 1: Pre-AI Assessment

Timing:
- Commenced after case received digital pathology sign-out assignment
- Each pathologist completed assessment within 48 hours of case assignment

Workflow:
1. Pathologist opens case in Sectra PACS (routine digital pathology viewer)
2. Reviews all four IHC stains (ER, PR, HER2, Ki-67) at various magnifications
3. Performs manual assessment using established guidelines (Section 4.4)
4. Records results in individual Google Sheet (pathologist-specific, blinded to others)
5. Submission deadline enforced to ensure independence

Data Recorded (for each marker):
- ER: Percentage (0-100%)
- PR: Percentage (0-100%)
- HER2: Category (0, Low-expression, 1+, 2+, 3+)
- Ki-67: Percentage (0-100%)
- Molecular subtype classification (derived from markers)
- Assessment time (optional, for workflow analysis)
- Comments on challenging features (optional)

Blinding:
- Pathologists blinded to each other’s assessments (separate Google Sheets)
- Pathologists blinded to AI results (AI not available in Phase 1)
- Pathologists blinded to final clinical sign-out (not yet finalized)

2.6.3 Phase 2: Post-AI Assessment

Timing:
- Performed sequentially after Phase 1 completion for each case
- No washout period between phases; pathologists proceeded directly from Sectra to Aiforia

Workflow:
1. After completing Phase 1 assessment in Sectra, pathologist opens the same case in Aiforia® Create platform (AI-enabled viewer)
2. Aiforia® automatically runs all four biomarker modules (2-5 minutes processing)
3. Pathologist reviews AI-generated results:
- Visual heatmap overlay on slide
- Quantitative percentages/categories
- Option to include/exclude regions (refine tumor segmentation)
4. Pathologist formulates final assessment incorporating AI as decision support
5. Records results in same Google Sheet (Post-AI columns)

Data Recorded (for each marker):
- Raw AI output (before any pathologist modifications)
- Final pathologist assessment (after reviewing AI)
- Whether pathologist agreed with AI or overrode recommendation
- Regions excluded/included by pathologist (if any)
- Assessment time (optional)
- Comments on AI performance (optional)

AI Integration Decision:
- Pathologists instructed: “Use AI as decision support, not replacement”
- Pathologists retained full autonomy to accept, modify, or reject AI suggestions
- No incentive to agree or disagree with AI (research context, not clinical workflow)

2.6.4 Sequential Workflow Rationale

Design choice: A sequential (Sectra first, then Aiforia) workflow without a washout period was used, reflecting the intended real-world clinical integration where AI serves as a second-read decision support tool.

Advantages:
- Mirrors the practical clinical scenario where pathologists review AI output after forming an initial impression
- Allows direct measurement of how AI modifies an existing assessment (the clinically relevant question)
- Eliminates logistic burden of re-scheduling assessments across multiple days

Limitation:
- Recall of Phase 1 scores may reduce the apparent magnitude of AI-induced changes, as pathologists may anchor to their initial assessment
- This conservative bias means observed AI effects likely represent a lower bound of the true AI influence

2.7 Data Collection and Management

2.7.1 Data Collection Instrument

Platform: Google Sheets (Google Workspace, cloud-based)

Structure:
- Master sheet: Case inventory with anonymized IDs
- Individual pathologist sheets (4 separate sheets, one per pathologist):
- Pre-AI assessment columns (Phase 1)
- Post-AI assessment columns (Phase 2)
- Each pathologist’s sheet password-protected and accessible only to that pathologist

Variables Collected (per case, per pathologist, per phase):

Demographic/Clinical:
- Case ID (anonymized: BxNo)
- Specimen type (Tru-cut biopsy, Excision, Vacuum-assisted biopsy). Note: Vacuum-assisted biopsies (coded “V”) were recoded as Tru-cut for analysis due to their similar tissue sampling characteristics, yielding two final categories: Excision and Tru-cut. - Tumor histologic type (Invasive ductal, Invasive lobular, Other)
- Tumor grade (Nottingham, if available)

Biomarker Assessments:
- ER percentage (0-100%)
- PR percentage (0-100%)
- HER2 score (0, Low-expression, 1+, 2+, 3+)
- Ki-67 percentage (0-100%)

Derived Variables:
- ER category (Negative <1%, Low-positive 1-9%, Positive ≥10%)
- PR category (Negative <10%, Positive ≥10%)
- HER2 status (Negative: 0/1+, Equivocal: 2+, Positive: 3+)
- Ki-67 category (<20%, ≥20%; <30%, ≥30%)
- Molecular subtype (HER2-positive, Luminal A, Luminal B, Hormone Weak-positive, Triple-negative)

Process Variables:
- Assessment timestamp (Phase 1 and Phase 2)
- Time spent (minutes, optional)
- AI agreement flag (Did pathologist agree with AI? Yes/No/Partial)
- Override regions (Were tumor boundaries modified? Yes/No)
- Comments/notes (free text)

2.7.2 Data Quality Assurance

Real-time Validation:
- Google Sheets formula-based range checks (e.g., percentages 0-100)
- Dropdown menus for categorical variables (prevents typos)
- Required field enforcement (cannot submit incomplete assessments)

Weekly Audits:
- Principal investigator (PI) reviews completed cases
- Checks for missing data, outliers, inconsistencies
- Flags cases for re-review if necessary

Inter-rater Reliability Checks:
- Periodic consensus meetings during data collection (not after)
- Cases with extreme disagreement (>30% difference in ER/PR/Ki-67) discussed
- AI performance issues documented (e.g., systematic segmentation errors)

2.7.3 Data Export and Processing

Export:
- Weekly export from Google Sheets to CSV format
- Timestamp of export recorded for provenance

Data Processing Pipeline:
1. CSV import to R (tidyverse package)
2. Data cleaning: Missing value coding, outlier detection
3. Long format transformation (case-pathologist-marker-phase structure)
4. Derived variable calculation (categories, molecular subtypes)
5. Merged dataset creation (wide format for ICC/Kappa calculations)
6. RDS format storage for reproducible analysis

Code Repository: All data processing scripts available in /scripts/ directory

Reproducibility: Full pipeline executable from raw CSV to final analysis datasets

2.7.4 Data Security and Confidentiality

Anonymization:
- Cases identified by study ID (BxNo) only
- No patient names, medical record numbers, or dates of birth collected
- Link file (Study ID ↔︎ Medical Record Number) stored separately, access-restricted

Access Control:
- Google Sheets access restricted to study pathologists and PI
- Two-factor authentication required
- Audit log maintained (all access and edits tracked)

Data Storage:
- Primary: Google Workspace (HIPAA-compliant, encrypted at rest and in transit)
- Backup: Institutional secure research server (weekly backups)
- Long-term archive: De-identified data suitable for public repository (at publication)

2.8 Quality Assurance and Consensus Procedures

2.8.1 Pre-Study Calibration

Training Session (conducted before Phase 1):
- Duration: 2 hours
- Content: Review of ASCO/CAP guidelines, Aiforia® AI interface tutorial
- Practice cases: 10 representative cases (not included in study cohort)
- Inter-pathologist discussion to align on borderline cases

Goal: Establish baseline consistency without homogenizing individual styles (preserve real-world variability)

2.8.2 Ongoing Quality Monitoring

Case Review Meetings:
- Frequency: Bi-weekly during active data collection
- Participants: All four pathologists + PI
- Agenda:
- Review cases with high inter-pathologist disagreement (flagged by PI)
- Discuss AI performance issues (segmentation errors, unexpected results)
- Address technical issues (staining artifacts, scanning problems)
- No consensus scoring performed (preserve independence for analysis)

Purpose:
- Ensure data quality (not to force agreement)
- Identify systematic technical issues requiring correction
- Maintain pathologist engagement and compliance

2.8.3 Exclusion Criteria Applied During Study

Post-hoc Exclusions (after data collection):
- Cases with incomplete assessments (missing Phase 1 or Phase 2 from any pathologist)
- Cases with documented technical failures (AI segmentation error, staining failure)
- Cases with insufficient tumor (AI flags <1 mm² tumor area)

Transparency:
- Excluded cases listed in Appendix with reasons
- Exclusion rate reported (target <15%)
- Sensitivity analysis: Results with and without exclusions (if >10% excluded)

2.8.4 Outlier Handling

Definition of Outlier:
- ER/PR: Difference >30% from median of four pathologists
- Ki-67: Difference >20% from median
- HER2: Disagreement by >2 categories (e.g., 0 vs 3+)

Procedure:
- Outliers flagged for review but not excluded (preserve real-world variability)
- PI and pathologist review case together to identify cause:
- Technical issue (wrong region scored, control tissue included) → Correction permitted
- Legitimate disagreement (ambiguous morphology, heterogeneity) → Retained as-is
- Decision documented in case notes

Rationale: Study aims to measure real-world variability; excluding outliers would artificially improve agreement

2.9 Statistical Analysis

2.9.1 Primary Outcomes

Primary Outcome: Change in interobserver agreement from Pre-AI to Post-AI, measured by:

  1. Intraclass Correlation Coefficient (ICC) for continuous variables (ER, PR, Ki-67):
    • Model: Two-way random effects, absolute agreement, single rater
    • Interpretation: ICC <0.50 poor, 0.50-0.75 moderate, 0.75-0.90 good, >0.90 excellent
    • Hypothesis: ICC_post > ICC_pre (AI improves agreement)
  2. Fleiss’ Kappa (κ) for categorical variables (HER2, Molecular subtypes):
    • Interpretation: κ <0.20 slight, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, >0.80 excellent
    • Hypothesis: κ_post > κ_pre (AI improves agreement)
    • Extension: Weighted Kappa for HER2 (quadratic weights accounting for ordinal structure)

Minimum Clinically Important Difference (MCID):
- ICC/Kappa: Δ ≥ 0.10 (clinically meaningful improvement)
- ER/PR: Δ ≥ 5% (near therapeutic threshold of 10%)
- Ki-67: Δ ≥ 3% (near classification threshold of 20-30%)

2.9.2 Secondary Outcomes

  1. Systematic Bias: Paired t-test (Post-AI mean vs Pre-AI mean) for each marker
    • Null hypothesis: Mean difference = 0 (no systematic bias)
    • Alternative: Mean difference ≠ 0 (AI systematically shifts values)
    • Effect size: Cohen’s d for magnitude of bias
  2. Threshold Crossing Frequency:
    • N and % cases crossing clinical thresholds (ER/PR 10%, Ki-67 20%/30%, HER2 0→1+ or 2→3)
    • Clinical impact: Molecular subtype reclassification rate
  3. Individual Pathologist AI Adoption:
    • AI adoption index: Mean absolute change (|Post - Pre|) per pathologist
    • Consistency change: Within-pathologist variance Pre vs Post
    • Agreement with AI: Correlation between pathologist Post-AI and raw AI output

2.9.3 Statistical Methods

Agreement Metrics:
- ICC: irr package in R, bootstrap confidence intervals (1000 replicates)
- Kappa: irr package, Fleiss’ Kappa for multi-rater, Cohen’s Kappa for pairwise
- Weighted Kappa: psych package, quadratic weights for ordinal HER2

Bias Detection:
- Paired t-test: stats::t.test() with paired=TRUE
- Bland-Altman plots: Mean-difference vs mean plots for each marker
- Passing-Bablok regression: mcr package for proportional/constant bias detection (implemented in the supplementary materials and Ki-67 threshold analysis chapters)

Threshold Analysis:
- McNemar’s test: Categorical changes (pre vs post classification)
- Logistic regression: Predictors of threshold crossing

Mixed Effects Models:
- Model: lme4::lmer() for nested structure (pathologists within cases)
- Fixed effects: AI condition (Pre vs Post), Marker type
- Random effects: Case ID, Pathologist ID
- Purpose: Account for non-independence in repeated measures

Multiple Testing Correction:
- Primary hypotheses (N=3): No correction (pre-specified)
- H1: Ki-67 ICC improvement
- H2: HER2 Kappa improvement
- H3: Ki-67 systematic bias
- Exploratory analyses: Benjamini-Hochberg False Discovery Rate (FDR) correction, q<0.05

Sensitivity Analyses:
1. Complete case analysis vs multiple imputation (for missing data)
2. Alternative thresholds (Ki-67 25% vs 30%)
3. Outlier exclusion (remove cases with extreme changes >30%)

2.9.4 Software and Packages

Primary Software:
- R version 4.3.0 or later (R Core Team, 2024)
- RStudio IDE version 2023.06.0 or later

Key R Packages:
- Data manipulation: tidyverse (1.3.2), dplyr, tidyr
- Agreement analysis: irr (0.84.1), psych (2.3.6)
- Mixed models: lme4 (1.1-34), lmerTest (3.1-3)
- Bias analysis: mcr (1.3.1), deming (1.4)
- Visualization: ggplot2 (3.4.2), patchwork (1.1.2), gt (0.9.0)
- Reproducibility: here (1.0.1), renv (0.17.3)

Reproducibility:
- All analyses conducted in Quarto documents (.qmd format)
- Complete analysis pipeline available in project GitHub repository
- renv lockfile ensures exact package versions

2.9.5 Sample Size Justification (Achieved)

Target: 69 cases (from power analysis, Section 3)

Enrolled: 296 cases (300 assessed for eligibility, 4 excluded)

Analyzed: 296 cases (common cohort assessed by all 4 pathologists)

Adequacy: With 296 cases (exceeding the 69-case minimum by >4×), the study is adequately powered for all primary and secondary analyses.

2.10 Molecular Subtype Classification

The following hierarchy was used to classify cases into molecular subtypes based on the combination of ER, PR, HER2, and Ki-67 results (Ivanova et al. 2024; Ahn et al. 2023):

2.10.1 Classification Algorithm

Step 1: HER2 Status Assessment
- HER2-Positive: HER2 IHC Score 2+ or 3+ (regardless of HR status)
- Note: Score 2+ cases undergo reflex FISH testing in clinical practice
- For this study: Score 2+ included in HER2-positive category pending confirmatory results

If HER2 negative (Score 0 or 1+), proceed to Step 2.

Step 2: Hormone Receptor and Proliferation Assessment

  1. Luminal A (ER+ HER2- low proliferation):
    • HER2: 0 or 1+
    • ER: ≥10%
    • PR: ≥10%
    • Ki-67: <30%
    • Clinical implication: Endocrine therapy alone, chemotherapy generally not indicated
  2. Luminal B (ER+ HER2- high proliferation):
    • HER2: 0 or 1+
    • ER: ≥10%
    • Ki-67: ≥30%
    • Clinical implication: Endocrine therapy + chemotherapy recommended
  3. Hormone Weak-Positive (ER+ or PR+ but not Luminal A/B):
    • HER2: 0 or 1+
    • ER: >0 OR PR: >0
    • Does not meet criteria for Luminal A or Luminal B
    • Clinical implication: Consider endocrine therapy ± chemotherapy based on clinical features
  4. Triple-Negative (All receptors negative):
    • HER2: 0 or 1+
    • ER: 0%
    • PR: 0%
    • Clinical implication: Chemotherapy primary treatment (no targeted therapy available)

2.10.2 Clinical Threshold Rationale

ER/PR 10% Cutoff:
- Based on ASCO/CAP 2020 guidelines
- Cases 1-9%: “Low-positive” (may benefit from endocrine therapy but response uncertain)
- Cases ≥10%: “Positive” (clear endocrine therapy indication)

Ki-67 Thresholds:
- 20%: Used by some institutions (more sensitive Luminal A/B distinction)
- 30%: Used in this study (per St. Gallen consensus, more specific for high proliferation)
- Note: No universal consensus; both cutoffs analyzed in sensitivity analysis (Chapter 14)

HER2-Low Category (Recent 2023 update):
- HER2 Score 1+ or 2+ (FISH-negative if tested)
- Clinical relevance: Eligibility for trastuzumab deruxtecan (T-DXd)
- Analyzed separately in Chapter 18 (HER2-low classification analysis)

2.10.3 Changes in Classification

Primary Analysis: Change in molecular subtype from Pre-AI to Post-AI
- Quantified as: N and % cases reclassified
- Direction: Luminal A→B, Luminal B→A, HER2-positive gain/loss, Triple-negative gain/loss

Clinical Impact: Treatment decision changes implied by reclassification (see Chapter 6 and 12: Clinical Impact)

2.11 Ethical Considerations

2.11.1 Institutional Review

Ethics Approval:
- Study reviewed and approved by Memorial Hospitals Group Institutional Review Board (IRB)
- Protocol number: [Insert IRB protocol number]
- Approval date: [Insert date]

Study Classification:
- Retrospective use of retrospectively collected clinical data
- Minimal risk research (no patient contact, no intervention beyond standard care)
- AI used as research tool (all cases received standard pathology sign-out)

2.11.3 Clinical Care Safeguards

Non-interference with Clinical Care:
- Research assessments conducted in parallel with clinical sign-out
- Clinical pathology report generated by assigned pathologist (independent of research)
- AI recommendations in research context did not influence clinical sign-out
- Cases with research-identified discrepancies underwent clinical quality assurance review

Patient Safety:
- No patient treatment delayed or altered due to research participation
- All cases received standard-of-care pathology interpretation

2.11.4 Conflicts of Interest

Funding:
- [Specify funding source if any]
- No funding from Aiforia Technologies Oy

Aiforia® Relationship:
- Software provided under institutional license (no special consideration for research)
- No financial relationship between investigators and Aiforia Technologies

Pathologist Compensation:
- Pathologists volunteered participation (no additional compensation)
- Participation voluntary; could withdraw at any time without consequence

2.11.5 Data Sharing and Transparency

Data Availability:
- De-identified dataset available upon reasonable request after publication
- Analysis code publicly available on GitHub repository
- Raw slide images not shareable (institutional policy, patient privacy)

Reproducibility:
- Complete analysis pipeline (from raw data to published results) documented
- Quarto documents ensure computational reproducibility

2.12 References

All references cited in this Methods section are included in the master bibliography (see References chapter).


Note: This comprehensive Methods section provides sufficient detail for:
1. Replication by independent investigators
2. Critical appraisal by peer reviewers
3. Regulatory review (if applicable)
4. Integration into main manuscript with appropriate condensation for word limits (detailed methods can be moved to supplementary materials as needed)