19  Sensitivity Analyses

19.1 Introduction

This chapter presents comprehensive sensitivity analyses to assess the robustness of our primary findings. Sensitivity analyses test whether conclusions remain stable under alternative assumptions or analytical approaches, strengthening confidence in the generalizability of results.

Note for Pathologist: “Sensitivity Analysis” asks: “What if we are wrong about our assumptions?” For example, we assumed 30% is the best cutoff for Ki67. What if it’s 25%? Does the AI still help? If the results stay the same (“Robust”), we are confident. If they change wildly (“Sensitive”), we need to be careful.

Analyses included:

  1. Alternative Ki-67 Thresholds: Testing 25% cutoff (alternative guideline) vs. our primary 30% threshold for Luminal A/B classification
  2. Outlier Exclusion: Removing extreme Ki-67 changes (>30 percentage points) to assess if systematic bias is driven by aberrant cases
  3. Biopsy Type Stratification: Re-analysis restricted to excision specimens only (more homogeneous sample)
  4. Complete Case Analysis: HER2 analysis restricted to cases with no missing data

19.2 Setup

19.3 Analysis 1: Alternative Ki-67 Threshold (25% vs 30%)

19.3.1 Background

Current guidelines for Luminal A/B classification use Ki-67 thresholds ranging from 20-30% depending on the institution and guideline version:

  • St. Gallen 2013: 14% threshold
  • St. Gallen 2015: 20% threshold
  • ASCO/CAP recommendations: No specific cutoff recommended; institutional validation required
  • Our primary analysis: 30% threshold (commonly used in clinical practice)

This sensitivity analysis tests 25% as an alternative to assess whether our findings about AI-induced systematic bias in Ki-67 are robust to threshold choice.

19.3.2 Methods

We re-classify molecular subtypes using a 25% Ki-67 threshold instead of 30%:

  • Luminal A: ER+ and/or PR+, HER2-, Ki-67 < 25%
  • Luminal B: ER+ and/or PR+, HER2-, Ki-67 >= 25%

Then compare:

  1. N cases reclassified Pre-AI vs Post-AI
  2. Direction of reclassification (A->B vs B->A)
  3. Whether systematic upward bias persists
Molecular Subtype Reclassification: 30% vs 25% Ki-67 Threshold
Ki-67 Threshold N Total Assessments N Changed % Changed Luminal A->B Luminal B->A Net Shift (A->B)
30% 1037 66 6.4 30 2 28
25% 1037 77 7.4 38 5 33

19.3.3 Results: Threshold Comparison

19.3.3.1 Key Observations

**Impact of lowering threshold from 30% to 25%:**
- Luminal A->B reclassifications: +8 cases (6.4% -> 7.4%)
- Luminal B->A reclassifications: +3 cases
- Net shift toward Luminal B: +5 cases

**Interpretation**: Lowering the threshold does not substantially alter findings, suggesting systematic bias is robust to threshold choice.

19.3.4 Distribution Near Thresholds

Ki-67 Values Near Clinical Thresholds (+/-5%)
Threshold N Near Pre-AI N Near Post-AI N Crossing Up N Crossing Down
25% 228 258 165 12
30% 192 210 152 5

Clinical Impact: Cases within +/-5% of the threshold are at highest risk for reclassification. This analysis quantifies how many assessments fall in the “gray zone” where AI may shift treatment decisions.

Note for Pathologist: The threshold comparison shows whether lowering the Ki67 cutoff from 30% to 25% changes how many patients get reclassified. If reclassification rates change dramatically, it means the results are sensitive to the exact cutoff you use. If they remain similar, the findings are robust regardless of which guideline you follow.

19.4 Analysis 2: Outlier Exclusion (Extreme Ki-67 Changes)

19.4.1 Rationale

Systematic bias could be driven by a small number of extreme outliers rather than a true systematic trend. We test this by:

  1. Identifying cases with Ki-67 change > 30 percentage points (e.g., Pre-AI 20% -> Post-AI 55%)
  2. Excluding these outliers
  3. Re-calculating systematic bias metrics

Hypothesis: If findings persist after outlier exclusion, systematic bias is not driven by aberrant cases.

Impact of Outlier Exclusion on Ki-67 Systematic Bias
N Total N Outliers (|Delta|>30) % Outliers Mean Delta (All) Median Delta (All) Mean Delta (No Outliers) Median Delta (No Outliers)
1162 11 0.95 5.89 4 5.58 4

19.4.2 Visualization: Distribution With and Without Outliers

19.4.2.1 Interpretation

**Outlier exclusion analysis:**
- 0.9% of Ki-67 assessments are outliers (|Delta| > 30%)
- Mean Ki-67 change: 5.89% (all) vs 5.58% (no outliers)
- Removing outliers reduces mean change by 0.30 percentage points (5% reduction)

**Conclusion**: Systematic upward bias persists after outlier exclusion, indicating bias is not driven by extreme outliers.

19.5 Analysis 3: Biopsy Type Stratification

19.5.1 Rationale

Our primary analysis pools excision specimens and tru-cut biopsies. These differ in:

  • Sample size: Excisions larger, more representative
  • Fixation: May vary by specimen type
  • Tumor heterogeneity: Biopsies sample smaller areas

This sensitivity analysis restricts to excision specimens only to assess whether findings generalize to a more homogeneous sample type.

Ki-67 Systematic Bias by Biopsy Type
Biopsy Type N Mean Delta (%) SD Median Delta (%) Q1 Q3
Excision 682 5.80 7.06 4 1 10
Tru-cut 480 6.01 8.47 5 0 10

19.5.2 Excision-Only Analysis

Ki-67 Systematic Bias: Pooled vs Excision-Only
Analysis N Mean Delta (%) SD Median Delta (%) % Increasing
Pooled (All Types) 1162 5.89 7.67 4 75.99
Excision Only 682 5.80 7.06 4 77.13

**Interpretation**: Mean Ki-67 change in excision-only analysis (5.80%) is similar to the pooled estimate (5.89%), indicating findings are robust to specimen type.

19.6 Analysis 4: Complete Case Analysis (HER2)

19.6.1 Rationale

Our primary HER2 analysis includes cases with 7-8% missing data. While missing data mechanisms were assessed (MCAR test), we perform a complete case analysis to ensure findings are not biased by missing data patterns.

**HER2 Missing Data Summary:**
- Total assessments: 1184
- Complete cases: 1073 (90.6%)
- Missing: 111 (9.4%)

19.6.2 Agreement Metrics: Complete vs Full Sample

HER2 Agreement: Complete Case vs Full Sample
Analysis Phase Fleiss’ Kappa N Cases
Complete Cases Post-AI 0.762 203
Complete Cases Pre-AI 0.691 203
Full Sample Post-AI 0.726 226
Full Sample Pre-AI 0.671 229

19.6.2.1 Interpretation

**HER2 Complete Case Analysis:**
- Pre-AI Kappa: 0.691 (complete) vs 0.671 (full) -- Difference: 0.020
- Post-AI Kappa: 0.762 (complete) vs 0.726 (full) -- Difference: 0.036

**Conclusion**: Agreement metrics are nearly identical between complete case and full sample analyses, suggesting missing data does not substantially bias findings.

19.7 Summary of Sensitivity Analyses

19.7.1 Overview Table

Summary of Sensitivity Analyses
Analysis Parameter Tested Primary Finding Sensitivity Result Conclusion
Alternative Ki-67 Threshold 25% vs 30% cutoff Systematic +5.9% Ki-67 increase TBD from results above Robust/Sensitive
Outlier Exclusion Exclude |Delta| > 30% Systematic bias persists TBD from results above Not driven by outliers
Biopsy Type Excision only Pooled analysis valid TBD from results above Generalizes across types
Complete Case (HER2) No missing data Kappa improvement +0.058 TBD from results above Not biased by missingness

19.7.2 Key Takeaways

  1. Alternative Ki-67 Threshold (25% vs 30%):
    • Purpose: Test robustness to guideline variation
    • Result: [To be filled based on analysis above]
    • Implication: Findings are [robust/sensitive] to threshold choice
  2. Outlier Exclusion:
    • Purpose: Rule out influence of extreme outliers
    • Result: [To be filled based on analysis above]
    • Implication: Systematic bias is [not/partially] driven by aberrant cases
  3. Biopsy Type Stratification:
    • Purpose: Assess generalizability across specimen types
    • Result: [To be filled based on analysis above]
    • Implication: Findings [do/do not] generalize to excision specimens
  4. Complete Case Analysis (HER2):
    • Purpose: Rule out missing data bias
    • Result: [To be filled based on analysis above]
    • Implication: Missing data [does not/may] bias HER2 findings

19.7.3 Clinical Implications

For Pathologists:

  • If findings are robust across sensitivity analyses -> High confidence in primary results
  • If findings are sensitive to specific assumptions -> Caution in interpretation, additional validation needed

For Institutions Implementing AI:

  • Robust findings support broader adoption
  • Sensitive findings suggest need for local validation with institution-specific thresholds and workflows

For Guideline Development:

  • Sensitivity to Ki-67 threshold choice highlights need for standardized cutoffs
  • Outlier analysis informs quality control metrics for AI-assisted assessment

19.8 Recommendations

Based on sensitivity analyses:

  1. If findings are robust (minimal change across analyses):
    • Primary analysis results are generalizable
    • AI can be implemented with confidence
    • Monitoring protocols can use primary thresholds
  2. If findings are sensitive (substantial change with alternative assumptions):
    • Local validation required before implementation
    • Institution-specific threshold calibration needed
    • Enhanced quality control for borderline cases
  3. For Future Studies:
    • Prospective validation with predetermined Ki-67 thresholds
    • Multi-institutional studies to assess generalizability
    • Outcomes-based validation (correlation with recurrence risk)