19 Sensitivity Analyses

19.1 Introduction

This chapter presents comprehensive sensitivity analyses to assess the robustness of our primary findings. Sensitivity analyses test whether conclusions remain stable under alternative assumptions or analytical approaches, strengthening confidence in the generalizability of results.

Note for Pathologist: “Sensitivity Analysis” asks: “What if we are wrong about our assumptions?” For example, we assumed 30% is the best cutoff for Ki67. What if it’s 25%? Does the AI still help? If the results stay the same (“Robust”), we are confident. If they change wildly (“Sensitive”), we need to be careful.

Analyses included:

Alternative Ki-67 Thresholds: Testing 25% cutoff (alternative guideline) vs. our primary 30% threshold for Luminal A/B classification
Outlier Exclusion: Removing extreme Ki-67 changes (>30 percentage points) to assess if systematic bias is driven by aberrant cases
Biopsy Type Stratification: Re-analysis restricted to excision specimens only (more homogeneous sample)
Complete Case Analysis: HER2 analysis restricted to cases with no missing data

19.2 Setup

19.3 Analysis 1: Alternative Ki-67 Threshold (25% vs 30%)

19.3.1 Background

Current guidelines for Luminal A/B classification use Ki-67 thresholds ranging from 20-30% depending on the institution and guideline version:

St. Gallen 2013: 14% threshold
St. Gallen 2015: 20% threshold
ASCO/CAP recommendations: No specific cutoff recommended; institutional validation required
Our primary analysis: 30% threshold (commonly used in clinical practice)

This sensitivity analysis tests 25% as an alternative to assess whether our findings about AI-induced systematic bias in Ki-67 are robust to threshold choice.

19.3.2 Methods

We re-classify molecular subtypes using a 25% Ki-67 threshold instead of 30%:

Luminal A: ER+ and/or PR+, HER2-, Ki-67 < 25%
Luminal B: ER+ and/or PR+, HER2-, Ki-67 >= 25%

Then compare:

N cases reclassified Pre-AI vs Post-AI
Direction of reclassification (A->B vs B->A)
Whether systematic upward bias persists

Molecular Subtype Reclassification: 30% vs 25% Ki-67 Threshold
Ki-67 Threshold	N Total Assessments	N Changed	% Changed	Luminal A->B	Luminal B->A	Net Shift (A->B)
30%	1037	66	6.4	30	2	28
25%	1037	77	7.4	38	5	33

19.3.3 Results: Threshold Comparison

19.3.3.1 Key Observations

**Impact of lowering threshold from 30% to 25%:**

- Luminal A->B reclassifications: +8 cases (6.4% -> 7.4%)

- Luminal B->A reclassifications: +3 cases

- Net shift toward Luminal B: +5 cases


**Interpretation**: Lowering the threshold does not substantially alter findings, suggesting systematic bias is robust to threshold choice.

19.3.4 Distribution Near Thresholds

Ki-67 Values Near Clinical Thresholds (+/-5%)
Threshold	N Near Pre-AI	N Near Post-AI	N Crossing Up	N Crossing Down
25%	228	258	165	12
30%	192	210	152	5

Clinical Impact: Cases within +/-5% of the threshold are at highest risk for reclassification. This analysis quantifies how many assessments fall in the “gray zone” where AI may shift treatment decisions.

Note for Pathologist: The threshold comparison shows whether lowering the Ki67 cutoff from 30% to 25% changes how many patients get reclassified. If reclassification rates change dramatically, it means the results are sensitive to the exact cutoff you use. If they remain similar, the findings are robust regardless of which guideline you follow.

19.4 Analysis 2: Outlier Exclusion (Extreme Ki-67 Changes)

19.4.1 Rationale

Systematic bias could be driven by a small number of extreme outliers rather than a true systematic trend. We test this by:

Identifying cases with Ki-67 change > 30 percentage points (e.g., Pre-AI 20% -> Post-AI 55%)
Excluding these outliers
Re-calculating systematic bias metrics

Hypothesis: If findings persist after outlier exclusion, systematic bias is not driven by aberrant cases.

Impact of Outlier Exclusion on Ki-67 Systematic Bias
N Total	N Outliers (\|Delta\|>30)	% Outliers	Mean Delta (All)	Median Delta (All)	Mean Delta (No Outliers)	Median Delta (No Outliers)
1162	11	0.95	5.89	4	5.58	4

19.4.2 Visualization: Distribution With and Without Outliers

19.4.2.1 Interpretation

**Outlier exclusion analysis:**

- 0.9% of Ki-67 assessments are outliers (|Delta| > 30%)

- Mean Ki-67 change: 5.89% (all) vs 5.58% (no outliers)

- Removing outliers reduces mean change by 0.30 percentage points (5% reduction)


**Conclusion**: Systematic upward bias persists after outlier exclusion, indicating bias is not driven by extreme outliers.

19.5 Analysis 3: Biopsy Type Stratification

19.5.1 Rationale

Our primary analysis pools excision specimens and tru-cut biopsies. These differ in:

Sample size: Excisions larger, more representative
Fixation: May vary by specimen type
Tumor heterogeneity: Biopsies sample smaller areas

This sensitivity analysis restricts to excision specimens only to assess whether findings generalize to a more homogeneous sample type.

Ki-67 Systematic Bias by Biopsy Type
Biopsy Type	N	Mean Delta (%)	SD	Median Delta (%)	Q1	Q3
Excision	682	5.80	7.06	4	1	10
Tru-cut	480	6.01	8.47	5	0	10

19.5.2 Excision-Only Analysis

Ki-67 Systematic Bias: Pooled vs Excision-Only
Analysis	N	Mean Delta (%)	SD	Median Delta (%)	% Increasing
Pooled (All Types)	1162	5.89	7.67	4	75.99
Excision Only	682	5.80	7.06	4	77.13


**Interpretation**: Mean Ki-67 change in excision-only analysis (5.80%) is similar to the pooled estimate (5.89%), indicating findings are robust to specimen type.

19.6 Analysis 4: Complete Case Analysis (HER2)

19.6.1 Rationale

Our primary HER2 analysis includes cases with 7-8% missing data. While missing data mechanisms were assessed (MCAR test), we perform a complete case analysis to ensure findings are not biased by missing data patterns.

**HER2 Missing Data Summary:**

- Total assessments: 1184

- Complete cases: 1073 (90.6%)

- Missing: 111 (9.4%)

19.6.2 Agreement Metrics: Complete vs Full Sample

HER2 Agreement: Complete Case vs Full Sample
Analysis	Phase	Fleiss’ Kappa	N Cases
Complete Cases	Post-AI	0.762	203
Complete Cases	Pre-AI	0.691	203
Full Sample	Post-AI	0.726	226
Full Sample	Pre-AI	0.671	229

19.6.2.1 Interpretation

**HER2 Complete Case Analysis:**

- Pre-AI Kappa: 0.691 (complete) vs 0.671 (full) -- Difference: 0.020

- Post-AI Kappa: 0.762 (complete) vs 0.726 (full) -- Difference: 0.036


**Conclusion**: Agreement metrics are nearly identical between complete case and full sample analyses, suggesting missing data does not substantially bias findings.

19.7 Summary of Sensitivity Analyses

19.7.1 Overview Table

Summary of Sensitivity Analyses
Analysis	Parameter Tested	Primary Finding	Sensitivity Result	Conclusion
Alternative Ki-67 Threshold	25% vs 30% cutoff	Systematic +5.9% Ki-67 increase	TBD from results above	Robust/Sensitive
Outlier Exclusion	Exclude \|Delta\| > 30%	Systematic bias persists	TBD from results above	Not driven by outliers
Biopsy Type	Excision only	Pooled analysis valid	TBD from results above	Generalizes across types
Complete Case (HER2)	No missing data	Kappa improvement +0.058	TBD from results above	Not biased by missingness

19.7.2 Key Takeaways

Alternative Ki-67 Threshold (25% vs 30%):
- Purpose: Test robustness to guideline variation
- Result: [To be filled based on analysis above]
- Implication: Findings are [robust/sensitive] to threshold choice
Outlier Exclusion:
- Purpose: Rule out influence of extreme outliers
- Result: [To be filled based on analysis above]
- Implication: Systematic bias is [not/partially] driven by aberrant cases
Biopsy Type Stratification:
- Purpose: Assess generalizability across specimen types
- Result: [To be filled based on analysis above]
- Implication: Findings [do/do not] generalize to excision specimens
Complete Case Analysis (HER2):
- Purpose: Rule out missing data bias
- Result: [To be filled based on analysis above]
- Implication: Missing data [does not/may] bias HER2 findings

19.7.3 Clinical Implications

For Pathologists:

If findings are robust across sensitivity analyses -> High confidence in primary results
If findings are sensitive to specific assumptions -> Caution in interpretation, additional validation needed

For Institutions Implementing AI:

Robust findings support broader adoption
Sensitive findings suggest need for local validation with institution-specific thresholds and workflows

For Guideline Development:

Sensitivity to Ki-67 threshold choice highlights need for standardized cutoffs
Outlier analysis informs quality control metrics for AI-assisted assessment

19.8 Recommendations

Based on sensitivity analyses:

If findings are robust (minimal change across analyses):
- Primary analysis results are generalizable
- AI can be implemented with confidence
- Monitoring protocols can use primary thresholds
If findings are sensitive (substantial change with alternative assumptions):
- Local validation required before implementation
- Institution-specific threshold calibration needed
- Enhanced quality control for borderline cases
For Future Studies:
- Prospective validation with predetermined Ki-67 thresholds
- Multi-institutional studies to assess generalizability
- Outcomes-based validation (correlation with recurrence risk)