11 Systematic Bias Analysis – AI for Breast Cancer Analysis

11.1 Objective

Detect and quantify systematic biases in AI and pathologist assessments.

Note for Pathologist: Systematic bias means “always missing in the same direction”. For example, does the AI always underestimate Ki67 in high-grade tumors? Or does it always score ER higher than humans? Detecting these biases allows us to “calibrate” our expectations - if we know the AI runs high, we can mentally adjust.

11.2 Setup

11.3 Load Data

[1] "Filtered to common cohort. Merged N: 1184"

[1] "Before NA removal: 1184"

[1] "After NA removal: 1037"

11.4 Overall AI Bias

Does AI systematically over or underestimate compared to pathologists’ initial assessments?

Marker	Mean Bias¹	Median Bias	SD of Bias	NA
Overall AI Bias
Mean and median differences (Post-AI minus Pre-AI)
er	−0.76	0.00	6.39	—
pr	−1.23	0.00	7.62	—
ki67	5.88	4.00	7.69	—
n	—	—	—	1037
¹ Positive values = AI tends to give higher scores

Marker	Mean Difference	t-statistic	P-value	95% CI Lower	95% CI Upper
Statistical Tests for Systematic Bias
Paired t-tests: H0 = no difference between Pre and Post
ER	−0.764	−3.846	0.0001	−1.153	−0.374
PR	−1.226	−5.176	0.0000	−1.690	−0.761
Ki67	5.884	24.642	0.0000	5.416	6.353

11.5 Bias by Score Range

Test if AI bias varies by the initial score level (e.g., more bias for low vs high scores).

Marker	F-statistic	P-value
ANOVA: Bias Variation by Score Range
Test if AI bias differs across score ranges
ER	10.72	0.0000
PR	56.61	0.0000

11.6 Regression to the Mean

Test if AI shows regression to the mean (extreme values become more moderate).

Marker	Slope (β)¹	SE	t-value	P-value
Regression to the Mean Analysis
Relationship between initial score and change
ER	−0.0199	0.0053	−3.7706	0.00017
PR	−0.0627	0.0062	−10.1466	0.00000
Ki67	−0.0296	0.0112	−2.6327	0.00860
¹ Negative slope = regression to mean (high values decrease, low values increase)

11.7 Pathologist-Specific Bias Patterns

Compare systematic biases across pathologists.

Pathologist	ER Bias	PR Bias	Ki67 Bias
Pathologist-Specific Bias
Mean change after AI by pathologist
Pathologist 1	−1.83	−1.69	8.63
Pathologist 2	−1.38	−1.94	3.43
Pathologist 3	0.29	−1.03	5.62
Pathologist 4	−0.05	−0.16	6.00

11.8 HER2 Bias Patterns

Analyze directional bias in HER2 scoring.

Change Direction	Count	Percentage (%)
HER2 Score Changes
Overall directional patterns
Downgraded	56	5.4
No Change	938	90.5
Upgraded	43	4.1

Downgraded	No Change	Upgraded	Net Change	Total Cases
HER2 Changes by Pathologist
Distribution of upgrades, downgrades, and no changes
Pathologist 1
8	237	14	6	259
Pathologist 2
30	230	15	-15	275
Pathologist 3
12	231	9	-3	252
Pathologist 4
6	240	5	-1	251

Pre-AI Score	Post-AI Score	Direction	Count
Most Common HER2 Transitions
Top 10 score change patterns
1	1	No Change	384
0	0	No Change	293
2	2	No Change	137
3	3	No Change	124
2	1	Downgraded	44
0	1	Upgraded	24
1	2	Upgraded	16
1	0	Downgraded	10
2	3	Upgraded	3
3	2	Downgraded	2

11.9 Threshold Effects

Examine if bias is stronger near clinically important thresholds.

Marker	Threshold	Near Threshold	Far from Threshold	Difference	Ratio¹
AI Influence Near Clinical Thresholds
Mean absolute change within 5% of threshold vs away from threshold
ER	1%	0.33	3.76	−3.43	0.09
ER	10%	4.05	3.09	0.96	1.31
PR	1%	0.54	7.40	−6.86	0.07
PR	10%	3.58	4.05	−0.47	0.88
Ki67	20%	7.81	6.60	1.21	1.18
Ki67	30%	8.61	6.55	2.06	1.32
¹ Ratio > 1 = More AI influence near thresholds

11.10 Bland-Altman Analysis

Classic method for assessing agreement and detecting systematic bias.

Note: The Bland-Altman analysis below pools observations across all four pathologists. This approach captures overall pre-vs-post bias but conflates within-pathologist change with between-pathologist variation, potentially widening the limits of agreement. Per-pathologist Bland-Altman plots are available in the individual pathologist performance analysis (Chapter 7).

Marker	Mean Difference	SD of Differences	Upper LoA	Lower LoA
Bland-Altman Statistics
Mean difference and 95% limits of agreement
ER	−0.76	6.39	11.77	−13.30
PR	−1.23	7.62	13.72	−16.17
Ki67	5.88	7.69	20.96	−9.19

11.11 Conclusion

11.11.1 Summary of Bias Patterns

Overall Bias: Quantifies whether AI systematically shifts scores up or down
Range-Dependent Bias: AI may show different bias patterns for low vs high scores
Regression to the Mean: Extreme values tend to become more moderate after AI input
Pathologist-Specific Patterns: Individual pathologists may show unique bias patterns when using AI
Threshold Effects: AI influence may be stronger near clinically important cutoffs
Bland-Altman Analysis: Visualizes agreement and identifies proportional bias

11.11.2 Clinical Implications

Awareness of systematic biases is crucial for appropriate AI implementation
Bias patterns should inform calibration and validation strategies
Individual pathologists may need targeted training to address specific biases
Cases near clinical thresholds require extra scrutiny