7 Discordance Analysis – AI for Breast Cancer Analysis

7.1 Objective

Identify and analyze cases where pathologists disagree most, where AI caused divergence, and understand patterns of disagreement.

Note for Pathologist: This report “zooms in” on the disagreements. We identify specific cases where there was high variability among pathologists (the “Top Discordant Cases”). We also look at whether AI made these disagreements better (resolution) or worse (divergence) and which biomarker (ER, PR, Ki67, HER2) causes the most trouble.

7.2 Setup

7.3 Load Data

7.4 High Variance Cases

Identify cases with highest inter-pathologist disagreement.

case_id	pathologist	er_pre	pr_pre	ki67_pre	her2_pre	comment
Top 10 Most Discordant Cases (Pre-AI)
Cases with highest inter-pathologist variance
10257-25	Pathologist 1	70.0	40.0	39.0	3	NA
10257-25	Pathologist 2	70.0	70.0	50.0	3	NA
10257-25	Pathologist 3	80.0	40.0	40.0	3	NA
10257-25	Pathologist 4	30.0	30.0	39.0	3	NA
13471-25	Pathologist 1	95.0	10.0	3.0	0	NA
13471-25	Pathologist 2	95.0	40.0	3.0	1	NA
13471-25	Pathologist 3	90.0	60.0	25.0	0	NA
13472-25	Pathologist 1	70.0	50.0	24.0	1	NA
13472-25	Pathologist 3	90.0	60.0	25.0	0	her2 de invaziv alanlari taniyamamis
13472-25	Pathologist 4	95.0	90.0	14.0	1	NA
23259-25	Pathologist 2	100.0	50.0	3.0	0	lobuler karsinomu taniyamiyor
23259-25	Pathologist 3	60.0	60.0	3.0	0	NA
23259-25	Pathologist 4	95.0	90.0	3.0	0	NA
27247-25	Pathologist 3	80.0	60.0	8.0	0	NA
27247-25	Pathologist 4	95.0	90.0	8.0	1	NA
31289-25	Pathologist 2	30.0	5.0	6.0	0	NA
31289-25	Pathologist 3	60.0	40.0	1.0	0	NA
33681-25	Pathologist 1	95.0	90.0	30.0	2	AI-SKOR1
33681-25	Pathologist 2	50.0	30.0	35.0	1	artefaktli boyanma
33681-25	Pathologist 3	90.0	90.0	30.0	1	NA
33681-25	Pathologist 4	95.0	95.0	35.0	2	2t3'te her2 ile fokal +2 boyanan yer vardi o yuzden skor 2 diye degerlendirdim.
34334-25	Pathologist 2	90.0	0.0	23.0	1	artefaktli boyanma AI icin cok sakincali
34334-25	Pathologist 3	90.0	80.0	22.0	1	her2 insitu skor 3 snmis
34334-25	Pathologist 4	95.0	0.0	23.0	1	AI PR'DEKI ARTEFAKTLI BOYAMAYI POZITIF KABUL ETMIS. AI HER2 skor 3 vermis yine benzer artefakt var.
9347-25	Pathologist 1	75.0	40.0	32.0	2	HER2-AI SKOR 1
9347-25	Pathologist 2	100.0	50.0	40.0	2	NA
9347-25	Pathologist 3	90.0	90.0	40.0	2	zemindeki strolam hucreleri negatif ER PR olarak saymis. HER2de inv alanlari secmemis
9347-25	Pathologist 4	95.0	60.0	41.0	2	ER AI'da %60larda hesapladi. AI CERBB2 negatif +1 degerlendirdi.
9628-25	Pathologist 1	40.0	0.0	4.0	0	NA
9628-25	Pathologist 3	50.0	0.0	10.0	0	neoadj alms rezidu tumor taniyamiyor
9628-25	Pathologist 4	95.0	0.0	0.0	0	tedavi sonra tek hucre seklindeki tumorleri tanimiyor. ki-67 tumor hucre sayisi azligindan dolayi degerlendiremedim.

Note for Pathologist: The table above lists the 10 cases where pathologists disagreed the most before AI. These are inherently “difficult” cases. Review the actual values: if all four pathologists gave very different ER or Ki67 percentages, it may indicate tumor heterogeneity, staining issues, or differing assessment strategies.

7.5 Cases Where AI Increased Disagreement

Identify problematic cases where AI made pathologists disagree more.

Number of cases where AI increased disagreement: 94

Percentage: 32.1 %

case_id	pathologist	er_pre	er_post	pr_pre	pr_post	ki67_pre	ki67_post	her2_pre	her2_post	comment
Cases Where AI Increased Disagreement
Top 10 cases with largest variance increase
11567-25	Pathologist 1	0.0	0.0	0.0	0.0	15.0	42.0	0	0	NA
11567-25	Pathologist 2	0.0	0.0	0.0	0.0	7.0	33.0	0	0	igsi tumor
11567-25	Pathologist 3	0.0	0.0	0.0	0.0	10.0	10.0	0	0	NA
11567-25	Pathologist 4	0.0	0.0	0.0	0.0	15.0	42.0	0	0	NA
12545-25	Pathologist 1	95.0	96.0	2.0	8.0	8.0	51.0	0	0	NA
12545-25	Pathologist 3	90.0	90.0	5.0	5.0	8.0	10.0	0	0	NA
12545-25	Pathologist 4	95.0	95.0	5.0	5.0	8.0	20.0	0	0	NA
13412-25	Pathologist 1	90.0	94.0	95.0	85.0	4.0	28.0	0	0	NA
13412-25	Pathologist 2	90.0	95.0	90.0	85.0	7.0	23.0	0	0	NA
13412-25	Pathologist 3	90.0	90.0	90.0	90.0	8.0	0.0	0	0	NA
21208-25	Pathologist 1	95.0	95.0	0.0	0.0	1.0	1.0	3	3	NA
21208-25	Pathologist 2	95.0	95.0	0.0	0.0	1.0	1.0	3	3	NA
21208-25	Pathologist 3	80.0	60.0	0.0	10.0	1.0	0.0	3	3	NA
21208-25	Pathologist 4	95.0	95.0	0.0	0.0	1.0	1.0	3	3	NA
25969-25	Pathologist 2	10.0	20.0	60.0	73.0	2.0	4.0	0	0	NA
25969-25	Pathologist 3	20.0	20.0	50.0	40.0	4.0	4.0	0	0	NA
26410-25	Pathologist 1	0.0	0.0	0.0	0.0	27.0	23.0	0	0	NA
26410-25	Pathologist 2	0.0	0.0	0.0	0.0	34.0	50.0	0	0	NA
26410-25	Pathologist 3	0.0	0.0	0.0	0.0	34.0	50.0	0	0	NA
26410-25	Pathologist 4	0.0	0.0	0.0	0.0	34.0	50.0	0	0	NA
29466-25	Pathologist 1	95.0	94.0	50.0	40.0	27.0	68.0	3	3	NA
29466-25	Pathologist 3	90.0	90.0	60.0	40.0	27.0	40.0	3	3	NA
29466-25	Pathologist 4	95.0	95.0	50.0	40.0	27.0	38.0	3	3	NA
30707-25	Pathologist 2	80.0	80.0	10.0	10.0	20.0	20.0	1	1	NA
30707-25	Pathologist 3	90.0	50.0	30.0	20.0	5.0	10.0	1	1	NA
32224-25	Pathologist 1	95.0	82.0	95.0	94.0	9.0	10.0	0	0	NA
32224-25	Pathologist 2	100.0	95.0	100.0	95.0	10.0	10.0	0	0	NA
32224-25	Pathologist 3	90.0	90.0	90.0	60.0	9.0	12.0	0	0	NA
32224-25	Pathologist 4	95.0	95.0	90.0	90.0	9.0	10.0	0	0	NA
7007-25	Pathologist 1	90.0	61.0	0.0	2.0	1.0	3.0	1	1	NA
7007-25	Pathologist 2	90.0	90.0	0.0	0.0	0.0	0.0	1	1	daginik tumor
7007-25	Pathologist 3	80.0	60.0	0.0	1.0	1.0	3.0	2	2	NA
7007-25	Pathologist 4	80.0	70.0	0.0	0.0	1.0	1.0	1	1	NA

7.6 Cases Where AI Improved Agreement

Identify cases where AI successfully reduced disagreement.

Number of cases where AI decreased disagreement: 193

Percentage: 65.9 %

Note for Pathologist: The bar chart above categorizes cases by how much AI changed the spread of pathologist opinions. “Large Decrease” means AI helped the group converge significantly. “Large Increase” means AI made them disagree more. Ideally, most cases should be in the “Decrease” or “No Change” categories.

7.7 Marker-Specific Discordance Patterns

Which markers have the most disagreement?

Marker	Pre-AI Variance	Post-AI Variance	Absolute Change	% Change
Variance Change by Marker
ER	47.12	28.10	−19.02	−40.4
PR	74.31	29.99	−44.32	−59.6
Ki67	24.66	30.78	6.12	24.8

Note for Pathologist: The bar chart and table above break down which markers drive the most disagreement. Higher variance means pathologists gave more different values for that marker. Compare Pre-AI vs Post-AI bars: if the Post-AI bar is shorter, AI helped reduce disagreement for that marker.

7.8 HER2 Disagreement Analysis

Detailed analysis of HER2 scoring discordance.

Number of cases with HER2 disagreement (Pre-AI): 119

case_id	Score_0	Score_1	Score_2	Score_3
HER2 Score Distribution in Discordant Cases (Pre-AI)
First 20 cases showing number of pathologists per score
10183-25	1	3	0	0
10353-25	1	2	0	0
10454-25	0	2	2	0
10676-25	0	1	3	0
10719-25	0	0	3	1
10928-25	0	3	1	0
11286-25	1	3	0	0
11352-25	1	2	0	0
11630-25	0	1	3	0
11632-25	1	3	0	0
12030-25	1	3	0	0
12068-25	0	1	3	0
12343-25	2	1	0	0
12366-25	0	2	1	0
12440-25	1	2	0	0
13471-25	2	1	0	0
13472-25	1	2	0	0
13582-25	1	3	0	0
14029-25	0	3	1	0
14058-25	0	2	2	0

Outcome	Cases	Percentage (%)
HER2 Disagreement Resolution
AI created disagreement	12	4.1
AI resolved disagreement	38	13.0
Consistent throughout	162	55.3
Still discordant	81	27.6

7.9 Pairwise Pathologist Disagreement

Create a disagreement matrix showing which pathologists disagree most.

Note for Pathologist: These heatmaps show mean absolute differences between every pair of pathologists. Darker red cells mean larger disagreement. Compare the Pre-AI and Post-AI heatmaps for each marker: if cells become greener (lower values) after AI, it means AI helped those two pathologists converge.

7.10 Seniority and Pairwise Disagreement

Does the seniority gap between pathologist pairs predict their disagreement? Wu et al. (2023) found that junior pathologists benefited most from AI, which may translate to larger pre-AI disagreements between senior-junior pairs and greater convergence post-AI (Wu et al. 2023).

Assumption: Pathologist 1 (most senior) → Pathologist 4 (most junior).

Seniority Gap¹	Pair Type	N Pairs	Mean \|Δ\| Pre-AI	Mean \|Δ\| Post-AI	Change²
Pairwise Disagreement by Seniority Gap
Does experience difference predict disagreement magnitude?
ER
1	Adjacent	3	6.41	4.52	−1.88
2	2-level gap	2	6.02	3.78	−2.24
3	Largest gap (P1–P4)	1	3.45	3.40	−0.05
PR
1	Adjacent	3	6.72	3.60	−3.13
2	2-level gap	2	6.29	3.12	−3.17
3	Largest gap (P1–P4)	1	4.94	3.51	−1.44
Ki67
1	Adjacent	3	4.96	5.12	0.16
2	2-level gap	2	3.74	4.99	1.25
3	Largest gap (P1–P4)	1	3.22	4.51	1.29
¹ Seniority gap: 1 = adjacent levels (e.g., P1–P2), 3 = maximum gap (P1–P4)
² Negative change (green) = AI reduced disagreement

7.10.1 Seniority Gap vs Disagreement

7.10.2 Per-Pair Detailed Analysis

Pair	Pre-AI \|Δ\|	Post-AI \|Δ\|	Change	% Change	Gap	Category
Pairwise Disagreement: Detailed Breakdown
Ordered by seniority gap (largest gap first)
ER
P1–P4	3.45	3.40	−0.05	−1.5	3	Largest gap
P2–P4	4.60	2.85	−1.75	−38.1	2	2-level gap
P1–P3	7.44	4.72	−2.72	−36.6	2	2-level gap
P1–P2	4.52	2.90	−1.62	−35.8	1	Adjacent
P3–P4	6.60	4.77	−1.84	−27.8	1	Adjacent
P2–P3	8.09	5.89	−2.20	−27.2	1	Adjacent
PR
P1–P4	4.94	3.51	−1.44	−29.1	3	Largest gap
P2–P4	6.22	3.56	−2.67	−42.8	2	2-level gap
P1–P3	6.36	2.68	−3.67	−57.8	2	2-level gap
P1–P2	6.07	2.57	−3.49	−57.6	1	Adjacent
P3–P4	6.12	4.00	−2.12	−34.7	1	Adjacent
P2–P3	7.98	4.22	−3.76	−47.2	1	Adjacent
Ki67
P1–P4	3.22	4.51	1.29	39.9	3	Largest gap
P2–P4	4.61	4.50	−0.11	−2.4	2	2-level gap
P1–P3	2.86	5.47	2.61	91.3	2	2-level gap
P1–P2	5.55	5.33	−0.22	−4.0	1	Adjacent
P3–P4	3.57	4.80	1.23	34.6	1	Adjacent
P2–P3	5.77	5.22	−0.54	−9.4	1	Adjacent

7.10.3 Spearman Correlation: Seniority Gap and Disagreement

Marker	Pre-AI Disagreement		Post-AI Disagreement		AI Effect on Disagreement
Seniority Gap vs Disagreement: Spearman Correlations
ρ > 0: larger seniority gap → more disagreement
Marker	ρ (Pre-AI)	p (Pre-AI)	ρ (Post-AI)	p (Post-AI)	ρ (Change)¹	p (Change)
ER	−0.463	0.355	−0.432	0.392	0.309	0.552
PR	−0.309	0.552	−0.278	0.594	0.463	0.355
Ki67	−0.679	0.138	−0.370	0.470	0.679	0.138
¹ Negative ρ for Change indicates AI reduces disagreement more for senior–junior pairs

7.11 Conclusion

7.11.1 Key Findings

High Variance Cases: These represent diagnostically challenging cases where even without AI, pathologists disagree substantially. These may require additional testing or expert consultation.
AI-Induced Disagreement: Cases where AI increased disagreement are concerning and warrant further investigation. These may represent:
- AI model errors or edge cases
- Cases where the AI suggestion is valid but contradicts established patterns
- Technical issues with slide preparation or scanning
AI-Resolved Disagreement: Cases where AI reduced variance demonstrate the value of AI as a standardization tool.
Marker-Specific Patterns: Some markers (typically Ki67) show higher baseline disagreement, which may or may not be improved by AI.
Pairwise Analysis: Understanding which pathologist pairs disagree most can inform training needs and quality assurance processes.
Seniority and Disagreement: The seniority gap analysis reveals whether experience differences contribute to pairwise disagreement and whether AI helps bridge the experience gap. Wu et al. (2023) found junior pathologists benefited most from AI (Wu et al. 2023), which may manifest as larger AI-driven convergence in senior–junior pairs.