8 Statistical Tests: Pre-AI vs Post-AI

8.1 Objective

Perform formal statistical tests to determine if AI significantly improves inter-observer agreement and changes marker values.

Note for Pathologist: This is the deep statistical dive. We use “Bootstrap” and “Z-tests” to prove if the improvement in agreement (ICC) is real or just luck. We uses “Paired t-tests” to see if AI systematically pushes scores up or down (bias). We also check for “Clinical Significance” - because a tiny statistical change might not matter for patient care.

8.2 Setup

8.3 Load Data

8.4 Test if Agreement Metrics Changed

Bootstrap confidence intervals and permutation tests for ICC and Kappa values.

Marker	Condition	ICC	95% CI Lower	95% CI Upper	SE
ICC with Bootstrap 95% Confidence Intervals
ER	Pre-AI	0.962	0.949	0.972	0.006
ER	Post-AI	0.980	0.972	0.986	0.004
PR	Pre-AI	0.952	0.941	0.961	0.005
PR	Post-AI	0.974	0.959	0.984	0.007
Ki67	Pre-AI	0.939	0.919	0.955	0.009
Ki67	Post-AI	0.937	0.921	0.951	0.008

Note for Pathologist: ICC (Intraclass Correlation Coefficient) measures how well pathologists agree. Values closer to 1.0 mean near-perfect agreement. The bootstrap confidence intervals (CI) tell you the range of plausible ICC values. If the Pre-AI and Post-AI confidence intervals do not overlap, the change in agreement is statistically reliable.

8.4.1 Statistical Comparison of ICCs

Use z-test to compare pre vs post ICC values.

Marker	ICC Pre	ICC Post	SE Pre	SE Post	Z Statistic	P-value	Significant?	Sig.¹
ICC Comparison: Pre-AI vs Post-AI
Z-test for difference in agreement
ER	0.962	0.980	0.006	0.004	2.525	0.012	TRUE	*
PR	0.952	0.974	0.005	0.007	2.536	0.011	TRUE	*
KI67	0.939	0.937	0.009	0.008	−0.168	0.867	FALSE	ns
¹ * p < 0.001, p < 0.01, * p < 0.05, ns = not significant
Note: This z-test treats pre-AI and post-AI ICCs as independent estimates. Because both are measured on the same cases, a paired bootstrap approach would provide more precise inference. The z-test is presented as an approximate comparison; bootstrap confidence intervals for ICC differences are reported in Chapter 2.

8.5 Paired Tests for Marker Value Changes

Test if AI systematically changes the reported values.

Marker	N Pairs	Mean Pre	Mean Post	Mean Difference	t Statistic	t-test P-value	Wilcoxon P-value
Paired Tests: Pre-AI vs Post-AI Values
Testing if AI systematically changes marker values
ER	1175	71.68	71.04	−0.63	−3.40	0.0007	0.0016
PR	1159	31.60	30.48	−1.11	−4.93	0.0000	0.0000
KI67	1162	25.43	31.32	5.89	26.15	0.0000	0.0000

8.5.1 Interpretation of Paired Tests

Mean Difference: Positive values indicate AI tends to give higher values, negative indicates lower.
P-value < 0.05: Statistically significant systematic bias.
Wilcoxon test: More robust to outliers than t-test.

Methodological note: The paired t-test above pools all pathologist-case pairs as independent observations (N = pathologists × cases). Because observations are clustered within pathologists, standard errors may be underestimated. The linear mixed-effects model in Section 3 below, which accounts for this clustering structure, should be considered the primary analysis; the paired t-test is presented as a simplified secondary comparison.

Marker	Cohen's d	Interpretation
Effect Sizes (Paired Cohen's dz)
Magnitude of change from Pre-AI to Post-AI (dz = mean(diff)/sd(diff))
ER	−0.099	Negligible
PR	−0.145	Negligible
KI67	0.767	Medium

8.6 McNemar’s Test for Categorical Changes

Test if AI significantly changes categorical classifications.

Methodological note: McNemar’s test assumes independent matched pairs. Here, observations are pooled across pathologists (4 observations per case), violating independence. As with the paired t-test, the mixed-effects models in Section 4 should be considered the primary analysis; McNemar’s test is presented as a descriptive summary of classification changes.

McNemar's Test for HER2 Changes:

              
               Negative/Low Positive
  Negative/Low          738       16
  Positive               44      275


Chi-square: 12.15

P-value: 0.0004908786

Interpretation: Significant change in HER2 classification


Molecular Subtype Changes:

Total cases: 1184

Changed: 204 ( 17.2 %)

Pre-AI Subtype	HER2 Positive	Hormone Weak Positive	Luminal A	Luminal B	Triple Negative
Molecular Subtype Transitions
Rows = Pre-AI, Columns = Post-AI
HER2 Positive	275	7	11	26	5
Hormone Weak Positive	5	150	11	23	0
Luminal A	7	13	285	82	3
Luminal B	5	1	3	137	0
Triple Negative	0	2	0	0	133

8.7 Mixed Effects Models

Account for nested structure (cases within pathologists).

Marker	AI Effect (Post - Pre)	Standard Error	t-value	P-value
Mixed Effects Models: AI Impact on Marker Values
Accounting for case and pathologist effects
ER	−0.662	0.241	−2.753	0.0060
PR	−1.031	0.294	−3.507	0.0005
KI67	5.896	0.233	25.353	0.0000

Note for Pathologist: The mixed-effects model accounts for the fact that the same pathologist sees many cases and the same case is seen by multiple pathologists. The “AI Effect” column shows how much AI changes the average score for each marker, after controlling for these repeated measurements. A positive estimate means AI pushes values up on average; a negative estimate means AI pushes values down.

8.8 Variance Component Analysis

Decompose variance into case, pathologist, and residual components.

Marker	Case (Pre)	Pathologist (Pre)	Residual (Pre)	Case (Post)	Pathologist (Post)	Residual (Post)
Variance Component Analysis
Percentage of total variance by source
ER	96.2	0.6	3.2	98.0	0.3	1.7
PR	94.8	0.1	5.1	97.5	0.0	2.5
KI67	93.9	0.9	5.2	93.7	0.3	6.0

8.8.1 Interpretation

Case Variance: True biological variation between cases
Pathologist Variance: Systematic differences between pathologists
Residual Variance: Random disagreement (within-pathologist variability)

Ideally, AI should:
- Increase case variance % (better discrimination between truly different cases)
- Decrease pathologist variance % (less systematic bias between observers)
- Decrease residual variance % (less random disagreement)

8.9 Power Analysis

The a priori power analysis (see Materials and Methods, Section 3) determined a minimum required sample size of 69 cases (κ₀=0.4 vs κ₁=0.6, α=0.05, power=0.80). With 296 cases in the final cohort, the study is adequately powered for all primary and secondary analyses.

8.10 Multiple Testing Correction

Given the large number of statistical tests performed across this analysis, we apply Benjamini-Hochberg False Discovery Rate (FDR) correction to control for familywise error rate.

8.10.1 Pre-Specified Primary Hypotheses (No Correction Needed)

Based on study objectives, we pre-specify 3 primary hypotheses:

H1: AI improves interobserver agreement for Ki-67 (measured by ICC change)
H2: AI improves interobserver agreement for HER2 (measured by Kappa change)
H3: AI systematically changes Ki-67 values (measured by paired t-test)

These confirmatory hypotheses are tested at α = 0.05 without correction.

8.10.2 All Other Tests (Exploratory - FDR Correction Applied)

Test Name	Marker	Raw P-value	Test Type	Adjusted P-value (FDR)	Sig. (Raw)	Sig. (FDR)	Sig.²
Multiple Testing Correction Results
Benjamini-Hochberg FDR correction for exploratory analyses¹
Mixed_model_KI67	KI67	0.0000	Mixed Effects	0.0000	TRUE	TRUE	***
Wilcoxon_KI67	KI67	0.0000	Systematic Change (Wilcoxon)	0.0000	TRUE	TRUE	***
Paired_t_PR	PR	0.0000	Systematic Change (t-test)	0.0000	TRUE	TRUE	***
Wilcoxon_PR	PR	0.0000	Systematic Change (Wilcoxon)	0.0000	TRUE	TRUE	***
Mixed_model_PR	PR	0.0005	Mixed Effects	0.0010	TRUE	TRUE	**
Paired_t_ER	ER	0.0007	Systematic Change (t-test)	0.0013	TRUE	TRUE	**
Wilcoxon_ER	ER	0.0016	Systematic Change (Wilcoxon)	0.0025	TRUE	TRUE	**
Mixed_model_ER	ER	0.0060	Mixed Effects	0.0082	TRUE	TRUE	**
ICC_pr_comparison	pr	0.0112	ICC Agreement	0.0127	TRUE	TRUE	*
ICC_er_comparison	er	0.0116	ICC Agreement	0.0127	TRUE	TRUE	*
ICC_ki67_comparison	ki67	0.8668	ICC Agreement	0.8668	FALSE	FALSE	ns
¹ Primary hypotheses (Ki67 ICC, HER2 Kappa, Ki67 paired t-test) not included - tested at α=0.05
² * q < 0.001, q < 0.01, * q < 0.05, ns = not significant

8.10.3 Sensitivity Analysis: Impact of Multiple Testing Correction

Impact of Multiple Testing Correction:

Total exploratory tests: 11

Significant (raw α=0.05): 10

Significant (FDR q<0.05): 10

Tests losing significance: 0

Proportion losing significance: 0 %

8.10.4 Reporting Standards

For Primary Hypotheses: Report raw p-values at α = 0.05 (pre-specified, confirmatory).

For Exploratory Analyses: Report both raw p-values and FDR-adjusted q-values. Interpret results conservatively using adjusted values.

In Manuscript:
- State pre-specified hypotheses clearly in Methods
- Report: “To control for multiple testing in exploratory analyses, we applied Benjamini-Hochberg False Discovery Rate correction (q < 0.05)”
- Present both raw and adjusted p-values in tables
- Interpret findings based on adjusted values for exploratory tests

8.11 Clinical vs Statistical Significance

Statistical significance does not always imply clinical importance. Here we define Minimum Clinically Important Differences (MCID) for each metric.

Metric	MCID Threshold	Clinical Rationale
Minimum Clinically Important Differences (MCID)
Thresholds for interpreting clinical significance
ICC	0.10	ICC change >0.10 represents meaningful shift in reliability
Kappa	0.10	Kappa change >0.10 represents clinically noticeable agreement improvement
Mean Difference (ER)	5.00	ER change >5% may cross therapeutic threshold (10% cutoff ± buffer)
Mean Difference (PR)	5.00	PR change >5% may cross therapeutic threshold (10% cutoff ± buffer)
Mean Difference (Ki67)	3.00	Ki67 change >3% may affect subtype classification (near 20%, 30% cutoffs)
Cohen's d	0.30	Effect size >0.30 represents small-to-medium practical impact

Finding	Observed Value¹	MCID	Exceeds MCID?	Clinical Interpretation
Clinical Significance Interpretation
Distinguishing statistical from clinical significance
ER ICC change	0.018	0.100	FALSE	Negligible (ceiling/floor effect)
PR ICC change	0.022	0.100	FALSE	Negligible (ceiling/floor effect)
Ki67 ICC change	−0.002	0.100	FALSE	Negligible (ceiling/floor effect)
ER mean difference	NA	5.000	FALSE	Statistically significant but clinically trivial
PR mean difference	NA	5.000	FALSE	Statistically significant but clinically trivial
Ki67 mean difference	NA	3.000	FALSE	Statistically significant but clinically trivial
¹ Positive values indicate AI increases metric; negative indicates decrease

8.11.1 Key Insights: Statistical vs Clinical Significance

Summary of Statistical vs Clinical Significance:

ER:

  - ICC change:

  - Interpretation:  Negligible (ceiling/floor effect)

PR:

  - ICC change:

  - Interpretation:  Negligible (ceiling/floor effect)

Ki-67:

  - ICC change:

  - Mean systematic increase:  5.89 %

  - Clinical impact: Changes near 20% and 30% cutoffs may reclassify Luminal A/B subtypes

8.12 Conclusion

8.12.1 Summary of Statistical Findings

Agreement Improvement:
- ICC values with confidence intervals show whether AI significantly improved agreement
- Z-tests directly compare pre vs post ICC values
- Primary hypothesis (Ki67, HER2): Tested at α=0.05 without correction
- Exploratory tests: FDR correction applied (q<0.05)
Systematic Changes:
- Paired t-tests reveal if AI systematically shifts values up or down
- Effect sizes quantify the magnitude of these changes
- Clinical significance: Only changes exceeding MCID thresholds are clinically meaningful
Categorical Changes:
- McNemar’s test shows if classification changes are statistically significant
- Subtype transition tables reveal patterns
- Clinical impact: Assessed by proportion of cases crossing therapeutic thresholds
Variance Decomposition:
- Identifies whether disagreement is due to cases, pathologists, or randomness
- Shows how AI affects each variance component
- Ideal AI: Increases case variance %, decreases pathologist and residual variance %
Statistical Power:
- Ensures we have sufficient power to detect meaningful differences
- Values > 0.80 are generally considered adequate
- A priori power analysis: 69-case minimum required; 296 analyzed (see Materials and Methods, Section 3)
Multiple Testing Correction:
- Confirmatory analyses: 3 pre-specified primary hypotheses (no correction)
- Exploratory analyses: Benjamini-Hochberg FDR correction applied
- Reporting: Both raw and adjusted p-values presented
- Sensitivity: Impact of correction quantified
Clinical Significance:
- MCID thresholds defined based on clinical relevance
- ER/PR: Changes >5% may affect treatment decisions
- Ki67: Changes >3% may reclassify molecular subtypes
- ICC/Kappa: Changes >0.10 represent meaningful reliability shifts
- Interpretation: Statistical significance ≠ clinical importance