8  Statistical Tests: Pre-AI vs Post-AI

8.1 Objective

Perform formal statistical tests to determine if AI significantly improves inter-observer agreement and changes marker values.

Note for Pathologist: This is the deep statistical dive. We use “Bootstrap” and “Z-tests” to prove if the improvement in agreement (ICC) is real or just luck. We uses “Paired t-tests” to see if AI systematically pushes scores up or down (bias). We also check for “Clinical Significance” - because a tiny statistical change might not matter for patient care.

8.2 Setup

8.3 Load Data

8.4 Test if Agreement Metrics Changed

Bootstrap confidence intervals and permutation tests for ICC and Kappa values.

ICC with Bootstrap 95% Confidence Intervals
Marker Condition ICC 95% CI Lower 95% CI Upper SE
ER Pre-AI 0.962 0.949 0.972 0.006
ER Post-AI 0.980 0.972 0.986 0.004
PR Pre-AI 0.952 0.941 0.961 0.005
PR Post-AI 0.974 0.959 0.984 0.007
Ki67 Pre-AI 0.939 0.919 0.955 0.009
Ki67 Post-AI 0.937 0.921 0.951 0.008

Note for Pathologist: ICC (Intraclass Correlation Coefficient) measures how well pathologists agree. Values closer to 1.0 mean near-perfect agreement. The bootstrap confidence intervals (CI) tell you the range of plausible ICC values. If the Pre-AI and Post-AI confidence intervals do not overlap, the change in agreement is statistically reliable.

8.4.1 Statistical Comparison of ICCs

Use z-test to compare pre vs post ICC values.

ICC Comparison: Pre-AI vs Post-AI
Z-test for difference in agreement
Marker ICC Pre ICC Post SE Pre SE Post Z Statistic P-value Significant? Sig.1
ER 0.962 0.980 0.006 0.004 2.525 0.012 TRUE *
PR 0.952 0.974 0.005 0.007 2.536 0.011 TRUE *
KI67 0.939 0.937 0.009 0.008 −0.168 0.867 FALSE ns
1 *** p < 0.001, ** p < 0.01, * p < 0.05, ns = not significant
Note: This z-test treats pre-AI and post-AI ICCs as independent estimates. Because both are measured on the same cases, a paired bootstrap approach would provide more precise inference. The z-test is presented as an approximate comparison; bootstrap confidence intervals for ICC differences are reported in Chapter 2.

8.5 Paired Tests for Marker Value Changes

Test if AI systematically changes the reported values.

Paired Tests: Pre-AI vs Post-AI Values
Testing if AI systematically changes marker values
Marker N Pairs Mean Pre Mean Post Mean Difference t Statistic t-test P-value Wilcoxon P-value
ER 1175 71.68 71.04 −0.63 −3.40 0.0007 0.0016
PR 1159 31.60 30.48 −1.11 −4.93 0.0000 0.0000
KI67 1162 25.43 31.32 5.89 26.15 0.0000 0.0000

8.5.1 Interpretation of Paired Tests

  • Mean Difference: Positive values indicate AI tends to give higher values, negative indicates lower.
  • P-value < 0.05: Statistically significant systematic bias.
  • Wilcoxon test: More robust to outliers than t-test.

Methodological note: The paired t-test above pools all pathologist-case pairs as independent observations (N = pathologists × cases). Because observations are clustered within pathologists, standard errors may be underestimated. The linear mixed-effects model in Section 3 below, which accounts for this clustering structure, should be considered the primary analysis; the paired t-test is presented as a simplified secondary comparison.

Effect Sizes (Paired Cohen's dz)
Magnitude of change from Pre-AI to Post-AI (dz = mean(diff)/sd(diff))
Marker Cohen's d Interpretation
ER −0.099 Negligible
PR −0.145 Negligible
KI67 0.767 Medium

8.6 McNemar’s Test for Categorical Changes

Test if AI significantly changes categorical classifications.

Methodological note: McNemar’s test assumes independent matched pairs. Here, observations are pooled across pathologists (4 observations per case), violating independence. As with the paired t-test, the mixed-effects models in Section 4 should be considered the primary analysis; McNemar’s test is presented as a descriptive summary of classification changes.

McNemar's Test for HER2 Changes:
              
               Negative/Low Positive
  Negative/Low          738       16
  Positive               44      275

Chi-square: 12.15 
P-value: 0.0004908786 
Interpretation: Significant change in HER2 classification 

Molecular Subtype Changes:
Total cases: 1184 
Changed: 204 ( 17.2 %)
Molecular Subtype Transitions
Rows = Pre-AI, Columns = Post-AI
Pre-AI Subtype HER2 Positive Hormone Weak Positive Luminal A Luminal B Triple Negative
HER2 Positive 275 7 11 26 5
Hormone Weak Positive 5 150 11 23 0
Luminal A 7 13 285 82 3
Luminal B 5 1 3 137 0
Triple Negative 0 2 0 0 133

8.7 Mixed Effects Models

Account for nested structure (cases within pathologists).

Mixed Effects Models: AI Impact on Marker Values
Accounting for case and pathologist effects
Marker AI Effect (Post - Pre) Standard Error t-value P-value
ER −0.662 0.241 −2.753 0.0060
PR −1.031 0.294 −3.507 0.0005
KI67 5.896 0.233 25.353 0.0000

Note for Pathologist: The mixed-effects model accounts for the fact that the same pathologist sees many cases and the same case is seen by multiple pathologists. The “AI Effect” column shows how much AI changes the average score for each marker, after controlling for these repeated measurements. A positive estimate means AI pushes values up on average; a negative estimate means AI pushes values down.

8.8 Variance Component Analysis

Decompose variance into case, pathologist, and residual components.

Variance Component Analysis
Percentage of total variance by source
Marker Case (Pre) Pathologist (Pre) Residual (Pre) Case (Post) Pathologist (Post) Residual (Post)
ER 96.2 0.6 3.2 98.0 0.3 1.7
PR 94.8 0.1 5.1 97.5 0.0 2.5
KI67 93.9 0.9 5.2 93.7 0.3 6.0

8.8.1 Interpretation

  • Case Variance: True biological variation between cases
  • Pathologist Variance: Systematic differences between pathologists
  • Residual Variance: Random disagreement (within-pathologist variability)

Ideally, AI should:
- Increase case variance % (better discrimination between truly different cases)
- Decrease pathologist variance % (less systematic bias between observers)
- Decrease residual variance % (less random disagreement)

8.9 Power Analysis

The a priori power analysis (see Materials and Methods, Section 3) determined a minimum required sample size of 69 cases (κ₀=0.4 vs κ₁=0.6, α=0.05, power=0.80). With 296 cases in the final cohort, the study is adequately powered for all primary and secondary analyses.

8.10 Multiple Testing Correction

Given the large number of statistical tests performed across this analysis, we apply Benjamini-Hochberg False Discovery Rate (FDR) correction to control for familywise error rate.

8.10.1 Pre-Specified Primary Hypotheses (No Correction Needed)

Based on study objectives, we pre-specify 3 primary hypotheses:

  1. H1: AI improves interobserver agreement for Ki-67 (measured by ICC change)
  2. H2: AI improves interobserver agreement for HER2 (measured by Kappa change)
  3. H3: AI systematically changes Ki-67 values (measured by paired t-test)

These confirmatory hypotheses are tested at α = 0.05 without correction.

8.10.2 All Other Tests (Exploratory - FDR Correction Applied)

Multiple Testing Correction Results
Benjamini-Hochberg FDR correction for exploratory analyses1
Test Name Marker Raw P-value Test Type Adjusted P-value (FDR) Sig. (Raw) Sig. (FDR) Sig.2
Mixed_model_KI67 KI67 0.0000 Mixed Effects 0.0000 TRUE TRUE ***
Wilcoxon_KI67 KI67 0.0000 Systematic Change (Wilcoxon) 0.0000 TRUE TRUE ***
Paired_t_PR PR 0.0000 Systematic Change (t-test) 0.0000 TRUE TRUE ***
Wilcoxon_PR PR 0.0000 Systematic Change (Wilcoxon) 0.0000 TRUE TRUE ***
Mixed_model_PR PR 0.0005 Mixed Effects 0.0010 TRUE TRUE **
Paired_t_ER ER 0.0007 Systematic Change (t-test) 0.0013 TRUE TRUE **
Wilcoxon_ER ER 0.0016 Systematic Change (Wilcoxon) 0.0025 TRUE TRUE **
Mixed_model_ER ER 0.0060 Mixed Effects 0.0082 TRUE TRUE **
ICC_pr_comparison pr 0.0112 ICC Agreement 0.0127 TRUE TRUE *
ICC_er_comparison er 0.0116 ICC Agreement 0.0127 TRUE TRUE *
ICC_ki67_comparison ki67 0.8668 ICC Agreement 0.8668 FALSE FALSE ns
1 Primary hypotheses (Ki67 ICC, HER2 Kappa, Ki67 paired t-test) not included - tested at α=0.05
2 *** q < 0.001, ** q < 0.01, * q < 0.05, ns = not significant

8.10.3 Sensitivity Analysis: Impact of Multiple Testing Correction

Impact of Multiple Testing Correction:
Total exploratory tests: 11 
Significant (raw α=0.05): 10 
Significant (FDR q<0.05): 10 
Tests losing significance: 0 
Proportion losing significance: 0 %

8.10.4 Reporting Standards

For Primary Hypotheses: Report raw p-values at α = 0.05 (pre-specified, confirmatory).

For Exploratory Analyses: Report both raw p-values and FDR-adjusted q-values. Interpret results conservatively using adjusted values.

In Manuscript:
- State pre-specified hypotheses clearly in Methods
- Report: “To control for multiple testing in exploratory analyses, we applied Benjamini-Hochberg False Discovery Rate correction (q < 0.05)”
- Present both raw and adjusted p-values in tables
- Interpret findings based on adjusted values for exploratory tests

8.11 Clinical vs Statistical Significance

Statistical significance does not always imply clinical importance. Here we define Minimum Clinically Important Differences (MCID) for each metric.

Minimum Clinically Important Differences (MCID)
Thresholds for interpreting clinical significance
Metric MCID Threshold Clinical Rationale
ICC 0.10 ICC change >0.10 represents meaningful shift in reliability
Kappa 0.10 Kappa change >0.10 represents clinically noticeable agreement improvement
Mean Difference (ER) 5.00 ER change >5% may cross therapeutic threshold (10% cutoff ± buffer)
Mean Difference (PR) 5.00 PR change >5% may cross therapeutic threshold (10% cutoff ± buffer)
Mean Difference (Ki67) 3.00 Ki67 change >3% may affect subtype classification (near 20%, 30% cutoffs)
Cohen's d 0.30 Effect size >0.30 represents small-to-medium practical impact
Clinical Significance Interpretation
Distinguishing statistical from clinical significance
Finding Observed Value1 MCID Exceeds MCID? Clinical Interpretation
ER ICC change 0.018 0.100 FALSE Negligible (ceiling/floor effect)
PR ICC change 0.022 0.100 FALSE Negligible (ceiling/floor effect)
Ki67 ICC change −0.002 0.100 FALSE Negligible (ceiling/floor effect)
ER mean difference NA 5.000 FALSE Statistically significant but clinically trivial
PR mean difference NA 5.000 FALSE Statistically significant but clinically trivial
Ki67 mean difference NA 3.000 FALSE Statistically significant but clinically trivial
1 Positive values indicate AI increases metric; negative indicates decrease

8.11.1 Key Insights: Statistical vs Clinical Significance

Summary of Statistical vs Clinical Significance:
ER:
  - ICC change:   
  - Interpretation:  Negligible (ceiling/floor effect) 
PR:
  - ICC change:   
  - Interpretation:  Negligible (ceiling/floor effect) 
Ki-67:
  - ICC change:   
  - Mean systematic increase:  5.89 %
  - Clinical impact: Changes near 20% and 30% cutoffs may reclassify Luminal A/B subtypes

8.12 Conclusion

8.12.1 Summary of Statistical Findings

  1. Agreement Improvement:
    • ICC values with confidence intervals show whether AI significantly improved agreement
    • Z-tests directly compare pre vs post ICC values
    • Primary hypothesis (Ki67, HER2): Tested at α=0.05 without correction
    • Exploratory tests: FDR correction applied (q<0.05)
  2. Systematic Changes:
    • Paired t-tests reveal if AI systematically shifts values up or down
    • Effect sizes quantify the magnitude of these changes
    • Clinical significance: Only changes exceeding MCID thresholds are clinically meaningful
  3. Categorical Changes:
    • McNemar’s test shows if classification changes are statistically significant
    • Subtype transition tables reveal patterns
    • Clinical impact: Assessed by proportion of cases crossing therapeutic thresholds
  4. Variance Decomposition:
    • Identifies whether disagreement is due to cases, pathologists, or randomness
    • Shows how AI affects each variance component
    • Ideal AI: Increases case variance %, decreases pathologist and residual variance %
  5. Statistical Power:
    • Ensures we have sufficient power to detect meaningful differences
    • Values > 0.80 are generally considered adequate
    • A priori power analysis: 69-case minimum required; 296 analyzed (see Materials and Methods, Section 3)
  6. Multiple Testing Correction:
    • Confirmatory analyses: 3 pre-specified primary hypotheses (no correction)
    • Exploratory analyses: Benjamini-Hochberg FDR correction applied
    • Reporting: Both raw and adjusted p-values presented
    • Sensitivity: Impact of correction quantified
  7. Clinical Significance:
    • MCID thresholds defined based on clinical relevance
    • ER/PR: Changes >5% may affect treatment decisions
    • Ki67: Changes >3% may reclassify molecular subtypes
    • ICC/Kappa: Changes >0.10 represent meaningful reliability shifts
    • Interpretation: Statistical significance ≠ clinical importance