| ICC with Bootstrap 95% Confidence Intervals | |||||
| Marker | Condition | ICC | 95% CI Lower | 95% CI Upper | SE |
|---|---|---|---|---|---|
| ER | Pre-AI | 0.962 | 0.949 | 0.972 | 0.006 |
| ER | Post-AI | 0.980 | 0.972 | 0.986 | 0.004 |
| PR | Pre-AI | 0.952 | 0.941 | 0.961 | 0.005 |
| PR | Post-AI | 0.974 | 0.959 | 0.984 | 0.007 |
| Ki67 | Pre-AI | 0.939 | 0.919 | 0.955 | 0.009 |
| Ki67 | Post-AI | 0.937 | 0.921 | 0.951 | 0.008 |
8 Statistical Tests: Pre-AI vs Post-AI
8.1 Objective
Perform formal statistical tests to determine if AI significantly improves inter-observer agreement and changes marker values.
Note for Pathologist: This is the deep statistical dive. We use “Bootstrap” and “Z-tests” to prove if the improvement in agreement (ICC) is real or just luck. We uses “Paired t-tests” to see if AI systematically pushes scores up or down (bias). We also check for “Clinical Significance” - because a tiny statistical change might not matter for patient care.
8.2 Setup
8.3 Load Data
8.4 Test if Agreement Metrics Changed
Bootstrap confidence intervals and permutation tests for ICC and Kappa values.
Note for Pathologist: ICC (Intraclass Correlation Coefficient) measures how well pathologists agree. Values closer to 1.0 mean near-perfect agreement. The bootstrap confidence intervals (CI) tell you the range of plausible ICC values. If the Pre-AI and Post-AI confidence intervals do not overlap, the change in agreement is statistically reliable.
8.4.1 Statistical Comparison of ICCs
Use z-test to compare pre vs post ICC values.
| ICC Comparison: Pre-AI vs Post-AI | ||||||||
| Z-test for difference in agreement | ||||||||
| Marker | ICC Pre | ICC Post | SE Pre | SE Post | Z Statistic | P-value | Significant? | Sig.1 |
|---|---|---|---|---|---|---|---|---|
| ER | 0.962 | 0.980 | 0.006 | 0.004 | 2.525 | 0.012 | TRUE | * |
| PR | 0.952 | 0.974 | 0.005 | 0.007 | 2.536 | 0.011 | TRUE | * |
| KI67 | 0.939 | 0.937 | 0.009 | 0.008 | −0.168 | 0.867 | FALSE | ns |
| 1 *** p < 0.001, ** p < 0.01, * p < 0.05, ns = not significant | ||||||||
| Note: This z-test treats pre-AI and post-AI ICCs as independent estimates. Because both are measured on the same cases, a paired bootstrap approach would provide more precise inference. The z-test is presented as an approximate comparison; bootstrap confidence intervals for ICC differences are reported in Chapter 2. | ||||||||
8.5 Paired Tests for Marker Value Changes
Test if AI systematically changes the reported values.
| Paired Tests: Pre-AI vs Post-AI Values | |||||||
| Testing if AI systematically changes marker values | |||||||
| Marker | N Pairs | Mean Pre | Mean Post | Mean Difference | t Statistic | t-test P-value | Wilcoxon P-value |
|---|---|---|---|---|---|---|---|
| ER | 1175 | 71.68 | 71.04 | −0.63 | −3.40 | 0.0007 | 0.0016 |
| PR | 1159 | 31.60 | 30.48 | −1.11 | −4.93 | 0.0000 | 0.0000 |
| KI67 | 1162 | 25.43 | 31.32 | 5.89 | 26.15 | 0.0000 | 0.0000 |
8.5.1 Interpretation of Paired Tests
- Mean Difference: Positive values indicate AI tends to give higher values, negative indicates lower.
- P-value < 0.05: Statistically significant systematic bias.
- Wilcoxon test: More robust to outliers than t-test.
Methodological note: The paired t-test above pools all pathologist-case pairs as independent observations (N = pathologists × cases). Because observations are clustered within pathologists, standard errors may be underestimated. The linear mixed-effects model in Section 3 below, which accounts for this clustering structure, should be considered the primary analysis; the paired t-test is presented as a simplified secondary comparison.
| Effect Sizes (Paired Cohen's dz) | ||
| Magnitude of change from Pre-AI to Post-AI (dz = mean(diff)/sd(diff)) | ||
| Marker | Cohen's d | Interpretation |
|---|---|---|
| ER | −0.099 | Negligible |
| PR | −0.145 | Negligible |
| KI67 | 0.767 | Medium |
8.6 McNemar’s Test for Categorical Changes
Test if AI significantly changes categorical classifications.
Methodological note: McNemar’s test assumes independent matched pairs. Here, observations are pooled across pathologists (4 observations per case), violating independence. As with the paired t-test, the mixed-effects models in Section 4 should be considered the primary analysis; McNemar’s test is presented as a descriptive summary of classification changes.
McNemar's Test for HER2 Changes:
Negative/Low Positive
Negative/Low 738 16
Positive 44 275
Chi-square: 12.15
P-value: 0.0004908786
Interpretation: Significant change in HER2 classification
Molecular Subtype Changes:
Total cases: 1184
Changed: 204 ( 17.2 %)
| Molecular Subtype Transitions | |||||
| Rows = Pre-AI, Columns = Post-AI | |||||
| Pre-AI Subtype | HER2 Positive | Hormone Weak Positive | Luminal A | Luminal B | Triple Negative |
|---|---|---|---|---|---|
| HER2 Positive | 275 | 7 | 11 | 26 | 5 |
| Hormone Weak Positive | 5 | 150 | 11 | 23 | 0 |
| Luminal A | 7 | 13 | 285 | 82 | 3 |
| Luminal B | 5 | 1 | 3 | 137 | 0 |
| Triple Negative | 0 | 2 | 0 | 0 | 133 |
8.7 Mixed Effects Models
Account for nested structure (cases within pathologists).
| Mixed Effects Models: AI Impact on Marker Values | ||||
| Accounting for case and pathologist effects | ||||
| Marker | AI Effect (Post - Pre) | Standard Error | t-value | P-value |
|---|---|---|---|---|
| ER | −0.662 | 0.241 | −2.753 | 0.0060 |
| PR | −1.031 | 0.294 | −3.507 | 0.0005 |
| KI67 | 5.896 | 0.233 | 25.353 | 0.0000 |
Note for Pathologist: The mixed-effects model accounts for the fact that the same pathologist sees many cases and the same case is seen by multiple pathologists. The “AI Effect” column shows how much AI changes the average score for each marker, after controlling for these repeated measurements. A positive estimate means AI pushes values up on average; a negative estimate means AI pushes values down.
8.8 Variance Component Analysis
Decompose variance into case, pathologist, and residual components.
| Variance Component Analysis | ||||||
| Percentage of total variance by source | ||||||
| Marker | Case (Pre) | Pathologist (Pre) | Residual (Pre) | Case (Post) | Pathologist (Post) | Residual (Post) |
|---|---|---|---|---|---|---|
| ER | 96.2 | 0.6 | 3.2 | 98.0 | 0.3 | 1.7 |
| PR | 94.8 | 0.1 | 5.1 | 97.5 | 0.0 | 2.5 |
| KI67 | 93.9 | 0.9 | 5.2 | 93.7 | 0.3 | 6.0 |
8.8.1 Interpretation
- Case Variance: True biological variation between cases
- Pathologist Variance: Systematic differences between pathologists
- Residual Variance: Random disagreement (within-pathologist variability)
Ideally, AI should:
- Increase case variance % (better discrimination between truly different cases)
- Decrease pathologist variance % (less systematic bias between observers)
- Decrease residual variance % (less random disagreement)
8.9 Power Analysis
The a priori power analysis (see Materials and Methods, Section 3) determined a minimum required sample size of 69 cases (κ₀=0.4 vs κ₁=0.6, α=0.05, power=0.80). With 296 cases in the final cohort, the study is adequately powered for all primary and secondary analyses.
8.10 Multiple Testing Correction
Given the large number of statistical tests performed across this analysis, we apply Benjamini-Hochberg False Discovery Rate (FDR) correction to control for familywise error rate.
8.10.1 Pre-Specified Primary Hypotheses (No Correction Needed)
Based on study objectives, we pre-specify 3 primary hypotheses:
- H1: AI improves interobserver agreement for Ki-67 (measured by ICC change)
- H2: AI improves interobserver agreement for HER2 (measured by Kappa change)
- H3: AI systematically changes Ki-67 values (measured by paired t-test)
These confirmatory hypotheses are tested at α = 0.05 without correction.
8.10.2 All Other Tests (Exploratory - FDR Correction Applied)
| Multiple Testing Correction Results | |||||||
| Benjamini-Hochberg FDR correction for exploratory analyses1 | |||||||
| Test Name | Marker | Raw P-value | Test Type | Adjusted P-value (FDR) | Sig. (Raw) | Sig. (FDR) | Sig.2 |
|---|---|---|---|---|---|---|---|
| Mixed_model_KI67 | KI67 | 0.0000 | Mixed Effects | 0.0000 | TRUE | TRUE | *** |
| Wilcoxon_KI67 | KI67 | 0.0000 | Systematic Change (Wilcoxon) | 0.0000 | TRUE | TRUE | *** |
| Paired_t_PR | PR | 0.0000 | Systematic Change (t-test) | 0.0000 | TRUE | TRUE | *** |
| Wilcoxon_PR | PR | 0.0000 | Systematic Change (Wilcoxon) | 0.0000 | TRUE | TRUE | *** |
| Mixed_model_PR | PR | 0.0005 | Mixed Effects | 0.0010 | TRUE | TRUE | ** |
| Paired_t_ER | ER | 0.0007 | Systematic Change (t-test) | 0.0013 | TRUE | TRUE | ** |
| Wilcoxon_ER | ER | 0.0016 | Systematic Change (Wilcoxon) | 0.0025 | TRUE | TRUE | ** |
| Mixed_model_ER | ER | 0.0060 | Mixed Effects | 0.0082 | TRUE | TRUE | ** |
| ICC_pr_comparison | pr | 0.0112 | ICC Agreement | 0.0127 | TRUE | TRUE | * |
| ICC_er_comparison | er | 0.0116 | ICC Agreement | 0.0127 | TRUE | TRUE | * |
| ICC_ki67_comparison | ki67 | 0.8668 | ICC Agreement | 0.8668 | FALSE | FALSE | ns |
| 1 Primary hypotheses (Ki67 ICC, HER2 Kappa, Ki67 paired t-test) not included - tested at α=0.05 | |||||||
| 2 *** q < 0.001, ** q < 0.01, * q < 0.05, ns = not significant | |||||||
8.10.3 Sensitivity Analysis: Impact of Multiple Testing Correction
Impact of Multiple Testing Correction:
Total exploratory tests: 11
Significant (raw α=0.05): 10
Significant (FDR q<0.05): 10
Tests losing significance: 0
Proportion losing significance: 0 %
8.10.4 Reporting Standards
For Primary Hypotheses: Report raw p-values at α = 0.05 (pre-specified, confirmatory).
For Exploratory Analyses: Report both raw p-values and FDR-adjusted q-values. Interpret results conservatively using adjusted values.
In Manuscript:
- State pre-specified hypotheses clearly in Methods
- Report: “To control for multiple testing in exploratory analyses, we applied Benjamini-Hochberg False Discovery Rate correction (q < 0.05)”
- Present both raw and adjusted p-values in tables
- Interpret findings based on adjusted values for exploratory tests
8.11 Clinical vs Statistical Significance
Statistical significance does not always imply clinical importance. Here we define Minimum Clinically Important Differences (MCID) for each metric.
| Minimum Clinically Important Differences (MCID) | ||
| Thresholds for interpreting clinical significance | ||
| Metric | MCID Threshold | Clinical Rationale |
|---|---|---|
| ICC | 0.10 | ICC change >0.10 represents meaningful shift in reliability |
| Kappa | 0.10 | Kappa change >0.10 represents clinically noticeable agreement improvement |
| Mean Difference (ER) | 5.00 | ER change >5% may cross therapeutic threshold (10% cutoff ± buffer) |
| Mean Difference (PR) | 5.00 | PR change >5% may cross therapeutic threshold (10% cutoff ± buffer) |
| Mean Difference (Ki67) | 3.00 | Ki67 change >3% may affect subtype classification (near 20%, 30% cutoffs) |
| Cohen's d | 0.30 | Effect size >0.30 represents small-to-medium practical impact |
| Clinical Significance Interpretation | ||||
| Distinguishing statistical from clinical significance | ||||
| Finding | Observed Value1 | MCID | Exceeds MCID? | Clinical Interpretation |
|---|---|---|---|---|
| ER ICC change | 0.018 | 0.100 | FALSE | Negligible (ceiling/floor effect) |
| PR ICC change | 0.022 | 0.100 | FALSE | Negligible (ceiling/floor effect) |
| Ki67 ICC change | −0.002 | 0.100 | FALSE | Negligible (ceiling/floor effect) |
| ER mean difference | NA | 5.000 | FALSE | Statistically significant but clinically trivial |
| PR mean difference | NA | 5.000 | FALSE | Statistically significant but clinically trivial |
| Ki67 mean difference | NA | 3.000 | FALSE | Statistically significant but clinically trivial |
| 1 Positive values indicate AI increases metric; negative indicates decrease | ||||
8.11.1 Key Insights: Statistical vs Clinical Significance
Summary of Statistical vs Clinical Significance:
ER:
- ICC change:
- Interpretation: Negligible (ceiling/floor effect)
PR:
- ICC change:
- Interpretation: Negligible (ceiling/floor effect)
Ki-67:
- ICC change:
- Mean systematic increase: 5.89 %
- Clinical impact: Changes near 20% and 30% cutoffs may reclassify Luminal A/B subtypes
8.12 Conclusion
8.12.1 Summary of Statistical Findings
- Agreement Improvement:
- ICC values with confidence intervals show whether AI significantly improved agreement
- Z-tests directly compare pre vs post ICC values
- Primary hypothesis (Ki67, HER2): Tested at α=0.05 without correction
- Exploratory tests: FDR correction applied (q<0.05)
- ICC values with confidence intervals show whether AI significantly improved agreement
- Systematic Changes:
- Paired t-tests reveal if AI systematically shifts values up or down
- Effect sizes quantify the magnitude of these changes
- Clinical significance: Only changes exceeding MCID thresholds are clinically meaningful
- Paired t-tests reveal if AI systematically shifts values up or down
- Categorical Changes:
- McNemar’s test shows if classification changes are statistically significant
- Subtype transition tables reveal patterns
- Clinical impact: Assessed by proportion of cases crossing therapeutic thresholds
- McNemar’s test shows if classification changes are statistically significant
- Variance Decomposition:
- Identifies whether disagreement is due to cases, pathologists, or randomness
- Shows how AI affects each variance component
- Ideal AI: Increases case variance %, decreases pathologist and residual variance %
- Identifies whether disagreement is due to cases, pathologists, or randomness
- Statistical Power:
- Ensures we have sufficient power to detect meaningful differences
- Values > 0.80 are generally considered adequate
- A priori power analysis: 69-case minimum required; 296 analyzed (see Materials and Methods, Section 3)
- Ensures we have sufficient power to detect meaningful differences
- Multiple Testing Correction:
- Confirmatory analyses: 3 pre-specified primary hypotheses (no correction)
- Exploratory analyses: Benjamini-Hochberg FDR correction applied
- Reporting: Both raw and adjusted p-values presented
- Sensitivity: Impact of correction quantified
- Confirmatory analyses: 3 pre-specified primary hypotheses (no correction)
- Clinical Significance:
- MCID thresholds defined based on clinical relevance
- ER/PR: Changes >5% may affect treatment decisions
- Ki67: Changes >3% may reclassify molecular subtypes
- ICC/Kappa: Changes >0.10 represent meaningful reliability shifts
- Interpretation: Statistical significance ≠ clinical importance
- MCID thresholds defined based on clinical relevance