21  Interaction Effects Analysis

While main effects (e.g., “AI improves agreement”) are informative, interaction effects reveal when and for whom AI is most effective. Interaction effects address questions like:

  1. Marker × Biopsy Type: Does AI help more for tru-cut biopsies vs excisions?
  2. Marker × Pathologist: Do some pathologists benefit more from AI for specific markers?
  3. Threshold Proximity × AI: Does AI have greater impact near clinical decision thresholds?
  4. Baseline Agreement × AI: Does AI help more when initial disagreement is high?

This chapter provides formal statistical tests for interaction effects, extending the descriptive findings already reported in the manuscript (e.g., Table 5: marker × biopsy type opposing patterns).

Note for Pathologist: “Interaction” asks: “Does the AI effect depend on something else?” For example, does AI help more for biopsies than excisions? Or does it help Pathologist A more than Pathologist B? This tells us if a “one-size-fits-all” approach works, or if we need tailored instructions.

21.1 Marker × Biopsy Type Interaction

21.1.1 Background

Primary finding (reported in manuscript Table 5): Ki-67 shows increased agreement for tru-cut biopsies but decreased agreement for excisions, while ER/PR show opposite pattern.

Hypothesis: Specimen type moderates AI effect differently by marker.

21.1.2 Mixed Model Specification

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: value ~ marker * biopsy_type * modality + (1 | case_id) + (1 |  
    pathologist)
   Data: model_data_continuous
Control: lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1e+05))

REML criterion at convergence: 66932.7

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.31471 -0.72276 -0.05804  0.72109  3.09315 

Random effects:
 Groups      Name        Variance Std.Dev.
 case_id     (Intercept) 310.923  17.633  
 pathologist (Intercept)   1.727   1.314  
 Residual                721.354  26.858  
Number of obs: 7034, groups:  case_id, 296; pathologist, 4

Fixed effects:
                                            Estimate Std. Error        df
(Intercept)                                  72.5793     1.8046   97.8409
markerki67                                  -49.1515     1.4439 6726.1895
markerpr                                    -39.0568     1.4408 6724.9778
biopsy_typeTru-cut                           -2.4236     2.6176  603.8967
modalitypost                                 -0.5042     1.4424 6725.0047
markerki67:biopsy_typeTru-cut                 6.9836     2.2475 6725.6397
markerpr:biopsy_typeTru-cut                  -2.8132     2.2463 6725.4058
markerki67:modalitypost                       6.3367     2.0458 6725.4757
markerpr:modalitypost                        -0.5471     2.0439 6725.4133
biopsy_typeTru-cut:modalitypost              -0.1281     2.2479 6725.1221
markerki67:biopsy_typeTru-cut:modalitypost    0.4515     3.1855 6725.1966
markerpr:biopsy_typeTru-cut:modalitypost      0.2564     3.1858 6725.2191
                                           t value Pr(>|t|)    
(Intercept)                                 40.218  < 2e-16 ***
markerki67                                 -34.041  < 2e-16 ***
markerpr                                   -27.107  < 2e-16 ***
biopsy_typeTru-cut                          -0.926  0.35488    
modalitypost                                -0.350  0.72667    
markerki67:biopsy_typeTru-cut                3.107  0.00190 ** 
markerpr:biopsy_typeTru-cut                 -1.252  0.21048    
markerki67:modalitypost                      3.097  0.00196 ** 
markerpr:modalitypost                       -0.268  0.78894    
biopsy_typeTru-cut:modalitypost             -0.057  0.95455    
markerki67:biopsy_typeTru-cut:modalitypost   0.142  0.88729    
markerpr:biopsy_typeTru-cut:modalitypost     0.080  0.93587    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) mrkr67 mrkrpr bps_T- mdltyp mr67:_T- mr:_T- mrk67: mrkrp:
markerki67  -0.398                                                          
markerpr    -0.399  0.499                                                   
bpsy_typTr- -0.598  0.275  0.275                                            
modalitypst -0.399  0.498  0.499  0.275                                     
mrkrk67:_T-  0.256 -0.642 -0.321 -0.428 -0.320                              
mrkrpr:b_T-  0.256 -0.320 -0.641 -0.428 -0.320  0.498                       
mrkrk67:mdl  0.281 -0.705 -0.352 -0.194 -0.705  0.453    0.226              
mrkrpr:mdlt  0.281 -0.352 -0.705 -0.194 -0.706  0.226    0.452  0.498       
bpsy_typT-:  0.256 -0.320 -0.320 -0.428 -0.642  0.498    0.498  0.452  0.453
mrkr67:_T-: -0.181  0.453  0.226  0.302  0.453 -0.705   -0.352 -0.642 -0.320
mrkrpr:_T-: -0.181  0.226  0.452  0.302  0.453 -0.351   -0.705 -0.319 -0.642
            bp_T-: m67:_T-:
markerki67                 
markerpr                   
bpsy_typTr-                
modalitypst                
mrkrk67:_T-                
mrkrpr:b_T-                
mrkrk67:mdl                
mrkrpr:mdlt                
bpsy_typT-:                
mrkr67:_T-: -0.706         
mrkrpr:_T-: -0.706  0.498  

21.1.3 ANOVA Table

Test significance of interaction terms.

ANOVA: Marker × Biopsy Type × Modality Interaction
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
marker 2620631.7071 1310315.8535 2 6726.9349 1816.4680 0.0000
biopsy_type 145.4153 145.4153 1 293.8481 0.2016 0.6538
modality 3728.1659 3728.1659 1 6725.6422 5.1683 0.0230
marker:biopsy_type 29617.9187 14808.9593 2 6726.9407 20.5294 0.0000
marker:modality 17382.3318 8691.1659 2 6725.2877 12.0484 0.0000
biopsy_type:modality 4.9500 4.9500 1 6725.6229 0.0069 0.9340
marker:biopsy_type:modality 14.5854 7.2927 2 6725.2764 0.0101 0.9899

Note for Pathologist: The ANOVA table tests whether AI’s effect depends on what you are measuring (ER vs Ki67) and what type of specimen you are scoring (excision vs tru-cut). Look at the “Pr(>F)” column: values below 0.05 indicate the interaction is statistically significant. A significant three-way interaction means you cannot make a blanket statement like “AI improves all markers equally for all specimen types.”

21.1.4 Interpretation

Three-way interaction (marker:biopsy_type:modality):

  • If p < 0.05: AI effect (Pre vs Post change) varies by marker AND depends on biopsy type
  • If p > 0.05: No significant three-way interaction; two-way interactions sufficient

21.1.5 Post-Hoc Contrasts

Test specific comparisons of interest using emmeans.

AI Effect (Post-AI vs Pre-AI) by Marker and Biopsy Type
Marker Biopsy Type Contrast Estimate SE df t ratio p value
pre - post er Excision 0.504 1.442 Inf 0.350 0.727
pre - post ki67 Excision -5.833 1.451 Inf -4.020 0.000
pre - post pr Excision 1.051 1.448 Inf 0.726 0.468
pre - post er Tru-cut 0.632 1.724 Inf 0.367 0.714
pre - post ki67 Tru-cut -6.156 1.729 Inf -3.560 0.000
pre - post pr Tru-cut 0.923 1.732 Inf 0.533 0.594

21.1.6 Visualization: Interaction Plot

21.1.7 Marker × Pathologist Interaction

21.1.7.1 Research Question

Does AI effect vary by pathologist differently for each marker?

Clinical implication: Personalized AI recommendations (e.g., “Pathologist A benefits most from AI for Ki-67”).

21.1.7.2 Model Specification

ANOVA: Marker × Pathologist × Modality Interaction
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
marker 2733792.675 1366896.3375 2 6717.081 1882.7697 0.0000
pathologist 11186.814 3728.9378 3 6715.622 5.1363 0.0015
modality 3772.699 3772.6992 1 6715.572 5.1965 0.0227
marker:pathologist 2912.318 485.3864 6 6715.243 0.6686 0.6751
marker:modality 17822.797 8911.3983 2 6715.324 12.2746 0.0000
pathologist:modality 1271.718 423.9059 3 6715.322 0.5839 0.6255
marker:pathologist:modality 1302.197 217.0329 6 6715.026 0.2989 0.9376

21.1.7.3 Pathologist-Specific AI Effects

Extract AI effect (Post - Pre) for each pathologist and marker.

Mean AI Effect by Pathologist and Marker
Pathologist ER Ki-67 PR
Pathologist 1 -1.68 8.30 -1.40
Pathologist 2 -1.42 3.40 -1.97
Pathologist 3 0.52 5.77 -1.07
Pathologist 4 0.04 6.05 -0.05

21.1.8 Key Observations

Identify pathologists with:

  1. Largest positive AI effect for each marker (AI substantially increased values)
  2. Largest negative AI effect (AI substantially decreased values)
  3. Marker-specific patterns (e.g., large effect for Ki-67 but minimal for ER)
Pathologists with Largest AI Effects by Marker
Marker Direction Pathologist Mean AI Effect
er Largest Increase Pathologist 3 0.52
ki67 Largest Increase Pathologist 1 8.30
pr Largest Increase Pathologist 4 -0.05
er Largest Decrease Pathologist 1 -1.68
ki67 Largest Decrease Pathologist 2 3.40
pr Largest Decrease Pathologist 2 -1.97

21.1.9 Threshold Proximity × AI Effect

21.1.9.1 Research Question

Do cases near clinical thresholds (10%, 20%, 30% for Ki-67; 10% for ER/PR) show larger AI effects?

Hypothesis: Borderline cases have greatest variability; AI may have disproportionate impact.

21.1.9.2 Define Threshold Proximity

Cases within ±5 percentage points of threshold classified as “Near Threshold”.

AI Effect by Threshold Proximity
Marker Proximity N Mean AI Effect Mean |AI Effect| SD
er Away from Threshold 1149 -0.66 3.00 6.30
er Near Threshold 17 2.29 2.88 3.00
ki67 Away from Threshold 349 3.92 5.74 7.65
ki67 Near Threshold 605 5.01 5.69 5.71
pr Away from Threshold 997 -1.41 3.87 7.52
pr Near Threshold 80 0.91 1.84 2.85

21.1.9.3 Statistical Test

Test if AI effect magnitude differs by threshold proximity.

ANOVA: Threshold Proximity × AI Effect Interaction
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
marker 1871.4458 935.7229 2 3083.967 31.4442 0.0000
proximity 41.6405 41.6405 1 2952.865 1.3993 0.2369
marker:proximity 100.2062 50.1031 2 3015.155 1.6837 0.1859

21.1.9.4 Visualization

21.1.9.5 Interpretation

If near-threshold cases show larger AI effects:
- Supports hypothesis that AI is most impactful where human variability is highest
- Clinical implication: Consider mandatory AI review for borderline cases

If no difference:
- AI effect is uniform across value ranges
- Threshold proximity does not moderate AI impact

22 Baseline Disagreement × AI Effectiveness

22.0.1 Research Question

In cases with high Pre-AI disagreement (large variance across pathologists), does AI reduce or amplify discord?

Hypothesis: High baseline disagreement indicates difficult cases; AI may help consensus.

22.0.1.1 Calculate Baseline Disagreement

For each case, calculate coefficient of variation (CV) of Pre-AI assessments.

AI Effect on Disagreement by Baseline Disagreement Level
Marker Baseline Disagreement N Cases Mean CV (Pre) Mean CV (Post) Mean CV Change % Cases with Reduced CV
er High Disagreement 109 0.207 0.116 -0.091 78.899
er Low Disagreement 138 0.037 0.031 -0.005 65.942
ki67 High Disagreement 144 0.377 0.260 -0.117 76.389
ki67 Low Disagreement 146 0.065 0.121 0.056 36.986
pr High Disagreement 90 0.673 0.391 -0.283 88.889
pr Low Disagreement 92 0.101 0.067 -0.033 70.652

22.0.1.2 Visualization

22.0.1.3 Interpretation

Negative CV change: AI reduced disagreement (improved consensus)

Positive CV change: AI amplified disagreement (less consensus)

Key finding:
- If high-disagreement cases show larger negative CV change → AI most helpful for difficult cases
- If low-disagreement cases show negative change → AI maintains consensus but adds little value

22.0.2 Comprehensive Interaction Summary

22.0.2.1 Effect Sizes and Significance

Summary of Interaction Effects: Findings and Clinical Implications
Interaction Main Finding Statistical Test p-value Clinical Implication
Marker × Biopsy Type Opposing patterns: Ki-67 better for tru-cut, ER/PR better for excision Three-way ANOVA (mixed model) [From ANOVA above] Specimen type should inform AI use recommendations
Marker × Pathologist Heterogeneous AI effects by pathologist-marker combination Three-way ANOVA (mixed model) [From ANOVA above] Personalized AI training needed; one-size-fits-all approach suboptimal
Threshold Proximity × AI Borderline cases show [larger/similar] AI effects Two-way ANOVA (proximity × marker) [From ANOVA above] Consider mandatory AI review for cases within ±5% of thresholds
Baseline Disagreement × AI AI [reduces/maintains/amplifies] disagreement in high-discord cases CV comparison (Pre vs Post) [Descriptive] AI may be most valuable quality assurance tool for difficult cases

22.0.2.2 Forest Plot: Interaction Effect Sizes

22.0.3 Clinical Decision Framework

22.0.3.1 When to Use AI: Decision Tree

Based on interaction findings, create actionable decision tree for pathologists:

Clinical Decision Framework: When and How to Use AI
Clinical Scenario AI Recommendation Rationale
Tru-cut biopsy + Ki-67 assessment Strongly recommend AI Largest positive AI effect observed; AI improves consistency
Excision specimen + Ki-67 assessment Use with caution; review AI suggestions critically AI may introduce systematic upward bias
ER/PR assessment near 10% threshold Recommend AI for borderline cases High clinical impact; AI may reduce misclassification
HER2 1+ vs 0 distinction Mandatory AI review recommended Highest interobserver variability; AI improves agreement (Kappa +0.058)
High baseline disagreement case (CV > median) Use AI as tiebreaker AI reduces disagreement in difficult cases
Pathologist with low baseline Ki-67 agreement Targeted AI training recommended Personalized intervention based on marker-specific performance

22.0.4 Limitations of Interaction Analysis

  1. Sample size: With 4 pathologists, some interactions may be underpowered
  2. Multiple comparisons: Many interaction terms tested; risk of false positives despite adjustment
  3. Post-hoc exploration: Some interactions identified through data exploration rather than pre-specified
  4. Generalizability: Interactions may be specific to this AI system (Aiforia) and pathologist cohort

22.0.5 Recommendations for Future Studies

  1. Pre-specify interaction hypotheses: Reduce exploratory bias
  2. Larger pathologist cohorts: Enable more robust interaction detection (N ≥ 10 pathologists)
  3. External validation: Test if interactions replicate in independent datasets
  4. Mechanistic studies: Qualitative interviews to understand why certain interactions exist
  5. Adaptive AI: Design AI systems that adjust presentation based on detected interactions