15 Agreement Extensions and Precision

15.1 Aim

Quantify how reliable our agreement metrics are, retain more data via pairwise handling, and compare pre/post reliability directly.

Note for Pathologist: This is a technical supplement.
We calculate “Confidence Intervals” to show how certain we are about the agreement numbers.
We also look at “Pairwise Agreement” to see if specific pairs of pathologists agree more than others (e.g., do Senior pathologists agree more with each other?).

15.2 Sample Size and Completeness

# A tibble: 10 × 3
   marker  modality n_cases
   <chr>   <chr>      <int>
 1 ER      post        1184
 2 ER      pre         1184
 3 PR      post        1184
 4 PR      pre         1184
 5 Ki67    post        1184
 6 Ki67    pre         1184
 7 HER2    post        1184
 8 HER2    pre         1184
 9 Subtype post        1184
10 Subtype pre         1184

15.3 Bootstrap Confidence Intervals for Kappa/ICC

15.4 Pairwise Agreement by Rater Pair

15.5 Pre vs Post Reliability Shift

Note for Pathologist: The delta table (Pre vs Post ICC/Kappa) is the key output here. A positive delta means agreement improved after AI; a negative delta means it worsened. If the bootstrap confidence intervals for the delta do not cross zero, we can be confident the change is real and not due to chance.

15.6 Reporting Plan

Present bootstrap CIs for Fleiss’ Kappa (HER2, subtype) and ICC (ER/PR/Ki67) to convey precision.
Add pairwise Kappa/ICC tables to highlight which raters benefit most from AI.
Include a delta table with Pre vs Post ICC/Kappa and a note on whether CIs overlap.