1  Introduction

Breast cancer remains one of the most prevalent malignancies among women globally. Accurate pathological evaluation of key biomarkers, such as estrogen receptor (ER), progesterone receptor (PR), HER2, and the Ki-67 proliferation index, is crucial in determining prognosis and informing therapeutic decisions. However, despite the existence of well-established guidelines such as those from the American Society of Clinical Oncology (ASCO) and College of American Pathologists (CAP), significant interobserver variability persists in the interpretation of these immunohistochemistry (IHC) assays. This variability is particularly pronounced in cases of equivocal or low HER2 expression, low-positive hormone receptors (1-10% positivity), and Ki-67 scoring, leading to inconsistencies in molecular subtyping that can potentially impact treatment decisions (Ivanova et al. 2024; Eren et al. 2026).

Recent updates to HER2 testing guidelines (2023 ASCO/CAP) and the emerging clinical significance of HER2-low breast cancer have further emphasized the need for precise and reproducible biomarker assessment (Ivanova et al. 2024). Studies demonstrate considerable concordance challenges, particularly in distinguishing HER2 IHC 0 from 1+ scores, which has become clinically relevant with the development of HER2-targeted antibody-drug conjugates for HER2-low disease (Eren et al. 2026; Hou, Nitta, and Li 2023). Similarly, Ki-67 assessment remains problematic despite its prognostic and predictive value, with Gown (2023) characterizing it as showing “promise, potential, and problems” due to persistent inter-observer variability and lack of standardization across laboratories (Gown 2023).

The implementation of digital pathology has opened new possibilities for artificial intelligence (AI) integration in routine histopathological practice (Niazi, Parwani, and Gurcan 2019; Soliman, Li, and Parwani 2024). Deep learning approaches have demonstrated remarkable capabilities in breast cancer histopathology, including prediction of biomarker status directly from H&E-stained sections and automated grading with improved reproducibility (Shamai et al. 2022; Wang et al. 2022; Chan et al. 2023). Computational pathology methods using weakly supervised deep learning on whole slide images have achieved clinical-grade performance, suggesting feasibility for routine diagnostic implementation (Campanella et al. 2019).

AI-based digital image analysis has shown significant potential in reducing observer-dependent variability and improving reproducibility of biomarker scoring (Baxi et al. 2022; Abele et al. 2023; Liu et al. 2023). In ER and PR evaluation, automated image analysis demonstrates strong correlation with pathologists’ manual assessments and improved inter-laboratory consistency (Shafi et al. 2022). For HER2 interpretation, AI assistance has been reported to enhance diagnostic concordance, especially in challenging HER2-low and equivocal categories (Jung et al. 2024; Wu et al. 2023).

Ki-67 presents particular challenges that make it an important target for AI-assisted quantification. A comprehensive evaluation by Dawe et al. (2024) compared five digital image analysis (DIA) methods for Ki-67 scoring in 278 breast cancer cases, finding that while deep learning approaches (piNET) achieved the highest agreement with manual counts (ICC: 0.850), none of the tested methods reached the clinically desired Cohen’s κ ≥ 0.8 for common diagnostic cutoffs (Dawe et al. 2024). The study identified tumor heterogeneity, algorithm implementation differences, and image segmentation as primary contributors to variability, emphasizing that fully automated pipelines are crucial for developing robust, reproducible approaches. Other studies have confirmed that AI-assisted Ki-67 quantification significantly improves reproducibility and reduces the time-consuming nature of manual assessment (Dy et al. 2024; Acs et al. 2021).

A critical gap in the current literature is the paucity of real-world studies examining how pathologists actually integrate AI recommendations into clinical practice. Most validation studies focus on algorithm performance metrics (sensitivity, specificity, concordance) without examining pathologist behavior or the impact on clinical decision-making (Li et al. 2025; Tan-Garcia, Chua, and Leow 2025). Large-scale studies comparing local versus central laboratory assessments, such as the WSG ADAPTcycle trial (n=5,292) by Hamann et al. (2025), demonstrate ongoing challenges in achieving consistent Ki-67 measurements even without AI, highlighting the need for standardization tools (Hamann et al. 2025).

Overall, these findings position AI as a promising tool for standardizing IHC biomarker assessment, minimizing diagnostic discrepancies, and improving clinical decision-making in breast cancer (Reis-Filho and Kather 2023; Ahn et al. 2023; Yan, Li, and Wu 2023). However, critical questions remain about real-world implementation, individual pathologist adoption patterns, cases where AI may increase rather than decrease disagreement, and potential systematic biases introduced by automated systems. The present study addresses these gaps by evaluating the impact of AI assistance (Aiforia platform) on interobserver variability, individual pathologist behavior, and clinical decision implications in ER, PR, HER2, and Ki-67 assessments. Using a two-phase design (pre-AI and post-AI evaluations by the same pathologists), we provide comprehensive analysis of AI’s real-world impact on diagnostic concordance, systematic biases, and treatment-relevant molecular classification in a representative clinical dataset.