Artificial intelligence (AI) assistance during upper endoscopy did not significantly improve the detection of gastric neoplasms after centralized pathology review in a large multi-center randomized controlled trial, although the system reduced blind spots and increased inspection time.
"Our study is the first to underscore the critical role of rigorous pathologic review in evaluating AI systems," noted Zehua Dong, PhD, of the Department of Gastroenterology at Renmin Hospital of Wuhan University in China, and colleagues.
The study, published in Gastroenterology, enrolled 29,514 patients undergoing esophagogastroduodenoscopy (EGD) at 24 hospitals across 12 provinces in China between December 2021 and November 2023. Patients were randomized to AI-assisted EGD using the ENDOANGEL-GN system or standard EGD without AI assistance. The primary endpoint was the detection rate of gastric neoplasms after centralized endoscopic and pathologic review.
In the intention-to-treat cohort, gastric neoplasms were detected in 1.42% of patients in the AI-assisted group compared with 1.25% in the control group, indicating no statistically significant difference.
When outcomes were assessed using the original pathology reports from participating centers, the AI-assisted group showed a higher detection rate (4.06% vs 3.57%). This difference was not significant after centralized pathology review. However, "subgroup analysis suggested potential benefit among less experienced endoscopists and during fatigue periods," wrote Dr. Dong and colleagues.
The trial also found substantial reclassification of dysplasia diagnoses during centralized pathology review. After expert review, 83.58% of lesions initially diagnosed as low-grade intraepithelial neoplasia and 14% of those diagnosed as high-grade intraepithelial neoplasia were reclassified as benign. In contrast, only 0.72% of lesions initially considered non-neoplastic were later reclassified as neoplasms.
Secondary outcomes showed no significant differences between groups in early gastric cancer detection or in detection of intestinal metaplasia and/or gastric atrophy in the intention-to-treat analysis after pathology review.
Quality metrics differed between the groups. The AI-assisted system significantly reduced the number of blind spots during endoscopy, with an average of 1.07 blind spots in the AI group compared with 2.52 in the control group. Procedure time and inspection time were longer in the AI-assisted group (7.69 vs 7.33 minutes, respectively).
In exploratory subgroup analyses, AI assistance was associated with a higher detection rate of gastric neoplasms among endoscopists with less than three years of experience (1.44% vs 0.78%).
The investigators also evaluated the diagnostic performance of the AI system in lesion recognition. In the experimental group, ENDOANGEL-GN identified 100% of pathologically confirmed gastric adenocarcinoma, 91.9% of high-grade intraepithelial neoplasia, and 57.1% of low-grade intraepithelial neoplasia. False-positive alerts were most commonly associated with chronic inflammation, intestinal metaplasia, gastric atrophy, and other benign mucosal findings.
The authors noted that the study population included a high proportion of tertiary hospitals and experienced endoscopists, which may have influenced the results.
"Although no overall benefit was observed in detecting gastric neoplasms, the findings highlight the critical impact of centralized pathologic review on the evaluation of AI performance. Subgroup analyses suggest that AI may provide advantages in specific clinical scenarios, underscoring the importance of optimizing AI applications in targeted clinical settings," concluded investigators.
They disclosed having no conflicts of interest.
Expert Insight
GI & Hepatology News invited Mimi Tan, MD, MPH, an assistant professor in the Section or Gastroenterology at Baylor College of Medicine, Houston, Texas, to comment on the study.
This randomized trial found that AI assistance during upper endoscopy did not significantly improve gastric neoplasm detection after centralized pathology review. How should clinicians interpret these findings?
Mimi Tan, MD, MPH
This clinical trial is important for gastroenterologists to consider as we evaluate computer-assisted detection (CADe) models for indications beyond colon polyp detection. However, the findings must be interpreted in the context of several key limitations.
Several CADe systems for gastric cancer and precancer detection have been developed in Asia, and ENDOANGEL is among the most extensively published. Despite the impressive scale of this trial, the primary outcomes deserve scrutiny. Like many other CADe systems, ENGOANGEL was trained on retrospectively collected image and video datasets, often without complete ground-truth histopathology for all patients and was optimized using expert endoscopist image interpretation as the reference standard. Thus, when AI models are applied to scenarios where the true endpoint requires histopathology (e.g., early gastric cancer or dysplasia), there is inherent mismatch between what the system was trained to recognize (visual pattern) and the trial outcome (tissue diagnosis).
With that being said, I am still optimistic that there is a role for CADe models in upper GI indications. Unlike our Asian counterparts, we, as US endoscopists, have far less clinical exposure to early gastric cancer, dysplasia, and intestinal metaplasia detection in routine practice. Thus, the potential for improvement in endoscopic detection with AI assistance could be higher in our setting. Even if a CADe system fails to move the needle in a high-volume Chinese screening center staffed by expert endoscopists, it may offer meaningful gains in a US practice where recognizing subtle early gastric lesions is not a routine competency.
The study also showed substantial reclassification of dysplasia diagnoses after expert pathology review. What does this reveal about the challenges of evaluating AI systems for cancer detection?
The study demonstrated a significant reclassification of dysplasia diagnoses following expert pathology review, which exposes a critical challenge in evaluating AI systems: the reliability of the "ground truth." Similarly, there was a dramatic drop in gastric neoplasia diagnosis rates from 3.57% tp 4.06% by local pathologists to 1.25% to 1.42% when the same slides were interpreted by the central study pathologists. This exposes a foundational vulnerability in the entire CADe development pipeline. If the histopathology labels used as the ground truth have questionable accuracy, the CADe models trained on them will be equally unreliable. Therefore, establishing standardized, centralized expert pathology review is essential for both training and evaluating AI model performance in clinical settings.
Although detection rates were similar, AI assistance reduced blind spots and increased inspection time. Do you view these changes as meaningful improvements in endoscopy quality?
The secondary outcomes showing reduction in blind spots and longer inspection times are meaningful improvements. These improvements in quality metrics may provide specific advantages in certain scenarios. The CADe model may help bridge the experience gap (gastric neoplasm detection among endoscopists with less than 3 years' experience) or mitigating human error during times of fatigue. Subgroup analyses suggested AI may benefit less experienced endoscopists.
Do you see a potential role for AI as a training or decision-support tool in endoscopy?
An important conclusion from this trial is the benefit of AI assistance for less experienced endoscopists. CADe for gastric cancer detection has been shown to have comparable performance to expert endoscopists and may even surpass less experienced endoscopists; CADe models could help equalize performance across different skill levels.
Conversely, there are growing concerns about "de-skilling" (clinicians losing diagnostic competencies due to over-reliance on AI) or "never skilling" (never developing diagnostic competencies due to exposure to AI early on in training). While AI models can serve as a valuable training or decision-support tool, endoscopists must be careful not to become too overly dependent on the technology. It is essential to maintain personal diagnostic proficiency even while utilizing AI to improve detection.
Based on these results, what additional evidence or improvements would be needed before AI-assisted systems could be widely adopted for gastric cancer detection?
There is a clear role for CADe models to assist with detection of gastric cancers and precancers in US practice settings where these conditions are not routinely encountered by the gastroenterologist. But the trustworthiness of these CADe models is only as good as their training data. If we want models to accurately detect histopathology-based diseases, the training dataset of images and videos must be paired with undisputed ground truth histopathology. We saw in this trial that a massive discrepancy in ground truth histopathology (original vs revised histopathology) determined whether the primary outcome was positive or negative. Moving forward, we must ensure CADe model training pipelines incorporate rigorous, centralized pathologic review to establish a truly reliable gold standard.