AI score may boost UC trial efficiency

Share

An artificial intelligence-based scoring method for endoscopy showed greater sensitivity to detecting treatment effects and was associated with smaller estimated study sizes in a prospective ulcerative colitis trial, compared with the standard category-based scoring system, according to data presented at Digestive Disease Week® (DDW) 2026 during the AGA Presidential Plenary session

In a phase 2b study of patients with ulcerative colitis, an AI system detected stronger treatment effects than the Mayo Endoscopic Subscore and was associated with reductions in estimated per-group sample size requirements of up to 47%, while maintaining statistical power. The findings come from an analysis of endoscopy data collected at baseline and week 12 in a randomized trial comparing the interleukin-23 receptor antagonist icotrokinra with placebo.

Pooya Mobadersany, PhD

“Ulcerative colitis clinical trials commonly rely on categorical or binary endoscopic endpoints, such as the Mayo Endoscopic Subscore, which were originally designed for clinical interpretation but offer limited sensitivity to incremental changes in disease severity,” one of the study authors, Pooya Mobadersany, PhD, a Principal Scientist with Artificial Intelligence/Machine Learning & Digital Health at Johnson & Johnson, told GI & Hepatology News. “This can make it challenging to detect treatment effects, particularly in early‑phase or dose‑ranging studies, and may result in larger, more costly trials.”

The AI platform, called ARGES-UC and being developed by Johnson & Johnson, generates a continuous mucosal endoscopic score for the distal colon and a segment-averaged score across the descending colon, sigmoid colon, and rectum. The continuous scoring method is designed to pick up smaller, more subtle changes in the gut lining than the traditional Mayo Endoscopic Subscore, which groups disease severity into set categories.

ARGES-UC combines three parts. First, it uses an AI model trained on endoscopy images from ulcerative colitis and Crohn’s disease studies to learn patterns. Second, another model identifies the lower part of the colon and its different sections. Third, a final component generates continuous scores that aim to match the average assessments made by expert reviewers.

The system was trained and validated on 138.8 million endoscopy frames from 7,672 videos across six clinical trials, including UNIFI, JAKUC, VEGA, ASTRO, SEAVUE, and TRIDENT, encompassing 3,164 patients. External evaluation used data from the ANTHEM-UC trial, which enrolled 252 patients and contributed 623 endoscopy videos.

The researchers compared how scores changed from baseline to week 12 using a statistical test to see differences between treatment groups and placebo. They also directly compared each treatment with placebo. To understand how strong the treatment effects were, they calculated standardized effect sizes, and they estimated how many patients would be needed in each group to have an 80% chance of detecting a real effect.

Across dose groups, the continuous metrics produced larger effect sizes than the categorical Mayo Endoscopic Subscore, suggesting greater sensitivity to treatment-related changes in mucosal appearance. For the 200-mg dose, the distal colon continuous score reduced the estimated sample size per group by 47%, from 64 patients to 34. The segment-averaged score reduced the sample size by 30%, from 64 to 45 patients.

At the 400-mg dose, the distal colon score reduced required sample size by 35%, from 60 to 39 patients, while the segment-averaged score reduced sample size by 20%, from 60 to 48 patients. These reductions reflect the larger observed effect sizes when using continuous scoring compared with the categorical approach.

According to Dr. Mobadersany, a key finding was that the continuous AI-based measures consistently showed stronger standardized treatment effects at week 12 than the traditional category-based score. This was seen across both dose levels and across different ways of assessing disease activity in the lower part of the colon. “The gains in standardized effect size were not isolated to a single comparison, but were observed systematically,” he said. “We were also encouraged that these results emerged from a fully prospective clinical trial, rather than a retrospective dataset. Seeing this level of consistency in a real‑world trial setting suggested that the increased sensitivity was not driven by post‑hoc modeling choices but reflected genuine differences in how disease activity was captured.”

When asked about the potential impact on clinical practice, he said these findings are more likely to influence how clinical trials are designed in the near term, rather than change everyday patient care. “Many pivotal UC trials are powered around binary remission or response endpoints, which by definition require a certain number of patients to cross a predefined threshold in order to demonstrate success,” he explained. “Our findings do not change that requirement.”

However, he continued, for trials or analyses that rely on continuous endoscopic endpoints, such as early‑phase studies, dose‑finding trials, or key secondary and exploratory endpoints, “increased sensitivity could meaningfully improve efficiency,” Dr. Mobadersany said. “More sensitive measurement of endoscopic change may reduce the number of patients required to detect treatment differences, support better dose selection, and enable faster, more informed development decisions. Over time, as these approaches are further validated, they may complement existing endoscopic assessments by providing more reproducible and granular measures of disease activity.”

He emphasized that he and his team view this study as one piece of a larger effort to improve how data from endoscopies are measured and used in clinical trials for inflammatory bowel disease. “The goal is not to replace expert readers or established clinical endpoints, but to augment them with continuous, objective measurements that are better aligned with statistical analysis and trial efficiency,” he said.

The study did not report subgroup analyses by baseline disease severity or extent, and results were limited to distal colon assessment.

Dr. Mobadersany and his coauthors are employees of Johnson & Johnson.

DDW is AGA’s annual meeting, jointly sponsored by AGA, AASLD, ASGE, and SSAT. Learn more at ddw.org.