Journal of Endodontics, 2026 (SCI-Expanded, Scopus)
Introduction Artificial intelligence (AI) systems are increasingly used in dental radiology to support endodontic diagnosis. However, their diagnostic reliability across different clinical categories remains unclear. This study compared 3 vision–language AI models (ChatGPT-5 Plus, Gemini 2.5 Pro, and Copilot Pro) with expert endodontists by assessing sensitivity, specificity, overall diagnostic agreement, and Youden's Index across multiple endodontic conditions. Methods This retrospective diagnostic accuracy study evaluated the relationship between periapical radiographs and treatment decisions, procedural complications, and lesion detection. Expert endodontists served as the gold standard of reference. Diagnostic categories included primary treatment selection, nonsurgical retreatment, final treatment decisions, perforation, underfilling, overfilling, broken file, calcification, and periapical lesion detection. Results There was an almost perfect agreement between the endodontists (κ = 0.95). Gemini 2.5 Pro demonstrated the highest diagnostic accuracy, particularly in periapical lesion detection (sensitivity 100%, specificity 88%), while ChatGPT-5 Plus showed similarly strong performance in treatment selection. Copilot Pro exhibited markedly low sensitivity for complications such as perforation and instrument fracture. Kappa values for preoperative and postoperative treatment decisions were high for Gemini and ChatGPT-5 Plus, but low for Copilot Pro. The Friedman test confirmed significant differences among the groups ( P < .001). Conclusions AI systems demonstrated promising diagnostic accuracy in treatment selection tasks and lesion detection, but performed less reliably in identifying complex procedural complications. Gemini 2.5 Pro showed the most balanced performance, whereas Copilot Pro displayed the highest variability across diagnostic categories.