AUSTRALIAN ENDODONTIC JOURNAL, cilt.51, ss.732-739, 2025 (SCI-Expanded, Scopus)
This study aims to compare the accuracy of modern AI chatbots, including Gemini 1.5 Flash, Gemini 1.5 Pro, ChatGPT-3.5 and ChatGPT-4, in responding to endodontic questions and supporting clinicians. Forty yes/no questions covering 12 endodontic topics were formulated by three experts. Each question was presented to the AI models on the same day, with a new chat session initiated for each. The agreement between chatbot responses and expert consensus was assessed using Cohen's kappa test (p < 0.05). ChatGPT-3.5 demonstrated the highest accuracy (80%), followed by ChatGPT-4 (77.5%), Gemini 1.5 Pro (72.5%) and Gemini 1.5 Flash (60%). The agreement levels ranged from weak (ChatGPT models) to minimal (Gemini Flash). The findings indicate variability in chatbot performance, with ChatGPT models outperforming Gemini. However, reliance on AI-generated responses for clinical decision-making remains questionable. Future studies should incorporate more complex clinical scenarios and broader analytical approaches to enhance the assessment of AI chatbots in endodontics.