The quality of artificial intelligence Chatbot responses to patient questions on surgical infection prevention

TÜRE YÜCE, ZEYNEP; MARAŞ BAYDOĞAN, GÜLSEREN; Çetinkaya, Feyza; Tarkan, Müslüm; Eryılmaz Eren, Esma; ULU KILIÇ, AYŞEGÜL

doi:10.1016/j.ajic.2026.03.004

The quality of artificial intelligence Chatbot responses to patient questions on surgical infection prevention

TÜRE YÜCE Z., MARAŞ BAYDOĞAN G., Çetinkaya F. İ., Tarkan M., Eryılmaz Eren E., ULU KILIÇ A.

American Journal of Infection Control, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Basım Tarihi: 2026
Doi Numarası: 10.1016/j.ajic.2026.03.004
Dergi Adı: American Journal of Infection Control
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, CINAHL, EMBASE, MEDLINE
Anahtar Kelimeler: Large language models, Patient education, Quality assessment, Readability, Surgical site infection
Erciyes Üniversitesi Adresli: Evet

Özet

Background As patients increasingly turn to artificial intelligence (AI) chatbots for medical information, concerns remain regarding the accuracy, transparency, and readability of these tools. This study aimed to comparatively assess the quality, reliability, understandability, actionability, and readability of Surgical site infection (SSI)-related responses produced by widely used AI chatbots. Methods A cross-sectional design was used to evaluate 6 AI chatbots (ChatGPT-5o, Gemini 2.5 Pro, Gemini 2.5 Flash, DeepSeek, Grok-1.5, Perplexity). Five patient-centered SSI questions were developed through a Delphi method and directed to each chatbot. A multidisciplinary panel of 5 blinded experts rated responses using DISCERN, QUEST, PEMAT-P, and the Web Resource Rating. Readability was assessed using the Simple Measure of Gobbledygook, Flesch Reading Ease, and Ateşman formulas. Inter-rater reliability was calculated using the intraclass correlation coefficient. Results No single chatbot excelled across all domains. ChatGPT-5o achieved the highest quality scores (DISCERN), while DeepSeek showed the highest accuracy (QUEST). Gemini 2.5 Pro demonstrated the best understandability; however, actionability was lower across all platforms. Transparency was a major weakness: all chatbots scored poorly on Web Resource Rating, with ChatGPT-5o performing best yet still considered a low-quality source. Readability was generally high-level, with most responses requiring high-school to university literacy (Simple Measure of Gobbledygook 11.8-13.9). Conclusions Current AI chatbots are not sufficiently reliable as primary educational tools for SSI prevention. Despite strengths in quality and clarity, shortcomings in transparency and readability limit safe patient use.