A Hybrid Deep Learning Approach for Improving Medical Image Captioning Using Convolutional Vision Transformer


Ramadan W., AKAY B.

2nd International Conference on Artificial Intelligence, Computer, Data Sciences, and Applications, ACDSA 2025, Antalya, Türkiye, 7 - 09 Ağustos 2025, (Tam Metin Bildiri) identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/acdsa65407.2025.11166431
  • Basıldığı Şehir: Antalya
  • Basıldığı Ülke: Türkiye
  • Anahtar Kelimeler: Convolutional vision transformer, Hybrid deep learning, Medical image captioning, Multimodal learning
  • Erciyes Üniversitesi Adresli: Evet

Özet

The utilization of Deep Learning techniques, particularly computer vision (CV) and natural language processing, has led to significant advancements in the analysis of large-scale medical data, enhancing diagnosis, treatment planning, and management efficiency. In this context, medical image captioning (MIC) has emerged as a critical research area aimed at the automatic generation of clinically accurate reports from medical images. While convolutional neural networks (CNNs) and language transformers have been widely used in MIC models at the encoder and decoder sides respectively, the adoption of vision transformers (ViTs) for visual feature extraction remains limited. Moreover, existing MIC studies have largely overlooked hybrid architectures that introduce convolutions into the vision transformer architecture. This study proposes the integration of convolutional vision transformer (CvT) into MIC tasks, leveraging the local feature extraction strength of convolutions with the global context modeling capabilities of transformers. The objective is to evaluate the effectiveness of CvT in improving the quality and clinical relevance of generated medical reports through its multimodal compatibility and scalability. To the best of our knowledge, this represents one of the first attempts to explore convolution-transformer hybrid methods for medical image captioning. The proposed approach is evaluated on two public chest X-ray benchmark datasets, IU X-Ray and MIMICCXR, using natural language processing and clinical efficacy metrics (CE). The results demonstrate significantly improved CE results, with 7.9% and 8.3% increase in F1-score and Precision respectively on MIMIC-CXR, indicating better clinical feature representation from X-Ray images.