Computational Biology and Chemistry, cilt.120, 2026 (SCI-Expanded, Scopus)
Recent years have witnessed a notable rise in skin cancer cases, increasing the urgent need for accurate and timely diagnoses to ensure successful treatment and improve patient outcomes. Traditional diagnostic techniques, such as visual inspection, often involve subjective assessments that may lead to errors. As a result, deep learning algorithms have emerged as powerful and reliable alternatives for improving the accuracy and effectiveness of skin cancer identification. In this study, we conducted a thorough comparison of Convolutional Neural Network (CNN) and Vision Transformer (ViT) models, examining various performance factors, such as parameter count, computational cost, and overall efficiency. We evaluated 15 advanced CNN models and 15 ViT models on the publicly available HAM10000 and ISIC 2019 datasets under the same training conditions. Our findings revealed that ViT models, particularly those based on the Swin architecture, outperformed CNN models, achieving an accuracy of 0.9212 on HAM10000 and 0.9187 on ISIC 2019. However, these models have significantly more parameters and require more computational resources compared to CNN models, which are less demanding due to their simpler convolutional operations. This trade-off highlights the increased memory usage and processing time of ViT, caused by their self-attention mechanisms, which are effective for capturing long-range dependencies in data. Despite higher computational demands, ViT offer superior accuracy in skin cancer classification compared to CNNs. This study indicates a need for future studies to enhance ViT models for improved computational efficiency or to develop hybrid algorithms that integrate the advantages of both techniques for clinical applications.