Skin cancer is a serious health problem that, if not diagnosed correctly at an early stage, leads to disease progression, increased risk of metastasis, and decreased quality of life. Therefore, it is critical to develop accurate and effective rapid diagnosis methods to assist physicians and enable autonomous diagnosis. While deep learning-based methods generally rely on convolutional neural networks (CNNs) or vision transformers (ViTs), hybrid approaches remain limited. In this work, we propose a novel proposed model that combines the strengths of both CNN and ViT architectures. The model utilizes the CNN approach to extract local patterns and detailed textural features, while using ViT structures to capture long-range relationships between skin cancer regions. Initially, ConvNeXt blocks extract highly detailed local features such as texture, edges, and color transitions. Transformer blocks then model the contextual relationships and global structure of these features, effectively addressing data imbalance and improving the overall classification performance. For a fair evaluation, the model is tested on two widely recognized skin cancer datasets, HAM10000 and ISIC 2019, and compared with 10 CNN and 10 ViT-based models trained under the same conditions. The proposed model achieves 94.30% accuracy and 91.11% F1-score on HAM10000, and 92.50% accuracy and 90.38% F1-score on ISIC 2019, outperforming more than 20 deep learning models and literature. These results indicate that the integration of local and global features plays a critical role in accurately classifying skin lesions and can significantly assist dermatologists in clinical practice.