Evaluating the Impact of Distance Metrics on K-means Clustering Performance with Geospatial Data


Ay M., ÖZBAKIR L.

3rd International Conference on Data, Electronics and Computing, ICDEC 2024, Kayseri, Türkiye, 18 - 20 Eylül 2024, cilt.1455 LNNS, ss.343-351, (Tam Metin Bildiri) identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası: 1455 LNNS
  • Doi Numarası: 10.1007/978-981-96-7742-9_24
  • Basıldığı Şehir: Kayseri
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.343-351
  • Anahtar Kelimeler: Clustering, Euclidean, K-means, Machine learning, Vincenty
  • Erciyes Üniversitesi Adresli: Evet

Özet

Clustering algorithms are fundamental in data mining, machine learning, and statistical analysis, with K-means being one of the most popular due to its simplicity and efficiency. This study examines the impact of using different distance metrics (Euclidean and Vincenty) on the performance of the K-means clustering algorithm when applied to geographical data. While Euclidean distance is widely used for its straightforward computation, it may not be ideal for clustering geographical locations due to the Earth’s curvature. Vincenty’s formula, which accounts for the Earth’s ellipsoid shape, offers a potentially more accurate alternative. We conducted experiments using a dataset of 69,902 geographical points, evaluating the clustering performance based on Sum of Squared Errors (SSE), Davies-Bouldin Index (DBI), and Silhouette Index. Our results indicate that although Vincenty’s formula theoretically provides more precise distance measurements depending on the earth shape, the practical differences in clustering performance between Euclidean and Vincenty distances are minimal. Statistical analysis using the Mann–Whitney U test demonstrated that the observed differences lacked statistical significance.