A flexible non-monotonic discretization method for pre-processing in supervised learning


Şenozan H., SOYLU B.

Pattern Recognition Letters, cilt.181, ss.77-85, 2024 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 181
  • Basım Tarihi: 2024
  • Doi Numarası: 10.1016/j.patrec.2024.03.024
  • Dergi Adı: Pattern Recognition Letters
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC, zbMATH
  • Sayfa Sayıları: ss.77-85
  • Anahtar Kelimeler: Binarization, Classification, Machine learning, Monotonicity, Pre-processing
  • Erciyes Üniversitesi Adresli: Evet

Özet

Discretization is one of the important pre-processing steps for supervised learning. Discretizing attributes helps to simplify the data and make it easier to understand and analyze by reducing the number of values. It can provide a better representation of knowledge and thus help improve the accuracy of a classifier. However, to minimize the information loss, it is important to consider the characteristics of the data. Most approaches assume that the values of a continuous attribute are monotone with respect to the probability of belonging to a particular class. In other words, it is assumed that increasing or decreasing the value of the attribute leads to a proportional increase or decrease in the classification score. This assumption may not always be valid for all attributes of data. In this study, we present entropy-based, flexible discretization strategies capable of capturing the non-monotonicity of the attribute values. The algorithm can adjust the number of cut points and values depending on the characteristics of the data. It does not require setting of any hyper-parameter or threshold. Extensive experiments on different datasets have shown that the proposed discretizers significantly improve the performance of classifiers, especially on complex and high-dimensional data sets.