Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis

GÜRCAN, FATİH; Soylu, Ahmet

doi:10.3390/cancers16193417

Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis

Atıf İçin Kopyala

GÜRCAN F., Soylu A.

CANCERS, cilt.16, sa.19, 2024 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 16 Sayı: 19
Basım Tarihi: 2024
Doi Numarası: 10.3390/cancers16193417
Dergi Adı: CANCERS
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, BIOSIS, CAB Abstracts, CINAHL, EMBASE, Veterinary Science Database, Directory of Open Access Journals
Anahtar Kelimeler: cancer diagnosis and prognosis, class imbalance, machine learning, predictive modeling, random forest, resampling techniques
Karadeniz Teknik Üniversitesi Adresli: Evet

Özet

Simple Summary This research focuses on improving cancer diagnosis and prognosis by addressing a common problem in data analysis known as class imbalance, where some patient groups are underrepresented. The authors aim to evaluate different resampling methods that can balance the data and enhance the performance of various classification algorithms used to predict cancer outcomes. By testing a wide range of techniques across multiple cancer datasets, this study identifies the best-performing classifier, Random Forest, along with the most effective resampling method, SMOTEENN. These findings provide valuable insights for researchers and healthcare professionals, enabling them to make more accurate predictions and ultimately improve patient care. This research could pave the way for the development of more reliable machine learning applications in the medical field.Abstract Background/Objectives: This study aims to evaluate the performance of various classification algorithms and resampling methods across multiple diagnostic and prognostic cancer datasets, addressing the challenges of class imbalance. Methods: A total of five datasets were analyzed, including three diagnostic datasets (Wisconsin Breast Cancer Database, Cancer Prediction Dataset, Lung Cancer Detection Dataset) and two prognostic datasets (Seer Breast Cancer Dataset, Differentiated Thyroid Cancer Recurrence Dataset). Nineteen resampling methods from three categories were employed, and ten classifiers from four distinct categories were utilized for comparison. Results: The results demonstrated that hybrid sampling methods, particularly SMOTEENN, achieved the highest mean performance at 98.19%, followed by IHT (97.20%) and RENN (96.48%). In terms of classifiers, Random Forest showed the best performance with a mean value of 94.69%, with Balanced Random Forest and XGBoost following closely. The baseline method (no resampling) yielded a significantly lower performance of 91.33%, highlighting the effectiveness of resampling techniques in improving model outcomes. Conclusions: This research underscores the importance of resampling methods in enhancing classification performance on imbalanced datasets, providing valuable insights for researchers and healthcare professionals. The findings serve as a foundation for future studies aimed at integrating machine learning techniques in cancer diagnosis and prognosis, with recommendations for further research on hybrid models and clinical applications.