Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis


GÜRCAN F., Soylu A.

Cancers, vol.16, no.19, 2024 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 16 Issue: 19
  • Publication Date: 2024
  • Doi Number: 10.3390/cancers16193417
  • Journal Name: Cancers
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, BIOSIS, CAB Abstracts, CINAHL, EMBASE, Veterinary Science Database, Directory of Open Access Journals
  • Keywords: cancer diagnosis and prognosis, class imbalance, machine learning, predictive modeling, random forest, resampling techniques
  • Karadeniz Technical University Affiliated: Yes

Abstract

Background/Objectives: This study aims to evaluate the performance of various classification algorithms and resampling methods across multiple diagnostic and prognostic cancer datasets, addressing the challenges of class imbalance. Methods: A total of five datasets were analyzed, including three diagnostic datasets (Wisconsin Breast Cancer Database, Cancer Prediction Dataset, Lung Cancer Detection Dataset) and two prognostic datasets (Seer Breast Cancer Dataset, Differentiated Thyroid Cancer Recurrence Dataset). Nineteen resampling methods from three categories were employed, and ten classifiers from four distinct categories were utilized for comparison. Results: The results demonstrated that hybrid sampling methods, particularly SMOTEENN, achieved the highest mean performance at 98.19%, followed by IHT (97.20%) and RENN (96.48%). In terms of classifiers, Random Forest showed the best performance with a mean value of 94.69%, with Balanced Random Forest and XGBoost following closely. The baseline method (no resampling) yielded a significantly lower performance of 91.33%, highlighting the effectiveness of resampling techniques in improving model outcomes. Conclusions: This research underscores the importance of resampling methods in enhancing classification performance on imbalanced datasets, providing valuable insights for researchers and healthcare professionals. The findings serve as a foundation for future studies aimed at integrating machine learning techniques in cancer diagnosis and prognosis, with recommendations for further research on hybrid models and clinical applications.