Pareto optimal-based feature selection framework for biomarker identification

Ogutcen, Omer Faruk (2020) Pareto optimal-based feature selection framework for biomarker identification. Doctoral thesis, Northumbria University.

Text (Doctoral Thesis)
ogutcen.omer_phd.pdf - Submitted Version

Download (1MB) | Preview


Numerous computational techniques have been applied to identify the vital features of gene expression datasets in aiming to increase the efficiency of biomedical applications. The classification of microarray data samples is an important task to correctly recognise diseases by identifying small but clinically meaningful genes. However, identification of disease representative genes or biomarkers in high dimensional microarray gene-expression datasets remains a challenging task. This thesis investigates the viability of Pareto optimisation in identifying relevant subsets of biomarkers in high-dimensional microarray datasets. A robust Pareto Optimal based feature selection framework for biomarker discovery is then proposed. First, a two-stage feature selection approach using ensemble filter methods and Pareto Optimality is proposed. The integration of the multi-objective approach employing Pareto Optimality starts with well-known filter methods applied to various microarray gene-expression datasets. Although filter methods provide ranked lists of features, they do not give information about optimum subsets of features, which are namely genes in this study. To address this limitation, the Pareto Optimality is incorporated along with filter methods. The robustness of the proposed framework is successfully demonstrated on several well-known microarray gene expression datasets and it is shown to achieve comparable or up to 100% predictive accuracy with comparatively fewer features. Better performance results are obtained in comparison with other approaches, which are single-objective approaches. Furthermore, cross-validation and k-fold approaches are integrated into the framework, which can enhance the over-fitting problem and the gene selection process is subsequently more accurate under various conditions. Then the proposed framework is developed in several phases. The Sequential Forward Selection method (SFS) is first used to represent wrapper techniques, and the developed Pareto Optimality based framework is applied multiple times and tested on different data types. Given the nature of most real-life data, imbalanced classes are examined using the proposed framework. The classifier achieves high performance at a similar level of different cases using the proposed Pareto Optimal based feature selection framework, which has a novel structure for imbalanced classes. Comparable or better gene subset sizes are obtained using the proposed framework. Finally, handling missing data within the proposed framework is investigated and it is demonstrated that different data imputation methods can also help in the effective integration of various feature selection methods.

Item Type: Thesis (Doctoral)
Uncontrolled Keywords: Multi Criteria Decision Framework, Aggregated feature selection, Imbalanced Data, Missing Values, Ensemble Feature discovery
Subjects: B800 Medical Technology
G900 Others in Mathematical and Computing Sciences
Department: Faculties > Engineering and Environment > Computer and Information Sciences
University Services > Graduate School > Doctor of Philosophy
Depositing User: John Coen
Date Deposited: 04 Dec 2020 09:53
Last Modified: 31 Jul 2021 14:02

Actions (login required)

View Item View Item


Downloads per month over past year

View more statistics