Development of unsupervised feature selection methods for high dimensional biomedical data in regression domain

Sarac, Ferdi (2017) Development of unsupervised feature selection methods for high dimensional biomedical data in regression domain. Doctoral thesis, Northumbria University.

[img]
Preview
Text
sarac.ferdi_phd.pdf - Submitted Version

Download (1MB) | Preview

Abstract

In line with technological developments, there is almost no limit to collect data of high dimension in various fields including bioinformatics. In most cases, these high dimensional datasets contain many irrelevant or noisy features which need to be filtered out to find a small but biologically meaningful set of attributes. Although there have been various attempts to select predictive feature sets from high dimensional data in classification and clustering, there have only been limited attempts to do this for regression problems. Since supervised feature selection methods tend to identify noisy features in addition to discriminative variables, unsupervised feature selection methods (USFSMs) are generally regarded as more unbiased approaches. The aim of this thesis is, therefore, to provide (i) a comprehensive overview of feature selection methods for regression problems where feature selection methods are shown along with their types, references, sources, and code repositories (ii) a taxonomy of feature selection methods for regression problems to assist researchers to select appropriate feature selection methods for their research (iii) a deep learning based unsupervised feature selection framework, DFSFR (iv) a K-means based unsupervised feature selection method, KBFS. To the best of our knowledge, DFSFR is the first deep learning based method to be designed particularly for regression tasks. In addition, a hybrid USFSM, DKBFS, is proposed which combines KBFS and DFSFR to select discriminative features from very high dimensional data. The proposed frameworks are compared with the state-of-the-art USFSMs, including Multi Cluster Feature Selection (MCFS), Embedded Unsupervised Feature Selection (EUFS), Infinite Feature Selection (InFS), Spectral Regression Feature Selection (SPFS), Laplacian Score Feature Selection (LapFS), and Term Variance Feature Selection (TV) along with the entire feature sets as well as the methods used in previous studies. To evaluate the effectiveness of proposed methods, four different case studies are considered: (i) a low dimensional RV144 vaccine dataset; (ii) three different high dimensional peptide binding affinity datasets; (iii) a very high dimensional GSE44763 dataset; (iv) a very high dimensional GSE40279 dataset. Experimental results from these data sets are used to validate the effectiveness of the proposed methods. Compared to state-of-the-art feature selection methods, the proposed methods achieve improvements in prediction accuracy of as much as 9% for the RV144 Vaccine dataset, 75% for the peptide binding affinity datasets, 3% for the GSE44763 dataset, and 55% for the GSE40279 dataset.

Item Type: Thesis (Doctoral)
Subjects: B900 Others in Subjects allied to Medicine
G400 Computer Science
Department: Faculties > Engineering and Environment > Computer and Information Sciences
University Services > Graduate School > Doctor of Philosophy
Depositing User: Becky Skoyles
Date Deposited: 11 Oct 2018 10:50
Last Modified: 31 Jul 2021 19:31
URI: http://nrl.northumbria.ac.uk/id/eprint/36260

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics