1
|
Borah K, Das HS, Seth S, Mallick K, Rahaman Z, Mallik S. A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis. Funct Integr Genomics 2024; 24:139. [PMID: 39158621 DOI: 10.1007/s10142-024-01415-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2024] [Revised: 07/30/2024] [Accepted: 08/01/2024] [Indexed: 08/20/2024]
Abstract
Recent advancements in biomedical technologies and the proliferation of high-dimensional Next Generation Sequencing (NGS) datasets have led to significant growth in the bulk and density of data. The NGS high-dimensional data, characterized by a large number of genomics, transcriptomics, proteomics, and metagenomics features relative to the number of biological samples, presents significant challenges for reducing feature dimensionality. The high dimensionality of NGS data poses significant challenges for data analysis, including increased computational burden, potential overfitting, and difficulty in interpreting results. Feature selection and feature extraction are two pivotal techniques employed to address these challenges by reducing the dimensionality of the data, thereby enhancing model performance, interpretability, and computational efficiency. Feature selection and feature extraction can be categorized into statistical and machine learning methods. The present study conducts a comprehensive and comparative review of various statistical, machine learning, and deep learning-based feature selection and extraction techniques specifically tailored for NGS and microarray data interpretation of humankind. A thorough literature search was performed to gather information on these techniques, focusing on array-based and NGS data analysis. Various techniques, including deep learning architectures, machine learning algorithms, and statistical methods, have been explored for microarray, bulk RNA-Seq, and single-cell, single-cell RNA-Seq (scRNA-Seq) technology-based datasets surveyed here. The study provides an overview of these techniques, highlighting their applications, advantages, and limitations in the context of high-dimensional NGS data. This review provides better insights for readers to apply feature selection and feature extraction techniques to enhance the performance of predictive models, uncover underlying biological patterns, and gain deeper insights into massive and complex NGS and microarray data.
Collapse
Affiliation(s)
- Kasmika Borah
- Department of Computer Science and Information Technology, Cotton University, Panbazar, Guwahati, 781001, Assam, India
| | - Himanish Shekhar Das
- Department of Computer Science and Information Technology, Cotton University, Panbazar, Guwahati, 781001, Assam, India.
| | - Soumita Seth
- Department of Computer Science and Engineering, Future Institute of Engineering and Management, Narendrapur, Kolkata, 700150, West Bengal, India
| | - Koushik Mallick
- Department of Computer Science and Engineering, RCC Institute of Information Technology, Canal S Rd, Beleghata, Kolkata, 700015, West Bengal, India
| | | | - Saurav Mallik
- Department of Environmental Health, Harvard T H Chan School of Public Health, Boston, MA, 02115, USA.
- Department of Pharmacology & Toxicology, University of Arizona, Tucson, AZ, 85721, USA.
| |
Collapse
|
2
|
Wang S, Liu H, Yang P, Wang Z, Ye P, Xia J, Chen S. A role of inflammaging in aortic aneurysm: new insights from bioinformatics analysis. Front Immunol 2023; 14:1260688. [PMID: 37744379 PMCID: PMC10511768 DOI: 10.3389/fimmu.2023.1260688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 08/23/2023] [Indexed: 09/26/2023] Open
Abstract
Introduction Aortic aneurysms (AA) are prevalent worldwide with a notable absence of drug therapies. Thus, identifying potential drug targets is of utmost importance. AA often presents in the elderly, coupled with consistently raised serum inflammatory markers. Given that ageing and inflammation are pivotal processes linked to the evolution of AA, we have identified key genes involved in the inflammaging process of AA development through various bioinformatics methods, thereby providing potential molecular targets for further investigation. Methods The transcriptome data of AA was procured from the datasets GSE140947, GSE7084, and GSE47472, sourced from the NCBI GEO database, whilst gene data of ageing and inflammation were obtained from the GeneCards Database. To identify key genes, differentially expressed analysis using the "Limma" package and WGCNA were implemented. Protein-protein intersection (PPI) analysis and machine learning (ML) algorithms were employed for the screening of potential biomarkers, followed by an assessment of the diagnostic value. Following the acquisition of the hub inflammaging and AA-related differentially expressed genes (IADEGs), the TFs-mRNAs-miRNAs regulatory network was established. The CIBERSORT algorithm was utilized to investigate immune cell infiltration in AA. The correlation of hub IADEGs with infiltrating immunocytes was also evaluated. Lastly, wet laboratory experiments were carried out to confirm the expression of hub IADEGs. Results 342 and 715 AA-related DEGs (ADEGs) recognized from GSE140947 and GSE7084 datasets were procured by intersecting the results of "Limma" and WGCNA analyses. After 83 IADEGs were obtained, PPI analysis and ML algorithms pinpointed 7 and 5 hub IADEGs candidates respectively, and 6 of them demonstrated a high diagnostic value. Immune cell infiltration outcomes unveiled immune dysregulation in AA. In the wet laboratory experiments, 3 hub IADEGs, including BLNK, HLA-DRA, and HLA-DQB1, finally exhibited an expression trend in line with the bioinformatics analysis result. Discussion Our research identified three genes - BLNK, HLA-DRA, and HLA-DQB1- that play a significant role in promoting the development of AA through inflammaging, providing novel insights into the future understanding and therapeutic intervention of AA.
Collapse
Affiliation(s)
- Shilin Wang
- Department of Cardiovascular Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Hao Liu
- Department of Cardiovascular Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Peiwen Yang
- Department of Cardiovascular Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Zhiwen Wang
- Department of Cardiovascular Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Ping Ye
- Department of Cardiology, The Central Hospital of Wuhan, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Jiahong Xia
- Department of Cardiovascular Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Shu Chen
- Department of Cardiovascular Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| |
Collapse
|
3
|
Jiang X, Pan W, Chen M, Wang W, Song W, Lin GN. Integrative enrichment analysis of gene expression based on an artificial neuron. BMC Med Genomics 2021; 14:173. [PMID: 34433483 PMCID: PMC8386081 DOI: 10.1186/s12920-021-00988-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Accepted: 05/18/2021] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND Huntington's disease is a kind of chronic progressive neurodegenerative disease with complex pathogenic mechanisms. To data, the pathogenesis of Huntington's disease is still not fully understood, and there has been no effective treatment. The rapid development of high-throughput sequencing technologies makes it possible to explore the molecular mechanisms at the transcriptome level. Our previous studies on Huntington's disease have shown that it is difficult to distinguish disease-associated genes from non-disease genes. Meanwhile, recent progress in bio-medicine shows that the molecular origin of chronic complex diseases may not exist in the diseased tissue, and differentially expressed genes between different tissues may be helpful to reveal the molecular origin of chronic diseases. Therefore, developing integrative analysis computational methods for the multi-tissues gene expression data, exploring the relationship between differentially expressed genes in different tissues and the disease, can greatly accelerate the molecular discovery process. METHODS For analysis of the intra- and inter- tissues' differentially expressed genes, we designed an integrative enrichment analysis method based on an artificial neuron (IEAAN). Firstly, we calculated the differential expression scores of genes which are seen as features of the corresponding gene, using fold-change approach with intra- and inter- tissues' gene expression data. Then, we weighted sum all the differential expression scores through a sigmoid function to get differential expression enrichment score. Finally, we ranked the genes according to the enrichment score. Top ranking genes are supposed to be the potential disease-associated genes. RESULTS In this study, we conducted large amounts of experiments to analyze the differentially expressed genes of intra- and inter- tissues. Experimental results showed that genes differentially expressed between different tissues are more likely to be Huntington's disease-associated genes. Five disease-associated genes were selected out in this study, two of which have been reported to be implicated in Huntington's disease. CONCLUSIONS We proposed a novel integrative enrichment analysis method based on artificial neuron (IEAAN), which displays better prediction precision of disease-associated genes in comparison with the state-of-the-art statistical-based methods. Our comprehensive evaluation suggests that genes differentially expressed between striatum and liver tissues of health individuals are more likely to be Huntington's disease-associated genes.
Collapse
Affiliation(s)
- Xue Jiang
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Weihao Pan
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Miao Chen
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Weidi Wang
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Weichen Song
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Guan Ning Lin
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
- Shanghai Key Laboratory of Psychotic Disorders, Shanghai, 200030 China
| |
Collapse
|
4
|
Machine-Learning Provides Patient-Specific Prediction of Metastatic Risk Based on Innovative, Mechanobiology Assay. Ann Biomed Eng 2021; 49:1774-1783. [PMID: 33483841 DOI: 10.1007/s10439-020-02720-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Accepted: 12/30/2020] [Indexed: 12/13/2022]
Abstract
Cancer mortality is mostly related to metastasis. Metastasis is currently prognosed via histopathology, disease-statistics, or genetics; those are potentially inaccurate, not rapidly available and require known markers. We had developed a rapid (~ 2 h) mechanobiology-based approach to provide early prognosis of the clinical likelihood for metastasis. Specifically, invasive cell-subsets seeded on impenetrable, physiological-stiffness polyacrylamide gels forcefully indent the gels, while non-invasive/benign cells do not. The number of indenting cells and their attained depths, the mechanical invasiveness, accurately define the metastatic risk of tumors and cell-lines. Utilizing our experimental database, we compare the capacity of several machine learning models to predict the metastatic risk. Models underwent supervised training on individual experiments using classification from literature and commercial-sources for established cell-lines and clinical histopathology reports for tumor samples. We evaluated 2-class models, separating invasive/non-invasive (e.g. benign) samples, and obtained sensitivity and specificity of 0.92 and 1, respectively; this surpasses other works. We also introduce a novel approach, using 5-class models (i.e. normal, benign, cancer-metastatic-non/low/high) that provided average sensitivity and specificity of 0.69 and 0.91. Combining our rapid, mechanical invasiveness assay with machine learning classification can provide accurate and early prognosis of metastatic risk, to support choice of treatments and disease management.
Collapse
|
5
|
Liu P, Fu B, Yang SX, Deng L, Zhong X, Zheng H. Optimizing Survival Analysis of XGBoost for Ties to Predict Disease Progression of Breast Cancer. IEEE Trans Biomed Eng 2020; 68:148-160. [PMID: 32406821 DOI: 10.1109/tbme.2020.2993278] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
OBJECTIVE Some excellent prognostic models based on survival analysis methods for breast cancer have been proposed and extensively validated, which provide an essential means for clinical diagnosis and treatment to improve patient survival. To analyze clinical and follow-up data of 12119 breast cancer patients, derived from the Clinical Research Center for Breast (CRCB) in West China Hospital of Sichuan University, we developed a gradient boosting algorithm, called EXSA, by optimizing survival analysis of XGBoost framework for ties to predict the disease progression of breast cancer. METHODS EXSA is based on the XGBoost framework in machine learning and the Cox proportional hazards model in survival analysis. By taking Efron approximation of partial likelihood function as a learning objective for ties, EXSA derives gradient formulas of a more precise approximation. It optimizes and enhances the ability of XGBoost for survival data with ties. After retaining 4575 patients (3202 cases for training, 1373 cases for test), we exploit the developed EXSA method to build an excellent prognostic model to estimate disease progress. Risk score of disease progress is evaluated by the model, and the risk grouping and continuous functions between risk scores and disease progress rate at 5- and 10-year are also demonstrated. RESULTS Experimental results on test set show that the EXSA method achieves competitive performance with concordance index of 0.83454, 5-year and 10-year AUC of 0.83851 and 0.78155, respectively. CONCLUSION The proposed EXSA method can be utilized as an effective method for survival analysis. SIGNIFICANCE The proposed method in this paper can provide an important means for follow-up data of breast cancer or other disease research.
Collapse
|
6
|
Detecting biomarkers from microarray data using distributed correlation based gene selection. Genes Genomics 2020; 42:449-465. [PMID: 32040771 DOI: 10.1007/s13258-020-00916-w] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2019] [Accepted: 01/23/2020] [Indexed: 01/16/2023]
Abstract
BACKGROUND Over the past few decades, DNA microarray technology has emerged as a prevailing process for early identification of cancer subtypes. Several feature selection (FS) techniques have been widely applied for identifying cancer from microarray gene data but only very few studies have been conducted on distributing the feature selection process for detecting cancer subtypes. OBJECTIVE Not all the gene expressions are needed in prediction, this research article objective is to select discriminative biomarkers by using distributed FS method which helps in accurately diagnosis of cancer subtype. Traditional feature selection techniques have several drawbacks like unrelated features that could perform well in terms of classification accuracy with a suitable subset of genes will be left out of the selection. METHOD To overcome the issue, in this paper a new filter-based method for gene selection is introduced which can select the highly relevant genes for distinguishing tissues from the gene expression dataset. In addition, it is used to compute the relation between gene-gene and gene-class and simultaneously identify subset of essential genes. Our method is tested on Diffuse Large B cell Lymphoma (DLBCL) dataset by using well-known classification techniques such as support vector machine, naïve Bayes, k-nearest neighbor, and decision tree. RESULTS Results on biological DLBCL dataset demonstrate that the proposed method provides promising tools for the prediction of cancer type, with the prediction accuracy of 97.62%, precision of 94.23%, sensitivity of 94.12%, F-measure of 90.12%, and ROC value of 99.75%. CONCLUSION The experimental results reveal the fact that the proposed method is significantly improved classification accuracy and execution time, compared to existing standard algorithms when applied to the non-partitioned dataset. Furthermore, the extracted genes are biologically sound and agree with the outcome of relevant biomedical studies.
Collapse
|
7
|
Fu B, Liu P, Lin J, Deng L, Hu K, Zheng H. Predicting Invasive Disease-Free Survival for Early-stage Breast Cancer Patients Using Follow-up Clinical Data. IEEE Trans Biomed Eng 2018; 66:2053-2064. [PMID: 30475709 DOI: 10.1109/tbme.2018.2882867] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
OBJECTIVE Chinese women are seriously threatened by breast cancer with high morbidity and mortality. The lack of robust prognosis models results in difficulty for doctors to prepare an appropriate treatment plan that may prolong patient survival time. An alternative prognosis model framework to predict Invasive Disease-Free Survival (iDFS) for early-stage breast cancer patients, called MP4Ei, is proposed. MP4Ei framework gives an excellent performance to predict the relapse or metastasis breast cancer of Chinese patients in 5 years. METHODS MP4Ei is built based on statistical theory and gradient boosting decision tree framework. 5246 patients, derived from the Clinical Research Center for Breast (CRCB) in West China Hospital of Sichuan University, with early-stage (stage I-III) breast cancer are eligible for inclusion. Stratified feature selection, including statistical and ensemble methods, is adopted to select 23 out of the 89 patient features about the patient' demographics, diagnosis, pathology and therapy. Then 23 selected features as the input variables are imported into the XGBoost algorithm, with Bayesian parameter tuning and cross validation, to find out the optimum simplified model for 5-year iDFS prediction. RESULTS For eligible data, with 4196 patients (80%) for training, and with 1050 patients (20%) for testing, MP4Ei achieves comparable accuracy with AUC 0.8451, which has a significant advantage (p < 0.05). CONCLUSION This work demonstrates the complete iDFS prognosis model with very competitive performance. SIGNIFICANCE The proposed method in this paper could be used in clinical practice to predict patients' prognosis and future surviving state, which may help doctors make treatment plan.
Collapse
|
8
|
Alsalem MA, Zaidan AA, Zaidan BB, Hashim M, Albahri OS, Albahri AS, Hadi A, Mohammed KI. Systematic Review of an Automated Multiclass Detection and Classification System for Acute Leukaemia in Terms of Evaluation and Benchmarking, Open Challenges, Issues and Methodological Aspects. J Med Syst 2018; 42:204. [PMID: 30232632 DOI: 10.1007/s10916-018-1064-9] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2018] [Accepted: 09/06/2018] [Indexed: 10/28/2022]
Abstract
This study aims to systematically review prior research on the evaluation and benchmarking of automated acute leukaemia classification tasks. The review depends on three reliable search engines: ScienceDirect, Web of Science and IEEE Xplore. A research taxonomy developed for the review considers a wide perspective for automated detection and classification of acute leukaemia research and reflects the usage trends in the evaluation criteria in this field. The developed taxonomy consists of three main research directions in this domain. The taxonomy involves two phases. The first phase includes all three research directions. The second one demonstrates all the criteria used for evaluating acute leukaemia classification. The final set of studies includes 83 investigations, most of which focused on enhancing the accuracy and performance of detection and classification through proposed methods or systems. Few efforts were made to undertake the evaluation issues. According to the final set of articles, three groups of articles represented the main research directions in this domain: 56 articles highlighted the proposed methods, 22 articles involved proposals for system development and 5 papers centred on evaluation and comparison. The other taxonomy side included 16 main and sub-evaluation and benchmarking criteria. This review highlights three serious issues in the evaluation and benchmarking of multiclass classification of acute leukaemia, namely, conflicting criteria, evaluation criteria and criteria importance. It also determines the weakness of benchmarking tools. To solve these issues, multicriteria decision-making (MCDM) analysis techniques were proposed as effective recommended solutions in the methodological aspect. This methodological aspect involves a proposed decision support system based on MCDM for evaluation and benchmarking to select suitable multiclass classification models for acute leukaemia. The said support system is examined and has three sequential phases. Phase One presents the identification procedure and process for establishing a decision matrix based on a crossover of evaluation criteria and acute leukaemia multiclass classification models. Phase Two describes the decision matrix development for the selection of acute leukaemia classification models based on the integrated Best and worst method (BWM) and VIKOR. Phase Three entails the validation of the proposed system.
Collapse
Affiliation(s)
- M A Alsalem
- Department of Computing, Universiti Pendidikan Sultan Idris, Tanjong Malim, Perak, Malaysia
| | - A A Zaidan
- Department of Computing, Universiti Pendidikan Sultan Idris, Tanjong Malim, Perak, Malaysia.
| | - B B Zaidan
- Department of Computing, Universiti Pendidikan Sultan Idris, Tanjong Malim, Perak, Malaysia
| | - M Hashim
- Department of Computing, Universiti Pendidikan Sultan Idris, Tanjong Malim, Perak, Malaysia
| | - O S Albahri
- Department of Computing, Universiti Pendidikan Sultan Idris, Tanjong Malim, Perak, Malaysia
| | - A S Albahri
- Department of Computing, Universiti Pendidikan Sultan Idris, Tanjong Malim, Perak, Malaysia
| | - Ali Hadi
- Department of Computing, Universiti Pendidikan Sultan Idris, Tanjong Malim, Perak, Malaysia
| | - K I Mohammed
- Department of Computing, Universiti Pendidikan Sultan Idris, Tanjong Malim, Perak, Malaysia
| |
Collapse
|
9
|
Alsalem MA, Zaidan AA, Zaidan BB, Hashim M, Madhloom HT, Azeez ND, Alsyisuf S. A review of the automated detection and classification of acute leukaemia: Coherent taxonomy, datasets, validation and performance measurements, motivation, open challenges and recommendations. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2018; 158:93-112. [PMID: 29544792 DOI: 10.1016/j.cmpb.2018.02.005] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Revised: 01/19/2018] [Accepted: 02/02/2018] [Indexed: 06/08/2023]
Abstract
CONTEXT Acute leukaemia diagnosis is a field requiring automated solutions, tools and methods and the ability to facilitate early detection and even prediction. Many studies have focused on the automatic detection and classification of acute leukaemia and their subtypes to promote enable highly accurate diagnosis. OBJECTIVE This study aimed to review and analyse literature related to the detection and classification of acute leukaemia. The factors that were considered to improve understanding on the field's various contextual aspects in published studies and characteristics were motivation, open challenges that confronted researchers and recommendations presented to researchers to enhance this vital research area. METHODS We systematically searched all articles about the classification and detection of acute leukaemia, as well as their evaluation and benchmarking, in three main databases: ScienceDirect, Web of Science and IEEE Xplore from 2007 to 2017. These indices were considered to be sufficiently extensive to encompass our field of literature. RESULTS Based on our inclusion and exclusion criteria, 89 articles were selected. Most studies (58/89) focused on the methods or algorithms of acute leukaemia classification, a number of papers (22/89) covered the developed systems for the detection or diagnosis of acute leukaemia and few papers (5/89) presented evaluation and comparative studies. The smallest portion (4/89) of articles comprised reviews and surveys. DISCUSSION Acute leukaemia diagnosis, which is a field requiring automated solutions, tools and methods, entails the ability to facilitate early detection or even prediction. Many studies have been performed on the automatic detection and classification of acute leukaemia and their subtypes to promote accurate diagnosis. CONCLUSIONS Research areas on medical-image classification vary, but they are all equally vital. We expect this systematic review to help emphasise current research opportunities and thus extend and create additional research fields.
Collapse
Affiliation(s)
- M A Alsalem
- Department of Computing, Faculty of Arts, Computing and Creative Industry, Universiti Pendidikan Sultan Idris, Malaysia
| | - A A Zaidan
- Department of Computing, Faculty of Arts, Computing and Creative Industry, Universiti Pendidikan Sultan Idris, Malaysia.
| | - B B Zaidan
- Department of Computing, Faculty of Arts, Computing and Creative Industry, Universiti Pendidikan Sultan Idris, Malaysia
| | - M Hashim
- Department of Computing, Faculty of Arts, Computing and Creative Industry, Universiti Pendidikan Sultan Idris, Malaysia
| | - H T Madhloom
- Department of Computing, Faculty of Arts, Computing and Creative Industry, Universiti Pendidikan Sultan Idris, Malaysia
| | - N D Azeez
- Department of Computing, Faculty of Arts, Computing and Creative Industry, Universiti Pendidikan Sultan Idris, Malaysia
| | - S Alsyisuf
- Faculty of on information Science and Engineering, Management and Science university, Shah Alam, Malaysia
| |
Collapse
|
10
|
Brigham K, Gupta S, Brigham JC. Predicting responses to mechanical ventilation for preterm infants with acute respiratory illness using artificial neural networks. INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN BIOMEDICAL ENGINEERING 2018; 34:e3094. [PMID: 29667359 DOI: 10.1002/cnm.3094] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/30/2017] [Revised: 02/25/2018] [Accepted: 04/04/2018] [Indexed: 06/08/2023]
Abstract
Infants born prematurely are particularly susceptible to respiratory illness due to underdeveloped lungs, which can often result in fatality. Preterm infants in acute stages of respiratory illness typically require mechanical ventilation assistance, and the efficacy of the type of mechanical ventilation and its delivery has been the subject of a number clinical studies. With recent advances in machine learning approaches, particularly deep learning, it may be possible to estimate future responses to mechanical ventilation in real time, based on ventilation monitoring up to the point of analysis. In this work, recurrent neural networks are proposed for predicting future ventilation parameters due to the highly nonlinear behavior of the ventilation measures of interest and the ability of recurrent neural networks to model complex nonlinear functions. The resulting application of this particular class of neural networks shows promise in its ability to predict future responses for different ventilation modes. Towards improving care and treatment of preterm newborns, further development of this prediction process for ventilation could potentially aid in important clinical decisions or studies to improve preterm infant health.
Collapse
Affiliation(s)
| | - Samir Gupta
- Neonatal Unit, University Hospital of North Tees, Stockton-on-Tees, UK
| | - John C Brigham
- Department of Engineering, Durham University, Durham, UK
| |
Collapse
|
11
|
Gene selection from large-scale gene expression data based on fuzzy interactive multi-objective binary optimization for medical diagnosis. Biocybern Biomed Eng 2018. [DOI: 10.1016/j.bbe.2018.02.002] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
12
|
Shahjaman M, Kumar N, Ahmed MS, Begum A, Islam SMS, Mollah MNH. Robust Feature Selection Approach for Patient Classification using Gene Expression Data. Bioinformation 2017; 13:327-332. [PMID: 29162964 PMCID: PMC5680713 DOI: 10.6026/97320630013327] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2017] [Revised: 09/11/2017] [Accepted: 09/12/2017] [Indexed: 11/23/2022] Open
Abstract
Patient classification through feature selection (FS) based on gene expression data (GED) has already become popular to the research communities. T-test is the well-known statistical FS method in GED analysis. However, it produces higher false positives and lower accuracies for small sample sizes or in presence of outliers. To get rid from the shortcomings of t-test with small sample sizes, SAM has been applied in GED. But, it is highly sensitive to outliers. Recently, robust SAM using the minimum β-divergence estimators has overcome all the problems of classical t-test & SAM and it has been successfully applied for identification of differentially expressed (DE) genes. But, it was not applied in classification. Therefore, in this paper, we employ robust SAM as a feature selection approach along with classifiers for patient classification. We demonstrate the performance of the robust SAM in a comparison of classical t-test and SAM along with four popular classifiers (LDA, KNN, SVM and naive Bayes) using both simulated and real gene expression datasets. The results obtained from simulation and real data analysis confirm that the performance of the four classifiers improve with robust SAM than the classical t-test and SAM. From a real Colon cancer dataset we identified 21 additional DE genes using robust SAM that were not identified by the classical t-test or SAM. To reveal the biological functions and pathways of these 21 genes, we perform KEGG pathway enrichment analysis and found that these genes are involved in some important pathways related to cancer disease.
Collapse
Affiliation(s)
- Md. Shahjaman
- Bioinformatics Lab, Department of Statistics, University of Rajshahi-6205, Bangladesh
- Department of Statistics, Begum Rokeya University, Rangpur-5400, Bangladesh
| | - Nishith Kumar
- Bioinformatics Lab, Department of Statistics, University of Rajshahi-6205, Bangladesh
- Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh
| | - Md. Shakil Ahmed
- Bioinformatics Lab, Department of Statistics, University of Rajshahi-6205, Bangladesh
| | - AnjumanAra Begum
- Bioinformatics Lab, Department of Statistics, University of Rajshahi-6205, Bangladesh
| | - S. M. Shahinul Islam
- Institutitute of Biological Science (IBSc), University of Rajshahi, Rajshahi-6205, Bangladesh
| | | |
Collapse
|
13
|
Golestan Hashemi FS, Razi Ismail M, Rafii Yusop M, Golestan Hashemi MS, Nadimi Shahraki MH, Rastegari H, Miah G, Aslani F. Intelligent mining of large-scale bio-data: Bioinformatics applications. BIOTECHNOL BIOTEC EQ 2017. [DOI: 10.1080/13102818.2017.1364977] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Affiliation(s)
- Farahnaz Sadat Golestan Hashemi
- Plant Genetics, AgroBioChem Department, Gembloux Agro-Bio Tech, University of Liege, Liege, Belgium
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Mohd Razi Ismail
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
- Department of Crop Science, Faculty of Agriculture, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Mohd Rafii Yusop
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
- Department of Crop Science, Faculty of Agriculture, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Mahboobe Sadat Golestan Hashemi
- Department of Software Engineering, Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Isfahan,Iran
- Big Data Research Center, Najafabad Branch, Islamic Azad University, Isfahan, Iran
| | - Mohammad Hossein Nadimi Shahraki
- Department of Software Engineering, Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Isfahan,Iran
- Big Data Research Center, Najafabad Branch, Islamic Azad University, Isfahan, Iran
| | - Hamid Rastegari
- Department of Software Engineering, Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Isfahan,Iran
| | - Gous Miah
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Farzad Aslani
- Department of Crop Science, Faculty of Agriculture, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| |
Collapse
|
14
|
Mandal M, Mukhopadhyay A. Multiobjective PSO-based rank aggregation: Application in gene ranking from microarray data. Inf Sci (N Y) 2017. [DOI: 10.1016/j.ins.2016.12.037] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
15
|
Hassanzadeh HR, Phan JH, Wang MD. A Multi-Modal Graph-Based Semi-Supervised Pipeline for Predicting Cancer Survival. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2016; 2016:184-189. [PMID: 32655981 DOI: 10.1109/bibm.2016.7822516] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Cancer survival prediction is an active area of research that can help prevent unnecessary therapies and improve patient's quality of life. Gene expression profiling is being widely used in cancer studies to discover informative biomarkers that aid predict different clinical endpoint prediction. We use multiple modalities of data derived from RNA deep-sequencing (RNA-seq) to predict survival of cancer patients. Despite the wealth of information available in expression profiles of cancer tumors, fulfilling the aforementioned objective remains a big challenge, for the most part, due to the paucity of data samples compared to the high dimension of the expression profiles. As such, analysis of transcriptomic data modalities calls for state-of-the-art big-data analytics techniques that can maximally use all the available data to discover the relevant information hidden within a significant amount of noise. In this paper, we propose a pipeline that predicts cancer patients' survival by exploiting the structure of the input (manifold learning) and by leveraging the unlabeled samples using Laplacian support vector machines, a graph-based semi supervised learning (GSSL) paradigm. We show that under certain circumstances, no single modality per se will result in the best accuracy and by fusing different models together via a stacked generalization strategy, we may boost the accuracy synergistically. We apply our approach to two cancer datasets and present promising results. We maintain that a similar pipeline can be used for predictive tasks where labeled samples are expensive to acquire.
Collapse
Affiliation(s)
- Hamid Reza Hassanzadeh
- Department of Computational Science and Engineering, Georgia Institute of Technology Atlanta, Georgia 30332
| | - John H Phan
- Department of Biomedical Engineering Georgia Institute of Technology and Emory University, Atlanta, Georgia 30332
| | - May D Wang
- Department of Biomedical Engineering Georgia Institute of Technology and Emory University, Atlanta, Georgia 30332
| |
Collapse
|
16
|
Saha S, Mitra S, Yadav RK. A multiobjective based automatic framework for classifying cancer-microRNA biomarkers. GENE REPORTS 2016. [DOI: 10.1016/j.genrep.2016.04.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
17
|
RNA Sequencing and Genetic Disease. CURRENT GENETIC MEDICINE REPORTS 2016. [DOI: 10.1007/s40142-016-0098-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
18
|
Zhang X, Guan N, Jia Z, Qiu X, Luo Z. Semi-Supervised Projective Non-Negative Matrix Factorization for Cancer Classification. PLoS One 2015; 10:e0138814. [PMID: 26394323 PMCID: PMC4579132 DOI: 10.1371/journal.pone.0138814] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2015] [Accepted: 09/03/2015] [Indexed: 01/23/2023] Open
Abstract
Advances in DNA microarray technologies have made gene expression profiles a significant candidate in identifying different types of cancers. Traditional learning-based cancer identification methods utilize labeled samples to train a classifier, but they are inconvenient for practical application because labels are quite expensive in the clinical cancer research community. This paper proposes a semi-supervised projective non-negative matrix factorization method (Semi-PNMF) to learn an effective classifier from both labeled and unlabeled samples, thus boosting subsequent cancer classification performance. In particular, Semi-PNMF jointly learns a non-negative subspace from concatenated labeled and unlabeled samples and indicates classes by the positions of the maximum entries of their coefficients. Because Semi-PNMF incorporates statistical information from the large volume of unlabeled samples in the learned subspace, it can learn more representative subspaces and boost classification performance. We developed a multiplicative update rule (MUR) to optimize Semi-PNMF and proved its convergence. The experimental results of cancer classification for two multiclass cancer gene expression profile datasets show that Semi-PNMF outperforms the representative methods.
Collapse
Affiliation(s)
- Xiang Zhang
- College of Computer, National University of Defense Technology, Changsha 410073, China
- National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha 410073, China
| | - Naiyang Guan
- College of Computer, National University of Defense Technology, Changsha 410073, China
- National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha 410073, China
- * E-mail: (NG); (ZL)
| | - Zhilong Jia
- Department of Chemistry and Biology, College of Science, National University of Defense Technology, Changsha, Hunan, China
| | - Xiaogang Qiu
- College of Information System and Management, National University of Defense Technology, Changsha, Hunan, 410073 China
| | - Zhigang Luo
- College of Computer, National University of Defense Technology, Changsha 410073, China
- National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha 410073, China
- * E-mail: (NG); (ZL)
| |
Collapse
|
19
|
Gao F, Mei J, Sun J, Wang J, Yang E, Hussain A. A Novel Classification Algorithm Based on Incremental Semi-Supervised Support Vector Machine. PLoS One 2015; 10:e0135709. [PMID: 26275294 PMCID: PMC4537225 DOI: 10.1371/journal.pone.0135709] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2015] [Accepted: 07/26/2015] [Indexed: 11/18/2022] Open
Abstract
For current computational intelligence techniques, a major challenge is how to learn new concepts in changing environment. Traditional learning schemes could not adequately address this problem due to a lack of dynamic data selection mechanism. In this paper, inspired by human learning process, a novel classification algorithm based on incremental semi-supervised support vector machine (SVM) is proposed. Through the analysis of prediction confidence of samples and data distribution in a changing environment, a "soft-start" approach, a data selection mechanism and a data cleaning mechanism are designed, which complete the construction of our incremental semi-supervised learning system. Noticeably, with the ingenious design procedure of our proposed algorithm, the computation complexity is reduced effectively. In addition, for the possible appearance of some new labeled samples in the learning process, a detailed analysis is also carried out. The results show that our algorithm does not rely on the model of sample distribution, has an extremely low rate of introducing wrong semi-labeled samples and can effectively make use of the unlabeled samples to enrich the knowledge system of classifier and improve the accuracy rate. Moreover, our method also has outstanding generalization performance and the ability to overcome the concept drift in a changing environment.
Collapse
Affiliation(s)
- Fei Gao
- School of Electronic and Information Engineering, Beihang University, Beijing, 100191, China
| | - Jingyuan Mei
- School of Electronic and Information Engineering, Beihang University, Beijing, 100191, China
- * E-mail:
| | - Jinping Sun
- School of Electronic and Information Engineering, Beihang University, Beijing, 100191, China
| | - Jun Wang
- School of Electronic and Information Engineering, Beihang University, Beijing, 100191, China
| | - Erfu Yang
- Space Mechatronic Systems Technology Laboratory, Department of Design, Manufacture and Engineering Management, University of Strathclyde, Glasgow, G1 1XJ, United Kingdom
| | - Amir Hussain
- Cognitive Signal-Image and Control Processing Research Laboratory, School of Natural Sciences, University of Stirling, Stirling, FK9 4LA, United Kingdom
| |
Collapse
|
20
|
Banwait JK, Bastola DR. Contribution of bioinformatics prediction in microRNA-based cancer therapeutics. Adv Drug Deliv Rev 2015; 81:94-103. [PMID: 25450261 DOI: 10.1016/j.addr.2014.10.030] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2014] [Revised: 10/13/2014] [Accepted: 10/30/2014] [Indexed: 12/15/2022]
Abstract
Despite enormous efforts, cancer remains one of the most lethal diseases in the world. With the advancement of high throughput technologies massive amounts of cancer data can be accessed and analyzed. Bioinformatics provides a platform to assist biologists in developing minimally invasive biomarkers to detect cancer, and in designing effective personalized therapies to treat cancer patients. Still, the early diagnosis, prognosis, and treatment of cancer are an open challenge for the research community. MicroRNAs (miRNAs) are small non-coding RNAs that serve to regulate gene expression. The discovery of deregulated miRNAs in cancer cells and tissues has led many to investigate the use of miRNAs as potential biomarkers for early detection, and as a therapeutic agent to treat cancer. Here we describe advancements in computational approaches to predict miRNAs and their targets, and discuss the role of bioinformatics in studying miRNAs in the context of human cancer.
Collapse
Affiliation(s)
- Jasjit K Banwait
- College of Information Science and Technology, University of Nebraska at Omaha, 1110 South 67th Street, PKI 172, Omaha, NE 68106, USA.
| | - Dhundy R Bastola
- College of Information Science and Technology, University of Nebraska at Omaha, 1110 South 67th Street, PKI 172, Omaha, NE 68106, USA.
| |
Collapse
|
21
|
Chakraborty D, Maulik U. Identifying Cancer Biomarkers From Microarray Data Using Feature Selection and Semisupervised Learning. IEEE JOURNAL OF TRANSLATIONAL ENGINEERING IN HEALTH AND MEDICINE-JTEHM 2014; 2:4300211. [PMID: 27170887 PMCID: PMC4848046 DOI: 10.1109/jtehm.2014.2375820] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/08/2014] [Revised: 09/20/2014] [Accepted: 11/22/2014] [Indexed: 11/07/2022]
Abstract
Microarrays have now gone from obscurity to being almost ubiquitous in biological research. At the same time, the statistical methodology for microarray analysis has progressed from simple visual assessments of results to novel algorithms for analyzing changes in expression profiles. In a micro-RNA (miRNA) or gene-expression profiling experiment, the expression levels of thousands of genes/miRNAs are simultaneously monitored to study the effects of certain treatments, diseases, and developmental stages on their expressions. Microarray-based gene expression profiling can be used to identify genes, whose expressions are changed in response to pathogens or other organisms by comparing gene expression in infected to that in uninfected cells or tissues. Recent studies have revealed that patterns of altered microarray expression profiles in cancer can serve as molecular biomarkers for tumor diagnosis, prognosis of disease-specific outcomes, and prediction of therapeutic responses. Microarray data sets containing expression profiles of a number of miRNAs or genes are used to identify biomarkers, which have dysregulation in normal and malignant tissues. However, small sample size remains a bottleneck to design successful classification methods. On the other hand, adequate number of microarray data that do not have clinical knowledge can be employed as additional source of information. In this paper, a combination of kernelized fuzzy rough set (KFRS) and semisupervised support vector machine (S(3)VM) is proposed for predicting cancer biomarkers from one miRNA and three gene expression data sets. Biomarkers are discovered employing three feature selection methods, including KFRS. The effectiveness of the proposed KFRS and S(3)VM combination on the microarray data sets is demonstrated, and the cancer biomarkers identified from miRNA data are reported. Furthermore, biological significance tests are conducted for miRNA cancer biomarkers.
Collapse
|
22
|
Wang Y, Fan X, Cai Y. A comparative study of improvements Pre-filter methods bring on feature selection using microarray data. Health Inf Sci Syst 2014; 2:7. [PMID: 25825671 PMCID: PMC4340279 DOI: 10.1186/2047-2501-2-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2014] [Accepted: 10/03/2014] [Indexed: 12/13/2022] Open
Abstract
Background Feature selection techniques have become an apparent need in biomarker discoveries with the development of microarray. However, the high dimensional nature of microarray made feature selection become time-consuming. To overcome such difficulties, filter data according to the background knowledge before applying feature selection techniques has become a hot topic in microarray analysis. Different methods may affect final results greatly, thus it is important to evaluate these pre-filter methods in a system way. Methods In this paper, we compared the performance of statistical-based, biological-based pre-filter methods and the combination of them on microRNA-mRNA parallel expression profiles using L1 logistic regression as feature selection techniques. Four types of data were built for both microRNA and mRNA expression profiles. Results Results showed that pre-filter methods could reduce the number of features greatly for both mRNA and microRNA expression datasets. The features selected after pre-filter procedures were shown to be significant in biological levels such as biology process and microRNA functions. Analyses of classification performance based on precision showed the pre-filter methods were necessary when the number of raw features was much bigger than that of samples. All the computing time was greatly shortened after pre-filter procedures. Conclusions With similar or better classification improvements, less but biological significant features, pre-filter-based feature selection should be taken into consideration if researchers need fast results when facing complex computing problems in bioinformatics. Electronic supplementary material The online version of this article (doi:10.1186/2047-2501-2-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yingying Wang
- Research Center for Biomedical Information, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China
| | - Xiaomao Fan
- Research Center for Biomedical Information, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China
| | - Yunpeng Cai
- Research Center for Biomedical Information, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China
| |
Collapse
|
23
|
Fuzzy Preference Based Feature Selection and Semisupervised SVM for Cancer Classification. IEEE Trans Nanobioscience 2014; 13:152-60. [DOI: 10.1109/tnb.2014.2312132] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|