1
|
Yin H, Tao J, Peng Y, Xiong Y, Li B, Li S, Yang H. MSPJ: Discovering potential biomarkers in small gene expression datasets via ensemble learning. Comput Struct Biotechnol J 2022; 20:3783-3795. [PMID: 35891786 PMCID: PMC9304602 DOI: 10.1016/j.csbj.2022.07.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 07/10/2022] [Accepted: 07/11/2022] [Indexed: 11/24/2022] Open
Abstract
In transcriptomics, differentially expressed genes (DEGs) provide fine-grained phenotypic resolution for comparisons between groups and insights into molecular mechanisms underlying the pathogenesis of complex diseases or phenotypes. The robust detection of DEGs from large datasets is well-established. However, owing to various limitations (e.g., the low availability of samples for some diseases or limited research funding), small sample size is frequently used in experiments. Therefore, methods to screen reliable and stable features are urgently needed for analyses with limited sample size. In this study, MSPJ, a new machine learning approach for identifying DEGs was proposed to mitigate the reduced power and improve the stability of DEG identification in small gene expression datasets. This ensemble learning-based method consists of three algorithms: an improved multiple random sampling with meta-analysis, SVM-RFE (support vector machines-recursive feature elimination), and permutation test. MSPJ was compared with ten classical methods by 94 simulated datasets and large-scale benchmarking with 165 real datasets. The results showed that, among these methods MSPJ had the best performance in most small gene expression datasets, especially those with sample size below 30. In summary, the MSPJ method enables effective feature selection for robust DEG identification in small transcriptome datasets and is expected to expand research on the molecular mechanisms underlying complex diseases or phenotypes.
Collapse
Key Words
- AUC, area under the ROC curve (AUC)
- DEGs, differentially expressed genes
- Differentially expressed genes
- FDR, false positive rate
- Feature selection
- GA, genetic algorithm
- GEO, Gene Expression Omnibus
- GO, gene ontology
- MSPJ, the Joint method of Meta-analysis, SVM-RFE, and Permutation test
- Machine learning
- RF, random forest
- ROC, receiver operating characteristic
- Random sampling
- SAM, significance analysis of microarrays
- SMDs, standardized mean differences
- SNR, signal noise ratio
- SVM-RFE, support vector machines-recursive feature elimination
- Small sample size
- mRMR, minimum-redundancy-maximum-relevance
Collapse
Affiliation(s)
- HuaChun Yin
- Department of Neurosurgery, Xinqiao Hospital, The Army Medical University, Chongqing 400037, China
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China
- Department of Neurobiology, Chongqing Key Laboratory of Neurobiology, The Army Medical University, Chongqing 400038, China
| | - JingXin Tao
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China
| | - Yuyang Peng
- Department of Neurosurgery, Xinqiao Hospital, The Army Medical University, Chongqing 400037, China
| | - Ying Xiong
- Department of Neurobiology, Chongqing Key Laboratory of Neurobiology, The Army Medical University, Chongqing 400038, China
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China
| | - Song Li
- Department of Neurosurgery, Xinqiao Hospital, The Army Medical University, Chongqing 400037, China
- Guangyang Bay Laboratory, Chongqing Institute for Brain and Intelligence, Chongqing, China
| | - Hui Yang
- Department of Neurosurgery, Xinqiao Hospital, The Army Medical University, Chongqing 400037, China
- Guangyang Bay Laboratory, Chongqing Institute for Brain and Intelligence, Chongqing, China
| |
Collapse
|