1
|
Wang Y, Li X, Wong KC, Chang Y, Yang S. Evolutionary Multiobjective Clustering Algorithms With Ensemble for Patient Stratification. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:11027-11040. [PMID: 33961576 DOI: 10.1109/tcyb.2021.3069434] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Patient stratification has been studied widely to tackle subtype diagnosis problems for effective treatment. Due to the dimensionality curse and poor interpretability of data, there is always a long-lasting challenge in constructing a stratification model with high diagnostic ability and good generalization. To address these problems, this article proposes two novel evolutionary multiobjective clustering algorithms with ensemble (NSGA-II-ECFE and MOEA/D-ECFE) with four cluster validity indices used as the objective functions. First, an effective ensemble construction method is developed to enrich the ensemble diversity. After that, an ensemble clustering fitness evaluation (ECFE) method is proposed to evaluate the ensembles by measuring the consensus clustering under those four objective functions. To generate the consensus clustering, ECFE exploits the hybrid co-association matrix from the ensembles and then dynamically selects the suitable clustering algorithm on that matrix. Multiple experiments have been conducted to demonstrate the effectiveness of the proposed algorithm in comparison with seven clustering algorithms, twelve ensemble clustering approaches, and two multiobjective clustering algorithms on 55 synthetic datasets and 35 real patient stratification datasets. The experimental results demonstrate the competitive edges of the proposed algorithms over those compared methods. Furthermore, the proposed algorithm is applied to extend its advantages by identifying cancer subtypes from five cancer-related single-cell RNA-seq datasets.
Collapse
|
2
|
Li X, Zhang S, Wong KC. Multiobjective Genome-Wide RNA-Binding Event Identification From CLIP-Seq Data. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:5811-5824. [PMID: 31940583 DOI: 10.1109/tcyb.2019.2960515] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
RNA-binding proteins (RBPs) are the master regulators of mRNA processing, which are vital players for the post-transcriptional control of gene expression. In recent years, crosslinking immunoprecipitation sequencing (CLIP-seq) technologies have enabled us to sequence massive amounts of genome-wide RNA-binding event data. Its increasing availability provides opportunities to identify protein-RNA interactions on a genome-wide scale. Genome-wide RNA-binding event detection methods have been developed to the understanding of the proteins' functions within cellular processes. Unfortunately, those methods often suffer from realistic restrictions, such as high costs, intensive computation, high dimensionality, numerical instability, and data sparsity. We present a computational method [multiobjective forest algorithm (MFA)] to identify protein-RNA interactions from CLIP-seq data by synergizing multiobjective biogeography-based optimization (BBO) with random forest (RF). Since most of the tree-structured classifiers in RF are unnecessarily bulky with extra time costs and memory consumption, multiobjective BBO is designed to prune the unsuitable tree-structured classifiers dynamically. Moreover, to direct the evolution dynamics of the MFA, two objective functions are formulated to balance model generality and complexity for robust performance. To validate our MFA method, we compare its performance across 31 large-scale CLIP-seq datasets. The experimental results demonstrate that MFA can obtain superior performance over the current state-of-the-art methods. Mechanistic insights are also revealed and discussed to explore the multifaceted aspects of MFA through data source importance analysis, matrix rank estimations, seeding component perturbations, and multiobjective optimization methodology comparisons.
Collapse
|
3
|
Cheng Z, Liu L, Lin G, Yi C, Chu X, Liang Y, Zhou W, Jin X. ReHiC: Enhancing Hi-C data resolution via residual convolutional network. J Bioinform Comput Biol 2021; 19:2150001. [PMID: 33685371 DOI: 10.1142/s0219720021500013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
High-throughput chromosome conformation capture (Hi-C) is one of the most popular methods for studying the three-dimensional organization of genomes. However, Hi-C protocols can be expensive since they require large amounts of sample material and may be time-consuming. Most commonly used Hi-C data are low-resolution. Such data can only be used to identify large-scale genomic interactions and are not sufficient to identify the small-scale patterns. We propose a novel deep learning-based computational approach (named ReHiC) that enhances the resolution of Hi-C data and allows us to achieve high-resolution Hi-C data at a relatively low cost. Our model only requires 1/16 down-sampling ratio of the original sequence reading to predict higher resolution Hi-C data. This is very close to high-resolution data in terms of numerical distribution and interaction distribution. More importantly, our framework stacks deeper and converges faster due to residual blocks in the core of the network. Extensive experiments show that ReHiC performs better than HiCPlus and HiCNN, two recently developed and frequently used methods to look at the spatial organization of chromatin structure in the cell. Moreover, the portability of our framework verified by extensive experiments shows that the trained model can also enhance the Hi-C matrix of other cell types efficiently. In conclusion, ReHiC offers more accurate high-resolution image reconstruction in a broad field.
Collapse
Affiliation(s)
- Zhe Cheng
- National Pilot School of Software, Yunnan University, Kunming 650000, China.,Engineering Research Center of Cyberspace, Yunnan University, Kunming 650000, China
| | - Lin Liu
- School of Information, Yunnan Normal University, Kunming 650000, China
| | - Guoliang Lin
- State Key Laboratory for Conservation and Utilization of Bio-resource and School of Life Sciences, Yunnan University, Kunming 650000, China
| | - Chao Yi
- National Pilot School of Software, Yunnan University, Kunming 650000, China.,Engineering Research Center of Cyberspace, Yunnan University, Kunming 650000, China
| | - Xing Chu
- National Pilot School of Software, Yunnan University, Kunming 650000, China.,Engineering Research Center of Cyberspace, Yunnan University, Kunming 650000, China
| | - Yu Liang
- National Pilot School of Software, Yunnan University, Kunming 650000, China.,Engineering Research Center of Cyberspace, Yunnan University, Kunming 650000, China
| | - Wei Zhou
- National Pilot School of Software, Yunnan University, Kunming 650000, China.,Engineering Research Center of Cyberspace, Yunnan University, Kunming 650000, China
| | - Xin Jin
- National Pilot School of Software, Yunnan University, Kunming 650000, China.,Engineering Research Center of Cyberspace, Yunnan University, Kunming 650000, China
| |
Collapse
|
4
|
Li X, Zhang S, Wong KC. Nature-Inspired Multiobjective Epistasis Elucidation from Genome-Wide Association Studies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:226-237. [PMID: 29994485 DOI: 10.1109/tcbb.2018.2849759] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In recent years, the detection of epistatic interactions of multiple genetic variants on the causes of complex diseases brings a significant challenge in genome-wide association studies (GWAS). However, most of the existing methods still suffer from algorithmic limitations such as single-objective optimization, intensive computational requirement, and premature convergence. In this paper, we propose and formulate an epistatic interaction multi-objective artificial bee colony algorithm based on decomposition (EIMOABC/D) to address those problems for genetic interaction detection in genome-wide association studies. First, to direct the genetic interaction detection, two objective functions are formulated to characterize various epistatic models; rank probability model is proposed to sort each population into different nondomination levels based on the fast nondominated sorting approach. After that, the mutual information based local search algorithm is proposed to guide the population search for disease model evaluations in an unbiased manner. To validate the effectiveness of EIMOABC/D, we compare EIMOABC/D against seven state-of-the-art methods on 77 epistatic models including eight small-scale epistatic models with marginal effects, eight large-scale epistatic models with marginal effects, 60 large-scale epistatic models without any marginal effect, and one case study. The experimental results indicate that our proposed algorithm EIMOABC/D outperforms seven state-of-the-art methods on those epistatic models. Furthermore, time complexity analysis and parameter analysis are conducted to demonstrate various properties of our proposed algorithm.
Collapse
|
5
|
Classification of high dimensional biomedical data based on feature selection using redundant removal. PLoS One 2019; 14:e0214406. [PMID: 30964868 PMCID: PMC6456288 DOI: 10.1371/journal.pone.0214406] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2018] [Accepted: 03/12/2019] [Indexed: 11/26/2022] Open
Abstract
High dimensional biomedical data contain tens of thousands of features, accurate and effective identification of the core features in these data can be used to assist diagnose related diseases. However, there are often a large number of irrelevant or redundant features in biomedical data, which seriously affect subsequent classification accuracy and machine learning efficiency. To solve this problem, a novel filter feature selection algorithm based on redundant removal (FSBRR) is proposed to classify high dimensional biomedical data in this paper. First of all, two redundant criteria are determined by vertical relevance (the relationship between feature and class attribute) and horizontal relevance (the relationship between feature and feature). Secondly, to quantify redundant criteria, an approximate redundancy feature framework based on mutual information (MI) is defined to remove redundant and irrelevant features. To evaluate the effectiveness of our proposed algorithm, controlled trials based on typical feature selection algorithm are conducted using three different classifiers, and the experimental results indicate that the FSBRR algorithm can effectively reduce the feature dimension and improve the classification accuracy. In addition, an experiment of small sample dataset is designed and conducted in the section of discussion and analysis to clarify the specific implementation process of FSBRR algorithm more clearly.
Collapse
|
6
|
Li X, Zhang S, Wong KC. Single-cell RNA-seq interpretations using evolutionary multiobjective ensemble pruning. Bioinformatics 2018; 35:2809-2817. [DOI: 10.1093/bioinformatics/bty1056] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Revised: 10/31/2018] [Accepted: 12/21/2018] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
In recent years, single-cell RNA sequencing enables us to discover cell types or even subtypes. Its increasing availability provides opportunities to identify cell populations from single-cell RNA-seq data. Computational methods have been employed to reveal the gene expression variations among multiple cell populations. Unfortunately, the existing ones can suffer from realistic restrictions such as experimental noises, numerical instability, high dimensionality and computational scalability.
Results
We propose an evolutionary multiobjective ensemble pruning algorithm (EMEP) that addresses those realistic restrictions. Our EMEP algorithm first applies the unsupervised dimensionality reduction to project data from the original high dimensions to low-dimensional subspaces; basic clustering algorithms are applied in those new subspaces to generate different clustering results to form cluster ensembles. However, most of those cluster ensembles are unnecessarily bulky with the expense of extra time costs and memory consumption. To overcome that problem, EMEP is designed to dynamically select the suitable clustering results from the ensembles. Moreover, to guide the multiobjective ensemble evolution, three cluster validity indices including the overall cluster deviation, the within-cluster compactness and the number of basic partition clusters are formulated as the objective functions to unleash its cell type discovery performance using evolutionary multiobjective optimization. We applied EMEP to 55 simulated datasets and seven real single-cell RNA-seq datasets, including six single-cell RNA-seq dataset and one large-scale dataset with 3005 cells and 4412 genes. Two case studies are also conducted to reveal mechanistic insights into the biological relevance of EMEP. We found that EMEP can achieve superior performance over the other clustering algorithms, demonstrating that EMEP can identify cell populations clearly.
Availability and implementation
EMEP is written in Matlab and available at https://github.com/lixt314/EMEP
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiangtao Li
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, Jilin, China
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | - Shixiong Zhang
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| |
Collapse
|
7
|
Wang Y, Pang W, Zhou Y. Density propagation based adaptive multi-density clustering algorithm. PLoS One 2018; 13:e0198948. [PMID: 30020928 PMCID: PMC6051564 DOI: 10.1371/journal.pone.0198948] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2018] [Accepted: 05/29/2018] [Indexed: 11/21/2022] Open
Abstract
The performance of density based clustering algorithms may be greatly influenced by the chosen parameter values, and achieving optimal or near optimal results very much depends on empirical knowledge obtained from previous experiments. To address this limitation, we propose a novel density based clustering algorithm called the Density Propagation based Adaptive Multi-density clustering (DPAM) algorithm. DPAM can adaptively cluster spatial data. In order to avoid manual intervention when choosing parameters of density clustering and still achieve high performance, DPAM performs clustering in three stages: (1) generate the micro-clusters graph, (2) density propagation with redefinition of between-class margin and intra-class cohesion, and (3) calculate regional density. Experimental results demonstrated that DPAM could achieve better performance than several state-of-the-art density clustering algorithms in most of the tested cases, the ability of no parameters needing to be adjusted enables the proposed algorithm to achieve promising performance.
Collapse
Affiliation(s)
- Yizhang Wang
- College of Computer Science and Technology, Jilin University, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Changchun, China
| | - Wei Pang
- Department of Computing Science, University of Aberdeen, United Kingdom
| | - You Zhou
- College of Computer Science and Technology, Jilin University, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Changchun, China
- * E-mail:
| |
Collapse
|