1
|
Diop EHS, Ngom A, Prasath VBS. Signal Approximations Based on Nonlinear and Optimal Piecewise Affine Functions. CIRCUITS, SYSTEMS, AND SIGNAL PROCESSING 2023; 42:2366-2384. [DOI: 10.1007/s00034-022-02224-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 10/18/2022] [Accepted: 10/19/2022] [Indexed: 09/24/2023]
|
2
|
Mohammadi M. A Compact Neural Network for Fused Lasso Signal Approximator. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:4327-4336. [PMID: 31329147 DOI: 10.1109/tcyb.2019.2925707] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The fused lasso signal approximator (FLSA) is a vital optimization problem with extensive applications in signal processing and biomedical engineering. However, the optimization problem is difficult to solve since it is both nonsmooth and nonseparable. The existing numerical solutions implicate the use of several auxiliary variables in order to deal with the nondifferentiable penalty. Thus, the resulting algorithms are both time- and memory-inefficient. This paper proposes a compact neural network to solve the FLSA. The neural network has a one-layer structure with the number of neurons proportionate to the dimension of the given signal, thanks to the utilization of consecutive projections. The proposed neural network is stable in the Lyapunov sense and is guaranteed to converge globally to the optimal solution of the FLSA. Experiments on several applications from signal processing and biomedical engineering confirm the reasonable performance of the proposed neural network.
Collapse
|
3
|
Xie K, Tian Y, Yuan X. A Density Peak-Based Method to Detect Copy Number Variations From Next-Generation Sequencing Data. Front Genet 2021; 11:632311. [PMID: 33519925 PMCID: PMC7838601 DOI: 10.3389/fgene.2020.632311] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Accepted: 12/21/2020] [Indexed: 11/29/2022] Open
Abstract
Copy number variation (CNV) is a common type of structural variations in human genome and confers biological meanings to human complex diseases. Detection of CNVs is an important step for a systematic analysis of CNVs in medical research of complex diseases. The recent development of next-generation sequencing (NGS) platforms provides unprecedented opportunities for the detection of CNVs at a base-level resolution. However, due to the intrinsic characteristics behind NGS data, accurate detection of CNVs is still a challenging task. In this article, we propose a new density peak-based method, called dpCNV, for the detection of CNVs from NGS data. The algorithm of dpCNV is designed based on density peak clustering algorithm. It extracts two features, i.e., local density and minimum distance, from sequencing read depth (RD) profile and generates a two-dimensional data. Based on the generated data, a two-dimensional null distribution is constructed to test the significance of each genome bin and then the significant genome bins are declared as CNVs. We test the performance of the dpCNV method on a number of simulated datasets and make comparison with several existing methods. The experimental results demonstrate that our proposed method outperforms others in terms of sensitivity and F1-score. We further apply it to a set of real sequencing samples and the results demonstrate the validity of dpCNV. Therefore, we expect that dpCNV can be used as a supplementary to existing methods and may become a routine tool in the field of genome mutation analysis.
Collapse
Affiliation(s)
- Kun Xie
- The School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Ye Tian
- The School of Computer Science and Technology, Xidian University, Xi'an, China.,Xi'an Key Laboratory of Computational Bioinformatics, The School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Xiguo Yuan
- The School of Computer Science and Technology, Xidian University, Xi'an, China.,Xi'an Key Laboratory of Computational Bioinformatics, The School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
4
|
Alshawaqfeh M, Al Kawam A, Serpedin E, Datta A. Robust Recurrent CNV Detection in the Presence of Inter-Subject Variability. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1056-1067. [PMID: 30387737 DOI: 10.1109/tcbb.2018.2878560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The study of recurrent copy number variations (CNVs) plays an important role in understanding the onset and evolution of complex diseases such as cancer. Array-based comparative genomic hybridization (aCGH) is a widely used microarray based technology for identifying CNVs. However, due to high noise levels and inter-sample variability, detecting recurrent CNVs from aCGH data remains a challenging topic. This paper proposes a novel method for identification of the recurrent CNVs. In the proposed method, the noisy aCGH data is modeled as the superposition of three matrices: a full-rank matrix of weighted piece-wise generating signals accounting for the clean aCGH data, a Gaussian noise matrix to model the inherent experimentation errors and other sources of error, and a sparse matrix to capture the sparse inter-sample (sample-specific) variations. We demonstrated the ability of our method to separate accurately recurrent CNVs from sample-specific variations and noise in both simulated (artificial) data and real data. The proposed method produced more accurate results than current state-of-the-art methods used in recurrent CNV detection and exhibited robustness to noise and sample-specific variations.
Collapse
|
5
|
Xi J, Li A, Wang M. HetRCNA: A Novel Method to Identify Recurrent Copy Number Alternations from Heterogeneous Tumor Samples Based on Matrix Decomposition Framework. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:422-434. [PMID: 29994262 DOI: 10.1109/tcbb.2018.2846599] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
A common strategy to discovering cancer associated copy number aberrations (CNAs) from a cohort of cancer samples is to detect recurrent CNAs (RCNAs). Although the previous methods can successfully identify communal RCNAs shared by nearly all tumor samples, detecting subgroup-specific RCNAs and their related subgroup samples from cancer samples with heterogeneity is still invalid for these existing approaches. In this paper, we introduce a novel integrated method called HetRCNA, which can identify statistically significant subgroup-specific RCNAs and their related subgroup samples. Based on matrix decomposition framework with weight constraint, HetRCNA can successfully measure the subgroup samples by coefficients of left vectors with weight constraint and subgroup-specific RCNAs by coefficients of the right vectors and significance test. When we evaluate HetRCNA on simulated dataset, the results show that HetRCNA gives the best performances among the competing methods and is robust to the noise factors of the simulated data. When HetRCNA is applied on a real breast cancer dataset, our approach successfully identifies a bunch of RCNA regions and the result is highly correlated with the results of the other two investigated approaches. Notably, the genomic regions identified by HetRCNA harbor many breast cancer related genes reported by previous researches.
Collapse
|
6
|
Diop EHS, Boudraa AO, Prasath VBS. Optimal Nonlinear Signal Approximations Based on Piecewise Constant Functions. CIRCUITS SYSTEMS AND SIGNAL PROCESSING 2019. [DOI: 10.1007/s00034-019-01285-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
7
|
Mohammadi M, Farahi F. An Entropy-Regularized Framework for Detecting Copy Number Variants. IEEE Trans Biomed Eng 2019; 66:682-688. [DOI: 10.1109/tbme.2018.2854628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
8
|
Cai H, Chen P, Chen J, Cai J, Song Y, Han G. WaveDec: A Wavelet Approach to Identify Both Shared and Individual Patterns of Copy-Number Variations. IEEE Trans Biomed Eng 2019; 65:353-364. [PMID: 29346103 DOI: 10.1109/tbme.2017.2769677] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Copy-number variations (CNVs) are associated with complex diseases and particular tumor types. Array-based comparative genomic hybridization (aCGH) is a common approach for the detection of CNVs. Traditional CNV detection methods for multiple aCGH samples mainly use batch samples to find common variations, not accounting for the individual characteristics of each sample. Accurately differentiating both the commonly shared and the individual CNV patterns is pivotal to identify cell populations, or to distinguish cell growth (as in cancer) from invasion of new cells. Our preliminary results have now demonstrated that both the shared and individual CNV patterns have distinctive characteristics after wavelet transform. METHODS To exploit these characteristics, we propose to formulate a quadratic data-separation problem within the wavelet space to discriminate the shared and individual CNVs from raw data. We have elaborated a numerical solution and shown that the solution can be obtained by solving decoupled subproblems. By this approach, computational costs can be limited, enabling efficient application in the analysis of large sequencing datasets. RESULTS The advantages of our proposed method, called WaveDec, have been demonstrated by comparison with popular CNV-detection methods using synthetic and empirical aCGH data. The performance of WaveDec was further validated by experiments with single-cell-sequencing data. CONCLUSION WaveDec can successfully differentiate shared and individual patterns, and performs well even in data contaminated with high levels of noise. SIGNIFICANCE Both the shared and individual patterns can be uniquely characterized as well as effectively decomposed within the wavelet space.
Collapse
|
9
|
Mohammadi M, Tan YH, Hofman W, Mousavi SH. A novel one-layer recurrent neural network for the l1-regularized least square problem. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.07.007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
10
|
Mohammadi M, Mansoori A. A Projection Neural Network for Identifying Copy Number Variants. IEEE J Biomed Health Inform 2018; 23:2182-2188. [PMID: 30235154 DOI: 10.1109/jbhi.2018.2871619] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
The identification of copy number variations (CNVs) helps the diagnosis of many diseases. One major hurdle in the path of CNVs discovery is that the boundaries of normal and aberrant regions cannot be distinguished from the raw data, since various types of noise contaminate them. To tackle this challenge, the total variation regularization is mostly used in the optimization problems to approximate the noise-free data from corrupted observations. The minimization using such regularization is challenging to deal with since it is non-differentiable. In this paper, we propose a projection neural network to solve the non-smooth problem. The proposed neural network has a simple one-layer structure and is theoretically assured to have the global exponential convergence to the solution of the total variation-regularized problem. The experiments on several real and simulated datasets illustrate the reasonable performance of the proposed neural network and show that its performance is comparable with those of more sophisticated algorithms.
Collapse
|
11
|
|
12
|
Yuan X, Zhang J, Yang L, Bai J, Fan P. Detection of Significant Copy Number Variations From Multiple Samples in Next-Generation Sequencing Data. IEEE Trans Nanobioscience 2018; 17:12-20. [PMID: 29570071 DOI: 10.1109/tnb.2017.2783910] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Analyzing copy number variations (CNVs) from next-generation sequencing (NGS) data has become a common approach to detect disease susceptibility genes. The main challenge is how to utilize the NGS data with limited coverage depth to detect significant CNVs. Here, we introduce a new statistical method, the derivative of correlation coefficient (DCC), to detect significant CNVs that recurrently occur in multiple samples using read depth signals. We use a sliding window to calculate a correlation coefficient for each genome bin, and compute corresponding derivatives by fitting curves to the correlation coefficient. Then, the detection of significant CNVs was transformed into a problem of detecting significant derivatives reflecting genome breakpoints that can be solved using statistical hypothesis testing. We tested and compared the performance of DCC against several peer methods using a large number of simulation data sets, and validated DCC using several real sequencing data sets derived from the European Genome-Phenome archive, DNA Data Bank of Japan, and the 1000 Genomes Project. Experimental results suggest that DCC is an effective approach for identifying CNVs, outperforming peer methods in the terms of detection power and accuracy. DCC can be used to detect significant or recurrent CNVs in various NGS data sets, thus providing useful information to study genomic mutations and find disease susceptibility genes.
Collapse
|
13
|
Abstract
Many statistical learning methods such as matrix completion, matrix regression, and multiple response regression estimate a matrix of parameters. The nuclear norm regularization is frequently employed to achieve shrinkage and low rank solutions. To minimize a nuclear norm regularized loss function, a vital and most time-consuming step is singular value thresholding, which seeks the singular values of a large matrix exceeding a threshold and their associated singular vectors. Currently MATLAB lacks a function for singular value thresholding. Its built-in svds function computes the top r singular values/vectors by Lanczos iterative method but is only efficient for sparse matrix input, while aforementioned statistical learning algorithms perform singular value thresholding on dense but structured matrices. To address this issue, we provide a MATLAB wrapper function svt that implements singular value thresholding. It encompasses both top singular value decomposition and thresholding, handles both large sparse matrices and structured matrices, and reduces the computation cost in matrix learning algorithms.
Collapse
Affiliation(s)
- Cai Li
- North Carolina State University
| | - Hua Zhou
- University of California, Los Angeles
| |
Collapse
|
14
|
Zhang Y, Cheung YM, Xu B, Su W. Detection Copy Number Variants from NGS with Sparse and Smooth Constraints. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:856-867. [PMID: 27164604 DOI: 10.1109/tcbb.2016.2561933] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
It is known that copy number variations (CNVs) are associated with complex diseases and particular tumor types, thus reliable identification of CNVs is of great potential value. Recent advances in next generation sequencing (NGS) data analysis have helped manifest the richness of CNV information. However, the performances of these methods are not consistent. Reliably finding CNVs in NGS data in an efficient way remains a challenging topic, worthy of further investigation. Accordingly, we tackle the problem by formulating CNVs identification into a quadratic optimization problem involving two constraints. By imposing the constraints of sparsity and smoothness, the reconstructed read depth signal from NGS is anticipated to fit the CNVs patterns more accurately. An efficient numerical solution tailored from alternating direction minimization (ADM) framework is elaborated. We demonstrate the advantages of the proposed method, namely ADM-CNV, by comparing it with six popular CNV detection methods using synthetic, simulated, and empirical sequencing data. It is shown that the proposed approach can successfully reconstruct CNV patterns from raw data, and achieve superior or comparable performance in detection of the CNVs compared to the existing counterparts.
Collapse
|
15
|
A novel network regularized matrix decomposition method to detect mutated cancer genes in tumour samples with inter-patient heterogeneity. Sci Rep 2017; 7:2855. [PMID: 28588243 PMCID: PMC5460199 DOI: 10.1038/s41598-017-03141-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2017] [Accepted: 04/20/2017] [Indexed: 01/01/2023] Open
Abstract
Inter-patient heterogeneity is a major challenge for mutated cancer genes detection which is crucial to advance cancer diagnostics and therapeutics. To detect mutated cancer genes in heterogeneous tumour samples, a prominent strategy is to determine whether the genes are recurrently mutated in their interaction network context. However, recent studies show that some cancer genes in different perturbed pathways are mutated in different subsets of samples. Subsequently, these genes may not display significant mutational recurrence and thus remain undiscovered even in consideration of network information. We develop a novel method called mCGfinder to efficiently detect mutated cancer genes in tumour samples with inter-patient heterogeneity. Based on matrix decomposition framework incorporated with gene interaction network information, mCGfinder can successfully measure the significance of mutational recurrence of genes in a subset of samples. When applying mCGfinder on TCGA somatic mutation datasets of five types of cancers, we find that the genes detected by mCGfinder are significantly enriched for known cancer genes, and yield substantially smaller p-values than other existing methods. All the results demonstrate that mCGfinder is an efficient method in detecting mutated cancer genes.
Collapse
|
16
|
Xi J, Wang M, Li A. Discovering potential driver genes through an integrated model of somatic mutation profiles and gene functional information. MOLECULAR BIOSYSTEMS 2017; 13:2135-2144. [DOI: 10.1039/c7mb00303j] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
An integrated approach to identify driver genes based on information of somatic mutations, the interaction network and Gene Ontology similarity.
Collapse
Affiliation(s)
- Jianing Xi
- School of Information Science and Technology
- University of Science and Technology of China
- Hefei AH 230027
- People’s Republic of China
| | - Minghui Wang
- School of Information Science and Technology
- University of Science and Technology of China
- Hefei AH 230027
- People’s Republic of China
- Centers for Biomedical Engineering
| | - Ao Li
- School of Information Science and Technology
- University of Science and Technology of China
- Hefei AH 230027
- People’s Republic of China
- Centers for Biomedical Engineering
| |
Collapse
|
17
|
Sharifi Noghabi H, Mohammadi M, Tan Y. Robust group fused lasso for multisample copy number variation detection under uncertainty. IET Syst Biol 2016; 10:229-236. [DOI: 10.1049/iet-syb.2015.0081] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Affiliation(s)
- Hossein Sharifi Noghabi
- Department of Computer EngineeringFerdowsi University of MashhadIran
- The Center of Excellence of Soft Computing and Intelligent Information Processing (SCIIP)Ferdowsi University of MashhadIran
| | - Majid Mohammadi
- Department of Technology, Policy and ManagementDelft University of TechnologyNetherlands
| | - Yao‐Hua Tan
- Department of Technology, Policy and ManagementDelft University of TechnologyNetherlands
| |
Collapse
|
18
|
Xi J, Li A. Discovering Recurrent Copy Number Aberrations in Complex Patterns via Non-Negative Sparse Singular Value Decomposition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:656-668. [PMID: 26372614 DOI: 10.1109/tcbb.2015.2474404] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Recurrent copy number aberrations (RCNAs) in multiple cancer samples are strongly associated with tumorigenesis, and RCNA discovery is helpful to cancer research and treatment. Despite the emergence of numerous RCNA discovering methods, most of them are unable to detect RCNAs in complex patterns that are influenced by complicating factors including aberration in partial samples, co-existing of gains and losses and normal-like tumor samples. Here, we propose a novel computational method, called non-negative sparse singular value decomposition (NN-SSVD), to address the RCNA discovering problem in complex patterns. In NN-SSVD, the measurement of RCNA is based on the aberration frequency in a part of samples rather than all samples, which can circumvent the complexity of different RCNA patterns. We evaluate NN-SSVD on synthetic dataset by comparison on detection scores and Receiver Operating Characteristics curves, and the results show that NN-SSVD outperforms existing methods in RCNA discovery and demonstrate more robustness to RCNA complicating factors. Applying our approach on a breast cancer dataset, we successfully identify a number of genomic regions that are strongly correlated with previous studies, which harbor a bunch of known breast cancer associated genes.
Collapse
|
19
|
Wang D, Gu J. Integrative clustering methods of multi-omics data for molecule-based cancer classifications. QUANTITATIVE BIOLOGY 2016. [DOI: 10.1007/s40484-016-0063-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
20
|
Gao X. Penalized weighted low-rank approximation for robust recovery of recurrent copy number variations. BMC Bioinformatics 2015; 16:407. [PMID: 26652207 PMCID: PMC4676147 DOI: 10.1186/s12859-015-0835-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2015] [Accepted: 11/23/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Copy number variation (CNV) analysis has become one of the most important research areas for understanding complex disease. With increasing resolution of array-based comparative genomic hybridization (aCGH) arrays, more and more raw copy number data are collected for multiple arrays. It is natural to realize the co-existence of both recurrent and individual-specific CNVs, together with the possible data contamination during the data generation process. Therefore, there is a great need for an efficient and robust statistical model for simultaneous recovery of both recurrent and individual-specific CNVs. RESULT We develop a penalized weighted low-rank approximation method (WPLA) for robust recovery of recurrent CNVs. In particular, we formulate multiple aCGH arrays into a realization of a hidden low-rank matrix with some random noises and let an additional weight matrix account for those individual-specific effects. Thus, we do not restrict the random noise to be normally distributed, or even homogeneous. We show its performance through three real datasets and twelve synthetic datasets from different types of recurrent CNV regions associated with either normal random errors or heavily contaminated errors. CONCLUSION Our numerical experiments have demonstrated that the WPLA can successfully recover the recurrent CNV patterns from raw data under different scenarios. Compared with two other recent methods, it performs the best regarding its ability to simultaneously detect both recurrent and individual-specific CNVs under normal random errors. More importantly, the WPLA is the only method which can effectively recover the recurrent CNVs region when the data is heavily contaminated.
Collapse
Affiliation(s)
- Xiaoli Gao
- Department of Mathematics and Statistics, University of North Carolina at Greensboro, 1400 Spring Garden St, Greensoboro, NC, USA.
| |
Collapse
|
21
|
Wu D, Wang D, Zhang MQ, Gu J. Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification. BMC Genomics 2015; 16:1022. [PMID: 26626453 PMCID: PMC4667498 DOI: 10.1186/s12864-015-2223-8] [Citation(s) in RCA: 87] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2015] [Accepted: 11/16/2015] [Indexed: 11/30/2022] Open
Abstract
Background One major goal of large-scale cancer omics study is to identify molecular subtypes for more accurate cancer diagnoses and treatments. To deal with high-dimensional cancer multi-omics data, a promising strategy is to find an effective low-dimensional subspace of the original data and then cluster cancer samples in the reduced subspace. However, due to data-type diversity and big data volume, few methods can integrative and efficiently find the principal low-dimensional manifold of the high-dimensional cancer multi-omics data. Results In this study, we proposed a novel low-rank approximation based integrative probabilistic model to fast find the shared principal subspace across multiple data types: the convexity of the low-rank regularized likelihood function of the probabilistic model ensures efficient and stable model fitting. Candidate molecular subtypes can be identified by unsupervised clustering hundreds of cancer samples in the reduced low-dimensional subspace. On testing datasets, our method LRAcluster (low-rank approximation based multi-omics data clustering) runs much faster with better clustering performances than the existing method. Then, we applied LRAcluster on large-scale cancer multi-omics data from TCGA. The pan-cancer analysis results show that the cancers of different tissue origins are generally grouped as independent clusters, except squamous-like carcinomas. While the single cancer type analysis suggests that the omics data have different subtyping abilities for different cancer types. Conclusions LRAcluster is a very useful method for fast dimension reduction and unsupervised clustering of large-scale multi-omics data. LRAcluster is implemented in R and freely available via http://bioinfo.au.tsinghua.edu.cn/software/lracluster/. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2223-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Dingming Wu
- MOE Key Laboratory of Bioinformatics, TNLIST Bioinformatics Division & Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Dongfang Wang
- MOE Key Laboratory of Bioinformatics, TNLIST Bioinformatics Division & Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Michael Q Zhang
- MOE Key Laboratory of Bioinformatics, TNLIST Bioinformatics Division & Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, 100084, China. .,Department of Biological Sciences, Center for Systems Biology, University of Texas at Dallas, Richardson, TX, 75080, USA.
| | - Jin Gu
- MOE Key Laboratory of Bioinformatics, TNLIST Bioinformatics Division & Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, 100084, China.
| |
Collapse
|