1
|
Liu F, Yang Y, Xu XS, Yuan M. MESBC: A novel mutually exclusive spectral biclustering method for cancer subtyping. Comput Biol Chem 2024; 109:108009. [PMID: 38219419 DOI: 10.1016/j.compbiolchem.2023.108009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 12/22/2023] [Accepted: 12/24/2023] [Indexed: 01/16/2024]
Abstract
Many soft biclustering algorithms have been developed and applied to various biological and biomedical data analyses. However, few mutually exclusive (hard) biclustering algorithms have been proposed, which could better identify disease or molecular subtypes with survival significance based on genomic or transcriptomic data. In this study, we developed a novel mutually exclusive spectral biclustering (MESBC) algorithm based on spectral method to detect mutually exclusive biclusters. MESBC simultaneously detects relevant features (genes) and corresponding conditions (patients) subgroups and, therefore, automatically uses the signature features for each subtype to perform the clustering. Extensive simulations revealed that MESBC provided superior accuracy in detecting pre-specified biclusters compared with the non-negative matrix factorization (NMF) and Dhillon's algorithm, particularly in very noisy data. Further analysis of the algorithm on real datasets obtained from the TCGA database showed that MESBC provided more accurate (i.e., smaller p-value) overall survival prediction in patients with lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) cancers when compared to the existing, gold-standard subtypes for lung cancers (integrative clustering). Furthermore, MESBC detected several genes with significant prognostic value in both LUAD and LUSC patients. External validation on an independent, unseen GEO dataset of LUAD showed that MESBC-derived clusters based on TCGA data still exhibited clear biclustering patterns and consistent, outstanding prognostic predictability, demonstrating robust generalizability of MESBC. Therefore, MESBC could potentially be used as a risk stratification tool to optimize the treatment for the patient, improve the selection of patients for clinical trials, and contribute to the development of novel therapeutic agents.
Collapse
Affiliation(s)
- Fengrong Liu
- Department of Statistics and Finance, University of Science and Technology of China, Hefei 230026, China
| | - Yaning Yang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei 230026, China
| | | | - Min Yuan
- School of Public Health Administration, Anhui Medical University, Hefei 230032, China.
| |
Collapse
|
2
|
Ai D, Chen L, Xie J, Cheng L, Zhang F, Luan Y, Li Y, Hou S, Sun F, Xia LC. Identifying local associations in biological time series: algorithms, statistical significance, and applications. Brief Bioinform 2023; 24:bbad390. [PMID: 37930023 DOI: 10.1093/bib/bbad390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 08/21/2023] [Accepted: 09/14/2023] [Indexed: 11/07/2023] Open
Abstract
Local associations refer to spatial-temporal correlations that emerge from the biological realm, such as time-dependent gene co-expression or seasonal interactions between microbes. One can reveal the intricate dynamics and inherent interactions of biological systems by examining the biological time series data for these associations. To accomplish this goal, local similarity analysis algorithms and statistical methods that facilitate the local alignment of time series and assess the significance of the resulting alignments have been developed. Although these algorithms were initially devised for gene expression analysis from microarrays, they have been adapted and accelerated for multi-omics next generation sequencing datasets, achieving high scientific impact. In this review, we present an overview of the historical developments and recent advances for local similarity analysis algorithms, their statistical properties, and real applications in analyzing biological time series data. The benchmark data and analysis scripts used in this review are freely available at http://github.com/labxscut/lsareview.
Collapse
Affiliation(s)
- Dongmei Ai
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing 100083, China
| | - Lulu Chen
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing 100083, China
| | - Jiemin Xie
- Department of Statistics and Financial Mathematics, School of Mathematics, South China University of Technology, Guangzhou 510641, China
| | - Longwei Cheng
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing 100083, China
| | - Fang Zhang
- Shenwan Hongyuan Securities Co. Ltd., Shanghai 200031, China
| | - Yihui Luan
- School of Mathematics, Shandong University, Jinan 250100, China
| | - Yang Li
- Department of Statistics and Financial Mathematics, School of Mathematics, South China University of Technology, Guangzhou 510641, China
| | - Shengwei Hou
- Department of Ocean Science and Engineering, Southern University of Science and Technology, Shenzhen, 518055, China
| | - Fengzhu Sun
- Department of Quantitative and Computational Biology, University of Southern California, California, 90007, USA
| | - Li Charlie Xia
- Department of Statistics and Financial Mathematics, School of Mathematics, South China University of Technology, Guangzhou 510641, China
| |
Collapse
|
3
|
Shan A, Zhang F, Luan Y. Efficient Approximation of Statistical Significance in Local Trend Analysis of Dependent Time Series. Front Genet 2022; 13:729011. [PMID: 35559007 PMCID: PMC9086404 DOI: 10.3389/fgene.2022.729011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Accepted: 03/01/2022] [Indexed: 11/13/2022] Open
Abstract
Biological time series data plays an important role in exploring the dynamic changes of biological systems, while the determinate patterns of association between various biological factors can further deepen the understanding of biological system functions and the interactions between them. At present, local trend analysis (LTA) has been commonly conducted in many biological fields, where the biological time series data can be the sequence at either the level of gene expression or OTU abundance, etc., A local trend score can be obtained by taking the similarity degree of the upward, constant or downward trend of time series data as an indicator of the correlation between different biological factors. However, a major limitation facing local trend analysis is that the permutation test conducted to calculate its statistical significance requires a time-consuming process. Therefore, the problem attracting much attention from bioinformatics scientists is to develop a method of evaluating the statistical significance of local trend scores quickly and effectively. In this paper, a new approach is proposed to evaluate the efficient approximation of statistical significance in the local trend analysis of dependent time series, and the effectiveness of the new method is demonstrated through simulation and real data set analysis.
Collapse
Affiliation(s)
- Ang Shan
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
- Postdoctoral Programme of Zhongtai Securities Co. Ltd, Jinan, China
| | - Fang Zhang
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
| | - Yihui Luan
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
| |
Collapse
|
4
|
Xie J, Ma A, Fennell A, Ma Q, Zhao J. It is time to apply biclustering: a comprehensive review of biclustering applications in biological and biomedical data. Brief Bioinform 2020; 20:1449-1464. [PMID: 29490019 DOI: 10.1093/bib/bby014] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Revised: 01/16/2018] [Indexed: 12/12/2022] Open
Abstract
Biclustering is a powerful data mining technique that allows clustering of rows and columns, simultaneously, in a matrix-format data set. It was first applied to gene expression data in 2000, aiming to identify co-expressed genes under a subset of all the conditions/samples. During the past 17 years, tens of biclustering algorithms and tools have been developed to enhance the ability to make sense out of large data sets generated in the wake of high-throughput omics technologies. These algorithms and tools have been applied to a wide variety of data types, including but not limited to, genomes, transcriptomes, exomes, epigenomes, phenomes and pharmacogenomes. However, there is still a considerable gap between biclustering methodology development and comprehensive data interpretation, mainly because of the lack of knowledge for the selection of appropriate biclustering tools and further supporting computational techniques in specific studies. Here, we first deliver a brief introduction to the existing biclustering algorithms and tools in public domain, and then systematically summarize the basic applications of biclustering for biological data and more advanced applications of biclustering for biomedical data. This review will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency.
Collapse
|
5
|
Kehl T, Schneider L, Kattler K, Stöckel D, Wegert J, Gerstner N, Ludwig N, Distler U, Schick M, Keller U, Tenzer S, Gessler M, Walter J, Keller A, Graf N, Meese E, Lenhof HP. REGGAE: a novel approach for the identification of key transcriptional regulators. Bioinformatics 2019; 34:3503-3510. [PMID: 29741575 PMCID: PMC6184769 DOI: 10.1093/bioinformatics/bty372] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2017] [Accepted: 05/03/2018] [Indexed: 12/13/2022] Open
Abstract
Motivation Transcriptional regulators play a major role in most biological processes. Alterations in their activities are associated with a variety of diseases and in particular with tumor development and progression. Hence, it is important to assess the effects of deregulated regulators on pathological processes. Results Here, we present REGulator-Gene Association Enrichment (REGGAE), a novel method for the identification of key transcriptional regulators that have a significant effect on the expression of a given set of genes, e.g. genes that are differentially expressed between two sample groups. REGGAE uses a Kolmogorov-Smirnov-like test statistic that implicitly combines associations between regulators and their target genes with an enrichment approach to prioritize the influence of transcriptional regulators. We evaluated our method in two different application scenarios, which demonstrate that REGGAE is well suited for uncovering the influence of transcriptional regulators and is a valuable tool for the elucidation of complex regulatory mechanisms. Availability and implementation REGGAE is freely available at https://regulatortrail.bioinf.uni-sb.de. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tim Kehl
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbrücken D-66041, Germany
| | - Lara Schneider
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbrücken D-66041, Germany
| | - Kathrin Kattler
- Department of Genetics, Saarland University, Saarbrücken D-66041, Germany
| | - Daniel Stöckel
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbrücken D-66041, Germany
| | - Jenny Wegert
- Theodor-Boveri-Institute/Biocenter, Developmental Biochemistry, and Comprehensive Cancer Center Mainfranken, Würzburg University, Würzburg, Germany
| | - Nico Gerstner
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbrücken D-66041, Germany
| | - Nicole Ludwig
- Department of Human Genetics, Medical School, Saarland University, Homburg, Germany
| | - Ute Distler
- Institute for Immunology, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Markus Schick
- Department of Internal Medicine III, School of Medicine, Technische Universität München, Munich, Germany
| | - Ulrich Keller
- Department of Internal Medicine III, School of Medicine, Technische Universität München, Munich, Germany.,German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Stefan Tenzer
- Institute for Immunology, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Manfred Gessler
- Theodor-Boveri-Institute/Biocenter, Developmental Biochemistry, and Comprehensive Cancer Center Mainfranken, Würzburg University, Würzburg, Germany
| | - Jörn Walter
- Department of Genetics, Saarland University, Saarbrücken D-66041, Germany
| | - Andreas Keller
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbrücken D-66041, Germany
| | - Norbert Graf
- Department of Pediatric Oncology and Hematology, Medical School, Saarland University, Homburg, Germany
| | - Eckart Meese
- Department of Human Genetics, Medical School, Saarland University, Homburg, Germany
| | - Hans-Peter Lenhof
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbrücken D-66041, Germany
| |
Collapse
|
6
|
Kehl T, Schneider L, Schmidt F, Stöckel D, Gerstner N, Backes C, Meese E, Keller A, Schulz MH, Lenhof HP. RegulatorTrail: a web service for the identification of key transcriptional regulators. Nucleic Acids Res 2017; 45:W146-W153. [PMID: 28472408 PMCID: PMC5570139 DOI: 10.1093/nar/gkx350] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Revised: 04/07/2017] [Accepted: 04/20/2017] [Indexed: 12/14/2022] Open
Abstract
Transcriptional regulators such as transcription factors and chromatin modifiers play a central role in most biological processes. Alterations in their activities have been observed in many diseases, e.g. cancer. Hence, it is of utmost importance to evaluate and assess the effects of transcriptional regulators on natural and pathogenic processes. Here, we present RegulatorTrail, a web service that provides rich functionality for the identification and prioritization of key transcriptional regulators that have a strong impact on, e.g. pathological processes. RegulatorTrail offers eight methods that use regulator binding information in combination with transcriptomic or epigenomic data to infer the most influential regulators. Our web service not only provides an intuitive web interface, but also a well-documented RESTful API that allows for a straightforward integration into third-party workflows. The presented case studies highlight the capabilities of our web service and demonstrate its potential for the identification of influential regulators: we successfully identified regulators that might explain the increased malignancy in metastatic melanoma compared to primary tumors, as well as important regulators in macrophages. RegulatorTrail is freely accessible at: https://regulatortrail.bioinf.uni-sb.de/.
Collapse
Affiliation(s)
- Tim Kehl
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| | - Lara Schneider
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| | - Florian Schmidt
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
- Cluster of Excellence Multimodal Computing and Interaction, Saarland Informatics Campus, 66123 Saarland University, Saarbrücken, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
| | - Daniel Stöckel
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| | - Nico Gerstner
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| | - Christina Backes
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| | - Eckart Meese
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
- Human Genetics, Saarland University, 66421 Homburg, Germany
| | - Andreas Keller
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| | - Marcel H Schulz
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
- Cluster of Excellence Multimodal Computing and Interaction, Saarland Informatics Campus, 66123 Saarland University, Saarbrücken, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
| | - Hans-Peter Lenhof
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| |
Collapse
|
7
|
Quantitative assessment of gene expression network module-validation methods. Sci Rep 2015; 5:15258. [PMID: 26470848 PMCID: PMC4607977 DOI: 10.1038/srep15258] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 09/21/2015] [Indexed: 02/01/2023] Open
Abstract
Validation of pluripotent modules in diverse networks holds enormous potential for systems biology and network pharmacology. An arising challenge is how to assess the accuracy of discovering all potential modules from multi-omic networks and validating their architectural characteristics based on innovative computational methods beyond function enrichment and biological validation. To display the framework progress in this domain, we systematically divided the existing Computational Validation Approaches based on Modular Architecture (CVAMA) into topology-based approaches (TBA) and statistics-based approaches (SBA). We compared the available module validation methods based on 11 gene expression datasets, and partially consistent results in the form of homogeneous models were obtained with each individual approach, whereas discrepant contradictory results were found between TBA and SBA. The TBA of the Zsummary value had a higher Validation Success Ratio (VSR) (51%) and a higher Fluctuation Ratio (FR) (80.92%), whereas the SBA of the approximately unbiased (AU) p-value had a lower VSR (12.3%) and a lower FR (45.84%). The Gray area simulated study revealed a consistent result for these two models and indicated a lower Variation Ratio (VR) (8.10%) of TBA at 6 simulated levels. Despite facing many novel challenges and evidence limitations, CVAMA may offer novel insights into modular networks.
Collapse
|
8
|
Xia LC, Ai D, Cram JA, Liang X, Fuhrman JA, Sun F. Statistical significance approximation in local trend analysis of high-throughput time-series data using the theory of Markov chains. BMC Bioinformatics 2015; 16:301. [PMID: 26390921 PMCID: PMC4578688 DOI: 10.1186/s12859-015-0732-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2015] [Accepted: 09/05/2015] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Local trend (i.e. shape) analysis of time series data reveals co-changing patterns in dynamics of biological systems. However, slow permutation procedures to evaluate the statistical significance of local trend scores have limited its applications to high-throughput time series data analysis, e.g., data from the next generation sequencing technology based studies. RESULTS By extending the theories for the tail probability of the range of sum of Markovian random variables, we propose formulae for approximating the statistical significance of local trend scores. Using simulations and real data, we show that the approximate p-value is close to that obtained using a large number of permutations (starting at time points >20 with no delay and >30 with delay of at most three time steps) in that the non-zero decimals of the p-values obtained by the approximation and the permutations are mostly the same when the approximate p-value is less than 0.05. In addition, the approximate p-value is slightly larger than that based on permutations making hypothesis testing based on the approximate p-value conservative. The approximation enables efficient calculation of p-values for pairwise local trend analysis, making large scale all-versus-all comparisons possible. We also propose a hybrid approach by integrating the approximation and permutations to obtain accurate p-values for significantly associated pairs. We further demonstrate its use with the analysis of the Polymouth Marine Laboratory (PML) microbial community time series from high-throughput sequencing data and found interesting organism co-occurrence dynamic patterns. AVAILABILITY The software tool is integrated into the eLSA software package that now provides accelerated local trend and similarity analysis pipelines for time series data. The package is freely available from the eLSA website: http://bitbucket.org/charade/elsa.
Collapse
Affiliation(s)
- Li C Xia
- Department of Medicine, Division of Oncology, Stanford University School of Medicine, Stanford, 94305-5151, CA, USA.,Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, 19104, PA, USA
| | - Dongmei Ai
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, 100083, China
| | - Jacob A Cram
- Marine and Environmental Biology, Department of Biological Sciences, University of Southern California, Los Angeles, 90089-0371, CA, USA
| | - Xiaoyi Liang
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, 100083, China
| | - Jed A Fuhrman
- Marine and Environmental Biology, Department of Biological Sciences, University of Southern California, Los Angeles, 90089-0371, CA, USA
| | - Fengzhu Sun
- Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, 90089-2910, CA, USA. .,Centre for Computational Systems Biology, Fudan University, Shanghai, 200433, China.
| |
Collapse
|
9
|
Cui Y, Zheng CH, Yang J. Identifying subspace gene clusters from microarray data using low-rank representation. PLoS One 2013; 8:e59377. [PMID: 23527177 PMCID: PMC3602020 DOI: 10.1371/journal.pone.0059377] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2012] [Accepted: 02/13/2013] [Indexed: 12/23/2022] Open
Abstract
Identifying subspace gene clusters from the gene expression data is useful for discovering novel functional gene interactions. In this paper, we propose to use low-rank representation (LRR) to identify the subspace gene clusters from microarray data. LRR seeks the lowest-rank representation among all the candidates that can represent the genes as linear combinations of the bases in the dataset. The clusters can be extracted based on the block diagonal representation matrix obtained using LRR, and they can well capture the intrinsic patterns of genes with similar functions. Meanwhile, the parameter of LRR can balance the effect of noise so that the method is capable of extracting useful information from the data with high level of background noise. Compared with traditional methods, our approach can identify genes with similar functions yet without similar expression profiles. Also, it could assign one gene into different clusters. Moreover, our method is robust to the noise and can identify more biologically relevant gene clusters. When applied to three public datasets, the results show that the LRR based method is superior to existing methods for identifying subspace gene clusters.
Collapse
Affiliation(s)
- Yan Cui
- School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
| | - Chun-Hou Zheng
- College of Electrical Engineering and Automation, Anhui University, Hefei, Anhui, China
| | - Jian Yang
- School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
- * E-mail:
| |
Collapse
|