1
|
Castanho EN, Aidos H, Madeira SC. Biclustering data analysis: a comprehensive survey. Brief Bioinform 2024; 25:bbae342. [PMID: 39007596 PMCID: PMC11247412 DOI: 10.1093/bib/bbae342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 05/16/2024] [Accepted: 07/01/2024] [Indexed: 07/16/2024] Open
Abstract
Biclustering, the simultaneous clustering of rows and columns of a data matrix, has proved its effectiveness in bioinformatics due to its capacity to produce local instead of global models, evolving from a key technique used in gene expression data analysis into one of the most used approaches for pattern discovery and identification of biological modules, used in both descriptive and predictive learning tasks. This survey presents a comprehensive overview of biclustering. It proposes an updated taxonomy for its fundamental components (bicluster, biclustering solution, biclustering algorithms, and evaluation measures) and applications. We unify scattered concepts in the literature with new definitions to accommodate the diversity of data types (such as tabular, network, and time series data) and the specificities of biological and biomedical data domains. We further propose a pipeline for biclustering data analysis and discuss practical aspects of incorporating biclustering in real-world applications. We highlight prominent application domains, particularly in bioinformatics, and identify typical biclusters to illustrate the analysis output. Moreover, we discuss important aspects to consider when choosing, applying, and evaluating a biclustering algorithm. We also relate biclustering with other data mining tasks (clustering, pattern mining, classification, triclustering, N-way clustering, and graph mining). Thus, it provides theoretical and practical guidance on biclustering data analysis, demonstrating its potential to uncover actionable insights from complex datasets.
Collapse
Affiliation(s)
- Eduardo N Castanho
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Helena Aidos
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| |
Collapse
|
2
|
Jia X, Yin Z, Peng Y. Gene differential co-expression analysis of male infertility patients based on statistical and machine learning methods. Front Microbiol 2023; 14:1092143. [PMID: 36778885 PMCID: PMC9911419 DOI: 10.3389/fmicb.2023.1092143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 01/11/2023] [Indexed: 01/28/2023] Open
Abstract
Male infertility has always been one of the important factors affecting the infertility of couples of gestational age. The reasons that affect male infertility includes living habits, hereditary factors, etc. Identifying the genetic causes of male infertility can help us understand the biology of male infertility, as well as the diagnosis of genetic testing and the determination of clinical treatment options. While current research has made significant progress in the genes that cause sperm defects in men, genetic studies of sperm content defects are still lacking. This article is based on a dataset of gene expression data on the X chromosome in patients with azoospermia, mild and severe oligospermia. Due to the difference in the degree of disease between patients and the possible difference in genetic causes, common classical clustering methods such as k-means, hierarchical clustering, etc. cannot effectively identify samples (realize simultaneous clustering of samples and features). In this paper, we use machine learning and various statistical methods such as hypergeometric distribution, Gibbs sampling, Fisher test, etc. and genes the interaction network for cluster analysis of gene expression data of male infertility patients has certain advantages compared with existing methods. The cluster results were identified by differential co-expression analysis of gene expression data in male infertility patients, and the model recognition clusters were analyzed by multiple gene enrichment methods, showing different degrees of enrichment in various enzyme activities, cancer, virus-related, ATP and ADP production, and other pathways. At the same time, as this paper is an unsupervised analysis of genetic factors of male infertility patients, we constructed a simulated data set, in which the clustering results have been determined, which can be used to measure the effect of discriminant model recognition. Through comparison, it finds that the proposed model has a better identification effect.
Collapse
|
3
|
Abstract
Sensors deployed within water distribution systems collect consumption data that enable the application of data analysis techniques to extract essential information. Time series clustering has been traditionally applied for modeling end-user water consumption profiles to aid water management. However, its effectiveness is limited by the diversity and local nature of consumption patterns. In addition, existing techniques cannot adequately handle changes in household composition, disruptive events (e.g., vacations), and consumption dynamics at different time scales. In this context, biclustering approaches provide a natural alternative to detect groups of end-users with coherent consumption profiles during local time periods while addressing the aforementioned limitations. This work discusses when, why and how to apply biclustering techniques for water consumption data analysis, and further proposes a methodology to this end. To the best of our knowledge, this is the first work introducing biclustering to water consumption data analysis. Results on data from a real-world water distribution system—Quinta do Lago, Portugal—confirm the potentialities of the proposed approach for pattern discovery with guarantees of statistical significance and robustness that entities can rely on for strategic planning.
Collapse
|
4
|
Shan A, Zhang F, Luan Y. Efficient Approximation of Statistical Significance in Local Trend Analysis of Dependent Time Series. Front Genet 2022; 13:729011. [PMID: 35559007 PMCID: PMC9086404 DOI: 10.3389/fgene.2022.729011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Accepted: 03/01/2022] [Indexed: 11/13/2022] Open
Abstract
Biological time series data plays an important role in exploring the dynamic changes of biological systems, while the determinate patterns of association between various biological factors can further deepen the understanding of biological system functions and the interactions between them. At present, local trend analysis (LTA) has been commonly conducted in many biological fields, where the biological time series data can be the sequence at either the level of gene expression or OTU abundance, etc., A local trend score can be obtained by taking the similarity degree of the upward, constant or downward trend of time series data as an indicator of the correlation between different biological factors. However, a major limitation facing local trend analysis is that the permutation test conducted to calculate its statistical significance requires a time-consuming process. Therefore, the problem attracting much attention from bioinformatics scientists is to develop a method of evaluating the statistical significance of local trend scores quickly and effectively. In this paper, a new approach is proposed to evaluate the efficient approximation of statistical significance in the local trend analysis of dependent time series, and the effectiveness of the new method is demonstrated through simulation and real data set analysis.
Collapse
Affiliation(s)
- Ang Shan
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
- Postdoctoral Programme of Zhongtai Securities Co. Ltd, Jinan, China
| | - Fang Zhang
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
| | - Yihui Luan
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
| |
Collapse
|
5
|
Chang H, Zhang H, Zhang T, Su L, Qin QM, Li G, Li X, Wang L, Zhao T, Zhao E, Zhao H, Liu Y, Stacey G, Xu D. A Multi-Level Iterative Bi-Clustering Method for Discovering miRNA Co-regulation Network of Abiotic Stress Tolerance in Soybeans. FRONTIERS IN PLANT SCIENCE 2022; 13:860791. [PMID: 35463453 PMCID: PMC9021755 DOI: 10.3389/fpls.2022.860791] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Accepted: 02/24/2022] [Indexed: 06/14/2023]
Abstract
Although growing evidence shows that microRNA (miRNA) regulates plant growth and development, miRNA regulatory networks in plants are not well understood. Current experimental studies cannot characterize miRNA regulatory networks on a large scale. This information gap provides an excellent opportunity to employ computational methods for global analysis and generate valuable models and hypotheses. To address this opportunity, we collected miRNA-target interactions (MTIs) and used MTIs from Arabidopsis thaliana and Medicago truncatula to predict homologous MTIs in soybeans, resulting in 80,235 soybean MTIs in total. A multi-level iterative bi-clustering method was developed to identify 483 soybean miRNA-target regulatory modules (MTRMs). Furthermore, we collected soybean miRNA expression data and corresponding gene expression data in response to abiotic stresses. By clustering these data, 37 MTRMs related to abiotic stresses were identified, including stress-specific MTRMs and shared MTRMs. These MTRMs have gene ontology (GO) enrichment in resistance response, iron transport, positive growth regulation, etc. Our study predicts soybean MTRMs and miRNA-GO networks under different stresses, and provides miRNA targeting hypotheses for experimental analyses. The method can be applied to other biological processes and other plants to elucidate miRNA co-regulation mechanisms.
Collapse
Affiliation(s)
- Haowu Chang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Hao Zhang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Tianyue Zhang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Lingtao Su
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, China
| | - Qing-Ming Qin
- College of Plant Sciences and Key Laboratory of Zoonosis Research, Ministry of Education, Jilin University, Jilin, China
| | - Guihua Li
- College of Plant Sciences and Key Laboratory of Zoonosis Research, Ministry of Education, Jilin University, Jilin, China
| | - Xueqing Li
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Li Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Tianheng Zhao
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Enshuang Zhao
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Hengyi Zhao
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Yuanning Liu
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Gary Stacey
- Division of Plant Sciences and Technology, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Dong Xu
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| |
Collapse
|
6
|
Xie J, Ma A, Fennell A, Ma Q, Zhao J. It is time to apply biclustering: a comprehensive review of biclustering applications in biological and biomedical data. Brief Bioinform 2020; 20:1449-1464. [PMID: 29490019 DOI: 10.1093/bib/bby014] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Revised: 01/16/2018] [Indexed: 12/12/2022] Open
Abstract
Biclustering is a powerful data mining technique that allows clustering of rows and columns, simultaneously, in a matrix-format data set. It was first applied to gene expression data in 2000, aiming to identify co-expressed genes under a subset of all the conditions/samples. During the past 17 years, tens of biclustering algorithms and tools have been developed to enhance the ability to make sense out of large data sets generated in the wake of high-throughput omics technologies. These algorithms and tools have been applied to a wide variety of data types, including but not limited to, genomes, transcriptomes, exomes, epigenomes, phenomes and pharmacogenomes. However, there is still a considerable gap between biclustering methodology development and comprehensive data interpretation, mainly because of the lack of knowledge for the selection of appropriate biclustering tools and further supporting computational techniques in specific studies. Here, we first deliver a brief introduction to the existing biclustering algorithms and tools in public domain, and then systematically summarize the basic applications of biclustering for biological data and more advanced applications of biclustering for biomedical data. This review will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency.
Collapse
|
7
|
Zhang F, Sun F, Luan Y. Statistical significance approximation for local similarity analysis of dependent time series data. BMC Bioinformatics 2019; 20:53. [PMID: 30691412 PMCID: PMC6348690 DOI: 10.1186/s12859-019-2595-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2017] [Accepted: 01/03/2019] [Indexed: 11/29/2022] Open
Abstract
Background Local similarity analysis (LSA) of time series data has been extensively used to investigate the dynamics of biological systems in a wide range of environments. Recently, a theoretical method was proposed to approximately calculate the statistical significance of local similarity (LS) scores. However, the method assumes that the time series data are independent identically distributed, which can be violated in many problems. Results In this paper, we develop a novel approach to accurately approximate statistical significance of LSA for dependent time series data using nonparametric kernel estimated long-run variance. We also investigate an alternative method for LSA statistical significance approximation by computing the local similarity score of the residuals based on a predefined statistical model. We show by simulations that both methods have controllable type I errors for dependent time series, while other approaches for statistical significance can be grossly oversized. We apply both methods to human and marine microbial datasets, where most of possible significant associations are captured and false positives are efficiently controlled. Conclusions Our methods provide fast and effective approaches for evaluating statistical significance of dependent time series data with controllable type I error. They can be applied to a variety of time series data to reveal inherent relationships among the different factors. Electronic supplementary material The online version of this article (10.1186/s12859-019-2595-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Fang Zhang
- School of Mathematics, Shandong University, Jinan, Shandong, 250100, China
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, 1050 Childs Way, Los Angeles, 90089, CA, USA.,Institute of Science and Technology for Brain-inspired Intelligence, Fudan University, Shanghai, 200433, China
| | - Yihui Luan
- School of Mathematics, Shandong University, Jinan, Shandong, 250100, China.
| |
Collapse
|
8
|
Zhang F, Shan A, Luan Y. A novel method to accurately calculate statistical significance of local similarity analysis for high-throughput time series. Stat Appl Genet Mol Biol 2018; 17:/j/sagmb.ahead-of-print/sagmb-2018-0019/sagmb-2018-0019.xml. [PMID: 30447151 DOI: 10.1515/sagmb-2018-0019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In recent years, a large number of time series microbial community data has been produced in molecular biological studies, especially in metagenomics. Among the statistical methods for time series, local similarity analysis is used in a wide range of environments to capture potential local and time-shifted associations that cannot be distinguished by traditional correlation analysis. Initially, the permutation test is popularly applied to obtain the statistical significance of local similarity analysis. More recently, a theoretical method has also been developed to achieve this aim. However, all these methods require the assumption that the time series are independent and identically distributed. In this paper, we propose a new approach based on moving block bootstrap to approximate the statistical significance of local similarity scores for dependent time series. Simulations show that our method can control the type I error rate reasonably, while theoretical approximation and the permutation test perform less well. Finally, our method is applied to human and marine microbial community datasets, indicating that it can identify potential relationship among operational taxonomic units (OTUs) and significantly decrease the rate of false positives.
Collapse
Affiliation(s)
- Fang Zhang
- School of Mathematics, Shandong University, Jinan, 250100, P.R. China
| | - Ang Shan
- School of Mathematics, Shandong University, Jinan, 250100, P.R. China
| | - Yihui Luan
- School of Mathematics, Shandong University, Jinan, 250100, P.R. China
| |
Collapse
|
9
|
Mahfouz A, Huisman SMH, Lelieveldt BPF, Reinders MJT. Brain transcriptome atlases: a computational perspective. Brain Struct Funct 2017; 222:1557-1580. [PMID: 27909802 PMCID: PMC5406417 DOI: 10.1007/s00429-016-1338-2] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2016] [Accepted: 11/15/2016] [Indexed: 01/31/2023]
Abstract
The immense complexity of the mammalian brain is largely reflected in the underlying molecular signatures of its billions of cells. Brain transcriptome atlases provide valuable insights into gene expression patterns across different brain areas throughout the course of development. Such atlases allow researchers to probe the molecular mechanisms which define neuronal identities, neuroanatomy, and patterns of connectivity. Despite the immense effort put into generating such atlases, to answer fundamental questions in neuroscience, an even greater effort is needed to develop methods to probe the resulting high-dimensional multivariate data. We provide a comprehensive overview of the various computational methods used to analyze brain transcriptome atlases.
Collapse
Affiliation(s)
- Ahmed Mahfouz
- Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands.
- Delft Bioinformatics Laboratory, Delft University of Technology, Delft, The Netherlands.
| | - Sjoerd M H Huisman
- Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands
- Delft Bioinformatics Laboratory, Delft University of Technology, Delft, The Netherlands
| | - Boudewijn P F Lelieveldt
- Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands
- Delft Bioinformatics Laboratory, Delft University of Technology, Delft, The Netherlands
| | - Marcel J T Reinders
- Delft Bioinformatics Laboratory, Delft University of Technology, Delft, The Netherlands
| |
Collapse
|
10
|
Balasubramanian A, Wang J, Prabhakaran B. Discovering Multidimensional Motifs in Physiological Signals for Personalized Healthcare. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 2016; 10:832-841. [PMID: 28191269 PMCID: PMC5298205 DOI: 10.1109/jstsp.2016.2543679] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Personalized diagnosis and therapy requires monitoring patient activity using various body sensors. Sensor data generated during personalized exercises or tasks may be too specific or inadequate to be evaluated using supervised methods such as classification. We propose multidimensional motif (MDM) discovery as a means for patient activity monitoring, since such motifs can capture repeating patterns across multiple dimensions of the data, and can serve as conformance indicators. Previous studies pertaining to mining MDMs have proposed approaches that lack the capability of concurrently processing multiple dimensions, thus limiting their utility in online scenarios. In this paper, we propose an efficient real-time approach to MDM discovery in body sensor generated time series data for monitoring performance of patients during therapy. We present two alternative models for MDMs based on motif co-occurrences and temporal ordering among motifs across multiple dimensions, with detailed formulation of the concepts proposed. The proposed method uses an efficient hashing based record to enable speedy update and retrieval of motif sets, and identification of MDMs. Performance evaluation using synthetic and real body sensor data in unsupervised motif discovery tasks shows that the approach is effective for (a) concurrent processing of multidimensional time series information suitable for real-time applications, (b) finding unknown naturally occurring patterns with minimal delay, and
Collapse
Affiliation(s)
- Arvind Balasubramanian
- Department of Computer Science, The University of Texas at Dallas, Richardson, TX, 75080 USA
| | - Jun Wang
- Department of Bioengineering, The University of Texas at Dallas, Richardson, TX, 75080 USA
| | | |
Collapse
|
11
|
Xia LC, Ai D, Cram JA, Liang X, Fuhrman JA, Sun F. Statistical significance approximation in local trend analysis of high-throughput time-series data using the theory of Markov chains. BMC Bioinformatics 2015; 16:301. [PMID: 26390921 PMCID: PMC4578688 DOI: 10.1186/s12859-015-0732-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2015] [Accepted: 09/05/2015] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Local trend (i.e. shape) analysis of time series data reveals co-changing patterns in dynamics of biological systems. However, slow permutation procedures to evaluate the statistical significance of local trend scores have limited its applications to high-throughput time series data analysis, e.g., data from the next generation sequencing technology based studies. RESULTS By extending the theories for the tail probability of the range of sum of Markovian random variables, we propose formulae for approximating the statistical significance of local trend scores. Using simulations and real data, we show that the approximate p-value is close to that obtained using a large number of permutations (starting at time points >20 with no delay and >30 with delay of at most three time steps) in that the non-zero decimals of the p-values obtained by the approximation and the permutations are mostly the same when the approximate p-value is less than 0.05. In addition, the approximate p-value is slightly larger than that based on permutations making hypothesis testing based on the approximate p-value conservative. The approximation enables efficient calculation of p-values for pairwise local trend analysis, making large scale all-versus-all comparisons possible. We also propose a hybrid approach by integrating the approximation and permutations to obtain accurate p-values for significantly associated pairs. We further demonstrate its use with the analysis of the Polymouth Marine Laboratory (PML) microbial community time series from high-throughput sequencing data and found interesting organism co-occurrence dynamic patterns. AVAILABILITY The software tool is integrated into the eLSA software package that now provides accelerated local trend and similarity analysis pipelines for time series data. The package is freely available from the eLSA website: http://bitbucket.org/charade/elsa.
Collapse
Affiliation(s)
- Li C Xia
- Department of Medicine, Division of Oncology, Stanford University School of Medicine, Stanford, 94305-5151, CA, USA.,Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, 19104, PA, USA
| | - Dongmei Ai
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, 100083, China
| | - Jacob A Cram
- Marine and Environmental Biology, Department of Biological Sciences, University of Southern California, Los Angeles, 90089-0371, CA, USA
| | - Xiaoyi Liang
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, 100083, China
| | - Jed A Fuhrman
- Marine and Environmental Biology, Department of Biological Sciences, University of Southern California, Los Angeles, 90089-0371, CA, USA
| | - Fengzhu Sun
- Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, 90089-2910, CA, USA. .,Centre for Computational Systems Biology, Fudan University, Shanghai, 200433, China.
| |
Collapse
|