1
|
Castanho EN, Aidos H, Madeira SC. Biclustering data analysis: a comprehensive survey. Brief Bioinform 2024; 25:bbae342. [PMID: 39007596 PMCID: PMC11247412 DOI: 10.1093/bib/bbae342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 05/16/2024] [Accepted: 07/01/2024] [Indexed: 07/16/2024] Open
Abstract
Biclustering, the simultaneous clustering of rows and columns of a data matrix, has proved its effectiveness in bioinformatics due to its capacity to produce local instead of global models, evolving from a key technique used in gene expression data analysis into one of the most used approaches for pattern discovery and identification of biological modules, used in both descriptive and predictive learning tasks. This survey presents a comprehensive overview of biclustering. It proposes an updated taxonomy for its fundamental components (bicluster, biclustering solution, biclustering algorithms, and evaluation measures) and applications. We unify scattered concepts in the literature with new definitions to accommodate the diversity of data types (such as tabular, network, and time series data) and the specificities of biological and biomedical data domains. We further propose a pipeline for biclustering data analysis and discuss practical aspects of incorporating biclustering in real-world applications. We highlight prominent application domains, particularly in bioinformatics, and identify typical biclusters to illustrate the analysis output. Moreover, we discuss important aspects to consider when choosing, applying, and evaluating a biclustering algorithm. We also relate biclustering with other data mining tasks (clustering, pattern mining, classification, triclustering, N-way clustering, and graph mining). Thus, it provides theoretical and practical guidance on biclustering data analysis, demonstrating its potential to uncover actionable insights from complex datasets.
Collapse
Affiliation(s)
- Eduardo N Castanho
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Helena Aidos
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| |
Collapse
|
2
|
Abbas N, Tirmizi SA, Shabir G, Saeed A, Hussain G, Channer PA, Saleem R, Ayaz M. Chromium (III) complexes of azo dye ligands: Synthesis, characterization, DNA binding and application studies. INORG NANO-MET CHEM 2017. [DOI: 10.1080/24701556.2017.1357632] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Nasir Abbas
- Department of Chemistry, Quaid-I-Azam University, Islamabad, Pakistan
| | | | - Ghulam Shabir
- Department of Chemistry, Quaid-I-Azam University, Islamabad, Pakistan
| | - Aamer Saeed
- Department of Chemistry, Quaid-I-Azam University, Islamabad, Pakistan
| | - Ghulam Hussain
- Institute of Chemistry, Punjab University, Lahore, Pakistan
| | | | - Rashid Saleem
- R&D Manager, Leather Division Shafi Reso Chem, Lahore, Pakistan
| | - Muhammad Ayaz
- CECOS University of IT and Emerging Sciences, Hayatabad, Peshawar, Pakistan
| |
Collapse
|
3
|
Henriques R, Madeira SC. BicNET: Flexible module discovery in large-scale biological networks using biclustering. Algorithms Mol Biol 2016; 11:14. [PMID: 27213009 PMCID: PMC4875761 DOI: 10.1186/s13015-016-0074-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2015] [Accepted: 04/22/2016] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Despite the recognized importance of module discovery in biological networks to enhance our understanding of complex biological systems, existing methods generally suffer from two major drawbacks. First, there is a focus on modules where biological entities are strongly connected, leading to the discovery of trivial/well-known modules and to the inaccurate exclusion of biological entities with subtler yet relevant roles. Second, there is a generalized intolerance towards different forms of noise, including uncertainty associated with less-studied biological entities (in the context of literature-driven networks) and experimental noise (in the context of data-driven networks). Although state-of-the-art biclustering algorithms are able to discover modules with varying coherency and robustness to noise, their application for the discovery of non-dense modules in biological networks has been poorly explored and it is further challenged by efficiency bottlenecks. METHODS This work proposes Biclustering NETworks (BicNET), a biclustering algorithm to discover non-trivial yet coherent modules in weighted biological networks with heightened efficiency. Three major contributions are provided. First, we motivate the relevance of discovering network modules given by constant, symmetric, plaid and order-preserving biclustering models. Second, we propose an algorithm to discover these modules and to robustly handle noisy and missing interactions. Finally, we provide new searches to tackle time and memory bottlenecks by effectively exploring the inherent structural sparsity of network data. RESULTS Results in synthetic network data confirm the soundness, efficiency and superiority of BicNET. The application of BicNET on protein interaction and gene interaction networks from yeast, E. coli and Human reveals new modules with heightened biological significance. CONCLUSIONS BicNET is, to our knowledge, the first method enabling the efficient unsupervised analysis of large-scale network data for the discovery of coherent modules with parameterizable homogeneity.
Collapse
Affiliation(s)
- Rui Henriques
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| | - Sara C. Madeira
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| |
Collapse
|
4
|
Quantitative assessment of gene expression network module-validation methods. Sci Rep 2015; 5:15258. [PMID: 26470848 PMCID: PMC4607977 DOI: 10.1038/srep15258] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 09/21/2015] [Indexed: 02/01/2023] Open
Abstract
Validation of pluripotent modules in diverse networks holds enormous potential for systems biology and network pharmacology. An arising challenge is how to assess the accuracy of discovering all potential modules from multi-omic networks and validating their architectural characteristics based on innovative computational methods beyond function enrichment and biological validation. To display the framework progress in this domain, we systematically divided the existing Computational Validation Approaches based on Modular Architecture (CVAMA) into topology-based approaches (TBA) and statistics-based approaches (SBA). We compared the available module validation methods based on 11 gene expression datasets, and partially consistent results in the form of homogeneous models were obtained with each individual approach, whereas discrepant contradictory results were found between TBA and SBA. The TBA of the Zsummary value had a higher Validation Success Ratio (VSR) (51%) and a higher Fluctuation Ratio (FR) (80.92%), whereas the SBA of the approximately unbiased (AU) p-value had a lower VSR (12.3%) and a lower FR (45.84%). The Gray area simulated study revealed a consistent result for these two models and indicated a lower Variation Ratio (VR) (8.10%) of TBA at 6 simulated levels. Despite facing many novel challenges and evidence limitations, CVAMA may offer novel insights into modular networks.
Collapse
|
5
|
Luo H, Ye H, Ng H, Shi L, Tong W, Mattes W, Mendrick D, Hong H. Understanding and predicting binding between human leukocyte antigens (HLAs) and peptides by network analysis. BMC Bioinformatics 2015; 16 Suppl 13:S9. [PMID: 26424483 PMCID: PMC4597169 DOI: 10.1186/1471-2105-16-s13-s9] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND As the major histocompatibility complex (MHC), human leukocyte antigens (HLAs) are one of the most polymorphic genes in humans. Patients carrying certain HLA alleles may develop adverse drug reactions (ADRs) after taking specific drugs. Peptides play an important role in HLA related ADRs as they are the necessary co-binders of HLAs with drugs. Many experimental data have been generated for understanding HLA-peptide binding. However, efficiently utilizing the data for understanding and accurately predicting HLA-peptide binding is challenging. Therefore, we developed a network analysis based method to understand and predict HLA-peptide binding. METHODS Qualitative Class I HLA-peptide binding data were harvested and prepared from four major databases. An HLA-peptide binding network was constructed from this dataset and modules were identified by the fast greedy modularity optimization algorithm. To examine the significance of signals in the yielded models, the modularity was compared with the modularity values generated from 1,000 random networks. The peptides and HLAs in the modules were characterized by similarity analysis. The neighbor-edges based and unbiased leverage algorithm (Nebula) was developed for predicting HLA-peptide binding. Leave-one-out (LOO) validations and two-fold cross-validations were conducted to evaluate the performance of Nebula using the constructed HLA-peptide binding network. RESULTS Nine modules were identified from analyzing the HLA-peptide binding network with a highest modularity compared to all the random networks. Peptide length and functional side chains of amino acids at certain positions of the peptides were different among the modules. HLA sequences were module dependent to some extent. Nebula archived an overall prediction accuracy of 0.816 in the LOO validations and average accuracy of 0.795 in the two-fold cross-validations and outperformed the method reported in the literature. CONCLUSIONS Network analysis is a useful approach for analyzing large and sparse datasets such as the HLA-peptide binding dataset. The modules identified from the network analysis clustered peptides and HLAs with similar sequences and properties of amino acids. Nebula performed well in the predictions of HLA-peptide binding. We demonstrated that network analysis coupled with Nebula is an efficient approach to understand and predict HLA-peptide binding interactions and thus, could further our understanding of ADRs.
Collapse
|
6
|
Integrative approaches for finding modular structure in biological networks. Nat Rev Genet 2013; 14:719-32. [PMID: 24045689 DOI: 10.1038/nrg3552] [Citation(s) in RCA: 351] [Impact Index Per Article: 31.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
A central goal of systems biology is to elucidate the structural and functional architecture of the cell. To this end, large and complex networks of molecular interactions are being rapidly generated for humans and model organisms. A recent focus of bioinformatics research has been to integrate these networks with each other and with diverse molecular profiles to identify sets of molecules and interactions that participate in a common biological function - that is, 'modules'. Here, we classify such integrative approaches into four broad categories, describe their bioinformatic principles and review their applications.
Collapse
|
7
|
Alroobi R, Ahmed S, Salem S. Mining maximal cohesive induced subnetworks and patterns by integrating biological networks with gene profile data. Interdiscip Sci 2013; 5:211-24. [DOI: 10.1007/s12539-013-0168-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2013] [Revised: 03/30/2013] [Accepted: 06/12/2013] [Indexed: 01/28/2023]
|
8
|
Couto Alves A, Bruhn S, Ramasamy A, Wang H, Holloway JW, Hartikainen AL, Jarvelin MR, Benson M, Balding DJ, Coin LJM. Dysregulation of complement system and CD4+ T cell activation pathways implicated in allergic response. PLoS One 2013; 8:e74821. [PMID: 24116013 PMCID: PMC3792967 DOI: 10.1371/journal.pone.0074821] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2013] [Accepted: 08/06/2013] [Indexed: 11/18/2022] Open
Abstract
Allergy is a complex disease that is likely to involve dysregulated CD4+ T cell activation. Here we propose a novel methodology to gain insight into how coordinated behaviour emerges between disease-dysregulated pathways in response to pathophysiological stimuli. Using peripheral blood mononuclear cells of allergic rhinitis patients and controls cultured with and without pollen allergens, we integrate CD4+ T cell gene expression from microarray data and genetic markers of allergic sensitisation from GWAS data at the pathway level using enrichment analysis; implicating the complement system in both cellular and systemic response to pollen allergens. We delineate a novel disease network linking T cell activation to the complement system that is significantly enriched for genes exhibiting correlated gene expression and protein-protein interactions, suggesting a tight biological coordination that is dysregulated in the disease state in response to pollen allergen but not to diluent. This novel disease network has high predictive power for the gene and protein expression of the Th2 cytokine profile (IL-4, IL-5, IL-10, IL-13) and of the Th2 master regulator (GATA3), suggesting its involvement in the early stages of CD4+ T cell differentiation. Dissection of the complement system gene expression identifies 7 genes specifically associated with atopic response to pollen, including C1QR1, CFD, CFP, ITGB2, ITGAX and confirms the role of C3AR1 and C5AR1. Two of these genes (ITGB2 and C3AR1) are also implicated in the network linking complement system to T cell activation, which comprises 6 differentially expressed genes. C3AR1 is also significantly associated with allergic sensitisation in GWAS data.
Collapse
MESH Headings
- Allergens/pharmacology
- CD4-Positive T-Lymphocytes/drug effects
- CD4-Positive T-Lymphocytes/immunology
- CD4-Positive T-Lymphocytes/metabolism
- Cell Differentiation/drug effects
- Cell Differentiation/genetics
- Cytokines/genetics
- Cytokines/metabolism
- GATA3 Transcription Factor/genetics
- GATA3 Transcription Factor/metabolism
- Gene Expression Profiling
- Humans
- Leukocytes, Mononuclear/drug effects
- Leukocytes, Mononuclear/immunology
- Leukocytes, Mononuclear/metabolism
- Lymphocyte Activation/drug effects
- Lymphocyte Activation/genetics
- Lymphocyte Activation/immunology
- Pollen
- Receptors, Complement/genetics
- Receptors, Complement/metabolism
- Rhinitis, Allergic, Seasonal/genetics
- Rhinitis, Allergic, Seasonal/immunology
- Rhinitis, Allergic, Seasonal/metabolism
Collapse
Affiliation(s)
- Alexessander Couto Alves
- Department of Epidemiology and Biostatistics, Imperial College London, MRC-HPA Centre for Environment and Health, Imperial College London, London, United Kingdom
| | - Sören Bruhn
- Department of Clinical and Experimental Medicine, Linköping University, Linköping, Sweden
| | - Adaikalavan Ramasamy
- Department of Epidemiology and Biostatistics, Imperial College London, MRC-HPA Centre for Environment and Health, Imperial College London, London, United Kingdom
- Department of Medical and Molecular Genetics, King's College London, London, United Kingdom
| | - Hui Wang
- Department of Clinical and Experimental Medicine, Linköping University, Linköping, Sweden
- Dept of Paediatrics, Gothenburg University, Gothenburg, Sweden
| | - John W. Holloway
- Human Development and Health, Faculty of Medicine, University of Southampton, Southampton, United Kingdom
| | - Anna-Liisa Hartikainen
- Department of Clinical Sciences, Obstetrics and Gynecology, Institute of Clinical Medicine, University of Oulu, Oulu, Finland
| | - Marjo-Riitta Jarvelin
- Department of Epidemiology and Biostatistics, Imperial College London, MRC-HPA Centre for Environment and Health, Imperial College London, London, United Kingdom
- Institute of Health Sciences, University of Oulu, and Unit of General Practice, University Hospital of Oulu, Oulu, Finland
- Biocenter Oulu, University of Oulu, Oulu, Finland
- National Institute of Health and Welfare, Oulu, Finland
| | - Mikael Benson
- Department of Clinical and Experimental Medicine, Linköping University, Linköping, Sweden
| | - David J. Balding
- Department of Epidemiology and Biostatistics, Imperial College London, MRC-HPA Centre for Environment and Health, Imperial College London, London, United Kingdom
- Genetics Institute, University College London, United Kingdom
| | - Lachlan J. M. Coin
- Department of Genomics of Common Diseases, School of Public Health, Imperial College London, London, United Kingdom
- Institute for Molecular Bioscience, University of Queensland, Brisbane, Australia
| |
Collapse
|
9
|
Abstract
High-throughput experimental technologies are generating increasingly massive and complex genomic data sets. The sheer enormity and heterogeneity of these data threaten to make the arising problems computationally infeasible. Fortunately, powerful algorithmic techniques lead to software that can answer important biomedical questions in practice. In this Review, we sample the algorithmic landscape, focusing on state-of-the-art techniques, the understanding of which will aid the bench biologist in analysing omics data. We spotlight specific examples that have facilitated and enriched analyses of sequence, transcriptomic and network data sets.
Collapse
Affiliation(s)
- Bonnie Berger
- Department of Mathematics and Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA.
| | | | | |
Collapse
|
10
|
Feng C, Chen L, Li W, Wang H, Zhang L, Jia X, Miao Z, Qu X, Li W, He W. Identifying grade/stage-related active modules in human co-regulatory networks: a case study for breast cancer. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2013; 16:681-9. [PMID: 23215806 DOI: 10.1089/omi.2012.0015] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The histological grade/stage of tumor is widely acknowledged as an important clinical prognostic factor for cancer progression. Recent experimental studies have explored the following two topics at the molecular level: (1) whether or not gene expression levels vary by different degrees among different tumor grades/stages, and (2) whether some well-defined modules could distinguish one grade/stage from another. In this article, using breast cancer as an example, we investigated this topic and identified grade/stage-related active modules under the framework of a weighted network integrated from a human protein interaction network and a transcriptional regulatory network. Our results enabled us to draw the conclusion that the gene expression profile could provide more clues about tumor grade, but reveals less evidence about tumor stage. In addition, we found that our modular biomarker method had additional advantages in identifying some tumor grade/stage-related genes with slightly altered expression. According to our case study, the framework we introduced could be used for other cancers to identify their modules during grading or staging.
Collapse
Affiliation(s)
- Chenchen Feng
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Hei Longjiang Province, China
| | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Systems genetics in "-omics" era: current and future development. Theory Biosci 2012; 132:1-16. [PMID: 23138757 DOI: 10.1007/s12064-012-0168-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2012] [Accepted: 10/25/2012] [Indexed: 02/06/2023]
Abstract
The systems genetics is an emerging discipline that integrates high-throughput expression profiling technology and systems biology approaches for revealing the molecular mechanism of complex traits, and will improve our understanding of gene functions in the biochemical pathway and genetic interactions between biological molecules. With the rapid advances of microarray analysis technologies, bioinformatics is extensively used in the studies of gene functions, SNP-SNP genetic interactions, LD block-block interactions, miRNA-mRNA interactions, DNA-protein interactions, protein-protein interactions, and functional mapping for LD blocks. Based on bioinformatics panel, which can integrate "-omics" datasets to extract systems knowledge and useful information for explaining the molecular mechanism of complex traits, systems genetics is all about to enhance our understanding of biological processes. Systems biology has provided systems level recognition of various biological phenomena, and constructed the scientific background for the development of systems genetics. In addition, the next-generation sequencing technology and post-genome wide association studies empower the discovery of new gene and rare variants. The integration of different strategies will help to propose novel hypothesis and perfect the theoretical framework of systems genetics, which will make contribution to the future development of systems genetics, and open up a whole new area of genetics.
Collapse
|
12
|
Dao P, Colak R, Salari R, Moser F, Davicioni E, Schönhuth A, Ester M. Inferring cancer subnetwork markers using density-constrained biclustering. ACTA ACUST UNITED AC 2010; 26:i625-31. [PMID: 20823331 PMCID: PMC2935415 DOI: 10.1093/bioinformatics/btq393] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Motivation: Recent genomic studies have confirmed that cancer is of utmost phenotypical complexity, varying greatly in terms of subtypes and evolutionary stages. When classifying cancer tissue samples, subnetwork marker approaches have proven to be superior over single gene marker approaches, most importantly in cross-platform evaluation schemes. However, prior subnetwork-based approaches do not explicitly address the great phenotypical complexity of cancer. Results: We explicitly address this and employ density-constrained biclustering to compute subnetwork markers, which reflect pathways being dysregulated in many, but not necessarily all samples under consideration. In breast cancer we achieve substantial improvements over all cross-platform applicable approaches when predicting TP53 mutation status in a well-established non-cross-platform setting. In colon cancer, we raise prediction accuracy in the most difficult instances from 87% to 93% for cancer versus non−cancer and from 83% to (astonishing) 92%, for with versus without liver metastasis, in well-established cross-platform evaluation schemes. Availability: Software is available on request. Contact:alexsch@math.berkeley.edu; ester@cs.sfu.ca Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Phuong Dao
- School of Computing Science, Simon Fraser University, Burnaby, Canada
| | | | | | | | | | | | | |
Collapse
|