1
|
Jain N, Ghosh S, Ghosh A. A parameter free relative density based biclustering method for identifying non-linear feature relations. Heliyon 2024; 10:e34736. [PMID: 39157398 PMCID: PMC11327522 DOI: 10.1016/j.heliyon.2024.e34736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Revised: 07/09/2024] [Accepted: 07/16/2024] [Indexed: 08/20/2024] Open
Abstract
The existing biclustering algorithms often depend on assumptions like monotonicity or linearity of feature relations for finding biclusters. Though a few algorithms overcome this problem using density-based methods, they tend to miss out many biclusters because they use global criteria for identifying dense regions. The proposed method, PF-RelDenBi, uses local variations in marginal and joint densities for each pair of features to find the subset of observations, forming the basis of the relation between them. It then finds the set of features connected by a common set of observations using a non-linear feature relation index, resulting in a bicluster. This approach allows us to find biclusters based on feature relations, even if the relations are non-linear or non-monotonous. Additionally, the proposed method does not require the user to provide any parameters, allowing its application to datasets from different domains. To study the behaviour of PF-RelDenBi on datasets with different properties, experiments were carried out on sixteen simulated datasets and the performance has been compared with eleven state-of-the-art algorithms. The proposed method is seen to produce better results for most of the simulated datasets. Experiments were conducted with five benchmark datasets and biclusters were detected using PF-RelDenBi. For the first two datasets, the detected biclusters were used to generate additional features that improved classification performance. For the other three datasets, the performance of PF-RelDenBi was compared with the eleven state-of-the-art methods in terms of accuracy, NMI and ARI. The proposed method is seen to detect biclusters with greater accuracy. The proposed technique has also been applied to the COVID-19 dataset to identify some demographic features that are likely to affect the spread of COVID-19.
Collapse
Affiliation(s)
- Namita Jain
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
| | - Susmita Ghosh
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
| | - Ashish Ghosh
- International Institute of Information Technology, Bhubaneswar 751003, India
| |
Collapse
|
2
|
Chekouo T, Mukherjee H. A Bayesian hierarchical hidden Markov model for clustering and gene selection: Application to kidney cancer gene expression data. Biom J 2024; 66:e2300173. [PMID: 38817110 PMCID: PMC11239327 DOI: 10.1002/bimj.202300173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2023] [Revised: 02/18/2024] [Accepted: 03/02/2024] [Indexed: 06/01/2024]
Abstract
We introduce a Bayesian approach for biclustering that accounts for the prior functional dependence between genes using hidden Markov models (HMMs). We utilize biological knowledge gathered from gene ontologies and the hidden Markov structure to capture the potential coexpression of neighboring genes. Our interpretable model-based clustering characterized each cluster of samples by three groups of features: overexpressed, underexpressed, and irrelevant features. The proposed methods have been implemented in an R package and are used to analyze both the simulated data and The Cancer Genome Atlas kidney cancer data.
Collapse
Affiliation(s)
- Thierry Chekouo
- Division of Biostatistics and Health Data Science, School of Public Health, University of Minnesota, Minnesota, USA
| | - Himadri Mukherjee
- Department of Mathematics and Statistics, University of Minnesota Duluth, Minnesota, USA
| |
Collapse
|
3
|
Castanho EN, Aidos H, Madeira SC. Biclustering data analysis: a comprehensive survey. Brief Bioinform 2024; 25:bbae342. [PMID: 39007596 PMCID: PMC11247412 DOI: 10.1093/bib/bbae342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 05/16/2024] [Accepted: 07/01/2024] [Indexed: 07/16/2024] Open
Abstract
Biclustering, the simultaneous clustering of rows and columns of a data matrix, has proved its effectiveness in bioinformatics due to its capacity to produce local instead of global models, evolving from a key technique used in gene expression data analysis into one of the most used approaches for pattern discovery and identification of biological modules, used in both descriptive and predictive learning tasks. This survey presents a comprehensive overview of biclustering. It proposes an updated taxonomy for its fundamental components (bicluster, biclustering solution, biclustering algorithms, and evaluation measures) and applications. We unify scattered concepts in the literature with new definitions to accommodate the diversity of data types (such as tabular, network, and time series data) and the specificities of biological and biomedical data domains. We further propose a pipeline for biclustering data analysis and discuss practical aspects of incorporating biclustering in real-world applications. We highlight prominent application domains, particularly in bioinformatics, and identify typical biclusters to illustrate the analysis output. Moreover, we discuss important aspects to consider when choosing, applying, and evaluating a biclustering algorithm. We also relate biclustering with other data mining tasks (clustering, pattern mining, classification, triclustering, N-way clustering, and graph mining). Thus, it provides theoretical and practical guidance on biclustering data analysis, demonstrating its potential to uncover actionable insights from complex datasets.
Collapse
Affiliation(s)
- Eduardo N Castanho
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Helena Aidos
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| |
Collapse
|
4
|
Liu F, Yang Y, Xu XS, Yuan M. MESBC: A novel mutually exclusive spectral biclustering method for cancer subtyping. Comput Biol Chem 2024; 109:108009. [PMID: 38219419 DOI: 10.1016/j.compbiolchem.2023.108009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 12/22/2023] [Accepted: 12/24/2023] [Indexed: 01/16/2024]
Abstract
Many soft biclustering algorithms have been developed and applied to various biological and biomedical data analyses. However, few mutually exclusive (hard) biclustering algorithms have been proposed, which could better identify disease or molecular subtypes with survival significance based on genomic or transcriptomic data. In this study, we developed a novel mutually exclusive spectral biclustering (MESBC) algorithm based on spectral method to detect mutually exclusive biclusters. MESBC simultaneously detects relevant features (genes) and corresponding conditions (patients) subgroups and, therefore, automatically uses the signature features for each subtype to perform the clustering. Extensive simulations revealed that MESBC provided superior accuracy in detecting pre-specified biclusters compared with the non-negative matrix factorization (NMF) and Dhillon's algorithm, particularly in very noisy data. Further analysis of the algorithm on real datasets obtained from the TCGA database showed that MESBC provided more accurate (i.e., smaller p-value) overall survival prediction in patients with lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) cancers when compared to the existing, gold-standard subtypes for lung cancers (integrative clustering). Furthermore, MESBC detected several genes with significant prognostic value in both LUAD and LUSC patients. External validation on an independent, unseen GEO dataset of LUAD showed that MESBC-derived clusters based on TCGA data still exhibited clear biclustering patterns and consistent, outstanding prognostic predictability, demonstrating robust generalizability of MESBC. Therefore, MESBC could potentially be used as a risk stratification tool to optimize the treatment for the patient, improve the selection of patients for clinical trials, and contribute to the development of novel therapeutic agents.
Collapse
Affiliation(s)
- Fengrong Liu
- Department of Statistics and Finance, University of Science and Technology of China, Hefei 230026, China
| | - Yaning Yang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei 230026, China
| | | | - Min Yuan
- School of Public Health Administration, Anhui Medical University, Hefei 230032, China.
| |
Collapse
|
5
|
Han W, Zhang S, Gao H, Bu D. Clustering on hierarchical heterogeneous data with prior pairwise relationships. BMC Bioinformatics 2024; 25:40. [PMID: 38262930 PMCID: PMC10807103 DOI: 10.1186/s12859-024-05652-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 01/12/2024] [Indexed: 01/25/2024] Open
Abstract
BACKGROUND Clustering is a fundamental problem in statistics and has broad applications in various areas. Traditional clustering methods treat features equally and ignore the potential structure brought by the characteristic difference of features. Especially in cancer diagnosis and treatment, several types of biological features are collected and analyzed together. Treating these features equally fails to identify the heterogeneity of both data structure and cancer itself, which leads to incompleteness and inefficacy of current anti-cancer therapies. OBJECTIVES In this paper, we propose a clustering framework based on hierarchical heterogeneous data with prior pairwise relationships. The proposed clustering method fully characterizes the difference of features and identifies potential hierarchical structure by rough and refined clusters. RESULTS The refined clustering further divides the clusters obtained by the rough clustering into different subtypes. Thus it provides a deeper insight of cancer that can not be detected by existing clustering methods. The proposed method is also flexible with prior information, additional pairwise relationships of samples can be incorporated to help to improve clustering performance. Finally, well-grounded statistical consistency properties of our proposed method are rigorously established, including the accurate estimation of parameters and determination of clustering structures. CONCLUSIONS Our proposed method achieves better clustering performance than other methods in simulation studies, and the clustering accuracy increases with prior information incorporated. Meaningful biological findings are obtained in the analysis of lung adenocarcinoma with clinical imaging data and omics data, showing that hierarchical structure produced by rough and refined clustering is necessary and reasonable.
Collapse
Affiliation(s)
- Wei Han
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
- Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, China
| | - Sanguo Zhang
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
- Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, China
| | - Hailong Gao
- School of Mathematics and Statistics, Qingdao University, Qingdao, China
| | - Deliang Bu
- School of Statistics, Capital University of Economics and Business, Beijing, China.
| |
Collapse
|
6
|
Pauk J, Daunoraviciene K, Ziziene J, Minta-Bielecka K, Dzieciol-Anikiej Z. Classification of muscle activity patterns in healthy children using biclustering algorithm. Biomed Signal Process Control 2023. [DOI: 10.1016/j.bspc.2023.104731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/08/2023]
|
7
|
Zhang W, Wendt C, Bowler R, Hersh CP, Safo SE. Robust integrative biclustering for multi-view data. Stat Methods Med Res 2022; 31:2201-2216. [PMID: 36113157 PMCID: PMC10153449 DOI: 10.1177/09622802221122427] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In many biomedical research, multiple views of data (e.g. genomics, proteomics) are available, and a particular interest might be the detection of sample subgroups characterized by specific groups of variables. Biclustering methods are well-suited for this problem as they assume that specific groups of variables might be relevant only to specific groups of samples. Many biclustering methods exist for detecting row-column clusters in a view but few methods exist for data from multiple views. The few existing algorithms are heavily dependent on regularization parameters for getting row-column clusters, and they impose unnecessary burden on users thus limiting their use in practice. We extend an existing biclustering method based on sparse singular value decomposition for single-view data to data from multiple views. Our method, integrative sparse singular value decomposition (iSSVD), incorporates stability selection to control Type I error rates, estimates the probability of samples and variables to belong to a bicluster, finds stable biclusters, and results in interpretable row-column associations. Simulations and real data analyses show that integrative sparse singular value decomposition outperforms several other single- and multi-view biclustering methods and is able to detect meaningful biclusters. iSSVD is a user-friendly, computationally efficient algorithm that will be useful in many disease subtyping applications.
Collapse
Affiliation(s)
- Weijie Zhang
- Division of Biostatistics, 5635University of Minnesota, MN, USA
| | - Christine Wendt
- Division of Pulmonary, Allergy and Critical Care, 5635University of Minnesota, MN, USA
| | - Russel Bowler
- Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, 551774National Jewish Health, Denver, USA
| | - Craig P Hersh
- Channing Division of Network Medicine, Brigham and Women's Hospital, 1811Harvard Medical School, USA
| | - Sandra E Safo
- Division of Biostatistics, 5635University of Minnesota, MN, USA
| |
Collapse
|
8
|
Liu T, Yu H, Blair RH. Stability estimation for unsupervised clustering: A review. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL STATISTICS 2022; 14:e1575. [PMID: 36583207 PMCID: PMC9787023 DOI: 10.1002/wics.1575] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Revised: 11/24/2021] [Accepted: 12/08/2021] [Indexed: 01/01/2023]
Abstract
Cluster analysis remains one of the most challenging yet fundamental tasks in unsupervised learning. This is due in part to the fact that there are no labels or gold standards by which performance can be measured. Moreover, the wide range of clustering methods available is governed by different objective functions, different parameters, and dissimilarity measures. The purpose of clustering is versatile, often playing critical roles in the early stages of exploratory data analysis and as an endpoint for knowledge and discovery. Thus, understanding the quality of a clustering is of critical importance. The concept of stability has emerged as a strategy for assessing the performance and reproducibility of data clustering. The key idea is to produce perturbed data sets that are very close to the original, and cluster them. If the clustering is stable, then the clusters from the original data will be preserved in the perturbed data clustering. The nature of the perturbation, and the methods for quantifying similarity between clusterings, are nontrivial, and ultimately what distinguishes many of the stability estimation methods apart. In this review, we provide an overview of the very active research area of cluster stability estimation and discuss some of the open questions and challenges that remain in the field. This article is categorized under:Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification.
Collapse
Affiliation(s)
- Tianmou Liu
- Institute for Artificial Intelligence and Data ScienceState University of New York at BuffaloBuffaloNew YorkUSA
| | - Han Yu
- Roswell Park Comprehensive Cancer CenterBuffaloNew YorkUSA
| | - Rachael Hageman Blair
- Department of Biostatistics, Institute for Artificial Intelligence and Data ScienceState University of New York at BuffaloBuffaloNew YorkUSA
| |
Collapse
|
9
|
Abstract
AbstractBiclustering is a two-dimensional data analysis technique that, applied to a matrix, searches for a subset of rows and columns that intersect to produce a submatrix with given, expected features. Such an approach requires different methods to those of typical classification or regression tasks. In recent years it has become possible to express biclustering goals in the form of Boolean reasoning. This paper presents a new, heuristic approach to bicluster induction in binary data.
Collapse
|
10
|
Maisog JM, DeMarco AT, Devarajan K, Young SS, Fogel P, Luta G. Assessing Methods for Evaluating the Number of Components in Non-Negative Matrix Factorization. MATHEMATICS (BASEL, SWITZERLAND) 2021; 9:2840. [PMID: 35694180 PMCID: PMC9181460 DOI: 10.3390/math9222840] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
Non-negative matrix factorization is a relatively new method of matrix decomposition which factors an m×n data matrix X into an m×k matrix W and a k×n matrix H, so that X≈W×H. Importantly, all values in X, W, and H are constrained to be non-negative. NMF can be used for dimensionality reduction, since the k columns of W can be considered components into which X has been decomposed. The question arises: how does one choose k? In this paper, we first assess methods for estimating k in the context of NMF in synthetic data. Second, we examine the effect of normalization on this estimate's accuracy in empirical data. In synthetic data with orthogonal underlying components, methods based on PCA and Brunet's Cophenetic Correlation Coefficient achieved the highest accuracy. When evaluated on a well-known real dataset, normalization had an unpredictable effect on the estimate. For any given normalization method, the methods for estimating k gave widely varying results. We conclude that when estimating k, it is best not to apply normalization. If underlying components are known to be orthogonal, then Velicer's MAP or Minka's Laplace-PCA method might be best. However, when orthogonality of the underlying components is unknown, none of the methods seemed preferable.
Collapse
Affiliation(s)
| | - Andrew T. DeMarco
- Department of Rehabilitation Medicine, Georgetown University Medical Center
| | - Karthik Devarajan
- Department of Biostatistics and Bioinformatics, Fox Chase Cancer Center, Temple University Health System, Philadelphia, PA 19111
| | | | - Paul Fogel
- Advestis, 69 Boulevard Haussmann 75008 Paris, France
| | - George Luta
- Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University Medical Center
- Department of Clinical Epidemiology, Aarhus University, Aarhus, Denmark
- The Parker Institute, Copenhagen University Hospital, Frederiksberg, Denmark
| |
Collapse
|
11
|
Fang Q, Su D, Ng W, Feng J. An Effective Biclustering-Based Framework for Identifying Cell Subpopulations From scRNA-seq Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2249-2260. [PMID: 32167906 DOI: 10.1109/tcbb.2020.2979717] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The advent of single-cell RNA sequencing (scRNA-seq) techniques opens up new opportunities for studying the cell-specific changes in the transcriptomic data. An important research problem related with scRNA-seq data analysis is to identify cell subpopulations with distinct functions. However, the expression profiles of individual cells are usually measured over tens of thousands of genes, and it remains a difficult problem to effectively cluster the cells based on the high-dimensional profiles. An additional challenge of performing the analysis is that, the scRNA-seq data are often noisy and sometimes extremely sparse due to technical limitations and sampling deficiencies. In this paper, we propose a biclustering-based framework called DivBiclust that effectively identifies the cell subpopulations based on the high-dimensional noisy scRNA-seq data. Compared with nine state-of-the-art methods, DivBiclust excels in identifying cell subpopulations with high accuracy as evidenced by our experiments on ten real scRNA-seq datasets with different size and diverse dropout rates. The supplemental materials of DivBiclust, including the source codes, data, and a supplementary document, are available at https://www.github.com/Qiong-Fang/DivBiclust.
Collapse
|
12
|
A Holistic Performance Comparison for Lung Cancer Classification Using Swarm Intelligence Techniques. JOURNAL OF HEALTHCARE ENGINEERING 2021; 2021:6680424. [PMID: 34373776 PMCID: PMC8349254 DOI: 10.1155/2021/6680424] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Accepted: 07/17/2021] [Indexed: 12/22/2022]
Abstract
In the field of bioinformatics, feature selection in classification of cancer is a primary area of research and utilized to select the most informative genes from thousands of genes in the microarray. Microarray data is generally noisy, is highly redundant, and has an extremely asymmetric dimensionality, as the majority of the genes present here are believed to be uninformative. The paper adopts a methodology of classification of high dimensional lung cancer microarray data utilizing feature selection and optimization techniques. The methodology is divided into two stages; firstly, the ranking of each gene is done based on the standard gene selection techniques like Information Gain, Relief–F test, Chi-square statistic, and T-statistic test. As a result, the gathering of top scored genes is assimilated, and a new feature subset is obtained. In the second stage, the new feature subset is further optimized by using swarm intelligence techniques like Grasshopper Optimization (GO), Moth Flame Optimization (MFO), Bacterial Foraging Optimization (BFO), Krill Herd Optimization (KHO), and Artificial Fish Swarm Optimization (AFSO), and finally, an optimized subset is utilized. The selected genes are used for classification, and the classifiers used here are Naïve Bayesian Classifier (NBC), Decision Trees (DT), Support Vector Machines (SVM), and K-Nearest Neighbour (KNN). The best results are shown when Relief-F test is computed with AFSO and classified with Decision Trees classifier for hundred genes, and the highest classification accuracy of 99.10% is obtained.
Collapse
|
13
|
Salcedo EC, Winter MB, Khuri N, Knudsen GM, Sali A, Craik CS. Global Protease Activity Profiling Identifies HER2-Driven Proteolysis in Breast Cancer. ACS Chem Biol 2021; 16:712-723. [PMID: 33765766 DOI: 10.1021/acschembio.0c01000] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Differential expression of extracellular proteases and endogenous protease inhibitors has been associated with distinct molecular subtypes of breast cancer. However, due to the tight post-translational regulation of protease activity, protease expression-level data alone are not sufficient to understand the role of proteases in malignant transformation. Therefore, we hypothesized that global profiles of extracellular protease activity could more completely reflect differences observed at the transcriptional level in breast cancer and that subtype-associated protease activity may be leveraged to identify specific proteases that play a functional role in cancer signaling. Here, we used a global peptide library-based approach to profile the activities of proteases within distinct breast cancer subtypes. Analysis of 3651 total peptide cleavages from a panel of well-characterized breast cancer cell lines demonstrated differences in proteolytic signatures between cell lines. Cell line clustering based on protease cleavages within the peptide library expanded upon the expected classification derived from transcriptional profiling. An isogenic cell line model developed to further interrogate proteolysis in the HER2 subtype revealed a proteolytic signature consistent with activation of TGF-β signaling. Specifically, we determined that a metalloprotease involved in TGF-β signaling, BMP1, was upregulated at both the protein (2-fold, P = 0.001) and activity (P = 0.0599) levels. Inhibition of BMP1 and HER2 suppressed invasion of HER2-expressing cells by 35% (P < 0.0001), compared to 15% (P = 0.0086) observed in cells where only HER2 was inhibited. In summary, through global identification of extracellular proteolysis in breast cancer cell lines, we demonstrate subtype-specific differences in protease activity and elucidate proteolysis associated with HER2-mediated signaling.
Collapse
|
14
|
Li Y, Bandyopadhyay D, Xie F, Xu Y. BAREB: A Bayesian repulsive biclustering model for periodontal data. Stat Med 2020; 39:2139-2151. [PMID: 32246534 PMCID: PMC7272289 DOI: 10.1002/sim.8536] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2019] [Revised: 02/12/2020] [Accepted: 03/07/2020] [Indexed: 11/11/2022]
Abstract
Preventing periodontal diseases (PD) and maintaining the structure and function of teeth are important goals for personal oral care. To understand the heterogeneity in patients with diverse PD patterns, we develop a Bayesian repulsive biclustering method that can simultaneously cluster the PD patients and their tooth sites after taking the patient- and site-level covariates into consideration. BAREB uses the determinantal point process prior to induce diversity among different biclusters to facilitate parsimony and interpretability. Since PD progression is hypothesized to be spatially referenced, BAREB factors in the spatial dependence among tooth sites. In addition, since PD is the leading cause for tooth loss, the missing data mechanism is nonignorable. Such nonrandom missingness is incorporated into BAREB. For the posterior inference, we design an efficient reversible jump Markov chain Monte Carlo sampler. Simulation studies show that BAREB is able to accurately estimate the biclusters, and compares favorably to alternatives. For real world application, we apply BAREB to a dataset from a clinical PD study, and obtain desirable and interpretable results. A major contribution of this article is the Rcpp implementation of our methodology, available in the R package BAREB.
Collapse
Affiliation(s)
- Yuliang Li
- Department of Applied Mathematics and Statistics, Johns Hopkins University, MD, U.S.A
| | | | - Fangzheng Xie
- Department of Applied Mathematics and Statistics, Johns Hopkins University, MD, U.S.A
| | - Yanxun Xu
- Department of Applied Mathematics and Statistics, Johns Hopkins University, MD, U.S.A
| |
Collapse
|
15
|
|
16
|
Biswal BS, Mohapatra A, Vipsita S. Triclustering of gene expression microarray data using coarse grained and dynamic deme based parallel genetic approach. EVOLUTIONARY INTELLIGENCE 2019. [DOI: 10.1007/s12065-019-00330-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
17
|
Saini N, Saha S, Soni C, Bhattacharyya P. Automatic evolution of bi-clusters from microarray data using self-organized multi-objective evolutionary algorithm. APPL INTELL 2019. [DOI: 10.1007/s10489-019-01554-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
18
|
Cirrincione G, Ciravegna G, Barbiero P, Randazzo V, Pasero E. The GH-EXIN neural network for hierarchical clustering. Neural Netw 2019; 121:57-73. [PMID: 31536900 DOI: 10.1016/j.neunet.2019.07.018] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Revised: 06/11/2019] [Accepted: 07/21/2019] [Indexed: 10/26/2022]
Abstract
Hierarchical clustering is an important tool for extracting information from data in a multi-resolution way. It is more meaningful if driven by data, as in the case of divisive algorithms, which split data until no more division is allowed. However, they have the drawback of the splitting threshold setting. The neural networks can address this problem, because they basically depend on data. The growing hierarchical GH-EXIN neural network builds a hierarchical tree in an incremental (data-driven architecture) and self-organized way. It is a top-down technique which defines the horizontal growth by means of an anisotropic region of influence, based on the novel idea of neighborhood convex hull. It also reallocates data and detects outliers by using a novel approach on all the leaves, simultaneously. Its complexity is estimated and an analysis of its user-dependent parameters is given. The advantages of the proposed approach, with regard to the best existing networks, are shown and analyzed, qualitatively and quantitatively, both in benchmark synthetic problems and in a real application (image recognition from video), in order to test the performance in building hierarchical trees. Furthermore, an important and very promising application of GH-EXIN in two-way hierarchical clustering, for the analysis of gene expression data in the study of the colorectal cancer is described.
Collapse
Affiliation(s)
- Giansalvo Cirrincione
- University of South Pacific, Suva, Fiji; University of Picardie Jules Verne, Amiens, France
| | | | | | | | | |
Collapse
|
19
|
Acharya S, Saha S, Sahoo P. Bi-clustering of microarray data using a symmetry-based multi-objective optimization framework. Soft comput 2019. [DOI: 10.1007/s00500-018-3227-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
20
|
Zhang L, Wei Y, Yan X, Li N, Song H, Yang L, Wu Y, Xi YF, Weng HW, Li JH, Lin EH, Zou LQ. Survivin is a prognostic marker and therapeutic target for extranodal, nasal-type natural killer/T cell lymphoma. ANNALS OF TRANSLATIONAL MEDICINE 2019; 7:316. [PMID: 31475186 DOI: 10.21037/atm.2019.06.53] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Background The relationship between survivin and extranodal, nasal-type natural killer/T cell lymphoma (ENKTCL) was unclearly established yet. We here studied the potential prognostic roles of survivin and its implication as a target in ENKTCL therapy. Methods ENKTCL patients' peripheral blood were collected and tested by ELISA. ENKTCL cell lines were cultured with or without survivin inhibitor and tested by MTT and Flow cytometry. According to the gene expression profiles from the ArrayExpress Archive under E-TABM-702, survivin co-regulated cluster was established by Coupled Two-way Clustering Algorithm. Results Seventeen point six percent of total 17 ENKTCL patients were serum survivin-positive. These patients had poorer outcome than that of negative cases (P<0.01). Analysis of survivin co-regulation genes in ENKTCL revealed that survivin was significantly involved in pluripotency, drug resistance, cell cycle and proliferation, indicating that it should be one of key regulators in ENKTCL and might be a latent therapeutic target. Our results just showed that YM155, a survivin inhibitor, had strong anti-tumor effect on ENKTCL cell lines in a dose dependent manner. It increased sub-G1 phase population and reduced G1- and G2-M phase populations (P<0.05). In addition, combining YM155 with DDP induced a larger decrease in cell viability than either agent alone and had a higher inhibition rate than Bliss index, suggesting their synergistic inhibition. Conclusions We concluded that survivin was a potential prognostic marker and a critical regulatory molecule in the pathological process of ENKTCL. It would be a promising target in drugs discovery for ENKTCL therapy.
Collapse
Affiliation(s)
- Li Zhang
- Department of Medical Oncology, Cancer Center, West China Hospital of Sichuan University, Chengdu 610041, China.,National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital & Shenzhen Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Shenzhen 518116, China
| | - Yi Wei
- The Centre Transport Department of West China Hospital, Sichuan University, Chengdu 610065, China
| | - Xiaowei Yan
- Institute for Systems Biology, Seattle, Washington, USA
| | - Na Li
- Department of Medical Oncology, Cancer Center, West China Hospital of Sichuan University, Chengdu 610041, China
| | - Haolan Song
- Department of Laboratory Medicine, West China Hospital of Sichuan University, Chengdu 610041, China
| | - Li Yang
- State Key Laboratory of Biotherapy, Sichuan University, Chengdu 610065, China
| | - Yang Wu
- State Key Laboratory of Biotherapy, Sichuan University, Chengdu 610065, China
| | - Yu-Feng Xi
- State Key Laboratory of Biotherapy, Sichuan University, Chengdu 610065, China
| | - Hua-Wei Weng
- Department of Medical Oncology, Cancer Center, West China Hospital of Sichuan University, Chengdu 610041, China
| | - Jian-Hua Li
- Department of Medical Oncology, Cancer Center, West China Hospital of Sichuan University, Chengdu 610041, China
| | - Edward H Lin
- P4 Medicine Institute, University of Washington, Seattle, Washington, USA
| | - Li-Qun Zou
- Department of Medical Oncology, Cancer Center, West China Hospital of Sichuan University, Chengdu 610041, China.,State Key Laboratory of Biotherapy, Sichuan University, Chengdu 610065, China
| |
Collapse
|
21
|
Singh A, Bhanot G, Khiabanian H. TuBA: Tunable biclustering algorithm reveals clinically relevant tumor transcriptional profiles in breast cancer. Gigascience 2019; 8:giz064. [PMID: 31216036 PMCID: PMC6582332 DOI: 10.1093/gigascience/giz064] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Revised: 04/17/2019] [Accepted: 05/06/2019] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Traditional clustering approaches for gene expression data are not well adapted to address the complexity and heterogeneity of tumors, where small sets of genes may be aberrantly co-expressed in specific subsets of tumors. Biclustering algorithms that perform local clustering on subsets of genes and conditions help address this problem. We propose a graph-based Tunable Biclustering Algorithm (TuBA) based on a novel pairwise proximity measure, examining the relationship of samples at the extremes of genes' expression profiles to identify similarly altered signatures. RESULTS TuBA's predictions are consistent in 3,940 breast invasive carcinoma samples from 3 independent sources, using different technologies for measuring gene expression (RNA sequencing and Microarray). More than 60% of biclusters identified independently in each dataset had significant agreement in their gene sets, as well as similar clinical implications. Approximately 50% of biclusters were enriched in the estrogen receptor-negative/HER2-negative (or basal-like) subtype, while >50% were associated with transcriptionally active copy number changes. Biclusters representing gene co-expression patterns in stromal tissue were also identified in tumor specimens. CONCLUSIONS TuBA offers a simple biclustering method that can identify biologically relevant gene co-expression signatures not captured by traditional unsupervised clustering approaches. It complements biclustering approaches that are designed to identify constant or coherent submatrices in gene expression datasets, and outperforms them in identifying a multitude of altered transcriptional profiles that are associated with observed genomic heterogeneity of diseased states in breast cancer, both within and across tumor subtypes, a promising step in understanding disease heterogeneity, and a necessary first step in individualized therapy.
Collapse
Affiliation(s)
- Amartya Singh
- Department of Physics and Astronomy, Rutgers University, 136 Frelinghuysen Rd, Piscataway, NJ 08854
- Center for Systems and Computational Biology, Rutgers Cancer Institute, Rutgers University, 195 Little Albany St, New Brunswick, NJ 08903
| | - Gyan Bhanot
- Department of Physics and Astronomy, Rutgers University, 136 Frelinghuysen Rd, Piscataway, NJ 08854
- Center for Systems and Computational Biology, Rutgers Cancer Institute, Rutgers University, 195 Little Albany St, New Brunswick, NJ 08903
- Department of Molecular Biology and Biochemistry, Rutgers University, 604 Allison Rd, Piscataway, NJ 08854
| | - Hossein Khiabanian
- Department of Physics and Astronomy, Rutgers University, 136 Frelinghuysen Rd, Piscataway, NJ 08854
- Center for Systems and Computational Biology, Rutgers Cancer Institute, Rutgers University, 195 Little Albany St, New Brunswick, NJ 08903
- Department of Molecular Biology and Biochemistry, Rutgers University, 604 Allison Rd, Piscataway, NJ 08854
- Department of Pathology and Laboratory Medicine, Rutgers Robert Wood Johnson Medical School, Rutgers University, One Robert Wood Johnson Place, New Brunswick, NJ, 08903
| |
Collapse
|
22
|
Wang T, Zhang J, Huang K. Generalized gene co-expression analysis via subspace clustering using low-rank representation. BMC Bioinformatics 2019; 20:196. [PMID: 31074376 PMCID: PMC6509871 DOI: 10.1186/s12859-019-2733-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
Background Gene Co-expression Network Analysis (GCNA) helps identify gene modules with potential biological functions and has become a popular method in bioinformatics and biomedical research. However, most current GCNA algorithms use correlation to build gene co-expression networks and identify modules with highly correlated genes. There is a need to look beyond correlation and identify gene modules using other similarity measures for finding novel biologically meaningful modules. Results We propose a new generalized gene co-expression analysis algorithm via subspace clustering that can identify biologically meaningful gene co-expression modules with genes that are not all highly correlated. We use low-rank representation to construct gene co-expression networks and local maximal quasi-clique merger to identify gene co-expression modules. We applied our method on three large microarray datasets and a single-cell RNA sequencing dataset. We demonstrate that our method can identify gene modules with different biological functions than current GCNA methods and find gene modules with prognostic values. Conclusions The presented method takes advantage of subspace clustering to generate gene co-expression networks rather than using correlation as the similarity measure between genes. Our generalized GCNA method can provide new insights from gene expression datasets and serve as a complement to current GCNA algorithms.
Collapse
Affiliation(s)
- Tongxin Wang
- Department of Computer Science, Indiana University Bloomington, Bloomington, 47408, IN, USA
| | - Jie Zhang
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, 46202, IN, USA
| | - Kun Huang
- Department of Medicine, Indiana University School of Medicine, Indianapolis, 46202, IN, USA. .,Regenstrief Institute, Indianapolis, 46202, IN, USA.
| |
Collapse
|
23
|
Probabilistic density-based estimation of the number of clusters using the DBSCAN-martingale process. Pattern Recognit Lett 2019. [DOI: 10.1016/j.patrec.2019.03.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
24
|
Rahaman MA, Turner JA, Gupta CN, Rachakonda S, Chen J, Liu J, van Erp TGM, Potkin S, Ford J, Mathalon D, Lee HJ, Jiang W, Mueller BA, Andreassen O, Agartz I, Sponheim SR, Mayer AR, Stephen J, Jung RE, Canive J, Bustillo J, Calhoun VD. N-BiC: A Method for Multi-Component and Symptom Biclustering of Structural MRI Data: Application to Schizophrenia. IEEE Trans Biomed Eng 2019; 67:110-121. [PMID: 30946659 PMCID: PMC7906485 DOI: 10.1109/tbme.2019.2908815] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
OBJECTIVE We propose and develop a novel biclustering (N-BiC) approach for performing N-way biclustering of neuroimaging data. Our approach is applicable to an arbitrary number of features from both imaging and behavioral data (e.g., symptoms). We applied it to structural MRI data from patients with schizophrenia. METHODS It uses a source-based morphometry approach [i.e., independent component analysis of gray matter segmentation maps] to decompose the data into a set of spatial maps, each of which includes regions that covary among individuals. Then, the loading parameters for components of interest are entered to an exhaustive search, which incorporates a modified depth-first search technique to carry out the biclustering, with the goal of obtaining submatrices where the selected rows (individuals) show homogeneity in their expressions of selected columns (components) and vice versa. RESULTS Findings demonstrate that multiple biclusters have an evident association with distinct brain networks for the different types of symptoms in schizophrenia. The study identifies two components: inferior temporal gyrus (16) and brainstem (7), which are related to positive (distortion/excess of normal function) and negative (diminution/loss of normal function) symptoms in schizophrenia, respectively. CONCLUSION N-BiC is a data-driven method of biclustering MRI data that can exhaustively explore relationships/substructures from a dataset without any prior information with a higher degree of robustness than earlier biclustering applications. SIGNIFICANCE The use of such approaches is important to investigate the underlying biological substrates of mental illness by grouping patients into homogeneous subjects, as the schizophrenia diagnosis is known to be relatively nonspecific and heterogeneous.
Collapse
|
25
|
Kalisky T, Oriel S, Bar-Lev TH, Ben-Haim N, Trink A, Wineberg Y, Kanter I, Gilad S, Pyne S. A brief review of single-cell transcriptomic technologies. Brief Funct Genomics 2019; 17:64-76. [PMID: 28968725 DOI: 10.1093/bfgp/elx019] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
In recent years, there has been an effort to develop new technologies for measuring gene expression and sequence information from thousands of individual cells. Large data sets that were obtained using these 'single cell' technologies have allowed scientists to address fundamental questions in biomedicine ranging from stems cells and development to cancer and immunology. Here, we provide a brief review of recent developments in single-cell technology. Our intention is to provide a quick background for newcomers to the field as well as a deeper description of some of the leading technologies to date.
Collapse
|
26
|
Gupta MK, Vadde R. Identification and characterization of differentially expressed genes in Type 2 Diabetes using in silico approach. Comput Biol Chem 2019; 79:24-35. [PMID: 30708140 DOI: 10.1016/j.compbiolchem.2019.01.010] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2018] [Revised: 12/26/2018] [Accepted: 01/23/2019] [Indexed: 12/14/2022]
Abstract
Diabetes mellitus is clinically characterized by hyperglycemia. Though many studies have been done to understand the mechanism of Type 2 Diabetes (T2D), however, the complete network of diabetes and its associated disorders through polygenic involvement is still under debate. The present study designed to re-analyze publicly available T2D related microarray raw datasets present in GEO database and T2D genes information present in GWAS catalog for screening out differentially expressed genes (DEGs) and identify key hub genes associated with T2D. T2D related microarray data downloaded from Gene Expression Omnibus (GEO) database and re-analysis performed with in house R packages scripts for background correction, normalization and identification of DEGs in T2D. Also retrieved T2D related DEGs information from GWAS catalog. Both DEGs lists were grouped after removal of overlapping genes. These screened DEGs were utilized further for identification and characterization of key hub genes in T2D and its associated diseases using STRING, WebGestalt and Panther databases. Computational analysis reveal that out of 99 identified key hub gene candidates from 348 DEGs, only four genes (CCL2, ELMO1, VEGFA and TCF7L2) along with FOS playing key role in causing T2D and its associated disorders, like nephropathy, neuropathy, rheumatoid arthritis and cancer via p53 or Wnt signaling pathways. MIR-29, and MAZ_Q6 are identified potential target microRNA and TF along with probable drugs alprostadil, collagenase and dinoprostone for the key hub gene candidates. The results suggest that identified key DEGs may play promising roles in prevention of diabetes.
Collapse
Affiliation(s)
- Manoj Kumar Gupta
- Department of Biotechnology & Bioinformatics, Yogi Vemana University, Kadapa 516003, Andhra Pradesh, India.
| | - Ramakrishna Vadde
- Department of Biotechnology & Bioinformatics, Yogi Vemana University, Kadapa 516003, Andhra Pradesh, India.
| |
Collapse
|
27
|
Abstract
The cluster analysis has been widely applied by researchers from several scientific fields over the last decades. Advances in knowledge of biological phenomena have revived a great interest in cluster analysis due in part to the large amount of microarray data. Traditional clustering algorithms show, apart from the need of user-defined parameters, clear limitations to handle microarray data owing to its inherent characteristics: high-dimensional-low-sample-sized, highly redundant, and noisy. That has motivated the study of clustering algorithms tailored to the task of analyzing microarray data, which currently continue being developed and adapted. The present chapter is devoted to review clustering methods with different cluster analysis approaches in the challenging context of microarray data. Furthermore, the validation of the clustering results is briefly discussed by means of validity indexes used to assess the goodness of the number of clusters and the induced cluster assignments.
Collapse
Affiliation(s)
| | - Juana-María Vivo
- Department of Statistics and Operations Research, University of Murcia, Murcia, Spain.
| |
Collapse
|
28
|
Single-cell analyses demonstrate that a heme-GATA1 feedback loop regulates red cell differentiation. Blood 2018; 133:457-469. [PMID: 30530752 DOI: 10.1182/blood-2018-05-850412] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2018] [Accepted: 12/01/2018] [Indexed: 01/07/2023] Open
Abstract
Erythropoiesis is the complex, dynamic, and tightly regulated process that generates all mature red blood cells. To understand this process, we mapped the developmental trajectories of progenitors from wild-type, erythropoietin-treated, and Flvcr1-deleted mice at single-cell resolution. Importantly, we linked the quantity of each cell's surface proteins to its total transcriptome, which is a novel method. Deletion of Flvcr1 results in high levels of intracellular heme, allowing us to identify heme-regulated circuitry. Our studies demonstrate that in early erythroid cells (CD71+Ter119neg-lo), heme increases ribosomal protein transcripts, suggesting that heme, in addition to upregulating globin transcription and translation, guarantees ample ribosomes for globin synthesis. In later erythroid cells (CD71+Ter119lo-hi), heme decreases GATA1, GATA1-target gene, and mitotic spindle gene expression. These changes occur quickly. For example, in confirmatory studies using human marrow erythroid cells, ribosomal protein transcripts and proteins increase, and GATA1 transcript and protein decrease, within 15 to 30 minutes of amplifying endogenous heme synthesis with aminolevulinic acid. Because GATA1 initiates heme synthesis, GATA1 and heme together direct red cell maturation, and heme stops GATA1 synthesis, our observations reveal a GATA1-heme autoregulatory loop and implicate GATA1 and heme as the comaster regulators of the normal erythroid differentiation program. In addition, as excessive heme could amplify ribosomal protein imbalance, prematurely lower GATA1, and impede mitosis, these data may help explain the ineffective (early termination of) erythropoiesis in Diamond Blackfan anemia and del(5q) myelodysplasia, disorders with excessive heme in colony-forming unit-erythroid/proerythroblasts, explain why these anemias are macrocytic, and show why children with GATA1 mutations have DBA-like clinical phenotypes.
Collapse
|
29
|
Clustering, Pathway Enrichment, and Protein-Protein Interaction Analysis of Gene Expression in Neurodevelopmental Disorders. Adv Pharmacol Sci 2018; 2018:3632159. [PMID: 30598663 PMCID: PMC6288580 DOI: 10.1155/2018/3632159] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2018] [Accepted: 10/30/2018] [Indexed: 12/21/2022] Open
Abstract
Neuronal developmental disorder is a class of diseases in which there is impairment of the central nervous system and brain function. The brain in its developmental phase undergoes tremendous changes depending upon the stage and environmental factors. Neurodevelopmental disorders include abnormalities associated with cognitive, speech, reading, writing, linguistic, communication, and growth disorders with lifetime effects. Computational methods provide great potential for betterment of research and insight into the molecular mechanism of diseases. In this study, we have used four samples of microarray neuronal developmental data: control, RV (resveratrol), NGF (nerve growth factor), and RV + NGF. By using computational methods, we have identified genes that are expressed in the early stage of neuronal development and also involved in neuronal diseases. We have used MeV application to cluster the raw data using distance metric Pearson correlation coefficient. Finally, 60 genes were selected on the basis of coexpression analysis. Further pathway analysis was done using the Metascape tool, and the biological process was studied using gene ontology database. A total of 13 genes AKT1, BAD, BAX, BCL2, BDNF, CASP3, CASP8, CASP9, MYC, PIK3CD, MAPK1, MAPK10, and CYCS were identified that are common in all clusters. These genes are involved in neuronal developmental disorders and cancers like colorectal cancer, apoptosis, tuberculosis, amyotrophic lateral sclerosis (ALS), neuron death, and prostate cancer pathway. A protein-protein interaction study was done to identify proteins that belong to the same pathway. These genes can be used to design potential inhibitors against neurological disorders at the early stage of neuronal development. The microarray samples discussed in this publication are part of the data deposited in NCBI's Gene Expression Omnibus (Yadav et al., 2018) and are accessible through GEO Series (accession number GSE121261).
Collapse
|
30
|
Yin L, Qiu J, Gao S. Biclustering of Gene Expression Data Using Cuckoo Search and Genetic Algorithm. INT J PATTERN RECOGN 2018. [DOI: 10.1142/s0218001418500398] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Biclustering analysis of gene expression data can reveal a large number of biologically significant local gene expression patterns. Therefore, a large number of biclustering algorithms apply meta-heuristic algorithms such as genetic algorithm (GA) and cuckoo search (CS) to analyze the biclusters. However, different meta-heuristic algorithms have different applicability and characteristics. For example, the CS algorithm can obtain high-quality bicluster and strong global search ability, but its local search ability is relatively poor. In contrast to the CS algorithm, the GA has strong local search ability, but its global search ability is poor. In order to not only improve the global search ability of a bicluster and its coverage, but also improve the local search ability of the bicluster and its quality, this paper proposed a meta-heuristic algorithm based on GA and CS algorithm (GA-CS Biclustering, Georgia Association of Community Service Boards (GACSB)) to solve the problem of gene expression data clustering. The algorithm uses the CS algorithm as the main framework, and uses the tournament strategy and the elite retention strategy based on the GA to generate the next generation of the population. Compared with the experimental results of common biclustering analysis algorithms such as correlated correspondence (CC), fast, local clustering (FLOC), interior search algorithm (ISA), Securities Exchange Board of India (SEBI), sum of squares between (SSB) and coordinated scheduling/beamforming (CSB), the GACSB algorithm can not only obtain biclusters of high quality, but also obtain biclusters of high-biologic significance. In addition, we also use different bicluster evaluation indicators, such as Average Correlation Value (ACV), Mean-Squared Residue (MSR) and Virtual Error (VE), and verify that the GACSB algorithm has a strong scalability.
Collapse
Affiliation(s)
- Lu Yin
- School of Computer and Software, Huaiyin Institute of Technology, Mei Cheng Road No. 1, Huaian 223001, P. R. China
| | - Junlin Qiu
- Huaian Yile Education and Technology Co. Ltd., Huaihai Road No. 23, Huaian 223001, P. R. China
| | - Shangbing Gao
- School of Computer and Software, Huaiyin Institute of Technology, Mei Cheng Road No. 1, Huaian 223001, P. R. China
| |
Collapse
|
31
|
Ding KF, Finlay D, Yin H, Hendricks WPD, Sereduk C, Kiefer J, Sekulic A, LoRusso PM, Vuori K, Trent JM, Schork NJ. Network Rewiring in Cancer: Applications to Melanoma Cell Lines and the Cancer Genome Atlas Patients. Front Genet 2018; 9:228. [PMID: 30042785 PMCID: PMC6048451 DOI: 10.3389/fgene.2018.00228] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2018] [Accepted: 06/08/2018] [Indexed: 01/21/2023] Open
Abstract
Genes do not work in isolation, but rather as part of networks that have many feedback and redundancy mechanisms. Studying the properties of genetic networks and how individual genes contribute to overall network functions can provide insight into genetically-mediated disease processes. Most analytical techniques assume a network topology based on normal state networks. However, gene perturbations often lead to the rewiring of relevant networks and impact relationships among other genes. We apply a suite of analysis methodologies to assess the degree of transcriptional network rewiring observed in different sets of melanoma cell lines using whole genome gene expression microarray profiles. We assess evidence for network rewiring in melanoma patient tumor samples using RNA-sequence data available from The Cancer Genome Atlas. We make a distinction between “unsupervised” and “supervised” network-based methods and contrast their use in identifying consistent differences in networks between subsets of cell lines and tumor samples. We find that different genes play more central roles within subsets of genes within a broader network and hence are likely to be better drug targets in a disease state. Ultimately, we argue that our results have important implications for understanding the molecular pathology of melanoma as well as the choice of treatments to combat that pathology.
Collapse
Affiliation(s)
- Kuan-Fu Ding
- J. Craig Venter Institute, La Jolla, CA, United States.,Department of Bioengineering, University of California, San Diego, San Diego, CA, United States
| | - Darren Finlay
- Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA, United States
| | - Hongwei Yin
- The Translational Genomics Research Institute, Phoenix, AZ, United States
| | | | - Chris Sereduk
- The Translational Genomics Research Institute, Phoenix, AZ, United States
| | - Jeffrey Kiefer
- The Translational Genomics Research Institute, Phoenix, AZ, United States
| | - Aleksandar Sekulic
- The Translational Genomics Research Institute, Phoenix, AZ, United States
| | - Patricia M LoRusso
- Department of Medical Oncology, Yale Cancer Center, Yale University, New Haven, CT, United States
| | - Kristiina Vuori
- Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA, United States
| | - Jeffrey M Trent
- The Translational Genomics Research Institute, Phoenix, AZ, United States
| | - Nicholas J Schork
- J. Craig Venter Institute, La Jolla, CA, United States.,Department of Bioengineering, University of California, San Diego, San Diego, CA, United States.,The Translational Genomics Research Institute, Phoenix, AZ, United States.,Department of Psychiatry, University of California, San Diego, San Diego, CA, United States
| |
Collapse
|
32
|
|
33
|
A contiguous column coherent evolution biclustering algorithm for time-series gene expression data. INT J MACH LEARN CYB 2018. [DOI: 10.1007/s13042-015-0487-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
34
|
Bentham RB, Bryson K, Szabadkai G. MCbiclust: a novel algorithm to discover large-scale functionally related gene sets from massive transcriptomics data collections. Nucleic Acids Res 2017; 45:8712-8730. [PMID: 28911113 PMCID: PMC5587796 DOI: 10.1093/nar/gkx590] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2016] [Accepted: 07/01/2017] [Indexed: 12/16/2022] Open
Abstract
The potential to understand fundamental biological processes from gene expression data has grown in parallel with the recent explosion of the size of data collections. However, to exploit this potential, novel analytical methods are required, capable of discovering large co-regulated gene networks. We found current methods limited in the size of correlated gene sets they could discover within biologically heterogeneous data collections, hampering the identification of multi-gene controlled fundamental cellular processes such as energy metabolism, organelle biogenesis and stress responses. Here we describe a novel biclustering algorithm called Massively Correlated Biclustering (MCbiclust) that selects samples and genes from large datasets with maximal correlated gene expression, allowing regulation of complex networks to be examined. The method has been evaluated using synthetic data and applied to large bacterial and cancer cell datasets. We show that the large biclusters discovered, so far elusive to identification by existing techniques, are biologically relevant and thus MCbiclust has great potential in the analysis of transcriptomics data to identify large-scale unknown effects hidden within the data. The identified massive biclusters can be used to develop improved transcriptomics based diagnosis tools for diseases caused by altered gene expression, or used for further network analysis to understand genotype-phenotype correlations.
Collapse
Affiliation(s)
- Robert B Bentham
- Department of Cell and Developmental Biology, Consortium for Mitochondrial Research, University College London, London WC1E 6BT, UK.,The Francis Crick Institute, London NW1 1AT, UK
| | - Kevin Bryson
- Department of Computer Sciences, University College London, London WC1E 6BT, UK
| | - Gyorgy Szabadkai
- Department of Cell and Developmental Biology, Consortium for Mitochondrial Research, University College London, London WC1E 6BT, UK.,The Francis Crick Institute, London NW1 1AT, UK.,Department of Biomedical Sciences, University of Padua, 35131 Padua, Italy
| |
Collapse
|
35
|
|
36
|
Shi F, Huang H. Identifying Cell Subpopulations and Their Genetic Drivers from Single-Cell RNA-Seq Data Using a Biclustering Approach. J Comput Biol 2017; 24:663-674. [PMID: 28657835 PMCID: PMC5510693 DOI: 10.1089/cmb.2017.0049] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Single-cell RNA-Seq (scRNA-Seq) has attracted much attention recently because it allows unprecedented resolution into cellular activity; the technology, therefore, has been widely applied in studying cell heterogeneity such as the heterogeneity among embryonic cells at varied developmental stages or cells of different cancer types or subtypes. A pertinent question in such analyses is to identify cell subpopulations as well as their associated genetic drivers. Consequently, a multitude of approaches have been developed for clustering or biclustering analysis of scRNA-Seq data. In this article, we present a fast and simple iterative biclustering approach called "BiSNN-Walk" based on the existing SNN-Cliq algorithm. One of BiSNN-Walk's differentiating features is that it returns a ranked list of clusters, which may serve as an indicator of a cluster's reliability. Another important feature is that BiSNN-Walk ranks genes in a gene cluster according to their level of affiliation to the associated cell cluster, making the result more biologically interpretable. We also introduce an entropy-based measure for choosing a highly clusterable similarity matrix as our starting point among a wide selection to facilitate the efficient operation of our algorithm. We applied BiSNN-Walk to three large scRNA-Seq studies, where we demonstrated that BiSNN-Walk was able to retain and sometimes improve the cell clustering ability of SNN-Cliq. We were able to obtain biologically sensible gene clusters in terms of GO term enrichment. In addition, we saw that there was significant overlap in top characteristic genes for clusters corresponding to similar cell states, further demonstrating the fidelity of our gene clusters.
Collapse
Affiliation(s)
- Funan Shi
- Department of Statistics, University of California , Berkeley, California
| | - Haiyan Huang
- Department of Statistics, University of California , Berkeley, California
| |
Collapse
|
37
|
Huang Y. Clustering multi-typed objects in extended star-structured heterogeneous data. INTELL DATA ANAL 2017. [DOI: 10.3233/ida-150416] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
38
|
Henriques R, Ferreira FL, Madeira SC. BicPAMS: software for biological data analysis with pattern-based biclustering. BMC Bioinformatics 2017; 18:82. [PMID: 28153040 PMCID: PMC5290636 DOI: 10.1186/s12859-017-1493-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2016] [Accepted: 01/21/2017] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Biclustering has been largely applied for the unsupervised analysis of biological data, being recognised today as a key technique to discover putative modules in both expression data (subsets of genes correlated in subsets of conditions) and network data (groups of coherently interconnected biological entities). However, given its computational complexity, only recent breakthroughs on pattern-based biclustering enabled efficient searches without the restrictions that state-of-the-art biclustering algorithms place on the structure and homogeneity of biclusters. As a result, pattern-based biclustering provides the unprecedented opportunity to discover non-trivial yet meaningful biological modules with putative functions, whose coherency and tolerance to noise can be tuned and made problem-specific. METHODS To enable the effective use of pattern-based biclustering by the scientific community, we developed BicPAMS (Biclustering based on PAttern Mining Software), a software that: 1) makes available state-of-the-art pattern-based biclustering algorithms (BicPAM (Henriques and Madeira, Alg Mol Biol 9:27, 2014), BicNET (Henriques and Madeira, Alg Mol Biol 11:23, 2016), BicSPAM (Henriques and Madeira, BMC Bioinforma 15:130, 2014), BiC2PAM (Henriques and Madeira, Alg Mol Biol 11:1-30, 2016), BiP (Henriques and Madeira, IEEE/ACM Trans Comput Biol Bioinforma, 2015), DeBi (Serin and Vingron, AMB 6:1-12, 2011) and BiModule (Okada et al., IPSJ Trans Bioinf 48(SIG5):39-48, 2007)); 2) consistently integrates their dispersed contributions; 3) further explores additional accuracy and efficiency gains; and 4) makes available graphical and application programming interfaces. RESULTS Results on both synthetic and real data confirm the relevance of BicPAMS for biological data analysis, highlighting its essential role for the discovery of putative modules with non-trivial yet biologically significant functions from expression and network data. CONCLUSIONS BicPAMS is the first biclustering tool offering the possibility to: 1) parametrically customize the structure, coherency and quality of biclusters; 2) analyze large-scale biological networks; and 3) tackle the restrictive assumptions placed by state-of-the-art biclustering algorithms. These contributions are shown to be key for an adequate, complete and user-assisted unsupervised analysis of biological data. SOFTWARE BicPAMS and its tutorial available in http://www.bicpams.com .
Collapse
Affiliation(s)
- Rui Henriques
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| | | | - Sara C. Madeira
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| |
Collapse
|
39
|
Oyelade J, Isewon I, Oladipupo F, Aromolaran O, Uwoghiren E, Ameh F, Achas M, Adebiyi E. Clustering Algorithms: Their Application to Gene Expression Data. Bioinform Biol Insights 2016; 10:237-253. [PMID: 27932867 PMCID: PMC5135122 DOI: 10.4137/bbi.s38316] [Citation(s) in RCA: 69] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2016] [Revised: 09/05/2016] [Accepted: 09/09/2016] [Indexed: 12/17/2022] Open
Abstract
Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure.
Collapse
Affiliation(s)
- Jelili Oyelade
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Itunuoluwa Isewon
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Funke Oladipupo
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
| | - Olufemi Aromolaran
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
| | - Efosa Uwoghiren
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
| | - Faridah Ameh
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
| | - Moses Achas
- Department of Computer Science and Information Technology, Bells University of Technology, Ota, Ogun State, Nigeria
| | - Ezekiel Adebiyi
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| |
Collapse
|
40
|
Fushing H, Hsueh CH, Heitkamp C, Matthews MA, Koehl P. Unravelling the geometry of data matrices: effects of water stress regimes on winemaking. J R Soc Interface 2016; 12:20150753. [PMID: 26468072 DOI: 10.1098/rsif.2015.0753] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
A new method is proposed for unravelling the patterns between a set of experiments and the features that characterize those experiments. The aims are to extract these patterns in the form of a coupling between the rows and columns of the corresponding data matrix and to use this geometry as a support for model testing. These aims are reached through two key steps, namely application of an iterative geometric approach to couple the metric spaces associated with the rows and columns, and use of statistical physics to generate matrices that mimic the original data while maintaining their inherent structure, thereby providing the basis for hypothesis testing and statistical inference. The power of this new method is illustrated on the study of the impact of water stress conditions on the attributes of 'Cabernet Sauvignon' Grapes, Juice, Wine and Bottled Wine from two vintages. The first step, named data mechanics, de-convolutes the intrinsic effects of grape berries and wine attributes due to the experimental irrigation conditions from the extrinsic effects of the environment. The second step provides an analysis of the associations of some attributes of the bottled wine with characteristics of either the matured grape berries or the resulting juice, thereby identifying statistically significant associations between the juice pH, yeast assimilable nitrogen, and sugar content and the bottled wine alcohol level.
Collapse
Affiliation(s)
- Hsieh Fushing
- Department of Statistics, University of California, Davis, CA 95616, USA
| | - Chih-Hsin Hsueh
- Department of Statistics, University of California, Davis, CA 95616, USA
| | - Constantin Heitkamp
- Department of Viticulture and Enology, University of California, Davis, CA 95616, USA
| | - Mark A Matthews
- Department of Viticulture and Enology, University of California, Davis, CA 95616, USA
| | - Patrice Koehl
- Department of Computer Science and Genome Center, University of California, Davis, CA 95616, USA
| |
Collapse
|
41
|
Abstract
In this paper we present a structured overview of methods for two-mode clustering, that is, methods that provide a simultaneous clustering of the rows and columns of a rectangular data matrix. Key structuring principles include the nature of row, column and data clusters and the type of model structure or associated loss function. We illustrate with analyses of symptom data on archetypal psychiatric patients.
Collapse
Affiliation(s)
- Iven Van Mechelen
- Psychology Department, University of Leuven, Tiensestraat 102, B-3000 Leuven, Belgium.
| | | | | |
Collapse
|
42
|
Ensemble Feature Learning of Genomic Data Using Support Vector Machine. PLoS One 2016; 11:e0157330. [PMID: 27304923 PMCID: PMC4909287 DOI: 10.1371/journal.pone.0157330] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2015] [Accepted: 05/28/2016] [Indexed: 11/29/2022] Open
Abstract
The identification of a subset of genes having the ability to capture the necessary information to distinguish classes of patients is crucial in bioinformatics applications. Ensemble and bagging methods have been shown to work effectively in the process of gene selection and classification. Testament to that is random forest which combines random decision trees with bagging to improve overall feature selection and classification accuracy. Surprisingly, the adoption of these methods in support vector machines has only recently received attention but mostly on classification not gene selection. This paper introduces an ensemble SVM-Recursive Feature Elimination (ESVM-RFE) for gene selection that follows the concepts of ensemble and bagging used in random forest but adopts the backward elimination strategy which is the rationale of RFE algorithm. The rationale behind this is, building ensemble SVM models using randomly drawn bootstrap samples from the training set, will produce different feature rankings which will be subsequently aggregated as one feature ranking. As a result, the decision for elimination of features is based upon the ranking of multiple SVM models instead of choosing one particular model. Moreover, this approach will address the problem of imbalanced datasets by constructing a nearly balanced bootstrap sample. Our experiments show that ESVM-RFE for gene selection substantially increased the classification performance on five microarray datasets compared to state-of-the-art methods. Experiments on the childhood leukaemia dataset show that an average 9% better accuracy is achieved by ESVM-RFE over SVM-RFE, and 5% over random forest based approach. The selected genes by the ESVM-RFE algorithm were further explored with Singular Value Decomposition (SVD) which reveals significant clusters with the selected data.
Collapse
|
43
|
Wagner JR, Lee CT, Durrant JD, Malmstrom RD, Feher VA, Amaro RE. Emerging Computational Methods for the Rational Discovery of Allosteric Drugs. Chem Rev 2016; 116:6370-90. [PMID: 27074285 PMCID: PMC4901368 DOI: 10.1021/acs.chemrev.5b00631] [Citation(s) in RCA: 158] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
![]()
Allosteric drug development holds
promise for delivering medicines
that are more selective and less toxic than those that target orthosteric
sites. To date, the discovery of allosteric binding sites and lead
compounds has been mostly serendipitous, achieved through high-throughput
screening. Over the past decade, structural data has become more readily
available for larger protein systems and more membrane protein classes
(e.g., GPCRs and ion channels), which are common allosteric drug targets.
In parallel, improved simulation methods now provide better atomistic
understanding of the protein dynamics and cooperative motions that
are critical to allosteric mechanisms. As a result of these advances,
the field of predictive allosteric drug development is now on the
cusp of a new era of rational structure-based computational methods.
Here, we review algorithms that predict allosteric sites based on
sequence data and molecular dynamics simulations, describe tools that
assess the druggability of these pockets, and discuss how Markov state
models and topology analyses provide insight into the relationship
between protein dynamics and allosteric drug binding. In each section,
we first provide an overview of the various method classes before
describing relevant algorithms and software packages.
Collapse
Affiliation(s)
- Jeffrey R Wagner
- Department of Chemistry & Biochemistry and ‡National Biomedical Computation Resource, University of California, San Diego , La Jolla, California 92093, United States
| | - Christopher T Lee
- Department of Chemistry & Biochemistry and ‡National Biomedical Computation Resource, University of California, San Diego , La Jolla, California 92093, United States
| | - Jacob D Durrant
- Department of Chemistry & Biochemistry and ‡National Biomedical Computation Resource, University of California, San Diego , La Jolla, California 92093, United States
| | - Robert D Malmstrom
- Department of Chemistry & Biochemistry and ‡National Biomedical Computation Resource, University of California, San Diego , La Jolla, California 92093, United States
| | - Victoria A Feher
- Department of Chemistry & Biochemistry and ‡National Biomedical Computation Resource, University of California, San Diego , La Jolla, California 92093, United States
| | - Rommie E Amaro
- Department of Chemistry & Biochemistry and ‡National Biomedical Computation Resource, University of California, San Diego , La Jolla, California 92093, United States
| |
Collapse
|
44
|
Wang Z, Li G, Robinson RW, Huang X. UniBic: Sequential row-based biclustering algorithm for analysis of gene expression data. Sci Rep 2016; 6:23466. [PMID: 27001340 PMCID: PMC4802312 DOI: 10.1038/srep23466] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2015] [Accepted: 03/08/2016] [Indexed: 11/29/2022] Open
Abstract
Biclustering algorithms, which aim to provide an effective and efficient way to analyze gene expression data by finding a group of genes with trend-preserving expression patterns under certain conditions, have been widely developed since Morgan et al. pioneered a work about partitioning a data matrix into submatrices with approximately constant values. However, the identification of general trend-preserving biclusters which are the most meaningful substructures hidden in gene expression data remains a highly challenging problem. We found an elementary method by which biologically meaningful trend-preserving biclusters can be readily identified from noisy and complex large data. The basic idea is to apply the longest common subsequence (LCS) framework to selected pairs of rows in an index matrix derived from an input data matrix to locate a seed for each bicluster to be identified. We tested it on synthetic and real datasets and compared its performance with currently competitive biclustering tools. We found that the new algorithm, named UniBic, outperformed all previous biclustering algorithms in terms of commonly used evaluation scenarios except for BicSPAM on narrow biclusters. The latter was somewhat better at finding narrow biclusters, the task for which it was specifically designed.
Collapse
Affiliation(s)
- Zhenjia Wang
- School of Mathematics, Shandong University, Jinan, Shandong 250100, P.R. China
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan, Shandong 250100, P.R. China
- Department of Computer Science, Arkansas State University, Jonesboro, AR72467
| | - Robert W. Robinson
- Department of Computer Science, University of Georgia, Athens, GA 30602, USA
| | - Xiuzhen Huang
- Department of Computer Science, Arkansas State University, Jonesboro, AR72467
| |
Collapse
|
45
|
Mortlock SA, Booth R, Mazrier H, Khatkar MS, Williamson P. Visualization of Genome Diversity in German Shepherd Dogs. Bioinform Biol Insights 2016; 9:37-42. [PMID: 26884680 PMCID: PMC4750897 DOI: 10.4137/bbi.s30524] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2015] [Revised: 12/06/2015] [Accepted: 12/11/2015] [Indexed: 12/16/2022] Open
Abstract
A loss of genetic diversity may lead to increased disease risks in subpopulations of dogs. The canine breed structure has contributed to relatively small effective population size in many breeds and can limit the options for selective breeding strategies to maintain diversity. With the completion of the canine genome sequencing project, and the subsequent reduction in the cost of genotyping on a genomic scale, evaluating diversity in dogs has become much more accurate and accessible. This provides a potential tool for advising dog breeders and developing breeding programs within a breed. A challenge in doing this is to present complex relationship data in a form that can be readily utilized. Here, we demonstrate the use of a pipeline, known as NetView, to visualize the network of relationships in a subpopulation of German Shepherd Dogs.
Collapse
Affiliation(s)
| | - Rachel Booth
- Faculty of Veterinary Science, The University of Sydney, NSW, Australia
| | - Hamutal Mazrier
- Faculty of Veterinary Science, The University of Sydney, NSW, Australia
| | - Mehar S Khatkar
- Faculty of Veterinary Science, The University of Sydney, NSW, Australia
| | - Peter Williamson
- Faculty of Veterinary Science, The University of Sydney, NSW, Australia
| |
Collapse
|
46
|
|
47
|
Discovery of bidirectional contiguous column coherent bicluster in time-series gene expression data. INT J MACH LEARN CYB 2015. [DOI: 10.1007/s13042-015-0464-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
48
|
Cheon M, Kim C, Chang I. Uncovering multiloci-ordering by algebraic property of Laplacian matrix and its Fiedler vector. ACTA ACUST UNITED AC 2015; 32:801-7. [PMID: 26568627 DOI: 10.1093/bioinformatics/btv669] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2015] [Accepted: 11/09/2015] [Indexed: 11/13/2022]
Abstract
MOTIVATION The loci-ordering, based on two-point recombination fractions for a pair of loci, is the most important step in constructing a reliable and fine genetic map. RESULTS Using the concept from complex graph theory, here we propose a Laplacian ordering approach which uncovers the loci-ordering of multiloci simultaneously. The algebraic property for a Fiedler vector of a Laplacian matrix, constructed from the recombination fraction of the loci-ordering for 26 loci of barley chromosome IV, 846 loci of Arabidopsis thaliana and 1903 loci of Malus domestica, together with the variable threshold uncovers their loci-orders. It offers an alternative yet robust approach for ordering multiloci. AVAILABILITY AND IMPLEMENTATION Source code program with data set is available as supplementary data and also in a software category of the website (http://biophysics.dgist.ac.kr) CONTACT crkim@pusan.ac.kr or iksoochang@dgist.ac.kr SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mookyung Cheon
- Creative Research Initiatives Center for Proteome Biophysics, Department of Brain and Cognitive Sciences, Daegu Gyeongbuk Institute of Science and Technology (DGIST), Daegu 711-873, Korea and
| | - Choongrak Kim
- Department of Statistics, Pusan National University, Busan 609-735, Korea
| | - Iksoo Chang
- Creative Research Initiatives Center for Proteome Biophysics, Department of Brain and Cognitive Sciences, Daegu Gyeongbuk Institute of Science and Technology (DGIST), Daegu 711-873, Korea and
| |
Collapse
|
49
|
Nepomuceno JA, Troncoso A, Aguilar-Ruiz JS. Scatter search-based identification of local patterns with positive and negative correlations in gene expression data. Appl Soft Comput 2015. [DOI: 10.1016/j.asoc.2015.06.019] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
50
|
Pontes B, Giráldez R, Aguilar-Ruiz JS. Biclustering on expression data: A review. J Biomed Inform 2015; 57:163-80. [PMID: 26160444 DOI: 10.1016/j.jbi.2015.06.028] [Citation(s) in RCA: 165] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2015] [Revised: 06/22/2015] [Accepted: 06/30/2015] [Indexed: 11/28/2022]
Abstract
Biclustering has become a popular technique for the study of gene expression data, especially for discovering functionally related gene sets under different subsets of experimental conditions. Most of biclustering approaches use a measure or cost function that determines the quality of biclusters. In such cases, the development of both a suitable heuristics and a good measure for guiding the search are essential for discovering interesting biclusters in an expression matrix. Nevertheless, not all existing biclustering approaches base their search on evaluation measures for biclusters. There exists a diverse set of biclustering tools that follow different strategies and algorithmic concepts which guide the search towards meaningful results. In this paper we present a extensive survey of biclustering approaches, classifying them into two categories according to whether or not use evaluation metrics within the search method: biclustering algorithms based on evaluation measures and non metric-based biclustering algorithms. In both cases, they have been classified according to the type of meta-heuristics which they are based on.
Collapse
Affiliation(s)
- Beatriz Pontes
- Department of Languages and Computer Systems, University of Seville, Seville, Spain.
| | - Raúl Giráldez
- School of Engineering, Pablo de Olavide University, Seville, Spain.
| | | |
Collapse
|