1
|
Kumar N, Srivastava R. Deep learning in structural bioinformatics: current applications and future perspectives. Brief Bioinform 2024; 25:bbae042. [PMID: 38701422 PMCID: PMC11066934 DOI: 10.1093/bib/bbae042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 01/05/2024] [Accepted: 01/18/2024] [Indexed: 05/05/2024] Open
Abstract
In this review article, we explore the transformative impact of deep learning (DL) on structural bioinformatics, emphasizing its pivotal role in a scientific revolution driven by extensive data, accessible toolkits and robust computing resources. As big data continue to advance, DL is poised to become an integral component in healthcare and biology, revolutionizing analytical processes. Our comprehensive review provides detailed insights into DL, featuring specific demonstrations of its notable applications in bioinformatics. We address challenges tailored for DL, spotlight recent successes in structural bioinformatics and present a clear exposition of DL-from basic shallow neural networks to advanced models such as convolution, recurrent, artificial and transformer neural networks. This paper discusses the emerging use of DL for understanding biomolecular structures, anticipating ongoing developments and applications in the realm of structural bioinformatics.
Collapse
Affiliation(s)
- Niranjan Kumar
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India
| | - Rakesh Srivastava
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, India
| |
Collapse
|
2
|
An iterative approach to unsupervised outlier detection using ensemble method and distance-based data filtering. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-022-00674-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
AbstractOutlier or anomaly detection is the process through which datum/data with different properties from the rest of the data is/are identified. Their importance lies in their use in various domains such as fraud detection, network intrusion detection, and spam filtering. In this paper, we introduce a new outlier detection algorithm based on an ensemble method and distance-based data filtering with an iterative approach to detect outliers in unlabeled data. The ensemble method is used to cluster the unlabeled data and to filter out potential isolated outliers from the same by iteratively using a cluster membership threshold until the Dunn index score for clustering is maximized. The distance-based data filtering, on the other hand, removes the potential outlier clusters from the post-clustered data based on a distance threshold using the Euclidean distance measure of each data point from the majority cluster as the filtering factor. The performance of our algorithm is evaluated by applying it to 10 real-world machine learning datasets. Finally, we compare the results of our algorithm to various supervised and unsupervised outlier detection algorithms using Precision@n and F-score evaluation metrics.
Collapse
|
3
|
Band-based similarity indices for gene expression classification and clustering. Sci Rep 2021; 11:21609. [PMID: 34732744 PMCID: PMC8566472 DOI: 10.1038/s41598-021-00678-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Accepted: 10/11/2021] [Indexed: 11/16/2022] Open
Abstract
The concept of depth induces an ordering from centre outwards in multivariate data. Most depth definitions are unfeasible for dimensions larger than three or four, but the Modified Band Depth (MBD) is a notable exception that has proven to be a valuable tool in the analysis of high-dimensional gene expression data. This depth definition relates the centrality of each individual to its (partial) inclusion in all possible bands formed by elements of the data set. We assess (dis)similarity between pairs of observations by accounting for such bands and constructing binary matrices associated to each pair. From these, contingency tables are calculated and used to derive standard similarity indices. Our approach is computationally efficient and can be applied to bands formed by any number of observations from the data set. We have evaluated the performance of several band-based similarity indices with respect to that of other classical distances in standard classification and clustering tasks in a variety of simulated and real data sets. However, the use of the method is not restricted to these, the extension to other similarity coefficients being straightforward. Our experiments show the benefits of our technique, with some of the selected indices outperforming, among others, the Euclidean distance.
Collapse
|
4
|
Cao D, Chen Y, Chen J, Zhang H, Yuan Z. An improved algorithm for the maximal information coefficient and its application. ROYAL SOCIETY OPEN SCIENCE 2021; 8:201424. [PMID: 33972855 PMCID: PMC8074658 DOI: 10.1098/rsos.201424] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Accepted: 01/18/2021] [Indexed: 06/12/2023]
Abstract
The maximal information coefficient (MIC) captures both linear and nonlinear correlations between variable pairs. In this paper, we proposed the BackMIC algorithm for MIC estimation. The BackMIC algorithm adds a searching back process on the equipartitioned axis to obtain a better grid partition than the original implementation algorithm ApproxMaxMI. And similar to the ChiMIC algorithm, it terminates the grid search process by the χ 2-test instead of the maximum number of bins B(n, α). Results on simulated data show that the BackMIC algorithm maintains the generality of MIC, and gives more reasonable grid partition and MIC values for independent and dependent variable pairs under comparable running times. Moreover, it is robust under different α in B(n, α). MIC calculated by the BackMIC algorithm reveals an improvement in statistical power and equitability. We applied (1-MIC) as the distance measurement in the K-means algorithm to perform a clustering of the cancer/normal samples. The results on four cancer datasets demonstrated that the MIC values calculated by the BackMIC algorithm can obtain better clustering results, indicating the correlations between samples measured by the BackMIC algorithm were more credible than those measured by other algorithms.
Collapse
Affiliation(s)
- Dan Cao
- Hunan Engineering and Technology Research Centre for Agricultural Big Data Analysis and Decision-making, Hunan Agricultural University, Changsha 410000, People's Republic of China
- Orient Science and Technology College of Hunan Agricultural University, Changsha 410000, Hunan, People's Republic of China
| | - Yuan Chen
- Hunan Engineering and Technology Research Centre for Agricultural Big Data Analysis and Decision-making, Hunan Agricultural University, Changsha 410000, People's Republic of China
| | - Jin Chen
- Hunan Engineering and Technology Research Centre for Agricultural Big Data Analysis and Decision-making, Hunan Agricultural University, Changsha 410000, People's Republic of China
| | - Hongyan Zhang
- Hunan Engineering and Technology Research Centre for Agricultural Big Data Analysis and Decision-making, Hunan Agricultural University, Changsha 410000, People's Republic of China
| | - Zheming Yuan
- Hunan Engineering and Technology Research Centre for Agricultural Big Data Analysis and Decision-making, Hunan Agricultural University, Changsha 410000, People's Republic of China
| |
Collapse
|
5
|
Clustering data with the presence of attribute noise: a study of noise completely at random and ensemble of multiple k-means clusterings. INT J MACH LEARN CYB 2019. [DOI: 10.1007/s13042-019-00989-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
6
|
|
7
|
Alhusain L, Hafez AM. Cluster ensemble based on Random Forests for genetic data. BioData Min 2017; 10:37. [PMID: 29270227 PMCID: PMC5732374 DOI: 10.1186/s13040-017-0156-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2017] [Accepted: 11/21/2017] [Indexed: 11/25/2022] Open
Abstract
Background Clustering plays a crucial role in several application domains, such as bioinformatics. In bioinformatics, clustering has been extensively used as an approach for detecting interesting patterns in genetic data. One application is population structure analysis, which aims to group individuals into subpopulations based on shared genetic variations, such as single nucleotide polymorphisms. Advances in DNA sequencing technology have facilitated the obtainment of genetic datasets with exceptional sizes. Genetic data usually contain hundreds of thousands of genetic markers genotyped for thousands of individuals, making an efficient means for handling such data desirable. Results Random Forests (RFs) has emerged as an efficient algorithm capable of handling high-dimensional data. RFs provides a proximity measure that can capture different levels of co-occurring relationships between variables. RFs has been widely considered a supervised learning method, although it can be converted into an unsupervised learning method. Therefore, RF-derived proximity measure combined with a clustering technique may be well suited for determining the underlying structure of unlabeled data. This paper proposes, RFcluE, a cluster ensemble approach for determining the underlying structure of genetic data based on RFs. The approach comprises a cluster ensemble framework to combine multiple runs of RF clustering. Experiments were conducted on high-dimensional, real genetic dataset to evaluate the proposed approach. The experiments included an examination of the impact of parameter changes, comparing RFcluE performance against other clustering methods, and an assessment of the relationship between the diversity and quality of the ensemble and its effect on RFcluE performance. Conclusions This paper proposes, RFcluE, a cluster ensemble approach based on RF clustering to address the problem of population structure analysis and demonstrate the effectiveness of the approach. The paper also illustrates that applying a cluster ensemble approach, combining multiple RF clusterings, produces more robust and higher-quality results as a consequence of feeding the ensemble with diverse views of high-dimensional genetic data obtained through bagging and random subspace, the two key features of the RF algorithm.
Collapse
Affiliation(s)
- Luluah Alhusain
- College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| | - Alaaeldin M Hafez
- College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
8
|
Huo Z, Tseng G. Integrative Sparse K-Means With Overlapping Group Lasso in Genomic Applications for Disease Subtype Discovery. Ann Appl Stat 2017; 11:1011-1039. [PMID: 28959370 PMCID: PMC5613668 DOI: 10.1214/17-aoas1033] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Cancer subtypes discovery is the first step to deliver personalized medicine to cancer patients. With the accumulation of massive multi-level omics datasets and established biological knowledge databases, omics data integration with incorporation of rich existing biological knowledge is essential for deciphering a biological mechanism behind the complex diseases. In this manuscript, we propose an integrative sparse K-means (is-K means) approach to discover disease subtypes with the guidance of prior biological knowledge via sparse overlapping group lasso. An algorithm using an alternating direction method of multiplier (ADMM) will be applied for fast optimization. Simulation and three real applications in breast cancer and leukemia will be used to compare is-K means with existing methods and demonstrate its superior clustering accuracy, feature selection, functional annotation of detected molecular features and computing efficiency.
Collapse
Affiliation(s)
- Zhiguang Huo
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, ennsylvania 15261, USA
| | - George Tseng
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, ennsylvania 15261, USA
| |
Collapse
|
9
|
Iam-On N, Boongoen T. Generating descriptive model for student dropout: a review of clustering approach. HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES 2017. [DOI: 10.1186/s13673-016-0083-0] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
AbstractThe implementation of data mining is widely considered as a powerful instrument for acquiring new knowledge from a pile of historical data, which is normally left unstudied. This data driven methodology has proven effective to improve the quality of decision-making in several domains such as business, medical and complex engineering problems. Recently, educational data mining (EDM) has obtained a great deal of attention among educational researchers and computer scientists. In general, publications in the field of EDM focus on understanding student types and targeted marketing, using both descriptive and predictive models to maximize student retention. Inspired by previous attempts, this paper aims to establish the clustering approach as a practical guideline to explore student categories and characteristics, with the working example on a real dataset to illustrate analytical procedures and results.
Collapse
|
10
|
Abstract
Clustering is an unsupervised learning method, which groups data points based on similarity, and is used to reveal the underlying structure of data. This computational approach is essential to understanding and visualizing the complex data that are acquired in high-throughput multidimensional biological experiments. Clustering enables researchers to make biological inferences for further experiments. Although a powerful technique, inappropriate application can lead biological researchers to waste resources and time in experimental follow-up. We review common pitfalls identified from the published molecular biology literature and present methods to avoid them. Commonly encountered pitfalls relate to the high-dimensional nature of biological data from high-throughput experiments, the failure to consider more than one clustering method for a given problem, and the difficulty in determining whether clustering has produced meaningful results. We present concrete examples of problems and solutions (clustering results) in the form of toy problems and real biological data for these issues. We also discuss ensemble clustering as an easy-to-implement method that enables the exploration of multiple clustering solutions and improves robustness of clustering solutions. Increased awareness of common clustering pitfalls will help researchers avoid overinterpreting or misinterpreting the results and missing valuable insights when clustering biological data.
Collapse
Affiliation(s)
- Tom Ronan
- Department of Biomedical Engineering, Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA
| | - Zhijie Qi
- Department of Biomedical Engineering, Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA
| | - Kristen M Naegle
- Department of Biomedical Engineering, Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA.
| |
Collapse
|
11
|
Huo Z, Ding Y, Liu S, Oesterreich S, Tseng G. Meta-analytic framework for sparse K-means to identify disease subtypes in multiple transcriptomic studies. J Am Stat Assoc 2016; 111:27-42. [PMID: 27330233 PMCID: PMC4908837 DOI: 10.1080/01621459.2015.1086354] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Revised: 08/01/2015] [Indexed: 12/15/2022]
Abstract
Disease phenotyping by omics data has become a popular approach that potentially can lead to better personalized treatment. Identifying disease subtypes via unsupervised machine learning is the first step towards this goal. In this paper, we extend a sparse K-means method towards a meta-analytic framework to identify novel disease subtypes when expression profiles of multiple cohorts are available. The lasso regularization and meta-analysis identify a unique set of gene features for subtype characterization. An additional pattern matching reward function guarantees consistent subtype signatures across studies. The method was evaluated by simulations and leukemia and breast cancer data sets. The identified disease subtypes from meta-analysis were characterized with improved accuracy and stability compared to single study analysis. The breast cancer model was applied to an independent METABRIC dataset and generated improved survival difference between subtypes. These results provide a basis for diagnosis and development of targeted treatments for disease subgroups.
Collapse
Affiliation(s)
- Zhiguang Huo
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261,
| | - Ying Ding
- Department of Computational Biology, University of Pittsburgh, Pittsburgh, PA 15261,
| | - Silvia Liu
- Department of Computational Biology, University of Pittsburgh, Pittsburgh, PA 15261,
| | | | - George Tseng
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261,
| |
Collapse
|
12
|
Rodenas-Cuadrado P, Chen XS, Wiegrebe L, Firzlaff U, Vernes SC. A novel approach identifies the first transcriptome networks in bats: a new genetic model for vocal communication. BMC Genomics 2015; 16:836. [PMID: 26490347 PMCID: PMC4618519 DOI: 10.1186/s12864-015-2068-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2015] [Accepted: 10/13/2015] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Bats are able to employ an astonishingly complex vocal repertoire for navigating their environment and conveying social information. A handful of species also show evidence for vocal learning, an extremely rare ability shared only with humans and few other animals. However, despite their potential for the study of vocal communication, bats remain severely understudied at a molecular level. To address this fundamental gap we performed the first transcriptome profiling and genetic interrogation of molecular networks in the brain of a highly vocal bat species, Phyllostomus discolor. RESULTS Gene network analysis typically needs large sample sizes for correct clustering, this can be prohibitive where samples are limited, such as in this study. To overcome this, we developed a novel bioinformatics methodology for identifying robust co-expression gene networks using few samples (N=6). Using this approach, we identified tissue-specific functional gene networks from the bat PAG, a brain region fundamental for mammalian vocalisation. The most highly connected network identified represented a cluster of genes involved in glutamatergic synaptic transmission. Glutamatergic receptors play a significant role in vocalisation from the PAG, suggesting that this gene network may be mechanistically important for vocal-motor control in mammals. CONCLUSION We have developed an innovative approach to cluster co-expressing gene networks and show that it is highly effective in detecting robust functional gene networks with limited sample sizes. Moreover, this work represents the first gene network analysis performed in a bat brain and establishes bats as a novel, tractable model system for understanding the genetics of vocal mammalian communication.
Collapse
Affiliation(s)
- Pedro Rodenas-Cuadrado
- Max Planck Institute for Psycholinguistics, Wundtlaan 1, Nijmegen, 6525 XD, The Netherlands.
| | - Xiaowei Sylvia Chen
- Max Planck Institute for Psycholinguistics, Wundtlaan 1, Nijmegen, 6525 XD, The Netherlands.
| | - Lutz Wiegrebe
- Ludwig-Maximilians-Universität, Division of Neurobiology, Department Biology II, Großhaderner Straße 2, Planegg-Martinsried, Munich, D-82152, Germany.
| | - Uwe Firzlaff
- Lehrstuhl für Zoologie, TU München, Liesel-Beckmann-Str. 4, Freising-Weihenstephan, Munich, 85350, Germany.
| | - Sonja C Vernes
- Max Planck Institute for Psycholinguistics, Wundtlaan 1, Nijmegen, 6525 XD, The Netherlands. .,Donders Centre for Cognitive Neuroimaging, Kapittelweg 29, Nijmegen, 6525 EN, The Netherlands.
| |
Collapse
|
13
|
Wang J, Zhong J, Chen G, Li M, Wu FX, Pan Y. ClusterViz: A Cytoscape APP for Cluster Analysis of Biological Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:815-822. [PMID: 26357321 DOI: 10.1109/tcbb.2014.2361348] [Citation(s) in RCA: 74] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Cluster analysis of biological networks is one of the most important approaches for identifying functional modules and predicting protein functions. Furthermore, visualization of clustering results is crucial to uncover the structure of biological networks. In this paper, ClusterViz, an APP of Cytoscape 3 for cluster analysis and visualization, has been developed. In order to reduce complexity and enable extendibility for ClusterViz, we designed the architecture of ClusterViz based on the framework of Open Services Gateway Initiative. According to the architecture, the implementation of ClusterViz is partitioned into three modules including interface of ClusterViz, clustering algorithms and visualization and export. ClusterViz fascinates the comparison of the results of different algorithms to do further related analysis. Three commonly used clustering algorithms, FAG-EC, EAGLE and MCODE, are included in the current version. Due to adopting the abstract interface of algorithms in module of the clustering algorithms, more clustering algorithms can be included for the future use. To illustrate usability of ClusterViz, we provided three examples with detailed steps from the important scientific articles, which show that our tool has helped several research teams do their research work on the mechanism of the biological networks.
Collapse
|
14
|
Improved student dropout prediction in Thai University using ensemble of mixed-type data clusterings. INT J MACH LEARN CYB 2015. [DOI: 10.1007/s13042-015-0341-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
15
|
Șenbabaoğlu Y, Michailidis G, Li JZ. Critical limitations of consensus clustering in class discovery. Sci Rep 2014; 4:6207. [PMID: 25158761 PMCID: PMC4145288 DOI: 10.1038/srep06207] [Citation(s) in RCA: 187] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2014] [Accepted: 08/08/2014] [Indexed: 11/09/2022] Open
Abstract
Consensus clustering (CC) has been adopted for unsupervised class discovery in many genomic studies. It calculates how frequently two samples are grouped together in repeated clustering runs, and uses the resulting pairwise "consensus rates" for visual demonstration that clusters exist, for comparing cluster stability, and for estimating the optimal cluster number (K). However, the sensitivity and specificity of CC have not been systemically assessed. Through simulations we find that CC is able to divide randomly generated unimodal data into apparently stable clusters for a range of K, essentially reporting chance partitions of cluster-less data. For data with known structure, the common implementations of CC perform poorly in identifying the true K. These results suggest that CC should be applied and interpreted with caution. We found that a new metric based on CC, the proportion of ambiguously clustered pairs (PAC), infers K equally or more reliably than similar methods in simulated data with known K. Our overall approach involves the use of realistic null distributions based on the observed gene-gene correlation structure in a given study, and the implementation of PAC to more accurately estimate K. We discuss the strength of our approach in the context of other ensemble-based methods.
Collapse
Affiliation(s)
- Yasin Șenbabaoğlu
- 1] Department of Computational Medicine &Bioinformatics, University of Michigan, Ann Arbor, MI, USA [2]
| | - George Michailidis
- Department of Statistics and EECS, University of Michigan, Ann Arbor, MI, USA
| | - Jun Z Li
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
16
|
Wang X, Laird PW, Hinoue T, Groshen S, Siegmund KD. Non-specific filtering of beta-distributed data. BMC Bioinformatics 2014; 15:199. [PMID: 24943962 PMCID: PMC4230495 DOI: 10.1186/1471-2105-15-199] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2013] [Accepted: 06/12/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data. Not all measured features are expected to show biological variation, so only the most varying are selected for analysis. In DNA methylation studies, DNA methylation is measured as a proportion, bounded between 0 and 1, with variance a function of the mean. Filtering on standard deviation biases the selection of probes to those with mean values near 0.5. We explore the effect this has on clustering, and develop alternate filter methods that utilize a variance stabilizing transformation for Beta distributed data and do not share this bias. RESULTS We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions. We found that for data sets having a small fraction of samples showing abnormal methylation of a subset of normally unmethylated CpGs, a characteristic of the CpG island methylator phenotype in cancer, a novel filter statistic that utilized a variance-stabilizing transformation for Beta distributed data outperformed the common filter of using standard deviation of the DNA methylation proportion, or its log-transformed M-value, in its ability to detect the cancer subtype in a cluster analysis. However, the standard deviation filter always performed among the best for distinguishing subgroups of normal tissue. The novel filter and standard deviation filter tended to favour features in different genome contexts; for the same data set, the novel filter always selected more features from CpG island promoters and the standard deviation filter always selected more features from non-CpG island intergenic regions. Interestingly, despite selecting largely non-overlapping sets of features, the two filters did find sample subsets that overlapped for some real data sets. CONCLUSIONS We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype. Since cluster analysis is for discovery, we would suggest trying both filters on any new data sets, evaluating the overlap of features selected and clusters discovered.
Collapse
Affiliation(s)
- Xinhui Wang
- Department of Preventive Medicine, USC Keck School of Medicine, University of Southern California, 2001 N Soto Street, Suite 202W, Los Angeles 90089-9239 California, USA
| | - Peter W Laird
- Epigenome Center, USC Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Toshinori Hinoue
- Epigenome Center, USC Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Susan Groshen
- Department of Preventive Medicine, USC Keck School of Medicine, University of Southern California, 2001 N Soto Street, Suite 202W, Los Angeles 90089-9239 California, USA
| | - Kimberly D Siegmund
- Department of Preventive Medicine, USC Keck School of Medicine, University of Southern California, 2001 N Soto Street, Suite 202W, Los Angeles 90089-9239 California, USA
| |
Collapse
|
17
|
Wang Y, Pan Y. Semi-supervised consensus clustering for gene expression data analysis. BioData Min 2014; 7:7. [PMID: 24920961 PMCID: PMC4036113 DOI: 10.1186/1756-0381-7-7] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2013] [Accepted: 04/05/2014] [Indexed: 01/08/2023] Open
Abstract
Background Simple clustering methods such as hierarchical clustering and k-means are widely used for gene expression data analysis; but they are unable to deal with noise and high dimensionality associated with the microarray gene expression data. Consensus clustering appears to improve the robustness and quality of clustering results. Incorporating prior knowledge in clustering process (semi-supervised clustering) has been shown to improve the consistency between the data partitioning and domain knowledge. Methods We proposed semi-supervised consensus clustering (SSCC) to integrate the consensus clustering with semi-supervised clustering for analyzing gene expression data. We investigated the roles of consensus clustering and prior knowledge in improving the quality of clustering. SSCC was compared with one semi-supervised clustering algorithm, one consensus clustering algorithm, and k-means. Experiments on eight gene expression datasets were performed using h-fold cross-validation. Results Using prior knowledge improved the clustering quality by reducing the impact of noise and high dimensionality in microarray data. Integration of consensus clustering with semi-supervised clustering improved performance as compared to using consensus clustering or semi-supervised clustering separately. Our SSCC method outperformed the others tested in this paper.
Collapse
Affiliation(s)
- Yunli Wang
- National Research Council Canada, 46 Dineen Dr., Fredericton, Canada
| | - Youlian Pan
- National Research Council Canada, 1200 Montreal Rd., Ottawa, Canada
| |
Collapse
|
18
|
|
19
|
Liseron-Monfils C, Lewis T, Ashlock D, McNicholas PD, Fauteux F, Strömvik M, Raizada MN. Promzea: a pipeline for discovery of co-regulatory motifs in maize and other plant species and its application to the anthocyanin and phlobaphene biosynthetic pathways and the Maize Development Atlas. BMC PLANT BIOLOGY 2013; 13:42. [PMID: 23497159 PMCID: PMC3658923 DOI: 10.1186/1471-2229-13-42] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2012] [Accepted: 03/08/2013] [Indexed: 05/05/2023]
Abstract
BACKGROUND The discovery of genetic networks and cis-acting DNA motifs underlying their regulation is a major objective of transcriptome studies. The recent release of the maize genome (Zea mays L.) has facilitated in silico searches for regulatory motifs. Several algorithms exist to predict cis-acting elements, but none have been adapted for maize. RESULTS A benchmark data set was used to evaluate the accuracy of three motif discovery programs: BioProspector, Weeder and MEME. Analysis showed that each motif discovery tool had limited accuracy and appeared to retrieve a distinct set of motifs. Therefore, using the benchmark, statistical filters were optimized to reduce the false discovery ratio, and then remaining motifs from all programs were combined to improve motif prediction. These principles were integrated into a user-friendly pipeline for motif discovery in maize called Promzea, available at http://www.promzea.org and on the Discovery Environment of the iPlant Collaborative website. Promzea was subsequently expanded to include rice and Arabidopsis. Within Promzea, a user enters cDNA sequences or gene IDs; corresponding upstream sequences are retrieved from the maize genome. Predicted motifs are filtered, combined and ranked. Promzea searches the chosen plant genome for genes containing each candidate motif, providing the user with the gene list and corresponding gene annotations. Promzea was validated in silico using a benchmark data set: the Promzea pipeline showed a 22% increase in nucleotide sensitivity compared to the best standalone program tool, Weeder, with equivalent nucleotide specificity. Promzea was also validated by its ability to retrieve the experimentally defined binding sites of transcription factors that regulate the maize anthocyanin and phlobaphene biosynthetic pathways. Promzea predicted additional promoter motifs, and genome-wide motif searches by Promzea identified 127 non-anthocyanin/phlobaphene genes that each contained all five predicted promoter motifs in their promoters, perhaps uncovering a broader co-regulated gene network. Promzea was also tested against tissue-specific microarray data from maize. CONCLUSIONS An online tool customized for promoter motif discovery in plants has been generated called Promzea. Promzea was validated in silico by its ability to retrieve benchmark motifs and experimentally defined motifs and was tested using tissue-specific microarray data. Promzea predicted broader networks of gene regulation associated with the historic anthocyanin and phlobaphene biosynthetic pathways. Promzea is a new bioinformatics tool for understanding transcriptional gene regulation in maize and has been expanded to include rice and Arabidopsis.
Collapse
Affiliation(s)
| | - Tim Lewis
- Department of Plant Agriculture, University of Guelph, Guelph, ON N1G 2W1, Canada
| | - Daniel Ashlock
- Department of Mathematics and Statistics, University of Guelph, Guelph, ON N1G 2W1, Canada
| | - Paul D McNicholas
- Department of Mathematics and Statistics, University of Guelph, Guelph, ON N1G 2W1, Canada
| | - François Fauteux
- Department of Plant Sciences, McGill University, Ste. Anne de Bellevue, QC H9X 3V9, Canada
| | - Martina Strömvik
- Department of Plant Sciences, McGill University, Ste. Anne de Bellevue, QC H9X 3V9, Canada
| | - Manish N Raizada
- Department of Plant Agriculture, University of Guelph, Guelph, ON N1G 2W1, Canada
| |
Collapse
|
20
|
Kim EY, Hwang DU, Ko TW. Multiscale ensemble clustering for finding modules in complex networks. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2012; 85:026119. [PMID: 22463291 DOI: 10.1103/physreve.85.026119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/21/2011] [Indexed: 05/31/2023]
Abstract
The identification of modules in complex networks is important for the understanding of systems. Here, we propose an ensemble clustering method incorporating node groupings in various sizes and the sequential removal of weak ties between nodes which are rarely grouped together. This method successfully detects modules in various networks, such as hierarchical random networks and the American college football network, with known modular structures. Some of the results are compared with those obtained by modularity optimization and K-means clustering.
Collapse
Affiliation(s)
- Eun-Youn Kim
- Computational Neuroscience Team, National Institute for Mathematical Sciences, Daejeon 305-811, Republic of Korea
| | | | | |
Collapse
|
21
|
Mimaroglu S, Aksehirli E. DICLENS: divisive clustering ensemble with automatic cluster number. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 9:408-420. [PMID: 21968960 DOI: 10.1109/tcbb.2011.129] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Clustering has a long and rich history in a variety of scientific fields. Finding natural groupings of a data set is a hard task as attested by hundreds of clustering algorithms in the literature. Each clustering technique makes some assumptions about the underlying data set. If the assumptions hold, good clusterings can be expected. It is hard, in some cases impossible, to satisfy all the assumptions. Therefore, it is beneficial to apply different clustering methods on the same data set, or the same method with varying input parameters or both. We propose a novel method, DICLENS, which combines a set of clusterings into a final clustering having better overall quality. Our method produces the final clustering automatically and does not take any input parameters, a feature missing in many existing algorithms. Extensive experimental studies on real, artificial, and gene expression data sets demonstrate that DICLENS produces very good quality clusterings in a short amount of time. DICLENS implementation runs on standard personal computers by being scalable, and by consuming very little memory and CPU.
Collapse
|
22
|
Cancer classification based on microarray gene expression data using a principal component accumulation method. Sci China Chem 2011. [DOI: 10.1007/s11426-011-4263-5] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
23
|
New possibilistic method for discovering linear local behavior using hyper-Gaussian distributed membership function. Knowl Inf Syst 2011. [DOI: 10.1007/s10115-011-0385-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
24
|
Bayá AE, Granitto PM. Clustering gene expression data with a penalized graph-based metric. BMC Bioinformatics 2011; 12:2. [PMID: 21205299 PMCID: PMC3023695 DOI: 10.1186/1471-2105-12-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2010] [Accepted: 01/04/2011] [Indexed: 12/05/2022] Open
Abstract
Background The search for cluster structure in microarray datasets is a base problem for the so-called "-omic sciences". A difficult problem in clustering is how to handle data with a manifold structure, i.e. data that is not shaped in the form of compact clouds of points, forming arbitrary shapes or paths embedded in a high-dimensional space, as could be the case of some gene expression datasets. Results In this work we introduce the Penalized k-Nearest-Neighbor-Graph (PKNNG) based metric, a new tool for evaluating distances in such cases. The new metric can be used in combination with most clustering algorithms. The PKNNG metric is based on a two-step procedure: first it constructs the k-Nearest-Neighbor-Graph of the dataset of interest using a low k-value and then it adds edges with a highly penalized weight for connecting the subgraphs produced by the first step. We discuss several possible schemes for connecting the different sub-graphs as well as penalization functions. We show clustering results on several public gene expression datasets and simulated artificial problems to evaluate the behavior of the new metric. Conclusions In all cases the PKNNG metric shows promising clustering results. The use of the PKNNG metric can improve the performance of commonly used pairwise-distance based clustering methods, to the level of more advanced algorithms. A great advantage of the new procedure is that researchers do not need to learn a new method, they can simply compute distances with the PKNNG metric and then, for example, use hierarchical clustering to produce an accurate and highly interpretable dendrogram of their high-dimensional data.
Collapse
Affiliation(s)
- Ariel E Bayá
- CIFASIS French Argentine International Center for Information and Systems Sciences, UPCAM (France)/UNR-CONICET (Argentina), Bv 27 de Febrero 210 Bis, 2000 Rosario, República Argentina.
| | | |
Collapse
|
25
|
Sung MK, Bae YJ. Linking obesity to colorectal cancer: application of nutrigenomics. Biotechnol J 2010; 5:930-41. [PMID: 20715079 DOI: 10.1002/biot.201000165] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Diet is one of the most affective environmental factors in cancer development. Due to complicated nature of the diet, it has been very difficult to provide clear explanations for the role of dietary components in carcinogenesis. However, as high-throughput omics techniques became available, researchers are now able to analyze large sets of gene transcripts, proteins, and metabolites to identify molecules involved in disease development. Bioinformatics uses these data to perform network analyses and suggest possible interactions between metabolic processes and environmental factors. Obesity is known as one of the most closely related risk factors of colorectal cancer (CRC). Metabolic disturbances due to a positive energy balance may trigger and accelerate CRC development. In this review, we have summarized reports on genes, proteins and metabolites that are related to either obesity or CRC, and suggested candidate molecules linking obesity and CRC based on currently available literature. Possible application of bioinformatics for a large scale network analysis in studying cause-effect relationship between dietary components and CRC are suggested.
Collapse
Affiliation(s)
- Mi-Kyung Sung
- Department of Food and Nutrition, Sookmyung Women's University, Seoul, South Korea.
| | | |
Collapse
|
26
|
Iam-on N, Boongoen T, Garrett S. LCE: a link-based cluster ensemble method for improved gene expression data analysis. Bioinformatics 2010; 26:1513-9. [PMID: 20444838 DOI: 10.1093/bioinformatics/btq226] [Citation(s) in RCA: 85] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION It is far from trivial to select the most effective clustering method and its parameterization, for a particular set of gene expression data, because there are a very large number of possibilities. Although many researchers still prefer to use hierarchical clustering in one form or another, this is often sub-optimal. Cluster ensemble research solves this problem by automatically combining multiple data partitions from different clusterings to improve both the robustness and quality of the clustering result. However, many existing ensemble techniques use an association matrix to summarize sample-cluster co-occurrence statistics, and relations within an ensemble are encapsulated only at coarse level, while those existing among clusters are completely neglected. Discovering these missing associations may greatly extend the capability of the ensemble methodology for microarray data clustering. RESULTS The link-based cluster ensemble (LCE) method, presented here, implements these ideas and demonstrates outstanding performance. Experiment results on real gene expression and synthetic datasets indicate that LCE: (i) usually outperforms the existing cluster ensemble algorithms in individual tests and, overall, is clearly class-leading; (ii) generates excellent, robust performance across different types of data, especially with the presence of noise and imbalanced data clusters; (iii) provides a high-level data matrix that is applicable to many numerical clustering techniques; and (iv) is computationally efficient for large datasets and gene clustering. AVAILABILITY Online supplementary and implementation are available at: http://users.aber.ac.uk/nii07/bioinformatics2010. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Natthakan Iam-on
- Department of Computer Science, Aberystwyth University, Aberystwyth, Ceredigion, UK.
| | | | | |
Collapse
|
27
|
Newman AM, Cooper JB. AutoSOME: a clustering method for identifying gene expression modules without prior knowledge of cluster number. BMC Bioinformatics 2010; 11:117. [PMID: 20202218 PMCID: PMC2846907 DOI: 10.1186/1471-2105-11-117] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2009] [Accepted: 03/04/2010] [Indexed: 12/25/2022] Open
Abstract
Background Clustering the information content of large high-dimensional gene expression datasets has widespread application in "omics" biology. Unfortunately, the underlying structure of these natural datasets is often fuzzy, and the computational identification of data clusters generally requires knowledge about cluster number and geometry. Results We integrated strategies from machine learning, cartography, and graph theory into a new informatics method for automatically clustering self-organizing map ensembles of high-dimensional data. Our new method, called AutoSOME, readily identifies discrete and fuzzy data clusters without prior knowledge of cluster number or structure in diverse datasets including whole genome microarray data. Visualization of AutoSOME output using network diagrams and differential heat maps reveals unexpected variation among well-characterized cancer cell lines. Co-expression analysis of data from human embryonic and induced pluripotent stem cells using AutoSOME identifies >3400 up-regulated genes associated with pluripotency, and indicates that a recently identified protein-protein interaction network characterizing pluripotency was underestimated by a factor of four. Conclusions By effectively extracting important information from high-dimensional microarray data without prior knowledge or the need for data filtration, AutoSOME can yield systems-level insights from whole genome microarray expression studies. Due to its generality, this new method should also have practical utility for a variety of data-intensive applications, including the results of deep sequencing experiments. AutoSOME is available for download at http://jimcooperlab.mcdb.ucsb.edu/autosome.
Collapse
Affiliation(s)
- Aaron M Newman
- Biomolecular Science and Engineering Program, University of California, Santa Barbara, CA 93106, USA
| | | |
Collapse
|