1
|
Alhusain L, Hafez AM. Nonparametric approaches for population structure analysis. Hum Genomics 2018; 12:25. [PMID: 29743099 PMCID: PMC5944014 DOI: 10.1186/s40246-018-0156-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 04/24/2018] [Indexed: 12/28/2022] Open
Abstract
The analysis of population structure has many applications in medical and population genetic research. Such analysis is used to provide clear insight into the underlying genetic population substructure and is a crucial prerequisite for any analysis of genetic data. The analysis involves grouping individuals into subpopulations based on shared genetic variations. The most widely used markers to study the variation of DNA sequences between populations are single nucleotide polymorphisms. Data preprocessing is a necessary step to assess the quality of the data and to determine which markers or individuals can reasonably be included in the analysis. After preprocessing, several methods can be utilized to uncover population substructure, which can be categorized into two broad approaches: parametric and nonparametric. Parametric approaches use statistical models to infer population structure and assign individuals into subpopulations. However, these approaches suffer from many drawbacks that make them impractical for large datasets. In contrast, nonparametric approaches do not suffer from these drawbacks, making them more viable than parametric approaches for analyzing large datasets. Consequently, nonparametric approaches are increasingly used to reveal population substructure. Thus, this paper reviews and discusses the nonparametric approaches that are available for population structure analysis along with some implications to resolve challenges.
Collapse
Affiliation(s)
- Luluah Alhusain
- College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.
| | - Alaaeldin M Hafez
- College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
2
|
Parejo M, Wragg D, Gauthier L, Vignal A, Neumann P, Neuditschko M. Using Whole-Genome Sequence Information to Foster Conservation Efforts for the European Dark Honey Bee, Apis mellifera mellifera. Front Ecol Evol 2016. [DOI: 10.3389/fevo.2016.00140] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
3
|
Tetko IV, Maran U, Tropsha A. Public (Q)SAR Services, Integrated Modeling Environments, and Model Repositories on the Web: State of the Art and Perspectives for Future Development. Mol Inform 2016; 36. [PMID: 27778468 DOI: 10.1002/minf.201600082] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2016] [Accepted: 10/03/2016] [Indexed: 01/08/2023]
Abstract
Thousands of (Quantitative) Structure-Activity Relationships (Q)SAR models have been described in peer-reviewed publications; however, this way of sharing seldom makes models available for the use by the research community outside of the developer's laboratory. Conversely, on-line models allow broad dissemination and application representing the most effective way of sharing the scientific knowledge. Approaches for sharing and providing on-line access to models range from web services created by individual users and laboratories to integrated modeling environments and model repositories. This emerging transition from the descriptive and informative, but "static", and for the most part, non-executable print format to interactive, transparent and functional delivery of "living" models is expected to have a transformative effect on modern experimental research in areas of scientific and regulatory use of (Q)SAR models.
Collapse
Affiliation(s)
- Igor V Tetko
- Institute of Structural Biology, Helmholtz Zentrum München -, German Research Center for Environmental Health (GmbH), Institute of Structural Biology, Ingolstädter Landstraße 1, D-, 85764, Neuherberg, Germany.,BigChem GmbH, Ingolstädter Landstraße 1, b. 60w, D-, 85764, Neuherberg, Germany
| | - Uko Maran
- Institute of Chemistry, University of Tartu, Ravila 14A, Tartu, 50411, Estonia
| | - Alexander Tropsha
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, 27599, USA.,Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya St. 18, 420008, Kazan, Russia
| |
Collapse
|
4
|
Skuta C, Bartůněk P, Svozil D. InCHlib - interactive cluster heatmap for web applications. J Cheminform 2014; 6:44. [PMID: 25264459 PMCID: PMC4173117 DOI: 10.1186/s13321-014-0044-4] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2014] [Accepted: 09/08/2014] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Hierarchical clustering is an exploratory data analysis method that reveals the groups (clusters) of similar objects. The result of the hierarchical clustering is a tree structure called dendrogram that shows the arrangement of individual clusters. To investigate the row/column hierarchical cluster structure of a data matrix, a visualization tool called 'cluster heatmap' is commonly employed. In the cluster heatmap, the data matrix is displayed as a heatmap, a 2-dimensional array in which the colour of each element corresponds to its value. The rows/columns of the matrix are ordered such that similar rows/columns are near each other. The ordering is given by the dendrogram which is displayed on the side of the heatmap. RESULTS We developed InCHlib (Interactive Cluster Heatmap Library), a highly interactive and lightweight JavaScript library for cluster heatmap visualization and exploration. InCHlib enables the user to select individual or clustered heatmap rows, to zoom in and out of clusters or to flexibly modify heatmap appearance. The cluster heatmap can be augmented with additional metadata displayed in a different colour scale. In addition, to further enhance the visualization, the cluster heatmap can be interconnected with external data sources or analysis tools. Data clustering and the preparation of the input file for InCHlib is facilitated by the Python utility script inchlib_clust. CONCLUSIONS The cluster heatmap is one of the most popular visualizations of large chemical and biomedical data sets originating, e.g., in high-throughput screening, genomics or transcriptomics experiments. The presented JavaScript library InCHlib is a client-side solution for cluster heatmap exploration. InCHlib can be easily deployed into any modern web application and configured to cooperate with external tools and data sources. Though InCHlib is primarily intended for the analysis of chemical or biological data, it is a versatile tool which application domain is not limited to the life sciences only.
Collapse
Affiliation(s)
- Ctibor Skuta
- Laboratory of Informatics and Chemistry, Faculty of Chemical Technology, Institute of Chemical Technology Prague, Technická 5, CZ-166 28 Prague, Czech Republic ; CZ-OPENSCREEN, Institute of Molecular Genetics of the ASCR, v. v. i, Vídeňská 1083, CZ-142 20 Prague, Czech Republic
| | - Petr Bartůněk
- CZ-OPENSCREEN, Institute of Molecular Genetics of the ASCR, v. v. i, Vídeňská 1083, CZ-142 20 Prague, Czech Republic
| | - Daniel Svozil
- Laboratory of Informatics and Chemistry, Faculty of Chemical Technology, Institute of Chemical Technology Prague, Technická 5, CZ-166 28 Prague, Czech Republic ; CZ-OPENSCREEN, Institute of Molecular Genetics of the ASCR, v. v. i, Vídeňská 1083, CZ-142 20 Prague, Czech Republic
| |
Collapse
|
5
|
Li L, Liu C, Wang F, Miao W, Zhang J, Kang Z, Chen Y, Peng L. Unraveling the hidden heterogeneities of breast cancer based on functional miRNA cluster. PLoS One 2014; 9:e87601. [PMID: 24498150 PMCID: PMC3907466 DOI: 10.1371/journal.pone.0087601] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 12/23/2013] [Indexed: 12/21/2022] Open
Abstract
It has become increasingly clear that the current taxonomy of clinical phenotypes is mixed with molecular heterogeneity, which potentially affects the treatment effect for involved patients. Defining the hidden molecular-distinct diseases using modern large-scale genomic approaches is therefore useful for refining clinical practice and improving intervention strategies. Given that microRNA expression profiling has provided a powerful way to dissect hidden genetic heterogeneity for complex diseases, the aim of the study was to develop a bioinformatics approach that identifies microRNA features leading to the hidden subtyping of complex clinical phenotypes. The basic strategy of the proposed method was to identify optimal miRNA clusters by iteratively partitioning the sample and feature space using the two-ways super-paramagnetic clustering technique. We evaluated the obtained optimal miRNA cluster by determining the consistency of co-expression and the chromosome location among the within-cluster microRNAs, and concluded that the optimal miRNA cluster could lead to a natural partition of disease samples. We applied the proposed method to a publicly available microarray dataset of breast cancer patients that have notoriously heterogeneous phenotypes. We obtained a feature subset of 13 microRNAs that could classify the 71 breast cancer patients into five subtypes with significantly different five-year overall survival rates (45%, 82.4%, 70.6%, 100% and 60% respectively; p = 0.008). By building a multivariate Cox proportional-hazards prediction model for the feature subset, we identified has-miR-146b as one of the most significant predictor (p = 0.045; hazard ratios = 0.39). The proposed algorithm is a promising computational strategy for dissecting hidden genetic heterogeneity for complex diseases, and will be of value for improving cancer diagnosis and treatment.
Collapse
Affiliation(s)
- Li Li
- Division of Medical Genetics, Tongji University School of Medicine, Shanghai, China
- Key Laboratory of Arrhythmias, Ministry of Education, Tongji University, Shanghai, China
| | - Chang Liu
- Division of Medical Genetics, Tongji University School of Medicine, Shanghai, China
- Key Laboratory of Arrhythmias, Ministry of Education, Tongji University, Shanghai, China
| | - Fang Wang
- Division of Medical Genetics, Tongji University School of Medicine, Shanghai, China
- Key Laboratory of Arrhythmias, Ministry of Education, Tongji University, Shanghai, China
| | - Wei Miao
- Division of Medical Genetics, Tongji University School of Medicine, Shanghai, China
- Key Laboratory of Arrhythmias, Ministry of Education, Tongji University, Shanghai, China
| | - Jie Zhang
- Division of Medical Genetics, Tongji University School of Medicine, Shanghai, China
- Key Laboratory of Arrhythmias, Ministry of Education, Tongji University, Shanghai, China
| | - Zhiqian Kang
- Division of Medical Genetics, Tongji University School of Medicine, Shanghai, China
- Key Laboratory of Arrhythmias, Ministry of Education, Tongji University, Shanghai, China
| | - Yihan Chen
- Key Laboratory of Arrhythmias, Ministry of Education, Tongji University, Shanghai, China
- * E-mail: (YC); (LP)
| | - Luying Peng
- Division of Medical Genetics, Tongji University School of Medicine, Shanghai, China
- Key Laboratory of Arrhythmias, Ministry of Education, Tongji University, Shanghai, China
- * E-mail: (YC); (LP)
| |
Collapse
|
6
|
Neuditschko M, Khatkar MS, Raadsma HW. NetView: a high-definition network-visualization approach to detect fine-scale population structures from genome-wide patterns of variation. PLoS One 2012; 7:e48375. [PMID: 23152744 PMCID: PMC3485224 DOI: 10.1371/journal.pone.0048375] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2012] [Accepted: 09/25/2012] [Indexed: 02/06/2023] Open
Abstract
High-throughput sequencing and single nucleotide polymorphism (SNP) genotyping can be used to infer complex population structures. Fine-scale population structure analysis tracing individual ancestry remains one of the major challenges. Based on network theory and recent advances in SNP chip technology, we investigated an unsupervised network clustering method called Super Paramagnetic Clustering (Spc). When applied to whole-genome marker data it identifies the natural divisions of groups of individuals into population clusters without use of prior ancestry information. Furthermore, we optimised an analysis pipeline called NetView, a high-definition network visualization, starting with computation of genetic distance, followed clustering using Spc and finally visualization of clusters with Cytoscape. We compared NetView against commonly used methodologies including Principal Component Analyses (PCA) and a model-based algorithm, Admixture, on whole-genome-wide SNP data derived from three previously described data sets: simulated (2.5 million SNPs, 5 populations), human (1.4 million SNPs, 11 populations) and cattle (32,653 SNPs, 19 populations). We demonstrate that individuals can be effectively allocated to their correct population whilst simultaneously revealing fine-scale structure within the populations. Analyzing the human HapMap populations, we identified unexpected genetic relatedness among individuals, and population stratification within the Indian, African and Mexican samples. In the cattle data set, we correctly assigned all individuals to their respective breeds and detected fine-scale population sub-structures reflecting different sample origins and phenotypes. The NetView pipeline is computationally extremely efficient and can be easily applied on large-scale genome-wide data sets to assign individuals to particular populations and to reproduce fine-scale population structures without prior knowledge of individual ancestry. NetView can be used on any data from which a genetic relationship/distance between individuals can be calculated.
Collapse
Affiliation(s)
- Markus Neuditschko
- Reprogen-Animal Bioscience, Faculty of Veterinary Science, University of Sydney, Camden, New South Wales, Australia.
| | | | | |
Collapse
|
7
|
Linard B, Nguyen NH, Prosdocimi F, Poch O, Thompson JD. EvoluCode: Evolutionary Barcodes as a Unifying Framework for Multilevel Evolutionary Data. Evol Bioinform Online 2011; 8:61-77. [PMID: 22267905 PMCID: PMC3256995 DOI: 10.4137/ebo.s8814] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
Evolutionary systems biology aims to uncover the general trends and principles governing the evolution of biological networks. An essential part of this process is the reconstruction and analysis of the evolutionary histories of these complex, dynamic networks. Unfortunately, the methodologies for representing and exploiting such complex evolutionary histories in large scale studies are currently limited. Here, we propose a new formalism, called EvoluCode (Evolutionary barCode), which allows the integration of different evolutionary parameters (eg, sequence conservation, orthology, synteny …) in a unifying format and facilitates the multilevel analysis and visualization of complex evolutionary histories at the genome scale. The advantages of the approach are demonstrated by constructing barcodes representing the evolution of the complete human proteome. Two large-scale studies are then described: (i) the mapping and visualization of the barcodes on the human chromosomes and (ii) automatic clustering of the barcodes to highlight protein subsets sharing similar evolutionary histories and their functional analysis. The methodologies developed here open the way to the efficient application of other data mining and knowledge extraction techniques in evolutionary systems biology studies. A database containing all EvoluCode data is available at: http://lbgi.igbmc.fr/barcodes.
Collapse
Affiliation(s)
- Benjamin Linard
- Laboratoire De Bioinformatique Et Génomique Intégratives, Institut de Génétique et de Biologie Moléculaire et Cellulaire CNRS/INSERM/UDS, Illkirch, France
| | - Ngoc Hoan Nguyen
- Laboratoire De Bioinformatique Et Génomique Intégratives, Institut de Génétique et de Biologie Moléculaire et Cellulaire CNRS/INSERM/UDS, Illkirch, France
| | | | - Olivier Poch
- Laboratoire De Bioinformatique Et Génomique Intégratives, Institut de Génétique et de Biologie Moléculaire et Cellulaire CNRS/INSERM/UDS, Illkirch, France
| | - Julie D. Thompson
- Laboratoire De Bioinformatique Et Génomique Intégratives, Institut de Génétique et de Biologie Moléculaire et Cellulaire CNRS/INSERM/UDS, Illkirch, France
| |
Collapse
|
8
|
Carotid plaque age is a feature of plaque stability inversely related to levels of plasma insulin. PLoS One 2011; 6:e18248. [PMID: 21490968 PMCID: PMC3072386 DOI: 10.1371/journal.pone.0018248] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2010] [Accepted: 03/02/2011] [Indexed: 12/25/2022] Open
Abstract
Background The stability of atherosclerotic plaques determines the risk for rupture,
which may lead to thrombus formation and potentially severe clinical
complications such as myocardial infarction and stroke. Although the rate of
plaque formation may be important for plaque stability, this process is not
well understood. We took advantage of the atmospheric
14C-declination curve (a result of the atomic bomb tests in the
1950s and 1960s) to determine the average biological age of carotid
plaques. Methodology/Principal Finding The cores of carotid plaques were dissected from 29 well-characterized,
symptomatic patients with carotid stenosis and analyzed for 14C
content by accelerator mass spectrometry. The average plaque age (i.e.
formation time) was 9.6±3.3 years. All but two plaques had formed
within 5–15 years before surgery. Plaque age was not associated with
the chronological ages of the patients but was inversely related to plasma
insulin levels (p = 0.0014). Most plaques were
echo-lucent rather than echo-rich (2.24±0.97, range 1–5).
However, plaques in the lowest tercile of plaque age (most recently formed)
were characterized by further instability with a higher content of lipids
and macrophages (67.8±12.4 vs. 50.4±6.2,
p = 0.00005; 57.6±26.1 vs. 39.8±25.7,
p<0.0005, respectively), less collagen (45.3±6.1 vs.
51.1±9.8, p<0.05), and fewer smooth muscle cells (130±31
vs. 141±21, p<0.05) than plaques in the highest tercile.
Microarray analysis of plaques in the lowest tercile also showed increased
activity of genes involved in immune responses and oxidative
phosphorylation. Conclusions/Significance Our results show, for the first time, that plaque age, as judge by relative
incorporation of 14C, can improve our understanding of carotid
plaque stability and therefore risk for clinical complications. Our results
also suggest that levels of plasma insulin might be involved in determining
carotid plaque age.
Collapse
|
9
|
Abstract
Correct classification of genes into gene families is important for understanding gene function and evolution. Although gene families of many species have been resolved both computationally and experimentally with high accuracy, gene family classification in most newly sequenced genomes has not been done with the same high standard. This project has been designed to develop a strategy to effectively and accurately classify gene families across genomes. We first examine and compare the performance of computer programs developed for automated gene family classification. We demonstrate that some programs, including the hierarchical average-linkage clustering algorithm MC-UPGMA and the popular Markov clustering algorithm TRIBE-MCL, can reconstruct manual curation of gene families accurately. However, their performance is highly sensitive to parameter setting, i.e. different gene families require different program parameters for correct resolution. To circumvent the problem of parameterization, we have developed a comparative strategy for gene family classification. This strategy takes advantage of existing curated gene families of reference species to find suitable parameters for classifying genes in related genomes. To demonstrate the effectiveness of this novel strategy, we use TRIBE-MCL to classify chemosensory and ABC transporter gene families in C. elegans and its four sister species. We conclude that fully automated programs can establish biologically accurate gene families if parameterized accordingly. Comparative gene family classification finds optimal parameters automatically, thus allowing rapid insights into gene families of newly sequenced species.
Collapse
Affiliation(s)
- Christian Frech
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Nansheng Chen
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada
- * E-mail:
| |
Collapse
|
10
|
Multi-organ expression profiling uncovers a gene module in coronary artery disease involving transendothelial migration of leukocytes and LIM domain binding 2: the Stockholm Atherosclerosis Gene Expression (STAGE) study. PLoS Genet 2009; 5:e1000754. [PMID: 19997623 PMCID: PMC2780352 DOI: 10.1371/journal.pgen.1000754] [Citation(s) in RCA: 97] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2009] [Accepted: 11/04/2009] [Indexed: 02/07/2023] Open
Abstract
Environmental exposures filtered through the genetic make-up of each individual alter the transcriptional repertoire in organs central to metabolic homeostasis, thereby affecting arterial lipid accumulation, inflammation, and the development of coronary artery disease (CAD). The primary aim of the Stockholm Atherosclerosis Gene Expression (STAGE) study was to determine whether there are functionally associated genes (rather than individual genes) important for CAD development. To this end, two-way clustering was used on 278 transcriptional profiles of liver, skeletal muscle, and visceral fat (n = 66/tissue) and atherosclerotic and unaffected arterial wall (n = 40/tissue) isolated from CAD patients during coronary artery bypass surgery. The first step, across all mRNA signals (n = 15,042/12,621 RefSeqs/genes) in each tissue, resulted in a total of 60 tissue clusters (n = 3958 genes). In the second step (performed within tissue clusters), one atherosclerotic lesion (n = 49/48) and one visceral fat (n = 59) cluster segregated the patients into two groups that differed in the extent of coronary stenosis (P = 0.008 and P = 0.00015). The associations of these clusters with coronary atherosclerosis were validated by analyzing carotid atherosclerosis expression profiles. Remarkably, in one cluster (n = 55/54) relating to carotid stenosis (P = 0.04), 27 genes in the two clusters relating to coronary stenosis were confirmed (n = 16/17, P<10(-27 and-30)). Genes in the transendothelial migration of leukocytes (TEML) pathway were overrepresented in all three clusters, referred to as the atherosclerosis module (A-module). In a second validation step, using three independent cohorts, the A-module was found to be genetically enriched with CAD risk by 1.8-fold (P<0.004). The transcription co-factor LIM domain binding 2 (LDB2) was identified as a potential high-hierarchy regulator of the A-module, a notion supported by subnetwork analysis, by cellular and lesion expression of LDB2, and by the expression of 13 TEML genes in Ldb2-deficient arterial wall. Thus, the A-module appears to be important for atherosclerosis development and, together with LDB2, merits further attention in CAD research.
Collapse
|
11
|
Andreopoulos B, An A, Wang X, Schroeder M. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform 2009; 10:297-314. [PMID: 19240124 DOI: 10.1093/bib/bbn058] [Citation(s) in RCA: 92] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Clustering is ubiquitously applied in bioinformatics with hierarchical clustering and k-means partitioning being the most popular methods. Numerous improvements of these two clustering methods have been introduced, as well as completely different approaches such as grid-based, density-based and model-based clustering. For improved bioinformatics analysis of data, it is important to match clusterings to the requirements of a biomedical application. In this article, we present a set of desirable clustering features that are used as evaluation criteria for clustering algorithms. We review 40 different clustering algorithms of all approaches and datatypes. We compare algorithms on the basis of desirable clustering features, and outline algorithms' benefits and drawbacks as a basis for matching them to biomedical applications.
Collapse
|
12
|
Surmeli D, Ratmann O, Mewes HW, Tetko IV. FunCat functional inference with belief propagation and feature integration. Comput Biol Chem 2008; 32:375-7. [DOI: 10.1016/j.compbiolchem.2008.06.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2007] [Revised: 06/03/2008] [Accepted: 06/22/2008] [Indexed: 11/26/2022]
|
13
|
Tetko IV, Rodchenkov IV, Walter MC, Rattei T, Mewes HW. Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information. ACTA ACUST UNITED AC 2008; 24:621-8. [PMID: 18174184 DOI: 10.1093/bioinformatics/btm633] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Accurate automatic assignment of protein functions remains a challenge for genome annotation. We have developed and compared the automatic annotation of four bacterial genomes employing a 5-fold cross-validation procedure and several machine learning methods. RESULTS The analyzed genomes were manually annotated with FunCat categories in MIPS providing a gold standard. Features describing a pair of sequences rather than each sequence alone were used. The descriptors were derived from sequence alignment scores, InterPro domains, synteny information, sequence length and calculated protein properties. Following training we scored all pairs from the validation sets, selected a pair with the highest predicted score and annotated the target protein with functional categories of the prototype protein. The data integration using machine-learning methods provided significantly higher annotation accuracy compared to the use of individual descriptors alone. The neural network approach showed the best performance. The descriptors derived from the InterPro domains and sequence similarity provided the highest contribution to the method performance. The predicted annotation scores allow differentiation of reliable versus non-reliable annotations. The developed approach was applied to annotate the protein sequences from 180 complete bacterial genomes. AVAILABILITY The FUNcat Annotation Tool (FUNAT) is available on-line as Web Services at http://mips.gsf.de/proj/funat.
Collapse
Affiliation(s)
- Igor V Tetko
- Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH), Institute of Bioinformatics and Systems Biology, Neuherberg, Germany.
| | | | | | | | | |
Collapse
|
14
|
Rattei T, Tischler P, Arnold R, Hamberger F, Krebs J, Krumsiek J, Wachinger B, Stümpflen V, Mewes W. SIMAP--structuring the network of protein similarities. Nucleic Acids Res 2007; 36:D289-92. [PMID: 18037617 PMCID: PMC2238827 DOI: 10.1093/nar/gkm963] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Protein sequences are the most important source of evolutionary and functional information for new proteins. In order to facilitate the computationally intensive tasks of sequence analysis, the Similarity Matrix of Proteins (SIMAP) database aims to provide a comprehensive and up-to-date dataset of the pre-calculated sequence similarity matrix and sequence-based features like InterPro domains for all proteins contained in the major public sequence databases. As of September 2007, SIMAP covers ∼17 million proteins and more than 6 million non-redundant sequences and provides a complete annotation based on InterPro 16. Novel features of SIMAP include a new, portlet-based web portal providing multiple, structured views on retrieved proteins and integration of protein clusters and a unique search method for similar domain architectures. Access to SIMAP is freely provided for academic use through the web portal for individuals at http://mips.gsf.de/simap/and through Web Services for programmatic access at http://mips.gsf.de/webservices/services/SimapService2.0?wsdl.
Collapse
Affiliation(s)
- Thomas Rattei
- Chair of Genome Oriented Bioinformatics, Center of Life and Food Science, Technische Universität München, 85350 Freising-Weihenstephan, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
15
|
Zhang W, Li L, Li X, Jiang W, Huo J, Wang Y, Lin M, Rao S. Unravelling the hidden heterogeneities of diffuse large B-cell lymphoma based on coupled two-way clustering. BMC Genomics 2007; 8:332. [PMID: 17888167 PMCID: PMC2082044 DOI: 10.1186/1471-2164-8-332] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2006] [Accepted: 09/22/2007] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND It becomes increasingly clear that our current taxonomy of clinical phenotypes is mixed with molecular heterogeneity. Of vital importance for refined clinical practice and improved intervention strategies is to define the hidden molecular distinct diseases using modern large-scale genomic approaches. Microarray omics technology has provided a powerful way to dissect hidden genetic heterogeneity of complex diseases. The aim of this study was thus to develop a bioinformatics approach to seek the transcriptional features leading to the hidden subtyping of a complex clinical phenotype. The basic strategy of the proposed method was to iteratively partition in two ways sample and feature space with super-paramagnetic clustering technique and to seek for hard and robust gene clusters that lead to a natural partition of disease samples and that have the highest functionally conceptual consensus evaluated with Gene Ontology. RESULTS We applied the proposed method to two publicly available microarray datasets of diffuse large B-cell lymphoma (DLBCL), a notoriously heterogeneous phenotype. A feature subset of 30 genes (38 probes) derived from analysis of the first dataset consisting of 4026 genes and 42 DLBCL samples identified three categories of patients with very different five-year overall survival rates (70.59%, 44.44% and 14.29% respectively; p = 0.0017). Analysis of the second dataset consisting of 7129 genes and 58 DLBCL samples revealed a feature subset of 13 genes (16 probes) that not only replicated the findings of the important DLBCL genes (e.g. JAW1 and BCL7A), but also identified three clinically similar subtypes (with 5-year overall survival rates of 63.13%, 34.92% and 15.38% respectively; p = 0.0009) to those identified in the first dataset. Finally, we built a multivariate Cox proportional-hazards prediction model for each feature subset and defined JAW1 as one of the most significant predictor (p = 0.005 and 0.014; hazard ratios = 0.02 and 0.03, respectively for two datasets) for both DLBCL cohorts under study. CONCLUSION Our results showed that the proposed algorithm is a promising computational strategy for peeling off the hidden genetic heterogeneity based on transcriptionally profiling disease samples, which may lead to an improved diagnosis and treatment of cancers.
Collapse
Affiliation(s)
- Wei Zhang
- The First Clinical College, Department of Bioinformatics, and the Bio-pharmaceutical Key Laboratory of Heilongjiang Province and State, Harbin Medical University, Harbin 150086, China
| | - Li Li
- Institute of Medical Genetics, Tongji University, Shanghai 200092, China
| | - Xia Li
- The First Clinical College, Department of Bioinformatics, and the Bio-pharmaceutical Key Laboratory of Heilongjiang Province and State, Harbin Medical University, Harbin 150086, China
- Institute of Medical Genetics, Tongji University, Shanghai 200092, China
- Department of Computer Science, Harbin Institute of Technology, Harbin 150080, China
- The Biomedical Engineering Institute, Capital Medical University, Beijing 100054, China
| | - Wei Jiang
- The First Clinical College, Department of Bioinformatics, and the Bio-pharmaceutical Key Laboratory of Heilongjiang Province and State, Harbin Medical University, Harbin 150086, China
| | - Jianmin Huo
- The First Clinical College, Department of Bioinformatics, and the Bio-pharmaceutical Key Laboratory of Heilongjiang Province and State, Harbin Medical University, Harbin 150086, China
| | - Yadong Wang
- Department of Computer Science, Harbin Institute of Technology, Harbin 150080, China
| | - Meihua Lin
- The Biomedical Engineering Institute, Capital Medical University, Beijing 100054, China
- Department of Molecular Cardiology, Cleveland Clinic, Cleveland, OH 44195, USA
| | - Shaoqi Rao
- The First Clinical College, Department of Bioinformatics, and the Bio-pharmaceutical Key Laboratory of Heilongjiang Province and State, Harbin Medical University, Harbin 150086, China
- The Biomedical Engineering Institute, Capital Medical University, Beijing 100054, China
- Department of Molecular Cardiology, Cleveland Clinic, Cleveland, OH 44195, USA
| |
Collapse
|
16
|
Kelil A, Wang S, Brzezinski R, Fleury A. CLUSS: clustering of protein sequences based on a new similarity measure. BMC Bioinformatics 2007; 8:286. [PMID: 17683581 PMCID: PMC1976428 DOI: 10.1186/1471-2105-8-286] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2007] [Accepted: 08/04/2007] [Indexed: 11/14/2022] Open
Abstract
Background The rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as proteins with different domain structures, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-free algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families. In the rest of the paper, we use the term "phylogenetic" in the sense of "relatedness of biological functions". Results To show the effectiveness of CLUSS, we performed an extensive clustering on COG database. To demonstrate its ability to deal with hard-to-align sequences, we tested it on the GH2 family. In addition, we carried out experimental comparisons of CLUSS with a variety of mainstream algorithms. These comparisons were made on hard-to-align and easy-to-align protein sequences. The results of these experiments show the superiority of CLUSS in yielding clusters of proteins with similar functional activity. Conclusion We have developed an effective method and tool for clustering protein sequences to meet the needs of biologists in terms of phylogenetic analysis and prediction of biological functions. Compared to existing clustering methods, CLUSS more accurately highlights the functional characteristics of the clustered families. It provides biologists with a new and plausible instrument for the analysis of protein sequences, especially those that cause problems for the alignment-dependent algorithms.
Collapse
Affiliation(s)
- Abdellali Kelil
- Département d'informatique, Faculté des Sciences, Université de Sherbrooke, Sherbrooke, QC, Canada.
| | | | | | | |
Collapse
|
17
|
Cho YR, Hwang W, Ramanathan M, Zhang A. Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinformatics 2007; 8:265. [PMID: 17650343 PMCID: PMC1971074 DOI: 10.1186/1471-2105-8-265] [Citation(s) in RCA: 119] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2007] [Accepted: 07/24/2007] [Indexed: 12/05/2022] Open
Abstract
Background The systematic analysis of protein-protein interactions can enable a better understanding of cellular organization, processes and functions. Functional modules can be identified from the protein interaction networks derived from experimental data sets. However, these analyses are challenging because of the presence of unreliable interactions and the complex connectivity of the network. The integration of protein-protein interactions with the data from other sources can be leveraged for improving the effectiveness of functional module detection algorithms. Results We have developed novel metrics, called semantic similarity and semantic interactivity, which use Gene Ontology (GO) annotations to measure the reliability of protein-protein interactions. The protein interaction networks can be converted into a weighted graph representation by assigning the reliability values to each interaction as a weight. We presented a flow-based modularization algorithm to efficiently identify overlapping modules in the weighted interaction networks. The experimental results show that the semantic similarity and semantic interactivity of interacting pairs were positively correlated with functional co-occurrence. The effectiveness of the algorithm for identifying modules was evaluated using functional categories from the MIPS database. We demonstrated that our algorithm had higher accuracy compared to other competing approaches. Conclusion The integration of protein interaction networks with GO annotation data and the capability of detecting overlapping modules substantially improve the accuracy of module identification.
Collapse
Affiliation(s)
- Young-Rae Cho
- Department of Computer Science and Engineering, State University of New York, Buffalo, NY, USA
| | - Woochang Hwang
- Department of Computer Science and Engineering, State University of New York, Buffalo, NY, USA
| | - Murali Ramanathan
- Department of Pharmaceutical Science, State University of New York, Buffalo, NY, USA
| | - Aidong Zhang
- Department of Computer Science and Engineering, State University of New York, Buffalo, NY, USA
| |
Collapse
|
18
|
Marttinen P, Corander J, Törönen P, Holm L. Bayesian search of functionally divergent protein subgroups and their function specific residues. Bioinformatics 2006; 22:2466-74. [PMID: 16870932 DOI: 10.1093/bioinformatics/btl411] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The rapid increase in the amount of protein sequence data has created a need for an automated identification of evolutionarily related subgroups from large datasets. The existing methods typically require a priori specification of the number of putative groups, which defines the resolution of the classification solution. RESULTS We introduce a Bayesian model-based approach to simultaneous identification of evolutionary groups and conserved parts of the protein sequences. The model-based approach provides an intuitive and efficient way of determining the number of groups from the sequence data, in contrast to the ad hoc methods often exploited for similar purposes. Our model recognizes the areas in the sequences that are relevant for the clustering and regards other areas as noise. We have implemented the method using a fast stochastic optimization algorithm which yields a clustering associated with the estimated maximum posterior probability. The method has been shown to have high specificity and sensitivity in simulated and real clustering tasks. With real datasets the method also highlights the residues close to the active site. AVAILABILITY Software 'kPax' is available at http://www.rni.helsinki.fi/jic/softa.html
Collapse
Affiliation(s)
- Pekka Marttinen
- Department of Mathematics and Statistics, PO Box 68, 00014 University of Helsinki, Finland.
| | | | | | | |
Collapse
|
19
|
Abstract
The development of on-line software tools is changing the way we traditionally perform our analysis in drug design, but will chemoinformatics be forever behind bioinformatics in this development?
Collapse
Affiliation(s)
- Igor V Tetko
- Institute for Bioinformatics, GSF, Forschungszentrum fuer Umwelt und Gesundheit, Ingolstaedter Landstrasse 1, D-85764 Neuherberg, Germany.
| |
Collapse
|