1
|
Elmanova A, Jahn BO, Presselt M. Catching the π-Stacks: Prediction of Aggregate Structures of Porphyrin. J Phys Chem A 2024. [PMID: 39520375 DOI: 10.1021/acs.jpca.4c05969] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Abstract
π-π interactions decisively shape the supramolecular structure and functionality of π-conjugated molecular semiconductor materials. Despite the customizable molecular building blocks, predicting their supramolecular structure remains a challenge. Traditionally, force field methods have been used due to the complexity of these structures, but advances in computational power have enabled ab initio approaches such as density functional theory (DFT). DFT is particularly suitable for finding energetically favorable structures of dye aggregates, which are determined by a large number of different interactions, but a systematic aggregate search can still be very challenging due to the large number of possible geometries. In this work, we show ways to overcome this challenge. We investigate how finely translational and rotational lattices must be structured to identify all energetic minima of π-stack structures, focusing on porphyrins as a prototype challenge. Our approach involves single-point DFT calculations of systematically varied dimer geometries, identification of local energy minima, hierarchical grouping of geometrically similar structures, and optimization of the energetically favorable representatives of each geometric family. This ab initio method provides a general framework for the systematic prediction of aggregate structures and reveals geometrically diverse and energetically favorable dimers.
Collapse
Affiliation(s)
- Anna Elmanova
- Institute of Physical Chemistry, Friedrich Schiller University Jena, Helmholtzweg 4, 07743 Jena, Germany
- Leibniz Institute of Photonic Technology (IPHT), Albert-Einstein-Str. 9, 07745 Jena, Germany
- SciClus GmbH&Co. KG, Moritz-von-Rohr-Str. 1a, 07745 Jena, Germany
| | - Burkhard O Jahn
- SciClus GmbH&Co. KG, Moritz-von-Rohr-Str. 1a, 07745 Jena, Germany
| | - Martin Presselt
- Institute of Physical Chemistry, Friedrich Schiller University Jena, Helmholtzweg 4, 07743 Jena, Germany
- Leibniz Institute of Photonic Technology (IPHT), Albert-Einstein-Str. 9, 07745 Jena, Germany
- SciClus GmbH&Co. KG, Moritz-von-Rohr-Str. 1a, 07745 Jena, Germany
- Center for Energy and Environmental Chemistry Jena (CEEC Jena) Friedrich Schiller University Jena, Philosophenweg 7a, 07743 Jena, Germany
| |
Collapse
|
2
|
Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2023. [DOI: 10.1007/s41060-022-00381-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
3
|
Possible functional proximity of various organisms based on the bioinformatics analysis of their taste receptors. Int J Biol Macromol 2022; 222:2105-2121. [DOI: 10.1016/j.ijbiomac.2022.10.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Accepted: 10/02/2022] [Indexed: 11/05/2022]
|
4
|
Robin V, Bodein A, Scott-Boyer MP, Leclercq M, Périn O, Droit A. Overview of methods for characterization and visualization of a protein-protein interaction network in a multi-omics integration context. Front Mol Biosci 2022; 9:962799. [PMID: 36158572 PMCID: PMC9494275 DOI: 10.3389/fmolb.2022.962799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Accepted: 08/16/2022] [Indexed: 11/26/2022] Open
Abstract
At the heart of the cellular machinery through the regulation of cellular functions, protein-protein interactions (PPIs) have a significant role. PPIs can be analyzed with network approaches. Construction of a PPI network requires prediction of the interactions. All PPIs form a network. Different biases such as lack of data, recurrence of information, and false interactions make the network unstable. Integrated strategies allow solving these different challenges. These approaches have shown encouraging results for the understanding of molecular mechanisms, drug action mechanisms, and identification of target genes. In order to give more importance to an interaction, it is evaluated by different confidence scores. These scores allow the filtration of the network and thus facilitate the representation of the network, essential steps to the identification and understanding of molecular mechanisms. In this review, we will discuss the main computational methods for predicting PPI, including ones confirming an interaction as well as the integration of PPIs into a network, and we will discuss visualization of these complex data.
Collapse
Affiliation(s)
- Vivian Robin
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Antoine Bodein
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Marie-Pier Scott-Boyer
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Mickaël Leclercq
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Olivier Périn
- Digital Sciences Department, L'Oréal Advanced Research, Aulnay-sous-bois, France
| | - Arnaud Droit
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| |
Collapse
|
5
|
Cai Y, Zheng W, Yao J, Yang Y, Mai V, Mao Q, Sun Y. ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time. PLoS Comput Biol 2017; 13:e1005518. [PMID: 28437450 PMCID: PMC5421816 DOI: 10.1371/journal.pcbi.1005518] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2016] [Revised: 05/08/2017] [Accepted: 04/13/2017] [Indexed: 12/30/2022] Open
Abstract
The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html.
Collapse
Affiliation(s)
- Yunpeng Cai
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- * E-mail: (YC); (YS)
| | - Wei Zheng
- Department of Computer Science and Engineering, The State University of New York at Buffalo, Buffalo, New York, United States of America
| | - Jin Yao
- Department of Microbiology and Immunology, The State University of New York at Buffalo, Buffalo, New York, United States of America
| | - Yujie Yang
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Volker Mai
- Department of Epidemiology, University of Florida, Gainesville, Florida, United States of America
| | - Qi Mao
- Department of Microbiology and Immunology, The State University of New York at Buffalo, Buffalo, New York, United States of America
| | - Yijun Sun
- Department of Computer Science and Engineering, The State University of New York at Buffalo, Buffalo, New York, United States of America
- Department of Microbiology and Immunology, The State University of New York at Buffalo, Buffalo, New York, United States of America
- Department of Biostatistics, The State University of New York at Buffalo, Buffalo, New York, United States of America
- * E-mail: (YC); (YS)
| |
Collapse
|
6
|
Zaslavsky L, Ciufo S, Fedorov B, Tatusova T. Clustering analysis of proteins from microbial genomes at multiple levels of resolution. BMC Bioinformatics 2016; 17 Suppl 8:276. [PMID: 27586436 PMCID: PMC5009818 DOI: 10.1186/s12859-016-1112-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Background Microbial genomes at the National Center for Biotechnology Information (NCBI) represent a large collection of more than 35,000 assemblies. There are several complexities associated with the data: a great variation in sampling density since human pathogens are densely sampled while other bacteria are less represented; different protein families occur in annotations with different frequencies; and the quality of genome annotation varies greatly. In order to extract useful information from these sophisticated data, the analysis needs to be performed at multiple levels of phylogenomic resolution and protein similarity, with an adequate sampling strategy. Results Protein clustering is used to construct meaningful and stable groups of similar proteins to be used for analysis and functional annotation. Our approach is to create protein clusters at three levels. First, tight clusters in groups of closely-related genomes (species-level clades) are constructed using a combined approach that takes into account both sequence similarity and genome context. Second, clustroids of conservative in-clade clusters are organized into seed global clusters. Finally, global protein clusters are built around the the seed clusters. We propose filtering strategies that allow limiting the protein set included in global clustering. The in-clade clustering procedure, subsequent selection of clustroids and organization into seed global clusters provides a robust representation and high rate of compression. Seed protein clusters are further extended by adding related proteins. Extended seed clusters include a significant part of the data and represent all major known cell machinery. The remaining part, coming from either non-conservative (unique) or rapidly evolving proteins, from rare genomes, or resulting from low-quality annotation, does not group together well. Processing these proteins requires significant computational resources and results in a large number of questionable clusters. Conclusion The developed filtering strategies allow to identify and exclude such peripheral proteins limiting the protein dataset in global clustering. Overall, the proposed methodology allows the relevant data at different levels of details to be obtained and data redundancy eliminated while keeping biologically interesting variations. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1112-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Leonid Zaslavsky
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA.
| | - Stacy Ciufo
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA
| | - Boris Fedorov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA
| | - Tatiana Tatusova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA
| |
Collapse
|
7
|
Huwe PJ, Xu Q, Shapovalov MV, Modi V, Andrake MD, Dunbrack RL. Biological function derived from predicted structures in CASP11. Proteins 2016; 84 Suppl 1:370-91. [PMID: 27181425 DOI: 10.1002/prot.24997] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2015] [Revised: 01/10/2016] [Accepted: 01/18/2016] [Indexed: 12/26/2022]
Abstract
In CASP11, the organizers sought to bring the biological inferences from predicted structures to the fore. To accomplish this, we assessed the models for their ability to perform quantifiable tasks related to biological function. First, for 10 targets that were probable homodimers, we measured the accuracy of docking the models into homodimers as a function of GDT-TS of the monomers, which produced characteristic L-shaped plots. At low GDT-TS, none of the models could be docked correctly as homodimers. Above GDT-TS of ∼60%, some models formed correct homodimers in one of the largest docked clusters, while many other models at the same values of GDT-TS did not. Docking was more successful when many of the templates shared the same homodimer. Second, we docked a ligand from an experimental structure into each of the models of one of the targets. Docking to the models with two different programs produced poor ligand RMSDs with the experimental structure. Measures that evaluated similarity of contacts were reasonable for some of the models, although there was not a significant correlation with model accuracy. Finally, we assessed whether models would be useful in predicting the phenotypes of missense mutations in three human targets by comparing features calculated from the models with those calculated from the experimental structures. The models were successful in reproducing accessible surface areas but there was little correlation of model accuracy with calculation of FoldX evaluation of the change in free energy between the wild-type and the mutant. Proteins 2016; 84(Suppl 1):370-391. © 2016 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Peter J Huwe
- Fox Chase Cancer Center, Philadelphia, Pennsylvania, 19111
| | - Qifang Xu
- Fox Chase Cancer Center, Philadelphia, Pennsylvania, 19111
| | | | - Vivek Modi
- Fox Chase Cancer Center, Philadelphia, Pennsylvania, 19111
| | - Mark D Andrake
- Fox Chase Cancer Center, Philadelphia, Pennsylvania, 19111
| | | |
Collapse
|
8
|
Furlong SE, Ford A, Albarnez-Rodriguez L, Valvano MA. Topological analysis of the Escherichia coli WcaJ protein reveals a new conserved configuration for the polyisoprenyl-phosphate hexose-1-phosphate transferase family. Sci Rep 2015; 5:9178. [PMID: 25776537 PMCID: PMC4361858 DOI: 10.1038/srep09178] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2014] [Accepted: 02/24/2015] [Indexed: 11/25/2022] Open
Abstract
WcaJ is an Escherichia coli membrane enzyme catalysing the biosynthesis of undecaprenyl-diphosphate-glucose, the first step in the assembly of colanic acid exopolysaccharide. WcaJ belongs to a large family of polyisoprenyl-phosphate hexose-1-phosphate transferases (PHPTs) sharing a similar predicted topology consisting of an N-terminal domain containing four transmembrane helices (TMHs), a large central periplasmic loop, and a C-terminal domain containing the fifth TMH (TMH-V) and a cytosolic tail. However, the topology of PHPTs has not been experimentally validated. Here, we investigated the topology of WcaJ using a combination of LacZ/PhoA reporter fusions and sulfhydryl labelling by PEGylation of novel cysteine residues introduced into a cysteine-less WcaJ. The results showed that the large central loop and the C-terminal tail both reside in the cytoplasm and are separated by TMH-V, which does not fully span the membrane, likely forming a "hairpin" structure. Modelling of TMH-V revealed that a highly conserved proline might contribute to a helix-break-helix structure in all PHPT members. Bioinformatic analyses show that all of these features are conserved in PHPT homologues from Gram-negative and Gram-positive bacteria. Our data demonstrate a novel topological configuration for PHPTs, which is proposed as a signature for all members of this enzyme family.
Collapse
Affiliation(s)
- Sarah E. Furlong
- Centre for Human Immunology, Department of Microbiology and Immunology, University of Western Ontario, London, Ontario, N6A 5C1, Canada
| | - Amy Ford
- Centre for Infection and Immunity, Queen's University Belfast, Belfast, United Kingdom, BT9 7AE
| | - Lorena Albarnez-Rodriguez
- Centre for Human Immunology, Department of Microbiology and Immunology, University of Western Ontario, London, Ontario, N6A 5C1, Canada
| | - Miguel A. Valvano
- Centre for Human Immunology, Department of Microbiology and Immunology, University of Western Ontario, London, Ontario, N6A 5C1, Canada
- Centre for Infection and Immunity, Queen's University Belfast, Belfast, United Kingdom, BT9 7AE
| |
Collapse
|
9
|
Massive fungal biodiversity data re-annotation with multi-level clustering. Sci Rep 2014; 4:6837. [PMID: 25355642 PMCID: PMC4213798 DOI: 10.1038/srep06837] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2014] [Accepted: 10/10/2014] [Indexed: 11/08/2022] Open
Abstract
With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identification of large groups of sequences sharing common traits. Hence, there is a need for clustering tools for automatic knowledge extraction enabling the curation of large-scale databases. Current sophisticated approaches on sequence clustering are based on pairwise similarity matrices. This is impractical for databases of hundreds of thousands of sequences as such a similarity matrix alone would exceed the available memory. In this paper, a new approach called MultiLevel Clustering (MLC) is proposed which avoids a majority of sequence comparisons, and therefore, significantly reduces the total runtime for clustering. An implementation of the algorithm allowed clustering of all 344,239 ITS (Internal Transcribed Spacer) fungal sequences from GenBank utilizing only a normal desktop computer within 22 CPU-hours whereas the greedy clustering method took up to 242 CPU-hours.
Collapse
|
10
|
Ben-Tal N, Kolodny R. Representation of the Protein Universe using Classifications, Maps, and Networks. Isr J Chem 2014. [DOI: 10.1002/ijch.201400001] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
11
|
Chesters D, Zhu CD. A protocol for species delineation of public DNA databases, applied to the Insecta. Syst Biol 2014; 63:712-25. [PMID: 24929897 DOI: 10.1093/sysbio/syu038] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Public DNA databases are composed of data from many different taxa, although the taxonomic annotation on sequences is not always complete, which impedes the utilization of mined data for species-level applications. There is much ongoing work on species identification and delineation based on the molecular data itself, although applying species clustering to whole databases requires consolidation of results from numerous undefined gene regions, and introduces significant obstacles in data organization and computational load. In the current paper, we demonstrate an approach for species delineation of a sequence database. All DNA sequences for the insects were obtained and processed. After filtration of duplicated data, delineation of the database into species or molecular operational taxonomic units (MOTUs) followed a three-step process in which (i) the genetic loci L are partitioned, (ii) the species S are delineated within each locus, then (iii) species units are matched across loci to form the matrix L × S, a set of global (multilocus) species units. Partitioning the database into a set of homologous gene fragments was achieved by Markov clustering using edge weights calculated from the amount of overlap between pairs of sequences, then delineation of species units and assignment of species names were performed for the set of genes necessary to capture most of the species diversity. The complexity of computing pairwise similarities for species clustering was substantial at the cytochrome oxidase subunit I locus in particular, but made feasible through the development of software that performs pairwise alignments within the taxonomic framework, while accounting for the different ranks at which sequences are labeled with taxonomic information. Over 24 different homologs, the unidentified sequences numbered approximately 194,000, containing 41,525 species IDs (98.7% of all found in the insect database), and were grouped into 59,173 single-locus MOTUs by hierarchical clustering under parameters optimized independently for each locus. Species units from different loci were matched using a multipartite matching algorithm to form multilocus species units with minimal incongruence between loci. After matching, the insect database as represented by these 24 loci was found to be composed of 78,091 species units in total. 38,574 of these units contained only species labeled data, 34,891 contained only unlabeled data, leaving 4,626 units composed both of labeled and unlabeled sequences. In addition to giving estimates of species diversity of sequence repositories, the protocol developed here will facilitate species-level applications of modern-day sequence data sets. In particular, the L × S matrix represents a post-taxonomic framework that can be used for species-level organization of metagenomic data, and incorporation of these methods into phylogenetic pipelines will yield matrices more representative of species diversity.
Collapse
Affiliation(s)
- Douglas Chesters
- Key Laboratory of Zoological Systematics and Evolution (CAS), Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, PR China
| | - Chao-Dong Zhu
- Key Laboratory of Zoological Systematics and Evolution (CAS), Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, PR China
| |
Collapse
|
12
|
Hüffner F, Komusiewicz C, Liebtrau A, Niedermeier R. Partitioning Biological Networks into Highly Connected Clusters with Maximum Edge Coverage. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:455-467. [PMID: 26356014 DOI: 10.1109/tcbb.2013.177] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
A popular clustering algorithm for biological networks which was proposed by Hartuv and Shamir identifies nonoverlapping highly connected components. We extend the approach taken by this algorithm by introducing the combinatorial optimization problem Highly Connected Deletion, which asks for removing as few edges as possible from a graph such that the resulting graph consists of highly connected components. We show that Highly Connected Deletion is NP-hard and provide a fixed-parameter algorithm and a kernelization. We propose exact and heuristic solution strategies, based on polynomial-time data reduction rules and integer linear programming with column generation. The data reduction typically identifies 75 percent of the edges that are deleted for an optimal solution; the column generation method can then optimally solve protein interaction networks with up to 6,000 vertices and 13,500 edges within five hours. Additionally, we present a new heuristic that finds more clusters than the method by Hartuv and Shamir.
Collapse
|
13
|
Moktali V, Park J, Fedorova-Abrams ND, Park B, Choi J, Lee YH, Kang S. Systematic and searchable classification of cytochrome P450 proteins encoded by fungal and oomycete genomes. BMC Genomics 2012; 13:525. [PMID: 23033934 PMCID: PMC3505482 DOI: 10.1186/1471-2164-13-525] [Citation(s) in RCA: 112] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2012] [Accepted: 09/28/2012] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Cytochrome P450 proteins (CYPs) play diverse and pivotal roles in fungal metabolism and adaptation to specific ecological niches. Fungal genomes encode extremely variable "CYPomes" ranging from one to more than 300 CYPs. Despite the rapid growth of sequenced fungal and oomycete genomes and the resulting influx of predicted CYPs, the vast majority of CYPs remain functionally uncharacterized. To facilitate the curation and functional and evolutionary studies of CYPs, we previously developed Fungal Cytochrome P450 Database (FCPD), which included CYPs from 70 fungal and oomycete species. Here we present a new version of FCPD (1.2) with more data and an improved classification scheme. RESULTS The new database contains 22,940 CYPs from 213 species divided into 2,579 clusters and 115 clans. By optimizing the clustering pipeline, we were able to uncover 36 novel clans and to assign 153 orphan CYP families to specific clans. To augment their functional annotation, CYP clusters were mapped to David Nelson's P450 databases, which archive a total of 12,500 manually curated CYPs. Additionally, over 150 clusters were functionally classified based on sequence similarity to experimentally characterized CYPs. Comparative analysis of fungal and oomycete CYPomes revealed cases of both extreme expansion and contraction. The most dramatic expansions in fungi were observed in clans CYP58 and CYP68 (Pezizomycotina), clans CYP5150 and CYP63 (Agaricomycotina), and family CYP509 (Mucoromycotina). Although much of the extraordinary diversity of the pan-fungal CYPome can be attributed to gene duplication and adaptive divergence, our analysis also suggests a few potential horizontal gene transfer events. Updated families and clans can be accessed through the new version of the FCPD database. CONCLUSIONS FCPD version 1.2 provides a systematic and searchable catalogue of 9,550 fungal CYP sequences (292 families) encoded by 108 fungal species and 147 CYP sequences (9 families) encoded by five oomycete species. In comparison to the first version, it offers a more comprehensive clan classification, is fully compatible with Nelson's P450 databases, and has expanded functional categorization. These features will facilitate functional annotation and classification of CYPs encoded by newly sequenced fungal and oomycete genomes. Additionally, the classification system will aid in studying the roles of CYPs in the evolution of fungal adaptation to specific ecological niches.
Collapse
Affiliation(s)
- Venkatesh Moktali
- Integrative Biosciences program in Bioinformatics & Genomics, The Pennsylvania State University, University Park, PA, USA
| | | | | | | | | | | | | |
Collapse
|
14
|
Meinel T, Krause A. Meta-analysis of general bacterial subclades in whole-genome phylogenies using tree topology profiling. Evol Bioinform Online 2012; 8:489-525. [PMID: 22915837 PMCID: PMC3422217 DOI: 10.4137/ebo.s9642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
In the last two decades, a large number of whole-genome phylogenies have been inferred to reconstruct the Tree of Life (ToL). Underlying data models range from gene or functionality content in species to phylogenetic gene family trees and multiple sequence alignments of concatenated protein sequences. Diversity in data models together with the use of different tree reconstruction techniques, disruptive biological effects and the steadily increasing number of genomes have led to a huge diversity in published phylogenies. Comparison of those and, moreover, identification of the impact of inference properties (underlying data model, inference technique) on particular reconstructions is almost impossible. In this work, we introduce tree topology profiling as a method to compare already published whole-genome phylogenies. This method requires visual determination of the particular topology in a drawn whole-genome phylogeny for a set of particular bacterial clans. For each clan, neighborhoods to other bacteria are collected into a catalogue of generalized alternative topologies. Particular topology alternatives found for an ordered list of bacterial clans reveal a topology profile that represents the analyzed phylogeny. To simulate the inhomogeneity of published gene content phylogenies we generate a set of seven phylogenies using different inference techniques and the SYSTERS-PhyloMatrix data model. After tree topology profiling on in total 54 selected published and newly inferred phylogenies, we separate artefactual from biologically meaningful phylogenies and associate particular inference results (phylogenies) with inference background (inference techniques as well as data models). Topological relationships of particular bacterial species groups are presented. With this work we introduce tree topology profiling into the scientific field of comparative phylogenomics.
Collapse
Affiliation(s)
- Thomas Meinel
- Charité-University Medicine Berlin, Institute for Physiology, Structural Bioinformatics Group, Thielallee 71, 14195 Berlin, Germany
| | | |
Collapse
|
15
|
Sasidharan R, Nepusz T, Swarbreck D, Huala E, Paccanaro A. GFam: a platform for automatic annotation of gene families. Nucleic Acids Res 2012; 40:e152. [PMID: 22790981 PMCID: PMC3479161 DOI: 10.1093/nar/gks631] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam’s capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/.
Collapse
Affiliation(s)
- Rajkumar Sasidharan
- Department of Molecular, Cell and Developmental Biology, University of California at Los Angeles, Los Angeles, CA 90095, USA.
| | | | | | | | | |
Collapse
|
16
|
Vu TD, Eberhardt U, Szöke S, Groenewald M, Robert V. A laboratory information management system for DNA barcoding workflows. Integr Biol (Camb) 2012; 4:744-55. [PMID: 22344310 DOI: 10.1039/c2ib00146b] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
This paper presents a laboratory information management system for DNA sequences (LIMS) created and based on the needs of a DNA barcoding project at the CBS-KNAW Fungal Biodiversity Centre (Utrecht, the Netherlands). DNA barcoding is a global initiative for species identification through simple DNA sequence markers. We aim at generating barcode data for all strains (or specimens) included in the collection (currently ca. 80 k). The LIMS has been developed to better manage large amounts of sequence data and to keep track of the whole experimental procedure. The system has allowed us to classify strains more efficiently as the quality of sequence data has improved, and as a result, up-to-date taxonomic names have been given to strains and more accurate correlation analyses have been carried out.
Collapse
Affiliation(s)
- Thuy Duong Vu
- Bioinformatics Group, CBS-KNAW Fungal Biodiversity Centre, Utrecht, The Netherlands.
| | | | | | | | | |
Collapse
|
17
|
Nikas JB, Low WC. Application of clustering analyses to the diagnosis of Huntington disease in mice and other diseases with well-defined group boundaries. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2011; 104:e133-e147. [PMID: 21529982 PMCID: PMC3166551 DOI: 10.1016/j.cmpb.2011.03.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/03/2010] [Revised: 02/08/2011] [Accepted: 03/11/2011] [Indexed: 05/30/2023]
Abstract
Nuclear magnetic resonance (NMR) spectroscopy has emerged as a technology that can provide metabolite information within organ systems in vivo. In this study, we introduced a new method of employing a clustering algorithm to develop a diagnostic model that can differentially diagnose a single unknown subject in a disease with well-defined group boundaries. We used three tests to assess the suitability and the accuracy required for diagnostic purposes of the four clustering algorithms we investigated (K-means, Fuzzy, Hierarchical, and Medoid Partitioning). To accomplish this goal, we studied the striatal metabolomic profile of R6/2 Huntington disease (HD) transgenic mice and that of wild type (WT) mice using high field in vivo proton NMR spectroscopy (9.4T). We tested all four clustering algorithms (1) with the original R6/2 HD mice and WT mice, (2) with unknown mice, whose status had been determined via genotyping, and (3) with the ability to separate the original R6/2 mice into the two age subgroups (8 and 12 weeks old). Only our diagnostic models that employed ROC-supervised Fuzzy, unsupervised Fuzzy, and ROC-supervised K-means Clustering passed all three stringent tests with 100% accuracy, indicating that they may be used for diagnostic purposes.
Collapse
Affiliation(s)
- Jason B. Nikas
- Department of Neurosurgery, University of Minnesota, Minneapolis, MN, USA
- Pharmaco-Neuro-Immunology Program, University of Minnesota, Minneapolis, MN, USA
| | - Walter C. Low
- Department of Neurosurgery, University of Minnesota, Minneapolis, MN, USA
- Graduate Program in Neuroscience, University of Minnesota, Minneapolis, MN, USA
- Department of Integrative Biology and Physiology, Medical School, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
18
|
Conservation and Occurrence of Trans-Encoded sRNAs in the Rhizobiales. Genes (Basel) 2011; 2:925-56. [PMID: 24710299 PMCID: PMC3927594 DOI: 10.3390/genes2040925] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2011] [Revised: 10/24/2011] [Accepted: 10/26/2011] [Indexed: 12/13/2022] Open
Abstract
Post-transcriptional regulation by trans-encoded sRNAs, for example via base-pairing with target mRNAs, is a common feature in bacteria and influences various cell processes, e.g., response to stress factors. Several studies based on computational and RNA-seq approaches identified approximately 180 trans-encoded sRNAs in Sinorhizobium meliloti. The initial point of this report is a set of 52 trans-encoded sRNAs derived from the former studies. Sequence homology combined with structural conservation analyses were applied to elucidate the occurrence and distribution of conserved trans-encoded sRNAs in the order of Rhizobiales. This approach resulted in 39 RNA family models (RFMs) which showed various taxonomic distribution patterns. Whereas the majority of RFMs was restricted to Sinorhizobium species or the Rhizobiaceae, members of a few RFMs were more widely distributed in the Rhizobiales. Access to this data is provided via the RhizoGATE portal [1,2].
Collapse
|
19
|
Meinel T, Schweiger MR, Ludewig AH, Chenna R, Krobitsch S, Herwig R. Ortho2ExpressMatrix--a web server that interprets cross-species gene expression data by gene family information. BMC Genomics 2011; 12:483. [PMID: 21970648 PMCID: PMC3202273 DOI: 10.1186/1471-2164-12-483] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2011] [Accepted: 10/04/2011] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND The study of gene families is pivotal for the understanding of gene evolution across different organisms and such phylogenetic background is often used to infer biochemical functions of genes. Modern high-throughput experiments offer the possibility to analyze the entire transcriptome of an organism; however, it is often difficult to deduct functional information from that data. RESULTS To improve functional interpretation of gene expression we introduce Ortho2ExpressMatrix, a novel tool that integrates complex gene family information, computed from sequence similarity, with comparative gene expression profiles of two pre-selected biological objects: gene families are displayed with two-dimensional matrices. Parameters of the tool are object type (two organisms, two individuals, two tissues, etc.), type of computational gene family inference, experimental meta-data, microarray platform, gene annotation level and genome build. Family information in Ortho2ExpressMatrix bases on computationally different protein family approaches such as EnsemblCompara, InParanoid, SYSTERS and Ensembl Family. Currently, respective all-against-all associations are available for five species: human, mouse, worm, fruit fly and yeast. Additionally, microRNA expression can be examined with respect to miRBase or TargetScan families. The visualization, which is typical for Ortho2ExpressMatrix, is performed as matrix view that displays functional traits of genes (differential expression) as well as sequence similarity of protein family members (BLAST e-values) in colour codes. Such translations are intended to facilitate the user's perception of the research object. CONCLUSIONS Ortho2ExpressMatrix integrates gene family information with genome-wide expression data in order to enhance functional interpretation of high-throughput analyses on diseases, environmental factors, or genetic modification or compound treatment experiments. The tool explores differential gene expression in the light of orthology, paralogy and structure of gene families up to the point of ambiguity analyses. Results can be used for filtering and prioritization in functional genomic, biomedical and systems biology applications. The web server is freely accessible at http://bioinf-data.charite.de/o2em/cgi-bin/o2em.pl.
Collapse
Affiliation(s)
- Thomas Meinel
- Structural Bioinformatics Group, Institute for Physiology, Charité-University Medicine Berlin, Thielallee 71, 14195 Berlin, Germany.
| | | | | | | | | | | |
Collapse
|
20
|
Trigg J, Gutwin K, Keating AE, Berger B. Multicoil2: predicting coiled coils and their oligomerization states from sequence in the twilight zone. PLoS One 2011; 6:e23519. [PMID: 21901122 PMCID: PMC3162000 DOI: 10.1371/journal.pone.0023519] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2011] [Accepted: 07/20/2011] [Indexed: 12/23/2022] Open
Abstract
The alpha-helical coiled coil can adopt a variety of topologies, among the most common of which are parallel and antiparallel dimers and trimers. We present Multicoil2, an algorithm that predicts both the location and oligomerization state (two versus three helices) of coiled coils in protein sequences. Multicoil2 combines the pairwise correlations of the previous Multicoil method with the flexibility of Hidden Markov Models (HMMs) in a Markov Random Field (MRF). The resulting algorithm integrates sequence features, including pairwise interactions, through multinomial logistic regression to devise an optimized scoring function for distinguishing dimer, trimer and non-coiled-coil oligomerization states; this scoring function is used to produce Markov Random Field potentials that incorporate pairwise correlations localized in sequence. Multicoil2 significantly improves both coiled-coil detection and dimer versus trimer state prediction over the original Multicoil algorithm retrained on a newly-constructed database of coiled-coil sequences. The new database, comprised of 2,105 sequences containing 124,088 residues, includes reliable structural annotations based on experimental data in the literature. Notably, the enhanced performance of Multicoil2 is evident when tested in stringent leave-family-out cross-validation on the new database, reflecting expected performance on challenging new prediction targets that have minimal sequence similarity to known coiled-coil families. The Multicoil2 program and training database are available for download from http://multicoil2.csail.mit.edu.
Collapse
Affiliation(s)
- Jason Trigg
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Karl Gutwin
- Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Amy E. Keating
- Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- * E-mail: (BB); (AEK)
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory and Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- * E-mail: (BB); (AEK)
| |
Collapse
|
21
|
Abstract
Transitivity Clustering is a method for the partitioning of biological data into groups of similar objects, such as genes, for instance. It provides integrated access to various functions addressing each step of a typical cluster analysis. To facilitate this, Transitivity Clustering is accessible online and offers three user-friendly interfaces: a powerful stand-alone version, a web interface, and a collection of Cytoscape plug-ins. In this paper, we describe three major workflows: (i) protein (super)family detection with Cytoscape, (ii) protein homology detection with incomplete gold standards and (iii) clustering of gene expression data. This protocol guides the user through the most important features of Transitivity Clustering and takes ∼1 h to complete.
Collapse
|
22
|
Abstract
Correct classification of genes into gene families is important for understanding gene function and evolution. Although gene families of many species have been resolved both computationally and experimentally with high accuracy, gene family classification in most newly sequenced genomes has not been done with the same high standard. This project has been designed to develop a strategy to effectively and accurately classify gene families across genomes. We first examine and compare the performance of computer programs developed for automated gene family classification. We demonstrate that some programs, including the hierarchical average-linkage clustering algorithm MC-UPGMA and the popular Markov clustering algorithm TRIBE-MCL, can reconstruct manual curation of gene families accurately. However, their performance is highly sensitive to parameter setting, i.e. different gene families require different program parameters for correct resolution. To circumvent the problem of parameterization, we have developed a comparative strategy for gene family classification. This strategy takes advantage of existing curated gene families of reference species to find suitable parameters for classifying genes in related genomes. To demonstrate the effectiveness of this novel strategy, we use TRIBE-MCL to classify chemosensory and ABC transporter gene families in C. elegans and its four sister species. We conclude that fully automated programs can establish biologically accurate gene families if parameterized accordingly. Comparative gene family classification finds optimal parameters automatically, thus allowing rapid insights into gene families of newly sequenced species.
Collapse
Affiliation(s)
- Christian Frech
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Nansheng Chen
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada
- * E-mail:
| |
Collapse
|
23
|
Abstract
Motivation: Classification of gene and protein sequences into homologous families, i.e. sets of sequences that share common ancestry, is an essential step in comparative genomic analyses. This is typically achieved by construction of a sequence homology network, followed by clustering to identify dense subgraphs corresponding to families. Accurate classification of single domain families is now within reach due to major algorithmic advances in remote homology detection and graph clustering. However, classification of multidomain families remains a significant challenge. The presence of the same domain in sequences that do not share common ancestry introduces false edges in the homology network that link unrelated families and stymy clustering algorithms. Results: Here, we investigate a network-rewiring strategy designed to eliminate edges due to promiscuous domains. We show that this strategy can reduce noise in and restore structure to artificial networks with simulated noise, as well as to the yeast genome homology network. We further evaluate this approach on a hand-curated set of multidomain sequences in mouse and human, and demonstrate that classification using the rewired network delivers dramatic improvement in Precision and Recall, compared with current methods. Families in our test set exhibit a broad range of domain architectures and sequence conservation, demonstrating that our method is flexible, robust and suitable for high-throughput, automated processing of heterogeneous, genome-scale data. contact:jacobmj@cmu.edu
Collapse
Affiliation(s)
- Jacob M Joseph
- Department of Biological and Computer Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
| | | |
Collapse
|
24
|
Fayech S, Essoussi N, Limam M. Partitioning clustering algorithms for protein sequence data sets. BioData Min 2009; 2:3. [PMID: 19341454 PMCID: PMC2678123 DOI: 10.1186/1756-0381-2-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2008] [Accepted: 04/02/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome-sequencing projects are currently producing an enormous amount of new sequences and cause the rapid increasing of protein sequence databases. The unsupervised classification of these data into functional groups or families, clustering, has become one of the principal research objectives in structural and functional genomics. Computer programs to automatically and accurately classify sequences into families become a necessity. A significant number of methods have addressed the clustering of protein sequences and most of them can be categorized in three major groups: hierarchical, graph-based and partitioning methods. Among the various sequence clustering methods in literature, hierarchical and graph-based approaches have been widely used. Although partitioning clustering techniques are extremely used in other fields, few applications have been found in the field of protein sequence clustering. It is not fully demonstrated if partitioning methods can be applied to protein sequence data and if these methods can be efficient compared to the published clustering methods. METHODS We developed four partitioning clustering approaches using Smith-Waterman local-alignment algorithm to determine pair-wise similarities of sequences. Four different sets of protein sequences were used as evaluation data sets for the proposed methods. RESULTS We show that these methods outperform several other published clustering methods in terms of correctly predicting a classifier and especially in terms of the correctness of the provided prediction. The software is available to academic users from the authors upon request.
Collapse
Affiliation(s)
- Sondes Fayech
- Department of Computer Science, LARODEC Laboratory, Higher Institute of Management, University of Tunis, Tunis, Tunisia.
| | | | | |
Collapse
|
25
|
Francoleon DR, Boontheung P, Yang Y, Kim U, Ytterberg AJ, Denny PA, Denny PC, Loo JA, Gunsalus RP, Ogorzalek Loo RR. S-layer, surface-accessible, and concanavalin A binding proteins of Methanosarcina acetivorans and Methanosarcina mazei. J Proteome Res 2009; 8:1972-82. [PMID: 19228054 PMCID: PMC2666069 DOI: 10.1021/pr800923e] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
The outermost cell envelope structure of many archaea and bacteria contains a proteinaceous lattice termed the surface layer or S-layer. It is typically composed of only one or two abundant, often posttranslationally modified proteins that self-assemble to form the highly organized arrays. Surprisingly, over 100 proteins were annotated to be S-layer components in the archaeal species Methanosarcina acetivorans C2A and Methanosarcina mazei Gö1, reflecting limitations of current predictions. An in vivo biotinylation methodology was devised to affinity tag surface-exposed proteins while overcoming unique challenges in working with these fragile organisms. Cells were adapted to growth under N2 fixing conditions, thus, minimizing free amines reactive to the NHS-label, and high pH media compatible with the acylation chemistry was used. A 3-phase separation procedure was employed to isolate intact, labeled cells from lysed-cell derived proteins. Streptavidin affinity enrichment followed by stringent wash conditions removed nonspecifically bound proteins. This methodology revealed S-layer proteins in M. acetivorans C2A and M. mazei Gö1 to be MA0829 and MM1976, respectively. Each was demonstrated to exist as multiple glycosylated forms using SDS-PAGE coupled with glycoprotein-specific staining, and by interaction with the lectin, Concanavalin A. A number of additional surface-exposed proteins and glycoproteins were identified and included all three subunits of the thermosome: the latter suggests that the chaperonin complex is both surface- and cytoplasmically localized. This approach provides an alternative strategy to study surface proteins in the archaea.
Collapse
Affiliation(s)
- Deborah R. Francoleon
- Department of Chemistry and Biochemistry, University of California, Los Angeles, CA 90095
| | - Pinmanee Boontheung
- Department of Chemistry and Biochemistry, University of California, Los Angeles, CA 90095
| | - Yanan Yang
- Department of Chemistry and Biochemistry, University of California, Los Angeles, CA 90095
| | - Unmi Kim
- Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, CA 90095
| | - A. Jimmy Ytterberg
- Department of Chemistry and Biochemistry, University of California, Los Angeles, CA 90095
| | - Patricia A. Denny
- University of Southern California School of Dentistry, Los Angeles, CA 90089
| | - Paul C. Denny
- University of Southern California School of Dentistry, Los Angeles, CA 90089
| | - Joseph A. Loo
- Department of Chemistry and Biochemistry, University of California, Los Angeles, CA 90095
- Department of Biological Chemistry, University of California, Los Angeles, CA 90095
| | - Robert P. Gunsalus
- Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, CA 90095
| | | |
Collapse
|
26
|
Andreopoulos B, An A, Wang X, Schroeder M. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform 2009; 10:297-314. [PMID: 19240124 DOI: 10.1093/bib/bbn058] [Citation(s) in RCA: 92] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Clustering is ubiquitously applied in bioinformatics with hierarchical clustering and k-means partitioning being the most popular methods. Numerous improvements of these two clustering methods have been introduced, as well as completely different approaches such as grid-based, density-based and model-based clustering. For improved bioinformatics analysis of data, it is important to match clusterings to the requirements of a biomedical application. In this article, we present a set of desirable clustering features that are used as evaluation criteria for clustering algorithms. We review 40 different clustering algorithms of all approaches and datatypes. We compare algorithms on the basis of desirable clustering features, and outline algorithms' benefits and drawbacks as a basis for matching them to biomedical applications.
Collapse
|
27
|
Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, Linial M, Orengo C, Thornton J, Tramontano A. Protein function annotation by homology-based inference. Genome Biol 2009; 10:207. [PMID: 19226439 PMCID: PMC2688287 DOI: 10.1186/gb-2009-10-2-207] [Citation(s) in RCA: 148] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Where information on homologous proteins is available,
progress is being made in automated prediction of protein function
from sequence and structure. With many genomes now sequenced, computational annotation methods to characterize genes and proteins from their sequence are increasingly important. The BioSapiens Network has developed tools to address all stages of this process, and here we review progress in the automated prediction of protein function based on protein sequence and structure.
Collapse
Affiliation(s)
- Yaniv Loewenstein
- Department of Biological Chemistry, The Hebrew University of Jerusalem, Sudarsky Center, Jerusalem 91904, Israel
| | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Abstract
Motivation: The classification of proteins into homologous groups (families) allows their structure and function to be analysed and compared in an evolutionary context. The modular nature of eukaryotic proteins presents a considerable challenge to the delineation of families, as different local regions within a single protein may share common ancestry with distinct, even mutually exclusive, sets of homologs, thereby creating an intricate web of homologous relationships if full-length sequences are taken as the unit of evolution. We attempt to disentangle this web by developing a fully automated pipeline to delineate protein subsequences that represent sensible units for homology inference, and clustering them into putatively homologous families using the Markov clustering algorithm. Results: Using six eukaryotic proteomes as input, we clustered 162 349 protein sequences into 19 697–77 415 subsequence families depending on granularity of clustering. We validated these Markov clusters of homologous subsequences (MACHOS) against the manually curated Pfam domain families, using a quality measure to assess overlap. Our subsequence families correspond well to known domain families and achieve higher quality scores than do groups generated by fully automated domain family classification methods. We illustrate our approach by analysis of a group of proteins that contains the glutamyl/glutaminyl-tRNA synthetase domain, and conclude that our method can produce high-coverage decomposition of protein sequence space into precise homologous families in a way that takes the modularity of eukaryotic proteins into account. This approach allows for a fine-scale examination of evolutionary histories of proteins encoded in eukaryotic genomes. Contact:m.ragan@imb.uq.edu.au Supplementary information:Supplementary data are available at Bioinformatics online. MACHOS for the six proteomes are available as FASTA-formatted files: http://research1t.imb.uq.edu.au/ragan/machos
Collapse
Affiliation(s)
- Simon Wong
- ARC Centre of Excellence in Bioinformatics and Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD 4072, Australia
| | | |
Collapse
|
29
|
Loewenstein Y, Portugaly E, Fromer M, Linial M. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics 2008; 24:i41-9. [PMID: 18586742 PMCID: PMC2718652 DOI: 10.1093/bioinformatics/btn174] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. APPLICATION We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. RESULTS We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. AVAILABILITY A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request.
Collapse
Affiliation(s)
- Yaniv Loewenstein
- School of Computer Science and Engineering, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel
| | | | | | | |
Collapse
|
30
|
Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing. BMC Bioinformatics 2007; 8:396. [PMID: 17941985 PMCID: PMC2147039 DOI: 10.1186/1471-2105-8-396] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2007] [Accepted: 10/17/2007] [Indexed: 11/28/2022] Open
Abstract
Background Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. Results We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools (Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences (66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. Conclusion FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at .
Collapse
|
31
|
Zhao T, Murphy RF. Automated learning of generative models for subcellular location: Building blocks for systems biology. Cytometry A 2007; 71:978-90. [DOI: 10.1002/cyto.a.20487] [Citation(s) in RCA: 85] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
32
|
Integration and mining of malaria molecular, functional and pharmacological data: how far are we from a chemogenomic knowledge space? Malar J 2006; 5:110. [PMID: 17112376 PMCID: PMC1665468 DOI: 10.1186/1475-2875-5-110] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2006] [Accepted: 11/17/2006] [Indexed: 11/21/2022] Open
Abstract
The organization and mining of malaria genomic and post-genomic data is important to significantly increase the knowledge of the biology of its causative agents, and is motivated, on a longer term, by the necessity to predict and characterize new biological targets and new drugs. Biological targets are sought in a biological space designed from the genomic data from Plasmodium falciparum, but using also the millions of genomic data from other species. Drug candidates are sought in a chemical space containing the millions of small molecules stored in public and private chemolibraries. Data management should, therefore, be as reliable and versatile as possible. In this context, five aspects of the organization and mining of malaria genomic and post-genomic data were examined: 1) the comparison of protein sequences including compositionally atypical malaria sequences, 2) the high throughput reconstruction of molecular phylogenies, 3) the representation of biological processes, particularly metabolic pathways, 4) the versatile methods to integrate genomic data, biological representations and functional profiling obtained from X-omic experiments after drug treatments and 5) the determination and prediction of protein structures and their molecular docking with drug candidate structures. Recent progress towards a grid-enabled chemogenomic knowledge space is discussed.
Collapse
|
33
|
Abstract
One of the goals of structural genomics is to obtain a structural representative of almost every fold in nature. A recent estimate suggests that 70%-80% of soluble protein domains identified in the first 1000 genome sequences should be covered by about 25,000 structures-a reasonably achievable goal. As no current estimates exist for the number of membrane protein families, however, it is not possible to know whether family coverage is a realistic goal for membrane proteins. Here we find that virtually all polytopic helical membrane protein families are present in the already known sequences so we can make an estimate of the total number of families. We find that only approximately 700 polytopic membrane protein families account for 80% of structured residues and approximately 1700 cover 90% of structured residues. While apparently a finite and reachable goal, we estimate that it will likely take more than three decades to obtain the structures needed for 90% residue coverage, if current trends continue.
Collapse
Affiliation(s)
- Amit Oberai
- Department of Chemistry and Biochemistry, UCLA-DOE Institute for Genomics and Proteomics, Los Angeles, CA 90095-1570, USA
| | | | | | | |
Collapse
|
34
|
Lienau EK, DeSalle R, Rosenfeld JA, Planet PJ. Reciprocal illumination in the gene content tree of life. Syst Biol 2006; 55:441-53. [PMID: 16861208 DOI: 10.1080/10635150600697416] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022] Open
Abstract
Phylogenies based on gene content rely on statements of primary homology to characterize gene presence or absence. These statements (hypotheses) are usually determined by techniques based on threshold similarity or distance measurements between genes. This fundamental but problematic step can be examined by evaluating each homology hypothesis by the extent to which it is corroborated by the rest of the data. Here we test the effects of varying the stringency for making primary homology statements using a range of similarity (e-value) cutoffs in 166 fully sequenced and annotated genomes spanning the tree of life. By evaluating each resulting data set with tree-based measurements of character consistency and information content, we find a set of homology statements that optimizes overall corroboration. The resulting data set produces well-resolved and well-supported trees of life and greatly ameliorates previously noted inconsistencies such as the misclassification of small genomes. The method presented here, which can be used to test any technique for recognizing primary homology, provides an objective framework for evaluating phylogenetic hypotheses and data sets for the tree of life. It also can serve as a technique for identifying well-corroborated sets of homologous genes for functional genomic applications.
Collapse
Affiliation(s)
- E Kurt Lienau
- American Museum of Natural History, Molecular Laboratories, Central Park West at 79th Street, (P.J.P.), New York, New York 10024, USA
| | | | | | | |
Collapse
|
35
|
Abstract
In an era of rapid genome sequencing and high-throughput technology, automatic function prediction for a novel sequence is of utter importance in bioinformatics. While automatic annotation methods based on local alignment searches can be simple and straightforward, they suffer from several drawbacks, including relatively low sensitivity and assignment of incorrect annotations that are not associated with the region of similarity. ProtoNet is a hierarchical organization of the protein sequences in the UniProt database. Although the hierarchy is constructed in an unsupervised automatic manner, it has been shown to be coherent with several biological data sources. We extend the ProtoNet system in order to assign functional annotations automatically. By leveraging on the scaffold of the hierarchical classification, the method is able to overcome some frequent annotation pitfalls.
Collapse
Affiliation(s)
- Ori Sasson
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 91904, Isreal
| | | | | |
Collapse
|