1
|
Hernandez-Guerrero R, Galán-Vásquez E, Pérez-Rueda E. The protein architecture in Bacteria and Archaea identifies a set of promiscuous and ancient domains. PLoS One 2019; 14:e0226604. [PMID: 31856202 PMCID: PMC6922389 DOI: 10.1371/journal.pone.0226604] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2019] [Accepted: 11/29/2019] [Indexed: 11/19/2022] Open
Abstract
In this work, we describe a systematic comparative genomic analysis of promiscuous domains in genomes of Bacteria and Archaea. A quantitative measure of domain promiscuity, the weighted domain architecture score (WDAS), was used and applied to 1317 domains in 1320 genomes of Bacteria and Archaea. A functional analysis associated with the WDAS per genome showed that 18 of 50 functional categories were identified as significantly enriched in the promiscuous domains; in particular, small-molecule binding domains, transferases domains, DNA binding domains (transcription factors), and signal transduction domains were identified as promiscuous. In contrast, non-promiscuous domains were identified as associated with 6 of 50 functional categories, and the category Function unknown was enriched. In addition, the WDASs of 52 domains correlated with genome size, i.e., WDAS values decreased as the genome size increased, suggesting that the number of combinations at larger domains increases, including domains in the superfamilies Winged helix-turn-helix and P-loop-containing nucleoside triphosphate hydrolases. Finally, based on classification of the domains according to their ancestry, we determined that the set of 52 promiscuous domains are also ancient and abundant among all the genomes, in contrast to the non-promiscuous domains. In summary, we consider that the association between these two classes of protein domains (promiscuous and non-promiscuous) provides bacterial and archaeal cells with the ability to respond to diverse environmental challenges.
Collapse
Affiliation(s)
- Rafael Hernandez-Guerrero
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Unidad Académica Yucatán, Mérida, Yucatán, México
| | - Edgardo Galán-Vásquez
- Departamento de Ingeniería de Sistemas Computacionales y Automatización, Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Ciudad Universitaria, Universidad Nacional Autónoma de México, Ciudad de México, México
| | - Ernesto Pérez-Rueda
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Unidad Académica Yucatán, Mérida, Yucatán, México
- Centro de Genómica y Bioinformática, Facultad de Ciencias, Universidad Mayor, Santiago, Chile
- * E-mail:
| |
Collapse
|
2
|
Abstract
The Rossmann fold is one of the most commonly observed structural domains in proteins. The fold is composed of consecutive alternating β-strands and α-helices that form a layer of β-sheet with one (or two) layer(s) of α-helices. Here, we will discuss the Rossmann fold starting from its discovery 55 years ago, then overview entries of the fold in the major protein classification databases, SCOP and CATH, as well as the number of the occurrences of the fold in genomes. We also discuss the Rossmann fold as an interesting target of protein engineering as the site-directed mutagenesis of the fold can alter the ligand-binding specificity of the structure.
Collapse
Affiliation(s)
- Woong-Hee Shin
- Department of Biological Science, Purdue University, West Lafayette, IN, USA
| | - Daisuke Kihara
- Department of Biological Science, Purdue University, West Lafayette, IN, USA.
- Department of Computer Science, Purdue University, West Lafayette, IN, USA.
| |
Collapse
|
3
|
Kauko A, Lehto K. Eukaryote specific folds: Part of the whole. Proteins 2018; 86:868-881. [PMID: 29675831 DOI: 10.1002/prot.25517] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2017] [Revised: 04/17/2018] [Accepted: 04/18/2018] [Indexed: 01/07/2023]
Abstract
The origin of eukaryotes is one of the central transitions in the history of life; without eukaryotes there would be no complex multicellular life. The most accepted scenarios suggest the endosymbiosis of a mitochondrial ancestor with a complex archaeon, even though the details regarding the host and the triggering factors are still being discussed. Accordingly, phylogenetic analyses have demonstrated archaeal affiliations with key informational systems, while metabolic genes are often related to bacteria, mostly to the mitochondrial ancestor. Despite of this, there exists a large number of protein families and folds found only in eukaryotes. In this study, we have analyzed structural superfamilies and folds that probably appeared during eukaryogenesis. These folds typically represent relatively small binding domains of larger multidomain proteins. They are commonly involved in biological processes that are particularly complex in eukaryotes, such as signaling, trafficking/cytoskeleton, ubiquitination, transcription and RNA processing, but according to recent studies, these processes also have prokaryotic roots. Thus the folds originating from an eukaryotic stem seem to represent accessory parts that have contributed in the expansion of several prokaryotic processes to a new level of complexity. This might have taken place as a co-evolutionary process where increasing complexity and fold innovations have supported each other.
Collapse
Affiliation(s)
- Anni Kauko
- Department of Biochemistry, University of Turku, Turku, Finland
| | - Kirsi Lehto
- Department of Biochemistry, University of Turku, Turku, Finland
| |
Collapse
|
4
|
Li H. Structural Principles of CRISPR RNA Processing. Structure 2014; 23:13-20. [PMID: 25435327 DOI: 10.1016/j.str.2014.10.006] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2014] [Revised: 10/02/2014] [Accepted: 10/07/2014] [Indexed: 10/24/2022]
Abstract
The Cas6 superfamily, the Cas5d subclass, and the host RNase III endoribonucleases are responsible for producing small RNAs (crRNA) that function in the CRISPR-Cas immunity. The three enzymes may also interact with the crRNA-associated nucleic acid interference complexes. Recent development in structural biology of Cas6 and Cas5d and their complexes with RNA substrates has lent new insights on principles of crRNA processing and the structural basis for linking crRNA processing to interference. Both Cas6 and Cas5d are characterized by the presence of the ferredoxin-like fold, but each has unique domain arrangement and insertion elements. Cas6 proteins often interact strongly with stable RNA stem-loop structures but can also fold unstructured RNA into stem-loop structures for their cleavage. The extraordinarily simple fold, the wide range of substrates, and kinetic properties of Cas6/Cas5d make them excellent candidates for exploring molecular evolution, protein-RNA interaction, and biotechnology applications.
Collapse
Affiliation(s)
- Hong Li
- Institute of Molecular Biophysics, Florida State University, Tallahassee, FL 32306, USA; Department of Chemistry and Biochemistry, Florida State University, Tallahassee, FL 32306, USA.
| |
Collapse
|
5
|
Abstract
As more and more systems biology approaches are used to investigate the different types of biological macromolecules, increasing numbers of whole genomic studies are now available for a large array of organisms. Whether it is genomics, transcriptomics, proteomics, interactomics or metabolomics, the full complement of genomic information on all different levels can be juxtaposed between different organisms to reveal similarities or differences, and even to provide consensus models. At the intersection of comparative genomics and systems biology lies great possibility for discovery, analysis and prediction. This paper explores this nexus and the relationship from four general levels: DNA, RNA, protein and extragenomic. For each level, we provide an overview of the methods, discuss the potential challenges and survey the current research. Finally, we suggest some organizing principles and make proposals for new areas that will be important for future research.
Collapse
Affiliation(s)
- Jimmy Lin
- Wilmer Institute, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | | |
Collapse
|
6
|
Abstract
ORFan genes can constitute a large fraction of a bacterial genome, but due to their lack of homologs, their functions have remained largely unexplored. To determine if particular features of ORFan-encoded proteins promote their presence in a genome, we analyzed properties of ORFans that originated over a broad evolutionary timescale. We also compared ORFan genes to another class of acquired genes, heterogeneous occurrence in prokaryotes (HOPs), which have homologs in other bacteria. A total of 54 ORFan and HOP genes selected from different phylogenetic depths in the Escherichia coli lineage were cloned, expressed, purified, and subjected to circular dichroism (CD) spectroscopy. A majority of genes could be expressed, but only 18 yielded sufficient soluble protein for spectral analysis. Of these, half were significantly alpha-helical, three were predominantly beta-sheet, and six were of intermediate/indeterminate structure. Although a higher proportion of HOPs yielded soluble proteins with resolvable secondary structures, ORFans resembled HOPs with regard to most of the other features tested. Overall, we found that those ORFan and HOP genes that have persisted in the E. coli lineage were more likely to encode soluble and folded proteins, more likely to display environmental modulation of their gene expression, and by extrapolation, are more likely to be functional.
Collapse
Affiliation(s)
- Hema Prasad Narra
- Department of Biochemistry & Molecular Biophysics, University of Arizona, Tucson, AZ 85721, USA
| | | | | |
Collapse
|
7
|
Characterization of YvcJ, a conserved P-loop-containing protein, and its implication in competence in Bacillus subtilis. J Bacteriol 2008; 191:1556-64. [PMID: 19074378 DOI: 10.1128/jb.01493-08] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
The uncharacterized protein family UPF0042 of the Swiss-Prot database is predicted to be a member of the conserved group of bacterium-specific P-loop-containing proteins. Here we show that two of its members, YvcJ from Bacillus subtilis and YhbJ, its homologue from Escherichia coli, indeed bind and hydrolyze nucleotides. The cellular function of yvcJ was then addressed. In contrast to results recently obtained for E. coli, which indicated that yhbJ mutants strongly overproduced glucosamine-6-phosphate synthase (GlmS), comparison of the wild type with the yvcJ mutant of B. subtilis showed that GlmS expression was quite similar in the two strains. However, in mutants defective in yvcJ, the transformation efficiency and the fraction of cells that expressed competence were reduced. Furthermore, our data show that YvcJ positively controls the expression of late competence genes. The overexpression of comK or comS compensates for the decrease in competence of the yvcJ mutant. Our results show that even if YvcJ and YhbJ belong to the same family of P-loop-containing proteins, the deletion of corresponding genes has different consequences in B. subtilis and in E. coli.
Collapse
|
8
|
Frenkel ZM. Does Protein Relatedness Require Sequence Matching? AlignmentviaNetworks in Sequence Space. J Biomol Struct Dyn 2008; 26:215-22. [DOI: 10.1080/07391102.2008.10507237] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
9
|
Hartling J, Kim J. Mutational robustness and geometrical form in protein structures. JOURNAL OF EXPERIMENTAL ZOOLOGY PART B-MOLECULAR AND DEVELOPMENTAL EVOLUTION 2008; 310:216-26. [PMID: 17973270 DOI: 10.1002/jez.b.21203] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Theoretical studies of RNA and lattice protein models suggest that mutationally robust or the so-called designable phenotypes tend to have special geometric features such as being more compact and more geometrically regular. Such geometrical forms have been also linked to speed of folding and stability properties that may also assist in promoting mutational robustness. Here we test these theoretical predictions on a non-redundant collection of 2,660 experimentally determined structures from the PDB (Protein Data Bank) and CATH (Class Architecture Topology Homologous superfamily) database. We first developed an index summarizing the geometrical regularity of the structures and then used this index to show that the statistical pattern of empirical data is consistent with the theoretical predictions relating geometry to mutational robustness. Mutationally robust proteins tend to be more symmetric and compact. But, the relationship between compactness and robustness cannot be explained simply by the geometrical packing of individual amino acids in proteins; rather, it is the property of the whole system that is related to the statistical characteristics of the folding landscape. Finally, we hypothesize that a triplet relationship between mutational robustness, stability and form is a general properties of objects that optimize real-valued relationships between sequences and discrete structures.
Collapse
Affiliation(s)
- Julia Hartling
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, Connecticut, USA
| | | |
Collapse
|
10
|
Linial M. Fishing with (Proto)Net-a principled approach to protein target selection. Comp Funct Genomics 2008; 4:542-8. [PMID: 18629007 PMCID: PMC2447289 DOI: 10.1002/cfg.328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2003] [Revised: 08/05/2003] [Accepted: 08/05/2003] [Indexed: 12/02/2022] Open
Abstract
Structural genomics strives to represent the entire protein space. The first step towards achieving this goal is by rationally selecting proteins whose structures have
not been determined, but that represent an as yet unknown structural superfamily
or fold. Once such a structure is solved, it can be used as a template for modelling
homologous proteins. This will aid in unveiling the structural diversity of the protein
space. Currently, no reliable method for accurate 3D structural prediction is available
when a sequence or a structure homologue is not available. Here we present a
systematic methodology for selecting target proteins whose structure is likely to
adopt a new, as yet unknown superfamily or fold. Our method takes advantage
of a global classification of the sequence space as presented by ProtoNet-3D, which
is a hierarchical agglomerative clustering of the proteins of interest (the proteins in
Swiss-Prot) along with all solved structures (taken from the PDB). By navigating in
the scaffold of ProtoNet-3D, we yield a prioritized list of proteins that are not yet
structurally solved, along with the probability of each of the proteins belonging to a
new superfamily or fold. The sorted list has been self-validated against real structural
data that was not available when the predictions were made. The practical application
of using our computational–statistical method to determine novel superfamilies for
structural genomics projects is also discussed.
Collapse
Affiliation(s)
- Michal Linial
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University, Jerusalem 91904, Israel.
| |
Collapse
|
11
|
Frenkel ZM, Trifonov EN. Evolutionary Networks in the Formatted Protein Sequence Space. J Comput Biol 2007; 14:1044-57. [DOI: 10.1089/cmb.2007.0066] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Zakharia M. Frenkel
- Genome Diversity Center, Institute of Evolution, University of Haifa, Haifa 31905, Israel
| | - Edward N. Trifonov
- Genome Diversity Center, Institute of Evolution, University of Haifa, Haifa 31905, Israel
| |
Collapse
|
12
|
Sprinzak E, Altuvia Y, Margalit H. Characterization and prediction of protein-protein interactions within and between complexes. Proc Natl Acad Sci U S A 2006; 103:14718-23. [PMID: 17003128 PMCID: PMC1595418 DOI: 10.1073/pnas.0603352103] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Databases of experimentally determined protein interactions provide information on binary interactions and on involvement in multiprotein complexes. These data are valuable for understanding the general properties of the interaction between proteins as well as for the development of prediction schemes for unknown interactions. Here we analyze experimentally determined protein interactions by measuring various sequence, genomic, transcriptomic, and proteomic attributes of each interacting pair in the yeast Saccharomyces cerevisiae. We find that dividing the data into two groups, one that includes binary interactions within protein complexes (stable) and another that includes binary interactions that are not within complexes (transient), enables better characterization of the interactions by the different attributes and improves the prediction of new interactions. This analysis revealed that most attributes were more indicative in the set of intracomplex interactions. Using this data set for training, we integrated the different attributes by logistic regression and developed a predictive scheme that distinguishes between interacting and noninteracting protein pairs. Analysis of the logistic-regression model showed that one of the strongest contributors to the discrimination between interacting and noninteracting pairs is the presence of distinct pairs of domain signatures that were suggested previously to characterize interacting proteins. The predictive algorithm succeeds in identifying both intracomplex and other interactions (possibly the more stable ones), and its correct identification rate is 2-fold higher than that of large-scale yeast two-hybrid experiments.
Collapse
Affiliation(s)
- Einat Sprinzak
- Department of Molecular Genetics and Biotechnology, Faculty of Medicine, Hebrew University, Jerusalem 91120, Israel
| | - Yael Altuvia
- Department of Molecular Genetics and Biotechnology, Faculty of Medicine, Hebrew University, Jerusalem 91120, Israel
| | - Hanah Margalit
- Department of Molecular Genetics and Biotechnology, Faculty of Medicine, Hebrew University, Jerusalem 91120, Israel
- To whom correspondence should be addressed. E-mail:
| |
Collapse
|
13
|
Wong P, Frishman D. Fold designability, distribution, and disease. PLoS Comput Biol 2006; 2:e40. [PMID: 16680196 PMCID: PMC1456317 DOI: 10.1371/journal.pcbi.0020040] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2005] [Accepted: 03/17/2006] [Indexed: 12/04/2022] Open
Abstract
Fold designability has been estimated by the number of families contained in that fold. Here, we show that among orthologous proteins, sequence divergence is higher for folds with greater numbers of families. Folds with greater numbers of families also tend to have families that appear more often in the proteome and greater promiscuity (the number of unique “partner” folds that the fold is found with within the same protein). We also find that many disease-related proteins have folds with relatively few families. In particular, a number of these proteins are associated with diseases occurring at high frequency. These results suggest that family counts reflect how certain structures are distributed in nature and is an important characteristic associated with many human diseases. Most proteins are composed of structural domains that can be classified into “folds.” Domains with the same fold type share overall structural similarity. The number of amino acid sequences that encode a fold is termed the “designability” of the fold. Folds that have higher designability are thought to be more robust to stresses and mutations. Such features may also allow the fold to appear in a greater variety of contexts. Here, the authors show that proteins with folds estimated to be of higher designability are more widespread amongst proteins in human, mouse, and yeast, consistent with this hypothesis. The authors also find that many hereditary disease-associated proteins have folds estimated to be of low designability. A number of these diseases occur at a relatively high frequency. These results suggest that the estimate of designability employed reflects how certain structures are distributed in nature and is an important characteristic associated with many human diseases.
Collapse
Affiliation(s)
- Philip Wong
- Institute for Bioinformatics, GSF–National Research Center for Environment and Health, Neuherberg, Germany
| | - Dmitrij Frishman
- Institute for Bioinformatics, GSF–National Research Center for Environment and Health, Neuherberg, Germany
- Department of Genome-Oriented Bioinformatics, Technische Universität Munchen, Wissenschaftzentrum Weihenstephan, Freising, Germany
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
14
|
Abstract
We review fold usage on completed genomes to explore protein structure evolution. The patterns of presence or absence of folds on genomes gives us insights into the relationships between folds, the age of different folds and how we have arrived at the set of folds we see today. We examine the relationships between different measures which describe protein fold usage, such as the number of copies of a fold per genome, the number of families per fold, and the number of genomes a fold occurs on. We obtained these measures of fold usage by searching for the structural domains on 157 completed genome sequences from all three kingdoms of life. In our comparisons of these measures we found that bacteria have relatively more distinct folds on their genomes than archaea. Eukaryotes were found to have many more copies of a fold on their genomes. If we separate out the different fold classes, the alpha/beta class has relatively fewer distinct folds on large genomes, more copies of a fold on bacteria and more folds occurring in all three kingdoms simultaneously. These results possibly indicate that most alpha/beta folds originated earlier than other folds. The expected power law distribution is observed for copies of a fold per genome and we found a similar distribution for the number of families per fold. However, a more complicated distribution appears for fold occurrence across genomes, which strongly depends on fold class and kingdom. We also show that there is not a clear relationship between the three measures of fold usage. A fold which occurs on many genomes does not necessarily have many copies on each genome. Similarly, folds with many copies do not necessarily have many families or vice versa.
Collapse
Affiliation(s)
- Sanne Abeln
- Department of Statistics, University of Oxford, United Kingdom
| | | |
Collapse
|
15
|
Lee D, Grant A, Marsden RL, Orengo C. Identification and distribution of protein families in 120 completed genomes using Gene3D. Proteins 2006; 59:603-15. [PMID: 15768405 DOI: 10.1002/prot.20409] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Using a new protocol, PFscape, we undertake a systematic identification of protein families and domain architectures in 120 complete genomes. PFscape clusters sequences into protein families using a Markov clustering algorithm (Enright et al., Nucleic Acids Res 2002;30:1575-1584) followed by complete linkage clustering according to sequence identity. Within each protein family, domains are recognized using a library of hidden Markov models comprising CATH structural and Pfam functional domains. Domain architectures are then determined using DomainFinder (Pearl et al., Protein Sci 2002;11:233-244) and the protein family and domain architecture data are amalgamated in the Gene3D database (Buchan et al., Genome Res 2002;12:503-514). Using Gene3D, we have investigated protein sequence space, the extent of structural annotation, and the distribution of different domain architectures in completed genomes from all kingdoms of life. As with earlier studies by other researchers, the distribution of domain families shows power-law behavior such that the largest 2,000 domain families can be mapped to approximately 70% of nonsingleton genome sequences; the remaining sequences are assigned to much smaller families. While approximately 50% of domain annotations within a genome are assigned to 219 universal domain families, a much smaller proportion (< 10%) of protein sequences are assigned to universal protein families. This supports the mosaic theory of evolution whereby domain duplication followed by domain shuffling gives rise to novel domain architectures that can expand the protein functional repertoire of an organism. Functional data (e.g. COG/KEGG/GO) integrated within Gene3D result in a comprehensive resource that is currently being used in structure genomics initiatives and can be accessed via http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/.
Collapse
Affiliation(s)
- David Lee
- Biomolecular Structure and Modelling Group, Department of Biochemistry, University College London, Gower Street, London.
| | | | | | | |
Collapse
|
16
|
Doolittle RF. Evolutionary aspects of whole-genome biology. Curr Opin Struct Biol 2005; 15:248-53. [PMID: 15963888 DOI: 10.1016/j.sbi.2005.04.001] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2005] [Revised: 02/08/2005] [Accepted: 04/12/2005] [Indexed: 11/28/2022]
Abstract
A decade of access to whole-genome sequences has been increasingly revealing about the informational network relating all living organisms. Although at one point there was concern that extensive horizontal gene transfer might hopelessly muddle phylogenies, it has not proved a severe hindrance. The melding of sequence and structural information is being used to great advantage, and the prospect exists that some of the earliest aspects of life on Earth can be reconstructed, including the invention of biosynthetic and metabolic pathways. Still, some fundamental phylogenetic problems remain, including determining the root--if there is one--of the historical relationship between Archaea, Bacteria and Eukarya.
Collapse
Affiliation(s)
- Russell F Doolittle
- Department of Chemistry & Biochemistry, University of California San Diego, La Jolla, CA 92093-0314, USA.
| |
Collapse
|
17
|
Todd AE, Marsden RL, Thornton JM, Orengo CA. Progress of Structural Genomics Initiatives: An Analysis of Solved Target Structures. J Mol Biol 2005; 348:1235-60. [PMID: 15854658 DOI: 10.1016/j.jmb.2005.03.037] [Citation(s) in RCA: 103] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2004] [Revised: 02/28/2005] [Accepted: 03/15/2005] [Indexed: 11/27/2022]
Abstract
The explosion in gene sequence data and technological breakthroughs in protein structure determination inspired the launch of structural genomics (SG) initiatives. An often stated goal of structural genomics is the high-throughput structural characterisation of all protein sequence families, with the long-term hope of significantly impacting on the life sciences, biotechnology and drug discovery. Here, we present a comprehensive analysis of solved SG targets to assess progress of these initiatives. Eleven consortia have contributed 316 non-redundant entries and 323 protein chains to the Protein Data Bank (PDB), and 459 and 393 domains to the CATH and SCOP structure classifications, respectively. The quality and size of these proteins are comparable to those solved in traditional structural biology and, despite huge scope for duplicated efforts, only 14% of targets have a close homologue (>/=30% sequence identity) solved by another consortium. Analysis of CATH and SCOP revealed the significant contribution that structural genomics is making to the coverage of superfamilies and folds. A total of 67% of SG domains in CATH are unique, lacking an already characterised close homologue in the PDB, whereas only 21% of non-SG domains are unique. For 29% of domains, structure determination revealed a remote evolutionary relationship not apparent from sequence, and 19% and 11% contributed new superfamilies and folds. The secondary structure class, fold and superfamily distributions of this dataset reflect those of the genomes. The domains fall into 172 different folds and 259 superfamilies in CATH but the distribution is highly skewed. The most populous of these are those that recur most frequently in the genomes. Whilst 11% of superfamilies are bacteria-specific, most are common to all three superkingdoms of life and together the 316 PDB entries have provided new and reliable homology models for 9287 non-redundant gene sequences in 206 completely sequenced genomes. From the perspective of this analysis, it appears that structural genomics is on track to be a success, and it is hoped that this work will inform future directions of the field.
Collapse
Affiliation(s)
- Annabel E Todd
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London, WC1E 6BT, UK.
| | | | | | | |
Collapse
|
18
|
Ranea JAG, Grant A, Thornton JM, Orengo CA. Microeconomic principles explain an optimal genome size in bacteria. Trends Genet 2005; 21:21-5. [PMID: 15680509 DOI: 10.1016/j.tig.2004.11.014] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Bacteria can clearly enhance their survival by expanding their genetic repertoire. However, the tight packing of the bacterial genome and the fact that the most evolved species do not necessarily have the biggest genomes suggest there are other evolutionary factors limiting their genome expansion. To clarify these restrictions on size, we studied those protein families contributing most significantly to bacterial-genome complexity. We found that all bacteria apply the same basic and ancestral 'molecular technology' to optimize their reproductive efficiency. The same microeconomics principles that define the optimum size in a factory can also explain the existence of a statistical optimum in bacterial genome size. This optimum is reached when the bacterial genome obtains the maximum metabolic complexity (revenue) for minimal regulatory genes (logistic cost).
Collapse
Affiliation(s)
- Juan A G Ranea
- Biomolecular Structure and Modelling Group, Department of Biochemistry and Molecular Biology, University College London, London, UK.
| | | | | | | |
Collapse
|
19
|
Caetano-Anollés G, Caetano-Anollés D. Universal Sharing Patterns in Proteomes and Evolution of Protein Fold Architecture and Life. J Mol Evol 2005; 60:484-98. [PMID: 15883883 DOI: 10.1007/s00239-004-0221-6] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2004] [Accepted: 10/11/2004] [Indexed: 11/30/2022]
Abstract
Protein evolution is imprinted in both the sequence and the structure of evolutionary building blocks known as protein domains. These domains share a common ancestry and can be unified into a comparatively small set of folding architectures, the protein folds. We have traced the distribution of protein folds between and within proteomes belonging to Eukarya, Archaea, and Bacteria along the branches of a universal phylogeny of protein architecture. This tree was reconstructed from global fold-usage statistics derived from a structural census of proteomes. We found that folds shared by the three organismal domains were placed almost exclusively at the base of the rooted tree and that there were marked heterogeneities in fold distribution and clear evolutionary patterns related to protein architecture and organismal diversification. These include a relative timing for the emergence of prokaryotes, congruent episodes of architectural loss and diversification in Archaea and Bacteria, and a late and quite massive rise of architectural novelties in Eukarya perhaps linked to multicellularity.
Collapse
Affiliation(s)
- Gustavo Caetano-Anollés
- Department of Crop Sciences, University of Illinois, 332 NSRC, 1101 West Peabody Drive, Urbana, IL, 61801, USA.
| | | |
Collapse
|
20
|
Pagel P, Wong P, Frishman D. A Domain Interaction Map Based on Phylogenetic Profiling. J Mol Biol 2004; 344:1331-46. [PMID: 15561146 DOI: 10.1016/j.jmb.2004.10.019] [Citation(s) in RCA: 68] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Revised: 07/20/2004] [Accepted: 10/12/2004] [Indexed: 11/17/2022]
Abstract
Phylogenetic profiling is a well established method for predicting functional relations and physical interactions between proteins. We present a new method for finding such relations based on phylogenetic profiling of conserved domains rather than proteins, avoiding computationally expensive all versus all sequence comparisons among genomes. The resulting domain interaction map (DIMA) can be explored directly or mapped to a genome of interest. We demonstrate that the performance of DIMA is comparable to that of classical phylogenetic profiling and its predictions often yield information that cannot be detected by profiling of entire protein chains. We provide a list of novel domain associations predicted by our method.
Collapse
Affiliation(s)
- Philipp Pagel
- Institute for Bioinformatics, GSF-National Research Center for Environment and Health, Ingolstädter Landstrasse 1, 85764 Neuherberg, Germany
| | | | | |
Collapse
|
21
|
Kihara D, Skolnick J. Microbial genomes have over 72% structure assignment by the threading algorithm PROSPECTOR_Q. Proteins 2004; 55:464-73. [PMID: 15048836 DOI: 10.1002/prot.20044] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
The genome scale threading of five complete microbial genomes is revisited using our state-of-the-art threading algorithm, PROSPECTOR_Q. Considering that structure assignment to an ORF could be useful for predicting biochemical function as well as for analyzing pathways, it is important to assess the current status of genome scale threading. The fraction of ORFs to which we could assign protein structures with a reasonably good confidence level to each genome sequences is over 72%, which is significantly higher than earlier studies. Using the assigned structures, we have predicted the function of several ORFs through "single-function" template structures, obtained from an analysis of the relationship between protein fold and function. The fold distribution of the genomes and the effect of the number of homologous sequences on structure assignment are also discussed.
Collapse
Affiliation(s)
- Daisuke Kihara
- UB Center of Excellence in Bioinformatics, University at Buffalo, Buffalo, New York 14215, USA
| | | |
Collapse
|
22
|
Ranea JAG, Buchan DWA, Thornton JM, Orengo CA. Evolution of protein superfamilies and bacterial genome size. J Mol Biol 2004; 336:871-87. [PMID: 15095866 DOI: 10.1016/j.jmb.2003.12.044] [Citation(s) in RCA: 68] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2003] [Revised: 12/11/2003] [Accepted: 12/12/2003] [Indexed: 10/26/2022]
Abstract
We present the structural annotation of 56 different bacterial species based on the assignment of genes to 816 evolutionary superfamilies in the CATH domain structure database. These assignments have enabled us to analyse the recurrence of specific superfamilies within and across the genomes. We have selected the superfamilies that have a very broad representation and therefore appear to be universally distributed in a significant number of bacterial lineages. Occurrence profiles of these universally distributed superfamilies are compared with genome size in order to estimate the correlation between superfamily duplication and the increase in proteome size. This distinguishes between those size-dependent superfamilies where frequency of occurrence is highly correlated with increase in genome size, and size-independent superfamilies where no correlation is observed. Consideration of the size correlation and the ratio between the mean and the standard deviations for all the superfamily profiles allows more detailed subdivisions and classification of superfamilies. For example, within the size-independent superfamilies, we distinguished a group that are distributed evenly amongst all the genomes. Within the size-dependent superfamilies we differentiated two groups: linearly distributed and non-linearly distributed. Functional annotation using the COG database was performed for all superfamilies in each of these groups, and this revealed significant differences amongst the three sets of superfamilies. Evenly distributed, size-independent domains are shown to be involved primarily in protein translation and biosynthesis. For the size-dependent superfamilies, linearly distributed superfamilies are involved mainly in metabolism, and non-linearly distributed superfamily domains are involved principally in gene regulation.
Collapse
Affiliation(s)
- Juan A G Ranea
- Biomlolecular Structure and Modelling Group, Department of Biochemistry and Molecular Biology, University College London, London WC1E 6BT, UK.
| | | | | | | |
Collapse
|
23
|
Zhang Y, Skolnick J. Automated structure prediction of weakly homologous proteins on a genomic scale. Proc Natl Acad Sci U S A 2004; 101:7594-9. [PMID: 15126668 PMCID: PMC419651 DOI: 10.1073/pnas.0305695101] [Citation(s) in RCA: 245] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We have developed TASSER, a hierarchical approach to protein structure prediction that consists of template identification by threading, followed by tertiary structure assembly via the rearrangement of continuous template fragments guided by an optimized C(alpha) and side-chain-based potential driven by threading-based, predicted tertiary restraints. TASSER was applied to a comprehensive benchmark set of 1,489 medium-sized proteins in the Protein Data Bank. With homologues excluded, in 927 cases, the templates identified by our threading algorithm PROSPECTOR_3 have a rms deviation from native <6.5 A with approximately 80% alignment coverage. After template reassembly, this number increases to 1,172. This shows significant and systematic improvement of the final models with respect to the initial template alignments. Furthermore, significant improvements in loop modeling are demonstrated. We then apply TASSER to the 1,360 medium-sized ORFs in the Escherichia coli genome; approximately 920 can be predicted with high accuracy based on confidence criteria established in the Protein Data Bank benchmark. These results from our unprecedented comprehensive folding benchmark on all protein categories provide a reliable basis for the application of TASSER to structural genomics, especially to proteins of low sequence identity to solved protein structures.
Collapse
Affiliation(s)
- Yang Zhang
- Center of Excellence in Bioinformatics, University at Buffalo, 901 Washington Street, Buffalo, NY 14203, USA
| | | |
Collapse
|
24
|
Cherkasov A, Jones SJM. Structural characterization of genomes by large scale sequence-structure threading. BMC Bioinformatics 2004; 5:37. [PMID: 15061866 PMCID: PMC419331 DOI: 10.1186/1471-2105-5-37] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2003] [Accepted: 04/03/2004] [Indexed: 12/02/2022] Open
Abstract
Background Using sequence-structure threading we have conducted structural characterization of complete proteomes of 37 archaeal, bacterial and eukaryotic organisms (including worm, fly, mouse and human) totaling 167,888 genes. Results The reported data represent first rather general evaluation of performance of full sequence-structure threading on multiple genomes providing opportunity to evaluate its general applicability for large scale studies. According to the estimated results the sequence-structure threading has assigned protein folds to more then 60% of eukaryotic, 68% of archaeal and 70% of bacterial proteomes. The repertoires of protein classes, architectures, topologies and homologous superfamilies (according to the CATH 2.4 classification) have been established for distant organisms and superkingdoms. It has been found that the average abundance of CATH classes decreases from "alpha and beta" to "mainly beta", followed by "mainly alpha" and "few secondary structures". 3-Layer (aba) Sandwich has been characterized as the most abundant protein architecture and Rossman fold as the most common topology. Conclusion The analysis of genomic occurrences of CATH 2.4 protein homologous superfamilies and topologies has revealed the power-law character of their distributions. The corresponding double logarithmic "frequency – genomic occurrence" dependences characteristic of scale-free systems have been established for individual organisms and for three superkingdoms. Supplementary materials to this works are available at [1].
Collapse
Affiliation(s)
- Artem Cherkasov
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
- Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Steven JM Jones
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
| |
Collapse
|
25
|
Abstract
Chalcone isomerase, an enzyme in the isoflavonoid pathway in plants, catalyzes the cyclization of chalcone into (2S)-naringenin. Chalcone isomerase sequence family and three-dimensional fold appeared to be unique to plants and has been proposed as a plant-specific gene marker. Using sensitive methods of sequence comparison and fold recognition, we have identified genes homologous to chalcone isomerase in all completely sequenced fungi, in slime molds, and in many gammaproteobacteria. The residues directly involved in the enzyme's catalytic function are among the best conserved across species, indicating that the newly discovered homologs are enzymatically active. At the same time, fungal and bacterial species that have chalcone isomerase-like genes tend to lack the orthologs of the upstream enzyme chalcone synthase, suggesting a novel variation of the pathway in these species.
Collapse
Affiliation(s)
- Michael Gensheimer
- Stowers Institute for Medical Research, 1000 E. 50th Street, Kansas City, MO 64110, USA.
| | | |
Collapse
|
26
|
Caetano-Anollés G, Caetano-Anollés D. An evolutionarily structured universe of protein architecture. Genome Res 2003; 13:1563-71. [PMID: 12840035 PMCID: PMC403752 DOI: 10.1101/gr.1161903] [Citation(s) in RCA: 114] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2003] [Accepted: 04/17/2003] [Indexed: 11/25/2022]
Abstract
Protein structural diversity encompasses a finite set of architectural designs. Embedded in these topologies are evolutionary histories that we here uncover using cladistic principles and measurements of protein-fold usage and sharing. The reconstructed phylogenies are inherently rooted and depict histories of protein and proteome diversification. Proteome phylogenies showed two monophyletic sister-groups delimiting Bacteria and Archaea, and a topology rooted in Eucarya. This suggests three dramatic evolutionary events and a common ancestor with a eukaryotic-like, gene-rich, and relatively modern organization. Conversely, a general phylogeny of protein architectures showed that structural classes of globular proteins appeared early in evolution and in defined order, the alpha/beta class being the first. Although most ancestral folds shared a common architecture of barrels or interleaved beta-sheets and alpha-helices, many were clearly derived, such as polyhedral folds in the all-alpha class and beta-sandwiches, beta-propellers, and beta-prisms in all-beta proteins. We also describe transformation pathways of architectures that are prevalently used in nature. For example, beta-barrels with increased curl and stagger were favored evolutionary outcomes in the all-beta class. Interestingly, we found cases where structural change followed the alpha-to-beta tendency uncovered in the tree of architectures. Lastly, we traced the total number of enzymatic functions associated with folds in the trees and show that there is a general link between structure and enzymatic function.
Collapse
|
27
|
Anantharaman V, Aravind L, Koonin EV. Emergence of diverse biochemical activities in evolutionarily conserved structural scaffolds of proteins. Curr Opin Chem Biol 2003; 7:12-20. [PMID: 12547421 DOI: 10.1016/s1367-5931(02)00018-2] [Citation(s) in RCA: 119] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Comparative analysis of numerous protein structures that have become available in the past few years, combined with genome comparison, has yielded new insights into the evolution of enzymes and their functions. In addition to the well-known diversification of substrate specificities, enzymes with several widespread catalytic folds, particularly the TIM barrel, the RRM-like domain and the double-stranded beta-helix (cupin) domain, have been extensively explored in 'reaction space', resulting in the evolution of numerous, diverse catalytic activities supported by the same structural scaffold. Common protein folds differ widely in the diversity of catalyzed reactions. The biochemical plasticity of a fold seems to hinge on the presence of a generic, symmetrical substrate-binding pocket as opposed to highly specialized binding sites.
Collapse
Affiliation(s)
- Vivek Anantharaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | |
Collapse
|
28
|
Lin J, Qian J, Greenbaum D, Bertone P, Das R, Echols N, Senes A, Stenger B, Gerstein M. GeneCensus: genome comparisons in terms of metabolic pathway activity and protein family sharing. Nucleic Acids Res 2002; 30:4574-82. [PMID: 12384605 PMCID: PMC137121 DOI: 10.1093/nar/gkf555] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2002] [Revised: 08/08/2002] [Accepted: 08/08/2002] [Indexed: 11/15/2022] Open
Abstract
We present a prototype of a new database tool, GeneCensus, which focuses on comparing genomes globally, in terms of the collective properties of many genes, rather than in terms of the attributes of a single gene (e.g. sequence similarity for a particular ortholog). The comparisons are presented in a visual fashion over the web at GeneCensus.org. The system concentrates on two types of comparisons: (i) trees based on the sharing of generalized protein families between genomes, and (ii) whole pathway analysis in terms of activity levels. For the trees, we have developed a module (TreeViewer) that clusters genomes in terms of the folds, superfamilies or orthologs--all can be considered as generalized 'families' or 'protein parts'--they share, and compares the resulting trees side-by-side with those built from sequence similarity of individual genes (e.g. a traditional tree built on ribosomal similarity). We also include comparisons to trees built on whole-genome dinucleotide or codon composition. For pathway comparisons, we have implemented a module (PathwayPainter) that graphically depicts, in selected metabolic pathways, the fluxes or expression levels of the associated enzymes (i.e. generalized 'activities'). One can, consequently, compare organisms (and organism states) in terms of representations of these systemic quantities. Develop ment of this module involved compiling, calculating and standardizing flux and expression information from many different sources. We illustrate pathway analysis for enzymes involved in central metabolism. We are able to show that, to some degree, flux and expression fluctuations have characteristic values in different sections of the central metabolism and that control points in this system (e.g. hexokinase, pyruvate kinase, phosphofructokinase, isocitrate dehydrogenase and citric synthase) tend to be especially variable in flux and expression. Both the TreeViewer and PathwayPainter modules connect to other information sources related to individual-gene or organism properties (e.g. a single-gene structural annotation viewer).
Collapse
Affiliation(s)
- J Lin
- Department of Molecular Biophysics and Biochemistry, Yale University, PO Box 208114, New Haven, CT 06520, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
29
|
Harrison PM, Gerstein M. Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J Mol Biol 2002; 318:1155-74. [PMID: 12083509 DOI: 10.1016/s0022-2836(02)00109-2] [Citation(s) in RCA: 120] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Protein families can be used to understand many aspects of genomes, both their "live" and their "dead" parts (i.e. genes and pseudogenes). Surveys of genomes have revealed that, in every organism, there are always a few large families and many small ones, with the overall distribution following a power-law. This commonality is equally true for both genes and pseudogenes, and exists despite the fact that the specific families that are enlarged differ greatly between organisms. Furthermore, because of family structure there is great redundancy in proteomes, a fact linked to the large number of dispensable genes for each organism and the small size of the minimal, indispensable sub-proteome. Pseudogenes in prokaryotes represent families that are in the process of being dispensed with. In particular, the genome sequences of certain pathogenic bacteria (Mycobacterium leprae, Yersinia pestis and Rickettsia prowazekii) show how an organism can undergo reductive evolution on a large scale (i.e. the dying out of families) as a result of niche change. There appears to be less pressure to delete pseudogenes in eukaryotes. These can be divided into two varieties, duplicated and processed, where the latter involves reverse transcription from an mRNA intermediate. We discuss these collectively in yeast, worm, fly, and human. The fly has few pseudogenes apparently because of its high rate of genomic DNA deletion. In the other three organisms, the distribution of pseudogenes on the chromosome and amongst different families is highly non-uniform. Pseudogenes tend not to occur in the middle of chromosome arms, and tend to be associated with lineage-specific (as opposed to highly conserved) families that have environmental-response functions. This may be because, rather than being dead, they may form a reservoir of diverse "extra parts" that can be resurrected to help an organism adapt to its surroundings. In yeast, there may be a novel mechanism involving the [PSI+] prion that potentially enables this resurrection. In worm, the pseudogenes tend to arise out of families (e.g. chemoreceptors) that are greatly expanded in it compared to the fly. The human genome stands out in having many processed pseudogenes. These have a character very different from those of the duplicated variety, to a large extent just representing random insertions. Thus, their occurrence tends to be roughly in proportion to the amount of mRNA for a particular protein and to reflect the extent of the intergenic sequences. Further information about pseudogenes is available at http://genecensus.org/pseudogene
Collapse
Affiliation(s)
- Paul M Harrison
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520-8114, USA
| | | |
Collapse
|