1
|
Zielezinski A, Dobrychlop W, Karlowski WM. TRGdb: a universal resource for the exploration of taxonomically restricted genes in bacteria. Database (Oxford) 2023; 2023:baad058. [PMID: 37555549 PMCID: PMC10410690 DOI: 10.1093/database/baad058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 06/30/2023] [Accepted: 07/31/2023] [Indexed: 08/10/2023]
Abstract
The TRGdb database is a resource dedicated to taxonomically restricted genes (TRGs) in bacteria. It provides a comprehensive collection of genes that are specific to different genera and species, according to the latest release of bacterial taxonomy. The user interface allows for easy browsing and searching as well as sequence similarity exploration. The website also provides information on each TRG protein sequence, including its level of disorder, complexity and tendency to aggregate. TRGdb is a valuable resource for gaining a deeper understanding of the TRG-associated, unique features, and characteristics of bacterial organisms. Database URL www.combio.pl/trgdb.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Uniwersytetu Poznanskiego 6, Poznan 61-614, Poland
| | - Wojciech Dobrychlop
- Department of Computational Biology, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Uniwersytetu Poznanskiego 6, Poznan 61-614, Poland
| | - Wojciech M Karlowski
- Department of Computational Biology, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Uniwersytetu Poznanskiego 6, Poznan 61-614, Poland
| |
Collapse
|
2
|
Karlowski WM, Varshney D, Zielezinski A. Taxonomically Restricted Genes in Bacillus may Form Clusters of Homologs and Can be Traced to a Large Reservoir of Noncoding Sequences. Genome Biol Evol 2023; 15:7039703. [PMID: 36790099 PMCID: PMC10003748 DOI: 10.1093/gbe/evad023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 01/09/2023] [Accepted: 02/08/2023] [Indexed: 02/16/2023] Open
Abstract
Taxonomically restricted genes (TRGs) are unique for a defined group of organisms and may act as potential genetic determinants of lineage-specific, biological properties. Here, we explore the TRGs of highly diverse and economically important Bacillus bacteria by examining commonly used TRG identification parameters and data sources. We show the significant effects of sequence similarity thresholds, composition, and the size of the reference database in the identification process. Subsequently, we applied stringent TRG search parameters and expanded the identification procedure by incorporating an analysis of noncoding and non-syntenic regions of non-Bacillus genomes. A multiplex annotation procedure minimized the number of false-positive TRG predictions and showed nearly one-third of the alleged TRGs could be mapped to genes missed in genome annotations. We traced the putative origin of TRGs by identifying homologous, noncoding genomic regions in non-Bacillus species and detected sequence changes that could transform these regions into protein-coding genes. In addition, our analysis indicated that Bacillus TRGs represent a specific group of genes mostly showing intermediate sequence properties between genes that are conserved across multiple taxa and nonannotated peptides encoded by open reading frames.
Collapse
Affiliation(s)
- Wojciech M Karlowski
- Department of Computational Biology, Adam Mickiewicz University in Poznan, Uniwersytetu Poznanskiego 6, Poznan, Poland
| | - Deepti Varshney
- Department of Computational Biology, Adam Mickiewicz University in Poznan, Uniwersytetu Poznanskiego 6, Poznan, Poland
| | - Andrzej Zielezinski
- Department of Computational Biology, Adam Mickiewicz University in Poznan, Uniwersytetu Poznanskiego 6, Poznan, Poland
| |
Collapse
|
3
|
Abstract
Here we report the discovery of Yaravirus, a lineage of amoebal virus with a puzzling origin and evolution. Yaravirus presents 80-nm-sized particles and a 44,924-bp dsDNA genome encoding for 74 predicted proteins. Yaravirus genome annotation showed that none of its genes matched with sequences of known organisms at the nucleotide level; at the amino acid level, six predicted proteins had distant matches in the nr database. Complimentary prediction of three-dimensional structures indicated possible function of 17 proteins in total. Furthermore, we were not able to retrieve viral genomes closely related to Yaravirus in 8,535 publicly available metagenomes spanning diverse habitats around the globe. The Yaravirus genome also contained six types of tRNAs that did not match commonly used codons. Proteomics revealed that Yaravirus particles contain 26 viral proteins, one of which potentially representing a divergent major capsid protein (MCP) with a predicted double jelly-roll domain. Structure-guided phylogeny of MCP suggests that Yaravirus groups together with the MCPs of Pleurochrysis endemic viruses. Yaravirus expands our knowledge of the diversity of DNA viruses. The phylogenetic distance between Yaravirus and all other viruses highlights our still preliminary assessment of the genomic diversity of eukaryotic viruses, reinforcing the need for the isolation of new viruses of protists.
Collapse
|
4
|
Disentangling the effects of selection and loss bias on gene dynamics. Proc Natl Acad Sci U S A 2017; 114:E5616-E5624. [PMID: 28652353 DOI: 10.1073/pnas.1704925114] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We combine mathematical modeling of genome evolution with comparative analysis of prokaryotic genomes to estimate the relative contributions of selection and intrinsic loss bias to the evolution of different functional classes of genes and mobile genetic elements (MGE). An exact solution for the dynamics of gene family size was obtained under a linear duplication-transfer-loss model with selection. With the exception of genes involved in information processing, particularly translation, which are maintained by strong selection, the average selection coefficient for most nonparasitic genes is low albeit positive, compatible with observed positive correlation between genome size and effective population size. Free-living microbes evolve under stronger selection for gene retention than parasites. Different classes of MGE show a broad range of fitness effects, from the nearly neutral transposons to prophages, which are actively eliminated by selection. Genes involved in antiparasite defense, on average, incur a fitness cost to the host that is at least as high as the cost of plasmids. This cost is probably due to the adverse effects of autoimmunity and curtailment of horizontal gene transfer caused by the defense systems and selfish behavior of some of these systems, such as toxin-antitoxin and restriction modification modules. Transposons follow a biphasic dynamics, with bursts of gene proliferation followed by decay in the copy number that is quantitatively captured by the model. The horizontal gene transfer to loss ratio, but not duplication to loss ratio, correlates with genome size, potentially explaining increased abundance of neutral and costly elements in larger genomes.
Collapse
|
5
|
Makarova KS, Wolf YI, Forterre P, Prangishvili D, Krupovic M, Koonin EV. Dark matter in archaeal genomes: a rich source of novel mobile elements, defense systems and secretory complexes. Extremophiles 2014; 18:877-93. [PMID: 25113822 PMCID: PMC4158269 DOI: 10.1007/s00792-014-0672-7] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2014] [Accepted: 07/06/2014] [Indexed: 01/29/2023]
Abstract
Microbial genomes encompass a sizable fraction of poorly characterized, narrowly spread fast-evolving genes. Using sensitive methods for sequences comparison and protein structure prediction, we performed a detailed comparative analysis of clusters of such genes, which we denote "dark matter islands", in archaeal genomes. The dark matter islands comprise up to 20% of archaeal genomes and show remarkable heterogeneity and diversity. Nevertheless, three classes of entities are common in these genomic loci: (a) integrated viral genomes and other mobile elements; (b) defense systems, and (c) secretory and other membrane-associated systems. The dark matter islands in the genome of thermophiles and mesophiles show similar general trends of gene content, but thermophiles are substantially enriched in predicted membrane proteins whereas mesophiles have a greater proportion of recognizable mobile elements. Based on this analysis, we predict the existence of several novel groups of viruses and mobile elements, previously unnoticed variants of CRISPR-Cas immune systems, and new secretory systems that might be involved in stress response, intermicrobial conflicts and biogenesis of novel, uncharacterized membrane structures.
Collapse
Affiliation(s)
- Kira S Makarova
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, 20894, USA
| | | | | | | | | | | |
Collapse
|
6
|
Divergent evolutionary and expression patterns between lineage specific new duplicate genes and their parental paralogs in Arabidopsis thaliana. PLoS One 2013; 8:e72362. [PMID: 24009676 PMCID: PMC3756979 DOI: 10.1371/journal.pone.0072362] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2013] [Accepted: 07/11/2013] [Indexed: 12/14/2022] Open
Abstract
Gene duplication is an important mechanism for the origination of functional novelties in organisms. We performed a comparative genome analysis to systematically estimate recent lineage specific gene duplication events in Arabidopsis thaliana and further investigate whether and how these new duplicate genes (NDGs) play a functional role in the evolution and adaption of A. thaliana. We accomplished this using syntenic relationship among four closely related species, A. thaliana, A. lyrata, Capsella rubella and Brassica rapa. We identified 100 NDGs, showing clear origination patterns, whose parental genes are located in syntenic regions and/or have clear orthologs in at least one of three outgroup species. All 100 NDGs were transcribed and under functional constraints, while 24% of the NDGs have differential expression patterns compared to their parental genes. We explored the underlying evolutionary forces of these paralogous pairs through conducting neutrality tests with sequence divergence and polymorphism data. Evolution of about 15% of NDGs appeared to be driven by natural selection. Moreover, we found that 3 NDGs not only altered their expression patterns when compared with parental genes, but also evolved under positive selection. We investigated the underlying mechanisms driving the differential expression of NDGs and their parents, and found a number of NDGs had different cis-elements and methylation patterns from their parental genes. Overall, we demonstrated that NDGs acquired divergent cis-elements and methylation patterns and may experience sub-functionalization or neo-functionalization influencing the evolution and adaption of A. thaliana.
Collapse
|
7
|
Hypothetical Proteins Present During Recovery Phase of Radiation Resistant Bacterium Deinococcus radiodurans are Under Purifying Selection. J Mol Evol 2013; 77:31-42. [DOI: 10.1007/s00239-013-9577-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2013] [Accepted: 07/26/2013] [Indexed: 01/15/2023]
|
8
|
Re-annotation of protein-coding genes in the genome of saccharomyces cerevisiae based on support vector machines. PLoS One 2013; 8:e64477. [PMID: 23874379 PMCID: PMC3707884 DOI: 10.1371/journal.pone.0064477] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2013] [Accepted: 04/15/2013] [Indexed: 11/19/2022] Open
Abstract
The annotation of the well-studied organism, Saccharomyces cerevisiae, has been improving over the past decade while there are unresolved debates over the amount of biologically significant open reading frames (ORFs) in yeast genome. We revisited the total count of protein-coding genes in S. cerevisiae S288c genome using a theoretical approach by combining the Support Vector Machine (SVM) method with six widely used measurements of sequence statistical features. The accuracy of our method is over 99.5% in 10-fold cross-validation. Based on the annotation data in Saccharomyces Genome Database (SGD), we studied the coding capacity of all 1744 ORFs which lack experimental results and suggested that the overall number of chromosomal ORFs encoding proteins in yeast should be 6091 by removing 488 spurious ORFs. The importance of the present work lies in at least two aspects. First, cross-validation and retrospective examination showed the fidelity of our method in recognizing ORFs that likely encode proteins. Second, we have provided a web service that can be accessed at http://cobi.uestc.edu.cn/services/yeast/, which enables the prediction of protein-coding ORFs of the genus Saccharomyces with a high accuracy.
Collapse
|
9
|
Capra JA, Pollard KS, Singh M. Novel genes exhibit distinct patterns of function acquisition and network integration. Genome Biol 2010; 11:R127. [PMID: 21187012 PMCID: PMC3046487 DOI: 10.1186/gb-2010-11-12-r127] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2010] [Revised: 11/18/2010] [Accepted: 12/27/2010] [Indexed: 01/03/2023] Open
Abstract
BACKGROUND Genes are created by a variety of evolutionary processes, some of which generate duplicate copies of an entire gene, while others rearrange pre-existing genetic elements or co-opt previously non-coding sequence to create genes with 'novel' sequences. These novel genes are thought to contribute to distinct phenotypes that distinguish organisms. The creation, evolution, and function of duplicated genes are well-studied; however, the genesis and early evolution of novel genes are not well-characterized. We developed a computational approach to investigate these issues by integrating genome-wide comparative phylogenetic analysis with functional and interaction data derived from small-scale and high-throughput experiments. RESULTS We examine the function and evolution of new genes in the yeast Saccharomyces cerevisiae. We observed significant differences in the functional attributes and interactions of genes created at different times and by different mechanisms. Novel genes are initially less integrated into cellular networks than duplicate genes, but they appear to gain functions and interactions more quickly than duplicates. Recently created duplicated genes show evidence of adapting existing functions to environmental changes, while young novel genes do not exhibit enrichment for any particular functions. Finally, we found a significant preference for genes to interact with other genes of similar age and origin. CONCLUSIONS Our results suggest a strong relationship between how and when genes are created and the roles they play in the cell. Overall, genes tend to become more integrated into the functional networks of the cell with time, but the dynamics of this process differ significantly between duplicate and novel genes.
Collapse
Affiliation(s)
- John A Capra
- Gladstone Institutes, University of California, San Francisco, 1650 Owens St, San Francisco, CA 94158, USA.
| | | | | |
Collapse
|
10
|
Poptsova MS, Gogarten JP. Using comparative genome analysis to identify problems in annotated microbial genomes. Microbiology (Reading) 2010; 156:1909-1917. [DOI: 10.1099/mic.0.033811-0] [Citation(s) in RCA: 80] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Genome annotation is a tedious task that is mostly done by automated methods; however, the accuracy of these approaches has been questioned since the beginning of the sequencing era. Genome annotation is a multilevel process, and errors can emerge at different stages: during sequencing, as a result of gene-calling procedures, and in the process of assigning gene functions. Missed or wrongly annotated genes differentially impact different types of analyses. Here we discuss and demonstrate how the methods of comparative genome analysis can refine annotations by locating missing orthologues. We also discuss possible reasons for errors and show that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools. Since old errors may propagate to the newly sequenced genomes, we emphasize that the problem of continuously updating popular public databases is an urgent and unresolved one. Due to the progress in genome-sequencing technologies, automated annotation techniques will remain the main approach in the future. Researchers need to be aware of the existing errors in the annotation of even well-studied genomes, such as Escherichia coli, and consider additional quality control for their results.
Collapse
Affiliation(s)
- Maria S. Poptsova
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT 06269-3125, USA
| | - J. Peter Gogarten
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT 06269-3125, USA
| |
Collapse
|
11
|
Yooseph S, Li W, Sutton G. Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering. BMC Bioinformatics 2008; 9:182. [PMID: 18402669 PMCID: PMC2362130 DOI: 10.1186/1471-2105-9-182] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2007] [Accepted: 04/10/2008] [Indexed: 11/30/2022] Open
Abstract
Background The identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities. However, metagenomic datasets are characterized by the presence of organisms with varying GC composition, codon usage biases etc., and consequently gene identification is challenging. The vast amount of sequence data also requires faster protein family classification tools. Results We present a computational improvement to a sequence clustering approach that we developed previously to identify and classify protein coding genes in large microbial metagenomic datasets. The clustering approach can be used to identify protein coding genes in prokaryotes, viruses, and intron-less eukaryotes. The computational improvement is based on an incremental clustering method that does not require the expensive all-against-all compute that was required by the original approach, while still preserving the remote homology detection capabilities. We present evaluations of the clustering approach in protein-coding gene identification and classification, and also present the results of updating the protein clusters from our previous work with recent genomic and metagenomic sequences. The clustering results are available via CAMERA, (http://camera.calit2.net). Conclusion The clustering paradigm is shown to be a very useful tool in the analysis of microbial metagenomic data. The incremental clustering method is shown to be much faster than the original approach in identifying genes, grouping sequences into existing protein families, and also identifying novel families that have multiple members in a metagenomic dataset. These clusters provide a basis for further studies of protein families.
Collapse
Affiliation(s)
- Shibu Yooseph
- J, Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850, USA.
| | | | | |
Collapse
|
12
|
|
13
|
Yin Y, Fischer D. Identification and investigation of ORFans in the viral world. BMC Genomics 2008; 9:24. [PMID: 18205946 PMCID: PMC2245933 DOI: 10.1186/1471-2164-9-24] [Citation(s) in RCA: 78] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2007] [Accepted: 01/19/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome-wide studies have already shed light into the evolution and enormous diversity of the viral world. Nevertheless, one of the unresolved mysteries in comparative genomics today is the abundance of ORFans - ORFs with no detectable sequence similarity to any other ORF in the databases. Recently, studies attempting to understand the origin and functions of bacterial ORFans have been reported. Here we present a first genome-wide identification and analysis of ORFans in the viral world, with focus on bacteriophages. RESULTS Almost one-third of all ORFs in 1,456 complete virus genomes correspond to ORFans, a figure significantly larger than that observed in prokaryotes. Like prokaryotic ORFans, viral ORFans are shorter and have a lower GC content than non-ORFans. Nevertheless, a statistically significant lower GC content is found only on a minority of viruses. By focusing on phages, we find that 38.4% of phage ORFs have no homologs in other phages, and 30.1% have no homologs neither in the viral nor in the prokaryotic world. Phages with different host ranges have different percentages of ORFans, reflecting different sampling status and suggesting various diversities. Similarity searches of the phage ORFeome (ORFans and non-ORFans) against prokaryotic genomes shows that almost half of the phage ORFs have prokaryotic homologs, suggesting the major role that horizontal transfer plays in bacterial evolution. Surprisingly, the percentage of phage ORFans with prokaryotic homologs is only 18.7%. This suggests that phage ORFans play a lesser role in horizontal transfer to prokaryotes, but may be among the major players contributing to the vast phage diversity. CONCLUSION Although the current sampling of viral genomes is extremely low, ORFans and near-ORFans are likely to continue to grow in number as more genomes are sequenced. The abundance of phage ORFans may be partially due to the expected vast viral diversity, and may be instrumental in understanding viral evolution. The functions, origins and fates of the majority of viral ORFans remain a mystery. Further computational and experimental studies are likely to shed light on the mechanisms that have given rise to so many bacterial and viral ORFans.
Collapse
Affiliation(s)
- Yanbin Yin
- Computer Science and Engineering Dept, 201 Bell Hall, University at Buffalo, Buffalo, NY 14260-2000, USA.
| | | |
Collapse
|
14
|
Dynamic behavior of an intrinsically unstructured linker domain is conserved in the face of negligible amino acid sequence conservation. J Mol Evol 2007. [PMID: 17721672 DOI: 10.1007/s00239‐007‐9011‐2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/29/2022]
Abstract
Proteins or regions of proteins that do not form compact globular structures are classified as intrinsically unstructured proteins (IUPs). IUPs are common in nature and have essential molecular functions, but even a limited understanding of the evolution of their dynamic behavior is lacking. The primary objective of this work was to test the evolutionary conservation of dynamic behavior for a particular class of IUPs that form intrinsically unstructured linker domains (IULD) that tether flanking folded domains. This objective was accomplished by measuring the backbone flexibility of several IULD homologues using nuclear magnetic resonance (NMR) spectroscopy. The backbone flexibility of five IULDs, representing three kingdoms, was measured and analyzed. Two IULDs from animals, one IULD from fungi, and two IULDs from plants showed similar levels of backbone flexibility that were consistent with the absence of a compact globular structure. In contrast, the amino acid sequences of the IULDs from these three taxa showed no significant similarity. To investigate how the dynamic behavior of the IULDs could be conserved in the absence of detectable sequence conservation, evolutionary rate studies were performed on a set of nine mammalian IULDs. The results of this analysis showed that many sites in the IULD are evolving neutrally, suggesting that dynamic behavior can be maintained in the absence of natural selection. This work represents the first experimental test of the evolutionary conservation of dynamic behavior and demonstrates that amino acid sequence conservation is not required for the conservation of dynamic behavior and presumably molecular function.
Collapse
|
15
|
Daughdrill GW, Narayanaswami P, Gilmore SH, Belczyk A, Brown CJ. Dynamic behavior of an intrinsically unstructured linker domain is conserved in the face of negligible amino acid sequence conservation. J Mol Evol 2007; 65:277-88. [PMID: 17721672 DOI: 10.1007/s00239-007-9011-2] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2006] [Accepted: 05/18/2007] [Indexed: 01/19/2023]
Abstract
Proteins or regions of proteins that do not form compact globular structures are classified as intrinsically unstructured proteins (IUPs). IUPs are common in nature and have essential molecular functions, but even a limited understanding of the evolution of their dynamic behavior is lacking. The primary objective of this work was to test the evolutionary conservation of dynamic behavior for a particular class of IUPs that form intrinsically unstructured linker domains (IULD) that tether flanking folded domains. This objective was accomplished by measuring the backbone flexibility of several IULD homologues using nuclear magnetic resonance (NMR) spectroscopy. The backbone flexibility of five IULDs, representing three kingdoms, was measured and analyzed. Two IULDs from animals, one IULD from fungi, and two IULDs from plants showed similar levels of backbone flexibility that were consistent with the absence of a compact globular structure. In contrast, the amino acid sequences of the IULDs from these three taxa showed no significant similarity. To investigate how the dynamic behavior of the IULDs could be conserved in the absence of detectable sequence conservation, evolutionary rate studies were performed on a set of nine mammalian IULDs. The results of this analysis showed that many sites in the IULD are evolving neutrally, suggesting that dynamic behavior can be maintained in the absence of natural selection. This work represents the first experimental test of the evolutionary conservation of dynamic behavior and demonstrates that amino acid sequence conservation is not required for the conservation of dynamic behavior and presumably molecular function.
Collapse
Affiliation(s)
- Gary W Daughdrill
- Department of Microbiology, Molecular Biology, and Biochemistry, University of Idaho, Moscow, ID 83844-3052, USA.
| | | | | | | | | |
Collapse
|
16
|
Yin Y, Fischer D. On the origin of microbial ORFans: quantifying the strength of the evidence for viral lateral transfer. BMC Evol Biol 2006; 6:63. [PMID: 16914045 PMCID: PMC1559721 DOI: 10.1186/1471-2148-6-63] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2006] [Accepted: 08/16/2006] [Indexed: 11/10/2022] Open
Abstract
Background: The origin of microbial ORFans, ORFs having no detectable homology to other ORFs in the databases, is one of the unexplained puzzles of the post-genomic era. Several hypothesis on the origin of ORFans have been suggested in the last few years, most of which based on selected, relatively small, subsets of ORFans. One of the hypotheses for the origin of ORFans is that they have been acquired thru lateral transfer from viruses. Here we carry out a comprehensive, genome-wide study on the origins of ORFans to quantify the strength of current evidence supporting this hypothesis. Results: We performed similarity searches by querying all current ORFans against the public virus protein database. Surprisingly, we found that only 2.8% of all microbial ORFans have detectable homologs in viruses, while the percentage of non-ORFans with detectable homologs in viruses is 7.9%, a significantly higher figure. This suggests that the current evidence for the origin of ORFans from lateral transfer from viruses is at best weak. However, an analysis of individual genomes revealed a number of organisms with much higher percentages, many of them belonging to the Firmicutes and Gamma-proteobacteria. We provide evidence suggesting that the current virus database may be biased towards those viruses attacking Firmicutes and Gamma-proteobacteria. Conclusion: We conclude that as more viral genomes are sequenced, more microbial ORFans will find homologs in viruses, but this trend may vary much for individual genomes. Thus, lateral transfer from viruses alone is unlikely to explain the origin of the majority of ORFans in the majority of prokaryotes and consequently, other, not necessarily exclusive, mechanisms are likely to better explain the origin of the increasing number of ORFans.
Collapse
Affiliation(s)
- Yanbin Yin
- Computer Science and Engineering Dept. 201 Bell Hall, University at Buffalo, Buffalo, NY 14260-2000, US
| | - Daniel Fischer
- Computer Science and Engineering Dept. 201 Bell Hall, University at Buffalo, Buffalo, NY 14260-2000, US
- Bioinformatics/Dept. of Computer Science, Ben Gurion University, Beer-Sheva 84015, Israel
| |
Collapse
|
17
|
Siew N, Saini HK, Fischer D. A putative novel alpha/beta hydrolase ORFan family in Bacillus. FEBS Lett 2005; 579:3175-82. [PMID: 15922334 DOI: 10.1016/j.febslet.2005.04.030] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2004] [Revised: 03/25/2005] [Accepted: 04/11/2005] [Indexed: 10/25/2022]
Abstract
A large number of sequences in each newly sequenced genome correspond to lineage and species-specific proteins, also known as ORFans. Amongst these ORFans, a large number are sequences with unknown structures and functions. We have identified a family of sequences, annotated as hypothetical proteins, which are specific to Bacillus and have carried out a computational study aimed at characterizing this family. Fold-recognition methods predict that these sequences belong to the alpha/beta hydrolase fold. We suggest possible catalytic triads for the ORFans and propose a hypothesis regarding the possible families within the alpha/beta hydrolase superfamily to which they may belong.
Collapse
Affiliation(s)
- Naomi Siew
- Department of Chemistry, Ben Gurion University, Beer-Sheva 84105, Israel
| | | | | |
Collapse
|
18
|
Siew N, Fischer D. Structural Biology Sheds Light on the Puzzle of Genomic ORFans. J Mol Biol 2004; 342:369-73. [PMID: 15327940 DOI: 10.1016/j.jmb.2004.06.073] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2004] [Revised: 06/09/2004] [Accepted: 06/19/2004] [Indexed: 10/26/2022]
Abstract
Genomic ORFans are orphan open reading frames (ORFs) with no significant sequence similarity to other ORFs. ORFans comprise 20-30% of the ORFs of most completely sequenced genomes. Because nothing can be learnt about ORFans via sequence homology, the functions and evolutionary origins of ORFans remain a mystery. Furthermore, because relatively few ORFans have been experimentally characterized, it has been suggested that most ORFans are not likely to correspond to functional, expressed proteins, but rather to spurious ORFs, pseudo-genes or to rapidly evolving proteins with non-essential roles. As a snapshot view of current ORFan structural studies, we searched for ORFans among proteins whose three-dimensional structures have been recently determined. We find that functional and structural studies of ORFans are not as underemphasized as previously suggested. These recently determined structures correspond to ORFans from all Kingdoms of life, and include proteins that have previously been functionally characterized, as well as structural genomics targets of unknown function labeled as "hypothetical proteins". This suggests that many of the ORFans in the databases are likely to correspond to expressed, functional (and even essential) proteins. Furthermore, the recently determined structures include examples of the various types of ORFans, suggesting that the functions and evolutionary origins of ORFans are diverse. Although this survey sheds some light on the ORFan mystery, further experimental studies are required to gain a better understanding of the role and origins of the tens of thousands of ORFans awaiting characterization.
Collapse
Affiliation(s)
- Naomi Siew
- Department of Chemistry, Ben Gurion University Beer-Sheva 84105, Israel
| | | |
Collapse
|
19
|
Abstract
As each newly sequenced genome contains a significant number of protein-coding ORFs that are species-, family- or lineage-specific, many interesting questions arise about the evolution and role of these ORFs and of the genomes they are part of. We refer to these poorly conserved ORFs as singleton or paralogous ORFans if they are unique to one genome, or as orthologous ORFans if they appear only in a family of closely related organisms and have no homolog in other genomes. In order to study and classify ORFans we have constructed the ORFanage, an ORFan database. This database consists of the predicted ORFs in fully sequenced microbial genomes, and enables searching for the three types of ORFans in any subset of the genomes chosen by the user. The ORFanage could help in choosing interesting targets for further genomic and evolutionary studies. The ORFanage is accessible via http://www.bioinformatics.buffalo. edu/ORFanage.
Collapse
Affiliation(s)
- Naomi Siew
- Department of Chemistry, Ben Gurion University, Beer-Sheva 84105, Israel.
| | | | | |
Collapse
|