1
|
Zielezinski A, Dobrychlop W, Karlowski WM. TRGdb: a universal resource for the exploration of taxonomically restricted genes in bacteria. Database (Oxford) 2023; 2023:baad058. [PMID: 37555549 PMCID: PMC10410690 DOI: 10.1093/database/baad058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 06/30/2023] [Accepted: 07/31/2023] [Indexed: 08/10/2023]
Abstract
The TRGdb database is a resource dedicated to taxonomically restricted genes (TRGs) in bacteria. It provides a comprehensive collection of genes that are specific to different genera and species, according to the latest release of bacterial taxonomy. The user interface allows for easy browsing and searching as well as sequence similarity exploration. The website also provides information on each TRG protein sequence, including its level of disorder, complexity and tendency to aggregate. TRGdb is a valuable resource for gaining a deeper understanding of the TRG-associated, unique features, and characteristics of bacterial organisms. Database URL www.combio.pl/trgdb.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Uniwersytetu Poznanskiego 6, Poznan 61-614, Poland
| | - Wojciech Dobrychlop
- Department of Computational Biology, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Uniwersytetu Poznanskiego 6, Poznan 61-614, Poland
| | - Wojciech M Karlowski
- Department of Computational Biology, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Uniwersytetu Poznanskiego 6, Poznan 61-614, Poland
| |
Collapse
|
2
|
Sanejouand YH. On the Unknown Proteins of Eukaryotic Proteomes. J Mol Evol 2023:10.1007/s00239-023-10116-1. [PMID: 37219573 DOI: 10.1007/s00239-023-10116-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Accepted: 05/07/2023] [Indexed: 05/24/2023]
Abstract
To study unknown proteins on a large scale, a reference system has been set up for the three better studied eukaryotic kingdoms, built with 36 proteomes as taxonomically diverse as possible. Proteins from 362 other eukaryotic proteomes with no known homologue in this set were then analyzed, focusing noteworthy on singletons, that is, on such proteins with no known homologue in their own proteome. Consistently, for a given species, no more than 12% of the singletons thus found are known at the protein level, according to Uniprot. In addition, since they rely on the information found in the alignment of homologous sequences, predictions of AlphaFold2 for their tridimensional structure are poor. In the case of metazoan species, the number of singletons rarely exceeds 1000 for the species the closest to the reference system (divergence times below 75 Myr). Interestingly, in the cases of viridiplantae and fungi, larger amounts of singletons are found for such species, as if the timescale on which singletons are added to proteomes were different in metazoa and in other eukaryotic kingdoms. In order to confirm this phenomenon, further studies of proteomes closer to those of the reference system are, however, needed.
Collapse
Affiliation(s)
- Yves-Henri Sanejouand
- US2B, UMR 6286 of CNRS, Nantes University, rue de la Houssinière, 44322, Nantes, France.
| |
Collapse
|
3
|
Jiang M, Li X, Dong X, Zu Y, Zhan Z, Piao Z, Lang H. Research Advances and Prospects of Orphan Genes in Plants. FRONTIERS IN PLANT SCIENCE 2022; 13:947129. [PMID: 35874010 PMCID: PMC9305701 DOI: 10.3389/fpls.2022.947129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Accepted: 06/23/2022] [Indexed: 06/15/2023]
Abstract
Orphan genes (OGs) are defined as genes having no sequence similarity with genes present in other lineages. OGs have been regarded to play a key role in the development of lineage-specific adaptations and can also serve as a constant source of evolutionary novelty. These genes have often been found related to various stress responses, species-specific traits, special expression regulation, and also participate in primary substance metabolism. The advancement in sequencing tools and genome analysis methods has made the identification and characterization of OGs comparatively easier. In the study of OG functions in plants, significant progress has been made. We review recent advances in the fast evolving characteristics, expression modulation, and functional analysis of OGs with a focus on their role in plant biology. We also emphasize current challenges, adoptable strategies and discuss possible future directions of functional study of OGs.
Collapse
Affiliation(s)
- Mingliang Jiang
- School of Agriculture, Jilin Agricultural Science and Technology College, Jilin, China
| | - Xiaonan Li
- College of Horticulture, Shenyang Agricultural University, Shenyang, China
| | - Xiangshu Dong
- School of Agriculture, Yunnan University, Kunming, China
| | - Ye Zu
- College of Horticulture, Shenyang Agricultural University, Shenyang, China
| | - Zongxiang Zhan
- College of Horticulture, Shenyang Agricultural University, Shenyang, China
| | - Zhongyun Piao
- College of Horticulture, Shenyang Agricultural University, Shenyang, China
| | - Hong Lang
- School of Agriculture, Jilin Agricultural Science and Technology College, Jilin, China
| |
Collapse
|
4
|
Lobb B, Tremblay BJM, Moreno-Hagelsieb G, Doxey AC. An assessment of genome annotation coverage across the bacterial tree of life. Microb Genom 2020; 6. [PMID: 32124724 PMCID: PMC7200070 DOI: 10.1099/mgen.0.000341] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Although gene-finding in bacterial genomes is relatively straightforward, the automated assignment of gene function is still challenging, resulting in a vast quantity of hypothetical sequences of unknown function. But how prevalent are hypothetical sequences across bacteria, what proportion of genes in different bacterial genomes remain unannotated, and what factors affect annotation completeness? To address these questions, we surveyed over 27 000 bacterial genomes from the Genome Taxonomy Database, and measured genome annotation completeness as a function of annotation method, taxonomy, genome size, 'research bias' and publication date. Our analysis revealed that 52 and 79 % of the average bacterial proteome could be functionally annotated based on protein and domain-based homology searches, respectively. Annotation coverage using protein homology search varied significantly from as low as 14 % in some species to as high as 98 % in others. We found that taxonomy is a major factor influencing annotation completeness, with distinct trends observed across the microbial tree (e.g. the lowest level of completeness was found in the Patescibacteria lineage). Most lineages showed a significant association between genome size and annotation incompleteness, likely reflecting a greater degree of uncharacterized sequences in 'accessory' proteomes than in 'core' proteomes. Finally, research bias, as measured by publication volume, was also an important factor influencing genome annotation completeness, with early model organisms showing high completeness levels relative to other genomes in their own taxonomic lineages. Our work highlights the disparity in annotation coverage across the bacterial tree of life and emphasizes a need for more experimental characterization of accessory proteomes as well as understudied lineages.
Collapse
Affiliation(s)
- Briallen Lobb
- Department of Biology, University of Waterloo, 200 University Avenue West, Waterloo, ON N2L 3G1, Canada
| | | | - Gabriel Moreno-Hagelsieb
- Department of Biology, Wilfrid Laurier University, 75 University Avenue West, Waterloo, ON, Canada
| | - Andrew C Doxey
- Department of Biology, University of Waterloo, 200 University Avenue West, Waterloo, ON N2L 3G1, Canada
| |
Collapse
|
5
|
Chen K, Tian Z, Chen P, He H, Jiang F, Long CA. Genome-wide identification, characterization and expression analysis of lineage-specific genes within Hanseniaspora yeasts. FEMS Microbiol Lett 2020; 367:5837084. [PMID: 32407480 DOI: 10.1093/femsle/fnaa077] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2019] [Accepted: 05/12/2020] [Indexed: 12/13/2022] Open
Abstract
Lineage-specific genes (LSGs) are defined as genes with sequences that are not significantly similar to those in any other lineage. LSGs have been proposed, and sometimes shown, to have significant effects in the evolution of biological function. In this study, two sets of Hanseniaspora spp. LSGs were identified by comparing the sequences of the Kloeckera apiculata genome and of 80 other yeast genomes. This study identified 344 Hanseniaspora-specific genes (HSGs) and 109 genes ('orphan genes') specific to K. apiculata. Three thousand three hundred thirty-one K. apiculata genes that showed significant similarity to at least one sequence outside the Hanseniaspora were classified into evolutionarily conserved genes. We analyzed their sequence features, functional categories, gene origin, gene structure and gene expression. We also investigated the predicted cellular roles and Gene Ontology categories of the LSGs using functional inference. The patterns of the functions of LSGs do not deviate significantly from genome-wide average. The results showed that a few LSGs were formed by gene duplication, followed by rapid sequence divergence. Many of the HSGs and orphan genes exhibited altered expression in response to abiotic stress. Studying these LSGs might be helpful for understanding the molecular mechanism of yeast adaption.
Collapse
Affiliation(s)
- Kai Chen
- School of Biological Engineering and Food, Hubei University of Technology, Wuhan 430068, China
| | - Zhonghuan Tian
- Key Laboratory of Horticultural Plant Biology of the Ministry of Education, National Centre of Citrus Breeding, Huazhong Agricultural University, Wuhan 430070, China
| | - Ping Chen
- Department of Pediatric Hematology, Tongji Hospital Affiliated to Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430000, China
| | - Hua He
- School of Landscape Architecture and Horticulture, Wuhan Institute of Bioengineering, Wuhan 430415, China
| | - Fatang Jiang
- School of Biological Engineering and Food, Hubei University of Technology, Wuhan 430068, China
| | - Chao-An Long
- Key Laboratory of Horticultural Plant Biology of the Ministry of Education, National Centre of Citrus Breeding, Huazhong Agricultural University, Wuhan 430070, China
| |
Collapse
|
6
|
Orphan Genes Shared by Pathogenic Genomes Are More Associated with Bacterial Pathogenicity. mSystems 2019; 4:mSystems00290-18. [PMID: 30801025 PMCID: PMC6372840 DOI: 10.1128/msystems.00290-18] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Accepted: 01/08/2019] [Indexed: 11/20/2022] Open
Abstract
Recent pangenome analyses of numerous bacterial species have suggested that each genome of a single species may have a significant fraction of its gene content unique or shared by a very few genomes (i.e., ORFans). We selected nine bacterial genera, each containing at least five pathogenic and five nonpathogenic genomes, to compare their ORFans in relation to pathogenicity-related genes. Pathogens in these genera are known to cause a number of common and devastating human diseases such as pneumonia, diphtheria, melioidosis, and tuberculosis. Thus, they are worthy of in-depth systems microbiology investigations, including the comparative study of ORFans between pathogens and nonpathogens. We provide direct evidence to suggest that ORFans shared by more pathogens are more associated with pathogenicity-related genes and thus are more important targets for development of new diagnostic markers or therapeutic drugs for bacterial infectious diseases. Orphan genes (also known as ORFans [i.e., orphan open reading frames]) are new genes that enable an organism to adapt to its specific living environment. Our focus in this study is to compare ORFans between pathogens (P) and nonpathogens (NP) of the same genus. Using the pangenome idea, we have identified 130,169 ORFans in nine bacterial genera (505 genomes) and classified these ORFans into four groups: (i) SS-ORFans (P), which are only found in a single pathogenic genome; (ii) SS-ORFans (NP), which are only found in a single nonpathogenic genome; (iii) PS-ORFans (P), which are found in multiple pathogenic genomes; and (iv) NS-ORFans (NP), which are found in multiple nonpathogenic genomes. Within the same genus, pathogens do not always have more genes, more ORFans, or more pathogenicity-related genes (PRGs)—including prophages, pathogenicity islands (PAIs), virulence factors (VFs), and horizontal gene transfers (HGTs)—than nonpathogens. Interestingly, in pathogens of the nine genera, the percentages of PS-ORFans are consistently higher than those of SS-ORFans, which is not true in nonpathogens. Similarly, in pathogens of the nine genera, the percentages of PS-ORFans matching the four types of PRGs are also always higher than those of SS-ORFans, but this is not true in nonpathogens. All of these findings suggest the greater importance of PS-ORFans for bacterial pathogenicity. IMPORTANCE Recent pangenome analyses of numerous bacterial species have suggested that each genome of a single species may have a significant fraction of its gene content unique or shared by a very few genomes (i.e., ORFans). We selected nine bacterial genera, each containing at least five pathogenic and five nonpathogenic genomes, to compare their ORFans in relation to pathogenicity-related genes. Pathogens in these genera are known to cause a number of common and devastating human diseases such as pneumonia, diphtheria, melioidosis, and tuberculosis. Thus, they are worthy of in-depth systems microbiology investigations, including the comparative study of ORFans between pathogens and nonpathogens. We provide direct evidence to suggest that ORFans shared by more pathogens are more associated with pathogenicity-related genes and thus are more important targets for development of new diagnostic markers or therapeutic drugs for bacterial infectious diseases.
Collapse
|
7
|
Description and genomic characterization of Massiliimalia massiliensis gen. nov., sp. nov., and Massiliimalia timonensis gen. nov., sp. nov., two new members of the family Ruminococcaceae isolated from the human gut. Antonie van Leeuwenhoek 2019; 112:905-918. [DOI: 10.1007/s10482-018-01223-x] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Accepted: 12/29/2018] [Indexed: 12/16/2022]
|
8
|
Angers A, Ouimet P, Tsyvian-Dzyabko A, Nock T, Breton S. [The underestimated coding potential of mitochondrial DNA]. Med Sci (Paris) 2019; 35:46-54. [PMID: 30672456 DOI: 10.1051/medsci/2018308] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Mitochondria are ancient organelles that emerged from the endosymbiosis of free-living proto-bacteria. They still retain a semi-autonomous genetic system with a small genome. Mitochondrial DNA (mtDNA) codes for 13 essential proteins for the production of ATP, the sequences of which are relatively conserved across Metazoans. The discovery of additional mitochondria-derived peptides (MDPs) indicates an underestimated coding potential. Humanin, an anti-apoptotic peptide, is likely independently transcribed from within the 16S rRNA gene, as are recently described SHLPs. MOTS-c, discovered in silico, has been demonstrated to be involved in metabolism and insulin sensitivity. Gau, is a positionally conserved open reading frame (ORF) sequence found in the antisense strand of the COX1 gene and its corresponding peptide is strictly colocalized with mitochondrial markers. In bivalves with doubly uniparental inheritance of mtDNA, male and female mtDNAs each carry a separate additional gene possibly involved in sex determination. Other MDPs likely exist and their investigation will shed light on the underestimated functional repertoire of mitochondria.
Collapse
Affiliation(s)
- Annie Angers
- Département de sciences biologiques, université de Montréal, pavillon Marie-Victorin, faculté des arts et des sciences. CP 6128, succursale centre-ville, Montréal QC, H3C 3J7, Canada
| | - Philip Ouimet
- Département de sciences biologiques, université de Montréal, pavillon Marie-Victorin, faculté des arts et des sciences. CP 6128, succursale centre-ville, Montréal QC, H3C 3J7, Canada
| | - Assia Tsyvian-Dzyabko
- Département de sciences biologiques, université de Montréal, pavillon Marie-Victorin, faculté des arts et des sciences. CP 6128, succursale centre-ville, Montréal QC, H3C 3J7, Canada
| | - Tanya Nock
- Département de sciences biologiques, université de Montréal, pavillon Marie-Victorin, faculté des arts et des sciences. CP 6128, succursale centre-ville, Montréal QC, H3C 3J7, Canada
| | - Sophie Breton
- Département de sciences biologiques, université de Montréal, pavillon Marie-Victorin, faculté des arts et des sciences. CP 6128, succursale centre-ville, Montréal QC, H3C 3J7, Canada
| |
Collapse
|
9
|
Basile W, Sachenkova O, Light S, Elofsson A. High GC content causes orphan proteins to be intrinsically disordered. PLoS Comput Biol 2017; 13:e1005375. [PMID: 28355220 PMCID: PMC5389847 DOI: 10.1371/journal.pcbi.1005375] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2016] [Revised: 04/12/2017] [Accepted: 01/21/2017] [Indexed: 01/29/2023] Open
Abstract
De novo creation of protein coding genes involves the formation of short ORFs from noncoding regions; some of these ORFs might then become fixed in the population. These orphan proteins need to, at the bare minimum, not cause serious harm to the organism, meaning that they should for instance not aggregate. Therefore, although the creation of short ORFs could be truly random, the fixation should be subjected to some selective pressure. The selective forces acting on orphan proteins have been elusive, and contradictory results have been reported. In Drosophila young proteins are more disordered than ancient ones, while the opposite trend is present in yeast. To the best of our knowledge no valid explanation for this difference has been proposed. To solve this riddle we studied structural properties and age of proteins in 187 eukaryotic organisms. We find that, with the exception of length, there are only small differences in the properties between proteins of different ages. However, when we take the GC content into account we noted that it could explain the opposite trends observed for orphans in yeast (low GC) and Drosophila (high GC). GC content is correlated with codons coding for disorder promoting amino acids. This leads us to propose that intrinsic disorder is not a strong determining factor for fixation of orphan proteins. Instead these proteins largely resemble random proteins given a particular GC level. During evolution the properties of a protein change faster than the GC level causing the relationship between disorder and GC to gradually weaken. We show that the GC content of a genome is of great importance for the properties of an orphan protein. GC content affects the frequency of the codons and this affects the probability for each amino acid to be included in a de novo created protein. The codons encoding for Ala, Pro and Gly contain 80% GC, while codons for Lys, Phe, Asn, Tyr and Ile contain 20% or less. The three high GC amino acids are all disorder promoting, while Phe, Tyr and Ile are order promoting. Therefore, random protein sequences at a high GC will be more disordered than the ones created at a low GC. The structural properties of the youngest proteins match to a large degree the properties of random proteins when the GC content is taken into account. In contrast, structural properties of ancient proteins only show a weak correlation with GC content. This suggests that even after fixation in the population, proteins largely resemble random proteins given a certain GC content. Thereafter, during evolution the correlation between structural properties and GC weakens.
Collapse
Affiliation(s)
- Walter Basile
- Science for Life Laboratory, Stockholm University, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Oxana Sachenkova
- Science for Life Laboratory, Stockholm University, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Sara Light
- Science for Life Laboratory, Stockholm University, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
- Bioinformatics Infrastructure for Life Sciences (BILS), Linköping University, Linköping, Sweden
| | - Arne Elofsson
- Science for Life Laboratory, Stockholm University, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
- Swedish e-Science Research Center (SeRC), Kungliga Tekniska Högskolan, Stockholm, Sweden
- * E-mail:
| |
Collapse
|
10
|
Gupta RS. Impact of genomics on the understanding of microbial evolution and classification: the importance of Darwin's views on classification. FEMS Microbiol Rev 2016; 40:520-53. [PMID: 27279642 DOI: 10.1093/femsre/fuw011] [Citation(s) in RCA: 57] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/14/2016] [Indexed: 12/24/2022] Open
Abstract
Analyses of genome sequences, by some approaches, suggest that the widespread occurrence of horizontal gene transfers (HGTs) in prokaryotes disguises their evolutionary relationships and have led to questioning of the Darwinian model of evolution for prokaryotes. These inferences are critically examined in the light of comparative genome analysis, characteristic synapomorphies, phylogenetic trees and Darwin's views on examining evolutionary relationships. Genome sequences are enabling discovery of numerous molecular markers (synapomorphies) such as conserved signature indels (CSIs) and conserved signature proteins (CSPs), which are distinctive characteristics of different prokaryotic taxa. Based on these molecular markers, exhibiting high degree of specificity and predictive ability, numerous prokaryotic taxa of different ranks, currently identified based on the 16S rRNA gene trees, can now be reliably demarcated in molecular terms. Within all studied groups, multiple CSIs and CSPs have been identified for successive nested clades providing reliable information regarding their hierarchical relationships and these inferences are not affected by HGTs. These results strongly support Darwin's views on evolution and classification and supplement the current phylogenetic framework based on 16S rRNA in important respects. The identified molecular markers provide important means for developing novel diagnostics, therapeutics and for functional studies providing important insights regarding prokaryotic taxa.
Collapse
Affiliation(s)
- Radhey S Gupta
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, ON, Canada
| |
Collapse
|
11
|
Lobb B, Doxey AC. Novel function discovery through sequence and structural data mining. Curr Opin Struct Biol 2016; 38:53-61. [DOI: 10.1016/j.sbi.2016.05.017] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2016] [Revised: 05/17/2016] [Accepted: 05/24/2016] [Indexed: 01/30/2023]
|
12
|
Xu Y, Wu G, Hao B, Chen L, Deng X, Xu Q. Identification, characterization and expression analysis of lineage-specific genes within sweet orange (Citrus sinensis). BMC Genomics 2015; 16:995. [PMID: 26597278 PMCID: PMC4657247 DOI: 10.1186/s12864-015-2211-z] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2015] [Accepted: 11/13/2015] [Indexed: 11/23/2022] Open
Abstract
Background With the availability of rapidly increasing number of genome and transcriptome sequences, lineage-specific genes (LSGs) can be identified and characterized. Like other conserved functional genes, LSGs play important roles in biological evolution and functions. Results Two set of citrus LSGs, 296 citrus-specific genes (CSGs) and 1039 orphan genes specific to sweet orange, were identified by comparative analysis between the sweet orange genome sequences and 41 genomes and 273 transcriptomes. With the two sets of genes, gene structure and gene expression pattern were investigated. On average, both the CSGs and orphan genes have fewer exons, shorter gene length and higher GC content when compared with those evolutionarily conserved genes (ECs). Expression profiling indicated that most of the LSGs expressed in various tissues of sweet orange and some of them exhibited distinct temporal and spatial expression patterns. Particularly, the orphan genes were preferentially expressed in callus, which is an important pluripotent tissue of citrus. Besides, part of the CSGs and orphan genes expressed responsive to abiotic stress, indicating their potential functions during interaction with environment. Conclusion This study identified and characterized two sets of LSGs in citrus, dissected their sequence features and expression patterns, and provided valuable clues for future functional analysis of the LSGs in sweet orange. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2211-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yuantao Xu
- Key Laboratory of Horticultural Plant Biology (Ministry of Education), Huazhong Agricultural University, Wuhan, 430070, China.
| | - Guizhi Wu
- Key Laboratory of Horticultural Plant Biology (Ministry of Education), Huazhong Agricultural University, Wuhan, 430070, China.
| | - Baohai Hao
- Agricultural Bioinformatics Key laboratory of Hubei Province, College of Information, Huazhong Agricultural University, Wuhan, 430070, China.
| | - Lingling Chen
- Agricultural Bioinformatics Key laboratory of Hubei Province, College of Information, Huazhong Agricultural University, Wuhan, 430070, China.
| | - Xiuxin Deng
- Key Laboratory of Horticultural Plant Biology (Ministry of Education), Huazhong Agricultural University, Wuhan, 430070, China.
| | - Qiang Xu
- Key Laboratory of Horticultural Plant Biology (Ministry of Education), Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
13
|
Hugon P, Dufour JC, Colson P, Fournier PE, Sallah K, Raoult D. A comprehensive repertoire of prokaryotic species identified in human beings. THE LANCET. INFECTIOUS DISEASES 2015; 15:1211-1219. [PMID: 26311042 DOI: 10.1016/s1473-3099(15)00293-5] [Citation(s) in RCA: 209] [Impact Index Per Article: 23.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/04/2014] [Revised: 02/17/2015] [Accepted: 02/27/2015] [Indexed: 02/07/2023]
Abstract
The compilation of the complete prokaryotic repertoire associated with human beings as commensals or pathogens is a major goal for the scientific and medical community. The use of bacterial culture techniques remains a crucial step to describe new prokaryotic species. The large number of officially acknowledged bacterial species described since 1980 and the recent increase in the number of recognised pathogenic species have highlighted the absence of an exhaustive compilation of species isolated in human beings. By means of a thorough investigation of several large culture databases and a search of the scientific literature, we built an online database containing all human-associated prokaryotic species described, whether or not they had been validated and have standing in nomenclature. We list 2172 species that have been isolated in human beings. They were classified in 12 different phyla, mostly in the Proteobacteria, Firmicutes, Actinobacteria, and Bacteroidetes phyla. Our online database is useful for both clinicians and microbiologists and forms part of the Human Microbiome Project, which aims to characterise the whole human microbiota and help improve our understanding of the human predisposition and susceptibility to infectious agents.
Collapse
Affiliation(s)
- Perrine Hugon
- Aix-Marseille Université, Unité de Recherche sur les Maladies Infectieuses et Tropicales Émergentes, UM63, CNRS 7278, IRD 198, INSERM 1095, Marseille, France
| | - Jean-Charles Dufour
- Assistance Publique des Hôpitaux de Marseille, BioSTIC, Pôle de Santé Publique, Marseille, France; Aix-Marseille Université, UMR912 SESSTIM (AMU-INSERM-IRD), Marseille, France
| | - Philippe Colson
- Aix-Marseille Université, Unité de Recherche sur les Maladies Infectieuses et Tropicales Émergentes, UM63, CNRS 7278, IRD 198, INSERM 1095, Marseille, France
| | - Pierre-Edouard Fournier
- Aix-Marseille Université, Unité de Recherche sur les Maladies Infectieuses et Tropicales Émergentes, UM63, CNRS 7278, IRD 198, INSERM 1095, Marseille, France
| | - Kankoe Sallah
- Aix-Marseille Université, UMR912 SESSTIM (AMU-INSERM-IRD), Marseille, France
| | - Didier Raoult
- Aix-Marseille Université, Unité de Recherche sur les Maladies Infectieuses et Tropicales Émergentes, UM63, CNRS 7278, IRD 198, INSERM 1095, Marseille, France; Special Infectious Agents Unit, King Fahd Medical Research Center, King Abdulaziz University, Jeddah, Saudi Arabia.
| |
Collapse
|
14
|
Lobb B, Kurtz DA, Moreno-Hagelsieb G, Doxey AC. Remote homology and the functions of metagenomic dark matter. Front Genet 2015; 6:234. [PMID: 26257768 PMCID: PMC4508852 DOI: 10.3389/fgene.2015.00234] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2015] [Accepted: 06/22/2015] [Indexed: 01/26/2023] Open
Abstract
Predicted open reading frames (ORFs) that lack detectable homology to known proteins are termed ORFans. Despite their prevalence in metagenomes, the extent to which ORFans encode real proteins, the degree to which they can be annotated, and their functional contributions, remain unclear. To gain insights into these questions, we applied sensitive remote-homology detection methods to functionally analyze ORFans from soil, marine, and human gut metagenome collections. ORFans were identified, clustered into sequence families, and annotated through profile-profile comparison to proteins of known structure. We found that a considerable number of metagenomic ORFans (73,896 of 484,121, 15.3%) exhibit significant remote homology to structurally characterized proteins, providing a means for ORFan functional profiling. The extent of detected remote homology far exceeds that obtained for artificial protein families (1.4%). As expected for real genes, the predicted functions of ORFans are significantly similar to the functions of their gene neighbors (p < 0.001). Compared to the functional profiles predicted through standard homology searches, ORFans show biologically intriguing differences. Many ORFan-enriched functions are virus-related and tend to reflect biological processes associated with extreme sequence diversity. Each environment also possesses a large number of unique ORFan families and functions, including some known to play important community roles such as gut microbial polysaccharide digestion. Lastly, ORFans are a valuable resource for finding novel enzymes of interest, as we demonstrate through the identification of hundreds of novel ORFan metalloproteases that all possess a signature catalytic motif despite a general lack of similarity to known proteins. Our ORFan functional predictions are a valuable resource for discovering novel protein families and exploring the boundaries of protein sequence space. All remote homology predictions are available at http://doxey.uwaterloo.ca/ORFans.
Collapse
Affiliation(s)
- Briallen Lobb
- Department of Biology, University of Waterloo Waterloo, ON, Canada
| | - Daniel A Kurtz
- Department of Biology, University of Waterloo Waterloo, ON, Canada
| | | | - Andrew C Doxey
- Department of Biology, University of Waterloo Waterloo, ON, Canada
| |
Collapse
|
15
|
Zhou K, Huang B, Zou M, Lu D, He S, Wang G. Genome-wide identification of lineage-specific genes within Caenorhabditis elegans. Genomics 2015; 106:242-8. [PMID: 26188256 DOI: 10.1016/j.ygeno.2015.07.002] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Revised: 07/08/2015] [Accepted: 07/09/2015] [Indexed: 11/19/2022]
Abstract
With the rapid growth of sequencing technology, a number of genomes and transcriptomes of various species have been sequenced, contributing to the study of lineage-specific genes (LSGs). We identified two sets of LSGs using BLAST: one included Caenorhabditis elegans species-specific genes (1423, SSGs), and the other consisted of Caenorhabditis genus-specific genes (4539, GSGs). The subsequent characterization and analysis of the SSGs and GSGs showed that they have significant differences in evolution and that most LSGs were generated by gene duplication and integration of transposable elements (TEs). We then performed temporal expression profiling and protein function prediction and observed that many SSGs and GSGs are expressed and that genes involved with sex determination, specific stress, immune response, and morphogenesis are over-represented, suggesting that these specific genes may be related to the Caenorhabditis nematodes' special ability to survive in severe and extreme environments.
Collapse
Affiliation(s)
- Kun Zhou
- Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan 430079, China.
| | - Beibei Huang
- Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan 430079, China.
| | - Ming Zou
- Huazhong Agriculture University, Wuhan 430070, China.
| | - Dandan Lu
- Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan 430079, China.
| | - Shunping He
- The Key Laboratory of Aquatic Biodiversity and Conservation of the Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China.
| | - Guoxiu Wang
- Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan 430079, China.
| |
Collapse
|
16
|
Yang YS, Fernandez B, Lagorce A, Aloin V, De Guillen KM, Boyer JB, Dedieu A, Confalonieri F, Armengaud J, Roumestand C. Prioritizing targets for structural biology through the lens of proteomics: the archaeal protein TGAM_1934 from Thermococcus gammatolerans. Proteomics 2015; 15:114-23. [PMID: 25359407 DOI: 10.1002/pmic.201300535] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2013] [Revised: 10/01/2014] [Accepted: 10/24/2014] [Indexed: 11/09/2022]
Abstract
ORFans are hypothetical proteins lacking any significant sequence similarity with other proteins. Here, we highlighted by quantitative proteomics the TGAM_1934 ORFan from the hyperradioresistant Thermococcus gammatolerans archaeon as one of the most abundant hypothetical proteins. This protein has been selected as a priority target for structure determination on the basis of its abundance in three cellular conditions. Its solution structure has been determined using multidimensional heteronuclear NMR spectroscopy. TGAM_1934 displays an original fold, although sharing some similarities with the 3D structure of the bacterial ortholog of frataxin, CyaY, a protein conserved in bacteria and eukaryotes and involved in iron-sulfur cluster biogenesis. These results highlight the potential of structural proteomics in prioritizing ORFan targets for structure determination based on quantitative proteomics data. The proteomic data and structure coordinates have been deposited to the ProteomeXchange with identifier PXD000402 (http://proteomecentral.proteomexchange.org/dataset/PXD000402) and Protein Data Bank under the accession number 2mcf, respectively.
Collapse
Affiliation(s)
- Yin-Shan Yang
- Centre de Biochimie Structurale, Universités de Montpellier, Montpellier, France
| | | | | | | | | | | | | | | | | | | |
Collapse
|
17
|
Boros Á, Pankovics P, Reuter G. Avian picornaviruses: molecular evolution, genome diversity and unusual genome features of a rapidly expanding group of viruses in birds. INFECTION GENETICS AND EVOLUTION 2014; 28:151-66. [PMID: 25278047 DOI: 10.1016/j.meegid.2014.09.027] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/29/2014] [Revised: 09/15/2014] [Accepted: 09/21/2014] [Indexed: 12/29/2022]
Abstract
Picornaviridae is one of the most diverse families of viruses infecting vertebrate species. In contrast to the relative small number of mammal species compared to other vertebrates, the abundance of mammal-infecting picornaviruses was significantly overrepresented among the presently known picornaviruses. Therefore most of the current knowledge about the genome diversity/organization patterns and common genome features were based on the analysis of mammal-infecting picornaviruses. Beside the well known reservoir role of birds in case of several emerging viral pathogens, little is known about the diversity of picornaviruses circulating among birds, although in the last decade the number of known avian picornavirus species with complete genome was increased from one to at least 15. However, little is known about the geographic distribution, host spectrum or pathogenic potential of the recently described picornaviruses of birds. Despite the low number of known avian picornaviruses, the phylogenetic and genome organization diversity of these viruses were remarkable. Beside the common L-4-3-4 and 4-3-4 genome layouts unusual genome patterns (3-4-4; 3-5-4, 3-6-4; 3-8-4) with variable, multicistronic 2A genome regions were found among avian picornaviruses. The phylogenetic and genomic analysis revealed the presence of several conserved structures at the untranslated regions among phylogenetically distant avian and non-avian picornaviruses as well as at least five different avian picornavirus phylogenetic clusters located in every main picornavirus lineage with characteristic genome layouts which suggests the complex evolution history of these viruses. Based on the remarkable genetic diversity of the few known avian picornaviruses, the emergence of further divergent picornaviruses causing challenges in the current taxonomy and also in the understanding of the evolution and genome organization of picornaviruses will be strongly expected. In this review we would like to summarize the current knowledge about the taxonomy, pathogenic potential, phylogenetic/genomic diversity and evolutional relationship of avian picornaviruses.
Collapse
Affiliation(s)
- Ákos Boros
- Regional Laboratory of Virology, National Reference Laboratory of Gastroenteric Viruses, ÁNTSZ Regional Institute of State Public Health Service, Pécs, Hungary
| | - Péter Pankovics
- Regional Laboratory of Virology, National Reference Laboratory of Gastroenteric Viruses, ÁNTSZ Regional Institute of State Public Health Service, Pécs, Hungary
| | - Gábor Reuter
- Regional Laboratory of Virology, National Reference Laboratory of Gastroenteric Viruses, ÁNTSZ Regional Institute of State Public Health Service, Pécs, Hungary.
| |
Collapse
|
18
|
Mewalal R, Mizrachi E, Mansfield SD, Myburg AA. Cell wall-related proteins of unknown function: missing links in plant cell wall development. PLANT & CELL PHYSIOLOGY 2014; 55:1031-43. [PMID: 24683037 DOI: 10.1093/pcp/pcu050] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/18/2023]
Abstract
Lignocellulosic biomass is an important feedstock for the pulp and paper industry as well as emerging biofuel and biomaterial industries. However, the recalcitrance of the secondary cell wall to chemical or enzymatic degradation remains a major hurdle for efficient extraction of economically important biopolymers such as cellulose. It has been estimated that approximately 10-15% of about 27,000 protein-coding genes in the Arabidopsis genome are dedicated to cell wall development; however, only about 130 Arabidopsis genes thus far have experimental evidence validating cell wall function. While many genes have been implicated through co-expression analysis with known genes, a large number are broadly classified as proteins of unknown function (PUFs). Recently the functionality of some of these unknown proteins in cell wall development has been revealed using reverse genetic approaches. Given the large number of cell wall-related PUFs, how do we approach and subsequently prioritize the investigation of such unknown genes that may be essential to or influence plant cell wall development and structure? Here, we address the aforementioned question in two parts; we first identify the different kinds of PUFs based on known and predicted features such as protein domains. Knowledge of inherent features of PUFs may allow for functional inference and a concomitant link to biological context. Secondly, we discuss omics-based technologies and approaches that are helping identify and prioritize cell wall-related PUFs by functional association. In this way, hypothesis-driven experiments can be designed for functional elucidation of many proteins that remain missing links in our understanding of plant cell wall biosynthesis.
Collapse
Affiliation(s)
- Ritesh Mewalal
- Department of Genetics, Forestry and Agricultural Biotechnology Institute (FABI), University of Pretoria, Private bag X20, Hatfield, Pretoria, 0028, South Africa
| | - Eshchar Mizrachi
- Department of Genetics, Forestry and Agricultural Biotechnology Institute (FABI), University of Pretoria, Private bag X20, Hatfield, Pretoria, 0028, South Africa
| | - Shawn D Mansfield
- Department of Wood Science, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Alexander A Myburg
- Department of Genetics, Forestry and Agricultural Biotechnology Institute (FABI), University of Pretoria, Private bag X20, Hatfield, Pretoria, 0028, South Africa
| |
Collapse
|
19
|
Jeanniard A, Dunigan DD, Gurnon JR, Agarkova IV, Kang M, Vitek J, Duncan G, McClung OW, Larsen M, Claverie JM, Van Etten JL, Blanc G. Towards defining the chloroviruses: a genomic journey through a genus of large DNA viruses. BMC Genomics 2013; 14:158. [PMID: 23497343 PMCID: PMC3602175 DOI: 10.1186/1471-2164-14-158] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2012] [Accepted: 02/22/2013] [Indexed: 11/29/2022] Open
Abstract
Background Giant viruses in the genus Chlorovirus (family Phycodnaviridae) infect eukaryotic green microalgae. The prototype member of the genus, Paramecium bursaria chlorella virus 1, was sequenced more than 15 years ago, and to date there are only 6 fully sequenced chloroviruses in public databases. Presented here are the draft genome sequences of 35 additional chloroviruses (287 – 348 Kb/319 – 381 predicted protein encoding genes) collected across the globe; they infect one of three different green algal species. These new data allowed us to analyze the genomic landscape of 41 chloroviruses, which revealed some remarkable features about these viruses. Results Genome colinearity, nucleotide conservation and phylogenetic affinity were limited to chloroviruses infecting the same host, confirming the validity of the three previously known subgenera. Clues for the existence of a fourth new subgenus indicate that the boundaries of chlorovirus diversity are not completely determined. Comparison of the chlorovirus phylogeny with that of the algal hosts indicates that chloroviruses have changed hosts in their evolutionary history. Reconstruction of the ancestral genome suggests that the last common chlorovirus ancestor had a slightly more diverse protein repertoire than modern chloroviruses. However, more than half of the defined chlorovirus gene families have a potential recent origin (after Chlorovirus divergence), among which a portion shows compositional evidence for horizontal gene transfer. Only a few of the putative acquired proteins had close homologs in databases raising the question of the true donor organism(s). Phylogenomic analysis identified only seven proteins whose genes were potentially exchanged between the algal host and the chloroviruses. Conclusion The present evaluation of the genomic evolution pattern suggests that chloroviruses differ from that described in the related Poxviridae and Mimiviridae. Our study shows that the fixation of algal host genes has been anecdotal in the evolutionary history of chloroviruses. We finally discuss the incongruence between compositional evidence of horizontal gene transfer and lack of close relative sequences in the databases, which suggests that the recently acquired genes originate from a still largely un-sequenced reservoir of genomes, possibly other unknown viruses that infect the same hosts.
Collapse
Affiliation(s)
- Adrien Jeanniard
- Information Génomique & Structurale, IGS UMR7256, CNRS, Aix-Marseille Université, FR-13288, Marseille, France
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Yang L, Zou M, Fu B, He S. Genome-wide identification, characterization, and expression analysis of lineage-specific genes within zebrafish. BMC Genomics 2013; 14:65. [PMID: 23368736 PMCID: PMC3599513 DOI: 10.1186/1471-2164-14-65] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2012] [Accepted: 01/29/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The genomic basis of teleost phenotypic complexity remains obscure, despite increasing availability of genome and transcriptome sequence data. Fish-specific genome duplication cannot provide sufficient explanation for the morphological complexity of teleosts, considering the relatively large number of extinct basal ray-finned fishes. RESULTS In this study, we performed comparative genomic analysis to discover the Conserved Teleost-Specific Genes (CTSGs) and orphan genes within zebrafish and found that these two sets of lineage-specific genes may have played important roles during zebrafish embryogenesis. Lineage-specific genes within zebrafish share many of the characteristics of their counterparts in other species: shorter length, fewer exon numbers, higher GC content, and fewer of them have transcript support. Chromosomal location analysis indicated that neither the CTSGs nor the orphan genes were distributed evenly in the chromosomes of zebrafish. The significant enrichment of immunity proteins in CTSGs annotated by gene ontology (GO) or predicted ab initio may imply that defense against pathogens may be an important reason for the diversification of teleosts. The evolutionary origin of the lineage-specific genes was determined and a very high percentage of lineage-specific genes were generated via gene duplications. The temporal and spatial expression profile of lineage-specific genes obtained by expressed sequence tags (EST) and RNA-seq data revealed two novel properties: in addition to being highly tissue-preferred expression, lineage-specific genes are also highly temporally restricted, namely they are expressed in narrower time windows than evolutionarily conserved genes and are specifically enriched in later-stage embryos and early larval stages. CONCLUSIONS Our study provides the first systematic identification of two different sets of lineage-specific genes within zebrafish and provides valuable information leading towards a better understanding of the molecular mechanisms of the genomic basis of teleost phenotypic complexity for future studies.
Collapse
Affiliation(s)
- Liandong Yang
- The Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, Hubei 430072, People's Republic of China
| | | | | | | |
Collapse
|
21
|
Georgiades K, Raoult D. How microbiology helps define the rhizome of life. Front Cell Infect Microbiol 2012; 2:60. [PMID: 22919651 PMCID: PMC3417629 DOI: 10.3389/fcimb.2012.00060] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2012] [Accepted: 04/16/2012] [Indexed: 01/24/2023] Open
Abstract
In contrast to the tree of life (TOF) theory, species are mosaics of gene sequences with different origins. Observations of the extensive lateral sequence transfers in all organisms have demonstrated that the genomes of all life forms are collections of genes with different evolutionary histories that cannot be represented by a single TOF. Moreover, genes themselves commonly have several origins due to recombination. The human genome is not free from recombination events, so it is a mosaic like other organisms' genomes. Recent studies have demonstrated evidence for the integration of parasitic DNA into the human genome. Lateral transfer events have been accepted as major contributors of genome evolution in free-living bacteria. Furthermore, the accumulation of genomic sequence data provides evidence for extended genetic exchanges in intracellular bacteria and suggests that such events constitute an agent that promotes and maintains all bacterial species. Archaea and viruses also form chimeras containing primarily bacterial but also eukaryotic sequences. In addition to lateral transfers, orphan genes are indicative of the fact that gene creation is a permanent and unsettled phenomenon. Currently, a rhizome may more adequately represent the multiplicity and de novo creation of a genome. We wanted to confirm that the term “rhizome” in evolutionary biology applies to the entire cellular life history. This view of evolution should resemble a clump of roots representing the multiple origins of the repertoires of the genes of each species.
Collapse
Affiliation(s)
- Kalliopi Georgiades
- Faculté de Médecine La Timone, Unité de Recherche en Maladies Infectieuses Tropical Emergentes (URMITE), CNRS-IRD UMR 6236-198, Université de la Méditerranée Marseille, France
| | | |
Collapse
|
22
|
Evolutionary link between the mycobacterial plasmid pAL5000 replication protein RepB and the extracytoplasmic function family of σ factors. J Bacteriol 2012; 194:1331-41. [PMID: 22247504 DOI: 10.1128/jb.06218-11] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Mycobacterial plasmid pAL5000 represents a family of plasmids found mostly in the Actinobacteria. It replicates using two plasmid-encoded proteins, RepA and RepB. While BLAST searches indicate that RepA is a replicase family protein, the evolutionary connection of RepB cannot be established, as no significant homologous partner (E < 10(-3)) outside the RepB family can be identified. To obtain insight into the structure-function and evolutionary connections of RepB, an investigation was undertaken using homology modeling, phylogenetic, and mutational analysis methods. The results indicate that although they are synthesized from the same operon, the phylogenetic affinities of RepA and RepB differ. Thus, the operon may have evolved through random breaking and joining events. Homology modeling predicted the presence of a three-helical helix-turn-helix domain characteristic of region 4 of extracytoplasmic function (ECF) σ factors in the C-terminal region of RepB. At the N-terminal region, there is a helical stretch, which may be distantly related to region 3 of σ factors. Mutational analysis identified two arginines indispensable for RepB activity, one each located within the C- and N-terminal conserved regions. Apart from analyzing the domain organization of the protein, the significance of the presence of a highly conserved A/T-rich element within the RepB binding site was investigated. Mutational analysis revealed that although this motif does not bind RepB, its integrity is important for efficient DNA-protein interactions and replication to occur. The present investigation unravels the possibility that RepB-like proteins and their binding sites represent ancient DNA-protein interaction modules.
Collapse
|
23
|
Abstract
Large-scale databases are available that contain homologous gene families constructed from hundreds of complete genome sequences from across the three domains of life. Here, we discuss the approaches of increasing complexity aimed at extracting information on the pattern and process of gene family evolution from such datasets. In particular, we consider the models that invoke processes of gene birth (duplication and transfer) and death (loss) to explain the evolution of gene families. First, we review birth-and-death models of family size evolution and their implications in light of the universal features of family size distribution observed across different species and the three domains of life. Subsequently, we proceed to recent developments on models capable of more completely considering information in the sequences of homologous gene families through the probabilistic reconciliation of the phylogenetic histories of individual genes with the phylogenetic history of the genomes in which they have resided. To illustrate the methods and results presented, we use data from the HOGENOM database, demonstrating that the distribution of homologous gene family sizes in the genomes of the eukaryota, archaea, and bacteria exhibits remarkably similar shapes. We show that these distributions are best described by models of gene family size evolution, where for individual genes the death (loss) rate is larger than the birth (duplication and transfer) rate but new families are continually supplied to the genome by a process of origination. Finally, we use probabilistic reconciliation methods to take into consideration additional information from gene phylogenies, and find that, for prokaryotes, the majority of birth events are the result of transfer.
Collapse
|
24
|
Georgiades K, Merhej V, Raoult D. The influence of rickettsiologists on post-modern microbiology. Front Cell Infect Microbiol 2011; 1:8. [PMID: 22919574 PMCID: PMC3417371 DOI: 10.3389/fcimb.2011.00008] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2011] [Accepted: 10/10/2011] [Indexed: 11/29/2022] Open
Abstract
Many of the definitions in microbiology are currently false. We have reviewed the great denominations of microbiology and attempted to free microorganisms from the theories of the twentieth century. The presence of compartmentation and a nucleoid in Planctomycetes clearly calls into question the accuracy of the definitions of eukaryotes and prokaryotes. Archaea are viewed as prokaryotes resembling bacteria. However, the name archaea, suggesting an archaic origin of lifestyle, is inconsistent with the lifestyle of this family. Viruses are defined as small, filterable infectious agents, but giant viruses challenge the size criteria used for the definition of a virus. Pathogenicity does not require the acquisition of virulence factors (except for toxins), and in many cases, gene loss is significantly inked to the emergence of virulence. Species classification based on 16S rRNA is useless for taxonomic purposes of human pathogens, as a 2% divergence would classify all Rickettsiae within the same species and would not identify bacteria specialized for mammal infection. The use of metagenomics helps us to understand evolution and physiology by elucidating the structure, function, and interactions of the major microbial communities, but it neglects the minority populations. Finally, Darwin’s descent with modification theory, as represented by the tree of life, no longer matches our current genomic knowledge because genomics has revealed the occurrence of de novo-created genes and the mosaic structure of genomes, the Rhizome of life is therefore more appropriate.
Collapse
Affiliation(s)
- Kalliopi Georgiades
- Unité de Recherche en Maladies Infectieuses Tropical Emergentes, CNRS-IRD UMR 6236-198, Université de la Méditerranée Marseille, France.
| | | | | |
Collapse
|
25
|
Donoghue MT, Keshavaiah C, Swamidatta SH, Spillane C. Evolutionary origins of Brassicaceae specific genes in Arabidopsis thaliana. BMC Evol Biol 2011; 11:47. [PMID: 21332978 PMCID: PMC3049755 DOI: 10.1186/1471-2148-11-47] [Citation(s) in RCA: 126] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2010] [Accepted: 02/18/2011] [Indexed: 11/21/2022] Open
Abstract
Background All sequenced genomes contain a proportion of lineage-specific genes, which exhibit no sequence similarity to any genes outside the lineage. Despite their prevalence, the origins and functions of most lineage-specific genes remain largely unknown. As more genomes are sequenced opportunities for understanding evolutionary origins and functions of lineage-specific genes are increasing. Results This study provides a comprehensive analysis of the origins of lineage-specific genes (LSGs) in Arabidopsis thaliana that are restricted to the Brassicaceae family. In this study, lineage-specific genes within the nuclear (1761 genes) and mitochondrial (28 genes) genomes are identified. The evolutionary origins of two thirds of the lineage-specific genes within the Arabidopsis thaliana genome are also identified. Almost a quarter of lineage-specific genes originate from non-lineage-specific paralogs, while the origins of ~10% of lineage-specific genes are partly derived from DNA exapted from transposable elements (twice the proportion observed for non-lineage-specific genes). Lineage-specific genes are also enriched in genes that have overlapping CDS, which is consistent with such novel genes arising from overprinting. Over half of the subset of the 958 lineage-specific genes found only in Arabidopsis thaliana have alignments to intergenic regions in Arabidopsis lyrata, consistent with either de novo origination or differential gene loss and retention, with both evolutionary scenarios explaining the lineage-specific status of these genes. A smaller number of lineage-specific genes with an incomplete open reading frame across different Arabidopsis thaliana accessions are further identified as accession-specific genes, most likely of recent origin in Arabidopsis thaliana. Putative de novo origination for two of the Arabidopsis thaliana-only genes is identified via additional sequencing across accessions of Arabidopsis thaliana and closely related sister species lineages. We demonstrate that lineage-specific genes have high tissue specificity and low expression levels across multiple tissues and developmental stages. Finally, stress responsiveness is identified as a distinct feature of Brassicaceae-specific genes; where these LSGs are enriched for genes responsive to a wide range of abiotic stresses. Conclusion Improving our understanding of the origins of lineage-specific genes is key to gaining insights regarding how novel genes can arise and acquire functionality in different lineages. This study comprehensively identifies all of the Brassicaceae-specific genes in Arabidopsis thaliana and identifies how the majority of such lineage-specific genes have arisen. The analysis allows the relative importance (and prevalence) of different evolutionary routes to the genesis of novel ORFs within lineages to be assessed. Insights regarding the functional roles of lineage-specific genes are further advanced through identification of enrichment for stress responsiveness in lineage-specific genes, highlighting their likely importance for environmental adaptation strategies.
Collapse
Affiliation(s)
- Mark Ta Donoghue
- Department of Biochemistry, University College Cork, Cork, Ireland
| | | | | | | |
Collapse
|
26
|
Capra JA, Pollard KS, Singh M. Novel genes exhibit distinct patterns of function acquisition and network integration. Genome Biol 2010; 11:R127. [PMID: 21187012 PMCID: PMC3046487 DOI: 10.1186/gb-2010-11-12-r127] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2010] [Revised: 11/18/2010] [Accepted: 12/27/2010] [Indexed: 01/03/2023] Open
Abstract
BACKGROUND Genes are created by a variety of evolutionary processes, some of which generate duplicate copies of an entire gene, while others rearrange pre-existing genetic elements or co-opt previously non-coding sequence to create genes with 'novel' sequences. These novel genes are thought to contribute to distinct phenotypes that distinguish organisms. The creation, evolution, and function of duplicated genes are well-studied; however, the genesis and early evolution of novel genes are not well-characterized. We developed a computational approach to investigate these issues by integrating genome-wide comparative phylogenetic analysis with functional and interaction data derived from small-scale and high-throughput experiments. RESULTS We examine the function and evolution of new genes in the yeast Saccharomyces cerevisiae. We observed significant differences in the functional attributes and interactions of genes created at different times and by different mechanisms. Novel genes are initially less integrated into cellular networks than duplicate genes, but they appear to gain functions and interactions more quickly than duplicates. Recently created duplicated genes show evidence of adapting existing functions to environmental changes, while young novel genes do not exhibit enrichment for any particular functions. Finally, we found a significant preference for genes to interact with other genes of similar age and origin. CONCLUSIONS Our results suggest a strong relationship between how and when genes are created and the roles they play in the cell. Overall, genes tend to become more integrated into the functional networks of the cell with time, but the dynamics of this process differ significantly between duplicate and novel genes.
Collapse
Affiliation(s)
- John A Capra
- Gladstone Institutes, University of California, San Francisco, 1650 Owens St, San Francisco, CA 94158, USA.
| | | | | |
Collapse
|
27
|
Molecular signatures for the Crenarchaeota and the Thaumarchaeota. Antonie van Leeuwenhoek 2010; 99:133-57. [PMID: 20711675 DOI: 10.1007/s10482-010-9488-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/04/2010] [Accepted: 07/26/2010] [Indexed: 10/19/2022]
Abstract
Crenarchaeotes found in mesophilic marine environments were recently placed into a new phylum of Archaea called the Thaumarchaeota. However, very few molecular characteristics of this new phylum are currently known which can be used to distinguish them from the Crenarchaeota. In addition, their relationships to deep-branching archaeal lineages are unclear. We report here detailed analyses of protein sequences from Crenarchaeota and Thaumarchaeota that have identified many conserved signature indels (CSIs) and signature proteins (SPs) (i.e., proteins for which all significant blast hits are from these groups) that are specific for these archaeal groups. Of the identified signatures 6 CSIs and 13 SPs are specific for the Crenarchaeota phylum; 6 CSIs and >250 SPs are uniquely found in various Thaumarchaeota (viz. Cenarchaeum symbiosum, Nitrosopumilus maritimus and a number of uncultured marine crenarchaeotes) and 3 CSIs and ~10 SPs are found in both Thaumarchaeota and Crenarchaeota species. Some of the molecular signatures are also present in Korarchaeum cryptofilum, which forms the independent phylum Korarchaeota. Although some of these molecular signatures suggest a distant shared ancestry between Thaumarchaeota and Crenarchaeota, our identification of large numbers of Thaumarchaeota-specific proteins and their deep branching between the Crenarchaeota and Euryarchaeota phyla in phylogenetic trees shows that they are distinct from both Crenarchaeota and Euryarchaeota in both genetic and phylogenetic terms. These observations support the placement of marine mesophilic archaea into the separate phylum Thaumarchaeota. Additionally, many CSIs and SPs have been found that are specific for different orders within Crenarchaeota (viz. Sulfolobales-3 CSIs and 169 SPs, Thermoproteales-5 CSIs and 25 SPs, Desulfurococcales-4 SPs, and Sulfolobales and Desulfurococcales-2 CSIs and 18 SPs). The signatures described here provide novel means for distinguishing the Crenarchaeota and the Thaumarchaeota and for the classification of related and novel species in different environments. Functional studies on these signature proteins could lead to discovery of novel biochemical properties that are unique to these groups of archaea.
Collapse
|
28
|
Ellrott K, Jaroszewski L, Li W, Wooley JC, Godzik A. Expansion of the protein repertoire in newly explored environments: human gut microbiome specific protein families. PLoS Comput Biol 2010; 6:e1000798. [PMID: 20532204 PMCID: PMC2880560 DOI: 10.1371/journal.pcbi.1000798] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2009] [Accepted: 04/27/2010] [Indexed: 02/01/2023] Open
Abstract
The microbes that inhabit particular environments must be able to perform molecular functions that provide them with a competitive advantage to thrive in those environments. As most molecular functions are performed by proteins and are conserved between related proteins, we can expect that organisms successful in a given environmental niche would contain protein families that are specific for functions that are important in that environment. For instance, the human gut is rich in polysaccharides from the diet or secreted by the host, and is dominated by Bacteroides, whose genomes contain highly expanded repertoire of protein families involved in carbohydrate metabolism. To identify other protein families that are specific to this environment, we investigated the distribution of protein families in the currently available human gut genomic and metagenomic data. Using an automated procedure, we identified a group of protein families strongly overrepresented in the human gut. These not only include many families described previously but also, interestingly, a large group of previously unrecognized protein families, which suggests that we still have much to discover about this environment. The identification and analysis of these families could provide us with new information about an environment critical to our health and well being. Metagenomics provides a unique opportunity to sample the gene content of microbial communities adapted to specific environments and for the study of the correlations between the presence or absence of gene families that occur in organisms within that environment. Such studies provide detailed information about the adaptation of microbes to a given environment and, indirectly, provide clues about the most important molecular processes that are specific for that environment. Having performed such an analysis for the community of the human distal gut, we report many new protein families and identify many others that are highly specific for this particular environment. The function of most of these proteins is unknown, which illustrates the extent of our ignorance about the organisms within this environment that are so important for human health and well being.
Collapse
Affiliation(s)
- Kyle Ellrott
- Joint Center for Structural Genomics, Bioinformatics Core, University of California San Diego, La Jolla, California, United States of America
| | - Lukasz Jaroszewski
- Joint Center for Structural Genomics, Bioinformatics Core, Burnham Institute for Medical Research, La Jolla, California, United States of America
| | - Weizhong Li
- California Institute for Telecommunications and Information Technology, University of California San Diego, La Jolla, California, United States of America
| | - John C. Wooley
- Joint Center for Structural Genomics, Bioinformatics Core, University of California San Diego, La Jolla, California, United States of America
- Joint Center for Structural Genomics, Bioinformatics Core, Burnham Institute for Medical Research, La Jolla, California, United States of America
- California Institute for Telecommunications and Information Technology, University of California San Diego, La Jolla, California, United States of America
| | - Adam Godzik
- Joint Center for Structural Genomics, Bioinformatics Core, University of California San Diego, La Jolla, California, United States of America
- Joint Center for Structural Genomics, Bioinformatics Core, Burnham Institute for Medical Research, La Jolla, California, United States of America
- California Institute for Telecommunications and Information Technology, University of California San Diego, La Jolla, California, United States of America
- Joint Center for Molecular Modeling, Burnham Institute for Medical Research, La Jolla, California, United States of America
- * E-mail:
| |
Collapse
|
29
|
Yomtovian I, Teerakulkittipong N, Lee B, Moult J, Unger R. Composition bias and the origin of ORFan genes. ACTA ACUST UNITED AC 2010; 26:996-9. [PMID: 20231229 PMCID: PMC2853687 DOI: 10.1093/bioinformatics/btq093] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Motivation: Intriguingly, sequence analysis of genomes reveals that a large number of genes are unique to each organism. The origin of these genes, termed ORFans, is not known. Here, we explore the origin of ORFan genes by defining a simple measure called ‘composition bias’, based on the deviation of the amino acid composition of a given sequence from the average composition of all proteins of a given genome. Results: For a set of 47 prokaryotic genomes, we show that the amino acid composition bias of real proteins, random ‘proteins’ (created by using the nucleotide frequencies of each genome) and ‘proteins’ translated from intergenic regions are distinct. For ORFans, we observed a correlation between their composition bias and their relative evolutionary age. Recent ORFan proteins have compositions more similar to those of random ‘proteins’, while the compositions of more ancient ORFan proteins are more similar to those of the set of all proteins of the organism. This observation is consistent with an evolutionary scenario wherein ORFan genes emerged and underwent a large number of random mutations and selection, eventually adapting to the composition preference of their organism over time. Contact:ron@biocoml.ls.biu.ac.il Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Inbal Yomtovian
- Department of Computer Sciences, Bar-Ilan University, Ramat-Gan 52900, Israel
| | | | | | | | | |
Collapse
|
30
|
Lin H, Moghe G, Ouyang S, Iezzoni A, Shiu SH, Gu X, Buell CR. Comparative analyses reveal distinct sets of lineage-specific genes within Arabidopsis thaliana. BMC Evol Biol 2010; 10:41. [PMID: 20152032 PMCID: PMC2829037 DOI: 10.1186/1471-2148-10-41] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2009] [Accepted: 02/12/2010] [Indexed: 11/25/2022] Open
Abstract
Background The availability of genome and transcriptome sequences for a number of species permits the identification and characterization of conserved as well as divergent genes such as lineage-specific genes which have no detectable sequence similarity to genes from other lineages. While genes conserved among taxa provide insight into the core processes among species, lineage-specific genes provide insights into evolutionary processes and biological functions that are likely clade or species specific. Results Comparative analyses using the Arabidopsis thaliana genome and sequences from 178 other species within the Plant Kingdom enabled the identification of 24,624 A. thaliana genes (91.7%) that were termed Evolutionary Conserved (EC) as defined by sequence similarity to a database entry as well as two sets of lineage-specific genes within A. thaliana. One of the A. thaliana lineage-specific gene sets share sequence similarity only to sequences from species within the Brassicaceae family and are termed Conserved Brassicaceae-Specific Genes (914, 3.4%, CBSG). The other set of A. thaliana lineage-specific genes, the Arabidopsis Lineage-Specific Genes (1,324, 4.9%, ALSG), lack sequence similarity to any sequence outside A. thaliana. While many CBSGs (76.7%) and ALSGs (52.9%) are transcribed, the majority of the CBSGs (76.1%) and ALSGs (94.4%) have no annotated function. Co-expression analysis indicated significant enrichment of the CBSGs and ALSGs in multiple functional categories suggesting their involvement in a wide range of biological functions. Subcellular localization prediction revealed that the CBSGs were significantly enriched in proteins targeted to the secretory pathway (412, 45.1%). Among the 107 putatively secreted CBSGs with known functions, 67 encode a putative pollen coat protein or cysteine-rich protein with sequence similarity to the S-locus cysteine-rich protein that is the pollen determinant controlling allele specific pollen rejection in self-incompatible Brassicaceae species. Overall, the ALSGs and CBSGs were more highly methylated in floral tissue compared to the ECs. Single Nucleotide Polymorphism (SNP) analysis showed an elevated ratio of non-synonymous to synonymous SNPs within the ALSGs (1.99) and CBSGs (1.65) relative to the EC set (0.92), mainly caused by an elevated number of non-synonymous SNPs, indicating that they are fast-evolving at the protein sequence level. Conclusions Our analyses suggest that while a significant fraction of the A. thaliana proteome is conserved within the Plant Kingdom, evolutionarily distinct sets of genes that may function in defining biological processes unique to these lineages have arisen within the Brassicaceae and A. thaliana.
Collapse
Affiliation(s)
- Haining Lin
- Department of Plant Biology, Michigan State University, 166 Plant Biology Building, East Lansing, MI 48824, USA
| | | | | | | | | | | | | |
Collapse
|
31
|
Gupta RS, Mathews DW. Signature proteins for the major clades of Cyanobacteria. BMC Evol Biol 2010; 10:24. [PMID: 20100331 PMCID: PMC2823733 DOI: 10.1186/1471-2148-10-24] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2009] [Accepted: 01/25/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The phylogeny and taxonomy of cyanobacteria is currently poorly understood due to paucity of reliable markers for identification and circumscription of its major clades. RESULTS A combination of phylogenomic and protein signature based approaches was used to characterize the major clades of cyanobacteria. Phylogenetic trees were constructed for 44 cyanobacteria based on 44 conserved proteins. In parallel, Blastp searches were carried out on each ORF in the genomes of Synechococcus WH8102, Synechocystis PCC6803, Nostoc PCC7120, Synechococcus JA-3-3Ab, Prochlorococcus MIT9215 and Prochlor. marinus subsp. marinus CCMP1375 to identify proteins that are specific for various main clades of cyanobacteria. These studies have identified 39 proteins that are specific for all (or most) cyanobacteria and large numbers of proteins for other cyanobacterial clades. The identified signature proteins include: (i) 14 proteins for a deep branching clade (Clade A) of Gloebacter violaceus and two diazotrophic Synechococcus strains (JA-3-3Ab and JA2-3-B'a); (ii) 5 proteins that are present in all other cyanobacteria except those from Clade A; (iii) 60 proteins that are specific for a clade (Clade C) consisting of various marine unicellular cyanobacteria (viz. Synechococcus and Prochlorococcus); (iv) 14 and 19 signature proteins that are specific for the Clade C Synechococcus and Prochlorococcus strains, respectively; (v) 67 proteins that are specific for the Low B/A ecotype Prochlorococcus strains, containing lower ratio of chl b/a2 and adapted to growth at high light intensities; (vi) 65 and 8 proteins that are specific for the Nostocales and Chroococcales orders, respectively; and (vii) 22 and 9 proteins that are uniquely shared by various Nostocales and Oscillatoriales orders, or by these two orders and the Chroococcales, respectively. We also describe 3 conserved indels in flavoprotein, heme oxygenase and protochlorophyllide oxidoreductase proteins that are specific for either Clade C cyanobacteria or for various subclades of Prochlorococcus. Many other conserved indels for cyanobacterial clades have been described recently. CONCLUSIONS These signature proteins and indels provide novel means for circumscription of various cyanobacterial clades in clear molecular terms. Their functional studies should lead to discovery of novel properties that are unique to these groups of cyanobacteria.
Collapse
Affiliation(s)
- Radhey S Gupta
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, Canada.
| | | |
Collapse
|
32
|
Mazza R, Strozzi F, Caprera A, Ajmone-Marsan P, Williams JL. The other side of comparative genomics: genes with no orthologs between the cow and other mammalian species. BMC Genomics 2009; 10:604. [PMID: 20003425 PMCID: PMC2808326 DOI: 10.1186/1471-2164-10-604] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2009] [Accepted: 12/14/2009] [Indexed: 11/10/2022] Open
Abstract
Background With the rapid growth in the availability of genome sequence data, the automated identification of orthologous genes between species (orthologs) is of fundamental importance to facilitate functional annotation and studies on comparative and evolutionary genomics. Genes with no apparent orthologs between the bovine and human genome may be responsible for major differences between the species, however, such genes are often neglected in functional genomics studies. Results A BLAST-based method was exploited to explore the current annotation and orthology predictions in Ensembl. Genes with no orthologs between the two genomes were classified into groups based on alignments, ontology, manual curation and publicly available information. Starting from a high quality and specific set of orthology predictions, as provided by Ensembl, hidden relationship between genes and genomes of different mammalian species were unveiled using a highly sensitive approach, based on sequence similarity and genomic comparison. Conclusions The analysis identified 3,801 bovine genes with no orthologs in human and 1010 human genes with no orthologs in cow, among which 411 and 43 genes, respectively, had no match at all in the other species. Most of the apparently non-orthologous genes may potentially have orthologs which were missed in the annotation process, despite having a high percentage of identity, because of differences in gene length and structure. The comparative analysis reported here identified gene variants, new genes and species-specific features and gave an overview of the other side of orthology which may help to improve the annotation of the bovine genome and the knowledge of structural differences between species.
Collapse
Affiliation(s)
- Raffaele Mazza
- Istituto di Zootecnica, Università Cattolica del Sacro Cuore, 29100 Piacenza, Italy.
| | | | | | | | | |
Collapse
|
33
|
Ekman D, Elofsson A. Identifying and quantifying orphan protein sequences in fungi. J Mol Biol 2009; 396:396-405. [PMID: 19944701 DOI: 10.1016/j.jmb.2009.11.053] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2009] [Revised: 11/17/2009] [Accepted: 11/20/2009] [Indexed: 11/15/2022]
Abstract
For large regions of many proteins, and even entire proteins, no homology to known domains or proteins can be detected. These sequences are often referred to as orphans. Surprisingly, it has been reported that the large number of orphans is sustained in spite of a rapid increase of available genomic sequences. However, it is believed that de novo creation of coding sequences is rare in comparison to mechanisms such as domain shuffling and gene duplication; hence, most sequences should have homologs in other genomes. To investigate this, the sequences of 19 complete fungi genomes were compared. By using the phylogenetic relationship between these genomes, we could identify potentially de novo created orphans in Saccharomyces cerevisiae. We found that only a small fraction, <2%, of the S. cerevisiae proteome is orphan, which confirms that de novo creation of coding sequences is indeed rare. Furthermore, we found it necessary to compare the most closely related species to distinguish between de novo created sequences and rapidly evolving sequences where homologs are present but cannot be detected. Next, the orphan proteins (OPs) and orphan domains (ODs) were characterized. First, it was observed that both OPs and ODs are short. In addition, at least some of the OPs have been shown to be functional in experimental assays, showing that they are not pseudogenes. Furthermore, in contrast to what has been reported before and what is seen for older orphans, S. cerevisiae specific ODs and proteins are not more disordered than other proteins. This might indicate that many of the older, and earlier classified, orphans indeed are fast-evolving sequences. Finally, >90% of the detected ODs are located at the protein termini, which suggests that these orphans could have been created by mutations that have affected the start or stop codons.
Collapse
Affiliation(s)
- Diana Ekman
- Stockholm Bioinformatics Center/Center for Biomembrane Research, Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | | |
Collapse
|
34
|
Cortez D, Forterre P, Gribaldo S. A hidden reservoir of integrative elements is the major source of recently acquired foreign genes and ORFans in archaeal and bacterial genomes. Genome Biol 2009; 10:R65. [PMID: 19531232 PMCID: PMC2718499 DOI: 10.1186/gb-2009-10-6-r65] [Citation(s) in RCA: 95] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2009] [Revised: 06/04/2009] [Accepted: 06/16/2009] [Indexed: 11/10/2022] Open
Abstract
A large-scale survey of potential recently acquired integrative elements in 119 archaeal and bacterial genomes reveals that many recently acquired genes have originated from integrative elements Background Archaeal and bacterial genomes contain a number of genes of foreign origin that arose from recent horizontal gene transfer, but the role of integrative elements (IEs), such as viruses, plasmids, and transposable elements, in this process has not been extensively quantified. Moreover, it is not known whether IEs play an important role in the origin of ORFans (open reading frames without matches in current sequence databases), whose proportion remains stable despite the growing number of complete sequenced genomes. Results We have performed a large-scale survey of potential recently acquired IEs in 119 archaeal and bacterial genomes. We developed an accurate in silico Markov model-based strategy to identify clusters of genes that show atypical sequence composition (clusters of atypical genes or CAGs) and are thus likely to be recently integrated foreign elements, including IEs. Our method identified a high number of new CAGs. Probabilistic analysis of gene content indicates that 56% of these new CAGs are likely IEs, whereas only 7% likely originated via horizontal gene transfer from distant cellular sources. Thirty-four percent of CAGs remain unassigned, what may reflect a still poor sampling of IEs associated with bacterial and archaeal diversity. Moreover, our study contributes to the issue of the origin of ORFans, because 39% of these are found inside CAGs, many of which likely represent recently acquired IEs. Conclusions Our results strongly indicate that archaeal and bacterial genomes contain an impressive proportion of recently acquired foreign genes (including ORFans) coming from a still largely unexplored reservoir of IEs.
Collapse
Affiliation(s)
- Diego Cortez
- Institut Pasteur, Département de Microbiologie, Unité de Biologie Moléculaire du Gène chez les Extrêmophiles, Paris, France.
| | | | | |
Collapse
|
35
|
Abstract
Both supervised and unsupervised neural networks have been applied to the prediction of protein structure and function. Here, we focus on feedforward neural networks and describe how these learning machines can be applied to protein prediction. We discuss how to select an appropriate data set, how to choose and encode protein features into the neural network input, and how to assess the predictor's performance.
Collapse
Affiliation(s)
- Marco Punta
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA
| | | |
Collapse
|
36
|
Toll-Riera M, Bosch N, Bellora N, Castelo R, Armengol L, Estivill X, Albà MM. Origin of primate orphan genes: a comparative genomics approach. Mol Biol Evol 2008; 26:603-12. [PMID: 19064677 DOI: 10.1093/molbev/msn281] [Citation(s) in RCA: 182] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Genomes contain a large number of genes that do not have recognizable homologues in other species and that are likely to be involved in important species-specific adaptive processes. The origin of many such "orphan" genes remains unknown. Here we present the first systematic study of the characteristics and mechanisms of formation of primate-specific orphan genes. We determine that codon usage values for most orphan genes fall within the bulk of the codon usage distribution of bona fide human proteins, supporting their current protein-coding annotation. We also show that primate orphan genes display distinctive features in relation to genes of wider phylogenetic distribution: higher tissue specificity, more rapid evolution, and shorter peptide size. We estimate that around 24% are highly divergent members of mammalian protein families. Interestingly, around 53% of the orphan genes contain sequences derived from transposable elements (TEs) and are mostly located in primate-specific genomic regions. This indicates frequent recruitment of TEs as part of novel genes. Finally, we also obtain evidence that a small fraction of primate orphan genes, around 5.5%, might have originated de novo from mammalian noncoding genomic regions.
Collapse
Affiliation(s)
- Macarena Toll-Riera
- Evolutionary Genomics Group, Biomedical Informatics Research Programme, Fundació Institut Municipal d'Investigació Mèdica, Barcelona, Spain
| | | | | | | | | | | | | |
Collapse
|
37
|
Abstract
Bacteria experience a continual influx of novel genetic material from a wide range of sources and yet their genomes remain relatively small. This aspect of bacterial evolution indicates that most newly arriving sequences are rapidly eliminated; however, numerous new genes persist, as evident from the presence of unique genes in almost all bacterial genomes. This review summarizes the methods for identifying new genes in bacterial genomes and examines the features that promote the retention and elimination of these evolutionary novelties.
Collapse
Affiliation(s)
- Chih-Horng Kuo
- Department of Ecology & Evolutionary Biology, University of Arizona, Tucson, AZ 85721, USA
| | | |
Collapse
|
38
|
Luhua S, Ciftci-Yilmaz S, Harper J, Cushman J, Mittler R. Enhanced tolerance to oxidative stress in transgenic Arabidopsis plants expressing proteins of unknown function. PLANT PHYSIOLOGY 2008; 148:280-92. [PMID: 18614705 PMCID: PMC2528079 DOI: 10.1104/pp.108.124875] [Citation(s) in RCA: 84] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/16/2008] [Accepted: 07/02/2008] [Indexed: 05/19/2023]
Abstract
Over one-quarter of all plant genes encode proteins of unknown function that can be further classified as proteins with obscure features (POFs), which lack currently defined motifs or domains, or proteins with defined features, which contain at least one previously defined domain or motif. Although empirical data in the form of transcriptome and proteome profiling suggest that many of these proteins play important roles in plants, their functional characterization remains one of the main challenges in modern biology. To begin the functional annotation of proteins with unknown function, which are involved in the oxidative stress response of Arabidopsis (Arabidopsis thaliana), we generated transgenic Arabidopsis plants that constitutively expressed 23 different POFs (four of which were specific to Arabidopsis) and 18 different proteins with defined features. All were previously found to be expressed in response to oxidative stress in Arabidopsis. Transgenic plants were tested for their tolerance to oxidative stress imposed by paraquat or t-butyl hydroperoxide, or were subjected to osmotic, salinity, cold, and heat stresses. More than 70% of all expressed proteins conferred tolerance to oxidative stress. In contrast, >90% of the expressed proteins did not confer enhanced tolerance to the other abiotic stresses tested, and approximately 50% rendered plants more susceptible to osmotic or salinity stress. Two Arabidopsis-specific POFs, and an Arabidopsis and Brassica-specific protein of unknown function, conferred enhanced tolerance to oxidative stress. Our findings suggest that tolerance to oxidative stress involves mechanisms and pathways that are unknown at present, including some that are specific to Arabidopsis or the Brassicaceae.
Collapse
Affiliation(s)
- Song Luhua
- Department of Biochemistry and Molecular Biology, University of Nevada, Reno Nevada 89557, USA
| | | | | | | | | |
Collapse
|
39
|
Peregrín-Alvarez JM, Parkinson J. The global landscape of sequence diversity. Genome Biol 2008; 8:R238. [PMID: 17996061 PMCID: PMC2258180 DOI: 10.1186/gb-2007-8-11-r238] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2007] [Revised: 10/18/2007] [Accepted: 11/08/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Systematic comparisons between genomic sequence datasets have revealed a wide spectrum of sequence specificity from sequences that are highly conserved to those that are specific to individual species. Due to the limited number of fully sequenced eukaryotic genomes, analyses of this spectrum have largely focused on prokaryotes. Combining existing genomic datasets with the partial genomes of 193 eukaryotes derived from collections of expressed sequence tags, we performed a quantitative analysis of the sequence specificity spectrum to provide a global view of the origins and extent of sequence diversity across the three domains of life. RESULTS Comparisons with prokaryotic datasets reveal a greater genetic diversity within eukaryotes that may be related to differences in modes of genetic inheritance. Mapping this diversity within a phylogenetic framework revealed that the majority of sequences are either highly conserved or specific to the species or taxon from which they derive. Between these two extremes, several evolutionary landmarks consisting of large numbers of sequences conserved within specific taxonomic groups were identified. For example, 8% of sequences derived from metazoan species are specific and conserved within the metazoan lineage. Many of these sequences likely mediate metazoan specific functions, such as cell-cell communication and differentiation. CONCLUSION Through the use of partial genome datasets, this study provides a unique perspective of sequence conservation across the three domains of life. The provision of taxon restricted sequences should prove valuable for future computational and biochemical analyses aimed at understanding evolutionary and functional relationships.
Collapse
Affiliation(s)
- José Manuel Peregrín-Alvarez
- Molecular Structure and Function, Hospital for Sick Children, 555 University Avenue, Toronto, ON M5G 1X8, Canada.
| | | |
Collapse
|
40
|
Wasmuth J, Schmid R, Hedley A, Blaxter M. On the extent and origins of genic novelty in the phylum Nematoda. PLoS Negl Trop Dis 2008; 2:e258. [PMID: 18596977 PMCID: PMC2432500 DOI: 10.1371/journal.pntd.0000258] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2008] [Accepted: 06/09/2008] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND The phylum Nematoda is biologically diverse, including parasites of plants and animals as well as free-living taxa. Underpinning this diversity will be commensurate diversity in expressed genes, including gene sets associated specifically with evolution of parasitism. METHODS AND FINDINGS Here we have analyzed the extensive expressed sequence tag data (available for 37 nematode species, most of which are parasites) and define over 120,000 distinct putative genes from which we have derived robust protein translations. Combined with the complete proteomes of Caenorhabditis elegans and Caenorhabditis briggsae, these proteins have been grouped into 65,000 protein families that in turn contain 40,000 distinct protein domains. We have mapped the occurrence of domains and families across the Nematoda and compared the nematode data to that available for other phyla. Gene loss is common, and in particular we identify nearly 5,000 genes that may have been lost from the lineage leading to the model nematode C. elegans. We find a preponderance of novelty, including 56,000 nematode-restricted protein families and 26,000 nematode-restricted domains. Mapping of the latest time-of-origin of these new families and domains across the nematode phylogeny revealed ongoing evolution of novelty. A number of genes from parasitic species had signatures of horizontal transfer from their host organisms, and parasitic species had a greater proportion of novel, secreted proteins than did free-living ones. CONCLUSIONS These classes of genes may underpin parasitic phenotypes, and thus may be targets for development of effective control measures.
Collapse
Affiliation(s)
- James Wasmuth
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom
- Program for Molecular Structure and Function, Hospital for Sick Children, Toronto, Ontario, Canada
| | - Ralf Schmid
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom
- Department of Biochemistry, University of Leicester, Leicester, United Kingdom
| | - Ann Hedley
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom
| | - Mark Blaxter
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom
- * E-mail:
| |
Collapse
|
41
|
van Passel MWJ, Marri PR, Ochman H. The emergence and fate of horizontally acquired genes in Escherichia coli. PLoS Comput Biol 2008; 4:e1000059. [PMID: 18404206 PMCID: PMC2275313 DOI: 10.1371/journal.pcbi.1000059] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2007] [Accepted: 03/14/2008] [Indexed: 11/18/2022] Open
Abstract
Bacterial species, and even strains within species, can vary greatly in their gene contents and metabolic capabilities. We examine the evolution of this diversity by assessing the distribution and ancestry of each gene in 13 sequenced isolates of Escherichia coli and Shigella. We focus on the emergence and demise of two specific classes of genes, ORFans (genes with no homologs in present databases) and HOPs (genes with distant homologs), since these genes, in contrast to most conserved ancestral sequences, are known to be a major source of the novel features in each strain. We find that the rates of gain and loss of these genes vary greatly among strains as well as through time, and that ORFans and HOPs show very different behavior with respect to their emergence and demise. Although HOPs, which mostly represent gene acquisitions from other bacteria, originate more frequently, ORFans are much more likely to persist. This difference suggests that many adaptive traits are conferred by completely novel genes that do not originate in other bacterial genomes. With respect to the demise of these acquired genes, we find that strains of Shigella lose genes, both by disruption events and by complete removal, at accelerated rates.
Collapse
Affiliation(s)
- Mark W J van Passel
- Department of Biochemistry and Molecular Biophysics, University of Arizona, Tucson, Arizona, United States of America.
| | | | | |
Collapse
|
42
|
|
43
|
Yin Y, Fischer D. Identification and investigation of ORFans in the viral world. BMC Genomics 2008; 9:24. [PMID: 18205946 PMCID: PMC2245933 DOI: 10.1186/1471-2164-9-24] [Citation(s) in RCA: 78] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2007] [Accepted: 01/19/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome-wide studies have already shed light into the evolution and enormous diversity of the viral world. Nevertheless, one of the unresolved mysteries in comparative genomics today is the abundance of ORFans - ORFs with no detectable sequence similarity to any other ORF in the databases. Recently, studies attempting to understand the origin and functions of bacterial ORFans have been reported. Here we present a first genome-wide identification and analysis of ORFans in the viral world, with focus on bacteriophages. RESULTS Almost one-third of all ORFs in 1,456 complete virus genomes correspond to ORFans, a figure significantly larger than that observed in prokaryotes. Like prokaryotic ORFans, viral ORFans are shorter and have a lower GC content than non-ORFans. Nevertheless, a statistically significant lower GC content is found only on a minority of viruses. By focusing on phages, we find that 38.4% of phage ORFs have no homologs in other phages, and 30.1% have no homologs neither in the viral nor in the prokaryotic world. Phages with different host ranges have different percentages of ORFans, reflecting different sampling status and suggesting various diversities. Similarity searches of the phage ORFeome (ORFans and non-ORFans) against prokaryotic genomes shows that almost half of the phage ORFs have prokaryotic homologs, suggesting the major role that horizontal transfer plays in bacterial evolution. Surprisingly, the percentage of phage ORFans with prokaryotic homologs is only 18.7%. This suggests that phage ORFans play a lesser role in horizontal transfer to prokaryotes, but may be among the major players contributing to the vast phage diversity. CONCLUSION Although the current sampling of viral genomes is extremely low, ORFans and near-ORFans are likely to continue to grow in number as more genomes are sequenced. The abundance of phage ORFans may be partially due to the expected vast viral diversity, and may be instrumental in understanding viral evolution. The functions, origins and fates of the majority of viral ORFans remain a mystery. Further computational and experimental studies are likely to shed light on the mechanisms that have given rise to so many bacterial and viral ORFans.
Collapse
Affiliation(s)
- Yanbin Yin
- Computer Science and Engineering Dept, 201 Bell Hall, University at Buffalo, Buffalo, NY 14260-2000, USA.
| | | |
Collapse
|
44
|
Gupta RS, Mok A. Phylogenomics and signature proteins for the alpha proteobacteria and its main groups. BMC Microbiol 2007; 7:106. [PMID: 18045498 PMCID: PMC2241609 DOI: 10.1186/1471-2180-7-106] [Citation(s) in RCA: 102] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2007] [Accepted: 11/28/2007] [Indexed: 01/11/2023] Open
Abstract
Background Alpha proteobacteria are one of the largest and most extensively studied groups within bacteria. However, for these bacteria as a whole and for all of its major subgroups (viz. Rhizobiales, Rhodobacterales, Rhodospirillales, Rickettsiales, Sphingomonadales and Caulobacterales), very few or no distinctive molecular or biochemical characteristics are known. Results We have carried out comprehensive phylogenomic analyses by means of Blastp and PSI-Blast searches on the open reading frames in the genomes of several α-proteobacteria (viz. Bradyrhizobium japonicum, Brucella suis, Caulobacter crescentus, Gluconobacter oxydans, Mesorhizobium loti, Nitrobacter winogradskyi, Novosphingobium aromaticivorans, Rhodobacter sphaeroides 2.4.1, Silicibacter sp. TM1040, Rhodospirillum rubrum and Wolbachia (Drosophila) endosymbiont). These studies have identified several proteins that are distinctive characteristics of all α-proteobacteria, as well as numerous proteins that are unique repertoires of all of its main orders (viz. Rhizobiales, Rhodobacterales, Rhodospirillales, Rickettsiales, Sphingomonadales and Caulobacterales) and many families (viz. Rickettsiaceae, Anaplasmataceae, Rhodospirillaceae, Acetobacteraceae, Bradyrhiozobiaceae, Brucellaceae and Bartonellaceae). Many other proteins that are present at different phylogenetic depths in α-proteobacteria provide important information regarding their evolution. The evolutionary relationships among α-proteobacteria as deduced from these studies are in excellent agreement with their branching pattern in the phylogenetic trees and character compatibility cliques based on concatenated sequences for many conserved proteins. These studies provide evidence that the major groups within α-proteobacteria have diverged in the following order: (Rickettsiales(Rhodospirillales (Sphingomonadales (Rhodobacterales (Caulobacterales-Parvularculales (Rhizobiales)))))). We also describe two conserved inserts in DNA Gyrase B and RNA polymerase beta subunit that are distinctive characteristics of the Sphingomonadales and Rhodosprilllales species, respectively. The results presented here also provide support for the grouping of Hyphomonadaceae and Parvularcula species with the Caulobacterales and the placement of Stappia aggregata with the Rhizobiaceae group. Conclusion The α-proteobacteria-specific proteins and indels described here provide novel and powerful means for the taxonomic, biochemical and molecular biological studies on these bacteria. Their functional studies should prove helpful in identifying novel biochemical and physiological characteristics that are unique to these bacteria.
Collapse
Affiliation(s)
- Radhey S Gupta
- Department of Biochemistry and Biomedical Science, McMaster University, Hamilton L8N3Z5, Canada.
| | | |
Collapse
|
45
|
Chin KH, Ruan SK, Wang AHJ, Chou SH. XC5848, an ORFan protein from Xanthomonas campestris, adopts a novel variant of Sm-like motif. Proteins 2007; 68:1006-10. [PMID: 17546661 DOI: 10.1002/prot.21375] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Ko-Hsin Chin
- Institute of Biochemistry, National Chung-Hsing University, Taichung, 40227, Taiwan, Republic of China
| | | | | | | |
Collapse
|
46
|
Jain M, Khurana P, Tyagi AK, Khurana JP. Genome-wide analysis of intronless genes in rice and Arabidopsis. Funct Integr Genomics 2007; 8:69-78. [PMID: 17578610 DOI: 10.1007/s10142-007-0052-9] [Citation(s) in RCA: 77] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2007] [Revised: 04/07/2007] [Accepted: 05/06/2007] [Indexed: 10/23/2022]
Abstract
Intronless genes, a characteristic feature of prokaryotes, constitute a significant portion of the eukaryotic genomes. Our analysis revealed the presence of 11,109 (19.9%) and 5,846 (21.7%) intronless genes in rice and Arabidopsis genomes, respectively, belonging to different cellular role and gene ontology categories. The distribution and conservation of rice and Arabidopsis intronless genes among different taxonomic groups have been analyzed. A total of 301 and 296 intronless genes from rice and Arabidopsis, respectively, are conserved among organisms representing the three major domains of life, i.e., archaea, bacteria, and eukaryotes. These evolutionarily conserved proteins are predicted to be involved in housekeeping cellular functions. Interestingly, among the 68% of rice and 77% of Arabidopsis intronless genes present only in eukaryotic genomes, approximately 51% and 57% genes have orthologs only in plants, and thus may represent the plant-specific genes. Furthermore, 831 and 144 intronless genes of rice and Arabidopsis, respectively, referred to as ORFans, do not exhibit homology to any of the genes in the database and may perform species-specific functions. These data can serve as a resource for further comparative, evolutionary, and functional analysis of intronless genes in plants and other organisms.
Collapse
Affiliation(s)
- Mukesh Jain
- Interdisciplinary Centre for Plant Genomics and Department of Plant Molecular Biology, University of Delhi South Campus, Benito Juarez Road, New Delhi 110 021, India
| | | | | | | |
Collapse
|
47
|
Gollery M, Harper J, Cushman J, Mittler T, Girke T, Zhu JK, Bailey-Serres J, Mittler R. What makes species unique? The contribution of proteins with obscure features. Genome Biol 2007; 7:R57. [PMID: 16859532 PMCID: PMC1779552 DOI: 10.1186/gb-2006-7-7-r57] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2006] [Revised: 04/28/2006] [Accepted: 06/27/2006] [Indexed: 11/23/2022] Open
Abstract
An analysis of proteins with obscure features in ten eukaryotic genomes revealed that the majority are species-specific. Background Proteins with obscure features (POFs), which lack currently defined motifs or domains, represent between 18% and 38% of a typical eukaryotic proteome. To evaluate the contribution of this class of proteins to the diversity of eukaryotes, we performed a comparative analysis of the predicted proteomes derived from 10 different sequenced genomes, including budding and fission yeast, worm, fly, mosquito, Arabidopsis, rice, mouse, rat, and human. Results Only 1,650 protein groups were found to be conserved among these proteomes (BLAST E-value threshold of 10-6). Of these, only three were designated as POFs. Surprisingly, we found that, on average, 60% of the POFs identified in these 10 proteomes (44,236 in total) were species specific. In contrast, only 7.5% of the proteins with defined features (PDFs) were species specific (17,554 in total). As a group, POFs appear similar to PDFs in their relative contribution to biological functions, as indicated by their expression, participation in protein-protein interactions and association with mutant phenotypes. However, POF have more predicted disordered structure than PDFs, implying that they may exhibit preferential involvement in species-specific regulatory and signaling networks. Conclusion Because the majority of eukaryotic POFs are not well conserved, and by definition do not have defined domains or motifs upon which to formulate a functional working hypothesis, understanding their biochemical and biological functions will require species-specific investigations.
Collapse
Affiliation(s)
- Martin Gollery
- Department of Biochemistry and Molecular Biology, University Of Nevada, Reno, NV 89557, USA
| | - Jeff Harper
- Department of Biochemistry and Molecular Biology, University Of Nevada, Reno, NV 89557, USA
| | - John Cushman
- Department of Biochemistry and Molecular Biology, University Of Nevada, Reno, NV 89557, USA
| | - Taliah Mittler
- Department of Biochemistry and Molecular Biology, University Of Nevada, Reno, NV 89557, USA
| | - Thomas Girke
- Center for Plant Cell Biology, University Of California, Riverside, CA 92521, USA
| | - Jian-Kang Zhu
- Center for Plant Cell Biology, University Of California, Riverside, CA 92521, USA
| | - Julia Bailey-Serres
- Center for Plant Cell Biology, University Of California, Riverside, CA 92521, USA
| | - Ron Mittler
- Department of Biochemistry and Molecular Biology, University Of Nevada, Reno, NV 89557, USA
| |
Collapse
|
48
|
Luban S, Kihara D. Comparative genomics of small RNAs in bacterial genomes. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2007; 11:58-73. [PMID: 17411396 DOI: 10.1089/omi.2006.0005] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
In recent years, various families of small non-coding RNAs (sRNAs) have been discovered by experimental and computational approaches, both in bacterial and eukaryotic genomes. Although most of them await elucidation of their function, it has been reported that some play important roles in gene regulation. Here we carried out comparative genomics analysis of possible sRNAs that are computationally identified in 30 bacterial genomes from gamma- and alpha-proteobacteria and Deinococcus radiodurans. Identified sRNAs are clustered by a complete-linkage clustering method to see conservation among the organisms. On average, sRNAs are found in approximately 30% of intergenic regions of each genome sequence. Of these, 25.7% are conserved among three or more organisms. Approximately 60% of the conserved sRNAs do not locate in orthologous intergenic regions, implying that sRNAs may be shuffled their positions in genomes. The current study implies that sRNAs may be involved in a more extensive range of functions in bacteria.
Collapse
Affiliation(s)
- Stan Luban
- Department of Computer Science, Markey Center for Structural Biology, West Lafayette, Indiana 47907, USA
| | | |
Collapse
|
49
|
Marsden RL, Lewis TA, Orengo CA. Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint. BMC Bioinformatics 2007; 8:86. [PMID: 17349043 PMCID: PMC1829165 DOI: 10.1186/1471-2105-8-86] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2006] [Accepted: 03/09/2007] [Indexed: 11/25/2022] Open
Abstract
Background Structural genomics initiatives were established with the aim of solving protein structures on a large-scale. For many initiatives, such as the Protein Structure Initiative (PSI), the primary aim of target selection is focussed towards structurally characterising protein families which, so far, lack a structural representative. It is therefore of considerable interest to gain insights into the number and distribution of these families, and what efforts may be required to achieve a comprehensive structural coverage across all protein families. Results In this analysis we have derived a comprehensive domain annotation of the genomes using CATH, Pfam-A and Newfam domain families. We consider what proportions of structurally uncharacterised families are accessible to high-throughput structural genomics pipelines, specifically those targeting families containing multiple prokaryotic orthologues. In measuring the domain coverage of the genomes, we show the benefits of selecting targets from both structurally uncharacterised domain families, whilst in addition, pursuing additional targets from large structurally characterised protein superfamilies. Conclusion This work suggests that such a combined approach to target selection is essential if structural genomics is to achieve a comprehensive structural coverage of the genomes, leading to greater insights into structure and the mechanisms that underlie protein evolution.
Collapse
Affiliation(s)
- Russell L Marsden
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK
| | - Tony A Lewis
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK
| | - Christine A Orengo
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK
| |
Collapse
|
50
|
Abstract
Background Bacterial genomes develop new mechanisms to tide them over the imposing conditions they encounter during the course of their evolution. Acquisition of new genes by lateral gene transfer may be one of the dominant ways of adaptation in bacterial genome evolution. Lateral gene transfer provides the bacterial genome with a new set of genes that help it to explore and adapt to new ecological niches. Methods A maximum likelihood analysis was done on the five sequenced corynebacterial genomes to model the rates of gene insertions/deletions at various depths of the phylogeny. Results The study shows that most of the laterally acquired genes are transient and the inferred rates of gene movement are higher on the external branches of the phylogeny and decrease as the phylogenetic depth increases. The newly acquired genes are under relaxed selection and evolve faster than their older counterparts. Analysis of some of the functionally characterised LGTs in each species has indicated that they may have a possible adaptive role. Conclusion The five Corynebacterial genomes sequenced to date have evolved by acquiring between 8 – 14% of their genomes by LGT and some of these genes may have a role in adaptation.
Collapse
Affiliation(s)
- Pradeep Reddy Marri
- Department of Biology, McMaster University, Hamilton, Ontario L8S 4K1, Canada
| | - Weilong Hao
- Department of Biology, McMaster University, Hamilton, Ontario L8S 4K1, Canada
| | - G Brian Golding
- Department of Biology, McMaster University, Hamilton, Ontario L8S 4K1, Canada
| |
Collapse
|