251
|
Kislyuk AO, Katz LS, Agrawal S, Hagen MS, Conley AB, Jayaraman P, Nelakuditi V, Humphrey JC, Sammons SA, Govil D, Mair RD, Tatti KM, Tondella ML, Harcourt BH, Mayer LW, Jordan IK. A computational genomics pipeline for prokaryotic sequencing projects. ACTA ACUST UNITED AC 2010; 26:1819-26. [PMID: 20519285 PMCID: PMC2905547 DOI: 10.1093/bioinformatics/btq284] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION New sequencing technologies have accelerated research on prokaryotic genomes and have made genome sequencing operations outside major genome sequencing centers routine. However, no off-the-shelf solution exists for the combined assembly, gene prediction, genome annotation and data presentation necessary to interpret sequencing data. The resulting requirement to invest significant resources into custom informatics support for genome sequencing projects remains a major impediment to the accessibility of high-throughput sequence data. RESULTS We present a self-contained, automated high-throughput open source genome sequencing and computational genomics pipeline suitable for prokaryotic sequencing projects. The pipeline has been used at the Georgia Institute of Technology and the Centers for Disease Control and Prevention for the analysis of Neisseria meningitidis and Bordetella bronchiseptica genomes. The pipeline is capable of enhanced or manually assisted reference-based assembly using multiple assemblers and modes; gene predictor combining; and functional annotation of genes and gene products. Because every component of the pipeline is executed on a local machine with no need to access resources over the Internet, the pipeline is suitable for projects of a sensitive nature. Annotation of virulence-related features makes the pipeline particularly useful for projects working with pathogenic prokaryotes. AVAILABILITY AND IMPLEMENTATION The pipeline is licensed under the open-source GNU General Public License and available at the Georgia Tech Neisseria Base (http://nbase.biology.gatech.edu/). The pipeline is implemented with a combination of Perl, Bourne Shell and MySQL and is compatible with Linux and other Unix systems.
Collapse
Affiliation(s)
- Andrey O Kislyuk
- School of Biology, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
252
|
Hawkins T, Chitale M, Kihara D. Functional enrichment analyses and construction of functional similarity networks with high confidence function prediction by PFP. BMC Bioinformatics 2010; 11:265. [PMID: 20482861 PMCID: PMC2882935 DOI: 10.1186/1471-2105-11-265] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2009] [Accepted: 05/19/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A new paradigm of biological investigation takes advantage of technologies that produce large high throughput datasets, including genome sequences, interactions of proteins, and gene expression. The ability of biologists to analyze and interpret such data relies on functional annotation of the included proteins, but even in highly characterized organisms many proteins can lack the functional evidence necessary to infer their biological relevance. RESULTS Here we have applied high confidence function predictions from our automated prediction system, PFP, to three genome sequences, Escherichia coli, Saccharomyces cerevisiae, and Plasmodium falciparum (malaria). The number of annotated genes is increased by PFP to over 90% for all of the genomes. Using the large coverage of the function annotation, we introduced the functional similarity networks which represent the functional space of the proteomes. Four different functional similarity networks are constructed for each proteome, one each by considering similarity in a single Gene Ontology (GO) category, i.e. Biological Process, Cellular Component, and Molecular Function, and another one by considering overall similarity with the funSim score. The functional similarity networks are shown to have higher modularity than the protein-protein interaction network. Moreover, the funSim score network is distinct from the single GO-score networks by showing a higher clustering degree exponent value and thus has a higher tendency to be hierarchical. In addition, examining function assignments to the protein-protein interaction network and local regions of genomes has identified numerous cases where subnetworks or local regions have functionally coherent proteins. These results will help interpreting interactions of proteins and gene orders in a genome. Several examples of both analyses are highlighted. CONCLUSION The analyses demonstrate that applying high confidence predictions from PFP can have a significant impact on a researchers' ability to interpret the immense biological data that are being generated today. The newly introduced functional similarity networks of the three organisms show different network properties as compared with the protein-protein interaction networks.
Collapse
Affiliation(s)
- Troy Hawkins
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | | | | |
Collapse
|
253
|
Jung J, Yi G, Sukno SA, Thon MR. PoGO: Prediction of Gene Ontology terms for fungal proteins. BMC Bioinformatics 2010; 11:215. [PMID: 20429880 PMCID: PMC2882390 DOI: 10.1186/1471-2105-11-215] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2010] [Accepted: 04/29/2010] [Indexed: 11/10/2022] Open
Abstract
Background Automated protein function prediction methods are the only practical approach for assigning functions to genes obtained from model organisms. Many of the previously reported function annotation methods are of limited utility for fungal protein annotation. They are often trained only to one species, are not available for high-volume data processing, or require the use of data derived by experiments such as microarray analysis. To meet the increasing need for high throughput, automated annotation of fungal genomes, we have developed a tool for annotating fungal protein sequences with terms from the Gene Ontology. Results We describe a classifier called PoGO (Prediction of Gene Ontology terms) that uses statistical pattern recognition methods to assign Gene Ontology (GO) terms to proteins from filamentous fungi. PoGO is organized as a meta-classifier in which each evidence source (sequence similarity, protein domains, protein structure and biochemical properties) is used to train independent base-level classifiers. The outputs of the base classifiers are used to train a meta-classifier, which provides the final assignment of GO terms. An independent classifier is trained for each GO term, making the system amenable to updating, without having to re-train the whole system. The resulting system is robust. It provides better accuracy and can assign GO terms to a higher percentage of unannotated protein sequences than other methods that we tested. Conclusions Our annotation system overcomes many of the shortcomings that we found in other methods. We also provide a web server where users can submit protein sequences to be annotated.
Collapse
Affiliation(s)
- Jaehee Jung
- Centro Hispano-Luso de Investigaciones Agrarias (CIALE), Department of Microbiology and Genetics, University of Salamanca, Villamayor 37185, Spain
| | | | | | | |
Collapse
|
254
|
Jiang SY, Ma Z, Ramachandran S. Evolutionary history and stress regulation of the lectin superfamily in higher plants. BMC Evol Biol 2010; 10:79. [PMID: 20236552 PMCID: PMC2846932 DOI: 10.1186/1471-2148-10-79] [Citation(s) in RCA: 84] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2009] [Accepted: 03/18/2010] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND Lectins are a class of carbohydrate-binding proteins. They play roles in various biological processes. However, little is known about their evolutionary history and their functions in plant stress regulation. The availability of full genome sequences from various plant species makes it possible to perform a whole-genome exploration for further understanding their biological functions. RESULTS Higher plant genomes encode large numbers of lectin proteins. Based on their domain structures and phylogenetic analyses, a new classification system has been proposed. In this system, 12 different families have been classified and four of them consist of recently identified plant lectin members. Further analyses show that some of lectin families exhibit species-specific expansion and rapid birth-and-death evolution. Tandem and segmental duplications have been regarded as the major mechanisms to drive lectin expansion although retrogenes also significantly contributed to the birth of new lectin genes in soybean and rice. Evidence shows that lectin genes have been involved in biotic/abiotic stress regulations and tandem/segmental duplications may be regarded as drivers for plants to adapt various environmental stresses through duplication followed by expression divergence. Each member of this gene superfamily may play specialized roles in a specific stress condition and function as a regulator of various environmental factors such as cold, drought and high salinity as well as biotic stresses. CONCLUSIONS Our studies provide a new outline of the plant lectin gene superfamily and advance the understanding of plant lectin genes in lineage-specific expansion and their functions in biotic/abiotic stress-related developmental processes.
Collapse
Affiliation(s)
- Shu-Ye Jiang
- Temasek Life Sciences Laboratory, 1 Research Link, the National University of Singapore, Singapore 117604
| | - Zhigang Ma
- Temasek Life Sciences Laboratory, 1 Research Link, the National University of Singapore, Singapore 117604
| | - Srinivasan Ramachandran
- Temasek Life Sciences Laboratory, 1 Research Link, the National University of Singapore, Singapore 117604
| |
Collapse
|
255
|
Li J, Hosseini Moghaddam SH, Chen X, Chen M, Zhong B. Shotgun strategy-based proteome profiling analysis on the head of silkworm Bombyx mori. Amino Acids 2010; 39:751-61. [PMID: 20198493 DOI: 10.1007/s00726-010-0517-3] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2009] [Accepted: 02/05/2010] [Indexed: 01/09/2023]
Abstract
Insect head is comprised of important sensory systems to communicate with internal and external environment and endocrine organs such as brain and corpus allatum to regulate insect growth and development. To comprehensively understand how all these components act and interact within the head, it is necessary to investigate their molecular basis at protein level. Here, the spectra of peptides digested from silkworm larval heads were obtained from liquid chromatography tandem mass spectrometry (LC-MS/MS) and were analyzed by bioinformatics methods. Totally, 539 proteins with a low false discovery rate (FDR) were identified by searching against an in-house database with SEQUEST and X!Tandem algorithms followed by trans-proteomic pipeline (TPP) validation. Forty-three proteins had the theoretical isoelectric point (pI) greater than 10 which were too difficult to separate by two-dimensional gel electrophoresis (2-DE). Four chemosensory proteins, one odorant-binding protein, two diapause-related proteins, and a lot of cuticle proteins, interestingly including pupal cuticle proteins were identified. The proteins involved in nervous system development, stress response, apoptosis and so forth were related to the physiological status of head. Pathway analysis revealed that many proteins were highly homologous with the human proteins which involved in human neurodegenerative disease pathways, probably implying a symptom of the forthcoming metamorphosis of silkworm. These data and the analysis methods were expected to be of benefit to the proteomics research of silkworm and other insects.
Collapse
Affiliation(s)
- Jianying Li
- College of Animal Sciences, Zhejiang University, Hangzhou, 310029, People's Republic of China
| | | | | | | | | |
Collapse
|
256
|
Lokanathan Y, Mohd-Adnan A, Wan KL, Nathan S. Transcriptome analysis of the Cryptocaryon irritans tomont stage identifies potential genes for the detection and control of cryptocaryonosis. BMC Genomics 2010; 11:76. [PMID: 20113487 PMCID: PMC2828411 DOI: 10.1186/1471-2164-11-76] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2009] [Accepted: 01/29/2010] [Indexed: 01/26/2023] Open
Abstract
Background Cryptocaryon irritans is a parasitic ciliate that causes cryptocaryonosis (white spot disease) in marine fish. Diagnosis of cryptocaryonosis often depends on the appearance of white spots on the surface of the fish, which are usually visible only during later stages of the disease. Identifying suitable biomarkers of this parasite would aid the development of diagnostic tools and control strategies for C. irritans. The C. irritans genome is virtually unexplored; therefore, we generated and analyzed expressed sequence tags (ESTs) of the parasite to identify genes that encode for surface proteins, excretory/secretory proteins and repeat-containing proteins. Results ESTs were generated from a cDNA library of C. irritans tomonts isolated from infected Asian sea bass, Lates calcarifer. Clustering of the 5356 ESTs produced 2659 unique transcripts (UTs) containing 1989 singletons and 670 consensi. BLAST analysis showed that 74% of the UTs had significant similarity (E-value < 10-5) to sequences that are currently available in the GenBank database, with more than 15% of the significant hits showing unknown function. Forty percent of the UTs had significant similarity to ciliates from the genera Tetrahymena and Paramecium. Comparative gene family analysis with related taxa showed that many protein families are conserved among the protozoans. Based on gene ontology annotation, functional groups were successfully assigned to 790 UTs. Genes encoding excretory/secretory proteins and membrane and membrane-associated proteins were identified because these proteins often function as antigens and are good antibody targets. A total of 481 UTs were classified as encoding membrane proteins, 54 were classified as encoding for membrane-bound proteins, and 155 were found to contain excretory/secretory protein-coding sequences. Amino acid repeat-containing proteins and GPI-anchored proteins were also identified as potential candidates for the development of diagnostic and control strategies for C. irritans. Conclusions We successfully discovered and examined a large portion of the previously unexplored C. irritans transcriptome and identified potential genes for the development and validation of diagnostic and control strategies for cryptocaryonosis.
Collapse
Affiliation(s)
- Yogeswaran Lokanathan
- School of Biosciences and Biotechnology, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Selangor, Malaysia
| | | | | | | |
Collapse
|
257
|
Gong J, Wei T, Zhang N, Jamitzky F, Heckl WM, Rössle SC, Stark RW. TollML: a database of toll-like receptor structural motifs. J Mol Model 2010; 16:1283-9. [PMID: 20084417 DOI: 10.1007/s00894-009-0640-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2009] [Accepted: 11/19/2009] [Indexed: 02/06/2023]
Abstract
Toll-like receptors (TLRs) play a key role in the innate immune system. TLRs recognize pathogen-associated molecular patterns and initiate an intracellular kinase cascade to induce an immediate defensive response. During recent years TLRs have become the focus of tremendous research interest. A central repository for the growing amount of relevant TLR sequence information has been created. Nevertheless, structural motifs of most sequenced TLR proteins, such as leucine-rich repeats (LRRs), are poorly annotated in the established databases. A database that organizes the structural motifs of TLRs could be useful for developing pattern recognition programs, structural modeling and understanding functional mechanisms of TLRs. We describe TollML, a database that integrates all of the TLR sequencing data from the NCBI protein database. Entries were first divided into TLR families (TLR1-23) and then semi-automatically subdivided into three levels of structural motif categories: (1) signal peptide (SP), ectodomain (ECD), transmembrane domain (TD) and Toll/IL-1 receptor (TIR) domain of each TLR; (2) LRRs of each ECD; (3) highly conserved segment (HCS), variable segment (VS) and insertions of each LRR. These categories can be searched quickly using an easy-to-use web interface and dynamically displayed by graphics. Additionally, all entries have hyperlinks to various sources including NCBI, Swiss-Prot, PDB, LRRML and PubMed in order to provide broad external information for users. The TollML database is available at http://tollml.lrz.de.
Collapse
Affiliation(s)
- Jing Gong
- Center for Nanoscience, Ludwig-Maximilians-Universität München, 80799, Munich, Germany
| | | | | | | | | | | | | |
Collapse
|
258
|
Abstract
Protein sequence databases do not contain just the sequence of the protein itself but also annotation that reflects our knowledge of its function and contributing residues. In this chapter, we will discuss various public protein sequence databases, with a focus on those that are generally applicable. Special attention is paid to issues related to the reliability of both sequence and annotation, as those are fundamental to many questions researchers will ask. Using both well-annotated and scarcely annotated human proteins as examples, it will be shown what information about the targets can be collected from freely available Internet resources and how this information can be used. The results are shown to be summarized in a simple graphical model of the protein's sequence architecture highlighting its structural and functional modules.
Collapse
Affiliation(s)
- Michael Rebhan
- Head Bioinformatics Support, Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
| |
Collapse
|
259
|
Fang Y, Xie K, Hou X, Hu H, Xiong L. Systematic analysis of GT factor family of rice reveals a novel subfamily involved in stress responses. Mol Genet Genomics 2009; 283:157-69. [PMID: 20039179 DOI: 10.1007/s00438-009-0507-x] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2009] [Accepted: 12/11/2009] [Indexed: 01/25/2023]
Abstract
GT factors constitute a plant-specific transcription factor family with a conserved trihelix DNA-binding domain. In this study, comprehensive sequence analysis suggested that 26 putative GT factors exist in rice. Phylogenetic analysis revealed three distinctive subfamilies (GTalpha, GTbeta, and GTgamma) of plant GT factors and each subfamily has a unique composition of predicted motifs. We characterized the OsGTgamma-1 gene, a typical member of the GTgamma subfamily in rice. This gene encodes a protein containing a conserved trihelix domain, and the OsGTgamma-1:GFP fusion protein was targeted to nuclei of rice cells. The transcript level of OsGTgamma-1 was strongly induced by salt stress and slightly induced by drought and cold stresses and abscisic acid treatment. Two other members of the GTgamma subfamily, OsGTgamma-2 and OsGTgamma-3, were also induced by most of the abiotic stresses. These results suggested that the genes of the GTgamma subfamily in rice may be involved in stress responses. A homozygous mutant osgtgamma-1 (with T-DNA inserted in the promoter region of OsGTgamma-1) showed more sensitive to salt stress than wild-type rice. Overexpression of OsGTgamma-1 in rice enhanced salt tolerance at the seedling stage. This evidence suggests that the OsGTgamma subfamily may participate in the regulation of stress tolerance in rice.
Collapse
Affiliation(s)
- Yujie Fang
- National Key Laboratory of Crop Genetic Improvement, National Center of Plant Gene Research (Wuhan), Huazhong Agricultural University, 430070, Wuhan, China
| | | | | | | | | |
Collapse
|
260
|
Naamati G, Fromer M, Linial M. Expansion of tandem repeats in sea anemone Nematostella vectensis proteome: A source for gene novelty? BMC Genomics 2009; 10:593. [PMID: 20003297 PMCID: PMC2805694 DOI: 10.1186/1471-2164-10-593] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2009] [Accepted: 12/10/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The complete proteome of the starlet sea anemone, Nematostella vectensis, provides insights into gene invention dating back to the Cnidarian-Bilaterian ancestor. With the addition of the complete proteomes of Hydra magnipapillata and Monosiga brevicollis, the investigation of proteins having unique features in early metazoan life has become practical. We focused on the properties and the evolutionary trends of tandem repeat (TR) sequences in Cnidaria proteomes. RESULTS We found that 11-16% of N. vectensis proteins contain tandem repeats. Most TRs cover 150 amino acid segments that are comprised of basic units of 5-20 amino acids. In total, the N. Vectensis proteome has about 3300 unique TR-units, but only a small fraction of them are shared with H. magnipapillata, M. brevicollis, or mammalian proteomes. The overall abundance of these TRs stands out relative to that of 14 proteomes representing the diversity among eukaryotes and within the metazoan world. TR-units are characterized by a unique composition of amino acids, with cysteine and histidine being over-represented. Structurally, most TR-segments are associated with coiled and disordered regions. Interestingly, 80% of the TR-segments can be read in more than one open reading frame. For over 100 of them, translation of the alternative frames would result in long proteins. Most domain families that are characterized as repeats in eukaryotes are found in the TR-proteomes from Nematostella and Hydra. CONCLUSIONS While most TR-proteins have originated from prediction tools and are still awaiting experimental validations, supportive evidence exists for hundreds of TR-units in Nematostella. The existence of TR-proteins in early metazoan life may have served as a robust mode for novel genes with previously overlooked structural and functional characteristics.
Collapse
|
261
|
Zeng J, Alhajj R, Demetrick DJ. Representative transcript sets for evaluating a translational initiation sites predictor. BMC Bioinformatics 2009; 10:206. [PMID: 19573244 PMCID: PMC2712473 DOI: 10.1186/1471-2105-10-206] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2009] [Accepted: 07/02/2009] [Indexed: 11/10/2022] Open
Abstract
Background Translational initiation site (TIS) prediction is a very important and actively studied topic in bioinformatics. In order to complete a comparative analysis, it is desirable to have several benchmark data sets which can be used to test the effectiveness of different algorithms. An ideal benchmark data set should be reliable, representative and readily available. Preferably, proteins encoded by members of the data set should also be representative of the protein population actually expressed in cellular specimens. Results In this paper, we report a general algorithm for constructing a reliable sequence collection that only includes mRNA sequences whose corresponding protein products present an average profile of the general protein population of a given organism, with respect to three major structural parameters. Four representative transcript collections, each derived from a model organism, have been obtained following the algorithm we propose. Evaluation of these data sets shows that they are reasonable representations of the spectrum of proteins obtained from cellular proteomic studies. Six state-of-the-art predictors have been used to test the usefulness of the construction algorithm that we proposed. Comparative study which reports the predictors' performance on our data set as well as three other existing benchmark collections has demonstrated the actual merits of our data sets as benchmark testing collections. Conclusion The proposed data set construction algorithm has demonstrated its property of being a general and widely applicable scheme. Our comparison with published proteomic studies has shown that the expression of our data set of transcripts generates a polypeptide population that is representative of that obtained from evaluation of biological specimens. Our data set thus represents "real world" transcripts that will allow more accurate evaluation of algorithms dedicated to identification of TISs, as well as other translational regulatory motifs within mRNA sequences. The algorithm proposed by us aims at compiling a redundancy-free data set by removing redundant copies of homologous proteins. The existence of such data sets may be useful for conducting statistical analyses of protein sequence-structure relations. At the current stage, our approach's focus is to obtain an "average" protein data set for any particular organism without posing much selection bias. However, with the three major protein structural parameters deeply integrated into the scheme, it would be a trivial task to extend the current method for obtaining a more selective protein data set, which may facilitate the study of some particular protein structure.
Collapse
|
262
|
Clifford M, Twigg J, Upton C. Evidence for a novel gene associated with human influenza A viruses. Virol J 2009; 6:198. [PMID: 19917120 PMCID: PMC2780412 DOI: 10.1186/1743-422x-6-198] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2009] [Accepted: 11/16/2009] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Influenza A virus genomes are comprised of 8 negative strand single-stranded RNA segments and are thought to encode 11 proteins, which are all translated from mRNAs complementary to the genomic strands. Although human, swine and avian influenza A viruses are very similar, cross-species infections are usually limited. However, antigenic differences are considerable and when viruses become established in a different host or if novel viruses are created by re-assortment devastating pandemics may arise. RESULTS Examination of influenza A virus genomes from the early 20th Century revealed the association of a 167 codon ORF encoded by the genomic strand of segment 8 with human isolates. Close to the timing of the 1948 pseudopandemic, a mutation occurred that resulted in the extension of this ORF to 216 codons. Since 1948, this ORF has been almost totally maintained in human influenza A viruses suggesting a selectable biological function. The discovery of cytotoxic T cells responding to an epitope encoded by this ORF suggests that it is translated into protein. Evidence of several other non-traditionally translated polypeptides in influenza A virus support the translation of this genomic strand ORF. The gene product is predicted to have a signal sequence and two transmembrane domains. CONCLUSION We hypothesize that the genomic strand of segment 8 of encodes a novel influenza A virus protein. The persistence and conservation of this genomic strand ORF for almost a century in human influenza A viruses provides strong evidence that it is translated into a polypeptide that enhances viral fitness in the human host. This has important consequences for the interpretation of experiments that utilize mutations in the NS1 and NEP genes of segment 8 and also for the consideration of events that may alter the spread and/or pathogenesis of swine and avian influenza A viruses in the human population.
Collapse
Affiliation(s)
- Monica Clifford
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC, V8W 3P6, Canada
| | - James Twigg
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC, V8W 3P6, Canada
| | - Chris Upton
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC, V8W 3P6, Canada
| |
Collapse
|
263
|
Li D, Su Z, Dong J, Wang T. An expression database for roots of the model legume Medicago truncatula under salt stress. BMC Genomics 2009; 10:517. [PMID: 19906315 PMCID: PMC2779821 DOI: 10.1186/1471-2164-10-517] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2009] [Accepted: 11/11/2009] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Medicago truncatula is a model legume whose genome is currently being sequenced by an international consortium. Abiotic stresses such as salt stress limit plant growth and crop productivity, including those of legumes. We anticipate that studies on M. truncatula will shed light on other economically important legumes across the world. Here, we report the development of a database called MtED that contains gene expression profiles of the roots of M. truncatula based on time-course salt stress experiments using the Affymetrix Medicago GeneChip. Our hope is that MtED will provide information to assist in improving abiotic stress resistance in legumes. DESCRIPTION The results of our microarray experiment with roots of M. truncatula under 180 mM sodium chloride were deposited in the MtED database. Additionally, sequence and annotation information regarding microarray probe sets were included. MtED provides functional category analysis based on Gene and GeneBins Ontology, and other Web-based tools for querying and retrieving query results, browsing pathways and transcription factor families, showing metabolic maps, and comparing and visualizing expression profiles. Utilities like mapping probe sets to genome of M. truncatula and In-Silico PCR were implemented by BLAT software suite, which were also available through MtED database. CONCLUSION MtED was built in the PHP script language and as a MySQL relational database system on a Linux server. It has an integrated Web interface, which facilitates ready examination and interpretation of the results of microarray experiments. It is intended to help in selecting gene markers to improve abiotic stress resistance in legumes. MtED is available at http://bioinformatics.cau.edu.cn/MtED/.
Collapse
Affiliation(s)
- Daofeng Li
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, 100193, PR China
| | - Zhen Su
- State Key Laboratory of Plant Physiology and Biochemistry, College of Biological Sciences, China Agricultural University, Beijing 100193, PR China
| | - Jiangli Dong
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, 100193, PR China
| | - Tao Wang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, 100193, PR China
| |
Collapse
|
264
|
Souza CS, Oliveira BM, Costa GGL, Schriefer A, Selbach-Schnadelbach A, Uetanabaro APT, Pirovani CP, Pereira GAG, Taranto AG, Cascardo JCDM, Góes-Neto A. Identification and characterization of a class III chitin synthase gene of Moniliophthora perniciosa, the fungus that causes witches' broom disease of cacao. J Microbiol 2009; 47:431-40. [PMID: 19763417 DOI: 10.1007/s12275-008-0166-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2008] [Accepted: 04/01/2009] [Indexed: 11/30/2022]
Abstract
Chitin synthase (CHS) is a glucosyltransferase that converts UDP-N-acetylglucosamine into chitin, one of the main components of fungal cell wall. Class III chitin synthases act directly in the formation of the cell wall. They catalyze the conversion of the immediate precursor of chitin and are responsible for the majority of chitin synthesis in fungi. As such, they are highly specific molecular targets for drugs that can inhibit the growth and development of fungal pathogens. In this work, we have identified and characterized a chitin synthase gene of Moniliophthora perniciosa (Mopchs) by primer walking. The complete gene sequence is 3,443 bp, interrupted by 13 small introns, and comprises a cDNA with an ORF with 2,739 bp, whose terminal region was experimentally determined, encoding a protein with 913 aa that harbors all the motifs and domains typically found in class III chitin synthases. This is the first report on the characterization of a chitin synthase gene, its mature transcription product, and its putative protein in basidioma and secondary mycelium stages of M. perniciosa, a basidiomycotan fungus that causes witches' broom disease of cacao.
Collapse
Affiliation(s)
- Catiane S Souza
- Laboratório de Pesquisa em Microbiologia (LAPEM), Departamento de Ciências Biológicas, Universidade Estadual de Feira de Santana (UEFS), Avenida Transnordestina, s/n, Bairro Novo Horizonte, Feira de Santana, BA 44036-900, Brazil
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
265
|
Savas S, Geraci J, Jurisica I, Liu G. A comprehensive catalogue of functional genetic variations in the EGFR pathway: protein-protein interaction analysis reveals novel genes and polymorphisms important for cancer research. Int J Cancer 2009; 125:1257-65. [PMID: 19499547 DOI: 10.1002/ijc.24535] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
The EGFR pathway is a critical signaling pathway deregulated in many solid tumors. In addition to the initiation and progression of cancer, the EGFR pathway is also implicated in variable treatment responses and prognoses. Genetic variation in the form of Single Nucleotide Polymorphisms (SNPs) can affect the function/expression of the EGFR pathway genes. Here, we applied a systematic and comprehensive approach utilizing diverse public databases and in silico analysis tools to select putative functional genetic variations from 244 genes involved in the EGFR pathway. Our data comprises 649 SNPs. Three hundred sixty SNPs are predicted to have biological consequences (functional SNPs). These SNPs can be directly used in further studies to test their association with risk, treatment response and prognosis in cancer. To systematically cover the EGFR pathway, we also performed a network-based analysis to further select putative functional SNPs from the genes whose protein products physically interact with the EGFR pathway proteins. We utilized protein-protein interaction information and focused on 14 proteins that have a high degree of connectivity (interacting with > or = 10 proteins) with the EGFR pathway genes identified to have functional SNPs (f-EGFR genes). Two of these proteins (FYN and LCK) had interactions with 17 of the f-EGFR genes, yet both lacked any putative functional SNP. However, our analysis indicated the presence of potentially functional SNPs in 9 other highly interactive proteins. The genes and their SNPs identified in the network-based analysis represent potential candidates for gene-gene and SNP-SNP interaction studies in cancer research.
Collapse
Affiliation(s)
- Sevtap Savas
- Division of Applied Molecular Oncology, Department of Medical Biophysics, Ontario Cancer Institute, Toronto, Ontario, Canada.
| | | | | | | |
Collapse
|
266
|
Jiang SY, Christoffels A, Ramamoorthy R, Ramachandran S. Expansion mechanisms and functional annotations of hypothetical genes in the rice genome. PLANT PHYSIOLOGY 2009; 150:1997-2008. [PMID: 19535473 PMCID: PMC2719134 DOI: 10.1104/pp.109.139402] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/02/2009] [Accepted: 06/15/2009] [Indexed: 05/18/2023]
Abstract
In each completely sequenced genome, 30% to 50% of genes are annotated as uncharacterized hypothetical genes. In the rice (Oryza sativa) genome, 10,918 hypothetical genes were annotated in the latest version (release 6) of the Michigan State University rice genome annotation. We have implemented an integrative approach to analyze their duplication/expansion and function. The analyses show that tandem/segmental duplication and transposition/retrotransposition have significantly contributed to the expansion of hypothetical genes despite their different contribution rates. A total of 3,769 hypothetical genes have been detected from retrogene, tandem, segmental, Pack-MULE, or long terminated direct repeat-related duplication/expansion. The nonsynonymous substitutions per site and synonymous substitutions per site analyses showed that 21.65% of them were still functional, accounting for 7.47% of total hypothetical genes. Global expression analyses have identified 1,672 expressed hypothetical genes. Among them, 415 genes might function in a developmental stage-specific manner. Antisense strand expression and small RNA analyses have demonstrated that a high percentage of these hypothetical genes might play important roles in negatively regulating gene expression. Homologous searches against Arabidopsis (Arabidopsis thaliana), maize (Zea mays), sorghum (Sorghum bicolor), and indica rice genomes suggest that most of the hypothetical genes could be annotated from recently evolved genomic sequences. These data advance the understanding of rice hypothetical genes as being involved in lineage-specific expansion and that they function in a specific developmental stage. Our analyses also provide a valuable means to facilitate the characterization and functional annotation of hypothetical genes in other organisms.
Collapse
Affiliation(s)
- Shu-Ye Jiang
- Rice Functional Genomics Group, Temasek Life Sciences Laboratory, National University of Singapore, Singapore 117604
| | | | | | | |
Collapse
|
267
|
Almeida CR, Stoco PH, Wagner G, Sincero TC, Rotava G, Bayer-Santos E, Rodrigues JB, Sperandio MM, Maia AA, Ojopi EP, Zaha A, Ferreira HB, Tyler KM, Dávila AM, Grisard EC, Dias-Neto E. Transcriptome analysis of Taenia solium cysticerci using Open Reading Frame ESTs (ORESTES). Parasit Vectors 2009; 2:35. [PMID: 19646239 PMCID: PMC2731055 DOI: 10.1186/1756-3305-2-35] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2009] [Accepted: 07/31/2009] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Human infection by the pork tapeworm Taenia solium affects more than 50 million people worldwide, particularly in underdeveloped and developing countries. Cysticercosis which arises from larval encystation can be life threatening and difficult to treat. Here, we investigate for the first time the transcriptome of the clinically relevant cysticerci larval form. RESULTS Using Expressed Sequence Tags (ESTs) produced by the ORESTES method, a total of 1,520 high quality ESTs were generated from 20 ORESTES cDNA mini-libraries and its analysis revealed fragments of genes with promising applications including 51 ESTs matching antigens previously described in other species, as well as 113 sequences representing proteins with potential extracellular localization, with obvious applications for immune-diagnosis or vaccine development. CONCLUSION The set of sequences described here will contribute to deciphering the expression profile of this important parasite and will be informative for the genome assembly and annotation, as well as for studies of intra- and inter-specific sequence variability. Genes of interest for developing new diagnostic and therapeutic tools are described and discussed.
Collapse
Affiliation(s)
- Carolina R Almeida
- Laboratórios de Protozoologia e de Bioinformática, Departamento de Microbiologia, Imunologia e Parasitologia, Universidade Federal de Santa Catarina (UFSC), Caixa postal 476, CEP 88040-970, Florianópolis, SC, Brazil.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
268
|
Shen YQ, Lang BF, Burger G. Diversity and dispersal of a ubiquitous protein family: acyl-CoA dehydrogenases. Nucleic Acids Res 2009; 37:5619-31. [PMID: 19625492 PMCID: PMC2761260 DOI: 10.1093/nar/gkp566] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Acyl-CoA dehydrogenases (ACADs), which are key enzymes in fatty acid and amino acid catabolism, form a large, pan-taxonomic protein family with at least 13 distinct subfamilies. Yet most reported ACAD members have no subfamily assigned, and little is known about the taxonomic distribution and evolution of the subfamilies. In completely sequenced genomes from approximately 210 species (eukaryotes, bacteria and archaea), we detect ACAD subfamilies by rigorous ortholog identification combining sequence similarity search with phylogeny. We then construct taxonomic subfamily-distribution profiles and build phylogenetic trees with orthologous proteins. Subfamily profiles provide unparalleled insight into the organisms’ energy sources based on genome sequence alone and further predict enzyme substrate specificity, thus generating explicit working hypotheses for targeted biochemical experimentation. Eukaryotic ACAD subfamilies are traditionally considered as mitochondrial proteins, but we found evidence that in fungi one subfamily is located in peroxisomes and participates in a distinct β-oxidation pathway. Finally, we discern horizontal transfer, duplication, loss and secondary acquisition of ACAD genes during evolution of this family. Through these unorthodox expansion strategies, the ACAD family is proficient in utilizing a large range of fatty acids and amino acids—strategies that could have shaped the evolutionary history of many other ancient protein families.
Collapse
Affiliation(s)
- Yao-Qing Shen
- Robert Cedergren Center for Bioinformatics and Genomics, Biochemistry Department, Université de Montréal, 2900 Edouard-Montpetit, Montreal, QC, H3T 1J4, Canada.
| | | | | |
Collapse
|
269
|
Van Auken K, Jaffery J, Chan J, Müller HM, Sternberg PW. Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation. BMC Bioinformatics 2009; 10:228. [PMID: 19622167 PMCID: PMC2719631 DOI: 10.1186/1471-2105-10-228] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2009] [Accepted: 07/21/2009] [Indexed: 11/28/2022] Open
Abstract
Background Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. Nevertheless, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. Text mining software that can semi- or fully automate information retrieval from the literature would thus provide a significant boost to manual curation efforts. Results We employ the Textpresso category-based information retrieval and extraction system , developed by WormBase to explore how Textpresso might improve the efficiency with which we manually curate C. elegans proteins to the Gene Ontology's Cellular Component Ontology. Using a training set of sentences that describe results of localization experiments in the published literature, we generated three new curation task-specific categories (Cellular Components, Assay Terms, and Verbs) containing words and phrases associated with reports of experimentally determined subcellular localization. We compared the results of manual curation to that of Textpresso queries that searched the full text of articles for sentences containing terms from each of the three new categories plus the name of a previously uncurated C. elegans protein, and found that Textpresso searches identified curatable papers with recall and precision rates of 79.1% and 61.8%, respectively (F-score of 69.5%), when compared to manual curation. Within those documents, Textpresso identified relevant sentences with recall and precision rates of 30.3% and 80.1% (F-score of 44.0%). From returned sentences, curators were able to make 66.2% of all possible experimentally supported GO Cellular Component annotations with 97.3% precision (F-score of 78.8%). Measuring the relative efficiencies of Textpresso-based versus manual curation we find that Textpresso has the potential to increase curation efficiency by at least 8-fold, and perhaps as much as 15-fold, given differences in individual curatorial speed. Conclusion Textpresso is an effective tool for improving the efficiency of manual, experimentally based curation. Incorporating a Textpresso-based Cellular Component curation pipeline at WormBase has allowed us to transition from strictly manual curation of this data type to a more efficient pipeline of computer-assisted validation. Continued development of curation task-specific Textpresso categories will provide an invaluable resource for genomics databases that rely heavily on manual curation.
Collapse
Affiliation(s)
- Kimberly Van Auken
- Division of Biology, California Institute of Technology, Pasadena, CA 91125, USA.
| | | | | | | | | |
Collapse
|
270
|
Molecular evolution and functional divergence of HAK potassium transporter gene family in rice (Oryza sativa L.). J Genet Genomics 2009; 36:161-72. [PMID: 19302972 DOI: 10.1016/s1673-8527(08)60103-4] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2008] [Revised: 12/02/2008] [Accepted: 12/10/2008] [Indexed: 11/22/2022]
Abstract
The high-affinity K(+) (HAK) transporter gene family is the largest family in plant that functions as potassium transporter and is important for various aspects of plant life. In the present study, we identified 27 members of this family in rice genome. The phylogenetic tree divided the land plant HAK transporter proteins into 6 distinct groups. Although the main characteristic of this family was established before the origin of seed plants, they also showed some differences between the members of non-seed and seed plants. The HAK genes in rice were found to have expanded in lineage-specific manner after the split of monocots and dicots, and both segmental duplication events and tandem duplication events contributed to the expansion of this family. Functional divergence analysis for this family provided statistical evidence for shifted evolutionary rate after gene duplication. Further analysis indicated that both point mutant with positive selection and gene conversion events contributed to the evolution of this family in rice.
Collapse
|
271
|
Chitale M, Hawkins T, Park C, Kihara D. ESG: extended similarity group method for automated protein function prediction. ACTA ACUST UNITED AC 2009; 25:1739-45. [PMID: 19435743 DOI: 10.1093/bioinformatics/btp309] [Citation(s) in RCA: 70] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Importance of accurate automatic protein function prediction is ever increasing in the face of a large number of newly sequenced genomes and proteomics data that are awaiting biological interpretation. Conventional methods have focused on high sequence similarity-based annotation transfer which relies on the concept of homology. However, many cases have been reported that simple transfer of function from top hits of a homology search causes erroneous annotation. New methods are required to handle the sequence similarity in a more robust way to combine together signals from strongly and weakly similar proteins for effectively predicting function for unknown proteins with high reliability. RESULTS We present the extended similarity group (ESG) method, which performs iterative sequence database searches and annotates a query sequence with Gene Ontology terms. Each annotation is assigned with probability based on its relative similarity score with the multiple-level neighbors in the protein similarity graph. We will depict how the statistical framework of ESG improves the prediction accuracy by iteratively taking into account the neighborhood of query protein in the sequence similarity space. ESG outperforms conventional PSI-BLAST and the protein function prediction (PFP) algorithm. It is found that the iterative search is effective in capturing multiple-domains in a query protein, enabling accurately predicting several functions which originate from different domains. AVAILABILITY ESG web server is available for automated protein function prediction at http://dragon.bio.purdue.edu/ESG/.
Collapse
Affiliation(s)
- Meghana Chitale
- Department of Computer Science, Purdue University, IN 47907, USA
| | | | | | | |
Collapse
|
272
|
Da Silva M, Upton C. Vaccinia virus G8R protein: a structural ortholog of proliferating cell nuclear antigen (PCNA). PLoS One 2009; 4:e5479. [PMID: 19421403 PMCID: PMC2674943 DOI: 10.1371/journal.pone.0005479] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2009] [Accepted: 04/15/2009] [Indexed: 11/30/2022] Open
Abstract
Background Eukaryotic DNA replication involves the synthesis of both a DNA leading and lagging strand, the latter requiring several additional proteins including flap endonuclease (FEN-1) and proliferating cell nuclear antigen (PCNA) in order to remove RNA primers used in the synthesis of Okazaki fragments. Poxviruses are complex viruses (dsDNA genomes) that infect eukaryotes, but surprisingly little is known about the process of DNA replication. Given our previous results that the vaccinia virus (VACV) G5R protein may be structurally similar to a FEN-1-like protein and a recent finding that poxviruses encode a primase function, we undertook a series of in silico analyses to identify whether VACV also encodes a PCNA-like protein. Results An InterProScan of all VACV proteins using the JIPS software package was used to identify any PCNA-like proteins. The VACV G8R protein was identified as the only vaccinia protein that contained a PCNA-like sliding clamp motif. The VACV G8R protein plays a role in poxvirus late transcription and is known to interact with several other poxvirus proteins including itself. The secondary and tertiary structure of the VACV G8R protein was predicted and compared to the secondary and tertiary structure of both human and yeast PCNA proteins, and a high degree of similarity between all three proteins was noted. Conclusions The structure of the VACV G8R protein is predicted to closely resemble the eukaryotic PCNA protein; it possesses several other features including a conserved ubiquitylation and SUMOylation site that suggest that, like its counterpart in T4 bacteriophage (gp45), it may function as a sliding clamp ushering transcription factors to RNA polymerase during late transcription.
Collapse
Affiliation(s)
- Melissa Da Silva
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, British Columbia, Canada
| | - Chris Upton
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, British Columbia, Canada
- * E-mail:
| |
Collapse
|
273
|
Godin KS, Walbott H, Leulliot N, van Tilbeurgh H, Varani G. The box H/ACA snoRNP assembly factor Shq1p is a chaperone protein homologous to Hsp90 cochaperones that binds to the Cbf5p enzyme. J Mol Biol 2009; 390:231-44. [PMID: 19426738 DOI: 10.1016/j.jmb.2009.04.076] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2008] [Revised: 04/27/2009] [Accepted: 04/28/2009] [Indexed: 11/15/2022]
Abstract
Box H/ACA small nucleolar (sno) ribonucleoproteins (RNPs) are responsible for the formation of pseudouridine in a variety of RNAs and are essential for ribosome biogenesis, modification of spliceosomal RNAs, and telomerase stability. A mature snoRNP has been reconstituted in vitro and is composed of a single RNA and four proteins. However, snoRNP biogenesis in vivo requires multiple factors to coordinate a complex and poorly understood assembly and maturation process. Among the factors required for snoRNP biogenesis in yeast is Shq1p, an essential protein necessary for stable expression of box H/ACA snoRNAs. We have found that Shq1p consists of two independent domains that contain casein kinase 1 phosphorylation sites. We also demonstrate that Shq1p binds the pseudourydilating enzyme Cbf5p through the C-terminal domain, in synergy with the N-terminal domain. The NMR solution structure of the N-terminal domain has striking homology to the 'Chord and Sgt1' domain of known Hsp90 cochaperones, yet Shq1p does not interact with the yeast Hsp90 homologue in vitro. Surprisingly, Shq1p has stand-alone chaperone activity in vitro. This activity is harbored by the C-terminal domain, but it is increased by the presence of the N-terminal domain. These results provide the first evidence of a specific biochemical activity for Shq1p and a direct link to the H/ACA snoRNP.
Collapse
Affiliation(s)
- Katherine S Godin
- Department of Chemistry, University of Washington, Seattle, 98195, USA
| | | | | | | | | |
Collapse
|
274
|
May P, Christian JO, Kempa S, Walther D. ChlamyCyc: an integrative systems biology database and web-portal for Chlamydomonas reinhardtii. BMC Genomics 2009; 10:209. [PMID: 19409111 PMCID: PMC2688524 DOI: 10.1186/1471-2164-10-209] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2009] [Accepted: 05/04/2009] [Indexed: 01/10/2023] Open
Abstract
Background The unicellular green alga Chlamydomonas reinhardtii is an important eukaryotic model organism for the study of photosynthesis and plant growth. In the era of modern high-throughput technologies there is an imperative need to integrate large-scale data sets from high-throughput experimental techniques using computational methods and database resources to provide comprehensive information about the molecular and cellular organization of a single organism. Results In the framework of the German Systems Biology initiative GoFORSYS, a pathway database and web-portal for Chlamydomonas (ChlamyCyc) was established, which currently features about 250 metabolic pathways with associated genes, enzymes, and compound information. ChlamyCyc was assembled using an integrative approach combining the recently published genome sequence, bioinformatics methods, and experimental data from metabolomics and proteomics experiments. We analyzed and integrated a combination of primary and secondary database resources, such as existing genome annotations from JGI, EST collections, orthology information, and MapMan classification. Conclusion ChlamyCyc provides a curated and integrated systems biology repository that will enable and assist in systematic studies of fundamental cellular processes in Chlamydomonas. The ChlamyCyc database and web-portal is freely available under .
Collapse
Affiliation(s)
- Patrick May
- Max-Planck-Institute of Molecular Plant Physiology, Potsdam, Germany.
| | | | | | | |
Collapse
|
275
|
Ooi HS, Kwo CY, Wildpaner M, Sirota FL, Eisenhaber B, Maurer-Stroh S, Wong WC, Schleiffer A, Eisenhaber F, Schneider G. ANNIE: integrated de novo protein sequence annotation. Nucleic Acids Res 2009; 37:W435-40. [PMID: 19389726 PMCID: PMC2703921 DOI: 10.1093/nar/gkp254] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Function prediction of proteins with computational sequence analysis requires the use of dozens of prediction tools with a bewildering range of input and output formats. Each of these tools focuses on a narrow aspect and researchers are having difficulty obtaining an integrated picture. ANNIE is the result of years of close interaction between computational biologists and computer scientists and automates an essential part of this sequence analytic process. It brings together over 20 function prediction algorithms that have proven sufficiently reliable and indispensable in daily sequence analytic work and are meant to give scientists a quick overview of possible functional assignments of sequence segments in the query proteins. The results are displayed in an integrated manner using an innovative AJAX-based sequence viewer. ANNIE is available online at: http://annie.bii.a-star.edu.sg. This website is free and open to all users and there is no login requirement.
Collapse
|
276
|
Price DRG, Bell HA, Hinchliffe G, Fitches E, Weaver R, Gatehouse JA. A venom metalloproteinase from the parasitic wasp Eulophus pennicornis is toxic towards its host, tomato moth (Lacanobia oleracae). INSECT MOLECULAR BIOLOGY 2009; 18:195-202. [PMID: 19320760 DOI: 10.1111/j.1365-2583.2009.00864.x] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Three genes encoding clan MB metalloproteinases (EpMP1-3) were identified from venom glands of the ectoparasitic wasp Eulophus pennicornis. The derived amino acid sequences predict mature proteins of approximately 46 kDa, with a novel two-domain structure comprising a C-terminal reprolysin domain, and an N-terminal domain of unknown function. EpMP3 expressed as a recombinant protein in Pichia pastoris had gelatinase activity, which was inhibited by EDTA. Injection of recombinant EpMP3 into fifth instar Lacanobia oleracea (host) larvae resulted in partial insect mortality associated with the moult to sixth instar, with surviving insects showing retarded development and growth. EpMP3 is expressed specifically in venom glands. These results suggest that EpMP3 is a functional component of Eulophus venom, which is able to manipulate host development.
Collapse
|
277
|
Hong Y, Chalkia D, Ko KD, Bhardwaj G, Chang GS, van Rossum DB, Patterson RL. Phylogenetic Profiles Reveal Structural and Functional Determinants of Lipid-binding. ACTA ACUST UNITED AC 2009; 2:139-149. [PMID: 19946567 DOI: 10.4172/jpb.1000071] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
One of the major challenges in the genomic era is annotating structure/function to the vast quantities of sequence information now available. Indeed, most of the protein sequence database lacks comprehensive annotation, even when experimental evidence exists. Further, within structurally resolved and functionally annotated protein domains, additional functionalities contained in these domains are not apparent. To add further complication, small changes in the amino-acid sequence can lead to profound changes in both structure and function, underscoring the need for rapid and reliable methods to analyze these types of data. Phylogenetic profiles provide a quantitative method that can relate the structural and functional properties of proteins, as well as their evolutionary relationships. Using all of the structurally resolved Src-Homology-2 (SH2) domains, we demonstrate that knowledge-bases can be used to create single-amino acid phylogenetic profiles which reliably annotate lipid-binding. Indeed, these measures isolate the known phosphotyrosine and hydrophobic pockets as integral to lipid-binding function. In addition, we determined that the SH2 domain of Tec family kinases bind to lipids with varying affinity and specificity. Simulating mutations in Bruton's tyrosine kinase (BTK) that cause X-Linked Agammaglobulinemia (XLA) predict that these mutations alter lipid-binding, which we confirm experimentally. In light of these results, we propose that XLA-causing mutations in the SH3-SH2 domain of BTK alter lipid-binding, which could play a causative role in the XLA-phenotype. Overall, our study suggests that the number of lipid-binding proteins is drastically underestimated and, with further development, phylogenetic profiles can provide a method for rapidly increasing the functional annotation of protein sequences.
Collapse
Affiliation(s)
- Yoojin Hong
- Center for Computational Proteomics, The Pennsylvania State University
| | | | | | | | | | | | | |
Collapse
|
278
|
Bellgard MI, Wanchanthuek P, La T, Ryan K, Moolhuijzen P, Albertyn Z, Shaban B, Motro Y, Dunn DS, Schibeci D, Hunter A, Barrero R, Phillips ND, Hampson DJ. Genome sequence of the pathogenic intestinal spirochete brachyspira hyodysenteriae reveals adaptations to its lifestyle in the porcine large intestine. PLoS One 2009; 4:e4641. [PMID: 19262690 PMCID: PMC2650404 DOI: 10.1371/journal.pone.0004641] [Citation(s) in RCA: 90] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2008] [Accepted: 01/06/2009] [Indexed: 11/30/2022] Open
Abstract
Brachyspira hyodysenteriae is an anaerobic intestinal spirochete that colonizes the large intestine of pigs and causes swine dysentery, a disease of significant economic importance. The genome sequence of B. hyodysenteriae strain WA1 was determined, making it the first representative of the genus Brachyspira to be sequenced, and the seventeenth spirochete genome to be reported. The genome consisted of a circular 3,000,694 base pair (bp) chromosome, and a 35,940 bp circular plasmid that has not previously been described. The spirochete had 2,122 protein-coding sequences. Of the predicted proteins, more had similarities to proteins of the enteric Escherichia coli and Clostridium species than they did to proteins of other spirochetes. Many of these genes were associated with transport and metabolism, and they may have been gradually acquired through horizontal gene transfer in the environment of the large intestine. A reconstruction of central metabolic pathways identified a complete set of coding sequences for glycolysis, gluconeogenesis, a non-oxidative pentose phosphate pathway, nucleotide metabolism, lipooligosaccharide biosynthesis, and a respiratory electron transport chain. A notable finding was the presence on the plasmid of the genes involved in rhamnose biosynthesis. Potential virulence genes included those for 15 proteases and six hemolysins. Other adaptations to an enteric lifestyle included the presence of large numbers of genes associated with chemotaxis and motility. B. hyodysenteriae has diverged from other spirochetes in the process of accommodating to its habitat in the porcine large intestine.
Collapse
Affiliation(s)
- Matthew I. Bellgard
- Centre for Comparative Genomics, Murdoch University, Murdoch, Western Australia, Australia
| | - Phatthanaphong Wanchanthuek
- Centre for Comparative Genomics, Murdoch University, Murdoch, Western Australia, Australia
- Faculty of Informatics, Mahasarakham University, Mahasarakham, Thailand
| | - Tom La
- Animal Research Institute, School Veterinary and Biomedical Science, Murdoch University, Murdoch, Western Australia, Australia
| | - Karon Ryan
- Centre for Comparative Genomics, Murdoch University, Murdoch, Western Australia, Australia
| | - Paula Moolhuijzen
- Centre for Comparative Genomics, Murdoch University, Murdoch, Western Australia, Australia
| | - Zayed Albertyn
- Centre for Comparative Genomics, Murdoch University, Murdoch, Western Australia, Australia
| | - Babak Shaban
- Centre for Comparative Genomics, Murdoch University, Murdoch, Western Australia, Australia
| | - Yair Motro
- Centre for Comparative Genomics, Murdoch University, Murdoch, Western Australia, Australia
| | - David S. Dunn
- Centre for Comparative Genomics, Murdoch University, Murdoch, Western Australia, Australia
| | - David Schibeci
- Centre for Comparative Genomics, Murdoch University, Murdoch, Western Australia, Australia
| | - Adam Hunter
- Centre for Comparative Genomics, Murdoch University, Murdoch, Western Australia, Australia
| | - Roberto Barrero
- Centre for Comparative Genomics, Murdoch University, Murdoch, Western Australia, Australia
| | - Nyree D. Phillips
- Animal Research Institute, School Veterinary and Biomedical Science, Murdoch University, Murdoch, Western Australia, Australia
| | - David J. Hampson
- Animal Research Institute, School Veterinary and Biomedical Science, Murdoch University, Murdoch, Western Australia, Australia
| |
Collapse
|
279
|
Fontana P, Cestaro A, Velasco R, Formentin E, Toppo S. Rapid annotation of anonymous sequences from genome projects using semantic similarities and a weighting scheme in gene ontology. PLoS One 2009; 4:e4619. [PMID: 19247487 PMCID: PMC2645684 DOI: 10.1371/journal.pone.0004619] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2008] [Accepted: 01/09/2009] [Indexed: 11/22/2022] Open
Abstract
Background Large-scale sequencing projects have now become routine lab practice and this has led to the development of a new generation of tools involving function prediction methods, bringing the latter back to the fore. The advent of Gene Ontology, with its structured vocabulary and paradigm, has provided computational biologists with an appropriate means for this task. Methodology We present here a novel method called ARGOT (Annotation Retrieval of Gene Ontology Terms) that is able to process quickly thousands of sequences for functional inference. The tool exploits for the first time an integrated approach which combines clustering of GO terms, based on their semantic similarities, with a weighting scheme which assesses retrieved hits sharing a certain number of biological features with the sequence to be annotated. These hits may be obtained by different methods and in this work we have based ARGOT processing on BLAST results. Conclusions The extensive benchmark involved 10,000 protein sequences, the complete S. cerevisiae genome and a small subset of proteins for purposes of comparison with other available tools. The algorithm was proven to outperform existing methods and to be suitable for function prediction of single proteins due to its high degree of sensitivity, specificity and coverage.
Collapse
Affiliation(s)
- Paolo Fontana
- FEM-IASMA Research Center, San Michele all'Adige (TN), Italy
| | | | | | | | - Stefano Toppo
- Department of Biological Chemistry, University of Padova, Padova, Italy
- * E-mail:
| |
Collapse
|
280
|
Derrien T, Thézé J, Vaysse A, André C, Ostrander EA, Galibert F, Hitte C. Revisiting the missing protein-coding gene catalog of the domestic dog. BMC Genomics 2009; 10:62. [PMID: 19193219 PMCID: PMC2644713 DOI: 10.1186/1471-2164-10-62] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2008] [Accepted: 02/04/2009] [Indexed: 12/19/2022] Open
Abstract
Background Among mammals for which there is a high sequence coverage, the whole genome assembly of the dog is unique in that it predicts a low number of protein-coding genes, ~19,000, compared to the over 20,000 reported for other mammalian species. Of particular interest are the more than 400 of genes annotated in primates and rodent genomes, but missing in dog. Results Using over 14,000 orthologous genes between human, chimpanzee, mouse rat and dog, we built multiple pairwise synteny maps to infer short orthologous intervals that were targeted for characterizing the canine missing genes. Based on gene prediction and a functionality test using the ratio of replacement to silent nucleotide substitution rates (dN/dS), we provide compelling structural and functional evidence for the identification of 232 new protein-coding genes in the canine genome and 69 gene losses, characterized as undetected gene or pseudogenes. Gene loss phyletic pattern analysis using ten species from chicken to human allowed us to characterize 28 canine-specific gene losses that have functional orthologs continuously from chicken or marsupials through human, and 10 genes that arose specifically in the evolutionary lineage leading to rodent and primates. Conclusion This study demonstrates the central role of comparative genomics for refining gene catalogs and exploring the evolutionary history of gene repertoires, particularly as applied for the characterization of species-specific gene gains and losses.
Collapse
Affiliation(s)
- Thomas Derrien
- Institut de Génétique et Développement, CNRS UMR6061, Université de Rennes 1, Léon Bernard, Rennes, France.
| | | | | | | | | | | | | |
Collapse
|
281
|
|
282
|
The 2008 update of the Aspergillus nidulans genome annotation: a community effort. Fungal Genet Biol 2008; 46 Suppl 1:S2-13. [PMID: 19146970 DOI: 10.1016/j.fgb.2008.12.003] [Citation(s) in RCA: 87] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2008] [Revised: 12/15/2008] [Accepted: 12/15/2008] [Indexed: 01/28/2023]
Abstract
The identification and annotation of protein-coding genes is one of the primary goals of whole-genome sequencing projects, and the accuracy of predicting the primary protein products of gene expression is vital to the interpretation of the available data and the design of downstream functional applications. Nevertheless, the comprehensive annotation of eukaryotic genomes remains a considerable challenge. Many genomes submitted to public databases, including those of major model organisms, contain significant numbers of wrong and incomplete gene predictions. We present a community-based reannotation of the Aspergillus nidulans genome with the primary goal of increasing the number and quality of protein functional assignments through the careful review of experts in the field of fungal biology.
Collapse
|
283
|
Jackson AP, Quail MA, Berriman M. Insights into the genome sequence of a free-living Kinetoplastid: Bodo saltans (Kinetoplastida: Euglenozoa). BMC Genomics 2008; 9:594. [PMID: 19068121 PMCID: PMC2621209 DOI: 10.1186/1471-2164-9-594] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2008] [Accepted: 12/09/2008] [Indexed: 12/02/2022] Open
Abstract
Background Bodo saltans is a free-living kinetoplastid and among the closest relatives of the trypanosomatid parasites, which cause such human diseases as African sleeping sickness, leishmaniasis and Chagas disease. A B. saltans genome sequence will provide a free-living comparison with parasitic genomes necessary for comparative analyses of existing and future trypanosomatid genomic resources. Various coding regions were sequenced to provide a preliminary insight into the bodonid genome sequence, relative to trypanosomatid sequences. Results 0.4 Mbp of B. saltans genome was sequenced from 12 distinct regions and contained 178 coding sequences. As in trypanosomatids, introns were absent and %GC was elevated in coding regions, greatly assisting in gene finding. In the regions studied, roughly 60% of all genes had homologs in trypanosomatids, while 28% were Bodo-specific. Intergenic sequences were typically short, resulting in higher gene density than in trypanosomatids. Although synteny was typically conserved for those genes with trypanosomatid homologs, strict colinearity was rarely observed because gene order was regularly disrupted by Bodo-specific genes. Conclusion The B. saltans genome contains both sequences homologous to trypanosomatids and sequences never seen before. Structural similarities suggest that its assembly should be solvable, and, although de novo assembly will be necessary, existing trypanosomatid projects will provide some guide to annotation. A complete genome sequence will provide an effective ancestral model for understanding the shared and derived features of known trypanosomatid genomes, but it will also identify those kinetoplastid genome features lost during the evolution of parasitism.
Collapse
Affiliation(s)
- Andrew P Jackson
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK.
| | | | | |
Collapse
|
284
|
Reeves GA, Eilbeck K, Magrane M, O'Donovan C, Montecchi-Palazzi L, Harris MA, Orchard S, Jimenez RC, Prlic A, Hubbard TJP, Hermjakob H, Thornton JM. The Protein Feature Ontology: a tool for the unification of protein feature annotations. Bioinformatics 2008; 24:2767-72. [PMID: 18936051 PMCID: PMC2912506 DOI: 10.1093/bioinformatics/btn528] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The advent of sequencing and structural genomics projects has provided a dramatic boost in the number of uncharacterized protein structures and sequences. Consequently, many computational tools have been developed to help elucidate protein function. However, such services are spread throughout the world, often with standalone web pages. Integration of these methods is needed and so far this has not been possible as there was no common vocabulary available that could be used as a standard language. RESULTS The Protein Feature Ontology has been developed to provide a structured controlled vocabulary for features on a protein sequence or structure and comprises approximately 100 positional terms, now integrated into the Sequence Ontology (SO) and 40 non-positional terms which describe features relating to the whole-protein sequence. In addition, post-translational modifications are described by using a pre-existing ontology, the Protein Modification Ontology (MOD). This ontology is being used to integrate over 150 distinct annotations provided by the BioSapiens Network of Excellence, a consortium comprising 19 partner sites in Europe. AVAILABILITY The Protein Feature Ontology can be browsed by accessing the ontology lookup service at the European Bioinformatics Institute (http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=BS).
Collapse
Affiliation(s)
- Gabrielle A Reeves
- EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
285
|
Vizcaíno JA, Mueller M, Hermjakob H, Martens L. Charting online OMICS resources: A navigational chart for clinical researchers. Proteomics Clin Appl 2008; 3:18-29. [PMID: 21136933 DOI: 10.1002/prca.200800082] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2008] [Indexed: 12/22/2022]
Abstract
The life sciences have sprouted several popular and successful OMICS technologies that span all levels of biological information transfer. Ever since the start of the Human Genome Project, the then revolutionary idea to make all resulting data publicly available has been central to all of the efforts across OMICS technologies. As a result, a great variety of publicly available data repositories and resources is currently available to the research community. This widespread availability of data does come at the price of increased confusion on the part of the users, especially for those that see the OMICS technologies as tools to help unravel a larger biological or clinical question. We therefore provide a comprehensive overview of the available resources across OMICS fields, with a special emphasis on those databases that are relevant to the study of proteins. Additionally, we also describe various integrative systems that have been established, and highlight new developments in the field that can revolutionize the way in which live data integration is achieved over the internet.
Collapse
Affiliation(s)
- Juan Antonio Vizcaíno
- EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| | | | | | | |
Collapse
|
286
|
Wei T, Gong J, Jamitzky F, Heckl WM, Stark RW, Rössle SC. LRRML: a conformational database and an XML description of leucine-rich repeats (LRRs). BMC STRUCTURAL BIOLOGY 2008; 8:47. [PMID: 18986514 PMCID: PMC2645405 DOI: 10.1186/1472-6807-8-47] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/05/2008] [Accepted: 11/05/2008] [Indexed: 11/22/2022]
Abstract
Background Leucine-rich repeats (LRRs) are present in more than 6000 proteins. They are found in organisms ranging from viruses to eukaryotes and play an important role in protein-ligand interactions. To date, more than one hundred crystal structures of LRR containing proteins have been determined. This knowledge has increased our ability to use the crystal structures as templates to model LRR proteins with unknown structures. Since the individual three-dimensional LRR structures are not directly available from the established databases and since there are only a few detailed annotations for them, a conformational LRR database useful for homology modeling of LRR proteins is desirable. Description We developed LRRML, a conformational database and an extensible markup language (XML) description of LRRs. The release 0.2 contains 1261 individual LRR structures, which were identified from 112 PDB structures and annotated manually. An XML structure was defined to exchange and store the LRRs. LRRML provides a source for homology modeling and structural analysis of LRR proteins. In order to demonstrate the capabilities of the database we modeled the mouse Toll-like receptor 3 (TLR3) by multiple templates homology modeling and compared the result with the crystal structure. Conclusion LRRML is an information source for investigators involved in both theoretical and applied research on LRR proteins. It is available at .
Collapse
Affiliation(s)
- Tiandi Wei
- Department of Earth and Environmental Sciences, Ludwig-Maximilians-Universität München, Theresienstr, 41, 80333 Munich, Germany.
| | | | | | | | | | | |
Collapse
|
287
|
Loewenstein Y, Linial M. Connect the dots: exposing hidden protein family connections from the entire sequence tree. Bioinformatics 2008; 24:i193-9. [PMID: 18689824 DOI: 10.1093/bioinformatics/btn301] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Mapping of remote evolutionary links is a classic computational problem of much interest. Relating protein families allows for functional and structural inference on uncharacterized families. Since sequences have diverged beyond reliable alignment, these are too remote to identify by conventional methods. APPROACH We present a method to systematically identify remote evolutionary relations between protein families, leveraging a novel evolutionary-driven tree of all protein sequences and families. A global approach which considers the entire volume of similarities while clustering sequences, leads to a robust tree that allows tracing of very faint evolutionary links. The method systematically scans the tree for clusters which partition exceptionally well into extant protein families, thus suggesting an evolutionary breakpoint in a putative ancient superfamily. Our method does not require family pro.les (or HMMs), or multiple alignment. RESULTS Considering the entire Pfam database, we are able to suggest 710 links between protein families, 125 of which are con.rmed by existence of Pfam clans. The quality of our predictions is also validated by structural assignments. We further provide an intrinsic characterization of the validity of our results and provide examples for new biological.ndings, from our systematic scan. For example, we are able to relate several bacterial pore-forming toxin families, and then link them with a novel family of eukaryotic toxins expressed in plants.sh venom and notably also uncharacterized proteins from human pathogens. AVAILABILITY A detailed list of putative homologous superfamilies, including 210 families of unknown function, has been made available online: http://www.protonet.cs.huji.ac.il/dots
Collapse
Affiliation(s)
- Yaniv Loewenstein
- School of Computer Science and Engineering, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem 91904, Israel.
| | | |
Collapse
|
288
|
Uhl GR, Drgon T, Johnson C, Li CY, Contoreggi C, Hess J, Naiman D, Liu QR. Molecular genetics of addiction and related heritable phenotypes: genome-wide association approaches identify "connectivity constellation" and drug target genes with pleiotropic effects. Ann N Y Acad Sci 2008; 1141:318-81. [PMID: 18991966 PMCID: PMC3922196 DOI: 10.1196/annals.1441.018] [Citation(s) in RCA: 131] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Genome-wide association (GWA) can elucidate molecular genetic bases for human individual differences in complex phenotypes that include vulnerability to addiction. Here, we review (a) evidence that supports polygenic models with (at least) modest heterogeneity for the genetic architectures of addiction and several related phenotypes; (b) technical and ethical aspects of importance for understanding GWA data, including genotyping in individual samples versus DNA pools, analytic approaches, power estimation, and ethical issues in genotyping individuals with illegal behaviors; (c) the samples and the data that shape our current understanding of the molecular genetics of individual differences in vulnerability to substance dependence and related phenotypes; (d) overlaps between GWA data sets for dependence on different substances; and (e) overlaps between GWA data for addictions versus other heritable, brain-based phenotypes that include bipolar disorder, cognitive ability, frontal lobe brain volume, the ability to successfully quit smoking, neuroticism, and Alzheimer's disease. These convergent results identify potential targets for drugs that might modify addictions and play roles in these other phenotypes. They add to evidence that individual differences in the quality and quantity of brain connections make pleiotropic contributions to individual differences in vulnerability to addictions and to related brain disorders and phenotypes. A "connectivity constellation" of brain phenotypes and disorders appears to receive substantial pathogenic contributions from individual differences in a constellation of genes whose variants provide individual differences in the specification of brain connectivities during development and in adulthood. Heritable brain differences that underlie addiction vulnerability thus lie squarely in the midst of the repertoire of heritable brain differences that underlie vulnerability to other common brain disorders and phenotypes.
Collapse
Affiliation(s)
- George R Uhl
- Molecular Neurobiology Branch, National Institutes of Health (NIH), Intramural Research Program (IRP), National Institute on Drug Abuse (NIDA), Baltimore, MD 21224, USA.
| | | | | | | | | | | | | | | |
Collapse
|
289
|
Lacroix V, Cottret L, Thébault P, Sagot MF. An introduction to metabolic networks and their structural analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:594-617. [PMID: 18989046 DOI: 10.1109/tcbb.2008.79] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
There has been a renewed interest for metabolism in the computational biology community, leading to an avalanche of papers coming from methodological network analysis as well as experimental and theoretical biology. This paper is meant to serve as an initial guide for both the biologists interested in formal approaches and the mathematicians or computer scientists wishing to inject more realism into their models. The paper is focused on the structural aspects of metabolism only. The literature is vast enough already, and the thread through it difficult to follow even for the more experienced worker in the field. We explain methods for acquiring data and reconstructing metabolic networks, and review the various models that have been used for their structural analysis. Several concepts such as modularity are introduced, as are the controversies that have beset the field these past few years, for instance, on whether metabolic networks are small-world or scale-free, and on which model better explains the evolution of metabolism. Clarifying the work that has been done also helps in identifying open questions and in proposing relevant future directions in the field, which we do along the paper and in the conclusion.
Collapse
Affiliation(s)
- Vincent Lacroix
- Genome Bioinformatics Research Group, Centre de Regulacio Genomica (CRG), PRBB, Aiguader 88, 08003 Barcelona, Spain.
| | | | | | | |
Collapse
|
290
|
Li CY, Liu QR, Zhang PW, Li XM, Wei L, Uhl GR. OKCAM: an ontology-based, human-centered knowledgebase for cell adhesion molecules. Nucleic Acids Res 2008; 37:D251-60. [PMID: 18790807 PMCID: PMC2686464 DOI: 10.1093/nar/gkn568] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
‘Cell adhesion molecules’ (CAMs) are essential elements of cell/cell communication that are important for proper development and plasticity of a variety of organs and tissues. In the brain, appropriate assembly and tuning of neuronal connections is likely to require appropriate function of many cell adhesion processes. Genetic studies have linked and/or associated CAM variants with psychiatric, neurologic, neoplastic, immunologic and developmental phenotypes. However, despite increasing recognition of their functional and pathological significance, no systematic study has enumerated CAMs or documented their global features. We now report compilation of 496 human CAM genes in six gene families based on manual curation of protein domain structures, Gene Ontology annotations, and 1487 NCBI Entrez annotations. We map these genes onto a cell adhesion molecule ontology that contains 850 terms, up to seven levels of depth and provides a hierarchical description of these molecules and their functions. We develop OKCAM, a CAM knowledgebase that provides ready access to these data and ontologic system at http://okcam.cbi.pku.edu.cn. We identify global CAM properties that include: (i) functional enrichment, (ii) over-represented regulation modes and expression patterns and (iii) relationships to human Mendelian and complex diseases, and discuss the strengths and limitations of these data.
Collapse
Affiliation(s)
- Chuan-Yun Li
- Molecular Neurobiology Branch, NIH-IRP (NIDA), Baltimore, MD 21224, USA
| | | | | | | | | | | |
Collapse
|
291
|
Aftab S, Semenec L, Chu JSC, Chen N. Identification and characterization of novel human tissue-specific RFX transcription factors. BMC Evol Biol 2008; 8:226. [PMID: 18673564 PMCID: PMC2533330 DOI: 10.1186/1471-2148-8-226] [Citation(s) in RCA: 99] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2008] [Accepted: 08/01/2008] [Indexed: 02/06/2023] Open
Abstract
Background Five regulatory factor X (RFX) transcription factors (TFs)–RFX1-5–have been previously characterized in the human genome, which have been demonstrated to be critical for development and are associated with an expanding list of serious human disease conditions including major histocompatibility (MHC) class II deficiency and ciliaophathies. Results In this study, we have identified two additional RFX genes–RFX6 and RFX7–in the current human genome sequences. Both RFX6 and RFX7 are demonstrated to be winged-helix TFs and have well conserved RFX DNA binding domains (DBDs), which are also found in winged-helix TFs RFX1-5. Phylogenetic analysis suggests that the RFX family in the human genome has undergone at least three gene duplications in evolution and the seven human RFX genes can be clearly categorized into three subgroups: (1) RFX1-3, (2) RFX4 and RFX6, and (3) RFX5 and RFX7. Our functional genomics analysis suggests that RFX6 and RFX7 have distinct expression profiles. RFX6 is expressed almost exclusively in the pancreatic islets, while RFX7 has high ubiquitous expression in nearly all tissues examined, particularly in various brain tissues. Conclusion The identification and further characterization of these two novel RFX genes hold promise for gaining critical insight into development and many disease conditions in mammals, potentially leading to identification of disease genes and biomarkers.
Collapse
Affiliation(s)
- Syed Aftab
- Department of Molecular Biology and Biochemistry, Simon Fraser University, 8888 University Drive, Burnaby, BC, V5A 1S6, Canada.
| | | | | | | |
Collapse
|
292
|
Espadaler J, Eswar N, Querol E, Avilés FX, Sali A, Marti-Renom MA, Oliva B. Prediction of enzyme function by combining sequence similarity and protein interactions. BMC Bioinformatics 2008; 9:249. [PMID: 18505562 PMCID: PMC2430716 DOI: 10.1186/1471-2105-9-249] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2007] [Accepted: 05/27/2008] [Indexed: 11/18/2022] Open
Abstract
Background A number of studies have used protein interaction data alone for protein function prediction. Here, we introduce a computational approach for annotation of enzymes, based on the observation that similar protein sequences are more likely to perform the same function if they share similar interacting partners. Results The method has been tested against the PSI-BLAST program using a set of 3,890 protein sequences from which interaction data was available. For protein sequences that align with at least 40% sequence identity to a known enzyme, the specificity of our method in predicting the first three EC digits increased from 80% to 90% at 80% coverage when compared to PSI-BLAST. Conclusion Our method can also be used in proteins for which homologous sequences with known interacting partners can be detected. Thus, our method could increase 10% the specificity of genome-wide enzyme predictions based on sequence matching by PSI-BLAST alone.
Collapse
Affiliation(s)
- Jordi Espadaler
- Laboratori de Bioinformàtica Estructural (GRIB), Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra-IMIM, 08003-Barcelona, Catalonia, Spain.
| | | | | | | | | | | | | |
Collapse
|
293
|
The metagenome of a biogas-producing microbial community of a production-scale biogas plant fermenter analysed by the 454-pyrosequencing technology. J Biotechnol 2008; 136:77-90. [PMID: 18597880 DOI: 10.1016/j.jbiotec.2008.05.008] [Citation(s) in RCA: 261] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2008] [Revised: 04/16/2008] [Accepted: 05/08/2008] [Indexed: 11/21/2022]
Abstract
Composition and gene content of a biogas-producing microbial community from a production-scale biogas plant fed with renewable primary products was analysed by means of a metagenomic approach applying the ultrafast 454-pyrosequencing technology. Sequencing of isolated total community DNA on a Genome Sequencer FLX System resulted in 616,072 reads with an average read length of 230 bases accounting for 141,664,289 bases sequence information. Assignment of obtained single reads to COG (Clusters of Orthologous Groups of proteins) categories revealed a genetic profile characteristic for an anaerobic microbial consortium conducting fermentative metabolic pathways. Assembly of single reads resulted in the formation of 8752 contigs larger than 500 bases in size. Contigs longer than 10kb mainly encode house-keeping proteins, e.g. DNA polymerase, recombinase, DNA ligase, sigma factor RpoD and genes involved in sugar and amino acid metabolism. A significant portion of contigs was allocated to the genome sequence of the archaeal methanogen Methanoculleus marisnigri JR1. Mapping of single reads to the M. marisnigri JR1 genome revealed that approximately 64% of the reference genome including methanogenesis gene regions are deeply covered. These results suggest that species related to those of the genus Methanoculleus play a dominant role in methanogenesis in the analysed fermentation sample. Moreover, assignment of numerous contig sequences to clostridial genomes including gene regions for cellulolytic functions indicates that clostridia are important for hydrolysis of cellulosic plant biomass in the biogas fermenter under study. Metagenome sequence data from a biogas-producing microbial community residing in a fermenter of a biogas plant provide the basis for a rational approach to improve the biotechnological process of biogas production.
Collapse
|
294
|
Ivanic J, Wallqvist A, Reifman J. Evidence of probabilistic behaviour in protein interaction networks. BMC SYSTEMS BIOLOGY 2008; 2:11. [PMID: 18237403 PMCID: PMC2267158 DOI: 10.1186/1752-0509-2-11] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/13/2007] [Accepted: 01/31/2008] [Indexed: 11/25/2022]
Abstract
Background Data from high-throughput experiments of protein-protein interactions are commonly used to probe the nature of biological organization and extract functional relationships between sets of proteins. What has not been appreciated is that the underlying mechanisms involved in assembling these networks may exhibit considerable probabilistic behaviour. Results We find that the probability of an interaction between two proteins is generally proportional to the numerical product of their individual interacting partners, or degrees. The degree-weighted behaviour is manifested throughout the protein-protein interaction networks studied here, except for the high-degree, or hub, interaction areas. However, we find that the probabilities of interaction between the hubs are still high. Further evidence is provided by path length analyses, which show that these hubs are separated by very few links. Conclusion The results suggest that protein-protein interaction networks incorporate probabilistic elements that lead to scale-rich hierarchical architectures. These observations seem to be at odds with a biologically-guided organization. One interpretation of the findings is that we are witnessing the ability of proteins to indiscriminately bind rather than the protein-protein interactions that are actually utilized by the cell in biological processes. Therefore, the topological study of a degree-weighted network requires a more refined methodology to extract biological information about pathways, modules, or other inferred relationships among proteins.
Collapse
Affiliation(s)
- Joseph Ivanic
- Biotechnology HPC Software Applications Institute, Telemedicine and Advanced Technology Research Center, U,S, Army Medical Research and Materiel Command, Ft, Detrick, MD 21702, USA.
| | | | | |
Collapse
|