1
|
Chantzi N, Mareboina M, Konnaris MA, Montgomery A, Patsakis M, Mouratidis I, Georgakopoulos-Soares I. The determinants of the rarity of nucleic and peptide short sequences in nature. NAR Genom Bioinform 2024; 6:lqae029. [PMID: 38584871 PMCID: PMC10993293 DOI: 10.1093/nargab/lqae029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 02/21/2024] [Accepted: 03/18/2024] [Indexed: 04/09/2024] Open
Abstract
The prevalence of nucleic and peptide short sequences across organismal genomes and proteomes has not been thoroughly investigated. We examined 45 785 reference genomes and 21 871 reference proteomes, spanning archaea, bacteria, eukaryotes and viruses to calculate the rarity of short sequences in them. To capture this, we developed a metric of the rarity of each sequence in nature, the rarity index. We find that the frequency of certain dipeptides in rare oligopeptide sequences is hundreds of times lower than expected, which is not the case for any dinucleotides. We also generate predictive regression models that infer the rarity of nucleic and proteomic sequences across nature or within each domain of life and viruses separately. When examining each of the three domains of life and viruses separately, the R² performance of the model predicting rarity for 5-mer peptides from mono- and dipeptides ranged between 0.814 and 0.932. A separate model predicting rarity for 10-mer oligonucleotides from mono- and dinucleotides achieved R² performance between 0.408 and 0.606. Our results indicate that the mono- and dinucleotide composition of nucleic sequences and the mono- and dipeptide composition of peptide sequences can explain a significant proportion of the variance in their frequencies in nature.
Collapse
Affiliation(s)
- Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Maxwell A Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
- Department of Statistics, Penn State University, University Park, PA, 16802, USA
- Huck Institutes of the Life Sciences, Penn State University, University Park, PA, 16802, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Michail Patsakis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
- Huck Institutes of the Life Sciences, Penn State University, University Park, PA, 16802, USA
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| |
Collapse
|
2
|
Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, Maetschke SR, Ragan MA. Alignment-free inference of hierarchical and reticulate phylogenomic relationships. Brief Bioinform 2019; 20:426-435. [PMID: 28673025 PMCID: PMC6433738 DOI: 10.1093/bib/bbx067] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Revised: 05/04/2017] [Indexed: 11/22/2022] Open
Abstract
We are amidst an ongoing flood of sequence data arising from the application of high-throughput technologies, and a concomitant fundamental revision in our understanding of how genomes evolve individually and within the biosphere. Workflows for phylogenomic inference must accommodate data that are not only much larger than before, but often more error prone and perhaps misassembled, or not assembled in the first place. Moreover, genomes of microbes, viruses and plasmids evolve not only by tree-like descent with modification but also by incorporating stretches of exogenous DNA. Thus, next-generation phylogenomics must address computational scalability while rethinking the nature of orthogroups, the alignment of multiple sequences and the inference and comparison of trees. New phylogenomic workflows have begun to take shape based on so-called alignment-free (AF) approaches. Here, we review the conceptual foundations of AF phylogenetics for the hierarchical (vertical) and reticulate (lateral) components of genome evolution, focusing on methods based on k-mers. We reflect on what seems to be successful, and on where further development is needed.
Collapse
|
3
|
Dong Q, Wang K, Liu X. Identifying the missing proteins in human proteome by biological language model. BMC SYSTEMS BIOLOGY 2016; 10:113. [PMID: 28155671 PMCID: PMC5259966 DOI: 10.1186/s12918-016-0352-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
BACKGROUND With the rapid development of high-throughput sequencing technology, the proteomics research becomes a trendy field in the post genomics era. It is necessary to identify all the native-encoding protein sequences for further function and pathway analysis. Toward that end, the Human Proteome Organization lunched the Human Protein Project in 2011. However many proteins are hard to be detected by experiment methods, which becomes one of the bottleneck in Human Proteome Project. In consideration of the complicatedness of detecting these missing proteins by using wet-experiment approach, here we use bioinformatics method to pre-filter the missing proteins. RESULTS Since there are analogy between the biological sequences and natural language, the n-gram models from Natural Language Processing field has been used to filter the missing proteins. The dataset used in this study contains 616 missing proteins from the "uncertain" category of the neXtProt database. There are 102 proteins deduced by the n-gram model, which have high probability to be native human proteins. We perform a detail analysis on the predicted structure and function of these missing proteins and also compare the high probability proteins with other mass spectrum datasets. The evaluation shows that the results reported here are in good agreement with those obtained by other well-established databases. CONCLUSION The analysis shows that 102 proteins may be native gene-coding proteins and some of the missing proteins are membrane or natively disordered proteins which are hard to be detected by experiment methods.
Collapse
Affiliation(s)
- Qiwen Dong
- Institute for Data Science and Engineering, East China Normal University, Shanghai, 200062, People's Republic of China. .,Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, 518055, People's Republic of China.
| | - Kai Wang
- College of Animal Science and technology, Jilin Agricultural University, Changchun, 130118, People's Republic of China
| | - Xuan Liu
- College of Engineering, Shanghai Ocean University, Shanghai, 201303, People's Republic of China.
| |
Collapse
|
4
|
Cunial F, Apostolico A. Phylogeny Construction with Rigid Gapped Motifs. J Comput Biol 2012; 19:911-27. [DOI: 10.1089/cmb.2012.0060] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Fabio Cunial
- School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology, Atlanta, Georgia
| | - Alberto Apostolico
- School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology, Atlanta, Georgia
| |
Collapse
|
5
|
Rajasekaran S, Balla S, Gradie P, Gryk MR, Kadaveru K, Kundeti V, Maciejewski MW, Mi T, Rubino N, Vyas J, Schiller MR. Minimotif miner 2nd release: a database and web system for motif search. Nucleic Acids Res 2009; 37:D185-90. [PMID: 18978024 PMCID: PMC2686579 DOI: 10.1093/nar/gkn865] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2008] [Accepted: 10/16/2008] [Indexed: 11/24/2022] Open
Abstract
Minimotif Miner (MnM) consists of a minimotif database and a web-based application that enables prediction of motif-based functions in user-supplied protein queries. We have revised MnM by expanding the database more than 10-fold to approximately 5000 motifs and standardized the motif function definitions. The web-application user interface has been redeveloped with new features including improved navigation, screencast-driven help, support for alias names and expanded SNP analysis. A sample analysis of prion shows how MnM 2 can be used. Weblink: http://mnm.engr.uconn.edu, weblink for version 1 is http://sms.engr.uconn.edu.
Collapse
Affiliation(s)
- Sanguthevar Rajasekaran
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Sudha Balla
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Patrick Gradie
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Michael R. Gryk
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Krishna Kadaveru
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Vamsi Kundeti
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Mark W. Maciejewski
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Tian Mi
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Nicholas Rubino
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Jay Vyas
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Martin R. Schiller
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| |
Collapse
|
6
|
Wang A, Ren L, Abenes G, Hai R. Genome sequence divergences and functional variations in human cytomegalovirus strains. ACTA ACUST UNITED AC 2008; 55:23-33. [PMID: 19076227 DOI: 10.1111/j.1574-695x.2008.00489.x] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
Genome sequences of numerous and wide-ranging species have been completed, but genome-wide sequence variation patterns linked to biological functions are just starting to be investigated. Here, by comparatively analyzing the genome variation patterns of human cytomegalovirus (HCMV) genomes, we revealed large sequence divergences and functional variations existing in HCMV genomes. They are divergent in genome-size, inversion, orientation and coding potential, even within conserved genes, including nucleotide polymorphism, DNA strand composition asymmetry, and evolutionary rate variation in conserved genes. These divergences in conserved genes are linked to HCMV biology. Codon usage variation of conserved genes located in the negative DNA strand is significantly different between HCMV strains, and this variation associates with virion production and virulence factor, suggesting that the negative DNA strand primarily contributes to virion production and virulence factor in HCMV. In addition, we also revealed that genes functioning for entry and egress are the most adaptable, and that those for transcription and replication are the most conserved in HCMV genomes. The conserved-transcription system is generally controlled by a genome-wide motif GCGC revealed in this study by Chaos map analysis. Our findings demonstrated that genome sequences of HCMV are generally divergent and these divergences directly reflect viral biology.
Collapse
Affiliation(s)
- Anyou Wang
- School of Public Health, University of California, Berkeley, CA, USA.
| | | | | | | |
Collapse
|
7
|
Miranda KC, Huynh T, Tay Y, Ang YS, Tam WL, Thomson AM, Lim B, Rigoutsos I. A pattern-based method for the identification of MicroRNA binding sites and their corresponding heteroduplexes. Cell 2006; 126:1203-17. [PMID: 16990141 DOI: 10.1016/j.cell.2006.07.031] [Citation(s) in RCA: 1499] [Impact Index Per Article: 83.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2006] [Revised: 06/16/2006] [Accepted: 07/26/2006] [Indexed: 12/12/2022]
Abstract
We present rna22, a method for identifying microRNA binding sites and their corresponding heteroduplexes. Rna22 does not rely upon cross-species conservation, is resilient to noise, and, unlike previous methods, it first finds putative microRNA binding sites in the sequence of interest, then identifies the targeting microRNA. Computationally, we show that rna22 identifies most of the currently known heteroduplexes. Experimentally, with luciferase assays, we demonstrate average repressions of 30% or more for 168 of 226 tested targets. The analysis suggests that some microRNAs may have as many as a few thousand targets, and that between 74% and 92% of the gene transcripts in four model genomes are likely under microRNA control through their untranslated and amino acid coding regions. We also extended the method's key idea to a low-error microRNA-precursor-discovery scheme; our studies suggest that the number of microRNA precursors in mammalian genomes likely ranges in the tens of thousands.
Collapse
Affiliation(s)
- Kevin C Miranda
- Bioinformatics and Pattern Discovery Group, IBM Thomas J. Watson Research Center, Yorktown Heights, P.O. Box 218, NY 10598, USA
| | | | | | | | | | | | | | | |
Collapse
|
8
|
Thompson JD, Muller A, Waterhouse A, Procter J, Barton GJ, Plewniak F, Poch O. MACSIMS: multiple alignment of complete sequences information management system. BMC Bioinformatics 2006; 7:318. [PMID: 16792820 PMCID: PMC1539025 DOI: 10.1186/1471-2105-7-318] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2006] [Accepted: 06/23/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the post-genomic era, systems-level studies are being performed that seek to explain complex biological systems by integrating diverse resources from fields such as genomics, proteomics or transcriptomics. New information management systems are now needed for the collection, validation and analysis of the vast amount of heterogeneous data available. Multiple alignments of complete sequences provide an ideal environment for the integration of this information in the context of the protein family. RESULTS MACSIMS is a multiple alignment-based information management program that combines the advantages of both knowledge-based and ab initio sequence analysis methods. Structural and functional information is retrieved automatically from the public databases. In the multiple alignment, homologous regions are identified and the retrieved data is evaluated and propagated from known to unknown sequences with these reliable regions. In a large-scale evaluation, the specificity of the propagated sequence features is estimated to be >99%, i.e. very few false positive predictions are made. MACSIMS is then used to characterise mutations in a test set of 100 proteins that are known to be involved in human genetic diseases. The number of sequence features associated with these proteins was increased by 60%, compared to the features available in the public databases. An XML format output file allows automatic parsing of the MACSIM results, while a graphical display using the JalView program allows manual analysis. CONCLUSION MACSIMS is a new information management system that incorporates detailed analyses of protein families at the structural, functional and evolutionary levels. MACSIMS thus provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist. A web server and the source code are available at http://bips.u-strasbg.fr/MACSIMS/.
Collapse
Affiliation(s)
- Julie D Thompson
- Laboratoire de Biologie et Genomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire, Illkirch, France
| | - Arnaud Muller
- The Laboratory of Molecular Biology, Genetic Analysis & Modelling, Luxembourg
| | - Andrew Waterhouse
- Post Genomics & Molecular Interactions Centre, School of Life Sciences, University of Dundee, UK
| | - Jim Procter
- Post Genomics & Molecular Interactions Centre, School of Life Sciences, University of Dundee, UK
| | - Geoffrey J Barton
- Post Genomics & Molecular Interactions Centre, School of Life Sciences, University of Dundee, UK
| | - Frédéric Plewniak
- Laboratoire de Biologie et Genomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire, Illkirch, France
| | - Olivier Poch
- Laboratoire de Biologie et Genomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire, Illkirch, France
| |
Collapse
|
9
|
Darzentas N, Rigoutsos I, Ouzounis CA. Sensitive detection of sequence similarity using combinatorial pattern discovery: A challenging study of two distantly related protein families. Proteins 2005; 61:926-37. [PMID: 16224785 DOI: 10.1002/prot.20608] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
We investigate the performance of combinatorial pattern discovery to detect remote sequence similarities in terms of both biological accuracy and computational efficiency for a pair of distantly related families, as a case study. The two families represent the cupredoxins and multicopper oxidases, both containing blue copper-binding domains. These families present a challenging case due to low sequence similarity, different local structure, and variable sequence conservation at their copper-binding active sites. In this study, we investigate a new approach for automatically identifying weak sequence similarities that is based on combinatorial pattern discovery. We compare its performance with a traditional, HMM-based scheme and obtain estimates for sensitivity and specificity of the two approaches. Our analysis suggests that pattern discovery methods can be substantially more sensitive in detecting remote protein relationships while at the same time guaranteeing high specificity.
Collapse
Affiliation(s)
- Nikos Darzentas
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge, UK
| | | | | |
Collapse
|
10
|
Pal D, Eisenberg D. Inference of Protein Function from Protein Structure. Structure 2005; 13:121-30. [PMID: 15642267 DOI: 10.1016/j.str.2004.10.015] [Citation(s) in RCA: 152] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2004] [Revised: 10/18/2004] [Accepted: 10/20/2004] [Indexed: 11/28/2022]
Abstract
Structural genomics has brought us three-dimensional structures of proteins with unknown functions. To shed light on such structures, we have developed ProKnow (http://www.doe-mbi.ucla.edu/Services/ProKnow/), which annotates proteins with Gene Ontology functional terms. The method extracts features from the protein such as 3D fold, sequence, motif, and functional linkages and relates them to function via the ProKnow knowledgebase of features, which links features to annotated functions via annotation profiles. Bayes' theorem is used to compute weights of the functions assigned, using likelihoods based on the extracted features. The description level of the assigned function is quantified by the ontology depth (from 1 = general to 9 = specific). Jackknife tests show approximately 89% correct assignments at ontology depth 1 and 40% at depth 9, with 93% coverage of 1507 distinct folded proteins. Overall, about 70% of the assignments were inferred correctly. This level of performance suggests that ProKnow is a useful resource in functional assessments of novel proteins.
Collapse
Affiliation(s)
- Debnath Pal
- UCLA-DOE Institute for Genomics and Proteomics, Los Angeles, CA 90095, USA
| | | |
Collapse
|
11
|
Lu X, Zhai C, Gopalakrishnan V, Buchanan BG. Automatic annotation of protein motif function with Gene Ontology terms. BMC Bioinformatics 2004; 5:122. [PMID: 15345032 PMCID: PMC517493 DOI: 10.1186/1471-2105-5-122] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2003] [Accepted: 09/02/2004] [Indexed: 11/15/2022] Open
Abstract
Background Conserved protein sequence motifs are short stretches of amino acid sequence patterns that potentially encode the function of proteins. Several sequence pattern searching algorithms and programs exist foridentifying candidate protein motifs at the whole genome level. However, amuch needed and importanttask is to determine the functions of the newly identified protein motifs. The Gene Ontology (GO) project is an endeavor to annotate the function of genes or protein sequences with terms from a dynamic, controlled vocabulary and these annotations serve well as a knowledge base. Results This paperpresents methods to mine the GO knowledge base and use the association between the GO terms assigned to a sequence and the motifs matched by the same sequence as evidence for predicting the functions of novel protein motifs automatically. The task of assigning GO terms to protein motifsis viewed as both a binary classification and information retrieval problem, where PROSITE motifs are used as samples for mode training and functional prediction. The mutual information of a motif and aGO term association isfound to be a very useful feature. We take advantageof the known motifs to train a logistic regression classifier, which allows us to combine mutual information with other frequency-based features and obtain a probability of correctassociation. The trained logistic regression model has intuitively meaningful and logically plausible parameter values, and performs very well empirically according to our evaluation criteria. Conclusions In this research, different methods for automatic annotation of protein motifs have been investigated. Empirical result demonstrated that the methods have a great potential for detecting and augmenting information about thefunctions of newly discovered candidate protein motifs.
Collapse
Affiliation(s)
- Xinghua Lu
- Dept. of Biostatistics, Bioinformatics and Epidemiology, Medical University of South Carolina, 135 Cannon St. Suite 303, Charleston, SC 29425, USA
| | - Chengxiang Zhai
- Dept of Computer Science, University of Illinois at Urbana-Champaign, 1304 W. Springfield Avenue, Urbana, IL 61801 USA
| | - Vanathi Gopalakrishnan
- Center for Biomedical Informatics, University of Pittsburgh, 200 Lothrop Street, Suite 8084, Pittsburgh, PA 15213 USA
| | - Bruce G Buchanan
- Center for Biomedical Informatics, University of Pittsburgh, 200 Lothrop Street, Suite 8084, Pittsburgh, PA 15213 USA
| |
Collapse
|
12
|
Huynh T, Rigoutsos I. The web server of IBM's Bioinformatics and Pattern Discovery group: 2004 update. Nucleic Acids Res 2004; 32:W10-5. [PMID: 15215340 PMCID: PMC441505 DOI: 10.1093/nar/gkh367] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In this report, we provide an update on the services and content which are available on the web server of IBM's Bioinformatics and Pattern Discovery group. The server, which is operational around the clock, provides access to a large number of methods that have been developed and published by the group's members. There is an increasing number of problems that these tools can help tackle; these problems range from the discovery of patterns in streams of events and the computation of multiple sequence alignments, to the discovery of genes in nucleic acid sequences, the identification--directly from sequence--of structural deviations from alpha-helicity and the annotation of amino acid sequences for antimicrobial activity. Additionally, annotations for more than 130 archaeal, bacterial, eukaryotic and viral genomes are now available on-line and can be searched interactively. The tools and code bundles continue to be accessible from http://cbcsrv.watson.ibm.com/Tspd.html whereas the genomics annotations are available at http://cbcsrv.watson.ibm.com/Annotations/.
Collapse
Affiliation(s)
- Tien Huynh
- Bioinformatics and Pattern Discovery group, IBM T.J. Watson Research Center, PO Box 218, Yorktown Heights, NY 10598, USA
| | | |
Collapse
|
13
|
Abstract
Herpesviruses represent an exceptionally suitable model to analyze evolutionary old pathogens, their competency to adapt to existing and changing molecular niches in host species, and the modulation of the gene content and function to comply with the requirements of life. The basis for numerous studies dealing with these questions are reliable statements about the gene content of herpesviral genomes and the functions of viral proteins. The recent determination of the coding strategy of the chimpanzee cytomegalovirus genome and the re-evaluation of the gene content of the human cytomegalovirus genome made it also necessary to restructure the putative transcription map of the Tupaia herpesvirus (THV) genome. Twenty-three THV-specific ORFs formerly predicted to be coding for viral proteins were deleted from the THV transcription map resulting in a gene layout that is now characterized by the presence of conserved genes in the genome center, that probably reflect the genome structure of common herpesviral ancestors, and species-specific genes at the termini. The conserved regions in the THV genome are characterized by high G + C contents between 60% and 80%, a high CpG dinucleotide frequency, and the presence of densely packed putative CpG islands. The genome termini seem to provide the requirements of large scale rearrangements and complements of the gene content to adapt to new environmental demands. With the help of the recently designed method of dictionary-driven, pattern-based protein annotation it was possible to assign putative functions to almost all potential THV proteins, e.g. 123 were found to be putative membrane or secreted proteins, putative signal domains were identified in 69, and 29 proteins were predicted to be glycosylated. The present study adds new aspects to the knowledge about the precise gene composition of herpesvirus genomes and viral protein functions that are of exceptional importance for studies dealing with the phylogeny, the evolution, vaccine vector development, virus-host interactions, pathogenesis and the determination of protein functions of herpesviruses.
Collapse
Affiliation(s)
- Udo Bahr
- Hygiene-Institut, Abteilung Virologie, Universität Heidelberg, Im Neuenheimer Feld 324, D-69120 Heidelberg, Germany
| | | |
Collapse
|
14
|
Rigoutsos I, Riek P, Graham RM, Novotny J. Structural details (kinks and non-alpha conformations) in transmembrane helices are intrahelically determined and can be predicted by sequence pattern descriptors. Nucleic Acids Res 2003; 31:4625-31. [PMID: 12888523 PMCID: PMC169910 DOI: 10.1093/nar/gkg639] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
One of the promising methods of protein structure prediction involves the use of amino acid sequence-derived patterns. Here we report on the creation of non-degenerate motif descriptors derived through data mining of training sets of residues taken from the transmembrane-spanning segments of polytopic proteins. These residues correspond to short regions in which there is a deviation from the regular alpha-helical character (i.e. pi-helices, 3(10)-helices and kinks). A 'search engine' derived from these motif descriptors correctly identifies, and discriminates amongst instances of the above 'non-canonical' helical motifs contained in the SwissProt/TrEMBL database of protein primary structures. Our results suggest that deviations from alpha-helicity are encoded locally in sequence patterns only about 7-9 residues long and can be determined in silico directly from the amino acid sequence. Delineation of such variations in helical habit is critical to understanding the complex structure-function relationships of polytopic proteins and for drug discovery. The success of our current methodology foretells development of similar prediction tools capable of identifying other structural motifs from sequence alone. The method described here has been implemented and is available on the World Wide Web at http://cbcsrv.watson.ibm.com/Ttkw.html.
Collapse
Affiliation(s)
- Isidore Rigoutsos
- Bioinformatics and Pattern Discovery Research Group, IBM Thomas J. Watson Research Center, PO Box 218, Yorktown Heights, NY 10598, USA.
| | | | | | | |
Collapse
|
15
|
Ouzounis CA, Coulson RMR, Enright AJ, Kunin V, Pereira-Leal JB. Classification schemes for protein structure and function. Nat Rev Genet 2003; 4:508-19. [PMID: 12838343 DOI: 10.1038/nrg1113] [Citation(s) in RCA: 75] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
We examine the structural and functional classifications of the protein universe, providing an overview of the existing classification schemes, their features and inter-relationships. We argue that a unified scheme should be based on a natural classification approach and that more comparative analyses of the present schemes are required both to understand their limitations and to help delimit the number of known protein folds and their corresponding functional roles in cells.
Collapse
Affiliation(s)
- Christos A Ouzounis
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK.
| | | | | | | | | |
Collapse
|
16
|
Huynh T, Rigoutsos I, Parida L, Platt D, Shibuya T. The web server of IBM's Bioinformatics and Pattern Discovery group. Nucleic Acids Res 2003; 31:3645-50. [PMID: 12824385 PMCID: PMC169027 DOI: 10.1093/nar/gkg621] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2003] [Revised: 04/08/2003] [Accepted: 04/08/2003] [Indexed: 11/12/2022] Open
Abstract
We herein present and discuss the services and content which are available on the web server of IBM's Bioinformatics and Pattern Discovery group. The server is operational around the clock and provides access to a variety of methods that have been published by the group's members and collaborators. The available tools correspond to applications ranging from the discovery of patterns in streams of events and the computation of multiple sequence alignments, to the discovery of genes in nucleic acid sequences and the interactive annotation of amino acid sequences. Additionally, annotations for more than 70 archaeal, bacterial, eukaryotic and viral genomes are available on-line and can be searched interactively. The tools and code bundles can be accessed beginning at http://cbcsrv.watson.ibm.com/Tspd.html whereas the genomics annotations are available at http://cbcsrv.watson.ibm.com/Annotations/.
Collapse
Affiliation(s)
- Tien Huynh
- Bioinformatics and Pattern Discovery Group, IBM TJ Watson Research Center, PO BOX 218, Yorktown Heights, NY 10598, USA.
| | | | | | | | | |
Collapse
|
17
|
Abstract
The draft of the human genome sequence is still incomplete. The outstanding tasks include filling in some gaps, finalizing the assembly of short sequences, improving sequence accuracy and correctly identifying coding regions. However, a closely related problem that receives little attention is the substantial number of incorrect annotations that have penetrated some of the widely used databases. This article illustrates this problem using the example of ubiquitin genes, and draws some conclusions that apply to false annotations in other short open reading frames (ORFs). Although the focus is on the human genome, other genomes are equally prone to similar propagation of false annotations.
Collapse
Affiliation(s)
- Michal Linial
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University, Jerusalem, 91904, Israel.
| |
Collapse
|
18
|
Mondal S, Jaishankar SP, Ramakumar S. Role of context in the relationship between form and function: structural plasticity of some PROSITE patterns. Biochem Biophys Res Commun 2003; 305:1078-84. [PMID: 12767941 DOI: 10.1016/s0006-291x(03)00882-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
True positive hits of PROSITE sequence pattern are expected to have a characteristic three-dimensional structure. The combined sequence-structure attributes of PROSITE patterns can be used for function prediction of an uncharacterized protein with known primary and 3D structure, a situation that might arise in structural genomics projects. We have found specific examples of true hits of PROSITE patterns displaying structural plasticity by assuming significantly different local conformation, depending upon the context. Our work highlights the importance of taking into account all the known distinct conformations of PROSITE patterns, while creating a sensitive 3D template for the pattern, for use in functional annotation.
Collapse
Affiliation(s)
- Sukanta Mondal
- Department of Physics, Indian Institute of Science, Bangalore 560 012, India
| | | | | |
Collapse
|
19
|
Rigoutsos I, Novotny J, Huynh T, Chin-Bow ST, Parida L, Platt D, Coleman D, Shenk T. In silico pattern-based analysis of the human cytomegalovirus genome. J Virol 2003; 77:4326-44. [PMID: 12634390 PMCID: PMC150618 DOI: 10.1128/jvi.77.7.4326-4344.2003] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2002] [Accepted: 12/23/2002] [Indexed: 11/20/2022] Open
Abstract
More than 200 open reading frames (ORFs) from the human cytomegalovirus genome have been reported as potentially coding for proteins. We have used two pattern-based in silico approaches to analyze this set of putative viral genes. With the help of an objective annotation method that is based on the Bio-Dictionary, a comprehensive collection of amino acid patterns that describes the currently known natural sequence space of proteins, we have reannotated all of the previously reported putative genes of the human cytomegalovirus. Also, with the help of MUSCA, a pattern-based multiple sequence alignment algorithm, we have reexamined the original human cytomegalovirus gene family definitions. Our analysis of the genome shows that many of the coded proteins comprise amino acid combinations that are unique to either the human cytomegalovirus or the larger group of herpesviruses. We have confirmed that a surprisingly large portion of the analyzed ORFs encode membrane proteins, and we have discovered a significant number of previously uncharacterized proteins that are predicted to be G-protein-coupled receptor homologues. The analysis also indicates that many of the encoded proteins undergo posttranslational modifications such as hydroxylation, phosphorylation, and glycosylation. ORFs encoding proteins with similar functional behavior appear in neighboring regions of the human cytomegalovirus genome. All of the results of the present study can be found and interactively explored online (http://cbcsrv.watson.ibm.com/virus/).
Collapse
Affiliation(s)
- Isidore Rigoutsos
- Bioinformatics and Pattern Discovery Group, IBM TJ Watson Research Center, Yorktown Heights, New York 10598, USA.
| | | | | | | | | | | | | | | |
Collapse
|