1
|
Chen C, Wang Q, Huang H, Vinayaka CR, Garavelli JS, Arighi CN, Natale DA, Wu CH. PIRSitePredict for protein functional site prediction using position-specific rules. Database (Oxford) 2019; 2019:baz026. [PMID: 30805646 PMCID: PMC6389862 DOI: 10.1093/database/baz026] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Revised: 01/24/2019] [Accepted: 02/04/2019] [Indexed: 11/14/2022]
Abstract
Methods focused on predicting 'global' annotations for proteins (such as molecular function, biological process and presence of domains or membership in a family) have reached a relatively mature stage. Methods to provide fine-grained 'local' annotation of functional sites (at the level of individual amino acid) are now coming to the forefront, especially in light of the rapid accumulation of genetic variant data. We have developed a computational method and workflow that predicts functional sites within proteins using position-specific conditional template annotation rules (namely PIR Site Rules or PIRSRs for short). Such rules are curated through review of known protein structural and other experimental data by structural biologists and are used to generate high-quality annotations for the UniProt Knowledgebase (UniProtKB) unreviewed section. To share the PIRSR functional site prediction method with the broader scientific community, we have streamlined our workflow and developed a stand-alone Java software package named PIRSitePredict. We demonstrate the use of PIRSitePredict for functional annotation of de novo assembled genome/transcriptome by annotating uncharacterized proteins from Trinity RNA-seq assembly of embryonic transcriptomes of the following three cartilaginous fishes: Leucoraja erinacea (Little Skate), Scyliorhinus canicula (Small-spotted Catshark) and Callorhinchus milii (Elephant Shark). On average about 1200 lines of annotations were predicted for each species.
Collapse
Affiliation(s)
- Chuming Chen
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Qinghua Wang
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Hongzhan Huang
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | | | - John S Garavelli
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
| | - Cecilia N Arighi
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Darren A Natale
- Protein Information Resource, Georgetown University Medical Center, Washington, DC, USA
| | - Cathy H Wu
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
- Protein Information Resource, Georgetown University Medical Center, Washington, DC, USA
| |
Collapse
|
2
|
Abstract
UniProt is an important collection of protein sequences and their annotations, which has doubled in size to 80 million sequences during the past year. This growth in sequences has prompted an extension of UniProt accession number space from 6 to 10 characters. An increasing fraction of new sequences are identical to a sequence that already exists in the database with the majority of sequences coming from genome sequencing projects. We have created a new proteome identifier that uniquely identifies a particular assembly of a species and strain or subspecies to help users track the provenance of sequences. We present a new website that has been designed using a user-experience design process. We have introduced an annotation score for all entries in UniProt to represent the relative amount of knowledge known about each protein. These scores will be helpful in identifying which proteins are the best characterized and most informative for comparative analysis. All UniProt data is provided freely and is available on the web at http://www.uniprot.org/.
Collapse
Affiliation(s)
- The UniProt Consortium
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland
- Protein Information Resource, Georgetown University Medical Center, 3300 Whitehaven Street North West, Suite 1200, Washington, DC 20007, USA
- Protein Information Resource, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE 19711, USA
- To whom correspondence should be addressed. Tel: +44 1223 494100; Fax: +44 1223 494468;
| |
Collapse
|
3
|
Gong S, Ware JS, Walsh R, Cook SA. NECTAR: a database of codon-centric missense variant annotations. Nucleic Acids Res 2013; 42:D1013-9. [PMID: 24297257 PMCID: PMC3965063 DOI: 10.1093/nar/gkt1245] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
NECTAR (Non-synonymous Enriched Coding muTation ARchive; http://nectarmutation.org) is a database and web application to annotate disease-related and functionally important amino acids in human proteins. A number of tools are available to facilitate the interpretation of DNA variants identified in diagnostic or research sequencing. These typically identify previous reports of DNA variation at a given genomic location, predict its effects on transcript and protein sequence and may predict downstream functional consequences. Previous reports and functional annotations are typically linked by the genomic location of the variant observed. NECTAR collates disease-causing variants and functionally important amino acid residues from a number of sources. Importantly, rather than simply linking annotations by a shared genomic location, NECTAR annotates variants of interest with details of previously reported variation affecting the same codon. This provides a much richer data set for the interpretation of a novel DNA variant. NECTAR also identifies functionally equivalent amino acid residues in evolutionarily related proteins (paralogues) and, where appropriate, transfers annotations between them. As well as accessing these data through a web interface, users can upload batches of variants in variant call format (VCF) for annotation on-the-fly. The database is freely available to download from the ftp site: ftp://ftp.nectarmutation.org.
Collapse
Affiliation(s)
- Sungsam Gong
- NIHR Cardiovascular Biomedical Research Unit, Royal Brompton and Harefield NHS Foundation Trust and Imperial College London, London SW3 6NP, UK, National Heart and Lung Institute, Imperial College, London SW3 6LY, UK, National Heart Centre Singapore, Singapore 168752, Singapore and Cardiovascular & Metabolic Disorders, Duke National University of Singapore, Singapore 169857, Singapore
| | | | | | | |
Collapse
|
4
|
Gana R, Rao S, Huang H, Wu C, Vasudevan S. Structural and functional studies of S-adenosyl-L-methionine binding proteins: a ligand-centric approach. BMC STRUCTURAL BIOLOGY 2013; 13:6. [PMID: 23617634 PMCID: PMC3662625 DOI: 10.1186/1472-6807-13-6] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/29/2012] [Accepted: 04/09/2013] [Indexed: 12/31/2022]
Abstract
BACKGROUND The post-genomic era poses several challenges. The biggest is the identification of biochemical function for protein sequences and structures resulting from genomic initiatives. Most sequences lack a characterized function and are annotated as hypothetical or uncharacterized. While homology-based methods are useful, and work well for sequences with sequence identities above 50%, they fail for sequences in the twilight zone (<30%) of sequence identity. For cases where sequence methods fail, structural approaches are often used, based on the premise that structure preserves function for longer evolutionary time-frames than sequence alone. It is now clear that no single method can be used successfully for functional inference. Given the growing need for functional assignments, we describe here a systematic new approach, designated ligand-centric, which is primarily based on analysis of ligand-bound/unbound structures in the PDB. Results of applying our approach to S-adenosyl-L-methionine (SAM) binding proteins are presented. RESULTS Our analysis included 1,224 structures that belong to 172 unique families of the Protein Information Resource Superfamily system. Our ligand-centric approach was divided into four levels: residue, protein/domain, ligand, and family levels. The residue level included the identification of conserved binding site residues based on structure-guided sequence alignments of representative members of a family, and the identification of conserved structural motifs. The protein/domain level included structural classification of proteins, Pfam domains, domain architectures, and protein topologies. The ligand level included ligand conformations, ribose sugar puckering, and the identification of conserved ligand-atom interactions. The family level included phylogenetic analysis. CONCLUSION We found that SAM bound to a total of 18 different fold types (I-XVIII). We identified 4 new fold types and 11 additional topological arrangements of strands within the well-studied Rossmann fold Methyltransferases (MTases). This extends the existing structural classification of SAM binding proteins. A striking correlation between fold type and the conformation of the bound SAM (classified as types) was found across the 18 fold types. Several site-specific rules were created for the assignment of functional residues to families and proteins that do not have a bound SAM or a solved structure.
Collapse
Affiliation(s)
- Rajaram Gana
- Department of Biostatistics and Bioinformatics, Georgetown University Medical Center, Washington, DC 20007, USA
| | | | | | | | | |
Collapse
|
5
|
Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res 2013; 41:D43-7. [PMID: 23161681 PMCID: PMC3531094 DOI: 10.1093/nar/gks1068] [Citation(s) in RCA: 543] [Impact Index Per Article: 49.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2012] [Revised: 10/11/2012] [Accepted: 10/11/2012] [Indexed: 12/22/2022] Open
Abstract
The mission of the Universal Protein Resource (UniProt) (http://www.uniprot.org) is to support biological research by providing a freely accessible, stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase. It integrates, interprets and standardizes data from numerous resources to achieve the most comprehensive catalogue of protein sequences and functional annotation. UniProt comprises four major components, each optimized for different uses, the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. UniProt is produced by the UniProt Consortium, which consists of groups from the European Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). UniProt is updated and distributed every 4 weeks and can be accessed online for searches or downloads.
Collapse
Affiliation(s)
- The UniProt Consortium
- The EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland, Protein Information Resource, Georgetown University Medical Center, 3300 Whitehaven Street North West, Suite 1200, Washington, DC 20007 and Protein Information Resource, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE 19711, USA
| |
Collapse
|
6
|
Pedruzzi I, Rivoire C, Auchincloss AH, Coudert E, Keller G, de Castro E, Baratin D, Cuche BA, Bougueleret L, Poux S, Redaschi N, Xenarios I, Bridge A. HAMAP in 2013, new developments in the protein family classification and annotation system. Nucleic Acids Res 2012. [PMID: 23193261 PMCID: PMC3531088 DOI: 10.1093/nar/gks1157] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
HAMAP (High-quality Automated and Manual Annotation of Proteins—available at http://hamap.expasy.org/) is a system for the classification and annotation of protein sequences. It consists of a collection of manually curated family profiles for protein classification, and associated annotation rules that specify annotations that apply to family members. HAMAP was originally developed to support the manual curation of UniProtKB/Swiss-Prot records describing microbial proteins. Here we describe new developments in HAMAP, including the extension of HAMAP to eukaryotic proteins, the use of HAMAP in the automated annotation of UniProtKB/TrEMBL, providing high-quality annotation for millions of protein sequences, and the future integration of HAMAP into a unified system for UniProtKB annotation, UniRule. HAMAP is continuously updated by expert curators with new family profiles and annotation rules as new protein families are characterized. The collection of HAMAP family classification profiles and annotation rules can be browsed and viewed on the HAMAP website, which also provides an interface to scan user sequences against HAMAP profiles.
Collapse
Affiliation(s)
- Ivo Pedruzzi
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, CMU, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|