1
|
Marciano DC, Wang C, Hsu TK, Bourquard T, Atri B, Nehring RB, Abel NS, Bowling EA, Chen TJ, Lurie PD, Katsonis P, Rosenberg SM, Herman C, Lichtarge O. Evolutionary action of mutations reveals antimicrobial resistance genes in Escherichia coli. Nat Commun 2022; 13:3189. [PMID: 35680894 PMCID: PMC9184624 DOI: 10.1038/s41467-022-30889-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2021] [Accepted: 05/24/2022] [Indexed: 11/08/2022] Open
Abstract
Since antibiotic development lags, we search for potential drug targets through directed evolution experiments. A challenge is that many resistance genes hide in a noisy mutational background as mutator clones emerge in the adaptive population. Here, to overcome this noise, we quantify the impact of mutations through evolutionary action (EA). After sequencing ciprofloxacin or colistin resistance strains grown under different mutational regimes, we find that an elevated sum of the evolutionary action of mutations in a gene identifies known resistance drivers. This EA integration approach also suggests new antibiotic resistance genes which are then shown to provide a fitness advantage in competition experiments. Moreover, EA integration analysis of clinical and environmental isolates of antibiotic resistant of E. coli identifies gene drivers of resistance where a standard approach fails. Together these results inform the genetic basis of de novo colistin resistance and support the robust discovery of phenotype-driving genes via the evolutionary action of genetic perturbations in fitness landscapes.
Collapse
Affiliation(s)
- David C Marciano
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.
| | - Chen Wang
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Teng-Kuei Hsu
- The Verna and Marrs McLean Department of Biochemistry & Molecular Biology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Thomas Bourquard
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Benu Atri
- Structural and Computational Biology & Molecular Biophysics Program, Baylor College of Medicine, Houston, TX, 77030, USA
- Clara Analytics Inc., 451 El Camino Real #201, Santa Clara, CA, 95050, USA
| | - Ralf B Nehring
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- The Verna and Marrs McLean Department of Biochemistry & Molecular Biology, Baylor College of Medicine, Houston, TX, 77030, USA
- Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX, 77030, USA
- Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Nicholas S Abel
- Department of Pharmacology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Elizabeth A Bowling
- The Verna and Marrs McLean Department of Biochemistry & Molecular Biology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Taylor J Chen
- Integrative Molecular & Biomedical Biosciences Program, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Pamela D Lurie
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Susan M Rosenberg
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- The Verna and Marrs McLean Department of Biochemistry & Molecular Biology, Baylor College of Medicine, Houston, TX, 77030, USA
- Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX, 77030, USA
- Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, TX, 77030, USA
- Integrative Molecular & Biomedical Biosciences Program, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Christophe Herman
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX, 77030, USA
- Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.
- Structural and Computational Biology & Molecular Biophysics Program, Baylor College of Medicine, Houston, TX, 77030, USA.
- Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, TX, 77030, USA.
- Computational and Integrative Biomedical Research Center, Baylor College of Medicine, Houston, TX, 77030, USA.
| |
Collapse
|
2
|
Katsonis P, Lichtarge O. A formal perturbation equation between genotype and phenotype determines the Evolutionary Action of protein-coding variations on fitness. Genome Res 2014; 24:2050-8. [PMID: 25217195 PMCID: PMC4248321 DOI: 10.1101/gr.176214.114] [Citation(s) in RCA: 114] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The relationship between genotype mutations and phenotype variations determines health in the short term and evolution over the long term, and it hinges on the action of mutations on fitness. A fundamental difficulty in determining this action, however, is that it depends on the unique context of each mutation, which is complex and often cryptic. As a result, the effect of most genome variations on molecular function and overall fitness remains unknown and stands apart from population genetics theories linking fitness effect to polymorphism frequency. Here, we hypothesize that evolution is a continuous and differentiable physical process coupling genotype to phenotype. This leads to a formal equation for the action of coding mutations on fitness that can be interpreted as a product of the evolutionary importance of the mutated site with the difference in amino acid similarity. Approximations for these terms are readily computable from phylogenetic sequence analysis, and we show mutational, clinical, and population genetic evidence that this action equation predicts the effect of point mutations in vivo and in vitro in diverse proteins, correlates disease-causing gene mutations with morbidity, and determines the frequency of human coding polymorphisms, respectively. Thus, elementary calculus and phylogenetics can be integrated into a perturbation analysis of the evolutionary relationship between genotype and phenotype that quantitatively links point mutations to function and fitness and that opens a new analytic framework for equations of biology. In practice, this work explicitly bridges molecular evolution with population genetics with applications from protein redesign to the clinical assessment of human genetic variations.
Collapse
Affiliation(s)
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Department of Biochemistry & Molecular Biology, Department of Pharmacology, Baylor College of Medicine, Houston, Texas 77030, USA; Computational and Integrative Biomedical Research Center, Baylor College of Medicine, Houston, Texas 77030, USA
| |
Collapse
|
3
|
Wong WC, Maurer-Stroh S, Eisenhaber B, Eisenhaber F. On the necessity of dissecting sequence similarity scores into segment-specific contributions for inferring protein homology, function prediction and annotation. BMC Bioinformatics 2014; 15:166. [PMID: 24890864 PMCID: PMC4061105 DOI: 10.1186/1471-2105-15-166] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2013] [Accepted: 05/27/2014] [Indexed: 02/01/2023] Open
Abstract
Background Protein sequence similarities to any types of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc. where either positional sequence conservation is the result of a very simple, physically induced pattern or rather integral sequence properties are critical) are pertinent sources for mistaken homologies. Regretfully, these considerations regularly escape attention in large-scale annotation studies since, often, there is no substitute to manual handling of these cases. Quantitative criteria are required to suppress events of function annotation transfer as a result of false homology assignments. Results The sequence homology concept is based on the similarity comparison between the structural elements, the basic building blocks for conferring the overall fold of a protein. We propose to dissect the total similarity score into fold-critical and other, remaining contributions and suggest that, for a valid homology statement, the fold-relevant score contribution should at least be significant on its own. As part of the article, we provide the DissectHMMER software program for dissecting HMMER2/3 scores into segment-specific contributions. We show that DissectHMMER reproduces HMMER2/3 scores with sufficient accuracy and that it is useful in automated decisions about homology for instructive sequence examples. To generalize the dissection concept for cases without 3D structural information, we find that a dissection based on alignment quality is an appropriate surrogate. The approach was applied to a large-scale study of SMART and PFAM domains in the space of seed sequences and in the space of UniProt/SwissProt. Conclusions Sequence similarity core dissection with regard to fold-critical and other contributions systematically suppresses false hits and, additionally, recovers previously obscured homology relationships such as the one between aquaporins and formate/nitrite transporters that, so far, was only supported by structure comparison.
Collapse
Affiliation(s)
- Wing-Cheong Wong
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore.
| | | | | | | |
Collapse
|
4
|
Chakraborty A, Chakrabarti S. A survey on prediction of specificity-determining sites in proteins. Brief Bioinform 2014; 16:71-88. [DOI: 10.1093/bib/bbt092] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
5
|
Gao YF, Li BQ, Cai YD, Feng KY, Li ZD, Jiang Y. Prediction of active sites of enzymes by maximum relevance minimum redundancy (mRMR) feature selection. ACTA ACUST UNITED AC 2013; 9:61-9. [DOI: 10.1039/c2mb25327e] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
|
6
|
Abstract
The evolutionary trace (ET) is the single most validated approach to identify protein functional determinants and to target mutational analysis, protein engineering and drug design to the most relevant sites of a protein. It applies to the entire proteome; its predictions come with a reliability score; and its results typically reach significance in most protein families with 20 or more sequence homologs. In order to identify functional hot spots, ET scans a multiple sequence alignment for residue variations that correlate with major evolutionary divergences. In case studies this enables the selective separation, recoding, or mimicry of functional sites and, on a large scale, this enables specific function predictions based on motifs built from select ET-identified residues. ET is therefore an accurate, scalable and efficient method to identify the molecular determinants of protein function and to direct their rational perturbation for therapeutic purposes. Public ET servers are located at: http://mammoth.bcm.tmc.edu/.
Collapse
|
7
|
Wilkins AD, Lua R, Erdin S, Ward RM, Lichtarge O. Sequence and structure continuity of evolutionary importance improves protein functional site discovery and annotation. Protein Sci 2010; 19:1296-311. [PMID: 20506260 DOI: 10.1002/pro.406] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Protein functional sites control most biological processes and are important targets for drug design and protein engineering. To characterize them, the evolutionary trace (ET) ranks the relative importance of residues according to their evolutionary variations. Generally, top-ranked residues cluster spatially to define evolutionary hotspots that predict functional sites in structures. Here, various functions that measure the physical continuity of ET ranks among neighboring residues in the structure, or in the sequence, are shown to inform sequence selection and to improve functional site resolution. This is shown first, in 110 proteins, for which the overlap between top-ranked residues and actual functional sites rose by 8% in significance. Then, on a structural proteomic scale, optimized ET led to better 3D structure-function motifs (3D templates) and, in turn, to enzyme function prediction by the Evolutionary Trace Annotation (ETA) method with better sensitivity of (40% to 53%) and positive predictive value (93% to 94%). This suggests that the similarity of evolutionary importance among neighboring residues in the sequence and in the structure is a universal feature of protein evolution. In practice, this yields a tool for optimizing sequence selections for comparative analysis and, via ET, for better predictions of functional site and function. This should prove useful for the efficient mutational redesign of protein function and for pharmaceutical targeting.
Collapse
Affiliation(s)
- A D Wilkins
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA
| | | | | | | | | |
Collapse
|
8
|
Evolution: a guide to perturb protein function and networks. Curr Opin Struct Biol 2010; 20:351-9. [PMID: 20444593 DOI: 10.1016/j.sbi.2010.04.002] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2010] [Accepted: 04/08/2010] [Indexed: 12/11/2022]
Abstract
Protein interactions give rise to networks that control cell fate in health and disease; selective means to probe these interactions are therefore of wide interest. We discuss here Evolutionary Tracing (ET), a comparative method to identify protein functional sites and to guide experiments that selectively block, recode, or mimic their amino acid determinants. These studies suggest, in principle, a scalable approach to perturb individual links in protein networks.
Collapse
|
9
|
Shudler M, Niv MY. BlockMaster: partitioning protein kinase structures using normal-mode analysis. J Phys Chem A 2009; 113:7528-34. [PMID: 19485335 DOI: 10.1021/jp900885w] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Protein kinases are key signaling enzymes which are dysregulated in many health disorders and therefore represent major targets of extensive drug discovery efforts. Their regulation in the cell is exerted via various mechanisms, including control of the 3D conformation of their catalytic domains. We developed a procedure, BlockMaster, for partitioning protein structures into semirigid blocks and flexible regions based on residue-residue correlations calculated from normal modes. BlockMaster provided correct partitioning into domains and subdomains of several test set proteins for which documented expert annotation of subdomains exists. When applied to representative structures of protein kinases, BlockMaster identified semirigid blocks within the traditional N-terminal and C-terminal lobes of the kinase domain. In general, the block regions had elevated helical content and reduced, but significant, coil content compared to the nonblock (flexible) regions. The specificity-determining regions, previously used to derive inhibitory peptides, were found to be more flexible in the tyrosine kinases than in serine/threonine kinases. Two blocks were identified which spanned both lobes. The first, which we termed the "pivot" block, included the alphaC-beta4 loop in the N-terminal lobe and part of the activation loop in the C-terminal lobe and appeared in both the active and inactive conformations of the kinases. The second, which we termed the "loop" block, differed between the active and inactive conformations. In the structures of active kinases, this block included part of the activation loop in the C-terminal lobe and the alphaC helix in the N-terminal lobe, representing a known interaction that stabilizes the active conformation. In the inactive structures, this block included G loop residues instead of the alphaC residues. This novel inactive "loop" block may stabilize the inactive conformation and thus downregulate kinase activity.
Collapse
Affiliation(s)
- Marina Shudler
- The Institute of Biochemistry, Food Science and Nutrition, The Hebrew University of Jerusalem, Rehovot 76100, Israel
| | | |
Collapse
|
10
|
Exploiting three kinds of interface propensities to identify protein binding sites. Comput Biol Chem 2009; 33:303-11. [DOI: 10.1016/j.compbiolchem.2009.07.001] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2008] [Revised: 06/22/2009] [Accepted: 07/01/2009] [Indexed: 11/21/2022]
|
11
|
Li N, Sun Z, Jiang F. Prediction of protein-protein binding site by using core interface residue and support vector machine. BMC Bioinformatics 2008; 9:553. [PMID: 19102736 PMCID: PMC2627892 DOI: 10.1186/1471-2105-9-553] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2008] [Accepted: 12/22/2008] [Indexed: 12/04/2022] Open
Abstract
Background The prediction of protein-protein binding site can provide structural annotation to the protein interaction data from proteomics studies. This is very important for the biological application of the protein interaction data that is increasing rapidly. Moreover, methods for predicting protein interaction sites can also provide crucial information for improving the speed and accuracy of protein docking methods. Results In this work, we describe a binding site prediction method by designing a new residue neighbour profile and by selecting only the core-interface residues for SVM training. The residue neighbour profile includes both the sequential and the spatial neighbour residues of an interface residue, which is a more complete description of the physical and chemical characteristics surrounding the interface residue. The concept of core interface is applied in selecting the interface residues for training the SVM models, which is shown to result in better discrimination between the core interface and other residues. The best SVM model trained was tested on a test set of 50 randomly selected proteins. The sensitivity, specificity, and MCC for the prediction of the core interface residues were 60.6%, 53.4%, and 0.243, respectively. Our prediction results on this test set were compared with other three binding site prediction methods and found to perform better. Furthermore, our method was tested on the 101 unbound proteins from the protein-protein interaction benchmark v2.0. The sensitivity, specificity, and MCC of this test were 57.5%, 32.5%, and 0.168, respectively. Conclusion By improving both the descriptions of the interface residues and their surrounding environment and the training strategy, better SVM models were obtained and shown to outperform previous methods. Our tests on the unbound protein structures suggest further improvement is possible.
Collapse
Affiliation(s)
- Nan Li
- Beijing National Laboratory for Condensed Matter Physics, Institute of Physics, Chinese Academy of Sciences, Beijing, PR China.
| | | | | |
Collapse
|
12
|
Ming D, Cohn JD, Wall ME. Fast dynamics perturbation analysis for prediction of protein functional sites. BMC STRUCTURAL BIOLOGY 2008; 8:5. [PMID: 18234095 PMCID: PMC2276503 DOI: 10.1186/1472-6807-8-5] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/18/2007] [Accepted: 01/30/2008] [Indexed: 11/10/2022]
Abstract
Background We present a fast version of the dynamics perturbation analysis (DPA) algorithm to predict functional sites in protein structures. The original DPA algorithm finds regions in proteins where interactions cause a large change in the protein conformational distribution, as measured using the relative entropy Dx. Such regions are associated with functional sites. Results The Fast DPA algorithm, which accelerates DPA calculations, is motivated by an empirical observation that Dx in a normal-modes model is highly correlated with an entropic term that only depends on the eigenvalues of the normal modes. The eigenvalues are accurately estimated using first-order perturbation theory, resulting in a N-fold reduction in the overall computational requirements of the algorithm, where N is the number of residues in the protein. The performance of the original and Fast DPA algorithms was compared using protein structures from a standard small-molecule docking test set. For nominal implementations of each algorithm, top-ranked Fast DPA predictions overlapped the true binding site 94% of the time, compared to 87% of the time for original DPA. In addition, per-protein recall statistics (fraction of binding-site residues that are among predicted residues) were slightly better for Fast DPA. On the other hand, per-protein precision statistics (fraction of predicted residues that are among binding-site residues) were slightly better using original DPA. Overall, the performance of Fast DPA in predicting ligand-binding-site residues was comparable to that of the original DPA algorithm. Conclusion Compared to the original DPA algorithm, the decreased run time with comparable performance makes Fast DPA well-suited for implementation on a web server and for high-throughput analysis.
Collapse
Affiliation(s)
- Dengming Ming
- Computer, Computational, and Statistical Scienes Division, Los Alamos National Laboratory, Los Alamos, New Mexico, USA.
| | | | | |
Collapse
|
13
|
Lee D, Redfern O, Orengo C. Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 2007; 8:995-1005. [PMID: 18037900 DOI: 10.1038/nrm2281] [Citation(s) in RCA: 360] [Impact Index Per Article: 21.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
14
|
Livesay DR, Kidd PD, Eskandari S, Roshan U. Assessing the ability of sequence-based methods to provide functional insight within membrane integral proteins: a case study analyzing the neurotransmitter/Na+ symporter family. BMC Bioinformatics 2007; 8:397. [PMID: 17941992 PMCID: PMC2194793 DOI: 10.1186/1471-2105-8-397] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2007] [Accepted: 10/17/2007] [Indexed: 01/09/2023] Open
Abstract
Background Efforts to predict functional sites from globular proteins is increasingly common; however, the most successful of these methods generally require structural insight. Unfortunately, despite several recent technological advances, structural coverage of membrane integral proteins continues to be sparse. ConSequently, sequence-based methods represent an important alternative to illuminate functional roles. In this report, we critically examine the ability of several computational methods to provide functional insight within two specific areas. First, can phylogenomic methods accurately describe the functional diversity across a membrane integral protein family? And second, can sequence-based strategies accurately predict key functional sites? Due to the presence of a recently solved structure and a vast amount of experimental mutagenesis data, the neurotransmitter/Na+ symporter (NSS) family is an ideal model system to assess the quality of our predictions. Results The raw NSS sequence dataset contains 181 sequences, which have been aligned by various methods. The resultant phylogenetic trees always contain six major subfamilies are consistent with the functional diversity across the family. Moreover, in well-represented subfamilies, phylogenetic clustering recapitulates several nuanced functional distinctions. Functional sites are predicted using six different methods (phylogenetic motifs, two methods that identify subfamily-specific positions, and three different conservation scores). A canonical set of 34 functional sites identified by Yamashita et al. within the recently solved LeuTAa structure is used to assess the quality of the predictions, most of which are predicted by the bioinformatic methods. Remarkably, the importance of these sites is largely confirmed by experimental mutagenesis. Furthermore, the collective set of functional site predictions qualitatively clusters along the proposed transport pathway, further demonstrating their utility. Interestingly, the various prediction schemes provide results that are predominantly orthogonal to each other. However, when the methods do provide overlapping results, specificity is shown to increase dramatically (e.g., sites predicted by any three methods have both accuracy and coverage greater than 50%). Conclusion The results presented herein clearly establish the viability of sequence-based bioinformatic strategies to provide functional insight within the NSS family. As such, we expect similar bioinformatic investigations will streamline functional investigations within membrane integral families in the absence of structure.
Collapse
Affiliation(s)
- Dennis R Livesay
- Department of Computer Science and Bioinformatics Research Center, University of North Carolina at Charlotte, Charlotte, NC 28262, USA.
| | | | | | | |
Collapse
|
15
|
Mistry J, Bateman A, Finn RD. Predicting active site residue annotations in the Pfam database. BMC Bioinformatics 2007; 8:298. [PMID: 17688688 PMCID: PMC2025603 DOI: 10.1186/1471-2105-8-298] [Citation(s) in RCA: 166] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2007] [Accepted: 08/09/2007] [Indexed: 12/03/2022] Open
Abstract
Background Approximately 5% of Pfam families are enzymatic, but only a small fraction of the sequences within these families (<0.5%) have had the residues responsible for catalysis determined. To increase the active site annotations in the Pfam database, we have developed a strict set of rules, chosen to reduce the rate of false positives, which enable the transfer of experimentally determined active site residue data to other sequences within the same Pfam family. Description We have created a large database of predicted active site residues. On comparing our active site predictions to those found in UniProtKB, Catalytic Site Atlas, PROSITE and MEROPS we find that we make many novel predictions. On investigating the small subset of predictions made by these databases that are not predicted by us, we found these sequences did not meet our strict criteria for prediction. We assessed the sensitivity and specificity of our methodology and estimate that only 3% of our predicted sequences are false positives. Conclusion We have predicted 606110 active site residues, of which 94% are not found in UniProtKB, and have increased the active site annotations in Pfam by more than 200 fold. Although implemented for Pfam, the tool we have developed for transferring the data can be applied to any alignment with associated experimental active site data and is available for download. Our active site predictions are re-calculated at each Pfam release to ensure they are comprehensive and up to date. They provide one of the largest available databases of active site annotation.
Collapse
Affiliation(s)
- Jaina Mistry
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Alex Bateman
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Robert D Finn
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| |
Collapse
|
16
|
Dong Q, Wang X, Lin L, Guan Y. Exploiting residue-level and profile-level interface propensities for usage in binding sites prediction of proteins. BMC Bioinformatics 2007; 8:147. [PMID: 17480235 PMCID: PMC1885810 DOI: 10.1186/1471-2105-8-147] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2007] [Accepted: 05/05/2007] [Indexed: 01/14/2023] Open
Abstract
Background Recognition of binding sites in proteins is a direct computational approach to the characterization of proteins in terms of biological and biochemical function. Residue preferences have been widely used in many studies but the results are often not satisfactory. Although different amino acid compositions among the interaction sites of different complexes have been observed, such differences have not been integrated into the prediction process. Furthermore, the evolution information has not been exploited to achieve a more powerful propensity. Result In this study, the residue interface propensities of four kinds of complexes (homo-permanent complexes, homo-transient complexes, hetero-permanent complexes and hetero-transient complexes) are investigated. These propensities, combined with sequence profiles and accessible surface areas, are inputted to the support vector machine for the prediction of protein binding sites. Such propensities are further improved by taking evolutional information into consideration, which results in a class of novel propensities at the profile level, i.e. the binary profiles interface propensities. Experiment is performed on the 1139 non-redundant protein chains. Although different residue interface propensities among different complexes are observed, the improvement of the classifier with residue interface propensities can be negligible in comparison with that without propensities. The binary profile interface propensities can significantly improve the performance of binding sites prediction by about ten percent in term of both precision and recall. Conclusion Although there are minor differences among the four kinds of complexes, the residue interface propensities cannot provide efficient discrimination for the complicated interfaces of proteins. The binary profile interface propensities can significantly improve the performance of binding sites prediction of protein, which indicates that the propensities at the profile level are more accurate than those at the residue level.
Collapse
Affiliation(s)
- Qiwen Dong
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Lei Lin
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yi Guan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
17
|
Wallace IM, Higgins DG. Supervised multivariate analysis of sequence groups to identify specificity determining residues. BMC Bioinformatics 2007; 8:135. [PMID: 17451607 PMCID: PMC1878507 DOI: 10.1186/1471-2105-8-135] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2006] [Accepted: 04/23/2007] [Indexed: 11/29/2022] Open
Abstract
Background Proteins that evolve from a common ancestor can change functionality over time, and it is important to be able identify residues that cause this change. In this paper we show how a supervised multivariate statistical method, Between Group Analysis (BGA), can be used to identify these residues from families of proteins with different substrate specifities using multiple sequence alignments. Results We demonstrate the usefulness of this method on three different test cases. Two of these test cases, the Lactate/Malate dehydrogenase family and Nucleotidyl Cyclases, consist of two functional groups. The other family, Serine Proteases consists of three groups. BGA was used to analyse and visualise these three families using two different encoding schemes for the amino acids. Conclusion This overall combination of methods in this paper is powerful and flexible while being computationally very fast and simple. BGA is especially useful because it can be used to analyse any number of functional classes. In the examples we used in this paper, we have only used 2 or 3 classes for demonstration purposes but any number can be used and visualised.
Collapse
Affiliation(s)
- Iain M Wallace
- The Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Dublin 4, Ireland
| | - Desmond G Higgins
- The Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Dublin 4, Ireland
| |
Collapse
|