1
|
Leelananda SP, Kloczkowski A, Jernigan RL. Fold-specific sequence scoring improves protein sequence matching. BMC Bioinformatics 2016; 17:328. [PMID: 27578239 PMCID: PMC5006591 DOI: 10.1186/s12859-016-1198-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2016] [Accepted: 08/24/2016] [Indexed: 11/10/2022] Open
Abstract
Background Sequence matching is extremely important for applications throughout biology, particularly for discovering information such as functional and evolutionary relationships, and also for discriminating between unimportant and disease mutants. At present the functions of a large fraction of genes are unknown; improvements in sequence matching will improve gene annotations. Universal amino acid substitution matrices such as Blosum62 are used to measure sequence similarities and to identify distant homologues, regardless of the structure class. However, such single matrices do not take into account important structural information evident within the different topologies of proteins and treats substitutions within all protein folds identically. Others have suggested that the use of structural information can lead to significant improvements in sequence matching but this has not yet been very effective. Here we develop novel substitution matrices that include not only general sequence information but also have a topology specific component that is unique for each CATH topology. This novel feature of using a combination of sequence and structure information for each protein topology significantly improves the sequence matching scores for the sequence pairs tested. We have used a novel multi-structure alignment method for each homology level of CATH in order to extract topological information. Results We obtain statistically significant improved sequence matching scores for 73 % of the alpha helical test cases. On average, 61 % of the test cases showed improvements in homology detection when structure information was incorporated into the substitution matrices. On average z-scores for homology detection are improved by more than 54 % for all cases, and some individual cases have z-scores more than twice those obtained using generic matrices. Our topology specific similarity matrices also outperform other traditional similarity matrices and single matrix based structure methods. When default amino acid substitution matrix in the Psi-blast algorithm is replaced by our structure-based matrices, the structure matching is significantly improved over conventional Psi-blast. It also outperforms results obtained for the corresponding HMM profiles generated for each topology. Conclusions We show that by incorporating topology-specific structure information in addition to sequence information into specific amino acid substitution matrices, the sequence matching scores and homology detection are significantly improved. Our topology specific similarity matrices outperform other traditional similarity matrices, single matrix based structure methods, also show improvement over conventional Psi-blast and HMM profile based methods in sequence matching. The results support the discriminatory ability of the new amino acid similarity matrices to distinguish between distant homologs and structurally dissimilar pairs. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1198-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sumudu P Leelananda
- Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA.,Laurence H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA.,Present Address: 2120 Newman and Wolfrom Laboratory, The Ohio State University, 100 W 18th Ave, Columbus, OH, 43210, USA.,Present Address: Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, Columbus, OH, 43205, USA
| | - Andrzej Kloczkowski
- Present Address: Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, Columbus, OH, 43205, USA.,Present Address: Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, 43205, USA
| | - Robert L Jernigan
- Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA. .,Laurence H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA.
| |
Collapse
|
2
|
|
3
|
Abstract
A method (SPREK) was developed to evaluate the register of a sequence on a structure based on the matching of structural patterns against a library derived from the protein structure databank. The scores obtained were normalized against random background distributions derived from sequence shuffling and permutation methods. 'Random' structures were also used to evaluate the effectiveness of the method. These were generated by a simple random-walk and a more sophisticated structure prediction method that produced protein-like folds. For comparison with other methods, the performance of the method was assessed using collections of models including decoys and models from the CASP-5 exercise. The performance of SPREK on the decoy models was equivalent to (and sometimes better than) those obtained with more complex approaches. An exception was the two smallest proteins, for which SPREK did not perform well due to a lack of patterns. Using the best parameter combination from trials on decoy models, the CASP models of intermediate difficulty were evaluated by SPREK and the quality of the top scoring model was evaluated by its CASP ranking. Of the 14 targets in this class, half lie in the top 10% (out of around 140 models for each target). The two worst rankings resulted from the selection by our method of a well-packed model that was based on the wrong fold. Of the other poor rankings, one was the smallest protein and the others were the four largest (all over 250 residues).
Collapse
Affiliation(s)
- William R Taylor
- Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK.
| | | |
Collapse
|
4
|
Abstract
Iterated sequence databank search methods were assessed from the viewpoint of someone with the sequence of a novel gene product wishing to find distant relatives to their protein and, with the specific searches against the PDB, also hoping to find a relative of known structure. We examined three methods in detail, spanning a range from simple pattern-matching to sophisticated weighted profiles. Rather than apply these methods 'blindly' (with default parameters) to a large number of test queries, we have concentrated on the globins, so allowing a more detailed investigation of each method on different data subsets with different parameter settings. Despite their widespread use, regular-expression matching proved to be very limited-seldom extending beyond the sub-family from which the pattern was derived. To attain any generality, the patterns had to be 'stripped-down' to include only the most highly conserved parts. The QUEST program avoided these problems by introducing a more flexible (weighted) matching. On the PDB sequences this was highly effective, missing only a few globins with probes based on each sub-family or even a single representative from each sub-family. In addition, very few false-positives were encountered, and those that did match, often only did so for a few cycles before being lost again. On the larger sequence collection, however, QUEST encountered problems with maintaining (or achieving) the alignment of the full globin family. psi-BLAST also recognised almost all the globins when matching against the PDB sequences, typically, missing three or four of the most distantly related sequences while picking-up a few false-positives. In contrast to QUEST, psi-BLAST performed very well on the larger databank, getting almost a full collection of globins although still retaining the same proportion of false-positives. SAM applied to the PDB sequences performed reasonably well with the myoglobin and hemoglobin families as probes, missing, typically several of the more difficult proteins but performed poorly with the leghemoglobin probe. Only with the full family range as a probe did it produce results comparable to psi-BLAST and QUEST. With the larger databank, SAM produced a good result but, again, this was only achieved using the full range of sequence variation with the default regulariser and use of Dirichlet mixtures completely failed in this situation.
Collapse
Affiliation(s)
- W R Taylor
- Division of Mathematical Biology, National Institute for Medical Research, Mill Hill, London, UK.
| | | |
Collapse
|
5
|
Schwartz HL, Chandonia JM, Kash SF, Kanaani J, Tunnell E, Domingo A, Cohen FE, Banga JP, Madec AM, Richter W, Baekkeskov S. High-resolution autoreactive epitope mapping and structural modeling of the 65 kDa form of human glutamic acid decarboxylase. J Mol Biol 1999; 287:983-99. [PMID: 10222205 DOI: 10.1006/jmbi.1999.2655] [Citation(s) in RCA: 77] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
The smaller isoform of the GABA-synthesizing enzyme, glutamic acid decarboxylase 65 (GAD65), is unusually susceptible to becoming a target of autoimmunity affecting its major sites of expression, GABA-ergic neurons and pancreatic beta-cells. In contrast, a highly homologous isoform, GAD67, is not an autoantigen. We used homolog-scanning mutagenesis to identify GAD65-specific amino acid residues which form autoreactive B-cell epitopes in this molecule. Detailed mapping of 13 conformational epitopes, recognized by human monoclonal antibodies derived from patients, together with two and three-dimensional structure prediction led to a model of the GAD65 dimer. GAD65 has structural similarities to ornithine decarboxylase in the pyridoxal-5'-phosphate-binding middle domain (residues 201-460) and to dialkylglycine decarboxylase in the C-terminal domain (residues 461-585). Six distinct conformational and one linear epitopes cluster on the hydrophilic face of three amphipathic alpha-helices in exons 14-16 in the C-terminal domain. Two of those epitopes also require amino acids in exon 4 in the N-terminal domain. Two distinct epitopes reside entirely in the N-terminal domain. In the middle domain, four distinct conformational epitopes cluster on a charged patch formed by amino acids from three alpha-helices away from the active site, and a fifth epitope resides at the back of the pyridoxal 5'-phosphate binding site and involves amino acid residues in exons 6 and 11-12. The epitopes localize to multiple hydrophilic patches, several of which also harbor DR*0401-restricted T-cell epitopes, and cover most of the surface of the protein. The results reveal a remarkable spectrum of human autoreactivity to GAD65, targeting almost the entire surface, and suggest that native folded GAD65 is the immunogen for autoreactive B-cells.
Collapse
Affiliation(s)
- H L Schwartz
- Departments of Microbiology/Immunology and Medicine, Hormone Research Institute, San Francisco, CA, 94143-0534, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
6
|
Abstract
Sequence databank searches are often performed iteratively, taking the results of a search to form a probe (either a pattern or profile) for a subsequent scan of the databank. The advantage of this approach is that, as more sequences are drawn into the probe, it should, in principle be possible to detect increasingly distant members of the family. This approach works well when supervised by an "expert" who has a good "eye" for the quality of the sequence alignment and whether novel matches should be rejected or incorporated into the probe. However, all attempts to automate the process have proved difficult, as the process is inherently unstable. Errors in the alignment, or the misalignment of a non-family member, lead to a deterioration of the probe specificity, so allowing further incorrect sequences to be identified. Here, a combination of two methods is used to provide a check on such instability. A pattern matching (template) search method is used (with a BLAST-like pre-filter for speed) to return sequence segments for alignment in a standard multiple alignment program (MULTAL). Sequences are aligned only to a fixed limit of similarity and any sequences or sub-families that have not joined the original "seed" family are rejected. The remaining core family then provides the basis for a subsequent pattern derivation and databank search. The constant check by the multiple alignment phase allows the search phase to be pushed continually towards the boundary of similarity. This is maintained by lowering the cutoff on the scores of acceptable sequences each time the family remains the same over successive search cycles. The procedure was observed to be stable under misalignments and to have an ability to recognise distantly related family members across super-families that was comparable to Psi-BLAST. The method is applied to the analysis of the hormone-binding domains of the insulin and related growth-factor receptors.
Collapse
Affiliation(s)
- W R Taylor
- Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, London, NW7 1AA, UK
| |
Collapse
|
7
|
Searls DB. Grand challenges in computational biology. COMPUTATIONAL METHODS IN MOLECULAR BIOLOGY 1998. [DOI: 10.1016/s0167-7306(08)60458-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/10/2023]
|
8
|
Affiliation(s)
- P Bork
- European Molecular Biology Laboratory, Heidelberg, Germany
| | | |
Collapse
|
9
|
Affiliation(s)
- W R Taylor
- Division of Mathematical Biology, National Institute for Medical Research, London, United Kingdom
| |
Collapse
|
10
|
Abstract
The length of an alignment of biological sequences is typically longer than the mean length of its component sequences. (This arises from the insertion of gaps in the alignment.) When such an alignment is used as a profile for the alignment of further sequences (or profiles), it will have a bias toward additional sequences that match the length of the profile, rather than the mean length of sequences in the profile, as the alignment of these will entail fewer (or smaller) insertions (so avoiding gap-penalties). An algorithm is described to correct this bias that entails monitoring the correspondence, for every pair of positions, of the mean separations in both profiles as they are aligned. The correction was incorporated into a standard dynamic programming algorithm through a modification of the gap-penalty, but, unlike other approaches, this modification is not local and takes into consideration the overall alignment of the sequences. This implies that the algorithm cannot guarantee to find the optimal alignment, but tests suggest that close approximations are obtained. The method was tested on protein families by measuring the area in the parameter space of the phase containing the correct multiple alignment. No improvement (increase in phase area) was found with a family that required few gaps to be aligned correctly. However, for highly gapped alignments, a 50% increase in area was obtained with one family and the correct alignment was found for another that could not be aligned with the unbiased method.
Collapse
Affiliation(s)
- W R Taylor
- Division of Mathematical Biology, National Institute for Medical Research, London, UK
| |
Collapse
|
11
|
Hatrick K, Taylor WR. Sequence conservation and correlation measures in protein structure prediction. ACTA ACUST UNITED AC 1994; 18:245-9. [PMID: 16649265 DOI: 10.1016/0097-8485(94)85019-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
The rapid elucidation of protein sequences has allowed multiple sequence alignments to be calculated for a wide variety of proteins. Such alignments reveal positions that exhibit amino acid conservation--either of specific chemical groups in active and binding sites or of the more chemically inert hydrophobic residues that contribute to the protein core. The latter can provide constraints on the position of the protein chain and any local periodicity can suggest the type of secondary structure. Conservation measures, however, cannot provide specific pairwise packing information (each conserved hydrophobic position might pack against any other). However, if correlated changes between positions were observed then specific pairs of residue could be identified as interacting and therefore probably spatially adjacent. Most 'observations' of correlated changes have been anecdotal and of the few systematic studies that have been made, most have mistakenly incorporated a strong bias towards selecting conserved positions. When the conservation effect is separated (as best as possible) then little correlation signal remains to help identify adjacent positions.
Collapse
Affiliation(s)
- K Hatrick
- Laboratory of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, England
| | | |
Collapse
|
12
|
Abstract
Many methods exist for taking a sequence that exhibits similarity to another of known structure and building a molecular model. However, when the sequence similarity is very remote and fragmentary, this 'modelling-by-homology' approach is less reliable. Current methods that tackle this problem are reviewed below, taking as an example the construction of a predicted model for the retroviral protease. This earlier work, which was only partially automatic, identified many of the outstanding difficulties that have subsequently been automated in computer programs, developed both by the author and many others. Because of the rapid proliferation of methods and their variants, an exhaustive review of the literature has not been possible and the following survey concentrates on the developments of the author and colleagues to explain the basic methods.
Collapse
Affiliation(s)
- W R Taylor
- Laboratory of Mathematical Biology, National Institute for Medical Research, London, UK
| |
Collapse
|
13
|
Abstract
Through the comprehensive analysis of protein sequence and structural data, relationships can be established that suggest, with varying degrees of success, structural models for a protein for which only the sequence is known. The certainty with which a model can be proposed depends on the degree of similarity between the sequence of unknown structure and the sequence of a protein of known structure. Methods are being developed to detect remote similarities between sequences or structures, and to predict protein structure based on such small levels of similarity.
Collapse
Affiliation(s)
- W R Taylor
- Laboratory of Mathematical Biology, National Institute for Medical Research, London, UK
| |
Collapse
|
14
|
Gautier J, Solomon MJ, Booher RN, Bazan JF, Kirschner MW. cdc25 is a specific tyrosine phosphatase that directly activates p34cdc2. Cell 1991; 67:197-211. [PMID: 1913817 DOI: 10.1016/0092-8674(91)90583-k] [Citation(s) in RCA: 628] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
cdc25 controls the activity of the cyclin-p34cdc2 complex by regulating the state of tyrosine phosphorylation of p34cdc2. Drosophila cdc25 protein from two different expression systems activates inactive cyclin-p34cdc2 and induces M phase in Xenopus oocytes and egg extracts. We find that the cdc25 sequence shows weak but significant homology to a phylogenetically diverse group of protein tyrosine phosphatases. cdc25 itself is a very specific protein tyrosine phosphatase. Bacterially expressed cdc25 directly dephosphorylates bacterially expressed p34cdc2 on Tyr-15 in a minimal system devoid of eukaryotic cell components, but does not dephosphorylate other tyrosine-phosphorylated proteins at appreciable rates. In addition, mutations in the putative catalytic site abolish the in vivo activity of cdc25 and its phosphatase activity in vitro. Therefore, cdc25 is a specific protein phosphatase that dephosphorylates tyrosine and possibly threonine residues on p34cdc2 and regulates MPF activation.
Collapse
Affiliation(s)
- J Gautier
- Department of Biochemistry and Biophysics, University of California, San Francisco 94143-0448
| | | | | | | | | |
Collapse
|
15
|
|