1
|
Wilburn GW, Eddy SR. Remote homology search with hidden Potts models. PLoS Comput Biol 2020; 16:e1008085. [PMID: 33253143 PMCID: PMC7728182 DOI: 10.1371/journal.pcbi.1008085] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2020] [Revised: 12/10/2020] [Accepted: 10/27/2020] [Indexed: 12/03/2022] Open
Abstract
Most methods for biological sequence homology search and alignment work with primary sequence alone, neglecting higher-order correlations. Recently, statistical physics models called Potts models have been used to infer all-by-all pairwise correlations between sites in deep multiple sequence alignments, and these pairwise couplings have improved 3D structure predictions. Here we extend the use of Potts models from structure prediction to sequence alignment and homology search by developing what we call a hidden Potts model (HPM) that merges a Potts emission process to a generative probability model of insertion and deletion. Because an HPM is incompatible with efficient dynamic programming alignment algorithms, we develop an approximate algorithm based on importance sampling, using simpler probabilistic models as proposal distributions. We test an HPM implementation on RNA structure homology search benchmarks, where we can compare directly to exact alignment methods that capture nested RNA base-pairing correlations (stochastic context-free grammars). HPMs perform promisingly in these proof of principle experiments.
Collapse
Affiliation(s)
- Grey W. Wilburn
- Department of Physics, Harvard University, Cambridge, Massachusetts, United States of America
| | - Sean R. Eddy
- Howard Hughes Medical Institute, Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts, United States of America
- John A Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, United States of America
| |
Collapse
|
2
|
Daniels NM, Gallant A, Ramsey N, Cowen LJ. MRFy: Remote Homology Detection for Beta-Structural Proteins Using Markov Random Fields and Stochastic Search. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:4-16. [PMID: 26357074 DOI: 10.1109/tcbb.2014.2344682] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
We introduce MRFy, a tool for protein remote homology detection that captures beta-strand dependencies in the Markov random field. Over a set of 11 SCOP beta-structural superfamilies, MRFy shows a 14 percent improvement in mean Area Under the Curve for the motif recognition problem as compared to HMMER, 25 percent improvement as compared to RAPTOR, 14 percent improvement as compared to HHPred, and a 18 percent improvement as compared to CNFPred and RaptorX. MRFy was implemented in the Haskell functional programming language, and parallelizes well on multi-core systems. MRFy is available, as source code as well as an executable, from http://mrfy.cs.tufts.edu/.
Collapse
|
3
|
Three-dimensional protein structure prediction: Methods and computational strategies. Comput Biol Chem 2014; 53PB:251-276. [DOI: 10.1016/j.compbiolchem.2014.10.001] [Citation(s) in RCA: 121] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Revised: 10/03/2014] [Accepted: 10/07/2014] [Indexed: 01/01/2023]
|
4
|
Daniels NM, Hosur R, Berger B, Cowen LJ. SMURFLite: combining simplified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins into the twilight zone. Bioinformatics 2012; 28:1216-22. [PMID: 22408192 PMCID: PMC3338012 DOI: 10.1093/bioinformatics/bts110] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Motivation: One of the most successful methods to date for recognizing protein sequences that are evolutionarily related has been profile hidden Markov models (HMMs). However, these models do not capture pairwise statistical preferences of residues that are hydrogen bonded in beta sheets. These dependencies have been partially captured in the HMM setting by simulated evolution in the training phase and can be fully captured by Markov random fields (MRFs). However, the MRFs can be computationally prohibitive when beta strands are interleaved in complex topologies. We introduce SMURFLite, a method that combines both simplified MRFs and simulated evolution to substantially improve remote homology detection for beta structures. Unlike previous MRF-based methods, SMURFLite is computationally feasible on any beta-structural motif. Results: We test SMURFLite on all propeller and barrel folds in the mainly-beta class of the SCOP hierarchy in stringent cross-validation experiments. We show a mean 26% (median 16%) improvement in area under curve (AUC) for beta-structural motif recognition as compared with HMMER (a well-known HMM method) and a mean 33% (median 19%) improvement as compared with RAPTOR (a well-known threading method) and even a mean 18% (median 10%) improvement in AUC over HHPred (a profile–profile HMM method), despite HHpred's use of extensive additional training data. We demonstrate SMURFLite's ability to scale to whole genomes by running a SMURFLite library of 207 beta-structural SCOP superfamilies against the entire genome of Thermotoga maritima, and make over a 100 new fold predictions. Availability and implementaion: A webserver that runs SMURFLite is available at: http://smurf.cs.tufts.edu/smurflite/ Contact:lenore.cowen@tufts.edu; bab@mit.edu
Collapse
Affiliation(s)
- Noah M Daniels
- Department of Computer Science, Tufts University, Medford, MA 02155, USA
| | | | | | | |
Collapse
|
5
|
Sreekumar J, ter Braak CJF, van Ham RCHJ, van Dijk ADJ. Correlated mutations via regularized multinomial regression. BMC Bioinformatics 2011; 12:444. [PMID: 22082126 PMCID: PMC3247924 DOI: 10.1186/1471-2105-12-444] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2011] [Accepted: 11/14/2011] [Indexed: 11/13/2022] Open
Abstract
Background In addition to sequence conservation, protein multiple sequence alignments contain evolutionary signal in the form of correlated variation among amino acid positions. This signal indicates positions in the sequence that influence each other, and can be applied for the prediction of intra- or intermolecular contacts. Although various approaches exist for the detection of such correlated mutations, in general these methods utilize only pairwise correlations. Hence, they tend to conflate direct and indirect dependencies. Results We propose RMRCM, a method for Regularized Multinomial Regression in order to obtain Correlated Mutations from protein multiple sequence alignments. Importantly, our method is not restricted to pairwise (column-column) comparisons only, but takes into account the network nature of relationships between protein residues in order to predict residue-residue contacts. The use of regularization ensures that the number of predicted links between columns in the multiple sequence alignment remains limited, preventing overprediction. Using simulated datasets we analyzed the performance of our approach in predicting residue-residue contacts, and studied how it is influenced by various types of noise. For various biological datasets, validation with protein structure data indicates a good performance of the proposed algorithm for the prediction of residue-residue contacts, in comparison to previous results. RMRCM can also be applied to predict interactions (in addition to only predicting interaction sites or contact sites), as demonstrated by predicting PDZ-peptide interactions. Conclusions A novel method is presented, which uses regularized multinomial regression in order to obtain correlated mutations from protein multiple sequence alignments. Availability R-code of our implementation is available via http://www.ab.wur.nl/rmrcm
Collapse
Affiliation(s)
- Janardanan Sreekumar
- Central Tuber Crops Research Institute, Thiruvananthapuram-695017, Kerala, India
| | | | | | | |
Collapse
|
6
|
Abstract
This article investigates aspects of pairwise and multiple structure comparison, and the problem of automatically discover common patterns in a set of structures. Descriptions and representation of structures and patterns are described, as well as scoring and algorithms for comparison and discovery. A framework and nomenclature is developed for classifying different methods, and many of these are reviewed and placed into this framework.
Collapse
Affiliation(s)
- I Eidhammer
- Department of Informatics, University of Bergen, Høyteknologisentret, N-5020 Bergen, Norway.
| | | | | |
Collapse
|
7
|
Bieńkowska JR, Rogers RG, Smith TF. Performance of threading scoring functions designed using new optimization method. J Comput Biol 2001; 6:299-311. [PMID: 10582568 DOI: 10.1089/106652799318283] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We present a new procedure for optimization of a threading scoring function. A scoring function is usually formulated in terms of the structural environment states that describe the protein fold model. We propose a method for the optimal selection of those structural environment states that naturally follows from the probabilistic description of the threading problem and is done prior to threading experiments. We demonstrate the selection of the optimal structural environment states for the solvent exposure of the amino acid position, and present the results of threading experiments performed using scoring functions designed with and without the optimization of the structural environment states. These results confirm that the optimal scoring function predicts the sequence-to-structure alignments most accurately. Threading experiments performed with 15 optimally designed scoring functions show that the correlation coefficient between the information content of the amino acid distribution that determines the scoring function and the accuracy of the optimal sequence-to-structure alignment is 0.94.
Collapse
Affiliation(s)
- J R Bieńkowska
- BioMolecular Engineering Research Center, College of Engineering, Boston University, Massachusetts 02215, USA.
| | | | | |
Collapse
|
8
|
Smith TF, Lo Conte L, Bienkowska J, Gaitatzes C, Rogers RG, Lathrop R. Current limitations to protein threading approaches. J Comput Biol 2001; 4:217-25. [PMID: 9278056 DOI: 10.1089/cmb.1997.4.217] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
A short review of the threading approach to protein structure prediction, including presentation of some open statistical problems. Also discussed is one of the likely sources of the current limited success, that being the form of the pairwise potentials used in most threading approaches.
Collapse
Affiliation(s)
- T F Smith
- BioMolecular Engineering Research Center, College of Engineering, Boston University, Massachusetts 02215, USA.
| | | | | | | | | | | |
Collapse
|
9
|
Lathrop RH. An anytime local-to-global optimization algorithm for protein threading in theta (m2ñ2) space. J Comput Biol 1999; 6:405-18. [PMID: 10582575 DOI: 10.1089/106652799318355] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
This paper describes a novel anytime branch-and-bound or best-first threading search algorithm for gapped block protein sequence-structure alignment with general sequence residue pair interactions. The new algorithm (1) returns a good approximate answer quickly, (2) iteratively improves that answer to the global optimum if allowed more time, (3) eventually produces a proof that the final answer found is indeed the global optimum, and (4) always terminates correctly within a bounded number of steps if allowed sufficient space and time. It runs in polynomial space, which is asymptotically dominated by the theta(m2ñ2) space required by the lower bound computation. Using previously published data sets and the Bryant-Lawrence (1993) objective function, the algorithm found the true (proven) global optimum in less than 5 min in all search spaces size 10(25) or smaller (sequences to 478 residues), and a putative (not guaranteed) optimum in less than 5 hr in all search spaces size 10(60) or smaller (sequences to 793 residues, cores to 42 secondary structure segments). The threading in the largest case studied was eventually proven to be globally optimal; the corresponding search speed in that case was the equivalent of 1.5 x 10(56) threadings/sec, a speed-up exceeding 10(25) over previously published batch branch-and-bound speeds, and exceeding 10(50) over previously published exhaustive search speeds, using the same objective function and threading paradigm. Implementation-independent measures of search efficiency are defined for equivalent branching factor, depth, and probability of success per draw; empirical data on these measures are given. The general approach should apply to other alignment methodologies and search methods that use a divide-and-conquer strategy.
Collapse
Affiliation(s)
- R H Lathrop
- Department of Information and Computer Science, University of California, Irvine 92697, USA.
| |
Collapse
|
10
|
|
11
|
Lathrop RH, Rogers RG, White JV, Gaitatzes C, Smith TF, Bienkowska J, Bryant BK, Buturović LJ, Nambudripad R. Analysis and algorithms for protein sequence–structure alignment. COMPUTATIONAL METHODS IN MOLECULAR BIOLOGY 1998. [DOI: 10.1016/s0167-7306(08)60469-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
12
|
Stultz CM, Nambudripad R, Lathrop RH, White JV. Predicting Protein Structure With Probabilistic Models. ADVANCES IN MOLECULAR AND CELL BIOLOGY 1997. [DOI: 10.1016/s1569-2558(08)60483-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
|
13
|
Abstract
'Profiles' of protein structures and sequence alignments can detect subtle homologies. Profile analysis has been put on firmer mathematical ground by the introduction of hidden Markov model (HMM) methods. During the past year, applications of these powerful new HMM-based profiles have begun to appear in the fields of protein-structure prediction and large-scale genome-sequence analysis.
Collapse
Affiliation(s)
- S R Eddy
- Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA.
| |
Collapse
|
14
|
Abstract
In the past years, much effort has been put on the development of new methodologies and algorithms for the prediction of protein secondary and tertiary structures from (sequence) data; this is reviewed in detail. New approaches for these predictions such as neural network methods, genetic algorithms, machine learning, and graph theoretical methods are discussed. Secondary structure prediction algorithms were improved mostly by considering families of related proteins; however, for the reliable tertiary structure modeling of proteins, knowledge-based techniques are still preferred. Methods and examples with more or less successful results are described. Also, programs and parameterizations for energy minimisations, molecular dynamics, and electrostatic interactions have been improved, especially with respect to their former limits of applicability. Other topics discussed in this review include the use of traditional and on-line databases, the docking problem and surface properties of biomolecules, packing of protein cores, de novo design and protein engineering, prediction of membrane protein structures, the verification and reliability of model structures, and progress made with currently available software and computer hardware. In summary, the prediction of the structure, function, and other properties of a protein is still possible only within limits, but these limits continue to be moved.
Collapse
Affiliation(s)
- G Böhm
- Institut für Biotechnologie, Martin-Luther-Universität Halle-Wittenberg, Germany
| |
Collapse
|