Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For:	Crooks GE, Brenner SE. An alternative model of amino acid replacement. Bioinformatics 2004;21:975-80. [PMID: 15531614 DOI: 10.1093/bioinformatics/bti109] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Number

Cited by Other Article(s)

Jia K, Kilinc M, Jernigan RL. New alignment method for remote protein sequences by the direct use of pairwise sequence correlations and substitutions. FRONTIERS IN BIOINFORMATICS 2023;3:1227193. [PMID: 37900964 PMCID: PMC10602800 DOI: 10.3389/fbinf.2023.1227193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 08/14/2023] [Indexed: 10/31/2023] Open

The Structure of Evolutionary Model Space for Proteins across the Tree of Life. BIOLOGY 2023;12:biology12020282. [PMID: 36829559 PMCID: PMC9952988 DOI: 10.3390/biology12020282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Revised: 02/04/2023] [Accepted: 02/08/2023] [Indexed: 02/12/2023]

Abstract

The factors that determine the relative rates of amino acid substitution during protein evolution are complex and known to vary among taxa. We estimated relative exchangeabilities for pairs of amino acids from clades spread across the tree of life and assessed the historical signal in the distances among these clade-specific models. We separately trained these models on collections of arbitrarily selected protein alignments and on ribosomal protein alignments. In both cases, we found a clear separation between the models trained using multiple sequence alignments from bacterial clades and the models trained on archaeal and eukaryotic data. We assessed the predictive power of our novel clade-specific models of sequence evolution by asking whether fit to the models could be used to identify the source of multiple sequence alignments. Model fit was generally able to correctly classify protein alignments at the level of domain (bacterial versus archaeal), but the accuracy of classification at finer scales was much lower. The only exceptions to this were the relatively high classification accuracy for two archaeal lineages: Halobacteriaceae and Thermoprotei. Genomic GC content had a modest impact on relative exchangeabilities despite having a large impact on amino acid frequencies. Relative exchangeabilities involving aromatic residues exhibited the largest differences among models. There were a small number of exchangeabilities that exhibited large differences in comparisons among major clades and between generalized models and ribosomal protein models. Taken as a whole, these results reveal that a small number of relative exchangeabilities are responsible for much of the structure of the "model space" for protein sequence evolution. The clade-specific models we generated may be useful tools for protein phylogenetics, and the structure of evolutionary model space that they revealed has implications for phylogenomic inference across the tree of life.

Collapse

Polyanovsky V, Lifanov A, Esipova N, Tumanyan V. The ranging of amino acids substitution matrices of various types in accordance with the alignment accuracy criterion. BMC Bioinformatics 2020;21:294. [PMID: 32921315 PMCID: PMC7489204 DOI: 10.1186/s12859-020-03616-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Accepted: 06/18/2020] [Indexed: 11/15/2022] Open

Abstract

Background

The alignment of character sequences is important in bioinformatics. The quality of this procedure is determined by the substitution matrix and parameters of the insertion-deletion penalty function. These matrices are derived from sequence alignment and thus reflect the evolutionary process. Currently, in addition to evolutionary matrices, a large number of different background matrices have been obtained. To make an optimal choice of the substitution matrix and the penalty parameters, we conducted a numerical experiment using a representative sample of existing matrices of various types and origins.

Results

We tested both the classical evolutionary matrix series (PAM, Blosum, VTML, Pfasum); structural alignment based matrices, contact energy matrix, and matrix based on the properties of the genetic code. This study presents results for two test set types: first, we simulated sequences that reflect the divergent evolution; second, we performed tests on Balibase sequences. In both cases, we obtained the dependences of the alignment quality (Accuracy, Confidence) on the evolutionary distance between sequences and the evolutionary distance to which the substitution matrices correspond. Optimization of a combination of matrices and the penalty parameters was carried out for local and global alignment on the values of penalty function parameters.

Consequently, we found that the best alignment quality is achieved with matrices corresponding to the largest evolutionary distance. These matrices prove to be universal, i.e. suitable for aligning sequences separated by both large and small evolutionary distances. We analysed the correspondence of the correlation coefficients of matrices to the alignment quality. It was found that matrices showing high quality alignment have an above average correlation value, but the converse is not true.

Conclusions

This study showed that the best alignment quality is achieved with evolutionary matrices designed for long distances: Gonnet, VTML250, PAM250, MIQS, and Pfasum050. The same property is inherent in matrices not only of evolutionary origin, but also of another background corresponding to a large evolutionary distance. Therefore, matrices based on structural data show alignment quality close enough to its value for evolutionary matrices. This agrees with the idea that the spatial structure is more conservative than the protein sequence.

Collapse

Pandey A, Braun EL. Phylogenetic Analyses of Sites in Different Protein Structural Environments Result in Distinct Placements of the Metazoan Root. BIOLOGY 2020;9:E64. [PMID: 32231097 PMCID: PMC7235752 DOI: 10.3390/biology9040064] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2019] [Revised: 03/09/2020] [Accepted: 03/20/2020] [Indexed: 12/23/2022]

Abstract

Phylogenomics, the use of large datasets to examine phylogeny, has revolutionized the study of evolutionary relationships. However, genome-scale data have not been able to resolve all relationships in the tree of life; this could reflect, at least in part, the poor-fit of the models used to analyze heterogeneous datasets. Some of the heterogeneity may reflect the different patterns of selection on proteins based on their structures. To test that hypothesis, we developed a pipeline to divide phylogenomic protein datasets into subsets based on secondary structure and relative solvent accessibility. We then tested whether amino acids in different structural environments had distinct signals for the topology of the deepest branches in the metazoan tree. We focused on a dataset that appeared to have a mixture of signals and we found that the most striking difference in phylogenetic signal reflected relative solvent accessibility. Analyses of exposed sites (residues located on the surface of proteins) yielded a tree that placed ctenophores sister to all other animals whereas sites buried inside proteins yielded a tree with a sponge+ctenophore clade. These differences in phylogenetic signal were not ameliorated when we conducted analyses using a set of maximum-likelihood profile mixture models. These models are very similar to the Bayesian CAT model, which has been used in many analyses of deep metazoan phylogeny. In contrast, analyses conducted after recoding amino acids to limit the impact of deviations from compositional stationarity increased the congruence in the estimates of phylogeny for exposed and buried sites; after recoding amino acid trees estimated using the exposed and buried site both supported placement of ctenophores sister to all other animals. Although the central conclusion of our analyses is that sites in different structural environments yield distinct trees when analyzed using models of protein evolution, our amino acid recoding analyses also have implications for metazoan evolution. Specifically, our results add to the evidence that ctenophores are the sister group of all other animals and they further suggest that the placozoa+cnidaria clade found in some other studies deserves more attention. Taken as a whole, these results provide striking evidence that it is necessary to achieve a better understanding of the constraints due to protein structure to improve phylogenetic estimation.

Collapse

Xie J, Trus I, Oh D, Kvisgaard LK, Rappe JCF, Ruggli N, Vanderheijden N, Larsen LE, Lefèvre F, Nauwynck HJ. A Triple Amino Acid Substitution at Position 88/94/95 in Glycoprotein GP2a of Type 1 Porcine Reproductive and Respiratory Syndrome Virus (PRRSV1) Is Responsible for Adaptation to MARC-145 Cells. Viruses 2019;11:v11010036. [PMID: 30626009 PMCID: PMC6356402 DOI: 10.3390/v11010036] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Revised: 12/31/2018] [Accepted: 01/03/2019] [Indexed: 02/05/2023] Open

Casadio R, Vassura M, Tiwari S, Fariselli P, Luigi Martelli P. Correlating disease-related mutations to their effect on protein stability: a large-scale analysis of the human proteome. Hum Mutat 2011;32:1161-70. [PMID: 21853506 DOI: 10.1002/humu.21555] [Citation(s) in RCA: 67] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2011] [Accepted: 06/03/2011] [Indexed: 11/08/2022]

Kim JW, Choi HS, Kwon MG, Park MA, Hwang JY, Kim DH, Park CI. Molecular identification and expression analysis of a natural killer cell enhancing factor (NKEF) from rock bream Oplegnathus fasciatus and the biological activity of its recombinant protein. RESULTS IN IMMUNOLOGY 2011;1:45-52. [PMID: 24371552 DOI: 10.1016/j.rinim.2011.08.002] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2011] [Revised: 08/05/2011] [Accepted: 08/22/2011] [Indexed: 01/17/2023]

Pokarowski P, Kloczkowski A, Nowakowski S, Pokarowska M, Jernigan RL, Kolinski A. Ideal amino acid exchange forms for approximating substitution matrices. Proteins 2009;69:379-93. [PMID: 17623859 DOI: 10.1002/prot.21509] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]

Le SQ, Lartillot N, Gascuel O. Phylogenetic mixture models for proteins. Philos Trans R Soc Lond B Biol Sci 2009;363:3965-76. [PMID: 18852096 DOI: 10.1098/rstb.2008.0180] [Citation(s) in RCA: 142] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Abstract

Standard protein substitution models use a single amino acid replacement rate matrix that summarizes the biological, chemical and physical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors: genetic code; solvent exposure; secondary and tertiary structure; protein function; etc. These impact the substitution pattern and, in most cases, a single replacement matrix is not enough to represent all the complexity of the evolutionary processes. This paper explores in maximum-likelihood framework phylogenetic mixture models that combine several amino acid replacement matrices to better fit protein evolution.We learn these mixture models from a large alignment database extracted from HSSP, and test the performance using independent alignments from TREEBASE.We compare unsupervised learning approaches, where the site categories are unknown, to supervised ones, where in estimations we use the known category of each site, based on its exposure or its secondary structure. All our models are combined with gamma-distributed rates across sites. Results show that highly significant likelihood gains are obtained when using mixture models compared with the best available single replacement matrices. Mixtures of matrices also improve over mixtures of profiles in the manner of the CAT model. The unsupervised approach tends to be better than the supervised one, but it appears difficult to implement and highly sensitive to the starting values of the parameters, meaning that the supervised approach is still of interest for initialization and model comparison. Using an unsupervised model involving three matrices, the average AIC gain per site with TREEBASE test alignments is 0.31, 0.49 and 0.61 compared with LG (named after Le & Gascuel 2008 Mol. Biol. Evol. 25, 1307-1320), WAG and JTT, respectively. This three-matrix model is significantly better than LG for 34 alignments (among 57), and significantly worse for 1 alignment only. Moreover, tree topologies inferred with our mixture models frequently differ from those obtained with single matrices, indicating that using these mixtures impacts not only the likelihood value but also the output tree. All our models and a PhyML implementation are available from http://atgc.lirmm.fr/mixtures.

Collapse

Stojmirović A, Yu YK. Geometric aspects of biological sequence comparison. J Comput Biol 2009;16:579-610. [PMID: 19361329 DOI: 10.1089/cmb.2008.0100] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Quang LS, Gascuel O, Lartillot N. Empirical profile mixture models for phylogenetic reconstruction. ACTA ACUST UNITED AC 2008;24:2317-23. [PMID: 18718941 DOI: 10.1093/bioinformatics/btn445] [Citation(s) in RCA: 207] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]

Blanquart S, Lartillot N. A site- and time-heterogeneous model of amino acid replacement. Mol Biol Evol 2008;25:842-58. [PMID: 18234708 DOI: 10.1093/molbev/msn018] [Citation(s) in RCA: 166] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Abstract

We combined the category (CAT) mixture model (Lartillot N, Philippe H. 2004) and the nonstationary break point (BP) model (Blanquart S, Lartillot N. 2006) into a new model, CAT-BP, accounting for variations of the evolutionary process both along the sequence and across lineages. As in CAT, the model implements a mixture of distinct Markovian processes of substitution distributed among sites, thus accommodating site-specific selective constraints induced by protein structure and function. Furthermore, as in BP, these processes are nonstationary, and their equilibrium frequencies are allowed to change along lineages in a correlated way, through discrete shifts in global amino acid composition distributed along the phylogenetic tree. We implemented the CAT-BP model in a Bayesian Markov Chain Monte Carlo framework and compared its predictions with those of 3 simpler models, BP, CAT, and the site- and time-homogeneous general time-reversible (GTR) model, on a concatenation of 4 mitochondrial proteins of 20 arthropod species. In contrast to GTR, BP, and CAT, which all display a phylogenetic reconstruction artifact positioning the bees Apis mellifera and Melipona bicolor among chelicerates, the CAT-BP model is able to recover the monophyly of insects. Using posterior predictive tests, we further show that the CAT-BP combination yields better anticipations of site- and taxon-specific amino acid frequencies and that it better accounts for the homoplasies that are responsible for the artifact. Altogether, our results show that the joint modeling of heterogeneities across sites and along time results in a synergistic improvement of the phylogenetic inference, indicating that it is essential to disentangle the combined effects of both sources of heterogeneity, in order to overcome systematic errors in protein phylogenetic analyses.

Collapse

Miller GK, Fridell SL. A Forgotten Discrete Distribution? Reviving the Negative Hypergeometric Model. AM STAT 2007. [DOI: 10.1198/000313007x245140] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]

Lartillot N, Brinkmann H, Philippe H. Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evol Biol 2007;7 Suppl 1:S4. [PMID: 17288577 PMCID: PMC1796613 DOI: 10.1186/1471-2148-7-s1-s4] [Citation(s) in RCA: 416] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Thanks to the large amount of signal contained in genome-wide sequence alignments, phylogenomic analyses are converging towards highly supported trees. However, high statistical support does not imply that the tree is accurate. Systematic errors, such as the Long Branch Attraction (LBA) artefact, can be misleading, in particular when the taxon sampling is poor, or the outgroup is distant. In an otherwise consistent probabilistic framework, systematic errors in genome-wide analyses can be traced back to model mis-specification problems, which suggests that better models of sequence evolution should be devised, that would be more robust to tree reconstruction artefacts, even under the most challenging conditions.

METHODS

We focus on a well characterized LBA artefact analyzed in a previous phylogenomic study of the metazoan tree, in which two fast-evolving animal phyla, nematodes and platyhelminths, emerge either at the base of all other Bilateria, or within protostomes, depending on the outgroup. We use this artefactual result as a case study for comparing the robustness of two alternative models: a standard, site-homogeneous model, based on an empirical matrix of amino-acid replacement (WAG), and a site-heterogeneous mixture model (CAT). In parallel, we propose a posterior predictive test, allowing one to measure how well a model acknowledges sequence saturation.

RESULTS

Adopting a Bayesian framework, we show that the LBA artefact observed under WAG disappears when the site-heterogeneous model CAT is used. Using cross-validation, we further demonstrate that CAT has a better statistical fit than WAG on this data set. Finally, using our statistical goodness-of-fit test, we show that CAT, but not WAG, correctly accounts for the overall level of saturation, and that this is due to a better estimation of site-specific amino-acid preferences.

CONCLUSION

The CAT model appears to be more robust than WAG against LBA artefacts, essentially because it correctly anticipates the high probability of convergences and reversions implied by the small effective size of the amino-acid alphabet at each site of the alignment. More generally, our results provide strong evidence that site-specificities in the substitution process need be accounted for in order to obtain more reliable phylogenetic trees.

Collapse

Altschul SF, Wootton JC, Gertz EM, Agarwala R, Morgulis A, Schäffer AA, Yu YK. Protein database searches using compositionally adjusted substitution matrices. FEBS J 2005;272:5101-9. [PMID: 16218944 PMCID: PMC1343503 DOI: 10.1111/j.1742-4658.2005.04945.x] [Citation(s) in RCA: 740] [Impact Index Per Article: 38.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]

Crooks GE, Green RE, Brenner SE. Pairwise alignment incorporating dipeptide covariation. Bioinformatics 2005;21:3704-10. [PMID: 16123116 DOI: 10.1093/bioinformatics/bti616] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open