1
|
Jia K, Kilinc M, Jernigan RL. New alignment method for remote protein sequences by the direct use of pairwise sequence correlations and substitutions. FRONTIERS IN BIOINFORMATICS 2023; 3:1227193. [PMID: 37900964 PMCID: PMC10602800 DOI: 10.3389/fbinf.2023.1227193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 08/14/2023] [Indexed: 10/31/2023] Open
Abstract
Understanding protein sequences and how they relate to the functions of proteins is extremely important. One of the most basic operations in bioinformatics is sequence alignment and usually the first things learned from these are which positions are the most conserved and often these are critical parts of the structure, such as enzyme active site residues. In addition, the contact pairs in a protein usually correspond closely to the correlations between residue positions in the multiple sequence alignment, and these usually change in a systematic and coordinated way, if one position changes then the other member of the pair also changes to compensate. In the present work, these correlated pairs are taken as anchor points for a new type of sequence alignment. The main advantage of the method here is its combining the remote homolog detection from our method PROST with pairwise sequence substitutions in the rigorous method from Kleinjung et al. We show a few examples of some resulting sequence alignments, and how they can lead to improvements in alignments for function, even for a disordered protein.
Collapse
Affiliation(s)
- Kejue Jia
- Roy J. Carver Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, United States
| | - Mesih Kilinc
- Roy J. Carver Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, United States
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, United States
| | - Robert L. Jernigan
- Roy J. Carver Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, United States
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, United States
| |
Collapse
|
2
|
The Structure of Evolutionary Model Space for Proteins across the Tree of Life. BIOLOGY 2023; 12:biology12020282. [PMID: 36829559 PMCID: PMC9952988 DOI: 10.3390/biology12020282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Revised: 02/04/2023] [Accepted: 02/08/2023] [Indexed: 02/12/2023]
Abstract
The factors that determine the relative rates of amino acid substitution during protein evolution are complex and known to vary among taxa. We estimated relative exchangeabilities for pairs of amino acids from clades spread across the tree of life and assessed the historical signal in the distances among these clade-specific models. We separately trained these models on collections of arbitrarily selected protein alignments and on ribosomal protein alignments. In both cases, we found a clear separation between the models trained using multiple sequence alignments from bacterial clades and the models trained on archaeal and eukaryotic data. We assessed the predictive power of our novel clade-specific models of sequence evolution by asking whether fit to the models could be used to identify the source of multiple sequence alignments. Model fit was generally able to correctly classify protein alignments at the level of domain (bacterial versus archaeal), but the accuracy of classification at finer scales was much lower. The only exceptions to this were the relatively high classification accuracy for two archaeal lineages: Halobacteriaceae and Thermoprotei. Genomic GC content had a modest impact on relative exchangeabilities despite having a large impact on amino acid frequencies. Relative exchangeabilities involving aromatic residues exhibited the largest differences among models. There were a small number of exchangeabilities that exhibited large differences in comparisons among major clades and between generalized models and ribosomal protein models. Taken as a whole, these results reveal that a small number of relative exchangeabilities are responsible for much of the structure of the "model space" for protein sequence evolution. The clade-specific models we generated may be useful tools for protein phylogenetics, and the structure of evolutionary model space that they revealed has implications for phylogenomic inference across the tree of life.
Collapse
|
3
|
Polyanovsky V, Lifanov A, Esipova N, Tumanyan V. The ranging of amino acids substitution matrices of various types in accordance with the alignment accuracy criterion. BMC Bioinformatics 2020; 21:294. [PMID: 32921315 PMCID: PMC7489204 DOI: 10.1186/s12859-020-03616-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Accepted: 06/18/2020] [Indexed: 11/15/2022] Open
Abstract
Background The alignment of character sequences is important in bioinformatics. The quality of this procedure is determined by the substitution matrix and parameters of the insertion-deletion penalty function. These matrices are derived from sequence alignment and thus reflect the evolutionary process. Currently, in addition to evolutionary matrices, a large number of different background matrices have been obtained. To make an optimal choice of the substitution matrix and the penalty parameters, we conducted a numerical experiment using a representative sample of existing matrices of various types and origins. Results We tested both the classical evolutionary matrix series (PAM, Blosum, VTML, Pfasum); structural alignment based matrices, contact energy matrix, and matrix based on the properties of the genetic code. This study presents results for two test set types: first, we simulated sequences that reflect the divergent evolution; second, we performed tests on Balibase sequences. In both cases, we obtained the dependences of the alignment quality (Accuracy, Confidence) on the evolutionary distance between sequences and the evolutionary distance to which the substitution matrices correspond. Optimization of a combination of matrices and the penalty parameters was carried out for local and global alignment on the values of penalty function parameters. Consequently, we found that the best alignment quality is achieved with matrices corresponding to the largest evolutionary distance. These matrices prove to be universal, i.e. suitable for aligning sequences separated by both large and small evolutionary distances. We analysed the correspondence of the correlation coefficients of matrices to the alignment quality. It was found that matrices showing high quality alignment have an above average correlation value, but the converse is not true. Conclusions This study showed that the best alignment quality is achieved with evolutionary matrices designed for long distances: Gonnet, VTML250, PAM250, MIQS, and Pfasum050. The same property is inherent in matrices not only of evolutionary origin, but also of another background corresponding to a large evolutionary distance. Therefore, matrices based on structural data show alignment quality close enough to its value for evolutionary matrices. This agrees with the idea that the spatial structure is more conservative than the protein sequence.
Collapse
|
4
|
Pandey A, Braun EL. Phylogenetic Analyses of Sites in Different Protein Structural Environments Result in Distinct Placements of the Metazoan Root. BIOLOGY 2020; 9:E64. [PMID: 32231097 PMCID: PMC7235752 DOI: 10.3390/biology9040064] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2019] [Revised: 03/09/2020] [Accepted: 03/20/2020] [Indexed: 12/23/2022]
Abstract
Phylogenomics, the use of large datasets to examine phylogeny, has revolutionized the study of evolutionary relationships. However, genome-scale data have not been able to resolve all relationships in the tree of life; this could reflect, at least in part, the poor-fit of the models used to analyze heterogeneous datasets. Some of the heterogeneity may reflect the different patterns of selection on proteins based on their structures. To test that hypothesis, we developed a pipeline to divide phylogenomic protein datasets into subsets based on secondary structure and relative solvent accessibility. We then tested whether amino acids in different structural environments had distinct signals for the topology of the deepest branches in the metazoan tree. We focused on a dataset that appeared to have a mixture of signals and we found that the most striking difference in phylogenetic signal reflected relative solvent accessibility. Analyses of exposed sites (residues located on the surface of proteins) yielded a tree that placed ctenophores sister to all other animals whereas sites buried inside proteins yielded a tree with a sponge+ctenophore clade. These differences in phylogenetic signal were not ameliorated when we conducted analyses using a set of maximum-likelihood profile mixture models. These models are very similar to the Bayesian CAT model, which has been used in many analyses of deep metazoan phylogeny. In contrast, analyses conducted after recoding amino acids to limit the impact of deviations from compositional stationarity increased the congruence in the estimates of phylogeny for exposed and buried sites; after recoding amino acid trees estimated using the exposed and buried site both supported placement of ctenophores sister to all other animals. Although the central conclusion of our analyses is that sites in different structural environments yield distinct trees when analyzed using models of protein evolution, our amino acid recoding analyses also have implications for metazoan evolution. Specifically, our results add to the evidence that ctenophores are the sister group of all other animals and they further suggest that the placozoa+cnidaria clade found in some other studies deserves more attention. Taken as a whole, these results provide striking evidence that it is necessary to achieve a better understanding of the constraints due to protein structure to improve phylogenetic estimation.
Collapse
Affiliation(s)
- Akanksha Pandey
- Department of Biology, University of Florida, Gainesville, FL 32611, USA;
| | - Edward L. Braun
- Department of Biology, University of Florida, Gainesville, FL 32611, USA;
- Genetics Institute, University of Florida, Gainesville, FL 32611, USA
| |
Collapse
|
5
|
Xie J, Trus I, Oh D, Kvisgaard LK, Rappe JCF, Ruggli N, Vanderheijden N, Larsen LE, Lefèvre F, Nauwynck HJ. A Triple Amino Acid Substitution at Position 88/94/95 in Glycoprotein GP2a of Type 1 Porcine Reproductive and Respiratory Syndrome Virus (PRRSV1) Is Responsible for Adaptation to MARC-145 Cells. Viruses 2019; 11:v11010036. [PMID: 30626009 PMCID: PMC6356402 DOI: 10.3390/v11010036] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Revised: 12/31/2018] [Accepted: 01/03/2019] [Indexed: 02/05/2023] Open
Abstract
The Meat Animal Research Center-145 (MARC-145) cell line has been proven to be valuable for viral attenuation regarding vaccine development and production. Cell-adaptation is necessary for the efficient replication of porcine reproductive and respiratory syndrome virus (PRRSV) in these cells. Multiple sequence analysis revealed consistent amino acid substitutions in GP2a (V88F, M94I, F95L) of MARC-145 cell-adapted strains. To investigate the putative effect of these substitutions, mutations at either position 88, 94, 95, and their combinations were introduced into two PRRSV1 (13V091 and IVI-1173) infectious clones followed by the recovery of viable recombinants. When comparing the replication kinetics in MARC-145 cells, a strongly positive effect on the growth characteristics of the 13V091 strain (+2.1 log10) and the IVI-1173 strain (+1.7 log10) compared to wild-type (WT) virus was only observed upon triple amino acid substitution at positions 88 (V88F), 94 (M94I), and 95 (F95L) of GP2a, suggesting that the triple mutation is a determining factor in PRRSV1 adaptation to MARC-145 cells.
Collapse
Affiliation(s)
- Jiexiong Xie
- Laboratory of Virology, Faculty of Veterinary Medicine, Ghent University, Salisburylaan 133, B-9820 Merelbeke, Belgium.
| | - Ivan Trus
- Laboratory of Virology, Faculty of Veterinary Medicine, Ghent University, Salisburylaan 133, B-9820 Merelbeke, Belgium.
| | - Dayoung Oh
- Laboratory of Virology, Faculty of Veterinary Medicine, Ghent University, Salisburylaan 133, B-9820 Merelbeke, Belgium.
| | - Lise K Kvisgaard
- National Veterinary Institute, Technical University of Denmark, 2800 Lyngby, Denmark.
| | - Julie C F Rappe
- The Institute of Virology and Immunology IVI, 3147 Mittelhäusern and Bern, Switzerland.
- Department of Infectious Diseases and Pathobiology, University of Bern, 3012 Bern, Switzerland.
- Graduate School for Cellular and Biomedical Sciences, University of Bern, 3012 Bern, Switzerland.
| | - Nicolas Ruggli
- The Institute of Virology and Immunology IVI, 3147 Mittelhäusern and Bern, Switzerland.
- Department of Infectious Diseases and Pathobiology, University of Bern, 3012 Bern, Switzerland.
| | - Nathalie Vanderheijden
- Laboratory of Virology, Faculty of Veterinary Medicine, Ghent University, Salisburylaan 133, B-9820 Merelbeke, Belgium.
| | - Lars E Larsen
- National Veterinary Institute, Technical University of Denmark, 2800 Lyngby, Denmark.
| | - François Lefèvre
- INRA, Molecular Immunology and Virology Unit, 78350 Jouy-en-Josas, France.
| | - Hans J Nauwynck
- Laboratory of Virology, Faculty of Veterinary Medicine, Ghent University, Salisburylaan 133, B-9820 Merelbeke, Belgium.
| |
Collapse
|
6
|
Casadio R, Vassura M, Tiwari S, Fariselli P, Luigi Martelli P. Correlating disease-related mutations to their effect on protein stability: a large-scale analysis of the human proteome. Hum Mutat 2011; 32:1161-70. [PMID: 21853506 DOI: 10.1002/humu.21555] [Citation(s) in RCA: 67] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2011] [Accepted: 06/03/2011] [Indexed: 11/08/2022]
Abstract
Single residue mutations in proteins are known to affect protein stability and function. As a consequence, they can be disease associated. Available computational methods starting from protein sequence/structure can predict whether a mutated residue is or not disease associated and whether it is promoting instability of the protein-folded structure. However, the relationship among stability changes in proteins and their involvement in human diseases still needs to be fully exploited. Here, we try to rationalize in a nutshell the complexity of the question by generalizing over information already stored in public databases. For each single aminoacid polymorphysm (SAP) type, we derive the probability of being disease-related (Pd) and compute from thermodynamic data three indexes indicating the probability of decreasing (P-), increasing (P+), and perturbing the protein structure stability (Pp). Statistically validated analysis of the different P/Pd correlations indicate that Pd best correlates with Pp. Pp/Pd correlation values are as high as 0.49, and increase up to 0.67 when data variability is taken into consideration. This is indicative of a medium/good correlation among Pd and Pp and corroborates the assumption that protein stability changes can also be disease associated at the proteome level.
Collapse
Affiliation(s)
- Rita Casadio
- Laboratory of Biocomputing, Giorgio Prodi Center/CIRB/Department of Biology, University of Bologna, Bologna, Italy.
| | | | | | | | | |
Collapse
|
7
|
Kim JW, Choi HS, Kwon MG, Park MA, Hwang JY, Kim DH, Park CI. Molecular identification and expression analysis of a natural killer cell enhancing factor (NKEF) from rock bream Oplegnathus fasciatus and the biological activity of its recombinant protein. RESULTS IN IMMUNOLOGY 2011; 1:45-52. [PMID: 24371552 DOI: 10.1016/j.rinim.2011.08.002] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2011] [Revised: 08/05/2011] [Accepted: 08/22/2011] [Indexed: 01/17/2023]
Abstract
Natural killer cell enhancing factor (NKEF) belongs to the defined peroxiredoxin (Prx) family. Rock bream NKEF cDNA was identified by expressed sequence tag (EST) analysis of rock bream liver that was stimulated with the LPS. The full-length RbNKEF cDNA (1062 bp) contained an open reading frame (ORF) of 594 bp encoding 198 amino acids. RbNKEF was significantly expressed in the gill, liver, and intestine. mRNA expression of NKEF in the head kidney was examined under viral and bacterial challenge via real-time RT-PCR. Experimental challenge of rock bream with Edwardsiella tarda, Streptococcus iniae, and RSIV resulted in significant increases in RbNKEF mRNA in the head kidney. To obtain a recombinant NKEF, the RbNKEF ORF was expressed in Escherichia coli BL21 (DE3), and the purified soluble protein exhibited a single band corresponding to the predicted molecular mass. When kidney leucocytes were treated with a high concentration of rRbNKEF (10 μg/mL), they exhibited significantly enhanced cell proliferation and viability under oxidative stress.
Collapse
Affiliation(s)
- Ju-Won Kim
- Department of Marine Biology & Aquaculture, Institute of Marine Industry, College of Marine Science, Gyeongsang National University, 455, Tongyeong 650-160, Republic of Korea
| | - Hye-Sung Choi
- Pathology Division, National Fisheries Research and Development Institute, Busan 619-900, Republic of Korea
| | - Mun-Gyeong Kwon
- Pathology Division, National Fisheries Research and Development Institute, Busan 619-900, Republic of Korea
| | - Myoung-Ae Park
- Pathology Division, National Fisheries Research and Development Institute, Busan 619-900, Republic of Korea
| | - Jee-Youn Hwang
- Pathology Division, National Fisheries Research and Development Institute, Busan 619-900, Republic of Korea
| | - Do-Hyung Kim
- Department of Aqualife Medicine, Chonnam National University, San 96-1, Dun-Duk Dong, Yeosu, Republic of Korea ; Fish Health Center, Chonnam National University, San 96-1, Dun-Duk Dong, Yeosu, Republic of Korea
| | - Chan-Il Park
- Department of Marine Biology & Aquaculture, Institute of Marine Industry, College of Marine Science, Gyeongsang National University, 455, Tongyeong 650-160, Republic of Korea
| |
Collapse
|
8
|
Pokarowski P, Kloczkowski A, Nowakowski S, Pokarowska M, Jernigan RL, Kolinski A. Ideal amino acid exchange forms for approximating substitution matrices. Proteins 2009; 69:379-93. [PMID: 17623859 DOI: 10.1002/prot.21509] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
We have analyzed 29 published substitution matrices (SMs) and five statistical protein contact potentials (CPs) for comparison. We find that popular, 'classical' SMs obtained mainly from sequence alignments of globular proteins are mostly correlated by at least a value of 0.9. The BLOSUM62 is the central element of this group. A second group includes SMs derived from alignments of remote homologs or transmembrane proteins. These matrices correlate better with classical SMs (0.8) than among themselves (0.7). A third group consists of intermediate links between SMs and CPs - matrices and potentials that exhibit mutual correlations of at least 0.8. Next, we show that SMs can be approximated with a correlation of 0.9 by expressions c(0) + x(i)x(j) + y(i)y(j) + z(i)z(j), 1<or= i, j <or= 20, where c(0) is a constant and the vectors (x(i)), (y(i)), (z(i)) correlate highly with hydrophobicity, molecular volume and coil preferences of amino acids, respectively. The present paper is the continuation of our work (Pokarowski et al., Proteins 2005;59:49-57), where similar approximation were used to derive ideal amino acid interaction forms from CPs. Both approximations allow us to understand general trends in amino acid similarity and can help improve multiple sequence alignments using the fast Fourier transform (MAFFT), fast threading or another methods based on alignments of physicochemical profiles of protein sequences. The use of this approximation in sequence alignments instead of a classical SM yields results that differ by less than 5%. Intermediate links between SMs and CPs, new formulas for approximating these matrices, and the highly significant dependence of classical SMs on coil preferences are new findings.
Collapse
Affiliation(s)
- Piotr Pokarowski
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, Warsaw University, 02-097 Warsaw, Poland.
| | | | | | | | | | | |
Collapse
|
9
|
Le SQ, Lartillot N, Gascuel O. Phylogenetic mixture models for proteins. Philos Trans R Soc Lond B Biol Sci 2009; 363:3965-76. [PMID: 18852096 DOI: 10.1098/rstb.2008.0180] [Citation(s) in RCA: 142] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Standard protein substitution models use a single amino acid replacement rate matrix that summarizes the biological, chemical and physical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors: genetic code; solvent exposure; secondary and tertiary structure; protein function; etc. These impact the substitution pattern and, in most cases, a single replacement matrix is not enough to represent all the complexity of the evolutionary processes. This paper explores in maximum-likelihood framework phylogenetic mixture models that combine several amino acid replacement matrices to better fit protein evolution.We learn these mixture models from a large alignment database extracted from HSSP, and test the performance using independent alignments from TREEBASE.We compare unsupervised learning approaches, where the site categories are unknown, to supervised ones, where in estimations we use the known category of each site, based on its exposure or its secondary structure. All our models are combined with gamma-distributed rates across sites. Results show that highly significant likelihood gains are obtained when using mixture models compared with the best available single replacement matrices. Mixtures of matrices also improve over mixtures of profiles in the manner of the CAT model. The unsupervised approach tends to be better than the supervised one, but it appears difficult to implement and highly sensitive to the starting values of the parameters, meaning that the supervised approach is still of interest for initialization and model comparison. Using an unsupervised model involving three matrices, the average AIC gain per site with TREEBASE test alignments is 0.31, 0.49 and 0.61 compared with LG (named after Le & Gascuel 2008 Mol. Biol. Evol. 25, 1307-1320), WAG and JTT, respectively. This three-matrix model is significantly better than LG for 34 alignments (among 57), and significantly worse for 1 alignment only. Moreover, tree topologies inferred with our mixture models frequently differ from those obtained with single matrices, indicating that using these mixtures impacts not only the likelihood value but also the output tree. All our models and a PhyML implementation are available from http://atgc.lirmm.fr/mixtures.
Collapse
Affiliation(s)
- Si Quang Le
- Méthodes et Algorithmes pour Bioinformatique, LIRMM, CNRS - Université Montpellier II, 161 rue Ada, 34392 Montpellier Cedex 5, France
| | | | | |
Collapse
|
10
|
Abstract
We introduce a geometric framework suitable for studying the relationships among biological sequences. In contrast to previous works, our formulation allows asymmetric distances (quasi-metrics), originating from uneven weighting of strings, which may induce non-trivial partial orders on sets of biosequences. The distances considered are more general than traditional generalized string edit distances. In particular, our framework enables non-trivial conversion between sequence similarities, both local and global, and distances. Our constructions apply to a wide class of scoring schemes and require much less restrictive gap penalties than the ones regularly used. Numerous examples are provided to illustrate the concepts introduced and their potential applications.
Collapse
Affiliation(s)
- Aleksandar Stojmirović
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | |
Collapse
|
11
|
Quang LS, Gascuel O, Lartillot N. Empirical profile mixture models for phylogenetic reconstruction. ACTA ACUST UNITED AC 2008; 24:2317-23. [PMID: 18718941 DOI: 10.1093/bioinformatics/btn445] [Citation(s) in RCA: 207] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Previous studies have shown that accounting for site-specific amino acid replacement patterns using mixtures of stationary probability profiles offers a promising approach for improving the robustness of phylogenetic reconstructions in the presence of saturation. However, such profile mixture models were introduced only in a Bayesian context, and are not yet available in a maximum likelihood (ML) framework. In addition, these mixture models only perform well on large alignments, from which they can reliably learn the shapes of profiles, and their associated weights. RESULTS In this work, we introduce an expectation-maximization algorithm for estimating amino acid profile mixtures from alignment databases. We apply it, learning on the HSSP database, and observe that a set of 20 profiles is enough to provide a better statistical fit than currently available empirical matrices (WAG, JTT), in particular on saturated data.
Collapse
Affiliation(s)
- Le Si Quang
- Méthodes et Algorithmes pour la Bioinformatique, LIRMM, CNRS-UM2, Montpellier Cedex 5, France
| | | | | |
Collapse
|
12
|
Blanquart S, Lartillot N. A site- and time-heterogeneous model of amino acid replacement. Mol Biol Evol 2008; 25:842-58. [PMID: 18234708 DOI: 10.1093/molbev/msn018] [Citation(s) in RCA: 166] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We combined the category (CAT) mixture model (Lartillot N, Philippe H. 2004) and the nonstationary break point (BP) model (Blanquart S, Lartillot N. 2006) into a new model, CAT-BP, accounting for variations of the evolutionary process both along the sequence and across lineages. As in CAT, the model implements a mixture of distinct Markovian processes of substitution distributed among sites, thus accommodating site-specific selective constraints induced by protein structure and function. Furthermore, as in BP, these processes are nonstationary, and their equilibrium frequencies are allowed to change along lineages in a correlated way, through discrete shifts in global amino acid composition distributed along the phylogenetic tree. We implemented the CAT-BP model in a Bayesian Markov Chain Monte Carlo framework and compared its predictions with those of 3 simpler models, BP, CAT, and the site- and time-homogeneous general time-reversible (GTR) model, on a concatenation of 4 mitochondrial proteins of 20 arthropod species. In contrast to GTR, BP, and CAT, which all display a phylogenetic reconstruction artifact positioning the bees Apis mellifera and Melipona bicolor among chelicerates, the CAT-BP model is able to recover the monophyly of insects. Using posterior predictive tests, we further show that the CAT-BP combination yields better anticipations of site- and taxon-specific amino acid frequencies and that it better accounts for the homoplasies that are responsible for the artifact. Altogether, our results show that the joint modeling of heterogeneities across sites and along time results in a synergistic improvement of the phylogenetic inference, indicating that it is essential to disentangle the combined effects of both sources of heterogeneity, in order to overcome systematic errors in protein phylogenetic analyses.
Collapse
Affiliation(s)
- Samuel Blanquart
- Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier, UMR 5506, CNRS-Université de Montpellier 2, Montpellier, France.
| | | |
Collapse
|
13
|
Miller GK, Fridell SL. A Forgotten Discrete Distribution? Reviving the Negative Hypergeometric Model. AM STAT 2007. [DOI: 10.1198/000313007x245140] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
14
|
Lartillot N, Brinkmann H, Philippe H. Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evol Biol 2007; 7 Suppl 1:S4. [PMID: 17288577 PMCID: PMC1796613 DOI: 10.1186/1471-2148-7-s1-s4] [Citation(s) in RCA: 416] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Thanks to the large amount of signal contained in genome-wide sequence alignments, phylogenomic analyses are converging towards highly supported trees. However, high statistical support does not imply that the tree is accurate. Systematic errors, such as the Long Branch Attraction (LBA) artefact, can be misleading, in particular when the taxon sampling is poor, or the outgroup is distant. In an otherwise consistent probabilistic framework, systematic errors in genome-wide analyses can be traced back to model mis-specification problems, which suggests that better models of sequence evolution should be devised, that would be more robust to tree reconstruction artefacts, even under the most challenging conditions. METHODS We focus on a well characterized LBA artefact analyzed in a previous phylogenomic study of the metazoan tree, in which two fast-evolving animal phyla, nematodes and platyhelminths, emerge either at the base of all other Bilateria, or within protostomes, depending on the outgroup. We use this artefactual result as a case study for comparing the robustness of two alternative models: a standard, site-homogeneous model, based on an empirical matrix of amino-acid replacement (WAG), and a site-heterogeneous mixture model (CAT). In parallel, we propose a posterior predictive test, allowing one to measure how well a model acknowledges sequence saturation. RESULTS Adopting a Bayesian framework, we show that the LBA artefact observed under WAG disappears when the site-heterogeneous model CAT is used. Using cross-validation, we further demonstrate that CAT has a better statistical fit than WAG on this data set. Finally, using our statistical goodness-of-fit test, we show that CAT, but not WAG, correctly accounts for the overall level of saturation, and that this is due to a better estimation of site-specific amino-acid preferences. CONCLUSION The CAT model appears to be more robust than WAG against LBA artefacts, essentially because it correctly anticipates the high probability of convergences and reversions implied by the small effective size of the amino-acid alphabet at each site of the alignment. More generally, our results provide strong evidence that site-specificities in the substitution process need be accounted for in order to obtain more reliable phylogenetic trees.
Collapse
Affiliation(s)
- Nicolas Lartillot
- Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier, UMR 5506, CNRS-Université de Montpellier 2, 161, rue Ada, 34392 Montpellier Cedex 5, France
| | - Henner Brinkmann
- Canadian Institute for Advanced Research, Département de Biochimie, Université de Montréal, Montréal, Québec Canada
| | - Hervé Philippe
- Canadian Institute for Advanced Research, Département de Biochimie, Université de Montréal, Montréal, Québec Canada
| |
Collapse
|
15
|
Altschul SF, Wootton JC, Gertz EM, Agarwala R, Morgulis A, Schäffer AA, Yu YK. Protein database searches using compositionally adjusted substitution matrices. FEBS J 2005; 272:5101-9. [PMID: 16218944 PMCID: PMC1343503 DOI: 10.1111/j.1742-4658.2005.04945.x] [Citation(s) in RCA: 740] [Impact Index Per Article: 38.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Almost all protein database search methods use amino acid substitution matrices for scoring, optimizing, and assessing the statistical significance of sequence alignments. Much care and effort has therefore gone into constructing substitution matrices, and the quality of search results can depend strongly upon the choice of the proper matrix. A long-standing problem has been the comparison of sequences with biased amino acid compositions, for which standard substitution matrices are not optimal. To address this problem, we have recently developed a general procedure for transforming a standard matrix into one appropriate for the comparison of two sequences with arbitrary, and possibly differing compositions. Such adjusted matrices yield, on average, improved alignments and alignment scores when applied to the comparison of proteins with markedly biased compositions. Here we review the application of compositionally adjusted matrices and consider whether they may also be applied fruitfully to general purpose protein sequence database searches, in which related sequence pairs do not necessarily have strong compositional biases. Although it is not advisable to apply compositional adjustment indiscriminately, we describe several simple criteria under which invoking such adjustment is on average beneficial. In a typical database search, at least one of these criteria is satisfied by over half the related sequence pairs. Compositional substitution matrix adjustment is now available in NCBI's protein-protein version of blast.
Collapse
Affiliation(s)
- Stephen F Altschul
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | | | | | | | | | | | |
Collapse
|
16
|
Abstract
MOTIVATION Standard algorithms for pairwise protein sequence alignment make the simplifying assumption that amino acid substitutions at neighboring sites are uncorrelated. This assumption allows implementation of fast algorithms for pairwise sequence alignment, but it ignores information that could conceivably increase the power of remote homolog detection. We examine the validity of this assumption by constructing extended substitution matrices that encapsulate the observed correlations between neighboring sites, by developing an efficient and rigorous algorithm for pairwise protein sequence alignment that incorporates these local substitution correlations and by assessing the ability of this algorithm to detect remote homologies. RESULTS Our analysis indicates that local correlations between substitutions are not strong on the average. Furthermore, incorporating local substitution correlations into pairwise alignment did not lead to a statistically significant improvement in remote homology detection. Therefore, the standard assumption that individual residues within protein sequences evolve independently of neighboring positions appears to be an efficient and appropriate approximation.
Collapse
Affiliation(s)
- Gavin E Crooks
- Department of Plant and Microbial Biology 111 Koshland Hall #3102 University of California, Berkeley, CA 94720-3102, USA.
| | | | | |
Collapse
|