1
|
Tinh NH, Dang CC, Vinh LS. nT4X and nT4M: Novel Time Non-reversible Mixture Amino Acid Substitution Models. J Mol Evol 2025:10.1007/s00239-024-10230-8. [PMID: 39832000 DOI: 10.1007/s00239-024-10230-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Accepted: 12/16/2024] [Indexed: 01/22/2025]
Abstract
One of the most important and difficult challenges in the research of molecular evolution is modeling the process of amino acid substitutions. Although single-matrix models, such as the LG model, are popular, their capability to properly capture the heterogeneity of the substitution process across sites is still questioned. Several mixture models with multiple matrices have been introduced and shown to offer advantages over single-matrix models. Current general mixture models assume the reversibility of the evolutionary process, implying that substitution rates between any two amino acids are equal in both forward and backward directions. This assumption is not based on biological properties but rather on computational simplicity. The well-known hypothesis is that more realistic models can yield more accurate evolutionary inferences; therefore, our aim is to estimate more biologically realistic models. To this end, we relax the assumption of reversibility and introduce two new general non-reversible 4-matrix mixture models, called nT4M and nT4X. Using alignments from HSSP and TreeBASE databases as data, our newly estimated models outperformed all single-matrix models and almost all reversible mixture models. Moreover, the new non-reversible mixture models enable us to infer rooted trees.
Collapse
Affiliation(s)
- Nguyen Huy Tinh
- University of Engineering and Technology, Vietnam National University, 144 Xuan Thuy, Cau Giay, 10000, Hanoi, Vietnam
| | - Cuong Cao Dang
- University of Engineering and Technology, Vietnam National University, 144 Xuan Thuy, Cau Giay, 10000, Hanoi, Vietnam
| | - Le Sy Vinh
- University of Engineering and Technology, Vietnam National University, 144 Xuan Thuy, Cau Giay, 10000, Hanoi, Vietnam.
| |
Collapse
|
2
|
Ren H, Wong TKF, Minh BQ, Lanfear R. MixtureFinder: Estimating DNA Mixture Models for Phylogenetic Analyses. Mol Biol Evol 2025; 42:msae264. [PMID: 39715360 PMCID: PMC11704958 DOI: 10.1093/molbev/msae264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 11/26/2024] [Accepted: 12/19/2024] [Indexed: 12/25/2024] Open
Abstract
In phylogenetic studies, both partitioned models and mixture models are used to account for heterogeneity in molecular evolution among the sites of DNA sequence alignments. Partitioned models require the user to specify the grouping of sites into subsets, and then assume that each subset of sites can be modeled by a single common process. Mixture models do not require users to prespecify subsets of sites, and instead calculate the likelihood of every site under every model, while co-estimating the model weights and parameters. While much research has gone into the optimization of partitioned models by merging user-specified subsets, there has been less attention paid to the optimization of mixture models for DNA sequence alignments. In this study, we first ask whether a key assumption of partitioned models-that each user-specified subset can be modeled by a single common process-is supported by the data. Having shown that this is not the case, we then design, implement, test, and apply an algorithm, MixtureFinder, to select the optimum number of classes for a mixture model of Q-matrices for the standard models of DNA sequence evolution. We show this algorithm performs well on simulated and empirical datasets and suggest that it may be useful for future empirical studies. MixtureFinder is available in IQ-TREE2, and a tutorial for using MixtureFinder can be found here: http://www.iqtree.org/doc/Complex-Models#mixture-models.
Collapse
Affiliation(s)
- Huaiyan Ren
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Thomas K F Wong
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Bui Quang Minh
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
| | - Robert Lanfear
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| |
Collapse
|
3
|
Dotan E, Wygoda E, Ecker N, Alburquerque M, Avram O, Belinkov Y, Pupko T. BetaAlign: a deep learning approach for multiple sequence alignment. Bioinformatics 2024; 41:btaf009. [PMID: 39775454 PMCID: PMC11758787 DOI: 10.1093/bioinformatics/btaf009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Revised: 12/21/2024] [Accepted: 01/07/2025] [Indexed: 01/11/2025] Open
Abstract
MOTIVATION Multiple sequence alignments (MSAs) are extensively used in biology, from phylogenetic reconstruction to structure and function prediction. Here, we suggest an out-of-the-box approach for the inference of MSAs, which relies on algorithms developed for processing natural languages. We show that our artificial intelligence (AI)-based methodology can be trained to align sequences by processing alignments that are generated via simulations, and thus different aligners can be easily generated for datasets with specific evolutionary dynamics attributes. We expect that natural language processing (NLP) solutions will replace or augment classic solutions for computing alignments, and more generally, challenging inference tasks in phylogenomics. RESULTS The MSA problem is a fundamental pillar in bioinformatics, comparative genomics, and phylogenetics. Here, we characterize and improve BetaAlign, the first deep learning aligner, which substantially deviates from conventional algorithms of alignment computation. BetaAlign draws on NLP techniques and trains transformers to map a set of unaligned biological sequences to an MSA. We show that our approach is highly accurate, comparable and sometimes better than state-of-the-art alignment tools. We characterize the performance of BetaAlign and the effect of various aspects on accuracy; for example, the size of the training data, the effect of different transformer architectures, and the effect of learning on a subspace of indel-model parameters (subspace learning). We also introduce a new technique that leads to improved performance compared to our previous approach. Our findings further uncover the potential of NLP-based methods for sequence alignment, highlighting that AI-based algorithms can substantially challenge classic approaches in phylogenomics and bioinformatics. AVAILABILITY AND IMPLEMENTATION Datasets used in this work are available on HuggingFace (Wolf et al. Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. p.38-45. 2020) at: https://huggingface.co/dotan1111. Source code is available at: https://github.com/idotan286/SimulateAlignments.
Collapse
Affiliation(s)
- Edo Dotan
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
- The Henry and Marilyn Taub Faculty of Computer Science, Technion—Israel Institute of Technology, Haifa 3200003, Israel
| | - Elya Wygoda
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Noa Ecker
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Michael Alburquerque
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Oren Avram
- The Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, United States
| | - Yonatan Belinkov
- The Henry and Marilyn Taub Faculty of Computer Science, Technion—Israel Institute of Technology, Haifa 3200003, Israel
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| |
Collapse
|
4
|
Hill-Terán G, Petrich J, Falcone Ferreyra ML, Aybar MJ, Coux G. Untangling Zebrafish Genetic Annotation: Addressing Complexities and Nomenclature Issues in Orthologous Evaluation of TCOF1 and NOLC1. J Mol Evol 2024; 92:744-760. [PMID: 39269459 DOI: 10.1007/s00239-024-10200-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 08/27/2024] [Indexed: 09/15/2024]
Abstract
Treacher Collins syndrome (TCS) is a genetic disorder affecting facial development, primarily caused by mutations in the TCOF1 gene. TCOF1, along with NOLC1, play important roles in ribosomal RNA transcription and processing. Previously, a zebrafish model of TCS successfully recapitulated the main characteristics of the syndrome by knocking down the expression of a gene on chromosome 13 (coding for Uniprot ID B8JIY2), which was identified as the TCOF1 orthologue. However, database updates renamed this gene as nolc1 and the zebrafish database (ZFIN) identified a different gene on chromosome 14 as the TCOF1 orthologue (coding for Uniprot ID E7F9D9). NOLC1 and TCOF1 are large proteins with unstructured regions and repetitive sequences that complicate alignments and comparisons. Also, the additional whole genome duplication of teleosts sets further difficulty. In this study, we present evidence that endorses that NOLC1 and TCOF1 are paralogs, and that the zebrafish gene on chromosome 14 is a low-complexity LisH domain-containing factor that displays homology to NOLC1 but lacks essential sequence features to accomplish TCOF1 nucleolar functions. Our analysis also supports the idea that zebrafish, as has been suggested for other non-tetrapod vertebrates, lack the TCOF1 gene that is associated with tripartite nucleolus. Using BLAST searches in a group of teleost genomes, we identified fish-specific sequences similar to E7F9D9 zebrafish protein. We propose naming them "LisH-containing Low Complexity Proteins" (LLCP). Interestingly, the gene on chromosome 13 (nolc1) displays the sequence features, developmental expression patterns, and phenotypic impact of depletion that are characteristic of TCOF1 functions. These findings suggest that in teleost fish, the nucleolar functions described for both NOLC1 and TCOF1 mediated by their repeated motifs, are carried out by a single gene, nolc1. Our study, which is mainly based on computational tools available as free web-based algorithms, could help to solve similar conflicts regarding gene orthology in zebrafish.
Collapse
Affiliation(s)
- Guillermina Hill-Terán
- Instituto Superior de Investigaciones Biológicas (INSIBIO, CONICET-UNT), CONICET-UNT, San Miguel de Tucumán, Tucumán, Argentina
| | - Julieta Petrich
- Facultad de Ciencias Bioquímicas y Farmacéuticas, Universidad Nacional de Rosario, Suipacha 531, (S2002LRK), Rosario, Santa Fe., Argentina
- Centro de Estudios Fotosintéticos y Bioquímicos (CEFOBI), CONICET, Suipacha 531, (S2002LRK), Rosario, Santa Fe., Argentina
| | - Maria Lorena Falcone Ferreyra
- Facultad de Ciencias Bioquímicas y Farmacéuticas, Universidad Nacional de Rosario, Suipacha 531, (S2002LRK), Rosario, Santa Fe., Argentina
- Centro de Estudios Fotosintéticos y Bioquímicos (CEFOBI), CONICET, Suipacha 531, (S2002LRK), Rosario, Santa Fe., Argentina
| | - Manuel J Aybar
- Instituto Superior de Investigaciones Biológicas (INSIBIO, CONICET-UNT), CONICET-UNT, San Miguel de Tucumán, Tucumán, Argentina
- Facultad de Bioquímica Química y Farmacia, Instituto de Biología "Dr. Francisco D. Barbieri", Universidad Nacional de Tucumán, San Miguel de Tucumán, Tucumán, Argentina
| | - Gabriela Coux
- Instituto de Biología Molecular y Celular de Rosario (IBR, CONICET-UNR), CONICET, CCT-Rosario CONICET, Ocampo y Esmeralda, (S2000EZP), Rosario, Argentina.
- Facultad de Ciencias Bioquímicas y Farmacéuticas, Universidad Nacional de Rosario (UNR), Suipacha 531, (S2002LRK), Rosario, Santa Fe., Argentina.
| |
Collapse
|
5
|
Banos H, Wong TKF, Daneau J, Susko E, Minh BQ, Lanfear R, Brown MW, Eme L, Roger AJ. GTRpmix: A Linked General Time-Reversible Model for Profile Mixture Models. Mol Biol Evol 2024; 41:msae174. [PMID: 39158305 PMCID: PMC11371462 DOI: 10.1093/molbev/msae174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 06/25/2024] [Accepted: 08/12/2024] [Indexed: 08/20/2024] Open
Abstract
Profile mixture models capture distinct biochemical constraints on the amino acid substitution process at different sites in proteins. These models feature a mixture of time-reversible models with a common matrix of exchangeabilities and distinct sets of equilibrium amino acid frequencies known as profiles. Combining the exchangeability matrix with each profile generates the matrix of instantaneous rates of amino acid exchange for that profile. Currently, empirically estimated exchangeability matrices (e.g. the LG matrix) are widely used for phylogenetic inference under profile mixture models. However, these were estimated using a single profile and are unlikely optimal for profile mixture models. Here, we describe the GTRpmix model that allows maximum likelihood estimation of a common exchangeability matrix under any profile mixture model. We show that exchangeability matrices estimated under profile mixture models differ from the LG matrix, dramatically improving model fit and topological estimation accuracy for empirical test cases. Because the GTRpmix model is computationally expensive, we provide two exchangeability matrices estimated from large concatenated phylogenomic-supermatrices to be used for phylogenetic analyses. One, called Eukaryotic Linked Mixture (ELM), is designed for phylogenetic analysis of proteins encoded by nuclear genomes of eukaryotes, and the other, Eukaryotic and Archaeal Linked mixture (EAL), for reconstructing relationships between eukaryotes and Archaea. These matrices, combined with profile mixture models, fit data better and have improved topology estimation relative to the LG matrix combined with the same mixture models. Starting with version 2.3.1, IQ-TREE2 allows users to estimate linked exchangeabilities (i.e. amino acid exchange rates) under profile mixture models.
Collapse
Affiliation(s)
- Hector Banos
- Department of Mathematics, California State University San Bernardino, San Bernardino, CA, USA
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Dalhousie University, Halifax, NS, Canada
| | - Thomas K F Wong
- School of Computing, College of Engineering and Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Justin Daneau
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Dalhousie University, Halifax, NS, Canada
| | - Edward Susko
- Department of Mathematics and Statistics, Faculty of Science, Dalhousie University, Halifax, NS, Canada
| | - Bui Quang Minh
- School of Computing, College of Engineering and Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
| | - Robert Lanfear
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Matthew W Brown
- Department of Biological Sciences, Mississippi State University, Mississippi State, MS, USA
| | - Laura Eme
- Laboratoire d’Ecologie, systématique et Evolution, Université Paris-Saclay, Gif-sur-Yvette, France
| | - Andrew J Roger
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Dalhousie University, Halifax, NS, Canada
| |
Collapse
|
6
|
Lax G, Park E, Na I, Jacko-Reynolds V, Kwong WK, House CSE, Trznadel M, Wakeman K, Leander BS, Keeling P. Phylogenomic diversity of archigregarine apicomplexans. Open Biol 2024; 14:240141. [PMID: 39317333 PMCID: PMC11500723 DOI: 10.1098/rsob.240141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 08/29/2024] [Accepted: 08/30/2024] [Indexed: 09/26/2024] Open
Abstract
Gregarines are a large and diverse subgroup of Apicomplexa, a lineage of obligate animal symbionts including pathogens such as Plasmodium, the malaria parasite. Unlike Plasmodium, however, gregarines are poorly studied, despite the fact that as early-branching apicomplexans they are crucial to our understanding of the origin and evolution of all apicomplexans and their parasitic lifestyle. Exemplifying this, the earliest branch of gregarines, the archigregarines, are particularly poorly studied: around 80 species have been described from marine invertebrates, but almost all of them were assigned to a single genus, Selenidium. Most are known only from light micrographs and largely unresolved rDNA phylogenies, where they exhibit a great deal of sequence variation, and fall into four subclades. To resolve the relationships within archigregarines, we sequenced 12 single-cell transcriptomes from species representing all four known subclades, as well as one blastogregarine (which frequently branch with Selenidium). A 190-gene phylogenomic tree confirmed four maximally supported individual clades of archigregarines and blastogregarines. These clades are discrete and distantly related, and also correlate with host identity. We propose the establishment of three novel genera of archigregarines to reflect their phylogenetic diversity and host range, and nine novel species isolated from a range of marine invertebrates.
Collapse
Affiliation(s)
- Gordon Lax
- Department of Botany, University of British Columbia, Vancouver, Canada
| | - Eunji Park
- Department of Botany, University of British Columbia, Vancouver, Canada
| | - Ina Na
- Department of Botany, University of British Columbia, Vancouver, Canada
| | | | | | - Chloe S. E. House
- Department of Botany, University of British Columbia, Vancouver, Canada
- Department of Biochemistry and Molecular Biology, University of British Columbia, Vancouver, Canada
| | - Morelia Trznadel
- Department of Botany, University of British Columbia, Vancouver, Canada
| | - Kevin Wakeman
- Institute for the Advancement of Higher Education, Hokkaido University, Sapporo, Hokkaido, Japan
| | - Brian S. Leander
- Department of Botany, University of British Columbia, Vancouver, Canada
- Department of Zoology, University of British Columbia, Vancouver, Canada
| | - Patrick Keeling
- Department of Botany, University of British Columbia, Vancouver, Canada
| |
Collapse
|
7
|
Bharti J, Verma R, Gupta I, Chakraborty P, Eashwaran M, Sony SK, Nehra M, Thangraj A, Kaul R, Fathy K, Kaul T. Functional characterization of novel mutations in the conserved region of EPSPS for herbicide resistance in pigeonpea: structure-based coherent design. J Biomol Struct Dyn 2024; 42:6065-6080. [PMID: 37652402 DOI: 10.1080/07391102.2023.2243522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 06/21/2023] [Indexed: 09/02/2023]
Abstract
Prospectively, agroecosystems for the growth of crops provide the potential fertile, productive, and tropical environment which attracts infestation by weedy plant species that compete with the primary crop plants. Infestation by weed is a major biotic stress factor faced by pigeonpea that hampers the productivity of the crop. In the modern era with the development of chemicals the problem of weed infestation is dealt with armours called herbicides. The most widely utilized, post-emergent, broad-spectrum herbicide has an essential active ingredient called glyphosate. Glyphosate mechanistically inhibits a chloroplastic enzyme 5-enolpyruvylshikimate-3-phosphate synthase (EPSPS) by competitively interacting with the PEP binding site which hinders the shikimate pathway and the production of essential aromatic amino acids (Phe, Tyr, Trp) and other secondary metabolites in plants. Moreover, herbicide spray for weed management is lethal to both the primary crop and the weeds. Therefore, it is critical to develop herbicide-resistant crops for field purposes to reduce the associated yield and economic losses. In this study, the in-silico analysis drove the selection and validation of the point mutations in the conserved region of the EPSPS gene, which confers efficient herbicide resistance to mutated-CcEPSPS enzyme along with the retention of the normal enzyme function. An optimized in-silico validation of the target mutation before the development of the genome-edited resistant plant lines is a prerequisite for testing their efficacy as a proof of concept. We validated the combination of GATIPS mutation for its no-cost effect at the enzyme level via molecular dynamic (MD) simulation.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Jyotsna Bharti
- Nutritional Improvement of Crops Group, Plant Biology & Biotechnology, International Centre for Genetic Engineering and Biotechnology (ICGEB), New Delhi, India
| | - Rachana Verma
- Nutritional Improvement of Crops Group, Plant Biology & Biotechnology, International Centre for Genetic Engineering and Biotechnology (ICGEB), New Delhi, India
| | - Isha Gupta
- Nutritional Improvement of Crops Group, Plant Biology & Biotechnology, International Centre for Genetic Engineering and Biotechnology (ICGEB), New Delhi, India
| | - Puja Chakraborty
- Nutritional Improvement of Crops Group, Plant Biology & Biotechnology, International Centre for Genetic Engineering and Biotechnology (ICGEB), New Delhi, India
| | - Murugesh Eashwaran
- Nutritional Improvement of Crops Group, Plant Biology & Biotechnology, International Centre for Genetic Engineering and Biotechnology (ICGEB), New Delhi, India
| | - Sonia Khan Sony
- Nutritional Improvement of Crops Group, Plant Biology & Biotechnology, International Centre for Genetic Engineering and Biotechnology (ICGEB), New Delhi, India
| | - Mamta Nehra
- Nutritional Improvement of Crops Group, Plant Biology & Biotechnology, International Centre for Genetic Engineering and Biotechnology (ICGEB), New Delhi, India
| | - Arulprakash Thangraj
- Nutritional Improvement of Crops Group, Plant Biology & Biotechnology, International Centre for Genetic Engineering and Biotechnology (ICGEB), New Delhi, India
| | - Rashmi Kaul
- Nutritional Improvement of Crops Group, Plant Biology & Biotechnology, International Centre for Genetic Engineering and Biotechnology (ICGEB), New Delhi, India
| | - Khaled Fathy
- Nutritional Improvement of Crops Group, Plant Biology & Biotechnology, International Centre for Genetic Engineering and Biotechnology (ICGEB), New Delhi, India
| | - Tanushri Kaul
- Nutritional Improvement of Crops Group, Plant Biology & Biotechnology, International Centre for Genetic Engineering and Biotechnology (ICGEB), New Delhi, India
| |
Collapse
|
8
|
Baños H, Susko E, Roger AJ. Is Over-parameterization a Problem for Profile Mixture Models? Syst Biol 2024; 73:53-75. [PMID: 37843172 PMCID: PMC11129589 DOI: 10.1093/sysbio/syad063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Revised: 09/12/2023] [Accepted: 10/13/2023] [Indexed: 10/17/2023] Open
Abstract
Biochemical constraints on the admissible amino acids at specific sites in proteins lead to heterogeneity of the amino acid substitution process over sites in alignments. It is well known that phylogenetic models of protein sequence evolution that do not account for site heterogeneity are prone to long-branch attraction (LBA) artifacts. Profile mixture models were developed to model heterogeneity of preferred amino acids at sites via a finite distribution of site classes each with a distinct set of equilibrium amino acid frequencies. However, it is unknown whether the large number of parameters in such models associated with the many amino acid frequency vectors can adversely affect tree topology estimates because of over-parameterization. Here, we demonstrate theoretically that for long sequences, over-parameterization does not create problems for estimation with profile mixture models. Under mild conditions, tree, amino acid frequencies, and other model parameters converge to true values as sequence length increases, even when there are large numbers of components in the frequency profile distributions. Because large sample theory does not necessarily imply good behavior for shorter alignments we explore the performance of these models with short alignments simulated with tree topologies that are prone to LBA artifacts. We find that over-parameterization is not a problem for complex profile mixture models even when there are many amino acid frequency vectors. In fact, simple models with few site classes behave poorly. Interestingly, we also found that misspecification of the amino acid frequency vectors does not lead to increased LBA artifacts as long as the estimated cumulative distribution function of the amino acid frequencies at sites adequately approximates the true one. In contrast, misspecification of the amino acid exchangeability rates can severely negatively affect parameter estimation. Finally, we explore the effects of including in the profile mixture model an additional "F-class" representing the overall frequencies of amino acids in the data set. Surprisingly, the F-class does not help parameter estimation significantly and can decrease the probability of correct tree estimation, depending on the scenario, even though it tends to improve likelihood scores.
Collapse
Affiliation(s)
- Hector Baños
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
- Institute for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
| | - Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
- Institute for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
| | - Andrew J Roger
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
- Institute for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
| |
Collapse
|
9
|
Zhao C, Wang S. AttCON: With better MSAs and attention mechanism for accurate protein contact map prediction. Comput Biol Med 2024; 169:107822. [PMID: 38091726 DOI: 10.1016/j.compbiomed.2023.107822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 11/19/2023] [Accepted: 12/04/2023] [Indexed: 02/08/2024]
Abstract
Protein contact map prediction is a critical and vital step in protein structure prediction, and its accuracy is highly contingent upon the feature representations of protein sequence information and the efficacy of deep learning models. In this paper, we propose an algorithm, DeepMSA+, to generate protein multiple sequence alignments (MSAs) and to construct feature representations based on co-evolutionary information and sequence information derived from MSAs. We also propose an improved deep learning model, AttCON, for training input features to predict protein contact map. The model incorporates an attention module, and by comparing different attention modules, we find a parameter-free attention module suitable for contact map prediction. Additionally, we use the Focal Loss function to better address the data imbalance issue in protein contact map. We also developed a weighted evaluation index (W score) for model evaluation, which takes into account a wide range of metrics. W score is comprehensive in its scope, with a particular focus on the precision of predictions for medium-range and long-range contacts. Experimental results show that AttCON achieves good precision results on datasets from CASP11 to CASP15. Compared to some state-of-the-art methods, it achieves an average improvement of over 5% in both medium-range and long-range predictions, and W score is improved by an average of 2 points.
Collapse
Affiliation(s)
- Che Zhao
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, Yunnan, China
| | - Shunfang Wang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, Yunnan, China; Yunnan Key Laboratory of Intelligent Systems and Computing, Yunnan University, Kunming, 650504, Yunnan, China.
| |
Collapse
|
10
|
Bowman J, Enard D, Lynch VJ. Phylogenomics reveals an almost perfect polytomy among the almost ungulates ( Paenungulata). BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.07.570590. [PMID: 38106080 PMCID: PMC10723481 DOI: 10.1101/2023.12.07.570590] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Phylogenetic studies have resolved most relationships among Eutherian Orders. However, the branching order of elephants (Proboscidea), hyraxes (Hyracoidea), and sea cows (Sirenia) (i.e., the Paenungulata) has remained uncertain since at least 1758, when Linnaeus grouped elephants and manatees into a single Order (Bruta) to the exclusion of hyraxes. Subsequent morphological, molecular, and large-scale phylogenomic datasets have reached conflicting conclusions on the branching order within Paenungulates. We use a phylogenomic dataset of alignments from 13,388 protein-coding genes across 261 Eutherian mammals to infer phylogenetic relationships within Paenungulates. We find that gene trees almost equally support the three alternative resolutions of Paenungulate relationships and that despite strong support for a Proboscidea+Hyracoidea split in the multispecies coalescent (MSC) tree, there is significant evidence for gene tree uncertainty, incomplete lineage sorting, and introgression among Proboscidea, Hyracoidea, and Sirenia. Indeed, only 8-10% of genes have statistically significant phylogenetic signal to reject the hypothesis of a Paenungulate polytomy. These data indicate little support for any resolution for the branching order Proboscidea, Hyracoidea, and Sirenia within Paenungulata and suggest that Paenungulata may be as close to a real, or at least unresolvable, polytomy as possible.
Collapse
Affiliation(s)
- Jacob Bowman
- Department of Biological Sciences, University at Buffalo, SUNY, 551 Cooke Hall, Buffalo, NY, USA
| | - David Enard
- Department of Ecology and Evolutionary Biology. University of Arizona, Tucson, AZ, USA
| | - Vincent J. Lynch
- Department of Biological Sciences, University at Buffalo, SUNY, 551 Cooke Hall, Buffalo, NY, USA
| |
Collapse
|
11
|
Smith SA, Walker-Hale N, Parins-Fukuchi CT. Compositional shifts associated with major evolutionary transitions in plants. THE NEW PHYTOLOGIST 2023; 239:2404-2415. [PMID: 37381083 DOI: 10.1111/nph.19099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 06/04/2023] [Indexed: 06/30/2023]
Abstract
Heterogeneity in gene trees, morphological characters, and composition has been associated with several major plant clades. Here, we examine heterogeneity in composition across a large transcriptomic dataset of plants to better understand whether locations of shifts in composition are shared across gene regions and whether directions of shifts within clades are shared across gene regions. We estimate mixed models of composition for both nucleotide and amino acids across a recent large-scale transcriptomic dataset for plants. We find shifts in composition across both nucleotide and amino acid datasets, with more shifts detected in nucleotides. We find that Chlorophytes and lineages within experience the most shifts. However, many shifts occur at the origins of land, vascular, and seed plants. While genes in these clades do not typically share the same composition, they tend to shift in the same direction. We discuss potential causes of these patterns. Compositional heterogeneity has been highlighted as a potential problem for phylogenetic analysis, but the variation presented here highlights the need to further investigate these patterns for the signal of biological processes.
Collapse
Affiliation(s)
- Stephen A Smith
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, 48103, USA
| | | | | |
Collapse
|
12
|
Szánthó LL, Lartillot N, Szöllősi GJ, Schrempf D. Compositionally Constrained Sites Drive Long-Branch Attraction. Syst Biol 2023; 72:767-780. [PMID: 36946562 PMCID: PMC10405358 DOI: 10.1093/sysbio/syad013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2022] [Revised: 03/01/2023] [Accepted: 03/16/2023] [Indexed: 03/23/2023] Open
Abstract
Accurate phylogenies are fundamental to our understanding of the pattern and process of evolution. Yet, phylogenies at deep evolutionary timescales, with correspondingly long branches, have been fraught with controversy resulting from conflicting estimates from models with varying complexity and goodness of fit. Analyses of historical as well as current empirical datasets, such as alignments including Microsporidia, Nematoda, or Platyhelminthes, have demonstrated that inadequate modeling of across-site compositional heterogeneity, which is the result of biochemical constraints that lead to varying patterns of accepted amino acids along sequences, can lead to erroneous topologies that are strongly supported. Unfortunately, models that adequately account for across-site compositional heterogeneity remain computationally challenging or intractable for an increasing fraction of contemporary datasets. Here, we introduce "compositional constraint analysis," a method to investigate the effect of site-specific constraints on amino acid composition on phylogenetic inference. We show that more constrained sites with lower diversity and less constrained sites with higher diversity exhibit ostensibly conflicting signals under models ignoring across-site compositional heterogeneity that lead to long-branch attraction artifacts and demonstrate that more complex models accounting for across-site compositional heterogeneity can ameliorate this bias. We present CAT-posterior mean site frequencies (PMSF), a pipeline for diagnosing and resolving phylogenetic bias resulting from inadequate modeling of across-site compositional heterogeneity based on the CAT model. CAT-PMSF is robust against long-branch attraction in all alignments we have examined. We suggest using CAT-PMSF when convergence of the CAT model cannot be assured. We find evidence that compositionally constrained sites are driving long-branch attraction in two metazoan datasets and recover evidence for Porifera as the sister group to all other animals. [Animal phylogeny; cross-site heterogeneity; long-branch attraction; phylogenomics.].
Collapse
Affiliation(s)
- Lénárd L Szánthó
- Department of Biological Physics, Eötvös University, Budapest, Hungary
- ELTE-MTA “Lendület” Evolutionary Genomics Research Group, Budapest, Hungary
- Institute of Evolution, Centre for Ecological Research, Budapest, Hungary
| | - Nicolas Lartillot
- Laboratoire de Biométrie et Biologie Evolutive UMR 5558, CNRS, Université de Lyon, Villeurbanne, France
| | - Gergely J Szöllősi
- Department of Biological Physics, Eötvös University, Budapest, Hungary
- ELTE-MTA “Lendület” Evolutionary Genomics Research Group, Budapest, Hungary
- Institute of Evolution, Centre for Ecological Research, Budapest, Hungary
| | - Dominik Schrempf
- Department of Biological Physics, Eötvös University, Budapest, Hungary
| |
Collapse
|
13
|
Ishaq SE, Ahmad T, Liang L, Hou J, Dong Y, Yu T, Wang F. Mariluticola halotolerans gen. nov., sp. nov., a novel member of the family Devosiaceae isolated from South China Sea sediment. Int J Syst Evol Microbiol 2023; 73. [PMID: 37486324 DOI: 10.1099/ijsem.0.005972] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/25/2023] Open
Abstract
A novel member of class Alphaproteobacteria was isolated from marine sediment of the South China Sea. Cells of strain LMO-2T were Gram-stain negative, greyish in colour, motile, with a single lateral flagellum and short rod in shape with a slight curve. Strain LMO-2T was positive for oxidase and negative for catalase. The bacterium grew aerobically at 10-40 °C (optimum, 25-30 °C), pH 5.5-10.0 (optimum, pH 7.0) and 0-9 % NaCl (w/v; optimum, 2-3 %). Phylogenetic analysis of the 16S rRNA gene sequence and phylogenomic analysis of the whole genome sequence indicated that strain LMO-2T represents a new genus and a new species within the family Devosiaceae, class Alphaproteobacteria, phylum Pseudomonadota. Comparisons of the 16S rRNA gene sequences of strain LMO-2T showed 94.8 % similarity to its closest relative. The genome size is ~3.45 Mbp with a DNA G+C content of 58.17 mol%. The strain possesses potential capability for the degradation of complex organic matter, i.e. fatty acid and benzoate. The predominant cellular fatty acids (>10 %) were C16 : 0 and C18 : 1 ω7c 11-methyl. The sole respiratory quinone was ubiquinone-10. The major identified polar lipids were diphosphatidylglycerol, phosphatidylglycerol and phospholipid. Based on the polyphasic taxonomic data, strain LMO-2T represents a novel genus and a novel species for which the name Mariluticola halotolerans gen. nov., sp. nov., was proposed in the family Devosiaceae. The type strain is LMO-2T (=CGMCC 1.19273T=JCM 34934T).
Collapse
Affiliation(s)
- Sidra Erum Ishaq
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, PR China
| | - Tariq Ahmad
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, PR China
| | - Lewen Liang
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, PR China
| | - Jialin Hou
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, PR China
| | - Yijing Dong
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, PR China
| | - Tiantian Yu
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, PR China
| | - Fengping Wang
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, PR China
- School of Oceanography, Shanghai Jiao Tong University, Shanghai, 200240, PR China
| |
Collapse
|
14
|
Zhao S, Zhang K, Lin C, Cheng M, Song J, Ru X, Wang Z, Wang W, Yang Q. Identification of a Novel Pleiotropic Transcriptional Regulator Involved in Sporulation and Secondary Metabolism Production in Chaetomium globosum. Int J Mol Sci 2022; 23:ijms232314849. [PMID: 36499180 PMCID: PMC9740612 DOI: 10.3390/ijms232314849] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Revised: 11/13/2022] [Accepted: 11/16/2022] [Indexed: 12/03/2022] Open
Abstract
Chaetoglobosin A (CheA), a well-known macrocyclic alkaloid with prominently highly antimycotic, antiparasitic, and antitumor properties, is mainly produced by Chaetomium globosum. However, a limited understanding of the transcriptional regulation of CheA biosynthesis has hampered its application and commercialization in agriculture and biomedicine. Here, a comprehensive study of the CgXpp1 gene, which encodes a basic helix-loop-helix family regulator with a putative role in the regulation of fungal growth and CheA biosynthesis, was performed by employing CgXpp1-disruption and CgXpp1-complementation strategies in the biocontrol species C. globosum. The results suggest that the CgXpp1 gene could be an indirect negative regulator in CheA production. Interestingly, knockout of CgXpp1 considerably increased the transcription levels of key genes and related regulatory factors associated with the CheA biosynthetic. Disruption of CgXpp1 led to a significant reduction in spore production and attenuation of cell development, which was consistent with metabolome analysis results. Taken together, an in-depth analysis of pleiotropic regulation influenced by transcription factors could provide insights into the unexplored metabolic mechanisms associated with primary and secondary metabolite production.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Qian Yang
- Correspondence: ; Tel.: +86-451-8640-2652
| |
Collapse
|
15
|
Goremykin V. Assessment of Absolute Substitution Model Fit Accommodating Time-Reversible and Non-Time-Reversible Evolutionary Processes. Syst Biol 2022:6632685. [PMID: 35792853 DOI: 10.1093/sysbio/syac046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Revised: 06/16/2022] [Accepted: 06/24/2022] [Indexed: 11/13/2022] Open
Abstract
The loss of information accompanying assessment of absolute fit of substitution models to phylogenetic data negatively affects the discriminatory power of previous methods and can make them insensitive to lineage-specific changes in the substitution process. As an alternative, I propose evaluating absolute fit of substitution models based on a novel statistic which describes the observed data without information loss and which is unlikely to become zero-inflated with increasing numbers of taxa. This method can accommodate gaps and is sensitive to lineage-specific shifts in the substitution process. In simulation experiments, it exhibits greater discriminatory power than previous methods. The method can be implemented in both Bayesian and Maximum Likelihood phylogenetic analyses, and used to screen any set of models. Recently, it has been suggested that model selection may be an unnecessary step in phylogenetic inference. However, results presented here emphasize the importance of model fit assessment for reliable phylogenetic inference.
Collapse
Affiliation(s)
- Vadim Goremykin
- Research and Innovation Centre, Fondazione Edmund Mach, 38010 San Michele all'Adige (TN), Italy
| |
Collapse
|
16
|
Chu L, Li S, Dong Z, Zhang Y, Jin P, Ye L, Wang X, Xiang W. Mining and engineering exporters for titer improvement of macrolide biopesticides in Streptomyces. Microb Biotechnol 2022; 15:1120-1132. [PMID: 34437755 PMCID: PMC8966021 DOI: 10.1111/1751-7915.13883] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Accepted: 06/21/2021] [Indexed: 11/27/2022] Open
Abstract
Exporter engineering is a promising strategy to construct high-yield Streptomyces for natural product pharmaceuticals in industrial biotechnology. However, available exporters are scarce, due to the limited knowledge of bacterial transporters. Here, we built a workflow for exporter mining and devised a tunable plug-and-play exporter (TuPPE) module to improve the production of macrolide biopesticides in Streptomyces. Combining genome analyses and experimental confirmations, we found three ATP-binding cassette transporters that contribute to milbemycin production in Streptomyces bingchenggensis. We then optimized the expression level of target exporters for milbemycin titer optimization by designing a TuPPE module with replaceable promoters and ribosome binding sites. Finally, broader applications of the TuPPE module were implemented in industrial S. bingchenggensis BC04, Streptomyces avermitilis NEAU12 and Streptomyces cyaneogriseus NMWT1, which led to optimal titer improvement of milbemycin A3/A4, avermectin B1a and nemadectin α by 24.2%, 53.0% and 41.0%, respectively. Our work provides useful exporters and a convenient TuPPE module for titer improvement of macrolide biopesticides in Streptomyces. More importantly, the feasible exporter mining workflow developed here might shed light on widespread applications of exporter engineering in Streptomyces to boost the production of other secondary metabolites.
Collapse
Affiliation(s)
- Liyang Chu
- School of Life ScienceNortheast Agricultural UniversityNo. 59 Mucai Street, Xiangfang DistrictHarbin150030China
- State Key Laboratory for Biology of Plant Diseases and Insect PestsInstitute of Plant ProtectionChinese Academy of Agricultural SciencesBeijing100193China
| | - Shanshan Li
- State Key Laboratory for Biology of Plant Diseases and Insect PestsInstitute of Plant ProtectionChinese Academy of Agricultural SciencesBeijing100193China
| | - Zhuoxu Dong
- School of Life ScienceNortheast Agricultural UniversityNo. 59 Mucai Street, Xiangfang DistrictHarbin150030China
- State Key Laboratory for Biology of Plant Diseases and Insect PestsInstitute of Plant ProtectionChinese Academy of Agricultural SciencesBeijing100193China
| | - Yanyan Zhang
- State Key Laboratory for Biology of Plant Diseases and Insect PestsInstitute of Plant ProtectionChinese Academy of Agricultural SciencesBeijing100193China
| | - Pinjiao Jin
- School of Life ScienceNortheast Agricultural UniversityNo. 59 Mucai Street, Xiangfang DistrictHarbin150030China
- State Key Laboratory for Biology of Plant Diseases and Insect PestsInstitute of Plant ProtectionChinese Academy of Agricultural SciencesBeijing100193China
| | - Lan Ye
- School of Life ScienceNortheast Agricultural UniversityNo. 59 Mucai Street, Xiangfang DistrictHarbin150030China
- State Key Laboratory for Biology of Plant Diseases and Insect PestsInstitute of Plant ProtectionChinese Academy of Agricultural SciencesBeijing100193China
| | - Xiangjing Wang
- School of Life ScienceNortheast Agricultural UniversityNo. 59 Mucai Street, Xiangfang DistrictHarbin150030China
| | - Wensheng Xiang
- School of Life ScienceNortheast Agricultural UniversityNo. 59 Mucai Street, Xiangfang DistrictHarbin150030China
- State Key Laboratory for Biology of Plant Diseases and Insect PestsInstitute of Plant ProtectionChinese Academy of Agricultural SciencesBeijing100193China
| |
Collapse
|
17
|
Moody ERR, Mahendrarajah TA, Dombrowski N, Clark JW, Petitjean C, Offre P, Szöllősi GJ, Spang A, Williams TA. An estimate of the deepest branches of the tree of life from ancient vertically-evolving genes. eLife 2022; 11:66695. [PMID: 35190025 PMCID: PMC8890751 DOI: 10.7554/elife.66695] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 02/07/2022] [Indexed: 11/30/2022] Open
Abstract
Core gene phylogenies provide a window into early evolution, but different gene sets and analytical methods have yielded substantially different views of the tree of life. Trees inferred from a small set of universal core genes have typically supported a long branch separating the archaeal and bacterial domains. By contrast, recent analyses of a broader set of non-ribosomal genes have suggested that Archaea may be less divergent from Bacteria, and that estimates of inter-domain distance are inflated due to accelerated evolution of ribosomal proteins along the inter-domain branch. Resolving this debate is key to determining the diversity of the archaeal and bacterial domains, the shape of the tree of life, and our understanding of the early course of cellular evolution. Here, we investigate the evolutionary history of the marker genes key to the debate. We show that estimates of a reduced Archaea-Bacteria (AB) branch length result from inter-domain gene transfers and hidden paralogy in the expanded marker gene set. By contrast, analysis of a broad range of manually curated marker gene datasets from an evenly sampled set of 700 Archaea and Bacteria reveals that current methods likely underestimate the AB branch length due to substitutional saturation and poor model fit; that the best-performing phylogenetic markers tend to support longer inter-domain branch lengths; and that the AB branch lengths of ribosomal and non-ribosomal marker genes are statistically indistinguishable. Furthermore, our phylogeny inferred from the 27 highest-ranked marker genes recovers a clade of DPANN at the base of the Archaea and places the bacterial Candidate Phyla Radiation (CPR) within Bacteria as the sister group to the Chloroflexota.
Collapse
Affiliation(s)
- Edmund R R Moody
- School of Biological Sciences, University of Bristol, Bristol, United Kingdom
| | - Tara A Mahendrarajah
- Department of Marine Microbiology and Biogeochemistry, Royal Netherlands Institute for Sea Research, Den Burg, Netherlands
| | - Nina Dombrowski
- Department of Marine Microbiology and Biogeochemistry, Royal Netherlands Institute for Sea Research, Den Burg, Netherlands
| | - James W Clark
- School of Biological Sciences, University of Bristol, Bristol, United Kingdom
| | - Celine Petitjean
- School of Biological Sciences, University of Bristol, Bristol, United Kingdom
| | - Pierre Offre
- Department of Marine Microbiology and Biogeochemistry, Royal Netherlands Institute for Sea Research, Den Burg, Netherlands
| | - Gergely J Szöllősi
- Department of Biological Physics, Eötvös Loránd University, Budapest, Hungary
| | - Anja Spang
- Department of Marine Microbiology and Biogeochemistry, Royal Netherlands Institute for Sea Research, Den Burg, Netherlands
| | - Tom A Williams
- School of Biological Sciences, University of Bristol, Bristol, United Kingdom
| |
Collapse
|
18
|
Youssef N, Susko E, Roger AJ, Bielawski JP. Shifts in amino acid preferences as proteins evolve: A synthesis of experimental and theoretical work. Protein Sci 2021; 30:2009-2028. [PMID: 34322924 PMCID: PMC8442975 DOI: 10.1002/pro.4161] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Revised: 07/19/2021] [Accepted: 07/26/2021] [Indexed: 11/08/2022]
Abstract
Amino acid preferences vary across sites and time. While variation across sites is widely accepted, the extent and frequency of temporal shifts are contentious. Our understanding of the drivers of amino acid preference change is incomplete: To what extent are temporal shifts driven by adaptive versus nonadaptive evolutionary processes? We review phenomena that cause preferences to vary (e.g., evolutionary Stokes shift, contingency, and entrenchment) and clarify how they differ. To determine the extent and prevalence of shifted preferences, we review experimental and theoretical studies. Analyses of natural sequence alignments often detect decreases in homoplasy (convergence and reversions) rates, and variation in replacement rates with time-signals that are consistent with temporally changing preferences. While approaches inferring shifts in preferences from patterns in natural alignments are valuable, they are indirect since multiple mechanisms (both adaptive and nonadaptive) could lead to the observed signal. Alternatively, site-directed mutagenesis experiments allow for a more direct assessment of shifted preferences. They corroborate evidence from multiple sequence alignments, revealing that the preference for an amino acid at a site varies depending on the background sequence. However, shifts in preferences are usually minor in magnitude and sites with significantly shifted preferences are low in frequency. The small yet consistent perturbations in preferences could, nevertheless, jeopardize the accuracy of inference procedures, which assume constant preferences. We conclude by discussing if and how such shifts in preferences might influence widely used time-homogenous inference procedures and potential ways to mitigate such effects.
Collapse
Affiliation(s)
- Noor Youssef
- Department of BiologyDalhousie UniversityHalifaxNova ScotiaCanada
| | - Edward Susko
- Department of Mathematics and StatisticsDalhousie UniversityHalifaxNova ScotiaCanada
| | - Andrew J. Roger
- Department of Biochemistry and Molecular BiologyDalhousie UniversityHalifaxNova ScotiaCanada
| | - Joseph P. Bielawski
- Department of BiologyDalhousie UniversityHalifaxNova ScotiaCanada
- Department of Mathematics and StatisticsDalhousie UniversityHalifaxNova ScotiaCanada
| |
Collapse
|
19
|
Schrempf D, Lartillot N, Szöllősi G. Scalable Empirical Mixture Models That Account for Across-Site Compositional Heterogeneity. Mol Biol Evol 2021; 37:3616-3631. [PMID: 32877529 PMCID: PMC7743758 DOI: 10.1093/molbev/msaa145] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Biochemical demands constrain the range of amino acids acceptable at specific sites resulting in across-site compositional heterogeneity of the amino acid replacement process. Phylogenetic models that disregard this heterogeneity are prone to systematic errors, which can lead to severe long-branch attraction artifacts. State-of-the-art models accounting for across-site compositional heterogeneity include the CAT model, which is computationally expensive, and empirical distribution mixture models estimated via maximum likelihood (C10–C60 models). Here, we present a new, scalable method EDCluster for finding empirical distribution mixture models involving a simple cluster analysis. The cluster analysis utilizes specific coordinate transformations which allow the detection of specialized amino acid distributions either from curated databases or from the alignment at hand. We apply EDCluster to the HOGENOM and HSSP databases in order to provide universal distribution mixture (UDM) models comprising up to 4,096 components. Detailed analyses of the UDM models demonstrate the removal of various long-branch attraction artifacts and improved performance compared with the C10–C60 models. Ready-to-use implementations of the UDM models are provided for three established software packages (IQ-TREE, Phylobayes, and RevBayes).
Collapse
Affiliation(s)
- Dominik Schrempf
- Department of Biological Physics, Eötvös University, Budapest, Hungary
| | - Nicolas Lartillot
- Laboratoire de Biométrie et Biologie Evolutive UMR 5558, CNRS, Université de Lyon, Villeurbanne, France
| | - Gergely Szöllősi
- Department of Biological Physics, Eötvös University, Budapest, Hungary.,ELTE-MTA "Lendület" Evolutionary Genomics Research Group, Budapest, Hungary.,Evolutionary Systems Research Group, Centre for Ecological Research, Hungarian Academy of Sciences, Tihany, Hungary
| |
Collapse
|
20
|
Yourdkhani S, Allman ES, Rhodes JA. Parameter Identifiability for a Profile Mixture Model of Protein Evolution. J Comput Biol 2021; 28:570-586. [PMID: 33960831 DOI: 10.1089/cmb.2020.0315] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
A profile mixture (PM) model is a model of protein evolution, describing sequence data in which sites are assumed to follow many related substitution processes on a single evolutionary tree. The processes depend, in part, on different amino acid distributions, or profiles, varying over sites in aligned sequences. A fundamental question for any stochastic model, which must be answered positively to justify model-based inference, is whether the parameters are identifiable from the probability distribution they determine. Here, using algebraic methods, we show that a PM model has identifiable parameters under circumstances in which it is likely to be used for empirical analyses. In particular, for a tree relating 9 or more taxa, both the tree topology and all numerical parameters are generically identifiable when the number of profiles is less than 74.
Collapse
Affiliation(s)
- Samaneh Yourdkhani
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, Alaska, USA
| | - Elizabeth S Allman
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, Alaska, USA
| | - John A Rhodes
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, Alaska, USA
| |
Collapse
|
21
|
Williams TA, Schrempf D, Szöllősi GJ, Cox CJ, Foster PG, Embley TM. Inferring the deep past from molecular data. Genome Biol Evol 2021; 13:6192802. [PMID: 33772552 PMCID: PMC8175050 DOI: 10.1093/gbe/evab067] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/22/2021] [Indexed: 12/17/2022] Open
Abstract
There is an expectation that analyses of molecular sequences might be able to distinguish between alternative hypotheses for ancient relationships, but the phylogenetic methods used and types of data analyzed are of critical importance in any attempt to recover historical signal. Here, we discuss some common issues that can influence the topology of trees obtained when using overly simple models to analyze molecular data that often display complicated patterns of sequence heterogeneity. To illustrate our discussion, we have used three examples of inferred relationships which have changed radically as models and methods of analysis have improved. In two of these examples, the sister-group relationship between thermophilic Thermus and mesophilic Deinococcus, and the position of long-branch Microsporidia among eukaryotes, we show that recovering what is now generally considered to be the correct tree is critically dependent on the fit between model and data. In the third example, the position of eukaryotes in the tree of life, the hypothesis that is currently supported by the best available methods is fundamentally different from the classical view of relationships between major cellular domains. Since heterogeneity appears to be pervasive and varied among all molecular sequence data, and even the best available models can still struggle to deal with some problems, the issues we discuss are generally relevant to phylogenetic analyses. It remains essential to maintain a critical attitude to all trees as hypotheses of relationship that may change with more data and better methods.
Collapse
Affiliation(s)
- Tom A Williams
- School of Biological Sciences, University of Bristol, Bristol BS8 1TQ, United Kingdom
| | - Dominik Schrempf
- Dept. of Biological Physics, Eötvös Loránd University, 1117 Budapest, Hungary
| | - Gergely J Szöllősi
- Dept. of Biological Physics, Eötvös Loránd University, 1117 Budapest, Hungary.,MTA-ELTE "Lendület" Evolutionary Genomics Research Group, 1117 Budapest, Hungary.,Institute of Evolution, Centre for Ecological Research, 1121 Budapest, Hungary
| | - Cymon J Cox
- Centro de Ciências do Mar, Universidade do Algarve, Gambelas, 8005-319 Faro, Portugal
| | - Peter G Foster
- Department of Life Sciences, Natural History Museum, London SW7 5BD, United Kingdom
| | - T Martin Embley
- Biosciences Institute, Centre for Bacterial Cell Biology, Newcastle University, Newcastle upon Tyne NE2 4AX, United Kingdom
| |
Collapse
|
22
|
Evidence for sponges as sister to all other animals from partitioned phylogenomics with mixture models and recoding. Nat Commun 2021; 12:1783. [PMID: 33741994 PMCID: PMC7979703 DOI: 10.1038/s41467-021-22074-7] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Accepted: 02/24/2021] [Indexed: 11/08/2022] Open
Abstract
Resolving the relationships between the major lineages in the animal tree of life is necessary to understand the origin and evolution of key animal traits. Sponges, characterized by their simple body plan, were traditionally considered the sister group of all other animal lineages, implying a gradual increase in animal complexity from unicellularity to complex multicellularity. However, the availability of genomic data has sparked tremendous controversy as some phylogenomic studies support comb jellies taking this position, requiring secondary loss or independent origins of complex traits. Here we show that incorporating site-heterogeneous mixture models and recoding into partitioned phylogenomics alleviates systematic errors that hamper commonly-applied phylogenetic models. Testing on real datasets, we show a great improvement in model-fit that attenuates branching artefacts induced by systematic error. We reanalyse key datasets and show that partitioned phylogenomics does not support comb jellies as sister to other animals at either the supermatrix or partition-specific level.
Collapse
|
23
|
Minh BQ, Dang CC, Vinh LS, Lanfear R. QMaker: Fast and accurate method to estimate empirical models of protein evolution. Syst Biol 2021; 70:1046-1060. [PMID: 33616668 PMCID: PMC8357343 DOI: 10.1093/sysbio/syab010] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 12/25/2020] [Accepted: 02/10/2021] [Indexed: 11/29/2022] Open
Abstract
Amino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models; however, they are typically complicated and slow. In this article, we propose QMaker, a new ML method to estimate a general time-reversible \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$Q$\end{document} matrix from a large protein data set consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.[Amino acid replacement matrices; amino acid substitution models; maximum likelihood estimation; phylogenetic inferences.]
Collapse
Affiliation(s)
- Bui Quang Minh
- School of Computing, Australian National University, 145 Science Road, Acton, ACT 2601, Canberra, Australia
- Department of Ecology and Evolution, Research School of Biology, Australian National University, 145 Science Road, Acton, ACT 2601, Canberra, Australia
| | - Cuong Cao Dang
- Faculty of Information Technology, University of Engineering and Technology, Vietnam National University, Hanoi, 144 Xuan Thuy, Cau Giay, 10000 Hanoi, Vietnam Bui Quang Minh and Cuong Cao Dang contributed equally to this article
| | - Le Sy Vinh
- Faculty of Information Technology, University of Engineering and Technology, Vietnam National University, Hanoi, 144 Xuan Thuy, Cau Giay, 10000 Hanoi, Vietnam Bui Quang Minh and Cuong Cao Dang contributed equally to this article
- Correspondence to be sent to: University of Engineering and Technology, Vietnam National University, Hanoi, 144 Xuan Thuy, Cau Giay, 10000 Hanoi, Vietnam; E-mail: and Department of Ecology and Evolution, Research School of Biology, Australian National University, 145 Science Road, Acton, ACT 2601, Canberra, Australia; E-mail:
| | - Robert Lanfear
- Department of Ecology and Evolution, Research School of Biology, Australian National University, 145 Science Road, Acton, ACT 2601, Canberra, Australia
- Correspondence to be sent to: University of Engineering and Technology, Vietnam National University, Hanoi, 144 Xuan Thuy, Cau Giay, 10000 Hanoi, Vietnam; E-mail: and Department of Ecology and Evolution, Research School of Biology, Australian National University, 145 Science Road, Acton, ACT 2601, Canberra, Australia; E-mail:
| |
Collapse
|
24
|
Kapli P, Telford MJ. Topology-dependent asymmetry in systematic errors affects phylogenetic placement of Ctenophora and Xenacoelomorpha. SCIENCE ADVANCES 2020; 6:eabc5162. [PMID: 33310849 PMCID: PMC7732190 DOI: 10.1126/sciadv.abc5162] [Citation(s) in RCA: 76] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Accepted: 10/27/2020] [Indexed: 05/21/2023]
Abstract
The evolutionary relationships of two animal phyla, Ctenophora and Xenacoelomorpha, have proved highly contentious. Ctenophora have been proposed as the most distant relatives of all other animals (Ctenophora-first rather than the traditional Porifera-first). Xenacoelomorpha may be primitively simple relatives of all other bilaterally symmetrical animals (Nephrozoa) or simplified relatives of echinoderms and hemichordates (Xenambulacraria). In both cases, one of the alternative topologies must be a result of errors in tree reconstruction. Here, using empirical data and simulations, we show that the Ctenophora-first and Nephrozoa topologies (but not Porifera-first and Ambulacraria topologies) are strongly supported by analyses affected by systematic errors. Accommodating this finding suggests that empirical studies supporting Ctenophora-first and Nephrozoa trees are likely to be explained by systematic error. This would imply that the alternative Porifera-first and Xenambulacraria topologies, which are supported by analyses designed to minimize systematic error, are the most credible current alternatives.
Collapse
Affiliation(s)
- Paschalia Kapli
- Centre for Life's Origins and Evolution, Department of Genetics, Evolution and Environment, University College London, Gower Street, London WC1E 6BT, UK
| | - Maximilian J Telford
- Centre for Life's Origins and Evolution, Department of Genetics, Evolution and Environment, University College London, Gower Street, London WC1E 6BT, UK.
| |
Collapse
|
25
|
Susko E, Roger AJ. On the Use of Information Criteria for Model Selection in Phylogenetics. Mol Biol Evol 2020; 37:549-562. [PMID: 31688943 DOI: 10.1093/molbev/msz228] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
The information criteria Akaike information criterion (AIC), AICc, and Bayesian information criterion (BIC) are widely used for model selection in phylogenetics, however, their theoretical justification and performance have not been carefully examined in this setting. Here, we investigate these methods under simple and complex phylogenetic models. We show that AIC can give a biased estimate of its intended target, the expected predictive log likelihood (EPLnL) or, equivalently, expected Kullback-Leibler divergence between the estimated model and the true distribution for the data. Reasons for bias include commonly occurring issues such as small edge-lengths or, in mixture models, small weights. The use of partitioned models is another issue that can cause problems with information criteria. We show that for partitioned models, a different BIC correction is required for it to be a valid approximation to a Bayes factor. The commonly used AICc correction is not clearly defined in partitioned models and can actually create a substantial bias when the number of parameters gets large as is the case with larger trees and partitioned models. Bias-corrected cross-validation corrections are shown to provide better approximations to EPLnL than AIC. We also illustrate how EPLnL, the estimation target of AIC, can sometimes favor an incorrect model and give reasons for why selection of incorrectly under-partitioned models might be desirable in partitioned model settings.
Collapse
Affiliation(s)
- Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Andrew J Roger
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada
| |
Collapse
|
26
|
Mingrone J, Susko E, Bielawski JP. ModL: exploring and restoring regularity when testing for positive selection. Bioinformatics 2020; 35:2545-2554. [PMID: 30541063 DOI: 10.1093/bioinformatics/bty1019] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2017] [Revised: 11/02/2018] [Accepted: 12/11/2018] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Likelihood ratio tests are commonly used to test for positive selection acting on proteins. They are usually applied with thresholds for declaring a protein under positive selection determined from a chi-square or mixture of chi-square distributions. Although it is known that such distributions are not strictly justified due to the statistical irregularity of the problem, the hope has been that the resulting tests are conservative and do not lose much power in comparison with the same test using the unknown, correct threshold. We show that commonly used thresholds need not yield conservative tests, but instead give larger than expected Type I error rates. Statistical regularity can be restored by using a modified likelihood ratio test. RESULTS We give theoretical results to prove that, if the number of sites is not too small, the modified likelihood ratio test gives approximately correct Type I error probabilities regardless of the parameter settings of the underlying null hypothesis. Simulations show that modification gives Type I error rates closer to those stated without a loss of power. The simulations also show that parameter estimation for mixture models of codon evolution can be challenging in certain data-generation settings with very different mixing distributions giving nearly identical site pattern distributions unless the number of taxa and tree length are large. Because mixture models are widely used for a variety of problems in molecular evolution, the challenges and general approaches to solving them presented here are applicable in a broader context. AVAILABILITY AND IMPLEMENTATION https://github.com/jehops/codeml_modl. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Joseph Mingrone
- Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada
- Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, NS, Canada
| | - Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada
- Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, NS, Canada
| | - Joseph P Bielawski
- Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada
- Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, NS, Canada
- Department of Biology, Dalhousie University, Halifax, NS, Canada
| |
Collapse
|
27
|
Cai C, Tihelka E, Pisani D, Donoghue PCJ. Data curation and modeling of compositional heterogeneity in insect phylogenomics: A case study of the phylogeny of Dytiscoidea (Coleoptera: Adephaga). Mol Phylogenet Evol 2020; 147:106782. [PMID: 32147574 DOI: 10.1016/j.ympev.2020.106782] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Accepted: 02/26/2020] [Indexed: 10/24/2022]
Abstract
Diving beetles and their allies are an almost ubiquitous group of freshwater predators. Knowledge of the phylogeny of the adephagan superfamily Dytiscoidea has significantly improved since the advent of molecular phylogenetics. However, despite recent comprehensive phylogenomic studies, some phylogenetic relationships among the constituent families remain elusive. In particular, the position of the family Hygrobiidae remains uncertain. We address these issues by re-analyzing recently published phylogenomic datasets for Dytiscoidea, using approaches to reduce compositional heterogeneity and adopting a site-heterogeneous mixture model. We obtained a consistent, well-resolved, and strongly supported tree. Consistent with previous studies, our analyses support Aspidytidae as the monophyletic sister group of Amphizoidae, and more importantly, Hygrobiidae as the sister of the diverse Dytiscidae, in agreement with morphology-based phylogenies. Our analyses provide a backbone phylogeny of Dytiscoidea, which lays the foundation for better understanding the evolution of morphological characters, life habits, and feeding behaviors of dytiscoid beetles.
Collapse
Affiliation(s)
- Chenyang Cai
- State Key Laboratory of Palaeobiology and Stratigraphy, Nanjing Institute of Geology and Palaeontology, and Centre for Excellence in Life and Paleoenvironment, Chinese Academy of Sciences, Nanjing 210008, China; School of Earth Sciences, University of Bristol, Life Sciences Building, Tyndall Avenue, Bristol BS8 1TQ, UK.
| | - Erik Tihelka
- Department of Animal Science, Hartpury College, Hartpury GL19 3BE, UK
| | - Davide Pisani
- School of Earth Sciences, University of Bristol, Life Sciences Building, Tyndall Avenue, Bristol BS8 1TQ, UK; School of Biological Sciences, University of Bristol, Life Sciences Building, Tyndall Avenue, Bristol BS8 1TQ, UK
| | - Philip C J Donoghue
- School of Earth Sciences, University of Bristol, Life Sciences Building, Tyndall Avenue, Bristol BS8 1TQ, UK.
| |
Collapse
|
28
|
Abstract
Knowing phylogenetic relationships among species is fundamental for many studies in biology. An accurate phylogenetic tree underpins our understanding of the major transitions in evolution, such as the emergence of new body plans or metabolism, and is key to inferring the origin of new genes, detecting molecular adaptation, understanding morphological character evolution and reconstructing demographic changes in recently diverged species. Although data are ever more plentiful and powerful analysis methods are available, there remain many challenges to reliable tree building. Here, we discuss the major steps of phylogenetic analysis, including identification of orthologous genes or proteins, multiple sequence alignment, and choice of substitution models and inference methodologies. Understanding the different sources of errors and the strategies to mitigate them is essential for assembling an accurate tree of life.
Collapse
|
29
|
Del Amparo R, Vicens A, Arenas M. The influence of heterogeneous codon frequencies along sequences on the estimation of molecular adaptation. Bioinformatics 2020; 36:430-436. [PMID: 31304972 DOI: 10.1093/bioinformatics/btz558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Revised: 07/08/2019] [Accepted: 07/11/2019] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The nonsynonymous/synonymous substitution rate ratio (dN/dS) is a commonly used parameter to quantify molecular adaptation in protein-coding data. It is known that the estimation of dN/dS can be biased if some evolutionary processes are ignored. In this concern, common ML methods to estimate dN/dS assume invariable codon frequencies among sites, despite this characteristic is rare in nature, and it could bias the estimation of this parameter. RESULTS Here we studied the influence of variable codon frequencies among genetic regions on the estimation of dN/dS. We explored scenarios varying the number of genetic regions that differ in codon frequencies, the amount of variability of codon frequencies among regions and the nucleotide frequencies at each codon position among regions. We found that ignoring heterogeneous codon frequencies among regions overall leads to underestimation of dN/dS and the bias increases with the level of heterogeneity of codon frequencies. Interestingly, we also found that varying nucleotide frequencies among regions at the first or second codon position leads to underestimation of dN/dS while variation at the third codon position leads to overestimation of dN/dS. Next, we present a methodology to reduce this bias based on the analysis of partitions presenting similar codon frequencies and we applied it to analyze four real datasets. We conclude that accounting for heterogeneous codon frequencies along sequences is required to obtain realistic estimates of molecular adaptation through this relevant evolutionary parameter. AVAILABILITY AND IMPLEMENTATION The applied frameworks for the computer simulations of protein-coding data and estimation of molecular adaptation are SGWE and PAML, respectively. Both are publicly available and referenced in the study. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Roberto Del Amparo
- Department of Biochemistry, Genetics and Immunology.,Biomedical Research Center (CINBIO), University of Vigo, 36310 Vigo, Spain
| | - Alberto Vicens
- Department of Biochemistry, Genetics and Immunology.,Biomedical Research Center (CINBIO), University of Vigo, 36310 Vigo, Spain
| | - Miguel Arenas
- Department of Biochemistry, Genetics and Immunology.,Biomedical Research Center (CINBIO), University of Vigo, 36310 Vigo, Spain
| |
Collapse
|
30
|
Wang HC, Susko E, Roger AJ. The Relative Importance of Modeling Site Pattern Heterogeneity Versus Partition-Wise Heterotachy in Phylogenomic Inference. Syst Biol 2020; 68:1003-1019. [PMID: 31140564 DOI: 10.1093/sysbio/syz021] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2018] [Revised: 02/04/2019] [Accepted: 04/09/2019] [Indexed: 12/18/2022] Open
Abstract
Large taxa-rich genome-scale data sets are often necessary for resolving ancient phylogenetic relationships. But accurate phylogenetic inference requires that they are analyzed with realistic models that account for the heterogeneity in substitution patterns amongst the sites, genes and lineages. Two kinds of adjustments are frequently used: models that account for heterogeneity in amino acid frequencies at sites in proteins, and partitioned models that accommodate the heterogeneity in rates (branch lengths) among different proteins in different lineages (protein-wise heterotachy). Although partitioned and site-heterogeneous models are both widely used in isolation, their relative importance to the inference of correct phylogenies has not been carefully evaluated. We conducted several empirical analyses and a large set of simulations to compare the relative performances of partitioned models, site-heterogeneous models, and combined partitioned site heterogeneous models. In general, site-homogeneous models (partitioned or not) performed worse than site heterogeneous, except in simulations with extreme protein-wise heterotachy. Furthermore, simulations using empirically-derived realistic parameter settings showed a marked long-branch attraction (LBA) problem for analyses employing protein-wise partitioning even when the generating model included partitioning. This LBA problem results from a small sample bias compounded over many single protein alignments. In some cases, this problem was ameliorated by clustering similarly-evolving proteins together into larger partitions using the PartitionFinder method. Similar results were obtained under simulations with larger numbers of taxa or heterogeneity in simulating topologies over genes. For an empirical Microsporidia test data set, all but one tested site-heterogeneous models (with or without partitioning) obtain the correct Microsporidia+Fungi grouping, whereas site-homogenous models (with or without partitioning) did not. The single exception was the fully partitioned site-heterogeneous analysis that succumbed to the compounded small sample LBA bias. In general unless protein-wise heterotachy effects are extreme, it is more important to model site-heterogeneity than protein-wise heterotachy in phylogenomic analyses. Complete protein-wise partitioning should be avoided as it can lead to a serious LBA bias. In cases of extreme protein-wise heterotachy, approaches that cluster similarly-evolving proteins together and coupled with site-heterogeneous models work well for phylogenetic estimation.
Collapse
Affiliation(s)
- Huai-Chun Wang
- Department of Mathematics and Statistics, Dalhousie University, 6316 Coburg Road, Halifax, Nova Scotia B3H 4R2, Canada.,Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, 5850 College Street, Halifax, Nova Scotia B3H 4R2, Canada
| | - Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, 6316 Coburg Road, Halifax, Nova Scotia B3H 4R2, Canada.,Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, 5850 College Street, Halifax, Nova Scotia B3H 4R2, Canada
| | - Andrew J Roger
- Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, 5850 College Street, Halifax, Nova Scotia B3H 4R2, Canada.,Department of Biochemistry and Molecular Biology, Dalhousie University, 5850 College Street, Halifax, Nova Scotia B3H 4R2, Canada
| |
Collapse
|
31
|
Crotty SM, Minh BQ, Bean NG, Holland BR, Tuke J, Jermiin LS, Haeseler AV. GHOST: Recovering Historical Signal from Heterotachously Evolved Sequence Alignments. Syst Biol 2019; 69:249-264. [DOI: 10.1093/sysbio/syz051] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2017] [Revised: 07/18/2019] [Accepted: 07/22/2019] [Indexed: 01/01/2023] Open
Abstract
Abstract
Molecular sequence data that have evolved under the influence of heterotachous evolutionary processes are known to mislead phylogenetic inference. We introduce the General Heterogeneous evolution On a Single Topology (GHOST) model of sequence evolution, implemented under a maximum-likelihood framework in the phylogenetic program IQ-TREE (http://www.iqtree.org). Simulations show that using the GHOST model, IQ-TREE can accurately recover the tree topology, branch lengths, and substitution model parameters from heterotachously evolved sequences. We investigate the performance of the GHOST model on empirical data by sampling phylogenomic alignments of varying lengths from a plastome alignment. We then carry out inference under the GHOST model on a phylogenomic data set composed of 248 genes from 16 taxa, where we find the GHOST model concurs with the currently accepted view, placing turtles as a sister lineage of archosaurs, in contrast to results obtained using traditional variable rates-across-sites models. Finally, we apply the model to a data set composed of a sodium channel gene of 11 fish taxa, finding that the GHOST model is able to elucidate a subtle component of the historical signal, linked to the previously established convergent evolution of the electric organ in two geographically distinct lineages of electric fish. We compare inference under the GHOST model to partitioning by codon position and show that, owing to the minimization of model constraints, the GHOST model offers unique biological insights when applied to empirical data.
Collapse
Affiliation(s)
- Stephen M Crotty
- Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna, Austria
- School of Mathematical Sciences, University of Adelaide, Adelaide, SA 5005, Australia
| | - Bui Quang Minh
- Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna, Austria
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
| | - Nigel G Bean
- School of Mathematical Sciences, University of Adelaide, Adelaide, SA 5005, Australia
- ARC Centre of Excellence for Mathematical and Statistical Frontiers, The University of Adelaide, Adelaide, SA, Australia
| | - Barbara R Holland
- School of Natural Sciences, University of Tasmania, Hobart, TAS 7001, Australia
| | - Jonathan Tuke
- School of Mathematical Sciences, University of Adelaide, Adelaide, SA 5005, Australia
- ARC Centre of Excellence for Mathematical and Statistical Frontiers, The University of Adelaide, Adelaide, SA, Australia
| | - Lars S Jermiin
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
- CSIRO Land & Water, Black Mountain Laboratories, Canberra, ACT 2601, Australia
- School of Biology and Environmental Science, University College Dublin, Belfield, Dublin 4, Ireland
- Earth Institute, University College Dublin, Belfield, Dublin 4, Ireland
| | - Arndt Von Haeseler
- Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna, Austria
- Bioinformatics & Computational Biology, Faculty of Computer Science, University of Vienna, Vienna, Austria
| |
Collapse
|
32
|
Redmond AK, Zou J, Secombes CJ, Macqueen DJ, Dooley H. Discovery of All Three Types in Cartilaginous Fishes Enables Phylogenetic Resolution of the Origins and Evolution of Interferons. Front Immunol 2019; 10:1558. [PMID: 31354716 PMCID: PMC6640115 DOI: 10.3389/fimmu.2019.01558] [Citation(s) in RCA: 45] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2019] [Accepted: 06/21/2019] [Indexed: 12/31/2022] Open
Abstract
Interferons orchestrate host antiviral responses in jawed vertebrates. They are categorized into three classes; IFN1 and IFN3 are the primary antiviral cytokine lineages, while IFN2 responds to a broader variety of pathogens. The evolutionary relationships within and between these three classes have proven difficult to resolve. Here, we reassess interferon evolution, considering key phylogenetic pitfalls including taxon sampling, alignment quality, model adequacy, and outgroup choice. We reveal that cartilaginous fishes, and hence the jawed vertebrate ancestor, possess(ed) orthologs of all three interferon classes. We show that IFN3 groups sister to IFN1, resolve the origins of the human IFN3 lineages, and find that intronless IFN3s emerged at least three times. IFN2 genes are highly conserved, except for IFN-γ-rel, which we confirm resulted from a teleost-specific duplication. Our analyses show that IFN1 phylogeny is highly sensitive to phylogenetic error. By accounting for this, we describe a new backbone IFN1 phylogeny that implies several IFN1 genes existed in the jawed vertebrate ancestor. One of these is represented by the intronless IFN1s of tetrapods, including mammalian-like repertoires of reptile IFN1s and a subset of amphibian IFN1s, in addition to newly-identified intron-containing shark IFN1 genes. IFN-f, previously only found in teleosts, likely represents another ancestral jawed vertebrate IFN1 family member, suggesting the current classification of fish IFN1s into two groups based on the number of cysteines may need revision. The providence of the remaining fish IFN1s and the coelacanth IFN1s proved difficult to resolve, but they may also be ancestral jawed vertebrate IFN1 lineages. Finally, a large group of amphibian-specific IFN1s falls sister to all other IFN1s and was likely also present in the jawed vertebrate ancestor. Our results verify that intronless IFN1s have evolved multiple times in amphibians and indicate that no one-to-one orthology exists between mammal and reptile IFN1s. Our data also imply that diversification of the multiple IFN1s present in the jawed vertebrate ancestor has occurred through a rapid birth-death process, consistent with functional maintenance over a 450-million-year host-pathogen arms race. In summary, this study reveals a new model of interferon evolution important to our understanding of jawed vertebrate antiviral immunity.
Collapse
Affiliation(s)
- Anthony K Redmond
- School of Biological Sciences, University of Aberdeen, Aberdeen, United Kingdom.,Centre for Genome-Enabled Biology and Medicine, University of Aberdeen, Aberdeen, United Kingdom.,Smurfit Institute of Genetics, Trinity College Dublin, University of Dublin, Dublin, Ireland
| | - Jun Zou
- School of Biological Sciences, University of Aberdeen, Aberdeen, United Kingdom.,Scottish Fish Immunology Research Centre, Institute of Biological and Environmental Sciences, University of Aberdeen, Aberdeen, United Kingdom.,Key Laboratory of Exploration and Utilization of Aquatic Genetic Resources, Ministry of Education, Shanghai Ocean University, Shanghai, China
| | - Christopher J Secombes
- School of Biological Sciences, University of Aberdeen, Aberdeen, United Kingdom.,Scottish Fish Immunology Research Centre, Institute of Biological and Environmental Sciences, University of Aberdeen, Aberdeen, United Kingdom
| | - Daniel J Macqueen
- School of Biological Sciences, University of Aberdeen, Aberdeen, United Kingdom.,The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Edinburgh, United Kingdom
| | - Helen Dooley
- School of Biological Sciences, University of Aberdeen, Aberdeen, United Kingdom.,Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD, United States.,Institute of Marine and Environmental Technology, Baltimore, MD, United States
| |
Collapse
|
33
|
Susko E, Lincker L, Roger AJ. Accelerated Estimation of Frequency Classes in Site-Heterogeneous Profile Mixture Models. Mol Biol Evol 2019; 35:1266-1283. [PMID: 29688541 DOI: 10.1093/molbev/msy026] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
As a consequence of structural and functional constraints, proteins tend to have site-specific preferences for particular amino acids. Failing to adjust for heterogeneity of frequencies over sites can lead to artifacts in phylogenetic estimation. Site-heterogeneous mixture-models have been developed to address this problem. However, due to prohibitive computational times, maximum likelihood implementations utilize fixed component frequency vectors inferred from sequences in a database that are external to the alignment under analysis. Here, we propose a composite likelihood approach to estimation of component frequencies for a mixture model that directly uses the data from the alignment of interest. In the common case that the number of taxa under study is not large, several adjustments to the default composite likelihood are shown to be necessary. In simulations, the approach is shown to provide large improvements over hierarchical clustering. For empirical data, substantial improvements in likelihoods are found over mixtures using fixed components.
Collapse
Affiliation(s)
- Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada
| | - Léa Lincker
- École Nationale Supérieure de Techniques Avancées, Palaiseau, France.,Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, NS, Canada
| | - Andrew J Roger
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, NS, Canada
| |
Collapse
|
34
|
Hilton SK, Bloom JD. Modeling site-specific amino-acid preferences deepens phylogenetic estimates of viral sequence divergence. Virus Evol 2018; 4:vey033. [PMID: 30425841 PMCID: PMC6220371 DOI: 10.1093/ve/vey033] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Molecular phylogenetics is often used to estimate the time since the divergence of modern gene sequences. For highly diverged sequences, such phylogenetic techniques sometimes estimate surprisingly recent divergence times. In the case of viruses, independent evidence indicates that the estimates of deep divergence times from molecular phylogenetics are sometimes too recent. This discrepancy is caused in part by inadequate models of purifying selection leading to branch-length underestimation. Here we examine the effect on branch-length estimation of using models that incorporate experimental measurements of purifying selection. We find that models informed by experimentally measured site-specific amino-acid preferences estimate longer deep branches on phylogenies of influenza virus hemagglutinin. This lengthening of branches is due to more realistic stationary states of the models, and is mostly independent of the branch-length extension from modeling site-to-site variation in amino-acid substitution rate. The branch-length extension from experimentally informed site-specific models is similar to that achieved by other approaches that allow the stationary state to vary across sites. However, the improvements from all of these site-specific but time homogeneous and site independent models are limited by the fact that a protein’s amino-acid preferences gradually shift as it evolves. Overall, our work underscores the importance of modeling site-specific amino-acid preferences when estimating deep divergence times—but also shows the inherent limitations of approaches that fail to account for how these preferences shift over time.
Collapse
Affiliation(s)
- Sarah K Hilton
- Basic Sciences and Computational Biology Program, Fred Hutchinson Cancer Research Center.,Department of Genome Sciences, University of Washington, USA
| | - Jesse D Bloom
- Basic Sciences and Computational Biology Program, Fred Hutchinson Cancer Research Center.,Department of Genome Sciences, University of Washington, USA.,Howard Hughes Medical Institute, Seattle, WA, USA
| |
Collapse
|
35
|
Brown MW, Heiss AA, Kamikawa R, Inagaki Y, Yabuki A, Tice AK, Shiratori T, Ishida KI, Hashimoto T, Simpson AGB, Roger AJ. Phylogenomics Places Orphan Protistan Lineages in a Novel Eukaryotic Super-Group. Genome Biol Evol 2018; 10:427-433. [PMID: 29360967 PMCID: PMC5793813 DOI: 10.1093/gbe/evy014] [Citation(s) in RCA: 76] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/18/2018] [Indexed: 01/13/2023] Open
Abstract
Recent phylogenetic analyses position certain “orphan” protist lineages deep in the tree of eukaryotic life, but their exact placements are poorly resolved. We conducted phylogenomic analyses that incorporate deeply sequenced transcriptomes from representatives of collodictyonids (diphylleids), rigifilids, Mantamonas, and ancyromonads (planomonads). Analyses of 351 genes, using site-heterogeneous mixture models, strongly support a novel super-group-level clade that includes collodictyonids, rigifilids, and Mantamonas, which we name “CRuMs”. Further, they robustly place CRuMs as the closest branch to Amorphea (including animals and fungi). Ancyromonads are strongly inferred to be more distantly related to Amorphea than are CRuMs. They emerge either as sister to malawimonads, or as a separate deeper branch. CRuMs and ancyromonads represent two distinct major groups that branch deeply on the lineage that includes animals, near the most commonly inferred root of the eukaryote tree. This makes both groups crucial in examinations of the deepest-level history of extant eukaryotes.
Collapse
Affiliation(s)
- Matthew W Brown
- Department of Biological Sciences, Mississippi State University, USA.,Institute for Genomics, Biocomputing & Biotechnology, Mississippi State University, USA
| | - Aaron A Heiss
- Department of Biology, and Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia, Canada.,Department of Invertebrate Zoology and Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, New York, USA
| | - Ryoma Kamikawa
- Graduate School of Human and Environmental Studies, Graduate School of Global Environmental Studies, Kyoto University, Japan
| | - Yuji Inagaki
- Graduate School of Life and Environmental Sciences, University of Tsukuba, Ibaraki, Japan.,Center for Computational Sciences, University of Tsukuba, Ibaraki, Japan
| | - Akinori Yabuki
- Japan Agency for Marine-Earth Science and Technology (JAMSTEC), Yokosuka, Kanagawa, Japan
| | - Alexander K Tice
- Department of Biological Sciences, Mississippi State University, USA.,Institute for Genomics, Biocomputing & Biotechnology, Mississippi State University, USA
| | - Takashi Shiratori
- Graduate School of Life and Environmental Sciences, University of Tsukuba, Ibaraki, Japan
| | - Ken-Ichiro Ishida
- Graduate School of Life and Environmental Sciences, University of Tsukuba, Ibaraki, Japan
| | - Tetsuo Hashimoto
- Graduate School of Life and Environmental Sciences, University of Tsukuba, Ibaraki, Japan.,Center for Computational Sciences, University of Tsukuba, Ibaraki, Japan
| | - Alastair G B Simpson
- Department of Biology, and Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Andrew J Roger
- Department of Biochemistry and Molecular Biology, and Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia, Canada
| |
Collapse
|
36
|
He H, Ye L, Li C, Wang H, Guo X, Wang X, Zhang Y, Xiang W. SbbR/SbbA, an Important ArpA/AfsA-Like System, Regulates Milbemycin Production in Streptomyces bingchenggensis. Front Microbiol 2018; 9:1064. [PMID: 29875761 PMCID: PMC5974925 DOI: 10.3389/fmicb.2018.01064] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2018] [Accepted: 05/04/2018] [Indexed: 12/17/2022] Open
Abstract
Milbemycins, a group of 16-membered macrolide antibiotics, are used widely as insecticides and anthelmintics. Previously, a limited understanding of the transcriptional regulation of milbemycin biosynthesis has hampered efforts to enhance antibiotic production by engineering of regulatory genes. Here, a novel ArpA/AfsA-type system, SbbR/SbbA (SBI_08928/SBI_08929), has been identified to be involved in regulating milbemycin biosynthesis in the industrial strain S. bingchenggensis BC04. Inactivation of sbbR in BC04 resulted in markedly decreased production of milbemycin, while deletion of sbbA enhanced milbemycin production. Electrophoresis mobility shift assays (EMSAs) and DNase I footprinting studies showed that SbbR has a specific DNA-binding activity for the promoters of milR (the cluster-situated activator gene for milbemycin production) and the bidirectionally organized genes sbbR and sbbA. Transcriptional analysis suggested that SbbR directly activates the transcription of milR, while represses its own transcription and that of sbbA. Moreover, 11 novel targets of SbbR were additionally found, including seven regulatory genes located in secondary metabolite biosynthetic gene clusters (e.g., sbi_08420, sbi_08432, sbi_09158, sbi_00827, sbi_01376, sbi_09325, and sig24sbh) and four well-known global regulatory genes (e.g., glnRsbh, wblAsbh, atrAsbh, and mtrA/Bsbh). These data suggest that SbbR is not only a direct activator of milbemycin production, but also a pleiotropic regulator that controls the expression of other cluster-situated regulatory genes and global regulatory genes. Overall, this study reveals the upper-layer regulatory system that controls milbemycin biosynthesis, which will not only expand our understanding of the complex regulation in milbemycin biosynthesis, but also provide a basis for an approach to improve milbemycin production via genetic manipulation of SbbR/SbbA system.
Collapse
Affiliation(s)
- Hairong He
- State Key Laboratory for Biology of Plant Diseases and Insect Pests, Institute of Plant Protection, Chinese Academy of Agricultural Sciences, Beijing, China.,School of Life Sciences, Northeast Agricultural University, Harbin, China
| | - Lan Ye
- State Key Laboratory for Biology of Plant Diseases and Insect Pests, Institute of Plant Protection, Chinese Academy of Agricultural Sciences, Beijing, China.,School of Life Sciences, Northeast Agricultural University, Harbin, China
| | - Chuang Li
- State Key Laboratory for Biology of Plant Diseases and Insect Pests, Institute of Plant Protection, Chinese Academy of Agricultural Sciences, Beijing, China.,School of Life Sciences, Northeast Agricultural University, Harbin, China
| | - Haiyan Wang
- State Key Laboratory for Biology of Plant Diseases and Insect Pests, Institute of Plant Protection, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xiaowei Guo
- School of Life Sciences, Northeast Agricultural University, Harbin, China
| | - Xiangjing Wang
- School of Life Sciences, Northeast Agricultural University, Harbin, China
| | - Yanyan Zhang
- State Key Laboratory for Biology of Plant Diseases and Insect Pests, Institute of Plant Protection, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Wensheng Xiang
- State Key Laboratory for Biology of Plant Diseases and Insect Pests, Institute of Plant Protection, Chinese Academy of Agricultural Sciences, Beijing, China.,School of Life Sciences, Northeast Agricultural University, Harbin, China
| |
Collapse
|
37
|
Wang HC, Minh BQ, Susko E, Roger AJ. Modeling Site Heterogeneity with Posterior Mean Site Frequency Profiles Accelerates Accurate Phylogenomic Estimation. Syst Biol 2018; 67:216-235. [PMID: 28950365 DOI: 10.1093/sysbio/syx068] [Citation(s) in RCA: 263] [Impact Index Per Article: 37.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2016] [Accepted: 08/02/2017] [Indexed: 11/14/2022] Open
Abstract
Proteins have distinct structural and functional constraints at different sites that lead to site-specific preferences for particular amino acid residues as the sequences evolve. Heterogeneity in the amino acid substitution process between sites is not modeled by commonly used empirical amino acid exchange matrices. Such model misspecification can lead to artefacts in phylogenetic estimation such as long-branch attraction. Although sophisticated site-heterogeneous mixture models have been developed to address this problem in both Bayesian and maximum likelihood (ML) frameworks, their formidable computational time and memory usage severely limits their use in large phylogenomic analyses. Here we propose a posterior mean site frequency (PMSF) method as a rapid and efficient approximation to full empirical profile mixture models for ML analysis. The PMSF approach assigns a conditional mean amino acid frequency profile to each site calculated based on a mixture model fitted to the data using a preliminary guide tree. These PMSF profiles can then be used for in-depth tree-searching in place of the full mixture model. Compared with widely used empirical mixture models with $k$ classes, our implementation of PMSF in IQ-TREE (http://www.iqtree.org) speeds up the computation by approximately $k$/1.5-fold and requires a small fraction of the RAM. Furthermore, this speedup allows, for the first time, full nonparametric bootstrap analyses to be conducted under complex site-heterogeneous models on large concatenated data matrices. Our simulations and empirical data analyses demonstrate that PMSF can effectively ameliorate long-branch attraction artefacts. In some empirical and simulation settings PMSF provided more accurate estimates of phylogenies than the mixture models from which they derive.
Collapse
Affiliation(s)
- Huai-Chun Wang
- Department of Mathematics and Statistics, 6316 Coburg Road.,Department of Biochemistry and Molecular Biology, 5850 College Street, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada.,Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
| | - Bui Quang Minh
- Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna and Medical University of Vienna, Austria
| | - Edward Susko
- Department of Mathematics and Statistics, 6316 Coburg Road.,Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
| | - Andrew J Roger
- Department of Biochemistry and Molecular Biology, 5850 College Street, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada.,Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
| |
Collapse
|
38
|
Leger MM, Eme L, Stairs CW, Roger AJ. Demystifying Eukaryote Lateral Gene Transfer (Response to Martin 2017 DOI: 10.1002/bies.201700115). Bioessays 2018; 40:e1700242. [DOI: 10.1002/bies.201700242] [Citation(s) in RCA: 54] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2017] [Revised: 02/06/2018] [Indexed: 12/28/2022]
Affiliation(s)
- Michelle M. Leger
- Institute of Evolutionary Biology (CSIC-UPF); Pg. Marítim de la Barceloneta, Barcelona ES 08003 Spain
| | - Laura Eme
- Department of Cell and Molecular Biology; Science for Life Laboratory; Uppsala University; Box 596, Uppsala SE 751 25 Sweden
| | - Courtney W. Stairs
- Department of Cell and Molecular Biology; Science for Life Laboratory; Uppsala University; Box 596, Uppsala SE 751 25 Sweden
| | - Andrew J. Roger
- Centre for Comparative Genomics and Evolutionary Bioinformatics; Department of Biochemistry and Molecular Biology; Dalhousie University; P.O. Box 15000, Halifax CAN B3H 4R2 Nova Scotia Canada
| |
Collapse
|
39
|
Barlowe S, Coan HB, Youker RT. SubVis: an interactive R package for exploring the effects of multiple substitution matrices on pairwise sequence alignment. PeerJ 2017; 5:e3492. [PMID: 28674656 PMCID: PMC5490468 DOI: 10.7717/peerj.3492] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Accepted: 05/27/2017] [Indexed: 01/13/2023] Open
Abstract
Understanding how proteins mutate is critical to solving a host of biological problems. Mutations occur when an amino acid is substituted for another in a protein sequence. The set of likelihoods for amino acid substitutions is stored in a matrix and input to alignment algorithms. The quality of the resulting alignment is used to assess the similarity of two or more sequences and can vary according to assumptions modeled by the substitution matrix. Substitution strategies with minor parameter variations are often grouped together in families. For example, the BLOSUM and PAM matrix families are commonly used because they provide a standard, predefined way of modeling substitutions. However, researchers often do not know if a given matrix family or any individual matrix within a family is the most suitable. Furthermore, predefined matrix families may inaccurately reflect a particular hypothesis that a researcher wishes to model or otherwise result in unsatisfactory alignments. In these cases, the ability to compare the effects of one or more custom matrices may be needed. This laborious process is often performed manually because the ability to simultaneously load multiple matrices and then compare their effects on alignments is not readily available in current software tools. This paper presents SubVis, an interactive R package for loading and applying multiple substitution matrices to pairwise alignments. Users can simultaneously explore alignments resulting from multiple predefined and custom substitution matrices. SubVis utilizes several of the alignment functions found in R, a common language among protein scientists. Functions are tied together with the Shiny platform which allows the modification of input parameters. Information regarding alignment quality and individual amino acid substitutions is displayed with the JavaScript language which provides interactive visualizations for revealing both high-level and low-level alignment information.
Collapse
Affiliation(s)
- Scott Barlowe
- Department of Mathematics and Computer Science, Western Carolina University, Cullowhee, NC, United States of America
| | - Heather B Coan
- Department of Biology, Western Carolina University, Cullowhee, NC, United States of America
| | - Robert T Youker
- Department of Biology, Western Carolina University, Cullowhee, NC, United States of America
| |
Collapse
|
40
|
Trifinopoulos J, Nguyen LT, von Haeseler A, Minh BQ. W-IQ-TREE: a fast online phylogenetic tool for maximum likelihood analysis. Nucleic Acids Res 2016; 44:W232-5. [PMID: 27084950 PMCID: PMC4987875 DOI: 10.1093/nar/gkw256] [Citation(s) in RCA: 2582] [Impact Index Per Article: 286.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
This article presents W-IQ-TREE, an intuitive and user-friendly web interface and server for IQ-TREE, an efficient phylogenetic software for maximum likelihood analysis. W-IQ-TREE supports multiple sequence types (DNA, protein, codon, binary and morphology) in common alignment formats and a wide range of evolutionary models including mixture and partition models. W-IQ-TREE performs fast model selection, partition scheme finding, efficient tree reconstruction, ultrafast bootstrapping, branch tests, and tree topology tests. All computations are conducted on a dedicated computer cluster and the users receive the results via URL or email. W-IQ-TREE is available at http://iqtree.cibiv.univie.ac.at. It is free and open to all users and there is no login requirement.
Collapse
Affiliation(s)
- Jana Trifinopoulos
- Center for Integrative Bioinformatics, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, 1030 Vienna, Austria
| | - Lam-Tung Nguyen
- Center for Integrative Bioinformatics, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, 1030 Vienna, Austria
| | - Arndt von Haeseler
- Center for Integrative Bioinformatics, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, 1030 Vienna, Austria Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna, 1090 Vienna, Austria
| | - Bui Quang Minh
- Center for Integrative Bioinformatics, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, 1030 Vienna, Austria
| |
Collapse
|
41
|
Echave J, Spielman SJ, Wilke CO. Causes of evolutionary rate variation among protein sites. Nat Rev Genet 2016; 17:109-21. [PMID: 26781812 DOI: 10.1038/nrg.2015.18] [Citation(s) in RCA: 180] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
It has long been recognized that certain sites within a protein, such as sites in the protein core or catalytic residues in enzymes, are evolutionarily more conserved than other sites. However, our understanding of rate variation among sites remains surprisingly limited. Recent progress to address this includes the development of a wide array of reliable methods to estimate site-specific substitution rates from sequence alignments. In addition, several molecular traits have been identified that correlate with site-specific mutation rates, and novel mechanistic biophysical models have been proposed to explain the observed correlations. Nonetheless, current models explain, at best, approximately 60% of the observed variance, highlighting the limitations of current methods and models and the need for new research directions.
Collapse
Affiliation(s)
- Julian Echave
- Escuela de Ciencia y Tecnología, Universidad Nacional de San Martín, 1650 San Martín, Buenos Aires, Argentina
| | - Stephanie J Spielman
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas 78712, USA
| | - Claus O Wilke
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas 78712, USA
| |
Collapse
|
42
|
Huang Y, Wang X, Ge S, Rao GY. Divergence and adaptive evolution of the gibberellin oxidase genes in plants. BMC Evol Biol 2015; 15:207. [PMID: 26416509 PMCID: PMC4587577 DOI: 10.1186/s12862-015-0490-2] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2015] [Accepted: 09/17/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The important phytohormone gibberellins (GAs) play key roles in various developmental processes. GA oxidases (GAoxs) are critical enzymes in GA synthesis pathway, but their classification, evolutionary history and the forces driving the evolution of plant GAox genes remain poorly understood. RESULTS This study provides the first large-scale evolutionary analysis of GAox genes in plants by using an extensive whole-genome dataset of 41 species, representing green algae, bryophytes, pteridophyte, and seed plants. We defined eight subfamilies under the GAox family, namely C19-GA2ox, C20-GA2ox, GA20ox,GA3ox, GAox-A, GAox-B, GAox-C and GAox-D. Of these, subfamilies GAox-A, GAox-B, GAox-C and GAox-D are described for the first time. On the basis of phylogenetic analyses and characteristic motifs of GAox genes, we demonstrated a rapid expansion and functional divergence of the GAox genes during the diversification of land plants. We also detected the subfamily-specific motifs and potential sites of some GAox genes, which might have evolved under positive selection. CONCLUSIONS GAox genes originated very early-before the divergence of bryophytes and the vascular plants and the diversification of GAox genes is associated with the functional divergence and could be driven by positive selection. Our study not only provides information on the classification of GAox genes, but also facilitates the further functional characterization and analysis of GA oxidases.
Collapse
Affiliation(s)
- Yuan Huang
- College of Life Sciences, Peking University, Beijing, 100871, China.
| | - Xi Wang
- College of Life Sciences, Peking University, Beijing, 100871, China.
| | - Song Ge
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, 100093, China.
| | - Guang-Yuan Rao
- College of Life Sciences, Peking University, Beijing, 100871, China.
| |
Collapse
|
43
|
Lartillot N. Probabilistic models of eukaryotic evolution: time for integration. Philos Trans R Soc Lond B Biol Sci 2015; 370:20140338. [PMID: 26323768 PMCID: PMC4571576 DOI: 10.1098/rstb.2014.0338] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/03/2015] [Indexed: 11/12/2022] Open
Abstract
In spite of substantial work and recent progress, a global and fully resolved picture of the macroevolutionary history of eukaryotes is still under construction. This concerns not only the phylogenetic relations among major groups, but also the general characteristics of the underlying macroevolutionary processes, including the patterns of gene family evolution associated with endosymbioses, as well as their impact on the sequence evolutionary process. All these questions raise formidable methodological challenges, calling for a more powerful statistical paradigm. In this direction, model-based probabilistic approaches have played an increasingly important role. In particular, improved models of sequence evolution accounting for heterogeneities across sites and across lineages have led to significant, although insufficient, improvement in phylogenetic accuracy. More recently, one main trend has been to move away from simple parametric models and stepwise approaches, towards integrative models explicitly considering the intricate interplay between multiple levels of macroevolutionary processes. Such integrative models are in their infancy, and their application to the phylogeny of eukaryotes still requires substantial improvement of the underlying models, as well as additional computational developments.
Collapse
Affiliation(s)
- Nicolas Lartillot
- Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS 5558, Université Claude Bernard Lyon 1, F-69622 Villeurbanne Cedex, France
| |
Collapse
|
44
|
Doud MB, Ashenberg O, Bloom JD. Site-Specific Amino Acid Preferences Are Mostly Conserved in Two Closely Related Protein Homologs. Mol Biol Evol 2015; 32:2944-60. [PMID: 26226986 PMCID: PMC4626756 DOI: 10.1093/molbev/msv167] [Citation(s) in RCA: 63] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Evolution drives changes in a protein’s sequence over time. The extent to which these changes in sequence lead to shifts in the underlying preference for each amino acid at each site is an important question with implications for comparative sequence-analysis methods, such as molecular phylogenetics. To quantify the extent that site-specific amino acid preferences shift during evolution, we performed deep mutational scanning on two homologs of human influenza nucleoprotein with 94% amino acid identity. We found that only a modest fraction of sites exhibited shifts in amino acid preferences that exceeded the noise in our experiments. Furthermore, even among sites that did exhibit detectable shifts, the magnitude tended to be small relative to differences between nonhomologous proteins. Given the limited change in amino acid preferences between these close homologs, we tested whether our measurements could inform site-specific substitution models that describe the evolution of nucleoproteins from more diverse influenza viruses. We found that site-specific evolutionary models informed by our experiments greatly outperformed nonsite-specific alternatives in fitting phylogenies of nucleoproteins from human, swine, equine, and avian influenza. Combining the experimental data from both homologs improved phylogenetic fit, partly because measurements in multiple genetic contexts better captured the evolutionary average of the amino acid preferences for sites with shifting preferences. Our results show that site-specific amino acid preferences are sufficiently conserved that measuring mutational effects in one protein provides information that can improve quantitative evolutionary modeling of nearby homologs.
Collapse
Affiliation(s)
- Michael B Doud
- Division of Basic Sciences and Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA Department of Genome Sciences, University of Washington Medical Scientist Training Program, University of Washington School of Medicine
| | - Orr Ashenberg
- Division of Basic Sciences and Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA
| | - Jesse D Bloom
- Division of Basic Sciences and Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA Department of Genome Sciences, University of Washington
| |
Collapse
|
45
|
Risso VA, Manssour-Triedo F, Delgado-Delgado A, Arco R, Barroso-delJesus A, Ingles-Prieto A, Godoy-Ruiz R, Gavira JA, Gaucher EA, Ibarra-Molero B, Sanchez-Ruiz JM. Mutational studies on resurrected ancestral proteins reveal conservation of site-specific amino acid preferences throughout evolutionary history. Mol Biol Evol 2014; 32:440-55. [PMID: 25392342 PMCID: PMC4298172 DOI: 10.1093/molbev/msu312] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Local protein interactions (“molecular context” effects) dictate amino acid replacements and can be described in terms of site-specific, energetic preferences for any different amino acid. It has been recently debated whether these preferences remain approximately constant during evolution or whether, due to coevolution of sites, they change strongly. Such research highlights an unresolved and fundamental issue with far-reaching implications for phylogenetic analysis and molecular evolution modeling. Here, we take advantage of the recent availability of phenotypically supported laboratory resurrections of Precambrian thioredoxins and β-lactamases to experimentally address the change of site-specific amino acid preferences over long geological timescales. Extensive mutational analyses support the notion that evolutionary adjustment to a new amino acid may occur, but to a large extent this is insufficient to erase the primitive preference for amino acid replacements. Generally, site-specific amino acid preferences appear to remain conserved throughout evolutionary history despite local sequence divergence. We show such preference conservation to be readily understandable in molecular terms and we provide crystallographic evidence for an intriguing structural-switch mechanism: Energetic preference for an ancestral amino acid in a modern protein can be linked to reorganization upon mutation to the ancestral local structure around the mutated site. Finally, we point out that site-specific preference conservation naturally leads to one plausible evolutionary explanation for the existence of intragenic global suppressor mutations.
Collapse
Affiliation(s)
- Valeria A Risso
- Departamento de Quimica Fisica, Facultad de Ciencias, Universidad de Granada, Granada, Spain
| | - Fadia Manssour-Triedo
- Departamento de Quimica Fisica, Facultad de Ciencias, Universidad de Granada, Granada, Spain
| | | | - Rocio Arco
- Departamento de Quimica Fisica, Facultad de Ciencias, Universidad de Granada, Granada, Spain
| | - Alicia Barroso-delJesus
- Unidad de Genómica, Instituto de Parasitología y Biomedicina López Neyra CSIC, PTS Granada, Granada, Spain
| | - Alvaro Ingles-Prieto
- Departamento de Quimica Fisica, Facultad de Ciencias, Universidad de Granada, Granada, Spain
| | - Raquel Godoy-Ruiz
- Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, Baltimore, MD
| | - Jose A Gavira
- Unidad de Genómica, Instituto de Parasitología y Biomedicina López Neyra CSIC, PTS Granada, Granada, Spain
| | - Eric A Gaucher
- Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, Baltimore, MD
| | - Beatriz Ibarra-Molero
- Departamento de Quimica Fisica, Facultad de Ciencias, Universidad de Granada, Granada, Spain
| | - Jose M Sanchez-Ruiz
- Departamento de Quimica Fisica, Facultad de Ciencias, Universidad de Granada, Granada, Spain
| |
Collapse
|
46
|
Fu CJ, Sheikh S, Miao W, Andersson SGE, Baldauf SL. Missing genes, multiple ORFs, and C-to-U type RNA editing in Acrasis kona (Heterolobosea, Excavata) mitochondrial DNA. Genome Biol Evol 2014; 6:2240-57. [PMID: 25146648 PMCID: PMC4202320 DOI: 10.1093/gbe/evu180] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Discoba (Excavata) is an ancient group of eukaryotes with great morphological and ecological diversity. Unlike the other major divisions of Discoba (Jakobida and Euglenozoa), little is known about the mitochondrial DNAs (mtDNAs) of Heterolobosea. We have assembled a complete mtDNA genome from the aggregating heterolobosean amoeba, Acrasis kona, which consists of a single circular highly AT-rich (83.3%) molecule of 51.5 kb. Unexpectedly, A. kona mtDNA is missing roughly 40% of the protein-coding genes and nearly half of the transfer RNAs found in the only other sequenced heterolobosean mtDNAs, those of Naegleria spp. Instead, over a quarter of A. kona mtDNA consists of novel open reading frames. Eleven of the 16 protein-coding genes missing from A. kona mtDNA were identified in its nuclear DNA and polyA RNA, and phylogenetic analyses indicate that at least 10 of these 11 putative nuclear-encoded mitochondrial (NcMt) proteins arose by direct transfer from the mitochondrion. Acrasis kona mtDNA also employs C-to-U type RNA editing, and 12 homologs of DYW-type pentatricopeptide repeat (PPR) proteins implicated in plant organellar RNA editing are found in A. kona nuclear DNA. A mapping of mitochondrial gene content onto a consensus phylogeny reveals a sporadic pattern of relative stasis and rampant gene loss in Discoba. Rampant loss occurred independently in the unique common lineage leading to Heterolobosea + Tsukubamonadida and later in the unique lineage leading to Acrasis. Meanwhile, mtDNA gene content appears to be remarkably stable in the Acrasis sister lineage leading to Naegleria and in their distant relatives Jakobida.
Collapse
Affiliation(s)
- Cheng-Jie Fu
- Program in Systematic Biology, Department of Organismal Biology, Evolutionary Biology Centre, Uppsala University, Sweden
| | - Sanea Sheikh
- Program in Systematic Biology, Department of Organismal Biology, Evolutionary Biology Centre, Uppsala University, Sweden
| | - Wei Miao
- Key Laboratory of Aquatic Biodiversity and Conservation, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, China
| | - Siv G E Andersson
- Department of Molecular Evolution, Cell and Molecular Biology, Science for Life Laboratory, Biomedical Centre, Uppsala University, Sweden
| | - Sandra L Baldauf
- Program in Systematic Biology, Department of Organismal Biology, Evolutionary Biology Centre, Uppsala University, Sweden
| |
Collapse
|
47
|
Bloom JD. An experimentally informed evolutionary model improves phylogenetic fit to divergent lactamase homologs. Mol Biol Evol 2014; 31:2753-69. [PMID: 25063439 PMCID: PMC4166927 DOI: 10.1093/molbev/msu220] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Phylogenetic analyses of molecular data require a quantitative model for how
sequences evolve. Traditionally, the details of the site-specific selection that
governs sequence evolution are not known a priori, making it challenging to
create evolutionary models that adequately capture the heterogeneity of
selection at different sites. However, recent advances in high-throughput
experiments have made it possible to quantify the effects of all single
mutations on gene function. I have previously shown that such high-throughput
experiments can be combined with knowledge of underlying mutation rates to
create a parameter-free evolutionary model that describes the phylogeny of
influenza nucleoprotein far better than commonly used existing models. Here, I
extend this work by showing that published experimental data on TEM-1
beta-lactamase (Firnberg E, Labonte JW, Gray JJ, Ostermeier M. 2014. A
comprehensive, high-resolution map of a gene’s fitness landscape.
Mol Biol Evol. 31:1581–1592) can be combined with a
few mutation rate parameters to create an evolutionary model that describes
beta-lactamase phylogenies much better than most common existing models. This
experimentally informed evolutionary model is superior even for homologs that
are substantially diverged (about 35% divergence at the protein level)
from the TEM-1 parent that was the subject of the experimental study. These
results suggest that experimental measurements can inform phylogenetic
evolutionary models that are applicable to homologs that span a substantial
range of sequence divergence.
Collapse
Affiliation(s)
- Jesse D Bloom
- Division of Basic Sciences and Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA
| |
Collapse
|
48
|
Abstract
All modern approaches to molecular phylogenetics require a quantitative model for how genes evolve. Unfortunately, existing evolutionary models do not realistically represent the site-heterogeneous selection that governs actual sequence change. Attempts to remedy this problem have involved augmenting these models with a burgeoning number of free parameters. Here, I demonstrate an alternative: Experimental determination of a parameter-free evolutionary model via mutagenesis, functional selection, and deep sequencing. Using this strategy, I create an evolutionary model for influenza nucleoprotein that describes the gene phylogeny far better than existing models with dozens or even hundreds of free parameters. Emerging high-throughput experimental strategies such as the one employed here provide fundamentally new information that has the potential to transform the sensitivity of phylogenetic and genetic analyses.
Collapse
Affiliation(s)
- Jesse D Bloom
- Division of Basic Sciences and Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA
| |
Collapse
|
49
|
Wang HC, Susko E, Roger AJ. An amino acid substitution-selection model adjusts residue fitness to improve phylogenetic estimation. Mol Biol Evol 2014; 31:779-92. [PMID: 24441033 DOI: 10.1093/molbev/msu044] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Standard protein phylogenetic models use fixed rate matrices of amino acid interchange derived from analyses of large databases. Differences between the stationary amino acid frequencies of these rate matrices from those of a data set of interest are typically adjusted for by matrix multiplication that converts the empirical rate matrix to an exchangeability matrix which is then postmultiplied by the amino acid frequencies in the alignment. The result is a time-reversible rate matrix with stationary amino acid frequencies equal to the data set frequencies. On the basis of population genetics principles, we develop an amino acid substitution-selection model that parameterizes the fitness of an amino acid as the logarithm of the ratio of the frequency of the amino acid to the frequency of the same amino acid under no selection. The model gives rise to a different sequence of matrix multiplications to convert an empirical rate matrix to one that has stationary amino acid frequencies equal to the data set frequencies. We incorporated the substitution-selection model with an improved amino acid class frequency mixture (cF) model to partially take into account site-specific amino acid frequencies in the phylogenetic models. We show that 1) the selection models fit data significantly better than corresponding models without selection for most of the 21 test data sets; 2) both cF and cF selection models favored the phylogenetic trees that were inferred under current sophisticated models and methods for three difficult phylogenetic problems (the positions of microsporidia and breviates in eukaryote phylogeny and the position of the root of the angiosperm tree); and 3) for data simulated under site-specific residue frequencies, the cF selection models estimated trees closer to the generating trees than a standard Г model or cF without selection. We also explored several ways of estimating amino acid frequencies under neutral evolution that are required for these selection models. By better modeling the amino acid substitution process, the cF selection models will be valuable for phylogenetic inference and evolutionary studies.
Collapse
Affiliation(s)
- Huai-Chun Wang
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada
| | | | | |
Collapse
|
50
|
Mutational effects on stability are largely conserved during protein evolution. Proc Natl Acad Sci U S A 2013; 110:21071-6. [PMID: 24324165 DOI: 10.1073/pnas.1314781111] [Citation(s) in RCA: 105] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Protein stability and folding are the result of cooperative interactions among many residues, yet phylogenetic approaches assume that sites are independent. This discrepancy has engendered concerns about large evolutionary shifts in mutational effects that might confound phylogenetic approaches. Here we experimentally investigate this issue by introducing the same mutations into a set of diverged homologs of the influenza nucleoprotein and measuring the effects on stability. We find that mutational effects on stability are largely conserved across the homologs. We reach qualitatively similar conclusions when we simulate protein evolution with molecular-mechanics force fields. Our results do not mean that proteins evolve without epistasis, which can still arise even when mutational stability effects are conserved. However, our findings indicate that large evolutionary shifts in mutational effects on stability are rare, at least among homologs with similar structures and functions. We suggest that properly describing the clearly observable and highly conserved amino acid preferences at individual sites is likely to be far more important for phylogenetic analyses than accounting for rare shifts in amino acid propensities due to site covariation.
Collapse
|