1
|
Liu L, Yu L, Wu S, Arnold J, Whalen C, Davis C, Edwards S. Short branch attraction in phylogenomic inference under the multispecies coalescent. Front Ecol Evol 2023; 11:1134764. [PMID: 39233780 PMCID: PMC11372852 DOI: 10.3389/fevo.2023.1134764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/06/2024] Open
Abstract
Accurate reconstruction of species trees often relies on the quality of input gene trees estimated from molecular sequences. Previous studies suggested that if the sequence length is fixed, the maximum likelihood may produce biased gene trees which subsequently mislead inference of species trees. Two key questions need to be answered in this context: what are the scenarios that may result in consistently biased gene trees? and for those scenarios, are there any remedies that may remove or at least reduce the misleading effects of consistently biased gene trees? In this article, we establish a theoretical framework to address these questions. Considering a scenario where the true gene tree is a 4-taxon star treeT * = S 1 , S 2 , S 3 , S 4 with two short branches leading to the speciesS 1 andS 2 , we demonstrate that maximum likelihood significantly favors the wrong bifurcating treeS 1 , S 2 , S 3 , S 4 grouping the two speciesS 1 andS 2 with short branches. We name this inconsistent behavior short branch attraction, which may occur in real-world data involving a 4-taxon bifurcating gene tree with a short internal branch. If no mutation occurs along the internal branch, which is likely if the internal branch is short, the 4-taxon bifurcating tree is equivalent to the 4-taxon star tree and thus will suffer the same misleading effect of short branch attraction. Theoretical and simulation results further demonstrate that short branch attraction may occur in gene trees and species trees of arbitrary size. Moreover, short branch attraction is primarily caused by a lack of phylogenetic information in sequence data, suggesting that converting short internal branches to polytomies in the estimated gene trees can significantly reduce artifacts induced by short branch attraction.
Collapse
Affiliation(s)
- Liang Liu
- Department of Statistics and Institute of Bioinformatics, University of Georgia, Athens, GA, United States
| | - Lili Yu
- Department of Biostatistics, Georgia Southern University, Statesboro, GA, United States
| | - Shaoyuan Wu
- Jiangsu Key Laboratory of Phylogenomics and Comparative Genomics, Jiangsu International Joint Center of Genomics, School of Life Sciences, Jiangsu Normal University, Xuzhou, Jiangsu, China
| | - Jonathan Arnold
- Department of Genetics, University of Georgia, Athens, GA, United States
| | - Christopher Whalen
- Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens, GA, United States
| | - Charles Davis
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, United States
| | - Scott Edwards
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, United States
| |
Collapse
|
2
|
Young C, Meng S, Moshiri N. An Evaluation of Phylogenetic Workflows in Viral Molecular Epidemiology. Viruses 2022; 14:v14040774. [PMID: 35458504 PMCID: PMC9032411 DOI: 10.3390/v14040774] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2022] [Revised: 04/04/2022] [Accepted: 04/06/2022] [Indexed: 01/25/2023] Open
Abstract
The use of viral sequence data to inform public health intervention has become increasingly common in the realm of epidemiology. Such methods typically utilize multiple sequence alignments and phylogenies estimated from the sequence data. Like all estimation techniques, they are error prone, yet the impacts of such imperfections on downstream epidemiological inferences are poorly understood. To address this, we executed multiple commonly used viral phylogenetic analysis workflows on simulated viral sequence data, modeling Human Immunodeficiency Virus (HIV), Hepatitis C Virus (HCV), and Ebolavirus, and we computed multiple methods of accuracy, motivated by transmission-clustering techniques. For multiple sequence alignment, MAFFT consistently outperformed MUSCLE and Clustal Omega, in both accuracy and runtime. For phylogenetic inference, FastTree 2, IQ-TREE, RAxML-NG, and PhyML had similar topological accuracies, but branch lengths and pairwise distances were consistently most accurate in phylogenies inferred by RAxML-NG. However, FastTree 2 was the fastest, by orders of magnitude, and when the other tools were used to optimize branch lengths along a fixed FastTree 2 topology, the resulting phylogenies had accuracies that were indistinguishable from their original counterparts, but with a fraction of the runtime.
Collapse
|
3
|
Dabert J, Mironov SV, Dabert M. The explosive radiation, intense host-shifts and long-term failure to speciate in the evolutionary history of the feather mite genus Analges (Acariformes: Analgidae) from European passerines. Zool J Linn Soc 2021. [DOI: 10.1093/zoolinnean/zlab057] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
Abstract
Mites of the genus Analges (Acariformes: Analgidae) inhabit the down feathers of passeriform birds. The evolutionary history of Analges and the co-phylogentic relationships between these mites and their hosts are unknown. Our phylogenetic analysis supported the monophyly of the genus, but it did not support previous taxonomic hypotheses subdividing the genus into the subgenera Analges and Analgopsis or arranging some species into the A. chelopus and A. passerinus species groups. Molecular data reveal seven new species inhabiting Eurasian passerines and support the existence of several multi-host species. According to molecular dating, the origin of the Analges (c. 41 Mya) coincided with the Eocene diversification of Passerida into Sylvioidea and Muscicapoidea–Passeroidea. The initial diversification of Analges took place on the Muscicapoidea clade, while remaining passerine superfamilies appear to have been colonized because of host-switching. Co-speciation appears to be relatively common among Analges species and their hosts, but the most striking pattern in the co-phylogenetic scenario involves numerous complete host-switches, spreads and several failures to speciate. The mechanism of long-term gene-flow among different populations of multi-host Analges species is enigmatic and difficult to resolve. Probably, in some cases mites could be transferred between birds via feathers used as nest material.
Collapse
Affiliation(s)
- Jacek Dabert
- Department of Animal Morphology, Faculty of Biology, Adam Mickiewicz University in Poznan, Uniwersytetu Poznanskiego, Poznan, Poland
| | - Serge V Mironov
- Zoological Institute of the Russian Academy of Sciences, Universitetskaya Embankment, St. Petersburg, Russia
| | - Miroslawa Dabert
- Molecular Biology Techniques Laboratory, Faculty of Biology, Adam Mickiewicz University in Poznan, Uniwersytetu Poznanskiego, Poznan, Poland
| |
Collapse
|
4
|
Su Z, Townsend JP. Utility of characters evolving at diverse rates of evolution to resolve quartet trees with unequal branch lengths: analytical predictions of long-branch effects. BMC Evol Biol 2015; 15:86. [PMID: 25968460 PMCID: PMC4429678 DOI: 10.1186/s12862-015-0364-7] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2015] [Accepted: 04/29/2015] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND The detection and avoidance of "long-branch effects" in phylogenetic inference represents a longstanding challenge for molecular phylogenetic investigations. A consequence of parallelism and convergence, long-branch effects arise in phylogenetic inference when there is unequal molecular divergence among lineages, and they can positively mislead inference based on parsimony especially, but also inference based on maximum likelihood and Bayesian approaches. Long-branch effects have been exhaustively examined by simulation studies that have compared the performance of different inference methods in specific model trees and branch length spaces. RESULTS In this paper, by generalizing the phylogenetic signal and noise analysis to quartets with uneven subtending branches, we quantify the utility of molecular characters for resolution of quartet phylogenies via parsimony. Our quantification incorporates contributions toward the correct tree from either signal or homoplasy (i.e. "the right result for either the right reason or the wrong reason"). We also characterize a highly conservative lower bound of utility that incorporates contributions to the correct tree only when they correspond to true, unobscured parsimony-informative sites (i.e. "the right result for the right reason"). We apply the generalized signal and noise analysis to classic quartet phylogenies in which long-branch effects can arise due to unequal rates of evolution or an asymmetrical topology. Application of the analysis leads to identification of branch length conditions in which inference will be inconsistent and reveals insights regarding how to improve sampling of molecular loci and taxa in order to correctly resolve phylogenies in which long-branch effects are hypothesized to exist. CONCLUSIONS The generalized signal and noise analysis provides analytical prediction of utility of characters evolving at diverse rates of evolution to resolve quartet phylogenies with unequal branch lengths. The analysis can be applied to identifying characters evolving at appropriate rates to resolve phylogenies in which long-branch effects are hypothesized to occur.
Collapse
Affiliation(s)
- Zhuo Su
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, 06520, USA.
| | - Jeffrey P Townsend
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, 06520, USA.
- Department of Biostatistics, Yale University, New Haven, CT, 06520, USA.
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06520, USA.
- Department of Biostatistics, Yale School of Public Health, 135 College St #222., New Haven, CT, 06511, United States of America.
| |
Collapse
|
5
|
Parks SL, Goldman N. Maximum likelihood inference of small trees in the presence of long branches. Syst Biol 2014; 63:798-811. [PMID: 24996414 PMCID: PMC6371681 DOI: 10.1093/sysbio/syu044] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2013] [Accepted: 06/20/2014] [Indexed: 11/14/2022] Open
Abstract
The statistical basis of maximum likelihood (ML), its robustness, and the fact that it appears to suffer less from biases lead to it being one of the most popular methods for tree reconstruction. Despite its popularity, very few analytical solutions for ML exist, so biases suffered by ML are not well understood. One possible bias is long branch attraction (LBA), a regularly cited term generally used to describe a propensity for long branches to be joined together in estimated trees. Although initially mentioned in connection with inconsistency of parsimony, LBA has been claimed to affect all major phylogenetic reconstruction methods, including ML. Despite the widespread use of this term in the literature, exactly what LBA is and what may be causing it is poorly understood, even for simple evolutionary models and small model trees. Studies looking at LBA have focused on the effect of two long branches on tree reconstruction. However, to understand the effect of two long branches it is also important to understand the effect of just one long branch. If ML struggles to reconstruct one long branch, then this may have an impact on LBA. In this study, we look at the effect of one long branch on three-taxon tree reconstruction. We show that, counterintuitively, long branches are preferentially placed at the tips of the tree. This can be understood through the use of analytical solutions to the ML equation and distance matrix methods. We go on to look at the placement of two long branches on four-taxon trees, showing that there is no attraction between long branches, but that for extreme branch lengths long branches are joined together disproportionally often. These results illustrate that even small model trees are still interesting to help understand how ML phylogenetic reconstruction works, and that LBA is a complicated phenomenon that deserves further study.
Collapse
Affiliation(s)
- Sarah L Parks
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, CB10 1SD, United Kingdom
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, CB10 1SD, United Kingdom
| |
Collapse
|
6
|
Steel M, Linz S, Huson DH, Sanderson MJ. Identifying a species tree subject to random lateral gene transfer. J Theor Biol 2013; 322:81-93. [DOI: 10.1016/j.jtbi.2013.01.009] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2012] [Revised: 01/09/2013] [Accepted: 01/10/2013] [Indexed: 11/26/2022]
|