Abstract
Over the past three decades, computational capabilities have grown at such a rapid rate that they have given rise to many computationally heavy science fields such as phylogenomics. As increasingly more genomes are sequenced in the three domains of life, larger and more species-complete phylogenetic tree reconstructions are leading to a better understanding of the tree of life and the evolutionary histories in deep times. However, these large datasets pose unique challenges from a modeling and computational perspective: accurately describing the evolutionary process of thousands of species is still beyond the capability of current models, while the computational burden limits our ability to test multiple hypotheses. Thus, it is common practice to reduce the size of a dataset by selecting species to represent a clade (taxon sampling). Unfortunately, this process is subjective, and comparisons of large tree of life studies show that choice and number of species used in a dataset can alter the topology obtained. Thus, taxon sampling is, in itself, a process that needs to be fully investigated to determine its effect on phylogenetic stability. Here, we present the theory and practical application of an automated pipeline that can be easily implemented to explore the effect of taxon sampling on phylogenetic reconstructions. The application of this approach was recently discussed in a study of Terrabacteria and shows its power in investigating the accuracy of deep nodes of a phylogeny.
Collapse