1
|
Vasei H, Foroughmand-Araabi MH, Daneshgar A. Weighted centroid trees: a general approach to summarize phylogenies in single-labeled tumor mutation tree inference. Bioinformatics 2024; 40:btae120. [PMID: 38984735 DOI: 10.1093/bioinformatics/btae120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 02/19/2024] [Accepted: 07/09/2024] [Indexed: 07/11/2024] Open
Abstract
MOTIVATION Tumor trees, which depict the evolutionary process of cancer, provide a backbone for discovering recurring evolutionary processes in cancer. While they are not the primary information extracted from genomic data, they are valuable for this purpose. One such extraction method involves summarizing multiple trees into a single representative tree, such as consensus trees or supertrees. RESULTS We define the "weighted centroid tree problem" to find the centroid tree of a set of single-labeled rooted trees through the following steps: (i) mapping the given trees into the Euclidean space, (ii) computing the weighted centroid matrix of the mapped trees, and (iii) finding the nearest mapped tree (NMTP) to the centroid matrix. We show that this setup encompasses previously studied parent-child and ancestor-descendent metrics as well as the GraPhyC and TuELiP consensus tree algorithms. Moreover, we show that, while the NMTP problem is polynomial-time solvable for the adjacency embedding, it is NP-hard for ancestry and distance mappings. We introduce integer linear programs for NMTP in different setups where we also provide a new algorithm for the case of ancestry embedding called 2-AncL2, that uses a novel weighting scheme for ancestry signals. Our experimental results show that 2-AncL2 has a superior performance compared to available consensus tree algorithms. We also illustrate our setup's application on providing representative trees for a large real breast cancer dataset, deducing that the cluster centroid trees summarize reliable evolutionary information about the original dataset. AVAILABILITY AND IMPLEMENTATION https://github.com/vasei/WAncILP.
Collapse
Affiliation(s)
- Hamed Vasei
- Department of Mathematical Sciences, Sharif University of Technology, Tehran 111559415, Iran
| | | | - Amir Daneshgar
- Department of Mathematical Sciences, Sharif University of Technology, Tehran 111559415, Iran
| |
Collapse
|
2
|
Sashittal P, Chen V, Pasarkar A, Raphael BJ. Joint inference of cell lineage and mitochondrial evolution from single-cell sequencing data. Bioinformatics 2024; 40:i218-i227. [PMID: 38940122 PMCID: PMC11211840 DOI: 10.1093/bioinformatics/btae231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Eukaryotic cells contain organelles called mitochondria that have their own genome. Most cells contain thousands of mitochondria which replicate, even in nondividing cells, by means of a relatively error-prone process resulting in somatic mutations in their genome. Because of the higher mutation rate compared to the nuclear genome, mitochondrial mutations have been used to track cellular lineage, particularly using single-cell sequencing that measures mitochondrial mutations in individual cells. However, existing methods to infer the cell lineage tree from mitochondrial mutations do not model "heteroplasmy," which is the presence of multiple mitochondrial clones with distinct sets of mutations in an individual cell. Single-cell sequencing data thus provide a mixture of the mitochondrial clones in individual cells, with the ancestral relationships between these clones described by a mitochondrial clone tree. While deconvolution of somatic mutations from a mixture of evolutionarily related genomes has been extensively studied in the context of bulk sequencing of cancer tumor samples, the problem of mitochondrial deconvolution has the additional constraint that the mitochondrial clone tree must be concordant with the cell lineage tree. RESULTS We formalize the problem of inferring a concordant pair of a mitochondrial clone tree and a cell lineage tree from single-cell sequencing data as the Nested Perfect Phylogeny Mixture (NPPM) problem. We derive a combinatorial characterization of the solutions to the NPPM problem, and formulate an algorithm, MERLIN, to solve this problem exactly using a mixed integer linear program. We show on simulated data that MERLIN outperforms existing methods that do not model mitochondrial heteroplasmy nor the concordance between the mitochondrial clone tree and the cell lineage tree. We use MERLIN to analyze single-cell whole-genome sequencing data of 5220 cells of a gastric cancer cell line and show that MERLIN infers a more biologically plausible cell lineage tree and mitochondrial clone tree compared to existing methods. AVAILABILITY AND IMPLEMENTATION https://github.com/raphael-group/MERLIN.
Collapse
Affiliation(s)
- Palash Sashittal
- Department of Computer Science, Princeton University, Princeton, NJ 08540, United States
| | - Viola Chen
- Department of Computer Science, Princeton University, Princeton, NJ 08540, United States
| | - Amey Pasarkar
- Department of Computer Science, Princeton University, Princeton, NJ 08540, United States
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, Princeton, NJ 08540, United States
| |
Collapse
|
3
|
Qi Y, El-Kebir M. Consensus Tree Under the Ancestor-Descendant Distance is NP-Hard. J Comput Biol 2024; 31:58-70. [PMID: 38010616 DOI: 10.1089/cmb.2023.0262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2023] Open
Abstract
Due to uncertainty in tumor phylogeny inference from sequencing data, many methods infer multiple, equally plausible phylogenies for the same cancer. To summarize the solution space T of tumor phylogenies, consensus tree methods seek a single best representative tree S under a specified pairwise tree distance function. One such distance function is the ancestor-descendant (AD) distance [Formula: see text] , which equals the size of the symmetric difference of the transitive closures of the edge sets [Formula: see text] and [Formula: see text] . Here, we show that finding a consensus tree S for tumor phylogenies T that minimizes the total AD distance [Formula: see text] is NP-hard.
Collapse
Affiliation(s)
- Yuanyuan Qi
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| | - Mohammed El-Kebir
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
- Cancer Center at Illinois, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| |
Collapse
|
4
|
Guang Z, Smith-Erb M, Oesper L. A weighted distance-based approach for deriving consensus tumor evolutionary trees. Bioinformatics 2023; 39:i204-i212. [PMID: 37387177 DOI: 10.1093/bioinformatics/btad230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION The acquisition of somatic mutations by a tumor can be modeled by a type of evolutionary tree. However, it is impossible to observe this tree directly. Instead, numerous algorithms have been developed to infer such a tree from different types of sequencing data. But such methods can produce conflicting trees for the same patient, making it desirable to have approaches that can combine several such tumor trees into a consensus or summary tree. We introduce The Weighted m-Tumor Tree Consensus Problem (W-m-TTCP) to find a consensus tree among multiple plausible tumor evolutionary histories, each assigned a confidence weight, given a specific distance measure between tumor trees. We present an algorithm called TuELiP that is based on integer linear programming which solves the W-m-TTCP, and unlike other existing consensus methods, allows the input trees to be weighted differently. RESULTS On simulated data we show that TuELiP outperforms two existing methods at correctly identifying the true underlying tree used to create the simulations. We also show that the incorporation of weights can lead to more accurate tree inference. On a Triple-Negative Breast Cancer dataset, we show that including confidence weights can have important impacts on the consensus tree identified. AVAILABILITY An implementation of TuELiP and simulated datasets are available at https://bitbucket.org/oesperlab/consensus-ilp/src/main/.
Collapse
Affiliation(s)
- Ziyun Guang
- Department of Computer Science, Carleton College, Northfield, MN 55057, USA
| | - Matthew Smith-Erb
- Department of Computer Science, Carleton College, Northfield, MN 55057, USA
| | - Layla Oesper
- Department of Computer Science, Carleton College, Northfield, MN 55057, USA
| |
Collapse
|
5
|
Abstract
The Robinson-Foulds (RF) distance, one of the most widely used metrics for comparing phylogenetic trees, has the advantage of being intuitive, with a natural interpretation in terms of common splits, and it can be computed in linear time, but it has a very low resolution, and it may become trivial for phylogenetic trees with overlapping taxa, that is, phylogenetic trees that share some but not all of their leaf labels. In this article, we study the properties of the Generalized Robinson-Foulds (GRF) distance, a recently proposed metric for comparing any structures that can be described by multisets of multisets of labels, when applied to rooted phylogenetic trees with overlapping taxa, which are described by sets of clusters, that is, by sets of sets of labels. We show that the GRF distance has a very high resolution, it can also be computed in linear time, and it is not (uniformly) equivalent to the RF distance.
Collapse
Affiliation(s)
- Mercè Llabrés
- Department of Mathematics and Computer Science, University of the Balearic Islands, Palma de Mallorca, Spain
- Balearic Islands Health Research Institute (IdISBa), Palma, Spain
| | - Francesc Rosselló
- Department of Mathematics and Computer Science, University of the Balearic Islands, Palma de Mallorca, Spain
- Balearic Islands Health Research Institute (IdISBa), Palma, Spain
| | - Gabriel Valiente
- Department of Computer Science, Technical University of Catalonia, Barcelona, Spain
| |
Collapse
|
6
|
Utro F, Levovitz C, Rhrissorrakrai K, Parida L. A common methodological phylogenomics framework for intra-patient heteroplasmies to infer SARS-CoV-2 sublineages and tumor clones. BMC Genomics 2021; 22:518. [PMID: 34789161 PMCID: PMC8596094 DOI: 10.1186/s12864-021-07660-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2021] [Accepted: 04/28/2021] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND All diseases containing genetic material undergo genetic evolution and give rise to heterogeneity including cancer and infection. Although these illnesses are biologically very different, the ability for phylogenetic retrodiction based on the genomic reads is common between them and thus tree-based principles and assumptions are shared. Just as the different frequencies of tumor genomic variants presupposes the existence of multiple tumor clones and provides a handle to computationally infer them, we postulate that the different variant frequencies in viral reads offers the means to infer multiple co-infecting sublineages. RESULTS We present a common methodological framework to infer the phylogenomics from genomic data, be it reads of SARS-CoV-2 of multiple COVID-19 patients or bulk DNAseq of the tumor of a cancer patient. We describe the Concerti computational framework for inferring phylogenies in each of the two scenarios.To demonstrate the accuracy of the method, we reproduce some known results in both scenarios. We also make some additional discoveries. CONCLUSIONS Concerti successfully extracts and integrates information from multi-point samples, enabling the discovery of clinically plausible phylogenetic trees that capture the heterogeneity known to exist both spatially and temporally. These models can have direct therapeutic implications by highlighting "birth" of clones that may harbor resistance mechanisms to treatment, "death" of subclones with drug targets, and acquisition of functionally pertinent mutations in clones that may have seemed clinically irrelevant. Specifically in this paper we uncover new potential parallel mutations in the evolution of the SARS-CoV-2 virus. In the context of cancer, we identify new clones harboring resistant mutations to therapy.
Collapse
Affiliation(s)
- Filippo Utro
- IBM Research, T.J. Watson Research Center, Yorktown Heights, USA
| | - Chaya Levovitz
- IBM Research, T.J. Watson Research Center, Yorktown Heights, USA
| | | | - Laxmi Parida
- IBM Research, T.J. Watson Research Center, Yorktown Heights, USA
| |
Collapse
|
7
|
Silva AS, Wilkinson M. On Defining and Finding Islands of Trees and Mitigating Large Island Bias. Syst Biol 2021; 70:1282-1294. [PMID: 33749752 PMCID: PMC8513764 DOI: 10.1093/sysbio/syab015] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Accepted: 02/24/2021] [Indexed: 11/12/2022] Open
Abstract
How best can we summarize sets of phylogenetic trees? Systematists have relied heavily on consensus methods, but if tree distributions can be partitioned into distinct subsets, it may be helpful to provide separate summaries of these rather than relying entirely upon a single consensus tree. How sets of trees can most helpfully be partitioned and represented leads to many open questions, but one natural partitioning is provided by the islands of trees found during tree searches. Islands that are of dissimilar size have been shown to yield majority-rule consensus trees dominated by the largest sets We illustrate this large island bias and approaches that mitigate its impact by revisiting a recent analysis of phylogenetic relationships of living and fossil amphibians. We introduce a revised definition of tree islands based on any tree-to-tree pairwise distance metric that usefully extends the notion to any set or multiset of trees, as might be produced by, for example, Bayesian or bootstrap methods, and that facilitates finding tree islands a posteriori. We extract islands from a tree distribution obtained in a Bayesian analysis of the amphibian data to investigate their impact in that context, and we compare the partitioning produced by tree islands with those resulting from some alternative approaches. Distinct subsets of trees, such as tree islands, should be of interest because of what they may reveal about evolution and/or our attempts to understand it, and are an important, sometimes overlooked, consideration when building and interpreting consensus trees. [Amphibia; Bayesian inference; consensus; parsimony; partitions; phylogeny; Chinlestegophis.].
Collapse
Affiliation(s)
- Ana Serra Silva
- Department of Life Sciences, The Natural History Museum, London SW7 5BD, UK
- School of Earth Sciences, University of Bristol, Bristol BS8 1RL, UK
| | - Mark Wilkinson
- Department of Life Sciences, The Natural History Museum, London SW7 5BD, UK
| |
Collapse
|
8
|
Ciccolella S, Bernardini G, Denti L, Bonizzoni P, Previtali M, Della Vedova G. Triplet-based similarity score for fully multilabeled trees with poly-occurring labels. Bioinformatics 2021; 37:178-184. [PMID: 32730595 PMCID: PMC8055217 DOI: 10.1093/bioinformatics/btaa676] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 06/29/2020] [Accepted: 07/22/2020] [Indexed: 01/06/2023] Open
Abstract
MOTIVATION The latest advances in cancer sequencing, and the availability of a wide range of methods to infer the evolutionary history of tumors, have made it important to evaluate, reconcile and cluster different tumor phylogenies. Recently, several notions of distance or similarities have been proposed in the literature, but none of them has emerged as the golden standard. Moreover, none of the known similarity measures is able to manage mutations occurring multiple times in the tree, a circumstance often occurring in real cases. RESULTS To overcome these limitations, in this article, we propose MP3, the first similarity measure for tumor phylogenies able to effectively manage cases where multiple mutations can occur at the same time and mutations can occur multiple times. Moreover, a comparison of MP3 with other measures shows that it is able to classify correctly similar and dissimilar trees, both on simulated and on real data. AVAILABILITY AND IMPLEMENTATION An open source implementation of MP3 is publicly available at https://github.com/AlgoLab/mp3treesim. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Simone Ciccolella
- Department of Informatics, Systems and Communication, University of Milano-Biccoca, Milan 20126, Italy
| | - Giulia Bernardini
- Department of Informatics, Systems and Communication, University of Milano-Biccoca, Milan 20126, Italy
| | - Luca Denti
- Department of Informatics, Systems and Communication, University of Milano-Biccoca, Milan 20126, Italy
| | - Paola Bonizzoni
- Department of Informatics, Systems and Communication, University of Milano-Biccoca, Milan 20126, Italy
| | - Marco Previtali
- Department of Informatics, Systems and Communication, University of Milano-Biccoca, Milan 20126, Italy
| | - Gianluca Della Vedova
- Department of Informatics, Systems and Communication, University of Milano-Biccoca, Milan 20126, Italy
| |
Collapse
|
9
|
Hodzic E, Shrestha R, Malikic S, Collins CC, Litchfield K, Turajlic S, Sahinalp SC. Identification of conserved evolutionary trajectories in tumors. Bioinformatics 2021; 36:i427-i435. [PMID: 32657374 DOI: 10.1093/bioinformatics/btaa453] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION As multi-region, time-series and single-cell sequencing data become more widely available; it is becoming clear that certain tumors share evolutionary characteristics with others. In the last few years, several computational methods have been developed with the goal of inferring the subclonal composition and evolutionary history of tumors from tumor biopsy sequencing data. However, the phylogenetic trees that they report differ significantly between tumors (even those with similar characteristics). RESULTS In this article, we present a novel combinatorial optimization method, CONETT, for detection of recurrent tumor evolution trajectories. Our method constructs a consensus tree of conserved evolutionary trajectories based on the information about temporal order of alteration events in a set of tumors. We apply our method to previously published datasets of 100 clear-cell renal cell carcinoma and 99 non-small-cell lung cancer patients and identify both conserved trajectories that were reported in the original studies, as well as new trajectories. AVAILABILITY AND IMPLEMENTATION CONETT is implemented in C++ and available at https://github.com/ehodzic/CONETT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ermin Hodzic
- Department of Computing Science, Simon Fraser University, Burnaby, BC, Canada
| | - Raunak Shrestha
- Department of Radiation Oncology, University of California San Francisco, San Francisco, CA, USA
| | - Salem Malikic
- Department of Computer Science, Indiana University Bloomington, Bloomington, IN, USA
| | - Colin C Collins
- Department of Urologic Sciences, University of British Columbia, Vancouver, BC, Canada.,aboratory for Advanced Genome Analysis, Vancouver Prostate Centre, Vancouver, BC, Canada
| | - Kevin Litchfield
- Cancer Dynamics Laboratory, the Francis Crick institute, Genome Instability Laboratory, Francis Crick Institute, London, UK
| | - Samra Turajlic
- Cancer Dynamics Laboratory, the Francis Crick institute, Genome Instability Laboratory, Francis Crick Institute, London, UK.,Skin and Renal Units, The royal Marsden NHS Foundation Trust, London, UK
| | - S Cenk Sahinalp
- Cancer Data Science Lab., National Cancer Institute, NIH, Bethesda, MD, USA
| |
Collapse
|
10
|
Christensen S, Kim J, Chia N, Koyejo O, El-Kebir M. Detecting evolutionary patterns of cancers using consensus trees. Bioinformatics 2021; 36:i684-i691. [PMID: 33381820 DOI: 10.1093/bioinformatics/btaa801] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION While each cancer is the result of an isolated evolutionary process, there are repeated patterns in tumorigenesis defined by recurrent driver mutations and their temporal ordering. Such repeated evolutionary trajectories hold the potential to improve stratification of cancer patients into subtypes with distinct survival and therapy response profiles. However, current cancer phylogeny methods infer large solution spaces of plausible evolutionary histories from the same sequencing data, obfuscating repeated evolutionary patterns. RESULTS To simultaneously resolve ambiguities in sequencing data and identify cancer subtypes, we propose to leverage common patterns of evolution found in patient cohorts. We first formulate the Multiple Choice Consensus Tree problem, which seeks to select a tumor tree for each patient and assign patients into clusters in such a way that maximizes consistency within each cluster of patient trees. We prove that this problem is NP-hard and develop a heuristic algorithm, Revealing Evolutionary Consensus Across Patients (RECAP), to solve this problem in practice. Finally, on simulated data, we show RECAP outperforms existing methods that do not account for patient subtypes. We then use RECAP to resolve ambiguities in patient trees and find repeated evolutionary trajectories in lung and breast cancer cohorts. AVAILABILITY AND IMPLEMENTATION https://github.com/elkebir-group/RECAP. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Juho Kim
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Nicholas Chia
- Microbiome Program, Center for Individualized Medicine.,Division of Surgical Research, Department of Surgery, Mayo Clinic, Rochester, MN, 55905, USA
| | | | | |
Collapse
|
11
|
Sundermann LK, Wintersinger J, Rätsch G, Stoye J, Morris Q. Reconstructing tumor evolutionary histories and clone trees in polynomial-time with SubMARine. PLoS Comput Biol 2021; 17:e1008400. [PMID: 33465079 PMCID: PMC7845980 DOI: 10.1371/journal.pcbi.1008400] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Revised: 01/29/2021] [Accepted: 09/22/2020] [Indexed: 11/18/2022] Open
Abstract
Tumors contain multiple subpopulations of genetically distinct cancer cells. Reconstructing their evolutionary history can improve our understanding of how cancers develop and respond to treatment. Subclonal reconstruction methods cluster mutations into groups that co-occur within the same subpopulations, estimate the frequency of cells belonging to each subpopulation, and infer the ancestral relationships among the subpopulations by constructing a clone tree. However, often multiple clone trees are consistent with the data and current methods do not efficiently capture this uncertainty; nor can these methods scale to clone trees with a large number of subclonal populations. Here, we formalize the notion of a partially-defined clone tree (partial clone tree for short) that defines a subset of the pairwise ancestral relationships in a clone tree, thereby implicitly representing the set of all clone trees that have these defined pairwise relationships. Also, we introduce a special partial clone tree, the Maximally-Constrained Ancestral Reconstruction (MAR), which summarizes all clone trees fitting the input data equally well. Finally, we extend commonly used clone tree validity conditions to apply to partial clone trees and describe SubMARine, a polynomial-time algorithm producing the subMAR, which approximates the MAR and guarantees that its defined relationships are a subset of those present in the MAR. We also extend SubMARine to work with subclonal copy number aberrations and define equivalence constraints for this purpose. Further, we extend SubMARine to permit noise in the estimates of the subclonal frequencies while retaining its validity conditions and guarantees. In contrast to other clone tree reconstruction methods, SubMARine runs in time and space that scale polynomially in the number of subclones. We show through extensive noise-free simulation, a large lung cancer dataset and a prostate cancer dataset that the subMAR equals the MAR in all cases where only a single clone tree exists and that it is a perfect match to the MAR in most of the other cases. Notably, SubMARine runs in less than 70 seconds on a single thread with less than one Gb of memory on all datasets presented in this paper, including ones with 50 nodes in a clone tree. On the real-world data, SubMARine almost perfectly recovers the previously reported trees and identifies minor errors made in the expert-driven reconstructions of those trees. The freely-available open-source code implementing SubMARine can be downloaded at https://github.com/morrislab/submarine. Cancer cells accumulate mutations over time and consist of genetically distinct subpopulations. Their evolutionary history (as represented by tumor phylogenies) can be inferred from bulk cancer genome sequencing data. Current tumor phylogeny reconstruction methods have two main issues: they are slow, and they do not efficiently represent uncertainty in the reconstruction. To address these issues, we developed SubMARine, a fast algorithm that summarizes all valid phylogenies in an intuitive format. SubMARine solved all reconstruction problems in this manuscript in less than 70 seconds, orders of magnitude faster than other methods. These reconstruction problems included those with up to 50 subclones; problems that are too large for other algorithms to even attempt. SubMARine achieves these result because, unlike other algorithms, it performs its reconstruction by identifying an upper-bound on the solution set of trees and the amount of noise in the estimates of the subclonal frequencies. In the vast majority of cases we checked, i. e. an extensive noise-free simulation, a lung cancer and a prostate cancer dataset, this upper bound is tight: when only a single solution exists, SubMARine converges to it every time. When multiple solutions exist, our algorithm correctly recovers the uncertain relationships in 71% of cases. In addition to solving these two major challenges, we introduce some useful new concepts for and open research problems in the field of tumor phylogeny reconstruction. Specifically, we formalize the concept of a partial clone tree which provides a set of constraints on the solution set of clone trees; and provide a complete set of conditions under which a partial clone tree is valid. These conditions guarantee that all trees in the solution set satisfy the constraints implied by the partial clone tree.
Collapse
Affiliation(s)
- Linda K. Sundermann
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
- Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | - Jeff Wintersinger
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
- Ontario Institute for Cancer Research, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, Zurich, Zurich, Switzerland
- Biomedical Informatics, University Hospital Zurich, Zurich, Zurich, Switzerland
| | - Jens Stoye
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, North Rhine-Westphalia, Germany
| | - Quaid Morris
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
- Ontario Institute for Cancer Research, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Computational and Systems Biology, Memorial Sloan Kettering Cancer Center, New York City, New York, United States of America
- * E-mail:
| |
Collapse
|
12
|
Tsyvina V, Zelikovsky A, Snir S, Skums P. Inference of mutability landscapes of tumors from single cell sequencing data. PLoS Comput Biol 2020; 16:e1008454. [PMID: 33253159 PMCID: PMC7728263 DOI: 10.1371/journal.pcbi.1008454] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Revised: 12/10/2020] [Accepted: 10/20/2020] [Indexed: 11/18/2022] Open
Abstract
One of the hallmarks of cancer is the extremely high mutability and genetic instability of tumor cells. Inherent heterogeneity of intra-tumor populations manifests itself in high variability of clone instability rates. Analogously to fitness landscapes, the instability rates of clonal populations form their mutability landscapes. Here, we present MULAN (MUtability LANdscape inference), a maximum-likelihood computational framework for inference of mutation rates of individual cancer subclones using single-cell sequencing data. It utilizes the partial information about the orders of mutation events provided by cancer mutation trees and extends it by inferring full evolutionary history and mutability landscape of a tumor. Evaluation of mutation rates on the level of subclones rather than individual genes allows to capture the effects of genomic interactions and epistasis. We estimate the accuracy of our approach and demonstrate that it can be used to study the evolution of genetic instability and infer tumor evolutionary history from experimental data. MULAN is available at https://github.com/compbel/MULAN.
Collapse
Affiliation(s)
- Viachaslau Tsyvina
- Department of Computer Science, Georgia State University, Atlanta, Georgia, United States of America
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, Georgia, United States of America
| | - Sagi Snir
- Department of Evolutionary and Environmental Biology, University of Haifa, Haifa, Israel
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, Georgia, United States of America
| |
Collapse
|
13
|
Weber LL, Aguse N, Chia N, El-Kebir M. PhyDOSE: Design of follow-up single-cell sequencing experiments of tumors. PLoS Comput Biol 2020; 16:e1008240. [PMID: 33001973 PMCID: PMC7553321 DOI: 10.1371/journal.pcbi.1008240] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2020] [Revised: 10/13/2020] [Accepted: 08/12/2020] [Indexed: 01/07/2023] Open
Abstract
The combination of bulk and single-cell DNA sequencing data of the same tumor enables the inference of high-fidelity phylogenies that form the input to many important downstream analyses in cancer genomics. While many studies simultaneously perform bulk and single-cell sequencing, some studies have analyzed initial bulk data to identify which mutations to target in a follow-up single-cell sequencing experiment, thereby decreasing cost. Bulk data provide an additional untapped source of valuable information, composed of candidate phylogenies and associated clonal prevalence. Here, we introduce PhyDOSE, a method that uses this information to strategically optimize the design of follow-up single cell experiments. Underpinning our method is the observation that only a small number of clones uniquely distinguish one candidate tree from all other trees. We incorporate distinguishing features into a probabilistic model that infers the number of cells to sequence so as to confidently reconstruct the phylogeny of the tumor. We validate PhyDOSE using simulations and a retrospective analysis of a leukemia patient, concluding that PhyDOSE's computed number of cells resolves tree ambiguity even in the presence of typical single-cell sequencing errors. We also conduct a retrospective analysis on an acute myeloid leukemia cohort, demonstrating the potential to achieve similar results with a significant reduction in the number of cells sequenced. In a prospective analysis, we demonstrate the advantage of selecting cells to sequence across multiple biopsies and that only a small number of cells suffice to disambiguate the solution space of trees in a recent lung cancer cohort. In summary, PhyDOSE proposes cost-efficient single-cell sequencing experiments that yield high-fidelity phylogenies, which will improve downstream analyses aimed at deepening our understanding of cancer biology.
Collapse
Affiliation(s)
- Leah L Weber
- Dept. of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Nuraini Aguse
- Dept. of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Nicholas Chia
- Microbiome Program, Center for Individualized Medicine, Mayo Clinic, Rochester, Minnesota, United States of America
- Division of Surgical Research, Department of Surgery, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Mohammed El-Kebir
- Dept. of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| |
Collapse
|
14
|
Abstract
MOTIVATION The combination of genomic and epidemiological data holds the potential to enable accurate pathogen transmission history inference. However, the inference of outbreak transmission histories remains challenging due to various factors such as within-host pathogen diversity and multi-strain infections. Current computational methods ignore within-host diversity and/or multi-strain infections, often failing to accurately infer the transmission history. Thus, there is a need for efficient computational methods for transmission tree inference that accommodate the complexities of real data. RESULTS We formulate the direct transmission inference (DTI) problem for inferring transmission trees that support multi-strain infections given a timed phylogeny and additional epidemiological data. We establish hardness for the decision and counting version of the DTI problem. We introduce Transmission Tree Uniform Sampler (TiTUS), a method that uses SATISFIABILITY to almost uniformly sample from the space of transmission trees. We introduce criteria that prioritize parsimonious transmission trees that we subsequently summarize using a novel consensus tree approach. We demonstrate TiTUS's ability to accurately reconstruct transmission trees on simulated data as well as a documented HIV transmission chain. AVAILABILITY AND IMPLEMENTATION https://github.com/elkebir-group/TiTUS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Palash Sashittal
- Department of Aerospace Engineering, University of Illinois at Urbana-Champaign, Urbama, IL 61801, USA
| | - Mohammed El-Kebir
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbama, IL 61801, USA
| |
Collapse
|