1
|
Rajput J, Chandra G, Jain C. Co-linear chaining on pangenome graphs. Algorithms Mol Biol 2024; 19:4. [PMID: 38279113 DOI: 10.1186/s13015-024-00250-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 01/02/2024] [Indexed: 01/28/2024] Open
Abstract
Pangenome reference graphs are useful in genomics because they compactly represent the genetic diversity within a species, a capability that linear references lack. However, efficiently aligning sequences to these graphs with complex topology and cycles can be challenging. The seed-chain-extend based alignment algorithms use co-linear chaining as a standard technique to identify a good cluster of exact seed matches that can be combined to form an alignment. Recent works show how the co-linear chaining problem can be efficiently solved for acyclic pangenome graphs by exploiting their small width and how incorporating gap cost in the scoring function improves alignment accuracy. However, it remains open on how to effectively generalize these techniques for general pangenome graphs which contain cycles. Here we present the first practical formulation and an exact algorithm for co-linear chaining on cyclic pangenome graphs. We rigorously prove the correctness and computational complexity of the proposed algorithm. We evaluate the empirical performance of our algorithm by aligning simulated long reads from the human genome to a cyclic pangenome graph constructed from 95 publicly available haplotype-resolved human genome assemblies. While the existing heuristic-based algorithms are faster, the proposed algorithm provides a significant advantage in terms of accuracy. Implementation ( https://github.com/at-cg/PanAligner ).
Collapse
Affiliation(s)
- Jyotshna Rajput
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, Karnataka, India
| | - Ghanshyam Chandra
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, Karnataka, India
| | - Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, Karnataka, India.
| |
Collapse
|
2
|
Chandra G, Jain C. Gap-Sensitive Colinear Chaining Algorithms for Acyclic Pangenome Graphs. J Comput Biol 2023; 30:1182-1197. [PMID: 37902967 DOI: 10.1089/cmb.2023.0186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2023] Open
Abstract
A pangenome graph can serve as a better reference for genomic studies because it allows a compact representation of multiple genomes within a species. Aligning sequences to a graph is critical for pangenome-based resequencing. The seed-chain-extend heuristic works by finding short exact matches between a sequence and a graph. In this heuristic, colinear chaining helps identify a good cluster of exact matches that can be combined to form an alignment. Colinear chaining algorithms have been extensively studied for aligning two sequences with various gap costs, including linear, concave, and convex cost functions. However, extending these algorithms for sequence-to-graph alignment presents significant challenges. Recently, Makinen et al. introduced a sparse dynamic programming framework that exploits the small path cover property of acyclic pangenome graphs, enabling efficient chaining. However, this framework does not consider gap costs, limiting its practical effectiveness. We address this limitation by developing novel problem formulations and provably good chaining algorithms that support a variety of gap cost functions. These functions are carefully designed to enable fast chaining algorithms whose time requirements are parameterized in terms of the size of the minimum path cover. Through an empirical evaluation, we demonstrate the superior performance of our algorithm compared with existing aligners. When mapping simulated long reads to a pangenome graph comprising 95 human haplotypes, we achieved 98.7% precision while leaving <2% of reads unmapped.
Collapse
Affiliation(s)
- Ghanshyam Chandra
- Department of Computational and Data Sciences, Indian Institute of Science Bengaluru, India
| | - Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science Bengaluru, India
| |
Collapse
|
3
|
Shaw J, Yu YW. Proving sequence aligners can guarantee accuracy in almost O( m log n) time through an average-case analysis of the seed-chain-extend heuristic. Genome Res 2023; 33:1175-1187. [PMID: 36990779 PMCID: PMC10538486 DOI: 10.1101/gr.277637.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2023] [Accepted: 03/16/2023] [Indexed: 03/31/2023]
Abstract
Seed-chain-extend with k-mer seeds is a powerful heuristic technique for sequence alignment used by modern sequence aligners. Although effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with k-mers in expectation Assume we are given a random nucleotide sequence of length ∼n that is indexed (or seeded) and a mutated substring of length ∼m ≤ n with mutation rate θ < 0.206. We prove that we can find a k = Θ(log n) for the k-mer size such that the expected runtime of seed-chain-extend under optimal linear-gap cost chaining and quadratic time gap extension is O(mn f (θ) log n), where f(θ) < 2.43 · θ holds as a loose bound. The alignment also turns out to be good; we prove that more than [Formula: see text] fraction of the homologous bases is recoverable under an optimal chain. We also show that our bounds work when k-mers are sketched, that is, only a subset of all k-mers is selected, and that sketching reduces chaining time without increasing alignment time or decreasing accuracy too much, justifying the effectiveness of sketching as a practical speedup in sequence alignment. We verify our results in simulation and on real noisy long-read data and show that our theoretical runtimes can predict real runtimes accurately. We conjecture that our bounds can be improved further, and in particular, f(θ) can be further reduced.
Collapse
Affiliation(s)
- Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, Ontario M5S 2E4, Canada;
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Computer and Mathematical Sciences, University of Toronto at Scarborough, Toronto, Ontario M1C 1A4, Canada
| |
Collapse
|
4
|
Sahlin K, Baudeau T, Cazaux B, Marchet C. A survey of mapping algorithms in the long-reads era. Genome Biol 2023; 24:133. [PMID: 37264447 PMCID: PMC10236595 DOI: 10.1186/s13059-023-02972-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Accepted: 05/12/2023] [Indexed: 06/03/2023] Open
Abstract
It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings ( http://bcazaux.polytech-lille.net/Minimap2/ ).
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 106 91, Stockholm, Sweden.
| | - Thomas Baudeau
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Bastien Cazaux
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Camille Marchet
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France.
| |
Collapse
|
5
|
Jain C, Gibney D, Thankachan SV. Algorithms for Colinear Chaining with Overlaps and Gap Costs. J Comput Biol 2022; 29:1237-1251. [PMID: 36351202 DOI: 10.1089/cmb.2022.0266] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Colinear chaining has proven to be a powerful heuristic for finding near-optimal alignments of long DNA sequences (e.g., long reads or a genome assembly) to a reference. It is used as an intermediate step in several alignment tools that employ a seed-chain-extend strategy. Despite this popularity, efficient subquadratic time algorithms for the general case where chains support anchor overlaps and gap costs are not currently known. We present algorithms to solve the colinear chaining problem with anchor overlaps and gap costs in Õ(n) time, where n denotes the count of anchors. The degree of the polylogarithmic factor depends on the type of anchors used (e.g., fixed-length anchors) and the type of precedence an optimal anchor chain is required to satisfy. We also establish the first theoretical connection between colinear chaining cost and edit distance. Specifically, we prove that for a fixed set of anchors under a carefully designed chaining cost function, the optimal "anchored" edit distance equals the optimal colinear chaining cost. The anchored edit distance for two sequences and a set of anchors is only a slight generalization of the standard edit distance. It adds an additional cost of one to an alignment of two matching symbols that are not supported by any anchor. Finally, we demonstrate experimentally that optimal colinear chaining cost under the proposed cost function can be computed orders of magnitude faster than edit distance, and achieves correlation coefficient >0.9 with edit distance for closely as well as distantly related sequences.
Collapse
Affiliation(s)
- Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science, Bengaluru, India
| | - Daniel Gibney
- School of Computational Science and Engineering, Georgia Institute of Technology Atlanta, Georgia, USA
| | - Sharma V. Thankachan
- Department of Computer Science, University of Central Florida, Orlando, Florida, USA
| |
Collapse
|
6
|
Li H, Feng X, Chu C. The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020; 21:265. [PMID: 33066802 PMCID: PMC7568353 DOI: 10.1186/s13059-020-02168-z] [Citation(s) in RCA: 147] [Impact Index Per Article: 36.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Accepted: 09/23/2020] [Indexed: 12/21/2022] Open
Abstract
The recent advances in sequencing technologies enable the assembly of individual genomes to the quality of the reference genome. How to integrate multiple genomes from the same species and make the integrated representation accessible to biologists remains an open challenge. Here, we propose a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome. We implement our ideas in the minigraph toolkit and demonstrate that we can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome.
Collapse
Affiliation(s)
- Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, 02215, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, 02215, MA, USA.
| | - Xiaowen Feng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, 02215, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, 02215, MA, USA
| | - Chong Chu
- Department of Biomedical Informatics, Harvard Medical School, Boston, 02215, MA, USA
| |
Collapse
|
7
|
Haghshenas E, Sahinalp SC, Hach F. lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data. Bioinformatics 2019; 35:20-27. [PMID: 30561550 DOI: 10.1093/bioinformatics/bty544] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2017] [Accepted: 06/28/2018] [Indexed: 02/01/2023] Open
Abstract
Motivation Recent advances in genomics and precision medicine have been made possible through the application of high throughput sequencing (HTS) to large collections of human genomes. Although HTS technologies have proven their use in cataloging human genome variation, computational analysis of the data they generate is still far from being perfect. The main limitation of Illumina and other popular sequencing technologies is their short read length relative to the lengths of (common) genomic repeats. Newer (single molecule sequencing - SMS) technologies such as Pacific Biosciences and Oxford Nanopore are producing longer reads, making it theoretically possible to overcome the difficulties imposed by repeat regions. Unfortunately, because of their high sequencing error rate, reads generated by these technologies are very difficult to work with and cannot be used in many of the standard downstream analysis pipelines. Note that it is not only difficult to find the correct mapping locations of such reads in a reference genome, but also to establish their correct alignment so as to differentiate sequencing errors from real genomic variants. Furthermore, especially since newer SMS instruments provide higher throughput, mapping and alignment need to be performed much faster than before, maintaining high sensitivity. Results We introduce lordFAST, a novel long-read mapper that is specifically designed to align reads generated by PacBio and potentially other SMS technologies to a reference. lordFAST not only has higher sensitivity than the available alternatives, it is also among the fastest and has a very low memory footprint. Availability and implementation lordFAST is implemented in C++ and supports multi-threading. The source code of lordFAST is available at https://github.com/vpc-ccg/lordfast. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ehsan Haghshenas
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
| | - S Cenk Sahinalp
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada.,School of Informatics and Computing, Indiana University, Bloomington, IN, USA
| | - Faraz Hach
- Vancouver Prostate Centre, Vancouver, BC, Canada.,Department of Urologic Sciences, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
8
|
Park J, Hulsman M, Arentshorst M, Breeman M, Alazi E, Lagendijk EL, Rocha MC, Malavazi I, Nitsche BM, van den Hondel CAMJJ, Meyer V, Ram AFJ. Transcriptomic and molecular genetic analysis of the cell wall salvage response of Aspergillus niger to the absence of galactofuranose synthesis. Cell Microbiol 2016; 18:1268-84. [PMID: 27264789 PMCID: PMC5129474 DOI: 10.1111/cmi.12624] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2016] [Revised: 05/16/2016] [Accepted: 05/30/2016] [Indexed: 12/11/2022]
Abstract
The biosynthesis of cell surface-located galactofuranose (Galf)-containing glycostructures such as galactomannan, N-glycans and O-glycans in filamentous fungi is important to secure the integrity of the cell wall. UgmA encodes an UDP-galactopyranose mutase, which is essential for the formation of Galf. Consequently, the ΔugmA mutant lacks Galf-containing molecules. Our previous work in Aspergillus niger work suggested that loss of function of ugmA results in activation of the cell wall integrity (CWI) pathway which is characterized by increased expression of the agsA gene, encoding an α-glucan synthase. In this study, the transcriptional response of the ΔugmA mutant was further linked to the CWI pathway by showing the induced and constitutive phosphorylation of the CWI-MAP kinase in the ΔugmA mutant. To identify genes involved in cell wall remodelling in response to the absence of galactofuranose biosynthesis, a genome-wide expression analysis was performed using RNAseq. Over 400 genes were higher expressed in the ΔugmA mutant compared to the wild-type. These include genes that encode enzymes involved in chitin (gfaB, gnsA, chsA) and α-glucan synthesis (agsA), and in β-glucan remodelling (bgxA, gelF and dfgC), and also include several glycosylphosphatidylinositol (GPI)-anchored cell wall protein-encoding genes. In silico analysis of the 1-kb promoter regions of the up-regulated genes in the ΔugmA mutant indicated overrepresentation of genes with RlmA, MsnA, PacC and SteA-binding sites. The importance of these transcription factors for survival of the ΔugmA mutant was analysed by constructing the respective double mutants. The ΔugmA/ΔrlmA and ΔugmA/ΔmsnA double mutants showed strong synthetic growth defects, indicating the importance of these transcription factors to maintain cell wall integrity in the absence of Galf biosynthesis.
Collapse
Affiliation(s)
- Joohae Park
- Leiden University, Institute of Biology Leiden, Molecular Microbiology and Biotechnology, Sylviusweg 72, 2333 BE, Leiden, The Netherlands
| | - Mark Hulsman
- Delft Bioinformatics Lab, Department of Intelligent Systems, Faculty Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands
| | - Mark Arentshorst
- Leiden University, Institute of Biology Leiden, Molecular Microbiology and Biotechnology, Sylviusweg 72, 2333 BE, Leiden, The Netherlands
| | - Matthijs Breeman
- Leiden University, Institute of Biology Leiden, Molecular Microbiology and Biotechnology, Sylviusweg 72, 2333 BE, Leiden, The Netherlands
| | - Ebru Alazi
- Leiden University, Institute of Biology Leiden, Molecular Microbiology and Biotechnology, Sylviusweg 72, 2333 BE, Leiden, The Netherlands
| | - Ellen L Lagendijk
- Leiden University, Institute of Biology Leiden, Molecular Microbiology and Biotechnology, Sylviusweg 72, 2333 BE, Leiden, The Netherlands
| | - Marina C Rocha
- Departamento de Genética e Evolução, Centro de Ciências Biológicas e da Saúde, Universidade Federal de São Carlos, São Paulo, Brazil
| | - Iran Malavazi
- Departamento de Genética e Evolução, Centro de Ciências Biológicas e da Saúde, Universidade Federal de São Carlos, São Paulo, Brazil
| | - Benjamin M Nitsche
- Applied and Molecular Microbiology, Institute of Biotechnology, Berlin University of Technology, Gustav-Meyer-Allee 25, 13355, Berlin, Germany
| | - Cees A M J J van den Hondel
- Leiden University, Institute of Biology Leiden, Molecular Microbiology and Biotechnology, Sylviusweg 72, 2333 BE, Leiden, The Netherlands
| | - Vera Meyer
- Applied and Molecular Microbiology, Institute of Biotechnology, Berlin University of Technology, Gustav-Meyer-Allee 25, 13355, Berlin, Germany
| | - Arthur F J Ram
- Leiden University, Institute of Biology Leiden, Molecular Microbiology and Biotechnology, Sylviusweg 72, 2333 BE, Leiden, The Netherlands
| |
Collapse
|