Groot Koerkamp R, Ivanov P. Exact global alignment using A* with chaining seed heuristic and match pruning.
Bioinformatics 2024;
40:btae032. [PMID:
38265119 PMCID:
PMC10932610 DOI:
10.1093/bioinformatics/btae032]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 11/14/2023] [Accepted: 01/20/2024] [Indexed: 01/25/2024] Open
Abstract
MOTIVATION
Sequence alignment has been at the core of computational biology for half a century. Still, it is an open problem to design a practical algorithm for exact alignment of a pair of related sequences in linear-like time.
RESULTS
We solve exact global pairwise alignment with respect to edit distance by using the A* shortest path algorithm. In order to efficiently align long sequences with high divergence, we extend the recently proposed seed heuristic with match chaining, gap costs, and inexact matches. We additionally integrate the novel match pruning technique and diagonal transition to improve the A* search. We prove the correctness of our algorithm, implement it in the A*PA aligner, and justify our extensions intuitively and empirically.
On random sequences of divergence d=4% and length n, the empirical runtime of A*PA scales near-linearly with length (best fit n1.06, n≤107 bp). A similar scaling remains up to d=12% (best fit n1.24, n≤107 bp). For n=107 bp and d=4%, A*PA reaches >500× speedup compared to the leading exact aligners Edlib and BiWFA. The performance of A*PA is highly influenced by long gaps. On long (n>500kb) ONT reads of a human sample it efficiently aligns sequences with d<10%, leading to 3× median speedup compared to Edlib and BiWFA. When the sequences come from different human samples, A*PA performs 1.7× faster than Edlib and BiWFA.
AVAILABILITY AND IMPLEMENTATION
github.com/RagnarGrootKoerkamp/astar-pairwise-aligner.
Collapse