Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Krannich T, White WTJ, Niehus S, Holley G, Halldórsson BV, Kehr B. Population-scale detection of non-reference sequence variants using colored de Bruijn graphs. Bioinformatics 2021;38:604-611. [PMID: 34726732 PMCID: PMC8756200 DOI: 10.1093/bioinformatics/btab749] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Revised: 09/27/2021] [Accepted: 10/28/2021] [Indexed: 02/03/2023] Open

For:	Krannich T, White WTJ, Niehus S, Holley G, Halldórsson BV, Kehr B. Population-scale detection of non-reference sequence variants using colored de Bruijn graphs. Bioinformatics 2021;38:604-611. [PMID: 34726732 PMCID: PMC8756200 DOI: 10.1093/bioinformatics/btab749] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Revised: 09/27/2021] [Accepted: 10/28/2021] [Indexed: 02/03/2023] Open

Number

Cited by Other Article(s)

Park A, Koslickia D. Prokrustean Graph: A substring index for rapid k-mer size analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.21.568151. [PMID: 38853857 PMCID: PMC11160577 DOI: 10.1101/2023.11.21.568151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]

Abstract

Despite the widespread adoption of k -mer-based methods in bioinformatics, understanding the influence of k -mer sizes remains a persistent challenge. Selecting an optimal k -mer size or employing multiple k -mer sizes is often arbitrary, application-specific, and fraught with computational complexities. Typically, the influence of k -mer size is obscured by the outputs of complex bioinformatics tasks, such as genome analysis, comparison, assembly, alignment, and error correction. However, it is frequently overlooked that every method is built above a well-defined k -mer-based object like Jaccard Similarity, de Bruijn graphs, k -mer spectra, and Bray-Curtis Dissimilarity. Despite these objects offering a clearer perspective on the role of k -mer sizes, the dynamics of k -mer-based objects with respect to k -mer sizes remain surprisingly elusive. This paper introduces a computational framework that generalizes the transition of k -mer-based objects across k -mer sizes, utilizing a novel substring index, the Prokrustean graph. The primary contribution of this framework is to compute quantities associated with k -mer-based objects for all k -mer sizes, where the computational complexity depends solely on the number of maximal repeats and is independent of the range of k -mer sizes. For example, counting vertices of compacted de Bruijn graphs for k = 1 , … , 100 can be accomplished in mere seconds with our substring index constructed on a gigabase-sized read set. Additionally, we derive a space-efficient algorithm to extract the Prokrustean graph from the Burrows-Wheeler Transform. It becomes evident that modern substring indices, mostly based on longest common prefixes of suffix arrays, inherently face difficulties at exploring varying k -mer sizes due to their limitations at grouping co-occurring substrings. We have implemented four applications that utilize quantities critical in modern pangenomics and metagenomics. The code for these applications and the construction algorithm is available at https://github.com/KoslickiLab/prokrustean.

Collapse

Mustafa H, Karasikov M, Mansouri Ghiasi N, Rätsch G, Kahles A. Label-guided seed-chain-extend alignment on annotated De Bruijn graphs. Bioinformatics 2024;40:i337-i346. [PMID: 38940164 PMCID: PMC11211850 DOI: 10.1093/bioinformatics/btae226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open

Abstract

MOTIVATION

Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy.

RESULTS

We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment.

AVAILABILITY AND IMPLEMENTATION

The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.

Collapse

Yang C, Zhang Z, Huang Y, Xie X, Liao H, Xiao J, Veldsman WP, Yin K, Fang X, Zhang L. LRTK: a platform agnostic toolkit for linked-read analysis of both human genome and metagenome. Gigascience 2024;13:giae028. [PMID: 38869148 PMCID: PMC11170215 DOI: 10.1093/gigascience/giae028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 03/15/2024] [Accepted: 05/09/2024] [Indexed: 06/14/2024] Open

Abstract

BACKGROUND

Linked-read sequencing technologies generate high-base quality short reads that contain extrapolative information on long-range DNA connectedness. These advantages of linked-read technologies are well known and have been demonstrated in many human genomic and metagenomic studies. However, existing linked-read analysis pipelines (e.g., Long Ranger) were primarily developed to process sequencing data from the human genome and are not suited for analyzing metagenomic sequencing data. Moreover, linked-read analysis pipelines are typically limited to 1 specific sequencing platform.

FINDINGS

To address these limitations, we present the Linked-Read ToolKit (LRTK), a unified and versatile toolkit for platform agnostic processing of linked-read sequencing data from both human genome and metagenome. LRTK provides functions to perform linked-read simulation, barcode sequencing error correction, barcode-aware read alignment and metagenome assembly, reconstruction of long DNA fragments, taxonomic classification and quantification, and barcode-assisted genomic variant calling and phasing. LRTK has the ability to process multiple samples automatically and provides users with the option to generate reproducible reports during processing of raw sequencing data and at multiple checkpoints throughout downstream analysis. We applied LRTK on linked reads from simulation, mock community, and real datasets for both human genome and metagenome. We showcased LRTK's ability to generate comparative performance results from preceding benchmark studies and to report these results in publication-ready HTML document plots.

CONCLUSIONS

LRTK provides comprehensive and flexible modules along with an easy-to-use Python-based workflow for processing linked-read sequencing datasets, thereby filling the current gap in the field caused by platform-centric genome-specific linked-read data analysis tools.

Collapse

Su R, Zhou H, Yang W, Moqir S, Ritu X, Liu L, Shi Y, Dong A, Bayier M, Letu Y, Manxi X, Chulu H, Nasenochir N, Meng H, Herrid M. Near telomere-to-telomere genome assembly of Mongolian cattle: implications for population genetic variation and beef quality. Gigascience 2024;13:giae099. [PMID: 39693631 DOI: 10.1093/gigascience/giae099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2024] [Revised: 09/29/2024] [Accepted: 11/10/2024] [Indexed: 12/20/2024] Open

Rajaby R, Liu DX, Au CH, Cheung YT, Lau AYT, Yang QY, Sung WK. INSurVeyor: improving insertion calling from short read sequencing data. Nat Commun 2023;14:3243. [PMID: 37277343 DOI: 10.1038/s41467-023-38870-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Accepted: 05/18/2023] [Indexed: 06/07/2023] Open

Zhou Y, Yang L, Han X, Han J, Hu Y, Li F, Xia H, Peng L, Boschiero C, Rosen BD, Bickhart DM, Zhang S, Guo A, Van Tassell CP, Smith TPL, Yang L, Liu GE. Assembly of a pangenome for global cattle reveals missing sequences and novel structural variations, providing new insights into their diversity and evolutionary history. Genome Res 2022;32:1585-1601. [PMID: 35977842 PMCID: PMC9435747 DOI: 10.1101/gr.276550.122] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2022] [Accepted: 07/21/2022] [Indexed: 02/03/2023]

Affiliation(s)

Yang Zhou Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
Lv Yang Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
Xiaotao Han Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
Jiazheng Han Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
Yan Hu Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
Fan Li Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
Han Xia Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
Lingwei Peng Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
Clarissa Boschiero Animal Genomics and Improvement Laboratory, BARC, USDA-ARS, Beltsville, Maryland 20705, USA
Benjamin D Rosen Animal Genomics and Improvement Laboratory, BARC, USDA-ARS, Beltsville, Maryland 20705, USA
Derek M Bickhart Dairy Forage Research Center, ARS USDA, Madison, Wisconsin 53706, USA
Shujun Zhang Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
Aizhen Guo The State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan 430070, China
Curtis P Van Tassell Animal Genomics and Improvement Laboratory, BARC, USDA-ARS, Beltsville, Maryland 20705, USA
Timothy P L Smith U.S. Meat Animal Research Center, ARS USDA, Clay Center, Nebraska 68933, USA
Liguo Yang Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
George E Liu Animal Genomics and Improvement Laboratory, BARC, USDA-ARS, Beltsville, Maryland 20705, USA

Collapse

Meleshko D, Yang R, Marks P, Williams S, Hajirasouliha I. Efficient detection and assembly of non-reference DNA sequences with synthetic long reads. Nucleic Acids Res 2022;50:e108. [PMID: 35924489 PMCID: PMC9561269 DOI: 10.1093/nar/gkac653] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Revised: 06/10/2022] [Accepted: 08/01/2022] [Indexed: 11/14/2022] Open