1
|
Gao J, Xu Y. DNA sequences alignment method using sparse index on pan-genome graph. J Bioinform Comput Biol 2024; 22:2450019. [PMID: 39215522 DOI: 10.1142/s0219720024500197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
The graph of sequences represents the genetic variations of pan-genome concisely and space-efficiently than multiple linear reference genome. In order to accelerate aligning reads to the graph, an index of graph-based reference genomes is used to obtain candidate locations. However, the potential combinatorial explosion of nodes on the sequence graph leads to increasing the index space and maximum memory usage of alignment process considerably, especially for large-scale datasets. For this, existing methods typically attempt to prune complex regions, or extend the length of seeds, which sacrifices the recall of alignment algorithm despite reducing space usage slightly. We present the Sparse-index of Graph (SIG) and alignment algorithm SIG-Aligner, capable of indexing and aligning at the lower memory cost. SIG builds the non-overlapping minimizers index inside nodes of sequence graph and SIG-Aligner filters out most of the false positive matches by the method based on the pigeonhole principle. Compared to Giraffe, the results of computational experiments show that SIG achieves a significant reduction in index memory space ranging from 50% to 75% for the human pan-genome graphs, while still preserving superior or comparable accuracy of alignment and the faster alignment time.
Collapse
Affiliation(s)
- Jia Gao
- School of Computer Science, University of Science and Technology of China, Heifei, Anhui 230027, P. R. China
- Key Laboratory on High Performance Computing, Anhui Province, P. R. China
| | - Yun Xu
- School of Computer Science, University of Science and Technology of China, Heifei, Anhui 230027, P. R. China
- Key Laboratory on High Performance Computing, Anhui Province, P. R. China
| |
Collapse
|
2
|
Owolabi P, Adam Y, Adebiyi E. Personalizing medicine in Africa: current state, progress and challenges. Front Genet 2023; 14:1233338. [PMID: 37795248 PMCID: PMC10546210 DOI: 10.3389/fgene.2023.1233338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Accepted: 09/11/2023] [Indexed: 10/06/2023] Open
Abstract
Personalized medicine has been identified as a powerful tool for addressing the myriad of health issues facing different health systems globally. Although recent studies have expanded our understanding of how different factors such as genetics and the environment play significant roles in affecting the health of individuals, there are still several other issues affecting their translation into personalizing health interventions globally. Since African populations have demonstrated huge genetic diversity, there is a significant need to apply the concepts of personalized medicine to overcome various African-specific health challenges. Thus, we review the current state, progress, and challenges facing the adoption of personalized medicine in Africa with a view to providing insights to critical stakeholders on the right approach to deploy.
Collapse
Affiliation(s)
- Paul Owolabi
- Covenant Applied Informatics and Communication, Africa Centre of Excellence (CApIC-ACE), Covenant University, Ota, Ogun State, Nigeria
- Department of Computer and Information Science, Covenant University, Ota, Ogun State, Nigeria
| | - Yagoub Adam
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Ezekiel Adebiyi
- Covenant Applied Informatics and Communication, Africa Centre of Excellence (CApIC-ACE), Covenant University, Ota, Ogun State, Nigeria
- Department of Computer and Information Science, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
- Applied Bioinformatics Division, German Cancer Research Center (DKFZ), Heidelberg, Germany
| |
Collapse
|
3
|
Costa-Silva J, Domingues DS, Menotti D, Hungria M, Lopes FM. Temporal progress of gene expression analysis with RNA-Seq data: A review on the relationship between computational methods. Comput Struct Biotechnol J 2022; 21:86-98. [PMID: 36514333 PMCID: PMC9730150 DOI: 10.1016/j.csbj.2022.11.051] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 11/25/2022] [Accepted: 11/25/2022] [Indexed: 12/03/2022] Open
Abstract
Analysis of differential gene expression from RNA-seq data has become a standard for several research areas. The steps for the computational analysis include many data types and file formats, and a wide variety of computational tools that can be applied alone or together as pipelines. This paper presents a review of the differential expression analysis pipeline, addressing its steps and the respective objectives, the principal methods available in each step, and their properties, therefore introducing an organized overview to this context. This review aims to address mainly the aspects involved in the differentially expressed gene (DEG) analysis from RNA sequencing data (RNA-seq), considering the computational methods. In addition, a timeline of the computational methods for DEG is shown and discussed, and the relationships existing between the most important computational tools are presented by an interaction network. A discussion on the challenges and gaps in DEG analysis is also highlighted in this review. This paper will serve as a tutorial for new entrants into the field and help established users update their analysis pipelines.
Collapse
Affiliation(s)
- Juliana Costa-Silva
- Department of Informatics – Federal University of Paraná, Rua Coronel Francisco Heráclito dos Santos, 100, 81531-990 Curitiba, Paraná, Brazil
| | - Douglas S. Domingues
- Department of Genetics, “Luiz de Queiroz” College of Agriculture, University of São Paulo, Av. Pádua Dias, 11, 13418-900 Piracicaba, São Paulo, Brazil
| | - David Menotti
- Department of Informatics – Federal University of Paraná, Rua Coronel Francisco Heráclito dos Santos, 100, 81531-990 Curitiba, Paraná, Brazil
| | - Mariangela Hungria
- Department of Soil Biotecnology - Embrapa Soybean, Cx. Postal 231, 86000-970 Londrina, Paraná, Brazil
| | - Fabrício Martins Lopes
- Department of Computer Science, Universidade Tecnológica Federal do Paraná – UTFPR, Av. Alberto Carazzai, 1640, 86300-000, Cornélio Procópio, Paraná, Brazil
| |
Collapse
|
4
|
Ji M, Kan Y, Kim D, Jung J, Yi G. cPlot: Contig-Plotting Visualization for the Analysis of Short-Read Nucleotide Sequence Alignments. Int J Mol Sci 2022; 23:ijms231911484. [PMID: 36232783 PMCID: PMC9570162 DOI: 10.3390/ijms231911484] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Revised: 09/13/2022] [Accepted: 09/21/2022] [Indexed: 11/21/2022] Open
Abstract
Advances in the next-generation sequencing technology have led to a dramatic decrease in read-generation cost and an increase in read output. Reconstruction of short DNA sequence reads generated by next-generation sequencing requires a read alignment method that reconstructs a reference genome. In addition, it is essential to analyze the results of read alignments for a biologically meaningful inference. However, read alignment from vast amounts of genomic data from various organisms is challenging in that it involves repeated automatic and manual analysis steps. We, here, devised cPlot software for read alignment of nucleotide sequences, with automated read alignment and position analysis, which allows visual assessment of the analysis results by the user. cPlot compares sequence similarity of reads by performing multiple read alignments, with FASTA format files as the input. This application provides a web-based interface for the user for facile implementation, without the need for a dedicated computing environment. cPlot identifies the location and order of the sequencing reads by comparing the sequence to a genetically close reference sequence in a way that is effective for visualizing the assembly of short reads generated by NGS and rapid gene map construction.
Collapse
Affiliation(s)
- Mingeun Ji
- Department of Multimedia Engineering, Dongguk University, Seoul 04620, Korea
| | - Yejin Kan
- Department of Multimedia Engineering, Dongguk University, Seoul 04620, Korea
| | - Dongyeon Kim
- Department of Multimedia Engineering, Dongguk University, Seoul 04620, Korea
| | - Jaehee Jung
- Department of Information and Communication Engineering, Myongji University, Yongin-si 17058, Korea
| | - Gangman Yi
- Department of Multimedia Engineering, Dongguk University, Seoul 04620, Korea
- Correspondence:
| |
Collapse
|
5
|
Tetikol HS, Turgut D, Narci K, Budak G, Kalay O, Arslan E, Demirkaya-Budak S, Dolgoborodov A, Kabakci-Zorlu D, Semenyuk V, Jain A, Davis-Dusenbery BN. Pan-African genome demonstrates how population-specific genome graphs improve high-throughput sequencing data analysis. Nat Commun 2022; 13:4384. [PMID: 35927245 PMCID: PMC9352875 DOI: 10.1038/s41467-022-31724-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2021] [Accepted: 06/30/2022] [Indexed: 11/29/2022] Open
Abstract
Graph-based genome reference representations have seen significant development, motivated by the inadequacy of the current human genome reference to represent the diverse genetic information from different human populations and its inability to maintain the same level of accuracy for non-European ancestries. While there have been many efforts to develop computationally efficient graph-based toolkits for NGS read alignment and variant calling, methods to curate genomic variants and subsequently construct genome graphs remain an understudied problem that inevitably determines the effectiveness of the overall bioinformatics pipeline. In this study, we discuss obstacles encountered during graph construction and propose methods for sample selection based on population diversity, graph augmentation with structural variants and resolution of graph reference ambiguity caused by information overload. Moreover, we present the case for iteratively augmenting tailored genome graphs for targeted populations and demonstrate this approach on the whole-genome samples of African ancestry. Our results show that population-specific graphs, as more representative alternatives to linear or generic graph references, can achieve significantly lower read mapping errors and enhanced variant calling sensitivity, in addition to providing the improvements of joint variant calling without the need of computationally intensive post-processing steps.
Collapse
Affiliation(s)
| | | | - Kubra Narci
- Seven Bridges Genomics, Charlestown, MA, USA
| | | | - Ozem Kalay
- Seven Bridges Genomics, Charlestown, MA, USA
| | - Elif Arslan
- Seven Bridges Genomics, Charlestown, MA, USA
| | | | | | | | | | - Amit Jain
- Seven Bridges Genomics, Charlestown, MA, USA
| | | |
Collapse
|
6
|
Van Der Merwe N, Ramesar R, De Vries J. Whole Exome Sequencing in South Africa: Stakeholder Views on Return of Individual Research Results and Incidental Findings. Front Genet 2022; 13:864822. [PMID: 35754817 PMCID: PMC9216214 DOI: 10.3389/fgene.2022.864822] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2022] [Accepted: 03/30/2022] [Indexed: 11/17/2022] Open
Abstract
The use of whole exome sequencing (WES) in medical research is increasing in South Africa (SA), raising important questions about whether and which individual genetic research results, particularly incidental findings, should be returned to patients. Whilst some commentaries and opinions related to the topic have been published in SA, there is no qualitative data on the views of professional stakeholders on this topic. Seventeen participants including clinicians, genomics researchers, and genetic counsellors (GCs) were recruited from the Western Cape in SA. Semi-structured interviews were conducted, and the transcripts analysed using the framework approach for data analysis. Current roadblocks for the clinical adoption of WES in SA include a lack of standardised guidelines; complexities relating to variant interpretation due to lack of functional studies and underrepresentation of people of African ancestry in the reference genome, population and variant databases; lack of resources and skilled personnel for variant confirmation and follow-up. Suggestions to overcome these barriers include obtaining funding and buy-in from the private and public sectors and medical insurance companies; the generation of a locally relevant reference genome; training of health professionals in the field of genomics and bioinformatics; and multidisciplinary collaboration. Participants emphasised the importance of upscaling the accessibility to and training of GCs, as well as upskilling of clinicians and genetic nurses for return of genetic data in collaboration with GCs and medical geneticists. Future research could focus on exploring the development of stakeholder partnerships for increased access to trained specialists as well as community engagement and education, alongside the development of guidelines for result disclosure.
Collapse
Affiliation(s)
- Nicole Van Der Merwe
- UCT/MRC Genomic and Precision Medicine Research Unit, Division of Human Genetics, Institute for Infectious Diseases and Molecular Medicine, Department of Pathology, Faculty of Medicine and Health Sciences, University of Cape Town, Cape Town, South Africa.,Department of Pathology, Faculty of Medicine and Health Sciences, Stellenbosch University, Tygerberg, South Africa
| | - Raj Ramesar
- UCT/MRC Genomic and Precision Medicine Research Unit, Division of Human Genetics, Institute for Infectious Diseases and Molecular Medicine, Department of Pathology, Faculty of Medicine and Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Jantina De Vries
- Department of Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa.,Neuroscience Institute, Faculty of Health Sciences, University of Cape Town, Observatory, South Africa
| |
Collapse
|
7
|
Baaijens JA, Bonizzoni P, Boucher C, Della Vedova G, Pirola Y, Rizzi R, Sirén J. Computational graph pangenomics: a tutorial on data structures and their applications. NATURAL COMPUTING 2022; 21:81-108. [PMID: 36969737 PMCID: PMC10038355 DOI: 10.1007/s11047-022-09882-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 02/14/2022] [Indexed: 05/08/2023]
Abstract
Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.
Collapse
Affiliation(s)
- Jasmijn A. Baaijens
- Department of Intelligent Systems, Delft University of Technology, Van Mourik Broekmanweg 6, 2628XE Delft, The Netherlands
- Department of Biomedical Informatics, Harvard University, 10 Shattuck St, Boston, MA 02115, USA
| | - Paola Bonizzoni
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, 432 Newell Dr, Gainesville, FL 32603, USA
| | - Gianluca Della Vedova
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Yuri Pirola
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Raffaella Rizzi
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Jouni Sirén
- Genomics Institute, University of California, 1156 High St., Santa Cruz, CA 95064, USA
| |
Collapse
|
8
|
Quan W, Liu B, Wang Y. Fast and SNP-aware short read alignment with SALT. BMC Bioinformatics 2021; 22:172. [PMID: 34433415 PMCID: PMC8386087 DOI: 10.1186/s12859-021-04088-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Accepted: 03/17/2021] [Indexed: 11/23/2022] Open
Abstract
Background DNA sequence alignment is a common first step in most applications of high-throughput sequencing technologies. The accuracy of sequence alignments directly affects the accuracy of downstream analyses, such as variant calling and quantitative analysis of transcriptome; therefore, rapidly and accurately mapping reads to a reference genome is a significant topic in bioinformatics. Conventional DNA read aligners map reads to a linear reference genome (such as the GRCh38 primary assembly). However, such a linear reference genome represents the genome of only one or a few individuals and thus lacks information on variations in the population. This limitation can introduce bias and impact the sensitivity and accuracy of mapping. Recently, a number of aligners have begun to map reads to populations of genomes, which can be represented by a reference genome and a large number of genetic variants. However, compared to linear reference aligners, an aligner that can store and index all genetic variants has a high cost in memory (RAM) space and leads to extremely long run time. Aligning reads to a graph-model-based index that includes all types of variants is ultimately an NP-hard problem in theory. By contrast, considering only single nucleotide polymorphism (SNP) information will reduce the complexity of the index and improve the speed of sequence alignment. Results The SNP-aware alignment tool (SALT) is a fast, memory-efficient, and SNP-aware short read alignment tool. SALT uses 5.8 GB of RAM to index a human reference genome (GRCh38) and incorporates 12.8M UCSC common SNPs. Compared with a state-of-the-art aligner, SALT has a similar speed but higher accuracy. Conclusions Herein, we present an SNP-aware alignment tool (SALT) that aligns reads to a reference genome that incorporates an SNP database. We benchmarked SALT using simulated and real datasets. The results demonstrate that SALT can efficiently map reads to the reference genome with significantly improved accuracy. Incorporating SNP information can improve the accuracy of read alignment and can reveal novel variants. The source code is freely available at https://github.com/weiquan/SALT.
Collapse
Affiliation(s)
- Wei Quan
- School of Computer Science and Technology, Harbin Institute of Technology, 92 West Dazhi Street, Harbin, China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, 92 West Dazhi Street, Harbin, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, 92 West Dazhi Street, Harbin, China.
| |
Collapse
|
9
|
Samaha G, Wade CM, Mazrier H, Grueber CE, Haase B. Exploiting genomic synteny in Felidae: cross-species genome alignments and SNV discovery can aid conservation management. BMC Genomics 2021; 22:601. [PMID: 34362297 PMCID: PMC8348863 DOI: 10.1186/s12864-021-07899-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Accepted: 07/14/2021] [Indexed: 11/10/2022] Open
Abstract
Background While recent advances in genomics has enabled vast improvements in the quantification of genome-wide diversity and the identification of adaptive and deleterious alleles in model species, wildlife and non-model species have largely not reaped the same benefits. This has been attributed to the resources and infrastructure required to develop essential genomic datasets such as reference genomes. In the absence of a high-quality reference genome, cross-species alignments can provide reliable, cost-effective methods for single nucleotide variant (SNV) discovery. Here, we demonstrated the utility of cross-species genome alignment methods in gaining insights into population structure and functional genomic features in cheetah (Acinonyx jubatas), snow leopard (Panthera uncia) and Sumatran tiger (Panthera tigris sumatrae), relative to the domestic cat (Felis catus). Results Alignment of big cats to the domestic cat reference assembly yielded nearly complete sequence coverage of the reference genome. From this, 38,839,061 variants in cheetah, 15,504,143 in snow leopard and 13,414,953 in Sumatran tiger were discovered and annotated. This method was able to delineate population structure but limited in its ability to adequately detect rare variants. Enrichment analysis of fixed and species-specific SNVs revealed insights into adaptive traits, evolutionary history and the pathogenesis of heritable diseases. Conclusions The high degree of synteny among felid genomes enabled the successful application of the domestic cat reference in high-quality SNV detection. The datasets presented here provide a useful resource for future studies into population dynamics, evolutionary history and genetic and disease management of big cats. This cross-species method of variant discovery provides genomic context for identifying annotated gene regions essential to understanding adaptive and deleterious variants that can improve conservation outcomes. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07899-2.
Collapse
Affiliation(s)
- Georgina Samaha
- Sydney School of Veterinary Science, Faculty of Science, The University of Sydney, Sydney, NSW, Australia.
| | - Claire M Wade
- School of Life and Environmental Sciences, The University of Sydney, Sydney, NSW, Australia
| | - Hamutal Mazrier
- Sydney School of Veterinary Science, Faculty of Science, The University of Sydney, Sydney, NSW, Australia
| | - Catherine E Grueber
- School of Life and Environmental Sciences, The University of Sydney, Sydney, NSW, Australia
| | - Bianca Haase
- Sydney School of Veterinary Science, Faculty of Science, The University of Sydney, Sydney, NSW, Australia
| |
Collapse
|
10
|
Maarala AI, Arasalo O, Valenzuela D, Mäkinen V, Heljanko K. Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment. PLoS One 2021; 16:e0255260. [PMID: 34343181 PMCID: PMC8330939 DOI: 10.1371/journal.pone.0255260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Accepted: 07/12/2021] [Indexed: 11/19/2022] Open
Abstract
Computational pan-genomics utilizes information from multiple individual genomes in large-scale comparative analysis. Genetic variation between case-controls, ethnic groups, or species can be discovered thoroughly using pan-genomes of such subpopulations. Whole-genome sequencing (WGS) data volumes are growing rapidly, making genomic data compression and indexing methods very important. Despite current space-efficient repetitive sequence compression and indexing methods, the deployed compression methods are often sequential, computationally time-consuming, and do not provide efficient sequence alignment performance on vast collections of genomes such as pan-genomes. For performing rapid analytics with the ever-growing genomics data, data compression and indexing methods have to exploit distributed and parallel computing more efficiently. Instead of strict genome data compression methods, we will focus on the efficient construction of a compressed index for pan-genomes. Compressed hybrid-index enables fast sequence alignments to several genomes at once while shrinking the index size significantly compared to traditional indexes. We propose a scalable distributed compressed hybrid-indexing method for large genomic data sets enabling pan-genome-based sequence search and read alignment capabilities. We show the scalability of our tool, DHPGIndex, by executing experiments in a distributed Apache Spark-based computing cluster comprising 448 cores distributed over 26 nodes. The experiments have been performed both with human and bacterial genomes. DHPGIndex built a BLAST index for n = 250 human pan-genome with an 870:1 compression ratio (CR) in 342 minutes and a Bowtie2 index with 157:1 CR in 397 minutes. For n = 1,000 human pan-genome, the BLAST index was built in 1520 minutes with 532:1 CR and the Bowtie2 index in 1938 minutes with 76:1 CR. Bowtie2 aligned 14.6 GB of paired-end reads to the compressed (n = 1,000) index in 31.7 minutes on a single node. Compressing n = 13,375,031 (488 GB) GenBank database to BLAST index resulted in CR of 62:1 in 575 minutes. BLASTing 189,864 Crispr-Cas9 gRNA target sequences (23 MB in total) to the compressed index of human pan-genome (n = 1,000) finished in 45 minutes on a single node. 30 MB mixed bacterial sequences were (n = 599) were blasted to the compressed index of 488 GB GenBank database (n = 13,375,031) in 26 minutes on 25 nodes. 78 MB mixed sequences (n = 4,167) were blasted to the compressed index of 18 GB E. coli sequence database (n = 745,409) in 5.4 minutes on a single node.
Collapse
Affiliation(s)
| | - Ossi Arasalo
- Department of Computer Science, Aalto University, Espoo, Finland
| | - Daniel Valenzuela
- Department of Computer Science, University of Helsinki, Espoo, Finland
| | - Veli Mäkinen
- Department of Computer Science, University of Helsinki, Espoo, Finland
- Helsinki Institute for Information Technology, Espoo, Finland
| | - Keijo Heljanko
- Department of Computer Science, University of Helsinki, Espoo, Finland
- Helsinki Institute for Information Technology, Espoo, Finland
| |
Collapse
|
11
|
Norri T, Cazaux B, Dönges S, Valenzuela D, Mäkinen V. Founder Reconstruction Enables Scalable and Seamless Pangenomic Analysis. Bioinformatics 2021; 37:4611-4619. [PMID: 34260702 PMCID: PMC8665761 DOI: 10.1093/bioinformatics/btab516] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Revised: 05/29/2021] [Accepted: 07/08/2021] [Indexed: 11/14/2022] Open
Abstract
Motivation Variant calling workflows that utilize a single reference sequence are the de facto standard elementary genomic analysis routine for resequencing projects. Various ways to enhance the reference with pangenomic information have been proposed, but scalability combined with seamless integration to existing workflows remains a challenge. Results We present PanVC with founder sequences, a scalable and accurate variant calling workflow based on a multiple alignment of reference sequences. Scalability is achieved by removing duplicate parts up to a limit into a founder multiple alignment, that is then indexed using a hybrid scheme that exploits general purpose read aligners. Our implemented workflow uses GATK or BCFtools for variant calling, but the various steps of our workflow (e.g. vcf2multialign tool, founder reconstruction) can be of independent interest as a basis for creating novel pangenome analysis workflows beyond variant calling. Availability and implementation Our open access tools and instructions how to reproduce our experiments are available at the following address: https://github.com/algbio/panvc-founders. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tuukka Norri
- Department of Computer Science, University of Helsinki, Helsinki, 00014, Finland
| | - Bastien Cazaux
- Department of Computer Science, University of Helsinki, Helsinki, 00014, Finland
| | - Saska Dönges
- Department of Computer Science, University of Helsinki, Helsinki, 00014, Finland
| | - Daniel Valenzuela
- Department of Computer Science, University of Helsinki, Helsinki, 00014, Finland
| | - Veli Mäkinen
- Department of Computer Science, University of Helsinki, Helsinki, 00014, Finland
| |
Collapse
|
12
|
Boti MA, Adamopoulos PG, Tsiakanikas P, Scorilas A. Nanopore Sequencing Unveils Diverse Transcript Variants of the Epithelial Cell-Specific Transcription Factor Elf-3 in Human Malignancies. Genes (Basel) 2021; 12:genes12060839. [PMID: 34072506 PMCID: PMC8227732 DOI: 10.3390/genes12060839] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Revised: 05/25/2021] [Accepted: 05/27/2021] [Indexed: 02/06/2023] Open
Abstract
The human E74-like ETS transcription factor 3 (Elf-3) is an epithelium-specific member of the ETS family, all members of which are characterized by a highly conserved DNA-binding domain. Elf-3 plays a crucial role in epithelial cell differentiation by participating in morphogenesis and terminal differentiation of the murine small intestinal epithelium, and also acts as an indispensable regulator of mesenchymal to epithelial transition, underlying its significant involvement in development and in pathological states, such as cancer. Although previous research works have deciphered the functional role of Elf-3 in normal physiology as well as in tumorigenesis, the present study highlights for the first time the wide spectrum of ELF3 mRNAs that are transcribed, providing an in-depth analysis of splicing events and exon/intron boundaries in a broad panel of human cell lines. The implementation of a versatile targeted nanopore sequencing approach led to the identification of 25 novel ELF3 mRNA transcript variants (ELF3 v.3–v.27) with new alternative splicing events, as well as two novel exons. Although the current study provides a qualitative transcriptional profile regarding ELF3, further studies must be conducted, so the biological function of all novel alternative transcript variants as well as the putative protein isoforms are elucidated.
Collapse
|
13
|
Darby CA, Gaddipati R, Schatz MC, Langmead B. Vargas: heuristic-free alignment for assessing linear and graph read aligners. Bioinformatics 2020; 36:3712-3718. [PMID: 32321164 PMCID: PMC7320598 DOI: 10.1093/bioinformatics/btaa265] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Revised: 03/19/2020] [Accepted: 04/15/2020] [Indexed: 12/31/2022] Open
Abstract
Motivation Read alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score. Results Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these ‘gold standard’ Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-maximal exact match and vg to align more reads correctly. Availability and implementation Source code implemented in C++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Michael C Schatz
- Department of Computer Science.,Department of Biology, Johns Hopkins University, Baltimore, MD 21218, USA.,Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | | |
Collapse
|
14
|
Büchler T, Ohlebusch E. An improved encoding of genetic variation in a Burrows-Wheeler transform. Bioinformatics 2020; 36:1413-1419. [PMID: 31613311 DOI: 10.1093/bioinformatics/btz782] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2019] [Revised: 09/04/2019] [Accepted: 10/11/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION In resequencing experiments, a high-throughput sequencer produces DNA-fragments (called reads) and each read is then mapped to the locus in a reference genome at which it fits best. Currently dominant read mappers are based on the Burrows-Wheeler transform (BWT). A read can be mapped correctly if it is similar enough to a substring of the reference genome. However, since the reference genome does not represent all known variations, read mapping tends to be biased towards the reference and mapping errors may thus occur. To cope with this problem, Huang et al. encoded single nucleotide polymorphisms (SNPs) in a BWT by the International Union of Pure and Applied Chemistry (IUPAC) nucleotide code. In a different approach, Maciuca et al. provided a 'natural encoding' of SNPs and other genetic variations in a BWT. However, their encoding resulted in a significantly increased alphabet size (the modified alphabet can have millions of new symbols, which usually implies a loss of efficiency). Moreover, the two approaches do not handle all known kinds of variation. RESULTS In this article, we propose a method that is able to encode many kinds of genetic variation (SNPs, multi-nucleotide polymorphisms, insertions or deletions, duplications, transpositions, inversions and copy-number variation) in a BWT. It takes the best of both worlds: SNPs are encoded by the IUPAC nucleotide code as in Huang et al. (2013, Short read alignment with populations of genomes. Bioinformatics, 29, i361-i370) and the encoding of the other kinds of genetic variation relies on the idea introduced in Maciuca et al. (2016, A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In: Proceedings of the 16th International Workshop on Algorithms in Bioinformatics, Volume 9838 of Lecture Notes in Computer Science, pp. 222-233. Springer). In contrast to Maciuca et al., however, we use only one additional symbol. This symbol marks variant sites in a chromosome and delimits multiple variants, which are added at the end of the 'marked chromosome'. We show how the backward search algorithm, which is used in BWT-based read mappers, can be modified in such a way that it can cope with the genetic variation encoded in the BWT. We implemented our method and compared it with BWBBLE and gramtools. AVAILABILITY AND IMPLEMENTATION https://www.uni-ulm.de/in/theo/research/seqana/.
Collapse
Affiliation(s)
- Thomas Büchler
- Institute of Theoretical Computer Science, Ulm University, Ulm 89069, Germany
| | - Enno Ohlebusch
- Institute of Theoretical Computer Science, Ulm University, Ulm 89069, Germany
| |
Collapse
|
15
|
Kumar S, Agarwal S, Ranvijay. Fast and memory efficient approach for mapping NGS reads to a reference genome. J Bioinform Comput Biol 2020; 17:1950008. [PMID: 31057068 DOI: 10.1142/s0219720019500082] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
New generation sequencing machines: Illumina and Solexa can generate millions of short reads from a given genome sequence on a single run. Alignment of these reads to a reference genome is a core step in Next-generation sequencing data analysis such as genetic variation and genome re-sequencing etc. Therefore there is a need of a new approach, efficient with respect to memory as well as time to align these enormous reads with the reference genome. Existing techniques such as MAQ, Bowtie, BWA, BWBBLE, Subread, Kart, and Minimap2 require huge memory for whole reference genome indexing and reads alignment. Gapped alignment versions of these techniques are also 20-40% slower than their respective normal versions. In this paper, an efficient approach: WIT for reference genome indexing and reads alignment using Burrows-Wheeler Transform (BWT) and Wavelet Tree (WT) is proposed. Both exact and approximate alignments are possible by it. Experimental work shows that the proposed approach WIT performs the best in case of protein sequence indexing. For indexing, the reference genome space required by WIT is 0.6 N (N is the size of reference genome) whereas existing techniques BWA, Subread, Kart, and Minimap2 require space in between 1.25 N to 5 N. Experimentally, it is also observed that even using such small index size alignment time of proposed approach is comparable in comparison to BWA, Subread, Kart, and Minimap2. Other alignment parameters accuracy and confidentiality are also experimentally shown to be better than Minimap2. The source code of the proposed approach WIT is available at http://www.algorithm-skg.com/wit/home.html .
Collapse
Affiliation(s)
| | | | - Ranvijay
- 1 CSED, NIT Allahabad, 211004, India
| |
Collapse
|
16
|
Eizenga JM, Novak AM, Sibbesen JA, Heumos S, Ghaffaari A, Hickey G, Chang X, Seaman JD, Rounthwaite R, Ebler J, Rautiainen M, Garg S, Paten B, Marschall T, Sirén J, Garrison E. Pangenome Graphs. Annu Rev Genomics Hum Genet 2020; 21:139-162. [PMID: 32453966 DOI: 10.1146/annurev-genom-120219-080406] [Citation(s) in RCA: 100] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Low-cost whole-genome assembly has enabled the collection of haplotype-resolved pangenomes for numerous organisms. In turn, this technological change is encouraging the development of methods that can precisely address the sequence and variation described in large collections of related genomes. These approaches often use graphical models of the pangenome to support algorithms for sequence alignment, visualization, functional genomics, and association studies. The additional information provided to these methods by the pangenome allows them to achieve superior performance on a variety of bioinformatic tasks, including read alignment, variant calling, and genotyping. Pangenome graphs stand to become a ubiquitous tool in genomics. Although it is unclear whether they will replace linearreference genomes, their ability to harmoniously relate multiple sequence and coordinate systems will make them useful irrespective of which pangenomic models become most common in the future.
Collapse
Affiliation(s)
- Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Adam M Novak
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Jonas A Sibbesen
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Simon Heumos
- Quantitative Biology Center, University of Tübingen, 72076 Tübingen, Germany
| | - Ali Ghaffaari
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Glenn Hickey
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Xian Chang
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Josiah D Seaman
- Royal Botanic Gardens, Kew, Richmond TW9 3AB, United Kingdom.,School of Biological and Chemical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom
| | - Robin Rounthwaite
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Jana Ebler
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Mikko Rautiainen
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Shilpa Garg
- Departments of Genetics and Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02215, USA.,Department of Data Sciences, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany
| | - Jouni Sirén
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Erik Garrison
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| |
Collapse
|
17
|
Jain C, Zhang H, Gao Y, Aluru S. On the Complexity of Sequence-to-Graph Alignment. J Comput Biol 2020. [DOI: 10.1089/cmb.2019.0066] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Affiliation(s)
- Chirag Jain
- College of Computing, Georgia Institute of Technology, Atlanta, Georgia
| | - Haowen Zhang
- College of Computing, Georgia Institute of Technology, Atlanta, Georgia
| | - Yu Gao
- College of Computing, Georgia Institute of Technology, Atlanta, Georgia
| | - Srinivas Aluru
- College of Computing, Georgia Institute of Technology, Atlanta, Georgia
| |
Collapse
|
18
|
Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol 2020; 27:500-513. [PMID: 32181684 PMCID: PMC7185338 DOI: 10.1089/cmb.2019.0309] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows–Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that—when used with the rank data structure—allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FM-index-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing.
Collapse
Affiliation(s)
- Alan Kuhnle
- Department of Computer Science, Florida State University, Tallahassee, Florida
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida
| | - Taher Mun
- Department of Computer Science, John Hopkins University, Baltimore, Maryland
- Address correspondence to: Taher Mun, PhD Candidate, Department of Computer Science, John Hopkins University, 3400 North Charles Street, Baltimore, MD 21218-2682
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
- School of Computer Science and Telecommunications, Universidad Diego Portales and CeBiB, Santiago, Chile
| | - Ben Langmead
- Department of Computer Science, John Hopkins University, Baltimore, Maryland
| | - Giovanni Manzini
- Department of Science and Technological Innovation, University of Eastern Piedmont, Alessandria, Italy
| |
Collapse
|
19
|
Mokveld T, Linthorst J, Al-Ars Z, Holstege H, Reinders M. CHOP: haplotype-aware path indexing in population graphs. Genome Biol 2020; 21:65. [PMID: 32160922 PMCID: PMC7066762 DOI: 10.1186/s13059-020-01963-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2019] [Accepted: 02/18/2020] [Indexed: 12/20/2022] Open
Abstract
The practical use of graph-based reference genomes depends on the ability to align reads to them. Performing substring queries to paths through these graphs lies at the core of this task. The combination of increasing pattern length and encoded variations inevitably leads to a combinatorial explosion of the search space. Instead of heuristic filtering or pruning steps to reduce the complexity, we propose CHOP, a method that constrains the search space by exploiting haplotype information, bounding the search space to the number of haplotypes so that a combinatorial explosion is prevented. We show that CHOP can be applied to large and complex datasets, by applying it on a graph-based representation of the human genome encoding all 80 million variants reported by the 1000 Genomes Project.
Collapse
Affiliation(s)
- Tom Mokveld
- Delft Bioinformatics Lab, Delft University of Technology, Van Mourik Broekmanweg 6, Delft, 2628 XE The Netherlands
| | - Jasper Linthorst
- Delft Bioinformatics Lab, Delft University of Technology, Van Mourik Broekmanweg 6, Delft, 2628 XE The Netherlands
- Department of Clinical Genetics, VU University Medical Center, Van der Boechorststraat 7, Amsterdam, 1081 BT The Netherlands
| | - Zaid Al-Ars
- Computer Engineering, Delft University of Technology, Mekelweg 4, Delft, 2628 CD The Netherlands
| | - Henne Holstege
- Delft Bioinformatics Lab, Delft University of Technology, Van Mourik Broekmanweg 6, Delft, 2628 XE The Netherlands
- Department of Clinical Genetics, VU University Medical Center, Van der Boechorststraat 7, Amsterdam, 1081 BT The Netherlands
| | - Marcel Reinders
- Delft Bioinformatics Lab, Delft University of Technology, Van Mourik Broekmanweg 6, Delft, 2628 XE The Netherlands
| |
Collapse
|
20
|
Guo H, Liu B, Guan D, Fu Y, Wang Y. Fast read alignment with incorporation of known genomic variants. BMC Med Inform Decis Mak 2019; 19:265. [PMID: 31856811 PMCID: PMC6921400 DOI: 10.1186/s12911-019-0960-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Background Many genetic variants have been reported from sequencing projects due to decreasing experimental costs. Compared to the current typical paradigm, read mapping incorporating existing variants can improve the performance of subsequent analysis. This method is supposed to map sequencing reads efficiently to a graphical index with a reference genome and known variation to increase alignment quality and variant calling accuracy. However, storing and indexing various types of variation require costly RAM space. Methods Aligning reads to a graph model-based index including the whole set of variants is ultimately an NP-hard problem in theory. Here, we propose a variation-aware read alignment algorithm (VARA), which generates the alignment between read and multiple genomic sequences simultaneously utilizing the schema of the Landau-Vishkin algorithm. VARA dynamically extracts regional variants to construct a pseudo tree-based structure on-the-fly for seed extension without loading the whole genome variation into memory space. Results We developed the novel high-throughput sequencing read aligner deBGA-VARA by integrating VARA into deBGA. The deBGA-VARA is benchmarked both on simulated reads and the NA12878 sequencing dataset. The experimental results demonstrate that read alignment incorporating genetic variation knowledge can achieve high sensitivity and accuracy. Conclusions Due to its efficiency, VARA provides a promising solution for further improvement of variant calling while maintaining small memory footprints. The deBGA-VARA is available at: https://github.com/hitbc/deBGA-VARA.
Collapse
Affiliation(s)
- Hongzhe Guo
- Center for Bioinformatics, Harbin Institute of Technology, 92 West Dazhi Street, Harbin, 150001, China
| | - Bo Liu
- Center for Bioinformatics, Harbin Institute of Technology, 92 West Dazhi Street, Harbin, 150001, China
| | - Dengfeng Guan
- Center for Bioinformatics, Harbin Institute of Technology, 92 West Dazhi Street, Harbin, 150001, China
| | - Yilei Fu
- Center for Bioinformatics, Harbin Institute of Technology, 92 West Dazhi Street, Harbin, 150001, China
| | - Yadong Wang
- Center for Bioinformatics, Harbin Institute of Technology, 92 West Dazhi Street, Harbin, 150001, China.
| |
Collapse
|
21
|
Sirén J, Garrison E, Novak AM, Paten B, Durbin R. Haplotype-aware graph indexes. Bioinformatics 2019; 36:400-407. [PMID: 31406990 PMCID: PMC7223266 DOI: 10.1093/bioinformatics/btz575] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Revised: 05/29/2019] [Accepted: 07/18/2019] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes. RESULTS We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. AVAILABILITY AND IMPLEMENTATION Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Erik Garrison
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK
| | - Adam M Novak
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA 95064, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA 95064, USA
| | - Richard Durbin
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK,Department of Genetics, University of Cambridge, Cambridge CB2 3EH, UK
| |
Collapse
|
22
|
Daykin J, Groult R, Guesnet Y, Lecroq T, Lefebvre A, Léonard M, Mouchard L, Prieur-Gaston É, Watson B. Efficient pattern matching in degenerate strings with the Burrows–Wheeler transform. INFORM PROCESS LETT 2019. [DOI: 10.1016/j.ipl.2019.03.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
23
|
Norri T, Cazaux B, Kosolobov D, Mäkinen V. Linear time minimum segmentation enables scalable founder reconstruction. Algorithms Mol Biol 2019; 14:12. [PMID: 31131017 PMCID: PMC6525415 DOI: 10.1186/s13015-019-0147-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Accepted: 05/04/2019] [Indexed: 11/15/2022] Open
Abstract
Background We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\mathcal {R}} = \{R_1, \ldots , R_m\}$$\end{document}R={R1,…,Rm} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$[a,b] \in P$$\end{document}[a,b]∈P has length at least L and the number \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$d(a,b)=|\{R_i[a,b] :1\le i \le m\}|$$\end{document}d(a,b)=|{Ri[a,b]:1≤i≤m}| of distinct substrings at segment [a, b] is minimized over \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$[a,b] \in P$$\end{document}[a,b]∈P. The distinct substrings in the segments represent founder blocks that can be concatenated to form \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\max \{ d(a,b) :[a,b] \in P \}$$\end{document}max{d(a,b):[a,b]∈P} founder sequences representing the original \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\mathcal {R}}$$\end{document}R such that crossovers happen only at segment boundaries. Results We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$O(mn^2)$$\end{document}O(mn2). Conclusions Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.
Collapse
|
24
|
Silva-Junior OB, Grattapaglia D, Novaes E, Collevatti RG. Design and evaluation of a sequence capture system for genome-wide SNP genotyping in highly heterozygous plant genomes: a case study with a keystone Neotropical hardwood tree genome. DNA Res 2019; 25:535-545. [PMID: 30020434 PMCID: PMC6191306 DOI: 10.1093/dnares/dsy023] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2017] [Accepted: 06/22/2018] [Indexed: 12/12/2022] Open
Abstract
Targeted sequence capture coupled to high-throughput sequencing has become a powerful method for the study of genome-wide sequence variation. Following our recent development of a genome assembly for the Pink Ipê tree (Handroanthus impetiginosus), a widely distributed Neotropical timber species, we now report the development of a set of 24,751 capture probes for single-nucleotide polymorphisms (SNPs) characterization and genotyping across 18,216 distinct loci, sampling more than 10 Mbp of the species genome. This system identifies nearly 200,000 SNPs located inside or in close proximity to almost 14,000 annotated protein-coding genes, generating quality genotypic data in populations spanning wide geographic distances across the species native range. To provide recommendations for future developments of similar systems for highly heterozygous plant genomes we investigated issues such as probe design, sequencing coverage and bioinformatics, including the evaluation of the capture efficiency and a reassessment of the technical reproducibility of the assay for SNPs recall and genotyping precision. Our results highlight the value of a detailed probe screening on a preliminary genome assembly to produce reliable data for downstream genetic studies. This work should inspire and assist the development of similar genomic resources for other orphan crops and forest trees with highly heterozygous genomes.
Collapse
Affiliation(s)
- Orzenil Bonfim Silva-Junior
- EMBRAPA Recursos Genéticos e Biotecnologia, EPqB, Brasília, DF, Brazil.,Programa de Ciências Genômicas e Biotecnologia, Universidade Católica de Brasília, SGAN 916 Modulo B, Brasilia, DF, Brazil
| | - Dario Grattapaglia
- EMBRAPA Recursos Genéticos e Biotecnologia, EPqB, Brasília, DF, Brazil.,Programa de Ciências Genômicas e Biotecnologia, Universidade Católica de Brasília, SGAN 916 Modulo B, Brasilia, DF, Brazil
| | - Evandro Novaes
- Departamento de Biologia, Universidade Federal de Lavras, Lavras, MG, Brazil
| | - Rosane G Collevatti
- Laboratório de Genética & Biodiversidade, Instituto de Ciências Biológicas, Universidade Federal de Goiás, Goiânia, GO, Brazil
| |
Collapse
|
25
|
Fast and accurate genomic analyses using genome graphs. Nat Genet 2019; 51:354-362. [PMID: 30643257 DOI: 10.1038/s41588-018-0316-4] [Citation(s) in RCA: 114] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2017] [Accepted: 11/14/2018] [Indexed: 12/29/2022]
Abstract
The human reference genome serves as the foundation for genomics by providing a scaffold for alignment of sequencing reads, but currently only reflects a single consensus haplotype, thus impairing analysis accuracy. Here we present a graph reference genome implementation that enables read alignment across 2,800 diploid genomes encompassing 12.6 million SNPs and 4.0 million insertions and deletions (indels). The pipeline processes one whole-genome sequencing sample in 6.5 h using a system with 36 CPU cores. We show that using a graph genome reference improves read mapping sensitivity and produces a 0.5% increase in variant calling recall, with unaffected specificity. Structural variations incorporated into a graph genome can be genotyped accurately under a unified framework. Finally, we show that iterative augmentation of graph genomes yields incremental gains in variant calling accuracy. Our implementation is an important advance toward fulfilling the promise of graph genomes to radically enhance the scalability and accuracy of genomic analyses.
Collapse
|
26
|
Bolger AM, Poorter H, Dumschott K, Bolger ME, Arend D, Osorio S, Gundlach H, Mayer KFX, Lange M, Scholz U, Usadel B. Computational aspects underlying genome to phenome analysis in plants. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2019; 97:182-198. [PMID: 30500991 PMCID: PMC6849790 DOI: 10.1111/tpj.14179] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Revised: 11/06/2018] [Accepted: 11/16/2018] [Indexed: 05/18/2023]
Abstract
Recent advances in genomics technologies have greatly accelerated the progress in both fundamental plant science and applied breeding research. Concurrently, high-throughput plant phenotyping is becoming widely adopted in the plant community, promising to alleviate the phenotypic bottleneck. While these technological breakthroughs are significantly accelerating quantitative trait locus (QTL) and causal gene identification, challenges to enable even more sophisticated analyses remain. In particular, care needs to be taken to standardize, describe and conduct experiments robustly while relying on plant physiology expertise. In this article, we review the state of the art regarding genome assembly and the future potential of pangenomics in plant research. We also describe the necessity of standardizing and describing phenotypic studies using the Minimum Information About a Plant Phenotyping Experiment (MIAPPE) standard to enable the reuse and integration of phenotypic data. In addition, we show how deep phenotypic data might yield novel trait-trait correlations and review how to link phenotypic data to genomic data. Finally, we provide perspectives on the golden future of machine learning and their potential in linking phenotypes to genomic features.
Collapse
Affiliation(s)
- Anthony M. Bolger
- Institute for Biology I, BioSCRWTH Aachen UniversityWorringer Weg 352074AachenGermany
| | - Hendrik Poorter
- Forschungszentrum Jülich (FZJ) Institute of Bio‐ and Geosciences (IBG‐2) Plant SciencesWilhelm‐Johnen‐Straße52428JülichGermany
- Department of Biological SciencesMacquarie UniversityNorth RydeNSW2109Australia
| | - Kathryn Dumschott
- Institute for Biology I, BioSCRWTH Aachen UniversityWorringer Weg 352074AachenGermany
| | - Marie E. Bolger
- Forschungszentrum Jülich (FZJ) Institute of Bio‐ and Geosciences (IBG‐2) Plant SciencesWilhelm‐Johnen‐Straße52428JülichGermany
| | - Daniel Arend
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) GaterslebenCorrensstraße 306466SeelandGermany
| | - Sonia Osorio
- Department of Molecular Biology and BiochemistryInstituto de Hortofruticultura Subtropical y Mediterránea “La Mayora”Universidad de Málaga‐Consejo Superior de Investigaciones CientíficasCampus de Teatinos29071MálagaSpain
| | - Heidrun Gundlach
- Plant Genome and Systems Biology (PGSB)Helmholtz Zentrum München (HMGU)Ingolstädter Landstraße 185764NeuherbergGermany
| | - Klaus F. X. Mayer
- Plant Genome and Systems Biology (PGSB)Helmholtz Zentrum München (HMGU)Ingolstädter Landstraße 185764NeuherbergGermany
| | - Matthias Lange
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) GaterslebenCorrensstraße 306466SeelandGermany
| | - Uwe Scholz
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) GaterslebenCorrensstraße 306466SeelandGermany
| | - Björn Usadel
- Institute for Biology I, BioSCRWTH Aachen UniversityWorringer Weg 352074AachenGermany
- Forschungszentrum Jülich (FZJ) Institute of Bio‐ and Geosciences (IBG‐2) Plant SciencesWilhelm‐Johnen‐Straße52428JülichGermany
| |
Collapse
|
27
|
Pritt J, Chen NC, Langmead B. FORGe: prioritizing variants for graph genomes. Genome Biol 2018; 19:220. [PMID: 30558649 PMCID: PMC6296055 DOI: 10.1186/s13059-018-1595-x] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2018] [Accepted: 11/26/2018] [Indexed: 12/30/2022] Open
Abstract
There is growing interest in using genetic variants to augment the reference genome into a graph genome, with alternative sequences, to improve read alignment accuracy and reduce allelic bias. While adding a variant has the positive effect of removing an undesirable alignment score penalty, it also increases both the ambiguity of the reference genome and the cost of storing and querying the genome index. We introduce methods and a software tool called FORGe for modeling these effects and prioritizing variants accordingly. We show that FORGe enables a range of advantageous and measurable trade-offs between accuracy and computational overhead.
Collapse
Affiliation(s)
- Jacob Pritt
- Department of Computer Science, Johns Hopkins University, Baltimore, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, USA
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA. .,Center for Computational Biology, Johns Hopkins University, Baltimore, USA.
| |
Collapse
|
28
|
Minkin I, Pham S, Medvedev P. TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics 2018; 33:4024-4032. [PMID: 27659452 DOI: 10.1093/bioinformatics/btw609] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2016] [Accepted: 09/16/2016] [Indexed: 01/06/2023] Open
Abstract
Motivation de Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes). Results In this article, we present TwoPaCo, a simple and scalable low memory algorithm for the direct construction of the compacted de Bruijn graph from a set of complete genomes. We demonstrate that it can construct the graph for 100 simulated human genomes in less than a day and eight real primates in < 2 h, on a typical shared-memory machine. We believe that this progress will enable novel biological analyses of hundreds of mammalian-sized genomes. Availability and Implementation Our code and data is available for download from github.com/medvedevgroup/TwoPaCo. Contact ium125@psu.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ilia Minkin
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA
| | - Son Pham
- BioTuring Inc., San Diego, CA 92121, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA.,Department of Biochemistry and Molecular Biology.,Genomic Sciences Institute of the Huck, The Pennsylvania State University, University Park, PA 16802, USA
| |
Collapse
|
29
|
Vo NS, Phan V. Leveraging known genomic variants to improve detection of variants, especially close-by Indels. Bioinformatics 2018; 34:2918-2926. [PMID: 29590294 DOI: 10.1093/bioinformatics/bty183] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Accepted: 03/23/2018] [Indexed: 12/30/2022] Open
Abstract
Motivation The detection of genomic variants has great significance in genomics, bioinformatics, biomedical research and its applications. However, despite a lot of effort, Indels and structural variants are still under-characterized compared to SNPs. Current approaches based on next-generation sequencing data usually require large numbers of reads (high coverage) to be able to detect such types of variants accurately. However Indels, especially those close to each other, are still hard to detect accurately. Results We introduce a novel approach that leverages known variant information, e.g. provided by dbSNP, dbVar, ExAC or the 1000 Genomes Project, to improve sensitivity of detecting variants, especially close-by Indels. In our approach, the standard reference genome and the known variants are combined to build a meta-reference, which is expected to be probabilistically closer to the subject genomes than the standard reference. An alignment algorithm, which can take into account known variant information, is developed to accurately align reads to the meta-reference. This strategy resulted in accurate alignment and variant calling even with low coverage data. We showed that compared to popular methods such as GATK and SAMtools, our method significantly improves the sensitivity of detecting variants, especially Indels that are close to each other. In particular, our method was able to call these close-by Indels at a 15-20% higher sensitivity than other methods at low coverage, and still get 1-5% higher sensitivity at high coverage, at competitive precision. These results were validated using simulated data with variant profiles extracted from the 1000 Genomes Project data, and real data from the Illumina Platinum Genomes Project and ExAC database. Our finding suggests that by incorporating known variant information in an appropriate manner, sensitive variant calling is possible at a low cost. Availability and implementation Implementation can be found in our public code repository https://github.com/namsyvo/IVC. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nam S Vo
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Vinhthuy Phan
- Department of Computer Science, The University of Memphis, Memphis, TN, USA
| |
Collapse
|
30
|
Turner I, Garimella KV, Iqbal Z, McVean G. Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 2018; 34:2556-2565. [PMID: 29554215 PMCID: PMC6061703 DOI: 10.1093/bioinformatics/bty157] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2017] [Revised: 11/25/2017] [Accepted: 03/14/2018] [Indexed: 12/27/2022] Open
Abstract
Motivation The de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input data. Results We present a novel assembly graph data structure: the Linked de Bruijn Graph (LdBG). Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both our de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to Klebsiella pneumoniae short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterize the genomic context of drug-resistance genes. Availability and implementation Linked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, which is available under the MIT license at https://github.com/mcveanlab/mccortex. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Isaac Turner
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Kiran V Garimella
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
| | - Zamin Iqbal
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - Gil McVean
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
| |
Collapse
|
31
|
Sibbesen JA, Maretty L, Krogh A. Accurate genotyping across variant classes and lengths using variant graphs. Nat Genet 2018; 50:1054-1059. [PMID: 29915429 DOI: 10.1038/s41588-018-0145-5] [Citation(s) in RCA: 38] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2016] [Accepted: 04/20/2018] [Indexed: 12/30/2022]
Abstract
Genotype estimates from short-read sequencing data are typically based on the alignment of reads to a linear reference, but reads originating from more complex variants (for example, structural variants) often align poorly, resulting in biased genotype estimates. This bias can be mitigated by first collecting a set of candidate variants across discovery methods, individuals and databases, and then realigning the reads to the variants and reference simultaneously. However, this realignment problem has proved computationally difficult. Here, we present a new method (BayesTyper) that uses exact alignment of read k-mers to a graph representation of the reference and variants to efficiently perform unbiased, probabilistic genotyping across the variation spectrum. We demonstrate that BayesTyper generally provides superior variant sensitivity and genotyping accuracy relative to existing methods when used to integrate variants across discovery approaches and individuals. Finally, we demonstrate that including a 'variation-prior' database containing already known variants significantly improves sensitivity.
Collapse
Affiliation(s)
- Jonas Andreas Sibbesen
- The Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Lasse Maretty
- The Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | | | - Anders Krogh
- The Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
32
|
Valenzuela D, Norri T, Välimäki N, Pitkänen E, Mäkinen V. Towards pan-genome read alignment to improve variation calling. BMC Genomics 2018; 19:87. [PMID: 29764365 PMCID: PMC5954285 DOI: 10.1186/s12864-018-4465-8] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Background Typical human genome differs from the reference genome at 4-5 million sites. This diversity is increasingly catalogued in repositories such as ExAC/gnomAD, consisting of >15,000 whole-genomes and >126,000 exome sequences from different individuals. Despite this enormous diversity, resequencing data workflows are still based on a single human reference genome. Identification and genotyping of genetic variants is typically carried out on short-read data aligned to a single reference, disregarding the underlying variation. Results We propose a new unified framework for variant calling with short-read data utilizing a representation of human genetic variation – a pan-genomic reference. We provide a modular pipeline that can be seamlessly incorporated into existing sequencing data analysis workflows. Our tool is open source and available online: https://gitlab.com/dvalenzu/PanVC. Conclusions Our experiments show that by replacing a standard human reference with a pan-genomic one we achieve an improvement in single-nucleotide variant calling accuracy and in short indel calling accuracy over the widely adopted Genome Analysis Toolkit (GATK) in difficult genomic regions.
Collapse
Affiliation(s)
- Daniel Valenzuela
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, P.O. Box 68 (Gustaf Hällströmin katu 2b), Helsinki, 00014, Finland
| | - Tuukka Norri
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, P.O. Box 68 (Gustaf Hällströmin katu 2b), Helsinki, 00014, Finland
| | - Niko Välimäki
- Department of Medical and Clinical Genetics, Genome-Scale Biology Program, University of Helsinki, Helsinki, Finland
| | - Esa Pitkänen
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Veli Mäkinen
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, P.O. Box 68 (Gustaf Hällströmin katu 2b), Helsinki, 00014, Finland.
| |
Collapse
|
33
|
Jandrasits C, Dabrowski PW, Fuchs S, Renard BY. seq-seq-pan: building a computational pan-genome data structure on whole genome alignment. BMC Genomics 2018; 19:47. [PMID: 29334898 PMCID: PMC5769345 DOI: 10.1186/s12864-017-4401-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2017] [Accepted: 12/19/2017] [Indexed: 12/15/2022] Open
Abstract
Background The increasing application of next generation sequencing technologies has led to the availability of thousands of reference genomes, often providing multiple genomes for the same or closely related species. The current approach to represent a species or a population with a single reference sequence and a set of variations cannot represent their full diversity and introduces bias towards the chosen reference. There is a need for the representation of multiple sequences in a composite way that is compatible with existing data sources for annotation and suitable for established sequence analysis methods. At the same time, this representation needs to be easily accessible and extendable to account for the constant change of available genomes. Results We introduce seq-seq-pan, a framework that provides methods for adding or removing new genomes from a set of aligned genomes and uses these to construct a whole genome alignment. Throughout the sequential workflow the alignment is optimized for generating a representative linear presentation of the aligned set of genomes, that enables its usage for annotation and in downstream analyses. Conclusions By providing dynamic updates and optimized processing, our approach enables the usage of whole genome alignment in the field of pan-genomics. In addition, the sequential workflow can be used as a fast alternative to existing whole genome aligners for aligning closely related genomes. seq-seq-pan is freely available at https://gitlab.com/rki_bioinformatics Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-4401-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | | | - Stephan Fuchs
- Robert Koch Institute, Wernigerode Branch, Burgstraße 37, Wernigerode, 38855, Germany
| | | |
Collapse
|
34
|
Abstract
Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.
Collapse
|
35
|
Abstract
Computational pan-genome analysis has emerged from the rapid increase of available genome sequencing data. Starting from a microbial pan-genome, the concept has spread to a variety of species, such as plants or viruses. Characterizing a pan-genome provides insights into intra-species evolution, functions, and diversity. However, researchers face challenges such as processing and maintaining large datasets while providing accurate and efficient analysis approaches. Comparative genomics methods are required for detecting conserved and unique regions between a set of genomes. This chapter gives an overview of tools available for indexing pan-genomes, identifying the sub-regions of a pan-genome and offering a variety of downstream analysis methods. These tools are categorized into two groups, gene-based and sequence-based, according to the pan-genome identification method. We highlight the differences, advantages, and disadvantages between the tools, and provide information about the general workflow, methodology of pan-genome identification, covered functionalities, usability and availability of the tools.
Collapse
Affiliation(s)
- Tina Zekic
- Faculty of Technology, Bielefeld University, Bielefeld, Germany
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
- International Research Training Group 1906, Bielefeld University, Bielefeld, Germany
| | - Guillaume Holley
- Faculty of Technology, Bielefeld University, Bielefeld, Germany
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
- International Research Training Group 1906, Bielefeld University, Bielefeld, Germany
| | - Jens Stoye
- Faculty of Technology, Bielefeld University, Bielefeld, Germany.
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany.
- International Research Training Group 1906, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
36
|
Graphtyper enables population-scale genotyping using pangenome graphs. Nat Genet 2017; 49:1654-1660. [PMID: 28945251 DOI: 10.1038/ng.3964] [Citation(s) in RCA: 144] [Impact Index Per Article: 20.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2017] [Accepted: 09/01/2017] [Indexed: 12/30/2022]
Abstract
A fundamental requirement for genetic studies is an accurate determination of sequence variation. While human genome sequence diversity is increasingly well characterized, there is a need for efficient ways to use this knowledge in sequence analysis. Here we present Graphtyper, a publicly available novel algorithm and software for discovering and genotyping sequence variants. Graphtyper realigns short-read sequence data to a pangenome, a variation-aware graph structure that encodes sequence variation within a population by representing possible haplotypes as graph paths. Our results show that Graphtyper is fast, highly scalable, and provides sensitive and accurate genotype calls. Graphtyper genotyped 89.4 million sequence variants in the whole genomes of 28,075 Icelanders using less than 100,000 CPU days, including detailed genotyping of six human leukocyte antigen (HLA) genes. We show that Graphtyper is a valuable tool in characterizing sequence variation in both small and population-scale sequencing studies.
Collapse
|
37
|
Novak AM, Garrison E, Paten B. A graph extension of the positional Burrows-Wheeler transform and its applications. Algorithms Mol Biol 2017; 12:18. [PMID: 28702075 PMCID: PMC5505026 DOI: 10.1186/s13015-017-0109-9] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2016] [Accepted: 06/17/2017] [Indexed: 01/23/2023] Open
Abstract
We present a generalization of the positional Burrows-Wheeler transform, or PBWT, to genome graphs, which we call the gPBWT. A genome graph is a collapsed representation of a set of genomes described as a graph. In a genome graph, a haplotype corresponds to a restricted form of walk. The gPBWT is a compressible representation of a set of these graph-encoded haplotypes that allows for efficient subhaplotype match queries. We give efficient algorithms for gPBWT construction and query operations. As a demonstration, we use the gPBWT to quickly count the number of haplotypes consistent with random walks in a genome graph, and with the paths taken by mapped reads; results suggest that haplotype consistency information can be practically incorporated into graph-based read mappers. We estimate that with the gPBWT of the order of 100,000 diploid genomes, including all forms structural variation, could be stored and made searchable for haplotype queries using a single large compute node.
Collapse
Affiliation(s)
- Adam M. Novak
- Genomics Institute, University of California Santa Cruz, CBSE, 501C Engineering 2, MS: CBSE, 1156 High St., Santa Cruz, CA 95064 USA
| | - Erik Garrison
- Wellcome Trust Sanger Institute, Cambridge, CB10 1SA UK
| | - Benedict Paten
- Genomics Institute, University of California Santa Cruz, CBSE, 501C Engineering 2, MS: CBSE, 1156 High St., Santa Cruz, CA 95064 USA
| |
Collapse
|
38
|
Gopalakrishnan S, Samaniego Castruita JA, Sinding MHS, Kuderna LFK, Räikkönen J, Petersen B, Sicheritz-Ponten T, Larson G, Orlando L, Marques-Bonet T, Hansen AJ, Dalén L, Gilbert MTP. The wolf reference genome sequence (Canis lupus lupus) and its implications for Canis spp. population genomics. BMC Genomics 2017; 18:495. [PMID: 28662691 PMCID: PMC5492679 DOI: 10.1186/s12864-017-3883-3] [Citation(s) in RCA: 53] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2016] [Accepted: 06/20/2017] [Indexed: 12/20/2022] Open
Abstract
Background An increasing number of studies are addressing the evolutionary genomics of dog domestication, principally through resequencing dog, wolf and related canid genomes. There is, however, only one de novo assembled canid genome currently available against which to map such data - that of a boxer dog (Canis lupus familiaris). We generated the first de novo wolf genome (Canis lupus lupus) as an additional choice of reference, and explored what implications may arise when previously published dog and wolf resequencing data are remapped to this reference. Results Reassuringly, we find that regardless of the reference genome choice, most evolutionary genomic analyses yield qualitatively similar results, including those exploring the structure between the wolves and dogs using admixture and principal component analysis. However, we do observe differences in the genomic coverage of re-mapped samples, the number of variants discovered, and heterozygosity estimates of the samples. Conclusion In conclusion, the choice of reference is dictated by the aims of the study being undertaken; if the study focuses on the differences between the different dog breeds or the fine structure among dogs, then using the boxer reference genome is appropriate, but if the aim of the study is to look at the variation within wolves and their relationships to dogs, then there are clear benefits to using the de novo assembled wolf reference genome. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3883-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shyam Gopalakrishnan
- Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Øster Voldgade 5-7, 1350, Copenhagen, Denmark
| | - Jose A Samaniego Castruita
- Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Øster Voldgade 5-7, 1350, Copenhagen, Denmark
| | - Mikkel-Holger S Sinding
- Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Øster Voldgade 5-7, 1350, Copenhagen, Denmark.,Natural History Museum, University of Oslo, N-0318, Oslo, Norway
| | - Lukas F K Kuderna
- Institute of Evolutionary Biology (UPF-CSIC), PRBB, Dr. Aiguader 88, 08003, Barcelona, Spain.,CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028, Barcelona, Spain
| | - Jannikke Räikkönen
- Department of Environmental Research and Monitoring, Swedish Museum of Natural History, Box 50007, 10405, Stockholm, Sweden
| | - Bent Petersen
- Department of Bio and Health Informatics, Technical University of Denmark, 2800, Kongens Lyngby, Denmark
| | - Thomas Sicheritz-Ponten
- Department of Bio and Health Informatics, Technical University of Denmark, 2800, Kongens Lyngby, Denmark
| | - Greger Larson
- Palaeogenomics & Bio-Archaeology Research Network, Research Laboratory for Archaeology and the History of Art, University of Oxford, OX1 3QY, Oxford, UK
| | - Ludovic Orlando
- Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Øster Voldgade 5-7, 1350, Copenhagen, Denmark
| | - Tomas Marques-Bonet
- Institute of Evolutionary Biology (UPF-CSIC), PRBB, Dr. Aiguader 88, 08003, Barcelona, Spain.,CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028, Barcelona, Spain.,Catalan Institution of Research and Advanced Studies (ICREA), Passeig de Lluís Companys, 23, 08010, Barcelona, Spain
| | - Anders J Hansen
- Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Øster Voldgade 5-7, 1350, Copenhagen, Denmark
| | - Love Dalén
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Box 50007, 10405, Stockholm, Sweden
| | - M Thomas P Gilbert
- Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Øster Voldgade 5-7, 1350, Copenhagen, Denmark. .,Trace and Environmental DNA Laboratory, Department of Environment and Agriculture, Curtin University, Perth, Western Australia, Australia. .,NTNU University Museum, Norwegian University of Science and Technology, Trondheim, Norway.
| |
Collapse
|
39
|
Abstract
The human reference genome is part of the foundation of modern human biology and a monumental scientific achievement. However, because it excludes a great deal of common human variation, it introduces a pervasive reference bias into the field of human genomics. To reduce this bias, it makes sense to draw on representative collections of human genomes, brought together into reference cohorts. There are a number of techniques to represent and organize data gleaned from these cohorts, many using ideas implicitly or explicitly borrowed from graph-based models. Here, we survey various projects underway to build and apply these graph-based structures-which we collectively refer to as genome graphs-and discuss the improvements in read mapping, variant calling, and haplotype determination that genome graphs are expected to produce.
Collapse
Affiliation(s)
- Benedict Paten
- Genomics Institute, CBSE, 501C Engineering 2, University of California Santa Cruz, Santa Cruz, California 95064, USA
| | - Adam M Novak
- Genomics Institute, CBSE, 501C Engineering 2, University of California Santa Cruz, Santa Cruz, California 95064, USA
| | - Jordan M Eizenga
- Genomics Institute, CBSE, 501C Engineering 2, University of California Santa Cruz, Santa Cruz, California 95064, USA
| | - Erik Garrison
- Wellcome Trust Sanger Institute, Cambridge CB10 1SA, United Kingdom
| |
Collapse
|
40
|
Beller T, Ohlebusch E. A representation of a compressed de Bruijn graph for pan-genome analysis that enables search. Algorithms Mol Biol 2016; 11:20. [PMID: 27437028 PMCID: PMC4950428 DOI: 10.1186/s13015-016-0083-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2016] [Accepted: 07/01/2016] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Recently, Marcus et al. (Bioinformatics 30:3476-83, 2014) proposed to use a compressed de Bruijn graph to describe the relationship between the genomes of many individuals/strains of the same or closely related species. They devised an [Formula: see text] time algorithm called splitMEM that constructs this graph directly (i.e., without using the uncompressed de Bruijn graph) based on a suffix tree, where n is the total length of the genomes and g is the length of the longest genome. Baier et al. (Bioinformatics 32:497-504, 2016) improved their result. RESULTS In this paper, we propose a new space-efficient representation of the compressed de Bruijn graph that adds the possibility to search for a pattern (e.g. an allele-a variant form of a gene) within the pan-genome. The ability to search within the pan-genome graph is of utmost importance and is a design goal of pan-genome data structures.
Collapse
Affiliation(s)
- Timo Beller
- Institute of Theoretical Computer Science, Ulm University, James-Franck-Ring O27/537, 89069 Ulm, Germany
| | - Enno Ohlebusch
- Institute of Theoretical Computer Science, Ulm University, James-Franck-Ring O27/537, 89069 Ulm, Germany
| |
Collapse
|
41
|
Liu B, Guo H, Brudno M, Wang Y. deBGA: read alignment with de Bruijn graph-based seed and extension. Bioinformatics 2016; 32:3224-3232. [PMID: 27378303 DOI: 10.1093/bioinformatics/btw371] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2015] [Accepted: 06/05/2016] [Indexed: 12/30/2022] Open
Abstract
MOTIVATION As high-throughput sequencing (HTS) technology becomes ubiquitous and the volume of data continues to rise, HTS read alignment is becoming increasingly rate-limiting, which keeps pressing the development of novel read alignment approaches. Moreover, promising novel applications of HTS technology require aligning reads to multiple genomes instead of a single reference; however, it is still not viable for the state-of-the-art aligners to align large numbers of reads to multiple genomes. RESULTS We propose de Bruijn Graph-based Aligner (deBGA), an innovative graph-based seed-and-extension algorithm to align HTS reads to a reference genome that is organized and indexed using a de Bruijn graph. With its well-handling of repeats, deBGA is substantially faster than state-of-the-art approaches while maintaining similar or higher sensitivity and accuracy. This makes it particularly well-suited to handle the rapidly growing volumes of sequencing data. Furthermore, it provides a promising solution for aligning reads to multiple genomes and graph-based references in HTS applications. AVAILABILITY AND IMPLEMENTATION deBGA is available at: https://github.com/hitbc/deBGA CONTACT: ydwang@hit.edu.cnSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bo Liu
- Center for Bioinformatics, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Hongzhe Guo
- Center for Bioinformatics, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Michael Brudno
- Department of Computer Science, University of Toronto, ON M5S 3G4, Canada Genetics and Genome Biology Program, The Hospital for Sick Children, Toronto, ON M5G 1L7, Canada Centre for Computational Medicine, The Hospital for Sick Children, Toronto, ON M5G 1L7, Canada
| | - Yadong Wang
- Center for Bioinformatics, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| |
Collapse
|
42
|
Limasset A, Cazaux B, Rivals E, Peterlongo P. Read mapping on de Bruijn graphs. BMC Bioinformatics 2016; 17:237. [PMID: 27306641 PMCID: PMC4910249 DOI: 10.1186/s12859-016-1103-9] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2015] [Accepted: 05/26/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph, however, the set of contigs does not record all the sequence information contained in that graph. Although many subsequent analyses can be performed with the set of contigs, one may ask whether mapping reads on the contigs is as informative as mapping them on the paths of the assembly graph. Currently, one lacks practical tools to perform mapping on such graphs. RESULTS Here, we propose a formal definition of mapping on a de Bruijn graph, analyse the problem complexity which turns out to be NP-complete, and provide a practical solution. We propose a pipeline called GGMAP (Greedy Graph MAPping). Its novelty is a procedure to map reads on branching paths of the graph, for which we designed a heuristic algorithm called BGREAT (de Bruijn Graph REAd mapping Tool). For the sake of efficiency, BGREAT rewrites a read sequence as a succession of unitigs sequences. GGMAP can map millions of reads per CPU hour on a de Bruijn graph built from a large set of human genomic reads. Surprisingly, results show that up to 22 % more reads can be mapped on the graph but not on the contig set. CONCLUSIONS Although mapping reads on a de Bruijn graph is complex task, our proposal offers a practical solution combining efficiency with an improved mapping capacity compared to assembly-based mapping even for complex eukaryotic data.
Collapse
Affiliation(s)
- Antoine Limasset
- IRISA Inria Rennes Bretagne Atlantique, GenScale team, Campus de Beaulieu, Rennes, 35042, France.
| | - Bastien Cazaux
- L.I.R.M.M., UMR 5506, Université de Montpellier et CNRS, 860 rue de St Priest, Montpellier Cedex 5, F-34392, France
- Institut Biologie Computationnelle, Université de Montpellier, Montpellier, F-34392, France
| | - Eric Rivals
- L.I.R.M.M., UMR 5506, Université de Montpellier et CNRS, 860 rue de St Priest, Montpellier Cedex 5, F-34392, France
- Institut Biologie Computationnelle, Université de Montpellier, Montpellier, F-34392, France
| | - Pierre Peterlongo
- IRISA Inria Rennes Bretagne Atlantique, GenScale team, Campus de Beaulieu, Rennes, 35042, France
| |
Collapse
|
43
|
Holley G, Wittler R, Stoye J. Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol Biol 2016; 11:3. [PMID: 27087830 PMCID: PMC4832552 DOI: 10.1186/s13015-016-0066-8] [Citation(s) in RCA: 60] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2015] [Accepted: 03/31/2016] [Indexed: 12/21/2022] Open
Abstract
Background High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices. Results In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory. Conclusion We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure. Availability https://www.github.com/GuillaumeHolley/BloomFilterTrie.
Collapse
|
44
|
Yorukoglu D, Yu YW, Peng J, Berger B. Compressive mapping for next-generation sequencing. Nat Biotechnol 2016; 34:374-6. [PMID: 27054987 PMCID: PMC5080835 DOI: 10.1038/nbt.3511] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Affiliation(s)
- Deniz Yorukoglu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Yun William Yu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Jian Peng
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, 61801, USA
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| |
Collapse
|
45
|
Kagale S, Koh C, Clarke WE, Bollina V, Parkin IAP, Sharpe AG. Analysis of Genotyping-by-Sequencing (GBS) Data. Methods Mol Biol 2016; 1374:269-284. [PMID: 26519412 DOI: 10.1007/978-1-4939-3167-5_15] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The development of genotyping-by-sequencing (GBS) to rapidly detect nucleotide variation at the whole genome level, in many individuals simultaneously, has provided a transformative genetic profiling technique. GBS can be carried out in species with or without reference genome sequences yields huge amounts of potentially informative data. One limitation with the approach is the paucity of tools to transform the raw data into a format that can be easily interrogated at the genetic level. In this chapter we describe bioinformatics tools developed to address this shortfall together with experimental design considerations to fully leverage the power of GBS for genetic analysis.
Collapse
Affiliation(s)
- Sateesh Kagale
- National Research Council Canada, 110 Gymnasium Place, Saskatoon, SK, Canada, S7N 0W9
| | - Chushin Koh
- National Research Council Canada, 110 Gymnasium Place, Saskatoon, SK, Canada, S7N 0W9
| | - Wayne E Clarke
- Agriculture and Agri-Food Canada, 107 Science Place, Saskatoon, SK, Canada, S7N 0X2
| | - Venkatesh Bollina
- Agriculture and Agri-Food Canada, 107 Science Place, Saskatoon, SK, Canada, S7N 0X2
| | - Isobel A P Parkin
- Agriculture and Agri-Food Canada, 107 Science Place, Saskatoon, SK, Canada, S7N 0X2
| | - Andrew G Sharpe
- National Research Council Canada, 110 Gymnasium Place, Saskatoon, SK, Canada, S7N 0W9.
| |
Collapse
|
46
|
Maciuca S, del Ojo Elias C, McVean G, Iqbal Z. A Natural Encoding of Genetic Variation in a Burrows-Wheeler Transform to Enable Mapping and Genome Inference. LECTURE NOTES IN COMPUTER SCIENCE 2016. [DOI: 10.1007/978-3-319-43681-4_18] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
47
|
Schröder J, Girirajan S, Papenfuss AT, Medvedev P. Improving the Power of Structural Variation Detection by Augmenting the Reference. PLoS One 2015; 10:e0136771. [PMID: 26322511 PMCID: PMC4556445 DOI: 10.1371/journal.pone.0136771] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Accepted: 08/07/2015] [Indexed: 11/18/2022] Open
Abstract
The uses of the Genome Reference Consortium’s human reference sequence can be roughly categorized into three related but distinct categories: as a representative species genome, as a coordinate system for identifying variants, and as an alignment reference for variation detection algorithms. However, the use of this reference sequence as simultaneously a representative species genome and as an alignment reference leads to unnecessary artifacts for structural variation detection algorithms and limits their accuracy. We show how decoupling these two references and developing a separate alignment reference can significantly improve the accuracy of structural variation detection, lead to improved genotyping of disease related genes, and decrease the cost of studying polymorphism in a population.
Collapse
Affiliation(s)
- Jan Schröder
- The Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia; Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia; Peter MacCallum Cancer Centre, Melbourne, Australia
| | - Santhosh Girirajan
- Genomic Sciences Institute of the Huck, The Pennsylvania State University, State College, United States of America; Department of Computer Science and Engineering, The Pennsylvania State University, State College, United States of America
| | - Anthony T Papenfuss
- The Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia; Department of Medical Biology, The University of Melbourne, Melbourne, Australia; Peter MacCallum Cancer Centre, Melbourne, Australia
| | - Paul Medvedev
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, State College, United States of America; Genomic Sciences Institute of the Huck, The Pennsylvania State University, State College, United States of America; Department of Computer Science and Engineering, The Pennsylvania State University, State College, United States of America
| |
Collapse
|
48
|
Dilthey A, Cox C, Iqbal Z, Nelson MR, McVean G. Improved genome inference in the MHC using a population reference graph. Nat Genet 2015; 47:682-8. [PMID: 25915597 PMCID: PMC4449272 DOI: 10.1038/ng.3257] [Citation(s) in RCA: 116] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2014] [Accepted: 03/03/2015] [Indexed: 12/21/2022]
Abstract
Although much is known about human genetic variation, such information is typically ignored in assembling new genomes. Instead, reads are mapped to a single reference, which can lead to poor characterization of regions of high sequence or structural diversity. We introduce a population reference graph, which combines multiple reference sequences and catalogs of variation. The genomes of new samples are reconstructed as paths through the graph using an efficient hidden Markov model, allowing for recombination between different haplotypes and additional variants. By applying the method to the 4.5-Mb extended MHC region on human chromosome 6, combining 8 assembled haplotypes, the sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate using simulations, SNP genotyping, and short-read and long-read data how the method improves the accuracy of genome inference and identified regions where the current set of reference sequences is substantially incomplete.
Collapse
Affiliation(s)
- Alexander Dilthey
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Charles Cox
- Department of Quantitative Sciences, GlaxoSmithKline, Stevenage, UK
| | - Zamin Iqbal
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Matthew R Nelson
- Department of Quantitative Sciences, GlaxoSmithKline, Research Triangle Park, North Carolina, USA
| | - Gil McVean
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| |
Collapse
|
49
|
Ames SK, Gardner SN, Marti JM, Slezak TR, Gokhale MB, Allen JE. Using populations of human and microbial genomes for organism detection in metagenomes. Genome Res 2015; 25:1056-67. [PMID: 25926546 PMCID: PMC4484388 DOI: 10.1101/gr.184879.114] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Accepted: 04/28/2015] [Indexed: 12/16/2022]
Abstract
Identifying causative disease agents in human patients from shotgun metagenomic sequencing (SMS) presents a powerful tool to apply when other targeted diagnostics fail. Numerous technical challenges remain, however, before SMS can move beyond the role of research tool. Accurately separating the known and unknown organism content remains difficult, particularly when SMS is applied as a last resort. The true amount of human DNA that remains in a sample after screening against the human reference genome and filtering nonbiological components left from library preparation has previously been underreported. In this study, we create the most comprehensive collection of microbial and reference-free human genetic variation available in a database optimized for efficient metagenomic search by extracting sequences from GenBank and the 1000 Genomes Project. The results reveal new human sequences found in individual Human Microbiome Project (HMP) samples. Individual samples contain up to 95% human sequence, and 4% of the individual HMP samples contain 10% or more human reads. Left unidentified, human reads can complicate and slow down further analysis and lead to inaccurately labeled microbial taxa and ultimately lead to privacy concerns as more human genome data is collected.
Collapse
Affiliation(s)
- Sasha K Ames
- Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, California 94550, USA
| | - Shea N Gardner
- Global Security Computer Applications Division, Lawrence Livermore National Laboratory, Livermore, California 94550, USA
| | | | - Tom R Slezak
- Global Security Computer Applications Division, Lawrence Livermore National Laboratory, Livermore, California 94550, USA
| | - Maya B Gokhale
- Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, California 94550, USA
| | - Jonathan E Allen
- Global Security Computer Applications Division, Lawrence Livermore National Laboratory, Livermore, California 94550, USA
| |
Collapse
|
50
|
Bloom Filter Trie – A Data Structure for Pan-Genome Storage. LECTURE NOTES IN COMPUTER SCIENCE 2015. [DOI: 10.1007/978-3-662-48221-6_16] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|