1
|
Yu C, Zhao Y, Zhao C, Jin J, Mao K, Wang G. MiniDBG: A Novel and Minimal De Bruijn Graph for Read Mapping. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:129-142. [PMID: 38060353 DOI: 10.1109/tcbb.2023.3340251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2024]
Abstract
The De Bruijn graph (DBG) has been widely used in the algorithms for indexing or organizing read and reference sequences in bioinformatics. However, a DBG model that can locate each node, edge and path on sequence has not been proposed so far. Recently, DBG has been used for representing reference sequences in read mapping tasks. In this process, it is not a one-to-one correspondence between the paths of DBG and the substrings of reference sequence. This results in the false path on DBG, which means no substrings of reference producing the path. Moreover, if a candidate path of a read is true, we need to locate it and verify the candidate on sequence. To solve these problems, we proposed a DBG model, called MiniDBG, which stores the position lists of a minimal set of edges. With the position lists, MiniDBG can locate any node, edge and path efficiently. We also proposed algorithms for generating MiniDBG based on an original DBG and algorithms for locating edges or paths on sequence. We designed and ran experiments on real datasets for comparing them with BWT-based and position list-based methods. The experimental results show that MiniDBG can locate the edges and paths efficiently with lower memory costs.
Collapse
|
2
|
Chin CS, Behera S, Khalak A, Sedlazeck FJ, Sudmant PH, Wagner J, Zook JM. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nat Methods 2023; 20:1213-1221. [PMID: 37365340 PMCID: PMC10406601 DOI: 10.1038/s41592-023-01914-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Accepted: 05/17/2023] [Indexed: 06/28/2023]
Abstract
Advancements in sequencing technologies and assembly methods enable the regular production of high-quality genome assemblies characterizing complex regions. However, challenges remain in efficiently interpreting variation at various scales, from smaller tandem repeats to megabase rearrangements, across many human genomes. We present a PanGenome Research Tool Kit (PGR-TK) enabling analyses of complex pangenome structural and haplotype variation at multiple scales. We apply the graph decomposition methods in PGR-TK to the class II major histocompatibility complex demonstrating the importance of the human pangenome for analyzing complicated regions. Moreover, we investigate the Y-chromosome genes, DAZ1/DAZ2/DAZ3/DAZ4, of which structural variants have been linked to male infertility, and X-chromosome genes OPN1LW and OPN1MW linked to eye disorders. We further showcase PGR-TK across 395 complex repetitive medically important genes. This highlights the power of PGR-TK to resolve complex variation in regions of the genome that were previously too complex to analyze.
Collapse
Affiliation(s)
- Chen-Shan Chin
- GeneDX, Stamford, CT, USA.
- Foundation of Biological Data Science, Belmont, CA, USA.
| | - Sairam Behera
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Asif Khalak
- Foundation of Biological Data Science, Belmont, CA, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Peter H Sudmant
- Department of Integrative Biology, University of California Berkeley, Berkeley, CA, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| |
Collapse
|
3
|
Khan J, Kokot M, Deorowicz S, Patro R. Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2. Genome Biol 2022; 23:190. [PMID: 36076275 PMCID: PMC9454175 DOI: 10.1186/s13059-022-02743-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 08/01/2022] [Indexed: 11/13/2022] Open
Abstract
The de Bruijn graph is a key data structure in modern computational genomics, and construction of its compacted variant resides upstream of many genomic analyses. As the quantity of genomic data grows rapidly, this often forms a computational bottleneck. We present Cuttlefish 2, significantly advancing the state-of-the-art for this problem. On a commodity server, it reduces the graph construction time for 661K bacterial genomes, of size 2.58Tbp, from 4.5 days to 17-23 h; and it constructs the graph for 1.52Tbp white spruce reads in approximately 10 h, while the closest competitor requires 54-58 h, using considerably more memory.
Collapse
Affiliation(s)
- Jamshed Khan
- Department of Computer Science, University of Maryland, College Park, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, USA
| | - Marek Kokot
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Sebastian Deorowicz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, USA
| |
Collapse
|
4
|
Yu C, Mao K, Zhao Y, Chang C, Wang G. StLiter: A Novel Algorithm to Iteratively Build the Compacted de Bruijn Graph From Many Complete Genomes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2471-2483. [PMID: 33630738 DOI: 10.1109/tcbb.2021.3062068] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Recently, the compacted de Bruijn graph (cDBG) of complete genome sequences was successfully used in read mapping due to its ability to deal with the repetitions in genomes. However, current approaches are not flexible enough to fit frequently building the graphs with different k-mer lengths. Instead of building the graph directly, how can we build the compacted de Bruijin graph of longer k-mer based on the one of short k-mer? In this article, we present StLiter, a novel algorithm to build the compacted de Bruijn graph either directly from genome sequences or indirectly based on the graph of a short k-mer. For 100 simulated human genomes, StLiter can construct the graph of k-mer length 15-18 in 2.5-3.2 hours with maximal ∼70GB memory in the case of without considering the reverese complements of the reference genomes. And it costs 4.5-5.9 hours when considering the reverse complements. In experiments, we compared StLiter with TwoPaCo, the state-of-art method for building the graph, on 4 datasets. For k-mer length 15-18, StLiter can build the graph 5-9 times faster than TwoPaCo using less maximal memory cost. For k-mer length larger than 18, given the graph of a short (k- x)-mer, such as x= 1-2, compared with TwoPaCo building the graph directly, StLiter can also build the graph more efficiently. The source codes of StLiter can be downloaded from web site https://github.com/BioLab-cz/StLiter.
Collapse
|
5
|
Dufault‐Thompson K, Jiang X. Applications of de Bruijn graphs in microbiome research. IMETA 2022; 1:e4. [PMID: 38867733 PMCID: PMC10989854 DOI: 10.1002/imt2.4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Revised: 01/24/2022] [Accepted: 01/24/2022] [Indexed: 06/14/2024]
Abstract
High-throughput sequencing has become an increasingly central component of microbiome research. The development of de Bruijn graph-based methods for assembling high-throughput sequencing data has been an important part of the broader adoption of sequencing as part of biological studies. Recent advances in the construction and representation of de Bruijn graphs have led to new approaches that utilize the de Bruijn graph data structure to aid in different biological analyses. One type of application of these methods has been in alternative approaches to the assembly of sequencing data like gene-targeted assembly, where only gene sequences are assembled out of larger metagenomes, and differential assembly, where sequences that are differentially present between two samples are assembled. de Bruijn graphs have also been applied for comparative genomics where they can be used to represent large sets of multiple genomes or metagenomes where structural features in the graphs can be used to identify variants, indels, and homologous regions in sequences. These de Bruijn graph-based representations of sequencing data have even begun to be applied to whole sequencing databases for large-scale searches and experiment discovery. de Bruijn graphs have played a central role in how high-throughput sequencing data is worked with, and the rapid development of new tools that rely on these data structures suggests that they will continue to play an important role in biology in the future.
Collapse
Affiliation(s)
- Keith Dufault‐Thompson
- Intramural Research ProgramNational Library of Medicine, National Institutes of HealthBethesdaMarylandUSA
| | - Xiaofang Jiang
- Intramural Research ProgramNational Library of Medicine, National Institutes of HealthBethesdaMarylandUSA
| |
Collapse
|
6
|
Biderre-Petit C, Taib N, Gardon H, Hochart C, Debroas D. New insights into the pelagic microorganisms involved in the methane cycle in the meromictic Lake Pavin through metagenomics. FEMS Microbiol Ecol 2020; 95:5092586. [PMID: 30203066 DOI: 10.1093/femsec/fiy183] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2018] [Accepted: 09/06/2018] [Indexed: 11/13/2022] Open
Abstract
Advances in metagenomics have given rise to the possibility of obtaining genome sequences from uncultured microorganisms, even for those poorly represented in the microbial community, thereby providing an important means to study their ecology and evolution. In this study, metagenomic sequencing was carried out at four sampling depths having different oxygen concentrations or environmental conditions in the water column of Lake Pavin. By analyzing the sequenced reads and matching the contigs to the proxy genomes of the closest cultivated relatives, we evaluated the metabolic potential of the dominant planktonic species involved in the methane cycle. We demonstrated that methane-producing communities were dominated by the genus Methanoregula while methane-consuming communities were dominated by the genus Methylobacter, thus confirming prior observations. Our work allowed the reconstruction of a draft of their core metabolic pathways. Hydrogenotrophs, the genes required for acetate activation in the methanogen genome, were also detected. Regarding methanotrophy, Methylobacter was present in the same areas as the non-methanotrophic, methylotrophic Methylotenera, which could suggest a relationship between these two groups. Furthermore, the presence of a large gene inventory for nitrogen metabolism (nitrate transport, denitrification, nitrite assimilation and nitrogen fixation, for instance) was detected in the Methylobacter genome.
Collapse
Affiliation(s)
- Corinne Biderre-Petit
- Université Clermont Auvergne, CNRS, Laboratoire Microorganismes: Génome et Environnement, F-63000 Clermont-Ferrand, France
| | - Najwa Taib
- Université Clermont Auvergne, CNRS, Laboratoire Microorganismes: Génome et Environnement, F-63000 Clermont-Ferrand, France
| | - Hélène Gardon
- Université Clermont Auvergne, CNRS, Laboratoire Microorganismes: Génome et Environnement, F-63000 Clermont-Ferrand, France
| | - Corentin Hochart
- Université Clermont Auvergne, CNRS, Laboratoire Microorganismes: Génome et Environnement, F-63000 Clermont-Ferrand, France
| | - Didier Debroas
- Université Clermont Auvergne, CNRS, Laboratoire Microorganismes: Génome et Environnement, F-63000 Clermont-Ferrand, France
| |
Collapse
|
7
|
Minkin I, Pham S, Medvedev P. TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics 2018; 33:4024-4032. [PMID: 27659452 DOI: 10.1093/bioinformatics/btw609] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2016] [Accepted: 09/16/2016] [Indexed: 01/06/2023] Open
Abstract
Motivation de Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes). Results In this article, we present TwoPaCo, a simple and scalable low memory algorithm for the direct construction of the compacted de Bruijn graph from a set of complete genomes. We demonstrate that it can construct the graph for 100 simulated human genomes in less than a day and eight real primates in < 2 h, on a typical shared-memory machine. We believe that this progress will enable novel biological analyses of hundreds of mammalian-sized genomes. Availability and Implementation Our code and data is available for download from github.com/medvedevgroup/TwoPaCo. Contact ium125@psu.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ilia Minkin
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA
| | - Son Pham
- BioTuring Inc., San Diego, CA 92121, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA.,Department of Biochemistry and Molecular Biology.,Genomic Sciences Institute of the Huck, The Pennsylvania State University, University Park, PA 16802, USA
| |
Collapse
|
8
|
Pinevich AV, Andronov EE, Pershina EV, Pinevich AA, Dmitrieva HY. Testing culture purity in prokaryotes: criteria and challenges. Antonie van Leeuwenhoek 2018; 111:1509-1521. [PMID: 29488181 DOI: 10.1007/s10482-018-1054-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/30/2017] [Accepted: 02/21/2018] [Indexed: 01/05/2023]
Abstract
Reliance on pure cultures was introduced at the beginning of microbiology as a discipline and has remained significant although their adaptive properties are essentially dissimilar from those of mixed cultures and environmental populations. They are needed for (i) taxonomic identification; (ii) diagnostics of pathogens; (iii) virulence and pathogenicity studies; (iv) elucidation of metabolic properties; (v) testing sensitivity to antibiotics; (vi) full-length genome assembly; (vii) strain deposition in microbial collections; and (viii) description of new species with name validation. Depending on the specific task there are alternative claims for culture purity, i.e., when conventional criteria are satisfied or when looking deeper is necessary. Conventional proof (microscopic and plating controls) has a low resolution and depends on the observer's personal judgement. Phenotypic criteria alone cannot prove culture purity and should be complemented with genomic criteria. We consider the possible use of DNA high-throughput culture sequencing data to define criteria for only one genospecies, axenic state detection panel and only one genome. The second and third of these are preferable, although their resolving capacity (depth) is limited. Because minor contaminants may go undetected, even with deep sequencing, the reliably pure culture would be a clonal culture launched from a single cell or trichome (multicellular bacterium). Although this type of culture is associated with technical difficulties and cannot be employed on a large scale (the corresponding inoculums may have low chances of growth when transferred to solid media), it is hoped that the high-throughput culturing methods introduced by 'culturomics' will overcome this obstacle.
Collapse
Affiliation(s)
- Alexander V Pinevich
- Saint Petersburg State University, Universitetskaya Quay, 7/9, P.O. Box 199034, St. Petersburg, Russia.
| | - Eugeny E Andronov
- All-Russia Research Institute for Agricultural Microbiology (ARRIAM), Russian Academy of Sciences, Podbelskogo Highway, 3, P.O. Box 196608, St. Petersburg-Pushkin, Russia
| | - Elizaveta V Pershina
- All-Russia Research Institute for Agricultural Microbiology (ARRIAM), Russian Academy of Sciences, Podbelskogo Highway, 3, P.O. Box 196608, St. Petersburg-Pushkin, Russia
| | - Agnia A Pinevich
- Saint Petersburg State University, Universitetskaya Quay, 7/9, P.O. Box 199034, St. Petersburg, Russia
| | - Helena Y Dmitrieva
- Saint Petersburg State University, Universitetskaya Quay, 7/9, P.O. Box 199034, St. Petersburg, Russia
| |
Collapse
|
9
|
Limasset A, Cazaux B, Rivals E, Peterlongo P. Read mapping on de Bruijn graphs. BMC Bioinformatics 2016; 17:237. [PMID: 27306641 PMCID: PMC4910249 DOI: 10.1186/s12859-016-1103-9] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2015] [Accepted: 05/26/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph, however, the set of contigs does not record all the sequence information contained in that graph. Although many subsequent analyses can be performed with the set of contigs, one may ask whether mapping reads on the contigs is as informative as mapping them on the paths of the assembly graph. Currently, one lacks practical tools to perform mapping on such graphs. RESULTS Here, we propose a formal definition of mapping on a de Bruijn graph, analyse the problem complexity which turns out to be NP-complete, and provide a practical solution. We propose a pipeline called GGMAP (Greedy Graph MAPping). Its novelty is a procedure to map reads on branching paths of the graph, for which we designed a heuristic algorithm called BGREAT (de Bruijn Graph REAd mapping Tool). For the sake of efficiency, BGREAT rewrites a read sequence as a succession of unitigs sequences. GGMAP can map millions of reads per CPU hour on a de Bruijn graph built from a large set of human genomic reads. Surprisingly, results show that up to 22 % more reads can be mapped on the graph but not on the contig set. CONCLUSIONS Although mapping reads on a de Bruijn graph is complex task, our proposal offers a practical solution combining efficiency with an improved mapping capacity compared to assembly-based mapping even for complex eukaryotic data.
Collapse
Affiliation(s)
- Antoine Limasset
- IRISA Inria Rennes Bretagne Atlantique, GenScale team, Campus de Beaulieu, Rennes, 35042, France.
| | - Bastien Cazaux
- L.I.R.M.M., UMR 5506, Université de Montpellier et CNRS, 860 rue de St Priest, Montpellier Cedex 5, F-34392, France
- Institut Biologie Computationnelle, Université de Montpellier, Montpellier, F-34392, France
| | - Eric Rivals
- L.I.R.M.M., UMR 5506, Université de Montpellier et CNRS, 860 rue de St Priest, Montpellier Cedex 5, F-34392, France
- Institut Biologie Computationnelle, Université de Montpellier, Montpellier, F-34392, France
| | - Pierre Peterlongo
- IRISA Inria Rennes Bretagne Atlantique, GenScale team, Campus de Beaulieu, Rennes, 35042, France
| |
Collapse
|
10
|
Bao G, Wang M, Doak TG, Ye Y. Strand-specific community RNA-seq reveals prevalent and dynamic antisense transcription in human gut microbiota. Front Microbiol 2015; 6:896. [PMID: 26388849 PMCID: PMC4555090 DOI: 10.3389/fmicb.2015.00896] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Accepted: 08/17/2015] [Indexed: 01/17/2023] Open
Abstract
Metagenomics and other meta-omics approaches (including metatranscriptomics) provide insights into the composition and function of microbial communities living in different environments or animal hosts. Metatranscriptomics research provides an unprecedented opportunity to examine gene regulation for many microbial species simultaneously, and more importantly, for the majority that are unculturable microbial species, in their natural environments (or hosts). Current analyses of metatranscriptomic datasets focus on the detection of gene expression levels and the study of the relationship between changes of gene expression and changes of environment. As a demonstration of utilizing metatranscriptomics beyond these common analyses, we developed a computational and statistical procedure to analyze the antisense transcripts in strand-specific metatranscriptomic datasets. Antisense RNAs encoded on the DNA strand opposite a gene’s CDS have the potential to form extensive base-pairing interactions with the corresponding sense RNA, and can have important regulatory functions. Most studies of antisense RNAs in bacteria are rather recent, are mostly based on transcriptome analysis, and have been applied mainly to single bacterial species. Application of our approaches to human gut-associated metatranscriptomic datasets allowed us to survey antisense transcription for a large number of bacterial species associated with human beings. The ratio of protein coding genes with antisense transcription ranges from 0 to 35.8% (median = 10.0%) among 47 species. Our results show that antisense transcription is dynamic, varying between human individuals. Functional enrichment analysis revealed a preference of certain gene functions for antisense transcription, and transposase genes are among the most prominent ones (but we also observed antisense transcription in bacterial house-keeping genes).
Collapse
Affiliation(s)
- Guanhui Bao
- School of Informatics and Computing, Indiana University Bloomington, IN, USA
| | - Mingjie Wang
- School of Informatics and Computing, Indiana University Bloomington, IN, USA
| | - Thomas G Doak
- Department of Biology, Indiana University Bloomington, IN, USA ; National Center for Genome Analysis Support, Indiana University Bloomington, IN, USA
| | - Yuzhen Ye
- School of Informatics and Computing, Indiana University Bloomington, IN, USA
| |
Collapse
|
11
|
Ye Y, Tang H. Utilizing de Bruijn graph of metagenome assembly for metatranscriptome analysis. Bioinformatics 2015; 32:1001-8. [PMID: 26319390 PMCID: PMC4896364 DOI: 10.1093/bioinformatics/btv510] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2015] [Accepted: 08/24/2015] [Indexed: 11/26/2022] Open
Abstract
Motivation: Metagenomics research has accelerated the studies of microbial organisms, providing insights into the composition and potential functionality of various microbial communities. Metatranscriptomics (studies of the transcripts from a mixture of microbial species) and other meta-omics approaches hold even greater promise for providing additional insights into functional and regulatory characteristics of the microbial communities. Current metatranscriptomics projects are often carried out without matched metagenomic datasets (of the same microbial communities). For the projects that produce both metatranscriptomic and metagenomic datasets, their analyses are often not integrated. Metagenome assemblies are far from perfect, partially explaining why metagenome assemblies are not used for the analysis of metatranscriptomic datasets. Results: Here, we report a reads mapping algorithm for mapping of short reads onto a de Bruijn graph of assemblies. A hash table of junction k-mers (k-mers spanning branching structures in the de Bruijn graph) is used to facilitate fast mapping of reads to the graph. We developed an application of this mapping algorithm: a reference-based approach to metatranscriptome assembly using graphs of metagenome assembly as the reference. Our results show that this new approach (called TAG) helps to assemble substantially more transcripts that otherwise would have been missed or truncated because of the fragmented nature of the reference metagenome. Availability and implementation: TAG was implemented in C++ and has been tested extensively on the Linux platform. It is available for download as open source at http://omics.informatics.indiana.edu/TAG. Contact:yye@indiana.edu
Collapse
Affiliation(s)
- Yuzhen Ye
- School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA
| | - Haixu Tang
- School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA
| |
Collapse
|
12
|
Lee H, Popodi E, Foster PL, Tang H. Detection of structural variants involving repetitive regions in the reference genome. J Comput Biol 2014; 21:219-33. [PMID: 24552580 DOI: 10.1089/cmb.2013.0129] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Next-generation sequencing techniques are now commonly used to characterize structural variations (SVs) in population genomics and elucidate their associations with phenotypes. Many of the computational tools developed for detecting structural variations work by mapping paired-end reads to a reference genome and identifying the discordant read-pairs whose mapped loci in the reference genome deviate from the expected insert size and orientation. However, repetitive regions in the reference genome represent a major challenge in SV detection, because the paired-end reads from these regions may be mapped to multiple loci in the reference genome, resulting in spuriously discordant read-pairs. To address this issue, we have developed an algorithmic approach for read mapping and SV detection based on the framework of A-Bruijn graphs. Instead of mapping reads to a linear sequence of the reference genome, we propose to map reads onto the A-Bruijn graph constructed from the reference genome in which all instances of the same repeat are collapsed into a single edge. As a result, any given read, either from repetitive regions or not, will be mapped to a unique location in the A-Bruijn graph, and each discordant read-pair in the A-Bruijn graph indicates a potentially true SV event. We also developed a simple clustering algorithm to derive valid clusters of these discordant read-pairs, each supporting a different SV event. Finally, we demonstrate the performance of this approach, compared to existing approaches, by identifying transposition events of insertion sequence (IS) elements, a class of simple mobile genetic elements (MGEs), in E. coli by using simulated and real paired-end sequence data acquired from E. coli mutation accumulation lines.
Collapse
Affiliation(s)
- Heewook Lee
- 1 School of Informatics and Computing, Indiana University , Bloomington, Indiana
| | | | | | | |
Collapse
|
13
|
Scherer A. Clinical and ethical considerations of massively parallel sequencing in transplantation science. World J Transplant 2013; 3:62-67. [PMID: 24392310 PMCID: PMC3879525 DOI: 10.5500/wjt.v3.i4.62] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/25/2013] [Revised: 08/16/2013] [Accepted: 10/12/2013] [Indexed: 02/05/2023] Open
Abstract
Massively parallel sequencing (MPS), alias next-generation sequencing, is making its way from research laboratories into applied sciences and clinics. MPS is a framework of experimental procedures which offer possibilities for genome research and genetics which could only be dreamed of until around 2005 when these technologies became available. Sequencing of a transcriptome, exome, even entire genomes is now possible within a time frame and precision that we could only hope for 10 years ago. Linking other experimental procedures with MPS enables researchers to study secondary DNA modifications across the entire genome, and protein binding sites, to name a few applications. How the advancements of sequencing technologies can contribute to transplantation science is subject of this discussion: immediate applications are in graft matching via human leukocyte antigen sequencing, as part of systems biology approaches which shed light on gene expression processes during immune response, as biomarkers of graft rejection, and to explore changes of microbiomes as a result of transplantation. Of considerable importance is the socio-ethical aspect of data ownership, privacy, informed consent, and result report to the study participant. While the technology is advancing rapidly, legislation is lagging behind due to the globalisation of data requisition, banking and sharing.
Collapse
|
14
|
Jakupciak JP, Wells JM, Karalus RJ, Pawlowski DR, Lin JS, Feldman AB. Population-Sequencing as a Biomarker of Burkholderia mallei and Burkholderia pseudomallei Evolution through Microbial Forensic Analysis. J Nucleic Acids 2013; 2013:801505. [PMID: 24455204 PMCID: PMC3877622 DOI: 10.1155/2013/801505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2013] [Revised: 10/01/2013] [Accepted: 10/02/2013] [Indexed: 11/18/2022] Open
Abstract
Large-scale genomics projects are identifying biomarkers to detect human disease. B. pseudomallei and B. mallei are two closely related select agents that cause melioidosis and glanders. Accurate characterization of metagenomic samples is dependent on accurate measurements of genetic variation between isolates with resolution down to strain level. Often single biomarker sensitivity is augmented by use of multiple or panels of biomarkers. In parallel with single biomarker validation, advances in DNA sequencing enable analysis of entire genomes in a single run: population-sequencing. Potentially, direct sequencing could be used to analyze an entire genome to serve as the biomarker for genome identification. However, genome variation and population diversity complicate use of direct sequencing, as well as differences caused by sample preparation protocols including sequencing artifacts and mistakes. As part of a Department of Homeland Security program in bacterial forensics, we examined how to implement whole genome sequencing (WGS) analysis as a judicially defensible forensic method for attributing microbial sample relatedness; and also to determine the strengths and limitations of whole genome sequence analysis in a forensics context. Herein, we demonstrate use of sequencing to provide genetic characterization of populations: direct sequencing of populations.
Collapse
Affiliation(s)
| | | | | | | | - Jeffrey S. Lin
- The Johns Hopkins University, Applied Physics Laboratory, 11100 Johns Hopkins Road, Laurel, MD 20723, USA
| | - Andrew B. Feldman
- The Johns Hopkins University, Applied Physics Laboratory, 11100 Johns Hopkins Road, Laurel, MD 20723, USA
| |
Collapse
|