1
|
Song L, Langmead B. Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification. Genome Biol 2024; 25:106. [PMID: 38664753 PMCID: PMC11046777 DOI: 10.1186/s13059-024-03244-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 04/10/2024] [Indexed: 04/28/2024] Open
Abstract
Centrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression. Run-block compression achieves sublinear space complexity and is effective at compressing diverse microbial databases like RefSeq while supporting fast rank queries. Combining this compression method with other strategies for compacting the Ferragina-Manzini (FM) index, Centrifuger reduces the memory footprint by half compared to other FM-index-based approaches. Furthermore, the lossless compression and the unconstrained match length help Centrifuger achieve greater accuracy than competing methods at lower taxonomic levels.
Collapse
Affiliation(s)
- Li Song
- Department of Biomedical Data Science, Dartmouth College, Hanover, NH, USA.
- Department of Computer Science, Dartmouth College, Hanover, NH, USA.
- Department of Microbiology and Immunology, Dartmouth College, Hanover, NH, USA.
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
2
|
Lin MJ, Iyer S, Chen NC, Langmead B. Measuring, visualizing, and diagnosing reference bias with biastools. Genome Biol 2024; 25:101. [PMID: 38641647 PMCID: PMC11027314 DOI: 10.1186/s13059-024-03240-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Accepted: 04/04/2024] [Indexed: 04/21/2024] Open
Abstract
Many bioinformatics methods seek to reduce reference bias, but no methods exist to comprehensively measure it. Biastools analyzes and categorizes instances of reference bias. It works in various scenarios: when the donor's variants are known and reads are simulated; when donor variants are known and reads are real; and when variants are unknown and reads are real. Using biastools, we observe that more inclusive graph genomes result in fewer biased sites. We find that end-to-end alignment reduces bias at indels relative to local aligners. Finally, we use biastools to characterize how T2T references improve large-scale bias.
Collapse
Affiliation(s)
- Mao-Jan Lin
- Department of Computer Science, Johns Hopkins University, Baltimore, USA.
| | - Sheila Iyer
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA.
| |
Collapse
|
3
|
Bonnie JK, Ahmed OY, Langmead B. DandD: Efficient measurement of sequence growth and similarity. iScience 2024; 27:109054. [PMID: 38361606 PMCID: PMC10867639 DOI: 10.1016/j.isci.2024.109054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 01/11/2024] [Accepted: 01/23/2024] [Indexed: 02/17/2024] Open
Abstract
Genome assembly databases are growing rapidly. The redundancy of sequence content between a new assembly and previous ones is neither conceptually nor algorithmically easy to measure. We introduce pertinent methods and DandD, a tool addressing how much new sequence is gained when a sequence collection grows. DandD can describe how much structural variation is discovered in each new human genome assembly and when discoveries will level off in the future. DandD uses a measure called δ ("delta"), developed initially for data compression and chiefly dependent on k-mer counts. DandD rapidly estimates δ using genomic sketches. We propose δ as an alternative to k-mer-specific cardinalities when computing the Jaccard coefficient, thereby avoiding the pitfalls of a poor choice of k. We demonstrate the utility of DandD's functions for estimating δ, characterizing the rate of pangenome growth, and computing all-pairs similarities using k-independent Jaccard.
Collapse
Affiliation(s)
- Jessica K. Bonnie
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Omar Y. Ahmed
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
4
|
Liebhoff AM, Venkataraman T, Morgenlander WR, Na M, Kula T, Waugh K, Morrison C, Rewers M, Longman R, Round J, Elledge S, Ruczinski I, Langmead B, Larman HB. Efficient encoding of large antigenic spaces by epitope prioritization with Dolphyn. Nat Commun 2024; 15:1577. [PMID: 38383452 PMCID: PMC10881494 DOI: 10.1038/s41467-024-45601-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 01/26/2024] [Indexed: 02/23/2024] Open
Abstract
We investigate a relatively underexplored component of the gut-immune axis by profiling the antibody response to gut phages using Phage Immunoprecipitation Sequencing (PhIP-Seq). To cover large antigenic spaces, we develop Dolphyn, a method that uses machine learning to select peptides from protein sets and compresses the proteome through epitope-stitching. Dolphyn compresses the size of a peptide library by 78% compared to traditional tiling, increasing the antibody-reactive peptides from 10% to 31%. We find that the immune system develops antibodies to human gut bacteria-infecting viruses, particularly E.coli-infecting Myoviridae. Cost-effective PhIP-Seq libraries designed with Dolphyn enable the assessment of a wider range of proteins in a single experiment, thus facilitating the study of the gut-immune axis.
Collapse
Affiliation(s)
- Anna-Maria Liebhoff
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Institute of Cell Engineering, Division of Immunology, Department of Pathology, Johns Hopkins University, Baltimore, MD, USA
| | - Thiagarajan Venkataraman
- Institute of Cell Engineering, Division of Immunology, Department of Pathology, Johns Hopkins University, Baltimore, MD, USA
| | - William R Morgenlander
- Institute of Cell Engineering, Division of Immunology, Department of Pathology, Johns Hopkins University, Baltimore, MD, USA
| | - Miso Na
- Institute of Cell Engineering, Division of Immunology, Department of Pathology, Johns Hopkins University, Baltimore, MD, USA
| | - Tomasz Kula
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Division of Genetics, Department of Medicine, Howard Hughes Medical Institute, Brigham and Women's Hospital, Boston, MA, USA
| | - Kathleen Waugh
- Barbara Davis Center for Diabetes, University of Colorado Denver, Aurora, CO, USA
| | - Charles Morrison
- Behavioral, Clinical and Epidemiologic Sciences, FHI 360, Durham, NC, USA
| | - Marian Rewers
- Barbara Davis Center for Diabetes, University of Colorado Denver, Aurora, CO, USA
| | - Randy Longman
- Jill Roberts Institute for Research in IBD, Division of Gastroenterology and Hepatology, Department of Medicine, Weill Cornell Medicine, New York, NY, USA
| | - June Round
- Department of Pathology, Division of Microbiology and Immunology, University of Utah School of Medicine, Salt Lake City, UT, USA
| | - Stephen Elledge
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Division of Genetics, Department of Medicine, Howard Hughes Medical Institute, Brigham and Women's Hospital, Boston, MA, USA
| | - Ingo Ruczinski
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - H Benjamin Larman
- Institute of Cell Engineering, Division of Immunology, Department of Pathology, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
5
|
Abstract
Many bioinformatics methods seek to reduce reference bias, but no methods exist to comprehensively measure it. Biastools analyzes and categorizes instances of reference bias. It works in various scenarios, i.e. (a) when the donor's variants are known and reads are simulated, (b) when donor variants are known and reads are real, and (c) when variants are unknown and reads are real. Using biastools, we observe that more inclusive graph genomes result in fewer biased sites. We find that end-to-end alignment reduces bias at indels relative to local aligners. Finally, we use biastools to characterize how T2T references improve large-scale bias.
Collapse
Affiliation(s)
- Mao-Jan Lin
- Department of Computer Science, Johns Hopkins University
| | - Sheila Iyer
- Department of Computer Science, Johns Hopkins University
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University
| |
Collapse
|
6
|
Zakeri M, Brown NK, Ahmed OY, Gagie T, Langmead B. Movi: a fast and cache-efficient full-text pangenome index. bioRxiv 2024:2023.11.04.565615. [PMID: 37961660 PMCID: PMC10635132 DOI: 10.1101/2023.11.04.565615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Efficient pangenome indexes are promising tools for many applications, including rapid classification of nanopore sequencing reads. Recently, a compressed-index data structure called the "move structure" was proposed as an alternative to other BWT-based indexes like the FM index and r-index. The move structure uniquely achieves both O(r) space and O(1)-time queries, where r is the number of runs in the pangenome BWT. We implemented Movi, an efficient tool for building and querying move-structure pangenome indexes. While the size of the Movi's index is larger than the r-index, it scales at a smaller rate for pangenome references, as its size is exactly proportional to r, the number of runs in the BWT of the reference. Movi can compute sophisticated matching queries needed for classification - such as pseudo-matching lengths and backward search - at least ten times faster than the fastest available methods, and in some cases more than 30-fold faster. Movi achieves this speed by leveraging the move structure's strong locality of reference, incurring close to the minimum possible number of cache misses for queries against large pangenomes. We achieve still further speed improvements by using memory prefetching to attain a degree of latency hiding that would be difficult with other index structures like the r-index. Movi's fast constant-time query loop makes it well suited to real-time applications like adaptive sampling for nanopore sequencing, where decisions must be made in a small and predictable time interval.
Collapse
Affiliation(s)
- Mohsen Zakeri
- Department of Computer Science, Johns Hopkins University
| | | | - Omar Y Ahmed
- Department of Computer Science, Johns Hopkins University
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University
| |
Collapse
|
7
|
Chen NC, Paulin LF, Sedlazeck FJ, Koren S, Phillippy AM, Langmead B. Improved sequence mapping using a complete reference genome and lift-over. Nat Methods 2024; 21:41-49. [PMID: 38036856 DOI: 10.1038/s41592-023-02069-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Accepted: 10/09/2023] [Indexed: 12/02/2023]
Abstract
Complete, telomere-to-telomere (T2T) genome assemblies promise improved analyses and the discovery of new variants, but many essential genomic resources remain associated with older reference genomes. Thus, there is a need to translate genomic features and read alignments between references. Here we describe a method called levioSAM2 that performs fast and accurate lift-over between assemblies using a whole-genome map. In addition to enabling the use of several references, we demonstrate that aligning reads to a high-quality reference (for example, T2T-CHM13) and lifting to an older reference (for example, Genome reference Consortium (GRC)h38) improves the accuracy of the resulting variant calls on the old reference. By leveraging the quality improvements of T2T-CHM13, levioSAM2 reduces small and structural variant calling errors compared with GRC-based mapping using real short- and long-read datasets. Performance is especially improved for a set of complex medically relevant genes, where the GRC references are lower quality.
Collapse
Affiliation(s)
- Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
| | - Luis F Paulin
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
8
|
Dyjack N, Baker DN, Braverman V, Langmead B, Hicks SC. A scalable and unbiased discordance metric with H. Biostatistics 2023; 25:188-202. [PMID: 36063544 PMCID: PMC10724244 DOI: 10.1093/biostatistics/kxac035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Revised: 06/27/2022] [Accepted: 08/07/2022] [Indexed: 11/15/2022] Open
Abstract
A standard unsupervised analysis is to cluster observations into discrete groups using a dissimilarity measure, such as Euclidean distance. If there does not exist a ground-truth label for each observation necessary for external validity metrics, then internal validity metrics, such as the tightness or separation of the clusters, are often used. However, the interpretation of these internal metrics can be problematic when using different dissimilarity measures as they have different magnitudes and ranges of values that they span. To address this problem, previous work introduced the "scale-agnostic" $G_{+}$ discordance metric; however, this internal metric is slow to calculate for large data. Furthermore, in the setting of unsupervised clustering with $k$ groups, we show that $G_{+}$ varies as a function of the proportion of observations assigned to each of the groups (or clusters), referred to as the group balance, which is an undesirable property. To address this problem, we propose a modification of $G_{+}$, referred to as $H_{+}$, and demonstrate that $H_{+}$ does not vary as a function of group balance using a simulation study and with public single-cell RNA-sequencing data. Finally, we provide scalable approaches to estimate $H_{+}$, which are available in the $\mathtt{fasthplus}$ R package.
Collapse
Affiliation(s)
- Nathan Dyjack
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N Wolfe St, Baltimore, MD 21205, USA
| | - Daniel N Baker
- Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, MD 21218, USA
| | - Vladimir Braverman
- Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, MD 21218, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, MD 21218, USA
| | - Stephanie C Hicks
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N Wolfe St, Baltimore, MD 21205, USA
| |
Collapse
|
9
|
Abstract
Pangenome indexes reduce reference bias in sequencing data analysis. However, a greater reduction in bias can be achieved using a personalized reference, e.g. a diploid human reference constructed to match a donor individual's alleles. We present a novel impute-first alignment framework that combines elements of genotype imputation and pangenome alignment. It begins by genotyping the individual from a subsample of the input reads. It next uses a reference panel and efficient imputation algorithm to impute a personalized diploid reference. Finally, it indexes the personalized reference and applies a read aligner, which could be a linear or graph aligner, to align the full read set to the personalized reference. This framework has higher variant-calling recall (99.54% vs. 99.37%), precision (99.36% vs. 99.18%), and F1 (99.45% vs. 99.28%) compared to a graph-based pangenome. The personalized reference is also smaller and faster to query compared to a pangenome index, making it an overall advantageous choice for whole-genome DNA sequencing experiments.
Collapse
Affiliation(s)
| | - Taher Mun
- Department of Computer Science, Johns Hopkins University
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University
| |
Collapse
|
10
|
Song L, Langmead B. Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification. bioRxiv 2023:2023.11.15.567129. [PMID: 38014029 PMCID: PMC10680779 DOI: 10.1101/2023.11.15.567129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Centrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression. Run-block compression achieves sublinear space complexity and is effective at compressing diverse microbial databases like RefSeq while supporting fast rank queries. Combining this compression method with other strategies for compacting the Ferragina-Manzini (FM) index, Centrifuger reduces the memory footprint by half compared to other FM-index-based approaches. Furthermore, the lossless compression and the unconstrained match length help Centrifuger achieve greater accuracy than competing methods at lower taxonomic levels.
Collapse
Affiliation(s)
- Li Song
- Department of Biomedical Data Science, Dartmouth College, Hanover, NH
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD
| |
Collapse
|
11
|
Shivakumar VS, Ahmed OY, Kovaka S, Zakeri M, Langmead B. Sigmoni: classification of nanopore signal with a compressed pangenome index. bioRxiv 2023:2023.08.15.553308. [PMID: 37645873 PMCID: PMC10462034 DOI: 10.1101/2023.08.15.553308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Improvements in nanopore sequencing necessitate efficient classification methods, including pre-filtering and adaptive sampling algorithms that enrich for reads of interest. Signal-based approaches circumvent the computational bottleneck of basecalling. But past methods for signal-based classification do not scale efficiently to large, repetitive references like pangenomes, limiting their utility to partial references or individual genomes. We introduce Sigmoni: a rapid, multiclass classification method based on the r-index that scales to references of hundreds of Gbps. Sigmoni quantizes nanopore signal into a discrete alphabet of picoamp ranges. It performs rapid, approximate matching using matching statistics, classifying reads based on distributions of picoamp matching statistics and co-linearity statistics. Sigmoni is 10-100× faster than previous methods for adaptive sampling in host depletion experiments with improved accuracy, and can query reads against large microbial or human pangenomes.
Collapse
Affiliation(s)
| | - Omar Y. Ahmed
- Department of Computer Science, Johns Hopkins University
| | - Sam Kovaka
- Department of Computer Science, Johns Hopkins University
| | - Mohsen Zakeri
- Department of Computer Science, Johns Hopkins University
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University
| |
Collapse
|
12
|
Chao KH, Chen PW, Seshia SA, Langmead B. WGT: Tools and algorithms for recognizing, visualizing, and generating Wheeler graphs. iScience 2023; 26:107402. [PMID: 37575187 PMCID: PMC10415921 DOI: 10.1016/j.isci.2023.107402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 06/29/2023] [Accepted: 07/12/2023] [Indexed: 08/15/2023] Open
Abstract
A Wheeler graph represents a collection of strings in a way that is particularly easy to index and query. Such a graph is a practical choice for representing a graph-shaped pangenome, and it is the foundation for current graph-based pangenome indexes. However, there are no practical tools to visualize or to check graphs that may have the Wheeler properties. Here, we present Wheelie, an algorithm that combines a renaming heuristic with a permutation solver (Wheelie-PR) or a Satisfiability Modulo Theory (SMT) solver (Wheelie-SMT) to check whether a given graph has the Wheeler properties, a problem that is NP-complete in general. Wheelie can check a variety of random and real-world graphs in far less time than any algorithm proposed to date. It can check a graph with 1,000s of nodes in seconds. We implement these algorithms together with complementary visualization tools in the WGT toolkit, available as open source software at https://github.com/Kuanhao-Chao/Wheeler_Graph_Toolkit.
Collapse
Affiliation(s)
- Kuan-Hao Chao
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Pei-Wei Chen
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Sanjit A. Seshia
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
13
|
Liebhoff AM, Venkataraman T, Morgenlander WR, Na M, Kula T, Waugh K, Morrison C, Rewers M, Longman R, Round J, Elledge S, Ruczinski I, Langmead B, Larman HB. Efficient encoding of large antigenic spaces by epitope prioritization with Dolphyn. bioRxiv 2023:2023.07.30.551179. [PMID: 37577562 PMCID: PMC10418057 DOI: 10.1101/2023.07.30.551179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
We investigated a relatively underexplored component of the gut-immune axis by profiling the antibody response to gut phages using Phage Immunoprecipitation Sequencing (PhIP-Seq). To enhance this approach, we developed Dolphyn, a novel method that uses machine learning to select peptides from protein sets and compresses the proteome through epitope-stitching. Dolphyn improves the fraction of gut phage library peptides bound by antibodies from 10% to 31% in healthy individuals, while also reducing the number of synthesized peptides by 78%. In our study on gut phages, we discovered that the immune system develops antibodies to bacteria-infecting viruses in the human gut, particularly E.coli-infecting Myoviridae. Cost-effective PhIP-Seq libraries designed with Dolphyn enable the assessment of a wider range of proteins in a single experiment, thus facilitating the study of the gut-immune axis.
Collapse
Affiliation(s)
- Anna-Maria Liebhoff
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Division of Immunology, Department of Pathology, Johns Hopkins University, Baltimore, MD, USA
| | | | - William R Morgenlander
- Division of Immunology, Department of Pathology, Johns Hopkins University, Baltimore, MD, USA
| | - Miso Na
- Division of Immunology, Department of Pathology, Johns Hopkins University, Baltimore, MD, USA
| | - Tomasz Kula
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Division of Genetics, Department of Medicine, Howard Hughes Medical Institute, Brigham and Women's Hospital, Boston, MA, USA
| | - Kathleen Waugh
- Barbara Davis Center for Diabetes, University of Colorado Denver, Aurora, Colorado, USA
| | - Charles Morrison
- Behavioral, Clinical and Epidemiologic Sciences, FHI 360, Durham, NC, USA
| | - Marian Rewers
- Barbara Davis Center for Diabetes, University of Colorado Denver, Aurora, Colorado, USA
| | - Randy Longman
- Jill Roberts Institute for Research in IBD, Division of Gastroenterology and Hepatology, Department of Medicine, Weill Cornell Medicine, New York, NY, USA
| | - June Round
- Department of Pathology, Division of Microbiology and Immunology, University of Utah School of Medicine, Salt Lake City, UT, USA
| | - Stephen Elledge
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Division of Genetics, Department of Medicine, Howard Hughes Medical Institute, Brigham and Women's Hospital, Boston, MA, USA
| | - Ingo Ruczinski
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - H Benjamin Larman
- Division of Immunology, Department of Pathology, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
14
|
Ahmed O, Rossi M, Boucher C, Langmead B. Efficient taxa identification using a pangenome index. Genome Res 2023; 33:1069-1077. [PMID: 37258301 PMCID: PMC10538492 DOI: 10.1101/gr.277642.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 05/22/2023] [Indexed: 06/02/2023]
Abstract
Tools that classify sequencing reads against a database of reference sequences require efficient index data-structures. The r-index is a compressed full-text index that answers substring presence/absence, count, and locate queries in space proportional to the amount of distinct sequence in the database: [Formula: see text] space, where r is the number of Burrows-Wheeler runs. To date, the r-index has lacked the ability to quickly classify matches according to which reference sequences (or sequence groupings, i.e., taxa) a match overlaps. We present new algorithms and methods for solving this problem. Specifically, given a collection D of d documents, [Formula: see text] over an alphabet of size σ, we extend the r-index with [Formula: see text] additional words to support document listing queries for a pattern [Formula: see text] that occurs in [Formula: see text] documents in D in [Formula: see text] time and [Formula: see text] space, where w is the machine word size. Applied in a bacterial mock community experiment, our method is up to three times faster than a comparable method that uses the standard r-index locate queries. We show that our method classifies both simulated and real nanopore reads at the strain level with higher accuracy compared with other approaches. Finally, we present strategies for compacting this structure in applications in which read lengths or match lengths can be bounded.
Collapse
Affiliation(s)
- Omar Ahmed
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA;
| | - Massimiliano Rossi
- Department of Computer and Information Science and Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, Florida 32611, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, Florida 32611, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
| |
Collapse
|
15
|
Baker DN, Langmead B. Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2. Genome Res 2023; 33:1218-1227. [PMID: 37414575 PMCID: PMC10538361 DOI: 10.1101/gr.277655.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 06/30/2023] [Indexed: 07/08/2023]
Abstract
A genomic sketch is a small, probabilistic representation of the set of k-mers in a sequencing data set. Sketches are building blocks for large-scale analyses that consider similarities between many pairs of sequences or sequence collections. Although existing tools can easily compare tens of thousands of genomes, data sets can reach millions of sequences and beyond. Popular tools also fail to consider k-mer multiplicities, making them less applicable in quantitative settings. Here, we describe a method called Dashing 2 that builds on the SetSketch data structure. SetSketch is related to HyperLogLog (HLL) but discards use of leading zero count in favor of a truncated logarithm of adjustable base. Unlike HLL, SetSketch can perform multiplicity-aware sketching when combined with the ProbMinHash method. Dashing 2 integrates locality-sensitive hashing to scale all-pairs comparisons to millions of sequences. It achieves superior similarity estimates for the Jaccard coefficient and average nucleotide identity compared with the original Dashing, but in much less time while using the same-sized sketch. Dashing 2 is a free, open source software.
Collapse
Affiliation(s)
- Daniel N Baker
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218-2683, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218-2683, USA
| |
Collapse
|
16
|
Ahmed OY, Rossi M, Gagie T, Boucher C, Langmead B. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biol 2023; 24:122. [PMID: 37202771 PMCID: PMC10197461 DOI: 10.1186/s13059-023-02958-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Accepted: 05/03/2023] [Indexed: 05/20/2023] Open
Abstract
Genomics analyses use large reference sequence collections, like pangenomes or taxonomic databases. SPUMONI 2 is an efficient tool for sequence classification of both short and long reads. It performs multi-class classification using a novel sampled document array. By incorporating minimizers, SPUMONI 2's index is 65 times smaller than minimap2's for a mock community pangenome. SPUMONI 2 achieves a speed improvement of 3-fold compared to SPUMONI and 15-fold compared to minimap2. We show SPUMONI 2 achieves an advantageous mix of accuracy and efficiency in practical scenarios such as adaptive sampling, contamination detection and multi-class metagenomics classification.
Collapse
Affiliation(s)
- Omar Y. Ahmed
- Department of Computer Science, Johns Hopkins University, Baltimore, MD USA
| | - Massimiliano Rossi
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL USA
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University, Halifax, NS Canada
| | - Christina Boucher
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD USA
| |
Collapse
|
17
|
Abstract
We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while reducing the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods. The method is implemented in the open source software tool rowbowt available at https://github.com/alshai/rowbowt .
Collapse
Affiliation(s)
- Taher Mun
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | | | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
18
|
Imada EL, Wilks C, Langmead B, Marchionni L. REPAC: analysis of alternative polyadenylation from RNA-sequencing data. Genome Biol 2023; 24:22. [PMID: 36759904 PMCID: PMC9912678 DOI: 10.1186/s13059-023-02865-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Accepted: 01/24/2023] [Indexed: 02/11/2023] Open
Abstract
Alternative polyadenylation (APA) is an important post-transcriptional mechanism that has major implications in biological processes and diseases. Although specialized sequencing methods for polyadenylation exist, availability of these data are limited compared to RNA-sequencing data. We developed REPAC, a framework for the analysis of APA from RNA-sequencing data. Using REPAC, we investigate the landscape of APA caused by activation of B cells. We also show that REPAC is faster than alternative methods by at least 7-fold and that it scales well to hundreds of samples. Overall, the REPAC method offers an accurate, easy, and convenient solution for the exploration of APA.
Collapse
Affiliation(s)
- Eddie L. Imada
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, USA
| | - Christopher Wilks
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Luigi Marchionni
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, USA
| |
Collapse
|
19
|
Bonnie JK, Ahmed O, Langmead B. DandD: efficient measurement of sequence growth and similarity. bioRxiv 2023:2023.02.02.526837. [PMID: 36778393 PMCID: PMC9915590 DOI: 10.1101/2023.02.02.526837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Genome assembly databases are growing rapidly. The sequence content in each new assembly can be largely redundant with previous ones, but this is neither conceptually nor algorithmically easy to measure. We propose new methods and a new tool called DandD that addresses the question of how much new sequence is gained when a sequence collection grows. DandD can describe how much human structural variation is being discovered in each new human genome assembly and when discoveries will level off in the future. DandD uses a measure called δ ("delta"), developed initially for data compression. Computing δ directly requires counting k-mers, but DandD can rapidly estimate it using genomic sketches. We also propose δ as an alternative to k-mer-specific cardinalities when computing the Jaccard coefficient, avoiding the pitfalls of a poor choice of k. We demonstrate the utility of DandD's functions for estimating δ, characterizing the rate of pangenome growth, and computing all-pairs similarities using k-independent Jaccard. DandD is open source software available at: https://github.com/jessicabonnie/dandd.
Collapse
Affiliation(s)
| | - Omar Ahmed
- Department of Computer Science, Johns Hopkins University
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University
| |
Collapse
|
20
|
Chen H, Caffo B, Stein-O’Brien G, Liu J, Langmead B, Colantuoni C, Xiao L. Two-stage linked component analysis for joint decomposition of multiple biologically related data sets. Biostatistics 2022; 23:1200-1217. [PMID: 35358296 PMCID: PMC9566367 DOI: 10.1093/biostatistics/kxac005] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Revised: 01/13/2022] [Accepted: 01/24/2022] [Indexed: 02/03/2023] Open
Abstract
Integrative analysis of multiple data sets has the potential of fully leveraging the vast amount of high throughput biological data being generated. In particular such analysis will be powerful in making inference from publicly available collections of genetic, transcriptomic and epigenetic data sets which are designed to study shared biological processes, but which vary in their target measurements, biological variation, unwanted noise, and batch variation. Thus, methods that enable the joint analysis of multiple data sets are needed to gain insights into shared biological processes that would otherwise be hidden by unwanted intra-data set variation. Here, we propose a method called two-stage linked component analysis (2s-LCA) to jointly decompose multiple biologically related experimental data sets with biological and technological relationships that can be structured into the decomposition. The consistency of the proposed method is established and its empirical performance is evaluated via simulation studies. We apply 2s-LCA to jointly analyze four data sets focused on human brain development and identify meaningful patterns of gene expression in human neurogenesis that have shared structure across these data sets.
Collapse
Affiliation(s)
- Huan Chen
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| | - Brian Caffo
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| | | | - Jinrui Liu
- Department of Neurology, Johns Hopkins University, Baltimore, MD, 21287, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Carlo Colantuoni
- Department of Neuroscience, Johns Hopkins University, Baltimore, MD, 21205, USA, Department of Neurology, Johns Hopkins University, Baltimore, MD, 21287, USA and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 21201, USA
| | - Luo Xiao
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, 27607, USA
| |
Collapse
|
21
|
Ling JP, Bygrave AM, Santiago CP, Carmen-Orozco RP, Trinh VT, Yu M, Li Y, Liu Y, Bowden KD, Duncan LH, Han J, Taneja K, Dongmo R, Babola TA, Parker P, Jiang L, Leavey PJ, Smith JJ, Vistein R, Gimmen MY, Dubner B, Helmenstine E, Teodorescu P, Karantanos T, Ghiaur G, Kanold PO, Bergles D, Langmead B, Sun S, Nielsen KJ, Peachey N, Singh MS, Dalton WB, Rajaii F, Huganir RL, Blackshaw S. Cell-specific regulation of gene expression using splicing-dependent frameshifting. Nat Commun 2022; 13:5773. [PMID: 36182931 PMCID: PMC9526712 DOI: 10.1038/s41467-022-33523-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Accepted: 09/21/2022] [Indexed: 01/29/2023] Open
Abstract
Precise and reliable cell-specific gene delivery remains technically challenging. Here we report a splicing-based approach for controlling gene expression whereby separate translational reading frames are coupled to the inclusion or exclusion of mutated, frameshifting cell-specific alternative exons. Candidate exons are identified by analyzing thousands of publicly available RNA sequencing datasets and filtering by cell specificity, conservation, and local intron length. This method, which we denote splicing-linked expression design (SLED), can be combined in a Boolean manner with existing techniques such as minipromoters and viral capsids. SLED can use strong constitutive promoters, without sacrificing precision, by decoupling the tradeoff between promoter strength and selectivity. AAV-packaged SLED vectors can selectively deliver fluorescent reporters and calcium indicators to various neuronal subtypes in vivo. We also demonstrate gene therapy utility by creating SLED vectors that can target PRPH2 and SF3B1 mutations. The flexibility of SLED technology enables creative avenues for basic and translational research.
Collapse
Affiliation(s)
- Jonathan P Ling
- Departments of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA.
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA.
- Kavli Neuroscience Discovery Institute, Johns Hopkins University, Baltimore, MD, 21218, USA.
| | - Alexei M Bygrave
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Clayton P Santiago
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Rogger P Carmen-Orozco
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Vickie T Trinh
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Minzhong Yu
- Department of Ophthalmic Research, Cole Eye Institute, Cleveland Clinic Foundation, Cleveland, OH, 44195, USA
- Department of Ophthalmology, Cleveland Clinic Lerner College of Medicine of Case Western Reserve University, Cleveland, OH, 44195, USA
| | - Yini Li
- Departments of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Ying Liu
- Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Kyra D Bowden
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Leighton H Duncan
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Jeong Han
- Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Kamil Taneja
- Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Rochinelle Dongmo
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Travis A Babola
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Patrick Parker
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Lizhi Jiang
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Patrick J Leavey
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Jennifer J Smith
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
- Zanvyl Krieger Mind/Brain Institute, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Rachel Vistein
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
- Zanvyl Krieger Mind/Brain Institute, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Megan Y Gimmen
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Benjamin Dubner
- Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Eric Helmenstine
- Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Patric Teodorescu
- Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Theodoros Karantanos
- Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Gabriel Ghiaur
- Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Patrick O Kanold
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
- Kavli Neuroscience Discovery Institute, Johns Hopkins University, Baltimore, MD, 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Dwight Bergles
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
- Kavli Neuroscience Discovery Institute, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Ben Langmead
- Kavli Neuroscience Discovery Institute, Johns Hopkins University, Baltimore, MD, 21218, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Shuying Sun
- Departments of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Kristina J Nielsen
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
- Kavli Neuroscience Discovery Institute, Johns Hopkins University, Baltimore, MD, 21218, USA
- Zanvyl Krieger Mind/Brain Institute, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Neal Peachey
- Department of Ophthalmic Research, Cole Eye Institute, Cleveland Clinic Foundation, Cleveland, OH, 44195, USA
- Department of Ophthalmology, Cleveland Clinic Lerner College of Medicine of Case Western Reserve University, Cleveland, OH, 44195, USA
- Research Service, Louis Stokes Cleveland VA Medical Center, Cleveland, OH, 44106, USA
| | - Mandeep S Singh
- Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - W Brian Dalton
- Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Fatemeh Rajaii
- Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Richard L Huganir
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
- Kavli Neuroscience Discovery Institute, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Seth Blackshaw
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA.
- Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA.
- Zanvyl Krieger Mind/Brain Institute, Johns Hopkins University, Baltimore, MD, 21218, USA.
- Department of Neurology, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA.
- Institute for Cell Engineering, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA.
| |
Collapse
|
22
|
Lu J, Rincon N, Wood DE, Breitwieser FP, Pockrandt C, Langmead B, Salzberg SL, Steinegger M. Metagenome analysis using the Kraken software suite. Nat Protoc 2022; 17:2815-2839. [PMID: 36171387 DOI: 10.1038/s41596-022-00738-y] [Citation(s) in RCA: 85] [Impact Index Per Article: 42.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Accepted: 06/16/2022] [Indexed: 01/19/2023]
Abstract
Metagenomic experiments expose the wide range of microscopic organisms in any microbial environment through high-throughput DNA sequencing. The computational analysis of the sequencing data is critical for the accurate and complete characterization of the microbial community. To facilitate efficient and reproducible metagenomic analysis, we introduce a step-by-step protocol for the Kraken suite, an end-to-end pipeline for the classification, quantification and visualization of metagenomic datasets. Our protocol describes the execution of the Kraken programs, via a sequence of easy-to-use scripts, in two scenarios: (1) quantification of the species in a given metagenomics sample; and (2) detection of a pathogenic agent from a clinical sample taken from a human patient. The protocol, which is executed within 1-2 h, is targeted to biologists and clinicians working in microbiome or metagenomics analysis who are familiar with the Unix command-line environment.
Collapse
Affiliation(s)
- Jennifer Lu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA. .,Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.
| | - Natalia Rincon
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.,Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Derrick E Wood
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.,Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Florian P Breitwieser
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Christopher Pockrandt
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Steven L Salzberg
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.,Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.,Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.,Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | - Martin Steinegger
- School of Biological Sciences and Institute of Molecular Biology & Genetics, Seoul National University, Seoul, Republic of Korea.
| |
Collapse
|
23
|
Mun T, Vaddadi NSK, Langmead B. Pangenomic Genotyping with the Marker Array. Algorithms Bioinform 2022; 242:19. [PMID: 36409181 PMCID: PMC9674407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while avoiding the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods.
Collapse
Affiliation(s)
- Taher Mun
- Johns Hopkins University, Baltimore MD, USA; Illumina, San Diego, USA
| | | | | |
Collapse
|
24
|
Imada EL, Wilks C, Langmead B, Marchionni L. Abstract 1219: Unraveling alternative polyadenylation in prostate cancer with CORE-PAD. Cancer Res 2022. [DOI: 10.1158/1538-7445.am2022-1219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Mechanisms that control gene expression at the RNA level are often referred to as post-transcriptional regulation (PTR) mechanisms. Splicing and polyadenylation (PA) are well-known examples of PTR that can regulate not only gene expression but also their function. Alternative Polyadenylation (APA) has already been shown to be essential to many biological processes (e.g., proliferation, cell differentiation, etc) and has also been implicated in the development and progression of multiple diseases (e.g., cancer, hematological and immune disorders, etc.). Although several sequencing methods have been developed to sequence only the transcript termination site (TTS), the number of publicly available data derived from these methods is extremely limited in comparison to traditional RNA-Seq data. To overcome this limitation, we created a new framework - Compositional Regression of Polyadenylation Differences (CORE-PAD) - for the study of differential APA events using traditional bulk RNA-seq data. Through simulated data, we showed that CORE-PAD has higher accuracy than other methods (accuracy = 0.98) in detecting APA events. We applied CORE-PAD across prostate cancer (PCa) samples with impaired CDK12 and “wild” samples. We found multiple genes presented differential PA site usage, including DNA repair genes. Most notably, we notice that many genes that exhibit differential APA were not differentially expressed at gene level, meaning they are potentially “silent drivers” that cannot be capture through standard differential gene expression analysis. These findings highlight the importance of studying APA, which can help shed light into another layer of regulation occurring between transcription and translation. This is especially important since these events can be source of neoantigens or targeted mRNA degradation which could be explore for new treatments. Finally, the CORE-PAD framework was designed to take advantage of our recently published recount3 resource making over 750,000 RNA-Seq samples of human and mouse origin readily available for analysis, enabling studies of APA across thousands of phenotypes in an accurate and accessible way.
Citation Format: Eddie L. Imada, Christopher Wilks, Ben Langmead, Luigi Marchionni. Unraveling alternative polyadenylation in prostate cancer with CORE-PAD [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 1219.
Collapse
|
25
|
Rossi M, Oliva M, Bonizzoni P, Langmead B, Gagie T, Boucher C. Finding Maximal Exact Matches Using the r-Index. J Comput Biol 2022; 29:188-194. [PMID: 35041518 PMCID: PMC8902461 DOI: 10.1089/cmb.2021.0445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Efficiently finding maximal exact matches (MEMs) between a sequence read and a database of genomes is a key first step in read alignment. But until recently, it was unknown how to build a data structure in [Formula: see text] space that supports efficient MEM finding, where r is the number of runs in the Burrows-Wheeler Transform. In 2021, Rossi et al. showed how to build a small auxiliary data structure called thresholds in addition to the r-index in [Formula: see text] space. This addition enables efficient MEM finding using the r-index. In this article, we present the tool that implements this solution, which we call MONI. Namely, we give a high-level view of the main components of the data structure and show how the source code can be downloaded, compiled, and used to find MEMs between a set of sequence reads and a set of genomes.
Collapse
Affiliation(s)
- Massimiliano Rossi
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA.,Address correspondence to: Dr. Massimiliano Rossi, Department of Computer and Information Science and Engineering, P.O. Box 116120, University of Florida, Gainesville, FL 32611-6550, USA
| | - Marco Oliva
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA
| | - Paola Bonizzoni
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano, Italy
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA
| |
Collapse
|
26
|
Abstract
Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called MONI can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references.
Collapse
Affiliation(s)
- Massimiliano Rossi
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA.,Address correspondence to: Dr. Massimiliano Rossi, Department of Computer and Information Science and Engineering, P.O. Box 116120, University of Florida, Gainesville, FL 32611-6550, USA
| | - Marco Oliva
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA
| | - Ben Langmead
- Department of Computer Science, John Hopkins University, Baltimore, Maryland, USA
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA
| |
Collapse
|
27
|
Wilks C, Zheng SC, Chen FY, Charles R, Solomon B, Ling JP, Imada EL, Zhang D, Joseph L, Leek JT, Jaffe AE, Nellore A, Collado-Torres L, Hansen KD, Langmead B. recount3: summaries and queries for large-scale RNA-seq expression and splicing. Genome Biol 2021; 22:323. [PMID: 34844637 PMCID: PMC8628444 DOI: 10.1186/s13059-021-02533-6] [Citation(s) in RCA: 63] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Accepted: 10/29/2021] [Indexed: 12/12/2022] Open
Abstract
We present recount3, a resource consisting of over 750,000 publicly available human and mouse RNA sequencing (RNA-seq) samples uniformly processed by our new Monorail analysis pipeline. To facilitate access to the data, we provide the recount3 and snapcount R/Bioconductor packages as well as complementary web resources. Using these tools, data can be downloaded as study-level summaries or queried for specific exon-exon junctions, genes, samples, or other features. Monorail can be used to process local and/or private data, allowing results to be directly compared to any study in recount3. Taken together, our tools help biologists maximize the utility of publicly available RNA-seq data, especially to improve their understanding of newly collected data. recount3 is available from http://rna.recount.bio .
Collapse
Affiliation(s)
- Christopher Wilks
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Shijie C Zheng
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA
| | | | - Rone Charles
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Brad Solomon
- Thomas M. Siebel Center for Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Jonathan P Ling
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, USA
| | - Eddie Luidy Imada
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY, USA
| | - David Zhang
- Institute of Child Health, University College London (UCL), London, UK
| | | | - Jeffrey T Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA
| | - Andrew E Jaffe
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA
- Lieber Institute for Brain Development, Baltimore, USA
- Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, USA
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA
| | - Abhinav Nellore
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA
- Department of Surgery, Oregon Health & Science University, Portland, OR, USA
| | | | - Kasper D Hansen
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA.
- Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, USA.
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA.
| |
Collapse
|
28
|
Abstract
Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattachaiyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels.
Collapse
Affiliation(s)
- Daniel N Baker
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Nathan Dyjack
- Department of Biostatistics, Johns Hopkins University, Bloomberg, School of Public Health, Baltimore, MD, USA
| | - Vladimir Braverman
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Stephanie C Hicks
- Department of Biostatistics, Johns Hopkins University, Bloomberg, School of Public Health, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
29
|
Ahmed O, Rossi M, Kovaka S, Schatz MC, Gagie T, Boucher C, Langmead B. Pan-genomic matching statistics for targeted nanopore sequencing. iScience 2021; 24:102696. [PMID: 34195571 PMCID: PMC8237286 DOI: 10.1016/j.isci.2021.102696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Revised: 05/06/2021] [Accepted: 06/04/2021] [Indexed: 11/24/2022] Open
Abstract
Nanopore sequencing is an increasingly powerful tool for genomics. Recently, computational advances have allowed nanopores to sequence in a targeted fashion; as the sequencer emits data, software can analyze the data in real time and signal the sequencer to eject “nontarget” DNA molecules. We present a novel method called SPUMONI, which enables rapid and accurate targeted sequencing using efficient pan-genome indexes. SPUMONI uses a compressed index to rapidly generate exact or approximate matching statistics in a streaming fashion. When used to target a specific strain in a mock community, SPUMONI has similar accuracy as minimap2 when both are run against an index containing many strains per species. However SPUMONI is 12 times faster than minimap2. SPUMONI's index and peak memory footprint are also 16 to 4 times smaller than those of minimap2, respectively. This could enable accurate targeted sequencing even when the targeted strains have not necessarily been sequenced or assembled previously. SPUMONI uses an efficient pan-genome index to eject nontarget reads from the nanopore Read classifications are highly accurate for typical nanopore sequencing error rates For larger pan-genomes, SPUMONI is faster and uses less memory than minimap2 Enables analyses for strains that are missing or poorly represented in databases
Collapse
Affiliation(s)
- Omar Ahmed
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Massimiliano Rossi
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| | - Sam Kovaka
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University, Halifax, NS, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
30
|
Mun T, Chen NC, Langmead B. LevioSAM: Fast lift-over of variant-aware reference alignments. Bioinformatics 2021; 37:4243-4245. [PMID: 34037690 PMCID: PMC9502237 DOI: 10.1093/bioinformatics/btab396] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Revised: 03/31/2021] [Accepted: 05/24/2021] [Indexed: 01/12/2023] Open
Abstract
Motivation As more population genetics datasets and population-specific references become available, the task of translating (‘lifting’) read alignments from one reference coordinate system to another is becoming more common. Existing tools generally require a chain file, whereas VCF files are the more common way to represent variation. Existing tools also do not make effective use of threads, creating a post-alignment bottleneck. Results LevioSAM is a tool for lifting SAM/BAM alignments from one reference to another using a VCF file containing population variants. LevioSAM uses succinct data structures and scales efficiently to many threads. When run downstream of a read aligner, levioSAM is more than 7 times faster than an aligner when both are run with 16 threads. Availability and implementation Software Package: https://github.com/alshai/levioSAM, Experiments: https://github.com/langmead-lab/levioSAM-experiments Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Taher Mun
- Department of Computer Science, Johns Hopkins University, Baltimore, 21218, USA
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, 21218, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, 21218, USA
| |
Collapse
|
31
|
Wilks C, Ahmed O, Baker DN, Zhang D, Collado-Torres L, Langmead B. Megadepth: efficient coverage quantification for BigWigs and BAMs. Bioinformatics 2021; 37:3014-3016. [PMID: 33693500 PMCID: PMC8528031 DOI: 10.1093/bioinformatics/btab152] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 01/16/2021] [Accepted: 03/04/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION A common way to summarize sequencing datasets is to quantify data lying within genes or other genomic intervals. This can be slow and can require different tools for different input file types. RESULTS Megadepth is a fast tool for quantifying alignments and coverage for BigWig and BAM/CRAM input files, using substantially less memory than the next-fastest competitor. Megadepth can summarize coverage within all disjoint intervals of the Gencode V35 gene annotation for more than 19 000 GTExV8 BigWig files in approximately 1 h using 32 threads. Megadepth is available both as a command-line tool and as an R/Bioconductor package providing much faster quantification compared to the rtracklayer package. AVAILABILITY AND IMPLEMENTATION https://github.com/ChristopherWilks/megadepth, https://bioconductor.org/packages/megadepth. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Christopher Wilks
- Department of Computer Science, Johns Hopkins
University, Baltimore, MD 21218, USA,To whom correspondence should be addressed.
or
| | - Omar Ahmed
- Department of Computer Science, Johns Hopkins
University, Baltimore, MD 21218, USA
| | - Daniel N Baker
- Department of Computer Science, Johns Hopkins
University, Baltimore, MD 21218, USA
| | - David Zhang
- Department of Molecular Neuroscience Institute of
Neurology, University College London (UCL), London WC1E 6BT,
UK,NIHR Great Ormond Street Hospital Biomedical
Research Centre, University College London, London WC1E 6BT,
UK,Genetics and Genomic Medicine, Great Ormond Street
Institute of Child Health University College London, London WC1E
6BT, UK
| | | | - Ben Langmead
- Department of Computer Science, Johns Hopkins
University, Baltimore, MD 21218, USA,To whom correspondence should be addressed.
or
| |
Collapse
|
32
|
Boucher C, Gagie T, Tomohiro I, Köppl D, Langmead B, Manzini G, Navarro G, Pacheco A, Rossi M. PHONI: Streamed Matching Statistics with Multi-Genome References. Proc Data Compress Conf 2021; 2021:193-202. [PMID: 34778549 PMCID: PMC8583545 DOI: 10.1109/dcc50243.2021.00027] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database. Our code is available at https://github.com/koeppl/phoni.
Collapse
|
33
|
Abstract
Most sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.
Collapse
Affiliation(s)
- Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Brad Solomon
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Taher Mun
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Sheila Iyer
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA.
| |
Collapse
|
34
|
Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol 2021; 22:8. [PMID: 33397413 PMCID: PMC7780692 DOI: 10.1186/s13059-020-02229-3] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Accepted: 12/08/2020] [Indexed: 12/30/2022] Open
Abstract
Most sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.
Collapse
Affiliation(s)
- Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Brad Solomon
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Taher Mun
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Sheila Iyer
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA.
| |
Collapse
|
35
|
Kempa D, Langmead B. Fast and Space-Efficient Construction of AVL Grammars from the LZ77 Parsing. Lebniz Int Proc Inform 2021; 204:56. [PMID: 37906626 PMCID: PMC10613516 DOI: 10.4230/lipics.esa.2021.56] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 11/02/2023]
Abstract
Grammar compression is, next to Lempel-Ziv (LZ77) and run-length Burrows-Wheeler transform (RLBWT), one of the most flexible approaches to representing and processing highly compressible strings. The main idea is to represent a text as a context-free grammar whose language is precisely the input string. This is called a straight-line grammar (SLG). An AVL grammar, proposed by Rytter [Theor. Comput. Sci., 2003] is a type of SLG that additionally satisfies the AVL property: the heights of parse trees for children of every nonterminal differ by at most one. In contrast to other SLG constructions, AVL grammars can be constructed from the LZ77 parsing in compressed time: 𝓞 ( z log n ) where z is the size of the LZ77 parsing and n is the length of the input text. Despite these advantages, AVL grammars are thought to be too large to be practical. We present a new technique for rapidly constructing a small AVL grammar from an LZ77 or LZ77-like parse. Our algorithm produces grammars that are always at least five times smaller than those produced by the original algorithm, and usually not more than double the size of grammars produced by the practical Re-Pair compressor [Larsson and Moffat, Proc. IEEE, 2000]. Our algorithm also achieves low peak RAM usage. By combining this algorithm with recent advances in approximating the LZ77 parsing, we show that our method has the potential to construct a run-length BWT in about one third of the time and peak RAM required by other approaches. Overall, we show that AVL grammars are surprisingly practical, opening the door to much faster construction of key compressed data structures.
Collapse
Affiliation(s)
- Dominik Kempa
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
36
|
Darby CA, Gaddipati R, Schatz MC, Langmead B. Vargas: heuristic-free alignment for assessing linear and graph read aligners. Bioinformatics 2020; 36:3712-3718. [PMID: 32321164 PMCID: PMC7320598 DOI: 10.1093/bioinformatics/btaa265] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Revised: 03/19/2020] [Accepted: 04/15/2020] [Indexed: 12/31/2022] Open
Abstract
Motivation Read alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score. Results Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these ‘gold standard’ Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-maximal exact match and vg to align more reads correctly. Availability and implementation Source code implemented in C++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Michael C Schatz
- Department of Computer Science.,Department of Biology, Johns Hopkins University, Baltimore, MD 21218, USA.,Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | | |
Collapse
|
37
|
Abstract
The r-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This article shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on an FASTA file to build an r-index for that file; and how to query that index with ri-align.
Collapse
Affiliation(s)
- Taher Mun
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland
| | - Alan Kuhnle
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University, Halifax, Canada Center for Biotechnology and Bioengineering, Santiago, Chile School of Computer Science and Telecommunications, Universidad Diego Portales, Santiago, Chile
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland
| | - Giovanni Manzini
- Department of Science and Technological Innovation, University of Eastern Piedmont, Alessandria, Italy
| |
Collapse
|
38
|
Abstract
Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows–Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that—when used with the rank data structure—allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FM-index-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing.
Collapse
Affiliation(s)
- Alan Kuhnle
- Department of Computer Science, Florida State University, Tallahassee, Florida
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida
| | - Taher Mun
- Department of Computer Science, John Hopkins University, Baltimore, Maryland
- Address correspondence to: Taher Mun, PhD Candidate, Department of Computer Science, John Hopkins University, 3400 North Charles Street, Baltimore, MD 21218-2682
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
- School of Computer Science and Telecommunications, Universidad Diego Portales and CeBiB, Santiago, Chile
| | - Ben Langmead
- Department of Computer Science, John Hopkins University, Baltimore, Maryland
| | - Giovanni Manzini
- Department of Science and Technological Innovation, University of Eastern Piedmont, Alessandria, Italy
| |
Collapse
|
39
|
Imada EL, Sanchez DF, Collado-Torres L, Wilks C, Matam T, Dinalankara W, Stupnikov A, Lobo-Pereira F, Yip CW, Yasuzawa K, Kondo N, Itoh M, Suzuki H, Kasukawa T, Hon CC, de Hoon MJL, Shin JW, Carninci P, Jaffe AE, Leek JT, Favorov A, Franco GR, Langmead B, Marchionni L. Recounting the FANTOM CAGE-Associated Transcriptome. Genome Res 2020; 30:1073-1081. [PMID: 32079618 PMCID: PMC7397872 DOI: 10.1101/gr.254656.119] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2019] [Accepted: 02/11/2020] [Indexed: 02/02/2023]
Abstract
Long noncoding RNAs (lncRNAs) have emerged as key coordinators of biological and cellular processes. Characterizing lncRNA expression across cells and tissues is key to understanding their role in determining phenotypes, including human diseases. We present here FC-R2, a comprehensive expression atlas across a broadly defined human transcriptome, inclusive of over 109,000 coding and noncoding genes, as described in the FANTOM CAGE-Associated Transcriptome (FANTOM-CAT) study. This atlas greatly extends the gene annotation used in the original recount2 resource. We demonstrate the utility of the FC-R2 atlas by reproducing key findings from published large studies and by generating new results across normal and diseased human samples. In particular, we (a) identify tissue-specific transcription profiles for distinct classes of coding and noncoding genes, (b) perform differential expression analysis across thirteen cancer types, identifying novel noncoding genes potentially involved in tumor pathogenesis and progression, and (c) confirm the prognostic value for several enhancer lncRNAs expression in cancer. Our resource is instrumental for the systematic molecular characterization of lncRNA by the FANTOM6 Consortium. In conclusion, comprised of over 70,000 samples, the FC-R2 atlas will empower other researchers to investigate functions and biological roles of both known coding genes and novel lncRNAs.
Collapse
Affiliation(s)
- Eddie Luidy Imada
- Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21827, USA.,Departamento de Bioqúımica e Imunologia, ICB, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, 31270-901, Brazil
| | - Diego Fernando Sanchez
- Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21827, USA
| | | | - Christopher Wilks
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Tejasvi Matam
- Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21827, USA
| | - Wikum Dinalankara
- Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21827, USA
| | - Aleksey Stupnikov
- Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21827, USA
| | - Francisco Lobo-Pereira
- Departamento de Biologia General, ICB, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, 31270-901, Brazil
| | - Chi-Wai Yip
- RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan
| | - Kayoko Yasuzawa
- RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan
| | - Naoto Kondo
- RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan
| | - Masayoshi Itoh
- RIKEN, Preventive Medicine and Diagnostic Innovation Program, Yokohama, 351-0198, Japan
| | - Harukazu Suzuki
- RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan
| | - Takeya Kasukawa
- RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan
| | - Chung-Chau Hon
- RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan
| | | | - Jay W Shin
- RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan
| | - Piero Carninci
- RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan
| | - Andrew E Jaffe
- Lieber Institute for Brain Development, Baltimore, Maryland 21205, USA.,Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland 21205, USA.,Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland 21205, USA
| | - Jeffrey T Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland 21205, USA
| | - Alexander Favorov
- Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21827, USA.,Laboratory of Systems Biology and Computational Genetics, VIGG RAS, 117971 Moscow, Russia
| | - Gloria R Franco
- Departamento de Bioqúımica e Imunologia, ICB, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, 31270-901, Brazil
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA.,Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland 21205, USA
| | - Luigi Marchionni
- Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21827, USA
| |
Collapse
|
40
|
Wulfridge P, Langmead B, Feinberg AP, Hansen KD. Analyzing whole genome bisulfite sequencing data from highly divergent genotypes. Nucleic Acids Res 2019; 47:e117. [PMID: 31392989 PMCID: PMC6821270 DOI: 10.1093/nar/gkz674] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2018] [Revised: 07/03/2019] [Accepted: 07/25/2019] [Indexed: 11/13/2022] Open
Abstract
In the study of DNA methylation, genetic variation between species, strains or individuals can result in CpG sites that are exclusive to a subset of samples, and insertions and deletions can rearrange the spatial distribution of CpGs. How to account for this variation in an analysis of the interplay between sequence variation and DNA methylation is not well understood, especially when the number of CpG differences between samples is large. Here, we use whole-genome bisulfite sequencing data on two highly divergent mouse strains to study this problem. We show that alignment to personal genomes is necessary for valid methylation quantification. We introduce a method for including strain-specific CpGs in differential analysis, and show that this increases power. We apply our method to a human normal-cancer dataset, and show this improves accuracy and power, illustrating the broad applicability of our approach. Our method uses smoothing to impute methylation levels at strain-specific sites, thereby allowing strain-specific CpGs to contribute to the analysis, while accounting for differences in the spatial occurrences of CpGs. Our results have implications for joint analysis of genetic variation and DNA methylation using bisulfite-converted DNA, and unlocks the use of personal genomes for addressing this question.
Collapse
Affiliation(s)
- Phillip Wulfridge
- Center for Epigenetics, Johns Hopkins School of Medicine, 855 N. Wolfe St, Baltimore, MD 21205, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, 3400 N. Charles St, Baltimore, MD 21218, USA
| | - Andrew P Feinberg
- Center for Epigenetics, Johns Hopkins School of Medicine, 855 N. Wolfe St, Baltimore, MD 21205, USA.,Department of Medicine, Johns Hopkins School of Medicine, 855 N. Wolfe St, Baltimore, MD 21205, USA.,Department of Biomedical Engineering, Whiting School of Engineering, 3400 N. Charles St, Baltimore, MD 21218, USA.,Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, 624 N. Broadway, MD 21205, USA
| | - Kasper D Hansen
- Center for Epigenetics, Johns Hopkins School of Medicine, 855 N. Wolfe St, Baltimore, MD 21205, USA.,Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N. Wolfe St, Baltimore, MD 21205, USA.,McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, 733 N. Broadway, Baltimore, MD 21205, USA
| |
Collapse
|
41
|
Abstract
Dashing is a fast and accurate software tool for estimating similarities of genomes or sequencing datasets. It uses the HyperLogLog sketch together with cardinality estimation methods that are specialized for set unions and intersections. Dashing summarizes genomes more rapidly than previous MinHash-based methods while providing greater accuracy across a wide range of input sizes and sketch sizes. It can sketch and calculate pairwise distances for over 87K genomes in 6 minutes. Dashing is open source and available at https://github.com/dnbaker/dashing.
Collapse
Affiliation(s)
- Daniel N Baker
- Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, 21218, USA.
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, 21218, USA.
| |
Collapse
|
42
|
Abstract
Dashing is a fast and accurate software tool for estimating similarities of genomes or sequencing datasets. It uses the HyperLogLog sketch together with cardinality estimation methods that are specialized for set unions and intersections. Dashing summarizes genomes more rapidly than previous MinHash-based methods while providing greater accuracy across a wide range of input sizes and sketch sizes. It can sketch and calculate pairwise distances for over 87K genomes in 6 minutes. Dashing is open source and available at https://github.com/dnbaker/dashing.
Collapse
Affiliation(s)
- Daniel N Baker
- Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, 21218, USA.
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, 21218, USA.
| |
Collapse
|
43
|
Abstract
Although Kraken's k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.
Collapse
Affiliation(s)
- Derrick E Wood
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer Lu
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
44
|
Abstract
Although Kraken’s k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.
Collapse
Affiliation(s)
- Derrick E Wood
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer Lu
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.,Department of Biomedical Engineering, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA. .,Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
45
|
Langmead B, Wilks C, Antonescu V, Charles R. Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics 2019; 35:421-432. [PMID: 30020410 PMCID: PMC6361242 DOI: 10.1093/bioinformatics/bty648] [Citation(s) in RCA: 331] [Impact Index Per Article: 66.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2018] [Accepted: 07/17/2018] [Indexed: 12/27/2022] Open
Abstract
Motivation General-purpose processors can now contain many dozens of processor cores and support hundreds of simultaneous threads of execution. To make best use of these threads, genomics software must contend with new and subtle computer architecture issues. We discuss some of these and propose methods for improving thread scaling in tools that analyze each read independently, such as read aligners. Results We implement these methods in new versions of Bowtie, Bowtie 2 and HISAT. We greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture. We also highlight how bottlenecks are exacerbated by variable-record-length file formats like FASTQ and suggest changes that enable superior scaling. Availability and implementation Experiments for this study: https://github.com/BenLangmead/bowtie-scaling. Bowtie http://bowtie-bio.sourceforge.net. Bowtie 2 http://bowtie-bio.sourceforge.net/bowtie2. HISAT http://www.ccb.jhu.edu/software/hisat Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Christopher Wilks
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Valentin Antonescu
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Rone Charles
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
46
|
Darby CA, Fitch JR, Brennan PJ, Kelly BJ, Bir N, Magrini V, Leonard J, Cottrell CE, Gastier-Foster JM, Wilson RK, Mardis ER, White P, Langmead B, Schatz MC. Samovar: Single-Sample Mosaic Single-Nucleotide Variant Calling with Linked Reads. iScience 2019; 18:1-10. [PMID: 31271967 PMCID: PMC6609817 DOI: 10.1016/j.isci.2019.05.037] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Revised: 05/06/2019] [Accepted: 05/24/2019] [Indexed: 12/25/2022] Open
Abstract
Linked-read sequencing enables greatly improves haplotype assembly over standard paired-end analysis. The detection of mosaic single-nucleotide variants benefits from haplotype assembly when the model is informed by the mapping between constituent reads and linked reads. Samovar evaluates haplotype-discordant reads identified through linked-read sequencing, thus enabling phasing and mosaic variant detection across the entire genome. Samovar trains a random forest model to score candidate sites using a dataset that considers read quality, phasing, and linked-read characteristics. Samovar calls mosaic single-nucleotide variants (SNVs) within a single sample with accuracy comparable with what previously required trios or matched tumor/normal pairs and outperforms single-sample mosaic variant callers at minor allele frequency 5%-50% with at least 30X coverage. Samovar finds somatic variants in both tumor and normal whole-genome sequencing from 13 pediatric cancer cases that can be corroborated with high recall with whole exome sequencing. Samovar is available open-source at https://github.com/cdarby/samovar under the MIT license.
Collapse
Affiliation(s)
- Charlotte A Darby
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - James R Fitch
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA
| | - Patrick J Brennan
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA
| | - Benjamin J Kelly
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA
| | - Natalie Bir
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA
| | - Vincent Magrini
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Jeffrey Leonard
- Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA; Department of Neurosurgery, Nationwide Children's Hospital, Columbus, OH, USA
| | - Catherine E Cottrell
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Julie M Gastier-Foster
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Richard K Wilson
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Elaine R Mardis
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Peter White
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA; Department of Biology, Johns Hopkins University, Baltimore, MD, USA; Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| |
Collapse
|
47
|
Madugundu AK, Na CH, Nirujogi RS, Renuse S, Kim KP, Burns KH, Wilks C, Langmead B, Ellis SE, Collado‐Torres L, Halushka MK, Kim M, Pandey A. Integrated Transcriptomic and Proteomic Analysis of Primary Human Umbilical Vein Endothelial Cells. Proteomics 2019; 19:e1800315. [PMID: 30983154 PMCID: PMC6812510 DOI: 10.1002/pmic.201800315] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Revised: 01/17/2019] [Indexed: 01/11/2023]
Abstract
Understanding the molecular profile of every human cell type is essential for understanding its role in normal physiology and disease. Technological advancements in DNA sequencing, mass spectrometry, and computational methods allow us to carry out multiomics analyses although such approaches are not routine yet. Human umbilical vein endothelial cells (HUVECs) are a widely used model system to study pathological and physiological processes associated with the cardiovascular system. In this study, next-generation sequencing and high-resolution mass spectrometry to profile the transcriptome and proteome of primary HUVECs is employed. Analysis of 145 million paired-end reads from next-generation sequencing confirmed expression of 12 186 protein-coding genes (FPKM ≥0.1), 439 novel long non-coding RNAs, and revealed 6089 novel isoforms that were not annotated in GENCODE. Proteomics analysis identifies 6477 proteins including confirmation of N-termini for 1091 proteins, isoforms for 149 proteins, and 1034 phosphosites. A database search to specifically identify other post-translational modifications provide evidence for a number of modification sites on 117 proteins which include ubiquitylation, lysine acetylation, and mono-, di- and tri-methylation events. Evidence for 11 "missing proteins," which are proteins for which there was insufficient or no protein level evidence, is provided. Peptides supporting missing protein and novel events are validated by comparison of MS/MS fragmentation patterns with synthetic peptides. Finally, 245 variant peptides derived from 207 expressed proteins in addition to alternate translational start sites for seven proteins and evidence for novel proteoforms for five proteins resulting from alternative splicing are identified. Overall, it is believed that the integrated approach employed in this study is widely applicable to study any primary cell type for deeper molecular characterization.
Collapse
Affiliation(s)
- Anil K. Madugundu
- Center for Molecular MedicineNational Institute of Mental Health and NeurosciencesHosur RoadBangalore560029KarnatakaIndia
- Institute of BioinformaticsInternational Technology ParkBangalore560066KarnatakaIndia
- Manipal Academy of Higher EducationManipal576104KarnatakaIndia
- McKusick‐Nathans Institute of Genetic MedicineJohns Hopkins University School of MedicineBaltimoreMD21205USA
- Center for Individualized Medicine and Department of Laboratory Medicine and PathologyMayo ClinicRochesterMN55905USA
| | - Chan Hyun Na
- McKusick‐Nathans Institute of Genetic MedicineJohns Hopkins University School of MedicineBaltimoreMD21205USA
- NeurologyInstitute for Cell EngineeringJohns Hopkins University School of MedicineBaltimoreMD21205USA
| | - Raja Sekhar Nirujogi
- McKusick‐Nathans Institute of Genetic MedicineJohns Hopkins University School of MedicineBaltimoreMD21205USA
| | - Santosh Renuse
- McKusick‐Nathans Institute of Genetic MedicineJohns Hopkins University School of MedicineBaltimoreMD21205USA
- Center for Individualized Medicine and Department of Laboratory Medicine and PathologyMayo ClinicRochesterMN55905USA
| | - Kwang Pyo Kim
- Department of Applied ChemistryKyung Hee UniversityYonginGyeonggi17104Republic of Korea
| | - Kathleen H. Burns
- McKusick‐Nathans Institute of Genetic MedicineJohns Hopkins University School of MedicineBaltimoreMD21205USA
- Departments of PathologyJohns Hopkins University School of MedicineBaltimoreMD21205USA
- Sidney Kimmel Comprehensive Cancer CenterJohns Hopkins University School of MedicineBaltimoreMD21205USA
- High Throughput Biology CenterJohns Hopkins University School of MedicineBaltimoreMD21205USA
| | - Christopher Wilks
- Department of Computer ScienceJohns Hopkins UniversityBaltimoreMD21218USA
- Center for Computational BiologyJohns Hopkins UniversityBaltimoreMD21205USA
| | - Ben Langmead
- Department of Computer ScienceJohns Hopkins UniversityBaltimoreMD21218USA
- Center for Computational BiologyJohns Hopkins UniversityBaltimoreMD21205USA
| | - Shannon E. Ellis
- Center for Computational BiologyJohns Hopkins UniversityBaltimoreMD21205USA
- Department of BiostatisticsJohns Hopkins Bloomberg School of Public HealthBaltimoreMD21205USA
| | - Leonardo Collado‐Torres
- Center for Computational BiologyJohns Hopkins UniversityBaltimoreMD21205USA
- Lieber Institute for Brain DevelopmentJohns Hopkins Medical CampusBaltimoreMD21205USA
| | - Marc K. Halushka
- Departments of PathologyJohns Hopkins University School of MedicineBaltimoreMD21205USA
| | - Min‐Sik Kim
- Department of Applied ChemistryKyung Hee UniversityYonginGyeonggi17104Republic of Korea
- Department of New BiologyDGISTDaegu42988Republic of Korea
| | - Akhilesh Pandey
- Center for Molecular MedicineNational Institute of Mental Health and NeurosciencesHosur RoadBangalore560029KarnatakaIndia
- McKusick‐Nathans Institute of Genetic MedicineJohns Hopkins University School of MedicineBaltimoreMD21205USA
- Center for Individualized Medicine and Department of Laboratory Medicine and PathologyMayo ClinicRochesterMN55905USA
- NeurologyInstitute for Cell EngineeringJohns Hopkins University School of MedicineBaltimoreMD21205USA
- Departments of PathologyJohns Hopkins University School of MedicineBaltimoreMD21205USA
- Department of Biological ChemistryJohns Hopkins University School of MedicineBaltimoreMD21205USA
- Department of OncologyJohns Hopkins University School of MedicineBaltimoreMD21205USA
| |
Collapse
|
48
|
Imada EL, Sanchez DF, Matam T, Collado-Torres L, Wilks C, Dinalankara W, Stupnikov A, Langmead B, Lupold SE, Marchionni L. Abstract 908: Comprehensive analysis of alternative polyadenylation across cancer phenotypes. Cancer Res 2019. [DOI: 10.1158/1538-7445.am2019-908] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
The three-prime untranslated region (3'-UTR) of a mRNA influences its biological behavior, from stability, post-transcriptional control through miRNAs, and availability for translation. Alternative polyadenylation (APA) can modulate 3' end site selection, and approximately 50% of coding genes are subject to it. Global transcript shortening has been reported in normal and cancer cells. APA can be seen as a regulatory step that controls differential expression of transcript isoforms, hence it can be analyzed similarly to gene expression, comparing relevant phenotypes (e.g., tumor vs. normal, survival) with appropriate statistical methods (e.g., generalized linear models, Cox proportional hazards models).We analyzed APA across 16 cancer types, taking advantage of the following public domain resources: 1) recount2, an annotation-agnostic RNA expression database for over 72,000 human samples (Collado-Torres et al, 2017); 2) Snaptron, a search engine and database that enables one to summarize expression for specific genomic regions and features (Wilks et al, 2017); and 3) APADB, the largest database collection of Human APA sites for coding and non-coding genes (Müller et al, 2014). We leveraged Snaptron to extract expression levels for 100-base-pair windows upstream and downstream APA sites defined in APADB. We annotated these genomic features, corresponding to short and long transcript isoforms, using metadata from recount2. As a proof of concept, we analyzed differential APA isoform expression in TCGA, comparing tumor vs. normal samples, and identifying APA events associated with recurrence and survival, as well as other well-defined clinical, morphologic and molecular classifications.Our preliminary results show hundreds of genes switching PA sites to shorten or extend 3'-UTR length in primary tumors when compared to normal tissues. Some of these genes are associated with cell cycle and proliferation, indicating that PA sites are dynamically used in primary tumors as another mechanism to evade and modulate post-transcriptional control. Even more interestingly, a substantial fraction of these APA isoforms were associated to tumor recurrence and survival independently from standard clinical and pathological variables.In conclusion, by leveraging public domain resources, such as APADB, recount2, and Snaptron, we created a comprehensive resource that enables to detect dynamic usage of PA sites across cancer phenotypes. Furthermore, the association of many APA isoforms with tumor progression suggests that these could serve as clinically useful biomarkers. Most importantly, the comprehensive resource we have built accounts for over 72,000 human samples, hence it is not limited to the cancer phenotypes we explored in this study. Once released in the public domain, our APA expression atlas will empower the scientific community at large to explore APA across many other cancer and human disease phenotypes.
Citation Format: Eddie L. Imada, Diego F. Sanchez, Tejasvi Matam, Leonardo Collado-Torres, Christopher Wilks, Wikum Dinalankara, Alexey Stupnikov, Ben Langmead, Shawn E. Lupold, Luigi Marchionni. Comprehensive analysis of alternative polyadenylation across cancer phenotypes [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2019; 2019 Mar 29-Apr 3; Atlanta, GA. Philadelphia (PA): AACR; Cancer Res 2019;79(13 Suppl):Abstract nr 908.
Collapse
Affiliation(s)
- Eddie L. Imada
- 1Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | | | | | | | | | | | | | | | | | | |
Collapse
|
49
|
Abstract
High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive-a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.
Collapse
Affiliation(s)
| | - Travis Gagie
- EIT, Diego Portales University, Santiago, Chile
- CeBiB, Santiago, Chile
| | - Alan Kuhnle
- CISE, University of Florida, Gainesville, FL USA
- Informatics Institute, Gainesville, FL USA
| | | | - Giovanni Manzini
- University of Eastern Piedmont, Alessandria, Italy
- IIT, CNR, Pisa, Italy
| | - Taher Mun
- Johns Hopkins University, Baltimore, MD USA
| |
Collapse
|
50
|
Abstract
There is growing interest in using genetic variants to augment the reference genome into a graph genome, with alternative sequences, to improve read alignment accuracy and reduce allelic bias. While adding a variant has the positive effect of removing an undesirable alignment score penalty, it also increases both the ambiguity of the reference genome and the cost of storing and querying the genome index. We introduce methods and a software tool called FORGe for modeling these effects and prioritizing variants accordingly. We show that FORGe enables a range of advantageous and measurable trade-offs between accuracy and computational overhead.
Collapse
Affiliation(s)
- Jacob Pritt
- Department of Computer Science, Johns Hopkins University, Baltimore, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, USA
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA. .,Center for Computational Biology, Johns Hopkins University, Baltimore, USA.
| |
Collapse
|