1
|
Finding maximal exact matches in graphs. Algorithms Mol Biol 2024; 19:10. [PMID: 38468275 DOI: 10.1186/s13015-024-00255-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 01/30/2024] [Indexed: 03/13/2024] Open
Abstract
BACKGROUND We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least κ ( κ -MEMs). However, on arbitrary input graphs, the problem of finding MEMs cannot be solved in truly sub-quadratic time under SETH (Equi et al., TALG 2023) even on acyclic graphs. RESULTS In this paper we show an O ( n · L · d L - 1 + m + M κ , L ) -time algorithm finding all κ -MEMs between Q and G spanning exactly L nodes in G, where n is the total length of node labels, d is the maximum degree of a node in G, m = | Q | , and M κ , L is the number of output MEMs. We use this algorithm to develop a κ -MEM finding solution on indexable Elastic Founder Graphs (Equi et al., Algorithmica 2022) running in time O ( n H 2 + m + M κ ) , where H is the maximum number of nodes in a block, and M κ is the total number of κ -MEMs. Our results generalize to the analysis of multiple query strings (MEMs between G and any of the strings). Additionally, we provide some experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection. CONCLUSIONS We show that seed-chain-extend type of alignment methods can be implemented on top of indexable Elastic Founder Graphs by providing an efficient way to produce the seeds between a set of queries and the graph. The code is available in https://github.com/algbio/efg-mems .
Collapse
|
2
|
Tackling reference bias in genotyping by using founder sequences with PanVC 3. BIOINFORMATICS ADVANCES 2024; 4:vbae027. [PMID: 38464975 PMCID: PMC10924279 DOI: 10.1093/bioadv/vbae027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 02/07/2024] [Accepted: 02/29/2024] [Indexed: 03/12/2024]
Abstract
Summary Overcoming reference bias and calling insertions and deletions are major challenges in genotyping. We present PanVC 3, a set of software that can be utilized as part of various variant calling workflows. We show that, by incorporating known genetic variants to a set of founder sequences to which reads are aligned, reference bias is reduced and precision of calling insertions and deletions is improved. Availability and implementation PanVC 3 and its source code are freely available at https://github.com/tsnorri/panvc3 and at https://anaconda.org/tsnorri/panvc3 under the MIT licence. The experiment scripts are available at https://github.com/algbio/panvc3-experiments.
Collapse
|
3
|
Chaining for accurate alignment of erroneous long reads to acyclic variation graphs. Bioinformatics 2023; 39:btad460. [PMID: 37494467 PMCID: PMC10423031 DOI: 10.1093/bioinformatics/btad460] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 06/08/2023] [Accepted: 07/25/2023] [Indexed: 07/28/2023] Open
Abstract
MOTIVATION Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications such as improving variant calling. While the vg toolkit [Garrison et al. (Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875-9)] is a popular aligner of short reads, GraphAligner [Rautiainen and Marschall (GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 2020;21:253-28)] is the state-of-the-art aligner of erroneous long reads. GraphAligner works by finding candidate read occurrences based on individually extending the best seeds of the read in the variation graph. However, a more principled approach recognized in the community is to co-linearly chain multiple seeds. RESULTS We present a new algorithm to co-linearly chain a set of seeds in a string labeled acyclic graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of erroneous long reads to acyclic variation graphs, GraphChainer. We run experiments aligning real and simulated PacBio CLR reads with average error rates 15% and 5%. Compared to GraphAligner, GraphChainer aligns 12-17% more reads, and 21-28% more total read length, on real PacBio CLR reads from human chromosomes 1, 22, and the whole human pangenome. On both simulated and real data, GraphChainer aligns between 95% and 99% of all reads, and of total read length. We also show that minigraph [Li et al. (The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020;21:265-19.)] and minichain [Chandra and Jain (Sequence to graph alignment using gap-sensitive co-linear chaining. In: Proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2023). Springer, 2023, 58-73.)] obtain an accuracy of <60% on this setting. AVAILABILITY AND IMPLEMENTATION GraphChainer is freely available at https://github.com/algbio/GraphChainer. The datasets and evaluation pipeline can be reached from the previous address.
Collapse
|
4
|
Abstract
Genomic epidemiology is a tool for tracing transmission of pathogens based on whole-genome sequencing. We introduce the mGEMS pipeline for genomic epidemiology with plate sweeps representing mixed samples of a target pathogen, opening the possibility to sequence all colonies on selective plates with a single DNA extraction and sequencing step. The pipeline includes the novel mGEMS read binner for probabilistic assignments of sequencing reads, and the scalable pseudoaligner Themisto. We demonstrate the effectiveness of our approach using closely related samples in a nosocomial setting, obtaining results that are comparable to those based on single-colony picks. Our results lend firm support to more widespread consideration of genomic epidemiology with mixed infection samples.
Collapse
|
5
|
Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment. PLoS One 2021; 16:e0255260. [PMID: 34343181 PMCID: PMC8330939 DOI: 10.1371/journal.pone.0255260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Accepted: 07/12/2021] [Indexed: 11/19/2022] Open
Abstract
Computational pan-genomics utilizes information from multiple individual genomes in large-scale comparative analysis. Genetic variation between case-controls, ethnic groups, or species can be discovered thoroughly using pan-genomes of such subpopulations. Whole-genome sequencing (WGS) data volumes are growing rapidly, making genomic data compression and indexing methods very important. Despite current space-efficient repetitive sequence compression and indexing methods, the deployed compression methods are often sequential, computationally time-consuming, and do not provide efficient sequence alignment performance on vast collections of genomes such as pan-genomes. For performing rapid analytics with the ever-growing genomics data, data compression and indexing methods have to exploit distributed and parallel computing more efficiently. Instead of strict genome data compression methods, we will focus on the efficient construction of a compressed index for pan-genomes. Compressed hybrid-index enables fast sequence alignments to several genomes at once while shrinking the index size significantly compared to traditional indexes. We propose a scalable distributed compressed hybrid-indexing method for large genomic data sets enabling pan-genome-based sequence search and read alignment capabilities. We show the scalability of our tool, DHPGIndex, by executing experiments in a distributed Apache Spark-based computing cluster comprising 448 cores distributed over 26 nodes. The experiments have been performed both with human and bacterial genomes. DHPGIndex built a BLAST index for n = 250 human pan-genome with an 870:1 compression ratio (CR) in 342 minutes and a Bowtie2 index with 157:1 CR in 397 minutes. For n = 1,000 human pan-genome, the BLAST index was built in 1520 minutes with 532:1 CR and the Bowtie2 index in 1938 minutes with 76:1 CR. Bowtie2 aligned 14.6 GB of paired-end reads to the compressed (n = 1,000) index in 31.7 minutes on a single node. Compressing n = 13,375,031 (488 GB) GenBank database to BLAST index resulted in CR of 62:1 in 575 minutes. BLASTing 189,864 Crispr-Cas9 gRNA target sequences (23 MB in total) to the compressed index of human pan-genome (n = 1,000) finished in 45 minutes on a single node. 30 MB mixed bacterial sequences were (n = 599) were blasted to the compressed index of 488 GB GenBank database (n = 13,375,031) in 26 minutes on 25 nodes. 78 MB mixed sequences (n = 4,167) were blasted to the compressed index of 18 GB E. coli sequence database (n = 745,409) in 5.4 minutes on a single node.
Collapse
|
6
|
Accurate spliced alignment of long RNA sequencing reads. Bioinformatics 2021; 37:4643-4651. [PMID: 34302453 PMCID: PMC8665758 DOI: 10.1093/bioinformatics/btab540] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Revised: 06/29/2021] [Accepted: 07/20/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Long-read RNA sequencing technologies are establishing themselves as the primary techniques to detect novel isoforms, and many such analyses are dependent on read alignments. However, the error rate and sequencing length of the reads create new challenges for accurately aligning them, particularly around small exons. RESULTS We present an alignment method uLTRA for long RNA sequencing reads based on a novel two-pass collinear chaining algorithm. We show that uLTRA produces higher accuracy over state-of-the-art aligners with substantially higher accuracy for small exons on simulated and synthetic data. On simulated data, uLTRA achieves an accuracy of about 60% for exons of length 10 nucleotides or smaller and close to 90% accuracy for exons of length between 11 to 20 nucleotides. On biological data where true read location is unknown, we show several examples where uLTRA aligns to known and novel isoforms containing small exons that are not detected with other aligners. While uLTRA obtains its accuracy using annotations, it can also be used as a wrapper around minimap2 to align reads outside annotated regions. AVAILABILITY uLTRA is available at https://github.com/ksahlin/ultra. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
7
|
Founder Reconstruction Enables Scalable and Seamless Pangenomic Analysis. Bioinformatics 2021; 37:4611-4619. [PMID: 34260702 PMCID: PMC8665761 DOI: 10.1093/bioinformatics/btab516] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Revised: 05/29/2021] [Accepted: 07/08/2021] [Indexed: 11/14/2022] Open
Abstract
Motivation Variant calling workflows that utilize a single reference sequence are the de facto standard elementary genomic analysis routine for resequencing projects. Various ways to enhance the reference with pangenomic information have been proposed, but scalability combined with seamless integration to existing workflows remains a challenge. Results We present PanVC with founder sequences, a scalable and accurate variant calling workflow based on a multiple alignment of reference sequences. Scalability is achieved by removing duplicate parts up to a limit into a founder multiple alignment, that is then indexed using a hybrid scheme that exploits general purpose read aligners. Our implemented workflow uses GATK or BCFtools for variant calling, but the various steps of our workflow (e.g. vcf2multialign tool, founder reconstruction) can be of independent interest as a basis for creating novel pangenome analysis workflows beyond variant calling. Availability and implementation Our open access tools and instructions how to reproduce our experiments are available at the following address: https://github.com/algbio/panvc-founders. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
8
|
Bit-parallel sequence-to-graph alignment. Bioinformatics 2020; 35:3599-3607. [PMID: 30851095 PMCID: PMC6761980 DOI: 10.1093/bioinformatics/btz162] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2018] [Revised: 01/19/2019] [Accepted: 03/07/2019] [Indexed: 01/16/2023] Open
Abstract
Motivation Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction and variant calling with respect to a variation graph. Results We generalize two linear sequence-to-sequence algorithms to graphs: the Shift-And algorithm for exact matching and Myers’ bitvector algorithm for semi-global alignment. These linear algorithms are both based on processing w sequence characters with a constant number of operations, where w is the word size of the machine (commonly 64), and achieve a speedup of up to w over naive algorithms. For a graph with |V| nodes and |E| edges and a sequence of length m, our bitvector-based graph alignment algorithm reaches a worst case runtime of O(|V|+⌈mw⌉|E| log w) for acyclic graphs and O(|V|+m|E| log w) for arbitrary cyclic graphs. We apply it to five different types of graphs and observe a speedup between 3-fold and 20-fold compared with a previous (asymptotically optimal) alignment algorithm. Availability and implementation https://github.com/maickrau/GraphAligner Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
9
|
MIPUP: minimum perfect unmixed phylogenies for multi-sampled tumors via branchings and ILP. Bioinformatics 2019; 35:769-777. [PMID: 30101335 PMCID: PMC6394401 DOI: 10.1093/bioinformatics/bty683] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2018] [Revised: 07/08/2018] [Accepted: 08/07/2018] [Indexed: 12/31/2022] Open
Abstract
Motivation Discovering the evolution of a tumor may help identify driver mutations and provide a more comprehensive view on the history of the tumor. Recent studies have tackled this problem using multiple samples sequenced from a tumor, and due to clinical implications, this has attracted great interest. However, such samples usually mix several distinct tumor subclones, which confounds the discovery of the tumor phylogeny. Results We study a natural problem formulation requiring to decompose the tumor samples into several subclones with the objective of forming a minimum perfect phylogeny. We propose an Integer Linear Programming formulation for it, and implement it into a method called MIPUP. We tested the ability of MIPUP and of four popular tools LICHeE, AncesTree, CITUP, Treeomics to reconstruct the tumor phylogeny. On simulated data, MIPUP shows up to a 34% improvement under the ancestor-descendant relations metric. On four real datasets, MIPUP’s reconstructions proved to be generally more faithful than those of LICHeE. Availability and implementation MIPUP is available at https://github.com/zhero9/MIPUP as open source. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
10
|
Linear time minimum segmentation enables scalable founder reconstruction. Algorithms Mol Biol 2019; 14:12. [PMID: 31131017 PMCID: PMC6525415 DOI: 10.1186/s13015-019-0147-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Accepted: 05/04/2019] [Indexed: 11/15/2022] Open
Abstract
Background We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\mathcal {R}} = \{R_1, \ldots , R_m\}$$\end{document}R={R1,…,Rm} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$[a,b] \in P$$\end{document}[a,b]∈P has length at least L and the number \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$d(a,b)=|\{R_i[a,b] :1\le i \le m\}|$$\end{document}d(a,b)=|{Ri[a,b]:1≤i≤m}| of distinct substrings at segment [a, b] is minimized over \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$[a,b] \in P$$\end{document}[a,b]∈P. The distinct substrings in the segments represent founder blocks that can be concatenated to form \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\max \{ d(a,b) :[a,b] \in P \}$$\end{document}max{d(a,b):[a,b]∈P} founder sequences representing the original \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\mathcal {R}}$$\end{document}R such that crossovers happen only at segment boundaries. Results We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$O(mn^2)$$\end{document}O(mn2). Conclusions Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.
Collapse
|
11
|
Abstract
Transcript prediction can be modeled as a graph problem where exons are modeled as nodes and reads spanning two or more exons are modeled as exon chains. Pacific Biosciences third-generation sequencing technology produces significantly longer reads than earlier second-generation sequencing technologies, which gives valuable information about longer exon chains in a graph. However, with the high error rates of third-generation sequencing, aligning long reads correctly around the splice sites is a challenging task. Incorrect alignments lead to spurious nodes and arcs in the graph, which in turn lead to incorrect transcript predictions. We survey several approaches to find the exon chains corresponding to long reads in a splicing graph, and experimentally study the performance of these methods using simulated data to allow for sensitivity/precision analysis. Our experiments show that short reads from second-generation sequencing can be used to significantly improve exon chain correctness either by error-correcting the long reads before splicing graph creation, or by using them to create a splicing graph on which the long-read alignments are then projected. We also study the memory and time consumption of various modules, and show that accurate exon chains lead to significantly increased transcript prediction accuracy. Availability: The simulated data and in-house scripts used for this article are available at http://www.cs.helsinki.fi/group/gsa/exon-chains/exon-chains-bib.tar.bz2.
Collapse
|
12
|
Abstract
Background Typical human genome differs from the reference genome at 4-5 million sites. This diversity is increasingly catalogued in repositories such as ExAC/gnomAD, consisting of >15,000 whole-genomes and >126,000 exome sequences from different individuals. Despite this enormous diversity, resequencing data workflows are still based on a single human reference genome. Identification and genotyping of genetic variants is typically carried out on short-read data aligned to a single reference, disregarding the underlying variation. Results We propose a new unified framework for variant calling with short-read data utilizing a representation of human genetic variation – a pan-genomic reference. We provide a modular pipeline that can be seamlessly incorporated into existing sequencing data analysis workflows. Our tool is open source and available online: https://gitlab.com/dvalenzu/PanVC. Conclusions Our experiments show that by replacing a standard human reference with a pan-genomic one we achieve an improvement in single-nucleotide variant calling accuracy and in short indel calling accuracy over the widely adopted Genome Analysis Toolkit (GATK) in difficult genomic regions.
Collapse
|
13
|
A safe and complete algorithm for metagenomic assembly. Algorithms Mol Biol 2018; 13:3. [PMID: 29445416 PMCID: PMC5802251 DOI: 10.1186/s13015-018-0122-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Accepted: 01/20/2018] [Indexed: 11/10/2022] Open
Abstract
Background Reconstructing the genome of a species from short fragments is one of the oldest bioinformatics problems. Metagenomic assembly is a variant of the problem asking to reconstruct the circular genomes of all bacterial species present in a sequencing sample. This problem can be naturally formulated as finding a collection of circular walks of a directed graph G that together cover all nodes, or edges, of G. Approach We address this problem with the “safe and complete” framework of Tomescu and Medvedev (Research in computational Molecular biology—20th annual conference, RECOMB 9649:152–163, 2016). An algorithm is called safe if it returns only those walks (also called safe) that appear as subwalk in all metagenomic assembly solutions for G. A safe algorithm is called complete if it returns all safe walks of G. Results We give graph-theoretic characterizations of the safe walks of G, and a safe and complete algorithm finding all safe walks of G. In the node-covering case, our algorithm runs in time \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$O(m^2 + n^3)$$\end{document}O(m2+n3), and in the edge-covering case it runs in time \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$O(m^2n)$$\end{document}O(m2n); n and m denote the number of nodes and edges, respectively, of G. This algorithm constitutes the first theoretical tight upper bound on what can be safely assembled from metagenomic reads using this problem formulation.
Collapse
|
14
|
Abstract
Although recent developments in DNA sequencing have allowed for great leaps in both the quality and quantity of genome assembly projects, de novo assemblies still lack the efficiency and accuracy required for studying genetic variation of individuals. Thus, efficient and accurate methods for calling and genotyping genetic variants are fundamental to studying the genomes of individuals. We study the problem of genotyping insertion variants. We assume that the location of the insertion is given, and the task is to find the insertion sequence. Insertions are the hardest structural variant to genotype, because the insertion sequence must be assembled from the reads, whereas genotyping other structural variants only requires transformations of the reference genome. The current methods for constructing insertion variants are mostly linked to variation calling methods and are only able to construct small insertions. A sub-problem in genome assembly, the gap filling problem, provides techniques that are readily applicable to insertion genotyping. Gap filling takes the context and length of a missing sequence in a genome assembly and attempts to assemble the intervening sequence. In this paper we show how tools and methods for gap filling can be used to assemble insertion variants by modeling the problem of insertion genotyping as filling gaps in the reference genome. We further give a general read filtering scheme to make the method scalable to large data sets. Our results show that gap filling methods are competitive against insertion genotyping tools. We further show that read filtering improves performance of insertion genotyping especially for long insertions. Our experiments show that on long insertions the new proposed method is the most accurate one, whereas on short insertions it has comparable performance as compared against existing tools.
Collapse
|
15
|
Abstract 5281: Fast and scalable software for comparative variant analysis and visualization of massive next-generation sequencing data. Cancer Res 2016. [DOI: 10.1158/1538-7445.am2016-5281] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Next-generation sequencing (NGS) techniques produce high quantities of DNA and RNA sequencing data, enabling variation analysis at a whole genome level. Due to the massive availability of NGS data, the processing and analysis now constitute a serious challenge for research in life sciences. This necessitates the development of novel, user-friendly and scalable software tools, which are able to jointly handle large NGS data sets in order to break the bottleneck that has already moved from data generation to data analysis in high-throughput biology.
Here we introduce a highly efficient NGS analysis and visualization software tool (Rikurator), which is designed for researchers who are struggling with NGS data analysis and may not have access to a dedicated bioinformatics team. Rikurator can be applied to a multitude of genomic analysis tasks, including discovery of pathogenic mutations in familial or sporadic sample sets, analysis of structural variants and somatic characterization of cancer genomes. To these ends the software implements interactive variant filtering and annotation methods based on sample comparison, control data (e.g. ExAC), amino acid changes and quality scores, which provide an intuitive view at the studied samples. Shared candidate genes in a set of dozens of individuals can be discovered in a matter of minutes with an ordinary laptop computer. Importantly, the software is able to analyze hundreds of whole-genome sequenced samples using only a modest amount of RAM. The tool also visualizes and integrates genes, variants (VCF file format), sequence reads (BAM) and user defined features (BED/GFF). In addition, Rikurator is not limited to human studies, supporting reference genomes and genome annotations in standard formats. The software supports targeted (e.g. exome), whole-genome, RNA, Oxford Nanopore and PacBio sequencing data.
Rikurator can be run on Windows, Linux and Mac, and it will be freely available after publication.
Citation Format: Riku Katainen, Veli Mäkinen, Lauri A. Aaltonen, Esa Pitkänen. Fast and scalable software for comparative variant analysis and visualization of massive next-generation sequencing data. [abstract]. In: Proceedings of the 107th Annual Meeting of the American Association for Cancer Research; 2016 Apr 16-20; New Orleans, LA. Philadelphia (PA): AACR; Cancer Res 2016;76(14 Suppl):Abstract nr 5281.
Collapse
|
16
|
|
17
|
Explaining a Weighted DAG with Few Paths for Solving Genome-Guided Multi-Assembly. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1345-1354. [PMID: 26671806 DOI: 10.1109/tcbb.2015.2418753] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. This is achieved by constructing a weighted DAG whose vertices stand for exons, and whose arcs stand for split alignments of the RNA-Seq reads to the exons. The task consists of finding a number of paths, together with their expression levels, which optimally explain the weights of the graph under various fitting functions, such as least sum of squared residuals. In (Tomescu et al. BMC Bioinformatics, 2013) we studied this genome-guided multi-assembly problem when the number of allowed solution paths was linear in the number of arcs. In this paper, we further refine this problem by asking for a bounded number k of solution paths, which is the setting of most practical interest. We formulate this problem in very broad terms, and show that for many choices of the fitting function it becomes NP-hard. Nevertheless, we identify a natural graph parameter of a DAG G, which we call arc-width and denote ⟨G⟩, and give a dynamic programming algorithm running in time O(W(k)⟨G⟩(k)(⟨G⟩+ k)n) , where n is the number of vertices and W is the maximum weight of G. This implies that the problem is fixed-parameter tractable (FPT) in the parameters W, ⟨G⟩, and k. We also show that the arc-width of DAGs constructed from simulated and real RNA-Seq reads is small in practice. Finally, we study the approximability of this problem, and, in particular, give a fully polynomial-time approximation scheme (FPTAS) for the case when the fitting function penalizes the maximum ratio between the weights of the arcs and their predicted coverage.
Collapse
|
18
|
Repeat- and error-aware comparison of deletions. Bioinformatics 2015; 31:2947-54. [PMID: 25979471 DOI: 10.1093/bioinformatics/btv304] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2014] [Accepted: 05/08/2015] [Indexed: 12/22/2022] Open
Abstract
MOTIVATION The number of reported genetic variants is rapidly growing, empowered by ever faster accumulation of next-generation sequencing data. A major issue is comparability. Standards that address the combined problem of inaccurately predicted breakpoints and repeat-induced ambiguities are missing. This decisively lowers the quality of 'consensus' callsets and hampers the removal of duplicate entries in variant databases, which can have deleterious effects in downstream analyses. RESULTS We introduce a sound framework for comparison of deletions that captures both tool-induced inaccuracies and repeat-induced ambiguities. We present a maximum matching algorithm that outputs virtual duplicates among two sets of predictions/annotations. We demonstrate that our approach is clearly superior over ad hoc criteria, like overlap, and that it can reduce the redundancy among callsets substantially. We also identify large amounts of duplicate entries in the Database of Genomic Variants, which points out the immediate relevance of our approach. AVAILABILITY AND IMPLEMENTATION Implementation is open source and available from https://bitbucket.org/readdi/readdi CONTACT roland.wittler@uni-bielefeld.de or t.marschall@mpi-inf.mpg.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
19
|
Abstract
Background Traditionally biological similarity search has been studied under the abstraction of a single string to represent each genome. The more realistic representation of diploid genomes, with two strings defining the genome, has so far been largely omitted in this context. With the development of sequencing techniques and better phasing routines through haplotype assembly algorithms, we are not far from the situation when individual diploid genomes could be represented in their full complexity with a pair-wise alignment defining the genome. Results We propose a generalization of global alignment that is designed to measure similarity between phased predictions of individual diploid genomes. This generalization takes into account that individual diploid genomes evolve through a mutation and recombination process, and that predictions may be erroneous in both dimensions. Even though our model is generic, we focus on the case where one wants to measure only the similarity of genome content allowing free recombination. This results into efficient algorithms for direct application in (i) evaluation of variation calling predictions and (ii) progressive multiple alignments based on labeled directed acyclic graphs (DAGs) to represent profiles. The latter may be of more general interest, in connection to covering alignment of DAGs. Extensions of our model and algorithms can be foreseen to have applications in evaluating phasing algorithms, as well as more fundamental role in phasing child genome based on parent genomes.
Collapse
|
20
|
SNV-PPILP: refined SNV calling for tumor data using perfect phylogenies and ILP. ACTA ACUST UNITED AC 2014; 31:1133-5. [PMID: 25398608 DOI: 10.1093/bioinformatics/btu755] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2014] [Accepted: 11/11/2014] [Indexed: 01/23/2023]
Abstract
MOTIVATION Recent studies sequenced tumor samples from the same progenitor at different development stages and showed that by taking into account the phylogeny of this development, single-nucleotide variant (SNV) calling can be improved. Accurate SNV calls can better reveal early-stage tumors, identify mechanisms of cancer progression or help in drug targeting. RESULTS We present SNV-PPILP, a fast and easy to use tool for refining GATK's Unified Genotyper SNV calls, for multiple samples assumed to form a phylogeny. We tested SNV-PPILP on simulated data, with a varying number of samples, SNVs, read coverage and violations of the perfect phylogeny assumption. We always match or improve the accuracy of GATK, with a significant improvement on low read coverage. AVAILABILITY AND IMPLEMENTATION SNV-PPILP, available at cs.helsinki.fi/gsa/snv-ppilp/, is written in Python and requires the free ILP solver lp_solve. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
21
|
Abstract
BACKGROUND Multi-assembly problems have gathered much attention in the last years, as Next-Generation Sequencing technologies have started being applied to mixed settings, such as reads from the transcriptome (RNA-Seq), or from viral quasi-species. One classical model that has resurfaced in many multi-assembly methods (e.g. in Cufflinks, ShoRAH, BRANCH, CLASS) is the Minimum Path Cover (MPC) Problem, which asks for the minimum number of directed paths that cover all the nodes of a directed acyclic graph. The MPC Problem is highly popular because the acyclicity of the graph ensures its polynomial-time solvability. RESULTS In this paper, we consider two generalizations of it dealing with integrating constraints arising from long reads or paired-end reads; these extensions have also been considered by two recent methods, but not fully solved. More specifically, we study the two problems where also a set of subpaths, or pairs of subpaths, of the graph have to be entirely covered by some path in the MPC. We show that in the case of long reads (subpaths), the generalized problem can be solved in polynomial-time by a reduction to the classical MPC Problem. We also consider the weighted case, and show that it can be solved in polynomial-time by a reduction to a min-cost circulation problem. As a side result, we also improve the time complexity of the classical minimum weight MPC Problem. In the case of paired-end reads (pairs of subpaths), the generalized problem becomes NP-hard, but we show that it is fixed-parameter tractable (FPT) in the total number of constraints. This computational dichotomy between long reads and paired-end reads is also a general insight into multi-assembly problems.
Collapse
|
22
|
The Glanville fritillary genome retains an ancient karyotype and reveals selective chromosomal fusions in Lepidoptera. Nat Commun 2014; 5:4737. [PMID: 25189940 PMCID: PMC4164777 DOI: 10.1038/ncomms5737] [Citation(s) in RCA: 153] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2014] [Accepted: 07/17/2014] [Indexed: 12/30/2022] Open
Abstract
Previous studies have reported that chromosome synteny in Lepidoptera has been well conserved, yet the number of haploid chromosomes varies widely from 5 to 223. Here we report the genome (393 Mb) of the Glanville fritillary butterfly (Melitaea cinxia; Nymphalidae), a widely recognized model species in metapopulation biology and eco-evolutionary research, which has the putative ancestral karyotype of n=31. Using a phylogenetic analyses of Nymphalidae and of other Lepidoptera, combined with orthologue-level comparisons of chromosomes, we conclude that the ancestral lepidopteran karyotype has been n=31 for at least 140 My. We show that fusion chromosomes have retained the ancestral chromosome segments and very few rearrangements have occurred across the fusion sites. The same, shortest ancestral chromosomes have independently participated in fusion events in species with smaller karyotypes. The short chromosomes have higher rearrangement rate than long ones. These characteristics highlight distinctive features of the evolutionary dynamics of butterflies and moths. Butterflies and moths (Lepidoptera) vary in chromosome number. Here, the authors sequence the genome of the Glanville fritillary butterfly, Melitaea cinxia, show it has the ancestral lepidopteran karyotype and provide insight into how chromosomal fusions have shaped karyotype evolution in butterflies and moths.
Collapse
|
23
|
Indexing Graphs for Path Queries with Applications in Genome Research. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:375-88. [PMID: 26355784 DOI: 10.1109/tcbb.2013.2297101] [Citation(s) in RCA: 141] [Impact Index Per Article: 14.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
We propose a generic approach to replace the canonical sequence representation of genomes with graph representations, and study several applications of such extensions. We extend the Burrows-Wheeler transform (BWT) of strings to acyclic directed labeled graphs, to support path queries as an extension to substring searching. We develop, apply, and tailor this technique to a) read alignment on an extended BWT index of a graph representing pan-genome, i.e., reference genome and known variants of it; and b) split-read alignment on an extended BWT index of a splicing graph. Other possible applications include probe/primer design, alignments to assembly graphs, and alignments to phylogenetic tree of partial-order graphs. We report several experiments on the feasibility and applicability of the approach. Especially on highly-polymorphic genome regions our pan-genome index is making a significant improvement in alignment accuracy.
Collapse
|
24
|
Abstract
Background Through transcription and alternative splicing, a gene can be transcribed into different RNA sequences (isoforms), depending on the individual, on the tissue the cell is in, or in response to some stimuli. Recent RNA-Seq technology allows for new high-throughput ways for isoform identification and quantification based on short reads, and various methods have been put forward for this non-trivial problem. Results In this paper we propose a novel radically different method based on minimum-cost network flows. This has a two-fold advantage: on the one hand, it translates the problem as an established one in the field of network flows, which can be solved in polynomial time, with different existing solvers; on the other hand, it is general enough to encompass many of the previous proposals under the least sum of squares model. Our method works as follows: in order to find the transcripts which best explain, under a given fitness model, a splicing graph resulting from an RNA-Seq experiment, we find a min-cost flow in an offset flow network, under an equivalent cost model. Under very weak assumptions on the fitness model, the optimal flow can be computed in polynomial time. Parsimoniously splitting the flow back into few path transcripts can be done with any of the heuristics and approximations available from the theory of network flows. In the present implementation, we choose the simple strategy of repeatedly removing the heaviest path. Conclusions We proposed a new very general method based on network flows for a multiassembly problem arising from isoform identification and quantification with RNA-Seq. Experimental results on prediction accuracy show that our method is very competitive with popular tools such as Cufflinks and IsoLasso. Our tool, called Traph (Transcrips in gRAPHs), is available at: http://www.cs.helsinki.fi/gsa/traph/.
Collapse
|
25
|
Normalized N50 assembly metric using gap-restricted co-linear chaining. BMC Bioinformatics 2012; 13:255. [PMID: 23031320 PMCID: PMC3556137 DOI: 10.1186/1471-2105-13-255] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2012] [Accepted: 09/26/2012] [Indexed: 01/10/2023] Open
Abstract
Background For the development of genome assembly tools, some comprehensive and efficiently computable validation measures are required to assess the quality of the assembly. The mostly used N50 measure summarizes the assembly results by the length of the scaffold (or contig) overlapping the midpoint of the length-order concatenation of scaffolds (contigs). Especially for scaffold assemblies it is non-trivial to combine a correctness measure to the N50 values, and the current methods for doing this are rather involved. Results We propose a simple but rigorous normalized N50 assembly metric that combines N50 with such a correctness measure; assembly is split into as many parts as necessary to align each part to the reference. For scalability, we first compute maximal local approximate matches between scaffolds and reference in distributed manner, and then proceed with co-linear chaining to find a global alignment. Best alignment is removed from the scaffold and the process is iterated with the remaining scaffold content in order to split the scaffold into correctly aligning parts. The proposed normalized N50 metric is then the N50 value computed for the final correctly aligning parts. As a side result of independent interest, we show how to modify co-linear chaining to restrict gaps to produce a more sensible global alignment. Conclusions We propose and implement a comprehensive and efficient approach to compute a metric that summarizes scaffold assembly correctness and length. Our implementation can be downloaded from
http://www.cs.helsinki.fi/group/scaffold/normalizedN50/.
Collapse
|
26
|
Detection of Viruses in Sweetpotato from Honduras and Guatemala Augmented by Deep-Sequencing of Small-RNAs. PLANT DISEASE 2012; 96:1430-1437. [PMID: 30727310 DOI: 10.1094/pdis-03-12-0268-re] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Sweetpotato (Ipomoea batatas) plants become infected with over 30 RNA or DNA viruses in different parts of the world but little is known about viruses infecting sweetpotato crops in Central America, the center of sweetpotato domestication. Small-RNA deep-sequencing (SRDS) analysis was used to detect viruses in sweetpotato in Honduras and Guatemala, which detected Sweet potato feathery mottle virus strain RC and Sweet potato virus C (Potyvirus spp.), Sweet potato chlorotic stunt virus strain WA (SPCSV-WA; Crinivirus sp.), Sweet potato leaf curl Georgia virus (Begomovirus sp.), and Sweet potato pakakuy virus strain B (synonym: Sweet potato badnavirus B). Results were confirmed by polymerase chain reaction and sequencing of the amplicons. Four viruses were detected in a sweetpotato sample from the Galapagos Islands. Serological assays available to two of the five viruses gave results consistent with those obtained by SRDS, and were negative for six additional sweetpotato viruses tested. Plants coinfected with SPCSV-WA and one to two other viruses displayed severe foliar symptoms of epinasty and leaf malformation, purpling, vein banding, or chlorosis. The results suggest that SRDS is suitable for use as a universal, robust, and reliable method for detection of plant viruses, and especially useful for determining virus infections in crops infected with a wide range of unrelated viruses.
Collapse
|
27
|
Abstract
Motivation: Assembling genomes from short read data has become increasingly popular, but the problem remains computationally challenging especially for larger genomes. We study the scaffolding phase of sequence assembly where preassembled contigs are ordered based on mate pair data. Results: We present MIP Scaffolder that divides the scaffolding problem into smaller subproblems and solves these with mixed integer programming. The scaffolding problem can be represented as a graph and the biconnected components of this graph can be solved independently. We present a technique for restricting the size of these subproblems so that they can be solved accurately with mixed integer programming. We compare MIP Scaffolder to two state of the art methods, SOPRA and SSPACE. MIP Scaffolder is fast and produces better or as good scaffolds as its competitors on large genomes. Availability: The source code of MIP Scaffolder is freely available at http://www.cs.helsinki.fi/u/lmsalmel/mip-scaffolder/. Contact:leena.salmela@cs.helsinki.fi
Collapse
|
28
|
Abstract
A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees. However, the suffix tree occupies much space, which very soon inhibits in-memory analyses. Recent advances in full-text indexing reduce the space of the suffix tree to, essentially, that of the compressed sequences, while retaining its functionality with only a polylogarithmic slowdown. However, the underlying compression model considers only the predictability of the next sequence symbol given the k previous ones, where k is a small integer. This is unable to capture longer-term repetitiveness. For example, r identical copies of an incompressible sequence will be incompressible under this model. We develop new static and dynamic full-text indexes that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations. The new indexes can be plugged into a recent dynamic fully-compressed suffix tree, achieving full functionality for sequence analysis, while retaining the reduced space and the polylogarithmic slowdown. Our experimental results confirm the practicality of our proposal.
Collapse
|
29
|
Abstract
Suffix tree is one of the most important data structures in string algorithms and biological sequence analysis. Unfortunately, when it comes to implementing those algorithms and applying them to real genomic sequences, often the main memory size becomes the bottleneck. This is easily explained by the fact that while a DNA sequence of length
n
from alphabet Σ = {
A
,
C
,
G
,
T
} can be stored in
n
log |Σ| = 2
n
bits, its suffix tree occupies
O
(
n
log
n
) bits. In practice, the size difference easily reaches factor 50.
We report on an implementation of the compressed suffix tree very recently proposed by Sadakane (2007). The compressed suffix tree occupies space proportional to the text size, that is,
O
(
n
log |Σ|) bits, and supports all typical suffix tree operations with at most log
n
factor slowdown. Our experiments show that, for example, on a 10 MB DNA sequence, the compressed suffix tree takes 10% of the space of the normal suffix tree. At the same time, a representative algorithm is slowed down by factor 30.
Our implementation follows the original proposal in spirit, but some internal parts are tailored toward practical implementation. Our construction algorithm has time requirement
O
(
n
log
n
log |Σ|) and uses closely the same space as the final structure while constructing it: on the 10MB DNA sequence, the maximum space usage during construction is only 1.5 times the final product size. As by-products, we develop a method to create
Succinct Suffix Array
directly from Burrows-Wheeler transform and a space-efficient version of the suffixes-insertion algorithm to build balanced parentheses representation of suffix tree from LCP information.
Collapse
|
30
|
Glucose variability, blood pressure and arterial stiffness in type 1 diabetes. Diabetes Res Clin Pract 2008; 80:e4-7. [PMID: 18325620 DOI: 10.1016/j.diabres.2008.01.010] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/29/2007] [Accepted: 01/11/2008] [Indexed: 01/08/2023]
Abstract
AIMS Evidence suggests that chronic hyperglycaemia predicts not only microvascular disease but also macrovascular disease, however it is not known whether it is the glucose variability per se or the total glucose exposure that confers risk. The objective of this study was to investigate whether daily glucose variability influence blood pressure and arterial stiffness, an early sign of macrovascular disease, at baseline and during a hyperglycaemic clamp in patients with type 1 diabetes. METHODS Twenty-two non-smoking male patients with type 1 diabetes without any diabetic complications, participated in the study. The patients were monitored for 72-h using a continuous glucose monitoring system. Before and during a 2-h hyperglycaemic clamp, blood pressure as well as pulse wave analysis and pulse wave velocity (PWV) were performed to assess arterial stiffness. RESULTS No correlation was observed between mean amplitude of glycaemic excursions (MAGE) and arterial stiffness at baseline. There was a correlation between mean daily glucose and aortic PWV even after adjusting for BMI, HbA(1c), and duration of diabetes in a multiple regression analysis (r=0.48; P<0.01). MAGE (r=0.52; P<0.01) correlated independently with the change in aortic DBP during the clamp. CONCLUSIONS This study suggests that high mean daily blood glucose but not glucose variability per se is associated with arterial stiffness in patients with T1D. Daily glucose variability is positively associated with the change in central blood pressure during a hyperglycaemic clamp.
Collapse
|
31
|
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections. STRING PROCESSING AND INFORMATION RETRIEVAL 2008. [DOI: 10.1007/978-3-540-89097-3_17] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
32
|
Abstract
A peak is a pair of real values (x,y), where x is the time when peak of height y is registered. In the peak alignment problem, we are given two sequences of peaks, and our task is to align the sequences allowing some basic edit operations on the peaks. We study an instance of the peak alignment problem that arises in the analysis of Mass Spectrometry data in Systems Biology. There the measurement technique guarantees that two peaks (x,y), (x',y') can only be considered the same if x is close enough to x', and y is close enough to y'. We review some methods to do alignment under such restrictions on matches.
Collapse
|
33
|
The effects of aging on the processing of rising-intensity tonal and speech stimuli. ACTA ACUST UNITED AC 2007. [DOI: 10.1016/j.ics.2007.01.015] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
34
|
The cortical processing of rising-intensity tonal and speech stimuli in young adults: Effects of spectral complexity. ACTA ACUST UNITED AC 2007. [DOI: 10.1016/j.ics.2007.02.009] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
35
|
The effects of cortical stroke on the processing of spectrally impoverished and enriched auditory stimuli. ACTA ACUST UNITED AC 2007. [DOI: 10.1016/j.ics.2007.03.027] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
36
|
Abstract
UNLABELLED Suffix tree is one of the most fundamental data structures in string algorithms and biological sequence analysis. Unfortunately, when it comes to implementing those algorithms and applying them to real genomic sequences, often the main memory size becomes the bottleneck. This is easily explained by the fact that while a DNA sequence of length n from alphabet sigma = {A, C, G, T} can be stored in n log absolute value(sigma) = 2n bits, its suffix tree occupies O(n log n) bits. In practice, the size difference easily reaches factor 50. We provide an implementation of the compressed suffix tree very recently proposed by Sadakane (Theory of Computing Systems, in press). The compressed suffix tree occupies space proportional to the text size, i.e. O(n log) absolute value(sigma)) bits, and supports all typical suffix tree operations with at most log n factor slowdown. Our experiments show that, e.g. on a 10 MB DNA sequence, the compressed suffix tree takes 10% of the space of normal suffix tree. Typical operations are slowed down by factor 60. AVAILABILITY The C++ implementation under GNU license is available at http://www.cs.helsinki.fi/group/suds/cst/. An example program implementing a typical pattern discovery task is included. Experimental results in this note correspond to version 0.95.
Collapse
|
37
|
Dynamic Entropy-Compressed Sequences and Full-Text Indexes. COMBINATORIAL PATTERN MATCHING 2006. [DOI: 10.1007/11780441_28] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
38
|
Succinct Suffix Arrays Based on Run-Length Encoding. COMBINATORIAL PATTERN MATCHING 2005. [DOI: 10.1007/11496656_5] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
39
|
The auditory n100m response reflects changes in speech fundamental frequency. NEUROLOGY & CLINICAL NEUROPHYSIOLOGY : NCN 2004; 2004:49. [PMID: 16012605] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
We studied the cortical activation underlying perception of variations in speech fundamental frequency (F0) as indexed by the amplitude, latency and source location of the auditory N100m response registered with magnetoencephalography (MEG). Ten subjects were presented with Finnish vowels with either a constant or an ascending/descending F0. We found that the human auditory cortex is sensitive to these time-varying changes in the F0 of speech: vowels with a constant F0 elicited more prominent N100m responses than did vowels with ascending or descending changes in F0. These results suggest that the speech-related behavior of the N100m arises out of cortical sensitivity to variations in the F0 and its harmonics which underlie the perception of pitch and intonation. The present observations are interpreted in terms of the interrelatedness of speech production and perception.
Collapse
|
40
|
Behavioral detection of spatial stimuli is reflected in auditory cortical dynamics. NEUROLOGY & CLINICAL NEUROPHYSIOLOGY : NCN 2004; 2004:50. [PMID: 16012685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
We studied the cortical processing of spatial stimuli by magnetoencephalographic (MEG) measurements using broadband noise bursts presented from eight sound source directions in the horizontal plane. The stimuli were individually created for each subject by using three-dimensional (3D) sound techniques. The subjects carried out a behavioral task where their accuracy for localizing the 3D stimuli was established. We found that the auditory N100m response was sensitive to the sound source direction, exhibiting contralaterally more preponderant responses in both the left and the right hemisphere. Generally, responses were more prominent in the right hemisphere. The behavioral performance of the subjects correlated positively with N100m amplitude organization, showing that the dynamics of auditory cortex predict behavioral sound detection.
Collapse
|
41
|
Periodic glottal excitation and formant frequencies in the perception of vowels. NEUROLOGY & CLINICAL NEUROPHYSIOLOGY : NCN 2004; 2004:103. [PMID: 16012623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Voiced speech is created by the fluctuating vocal folds generating the glottal pulseform. This excitation signal is the source of the speech fundamental frequency and its harmonic integer multiples. Periodic glottal excitation is required for the elicitation of speech-specific cortical processes indexed by the auditory N100m response. Here, we studied the cortical processing underlying the perception of the vowels /a/ and /u/ produced using normal and aperiodic phonation. The behavior of the N100m, registered with magnetoencephalography (MEG), was studied in 10 subjects. The amplitude and latency of the N100m as well as the center of gravity of the activated cortical areas varied as a function of stimulus periodicity. Further, the presence of glottal excitation had differential effects on the latency of the N100m elicited by the vowels /a/ and /u/. Thus, changes affecting the perceptual quality of speech signals without changing their phonetic content modify the dynamics of human auditory cortex.
Collapse
|
42
|
Cortical activity elicited by isolated vowels and diphthongs. NEUROLOGY & CLINICAL NEUROPHYSIOLOGY : NCN 2004; 2004:91. [PMID: 16021682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Cortical activity underlying speech perception has been studied mostly by using isolated vowels with constant formant frequencies. Speech, however, is characterized by formant transitions whereby formant frequencies change as a function of time. We used magnetoencephalography (MEG) to investigate cortical activity elicited by isolated vowels and diphthongs containing formant transitions. Ten subjects were presented with two isolated vowels /a/ and /u/ and diphthongs /au/ and /ua/. Stimulus duration was 200 ms, and the diphthongs started and ended with a 50-ms constant-formant period and included a 100-ms linear transition period. Apart from studying the auditory N100m response, we examined subsequent brain activity in a 500-ms poststimulus time window, as the transitions were expected to elicit activity also in later stages of cognitive processing. All the stimuli elicited prominent N100m responses. Thereafter, both the isolated vowels and diphthongs elicited sustained brain activity lasting up to 500 ms. The present observations indicate that identification of the speech sounds as well as changes in their identity are reflected in the auditory N100m. Notably, the stimuli appeared to elicit left-hemispheric activity resembling the N400, typically obtained by using more complicated speech stimuli such as words and sentences.
Collapse
|
43
|
Advantages of Backward Searching — Efficient Secondary Memory and Distributed Implementation of Compressed Suffix Arrays. ALGORITHMS AND COMPUTATION 2004. [DOI: 10.1007/978-3-540-30551-4_59] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
44
|
Visual attention to words of native versus later acquired languages: a magnetoencephalographic study in humans. Neurosci Lett 2001; 310:33-6. [PMID: 11524151 DOI: 10.1016/s0304-3940(01)02087-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
We recorded evoked magnetic fields while short words were visually presented in different languages with an oddball paradigm. The task was to count how many words were in a target language when most of the words were in another language and there were also non-target deviants in a third language. When native words (Finnish) were targets, they evoked a selection response at the latency of 300-600 ms. However, when the task was to count non-native words among native standards, in addition to the targets, also the non-target foreign words evoked the selection response. These results may reflect differences in the selection process for native versus non-native words brought about by different proficiency levels of the languages.
Collapse
|
45
|
Abstract
The effects of stimulus duration on the elicitation and equivalent current dipole (ECD) localization of the auditory N400(m) were studied in two subject groups, either familiar or unfamiliar with Finnish language, using a sentence-processing paradigm with incongruent ending words of either short or long duration. Long-duration words elicited a broad response at around 400 ms, the generator location(s) of which could not be reliably determined using ECD estimation. In contrast, short-duration words elicited a sharp, strong-amplitude response at about 400 ms latency and it's source location could be reliably determined as being in the vicinity of auditory cortex. Subjects unfamiliar with the Finnish language elicited no response at the 400 ms range. Thus, the use of short-duration words appears to be an important prerequisite for the elicitation and localization of N400m. The differential amplitude behaviour of the N400m between the two subject groups further suggests that comprehension of the semantic content of the speech message is also required.
Collapse
|
46
|
Sound localization in the human brain: neuromagnetic observations. Neuroreport 2000; 11:1535-8. [PMID: 10841372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/16/2023]
Abstract
Sound location processing in the human auditory cortex was studied with magnetoencephalography (MEG) by producing spatial stimuli using a modern stimulus generation methodology utilizing head-related transfer functions (HRTFs). The stimulus set comprised wideband noise bursts filtered through HRTFs in order to produce natural spatial sounds. Neuromagnetic responses for stimuli representing eight equally spaced sound source directions in the azimuthal plane were measured from 10 subjects. The most prominent response, the cortically generated N1m, was investigated above the left and right hemisphere. We found, firstly, that the HRTF-based stimuli presented from different directions elicited contralaterally prominent N1m responses. Secondly, we found that cortical activity reflecting the processing of spatial sound stimuli was more pronounced in the right than in the left hemisphere.
Collapse
|
47
|
Tolerance of acetate, propionate and sorbate by Saccharomyces cerevisiae and Torulopsis holmii. Food Microbiol 1984. [DOI: 10.1016/0740-0020(84)90019-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|