1
|
Xue H, Gallopin M, Marchet C, Nguyen HN, Wang Y, Lainé A, Bessiere C, Gautheret D. KaMRaT: a C++ toolkit for k-mer count matrix dimension reduction. Bioinformatics 2024; 40:btae090. [PMID: 38444086 PMCID: PMC10942800 DOI: 10.1093/bioinformatics/btae090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 12/26/2023] [Accepted: 03/04/2024] [Indexed: 03/07/2024] Open
Abstract
MOTIVATION KaMRaT is designed for processing large k-mer count tables derived from multi-sample, RNA-seq data. Its primary objective is to identify condition-specific or differentially expressed sequences, regardless of gene or transcript annotation. RESULTS KaMRaT is implemented in C++. Major functions include scoring k-mers based on count statistics, merging overlapping k-mers into contigs and selecting k-mers based on their occurrence across specific samples. AVAILABILITY AND IMPLEMENTATION Source code and documentation are available via https://github.com/Transipedia/KaMRaT.
Collapse
Affiliation(s)
- Haoliang Xue
- I2BC, Université Paris-Saclay, CNRS, CEA, 91190 Gif-sur-Yvette, France
| | - Mélina Gallopin
- I2BC, Université Paris-Saclay, CNRS, CEA, 91190 Gif-sur-Yvette, France
| | - Camille Marchet
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
| | - Ha N Nguyen
- I2BC, Université Paris-Saclay, CNRS, CEA, 91190 Gif-sur-Yvette, France
| | - Yunfeng Wang
- I2BC, Université Paris-Saclay, CNRS, CEA, 91190 Gif-sur-Yvette, France
| | - Antoine Lainé
- I2BC, Université Paris-Saclay, CNRS, CEA, 91190 Gif-sur-Yvette, France
| | - Chloé Bessiere
- IRMB, University of Montpellier, 34295 Montpellier, France
| | - Daniel Gautheret
- I2BC, Université Paris-Saclay, CNRS, CEA, 91190 Gif-sur-Yvette, France
| |
Collapse
|
2
|
Marchet C, Limasset A. Scalable sequence database search using partitioned aggregated Bloom comb trees. Bioinformatics 2023; 39:i252-i259. [PMID: 37387170 DOI: 10.1093/bioinformatics/btad225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION The Sequence Read Archive public database has reached 45 petabytes of raw sequences and doubles its nucleotide content every 2 years. Although BLAST-like methods can routinely search for a sequence in a small collection of genomes, making searchable immense public resources accessible is beyond the reach of alignment-based strategies. In recent years, abundant literature tackled the task of finding a sequence in extensive sequence collections using k-mer-based strategies. At present, the most scalable methods are approximate membership query data structures that combine the ability to query small signatures or variants while being scalable to collections up to 10 000 eukaryotic samples. Results. Here, we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3-6 fold improvement in construction time compared to other compressed methods for comparable index size. A PAC query can need single random access and be performed in constant time in favorable instances. Using limited computation resources, we built PAC for very large collections. They include 32 000 human RNA-seq samples in 5 days, the entire GenBank bacterial genome collection in a single day for an index size of 3.5 TB. The latter is, to our knowledge, the largest sequence collection ever indexed using an approximate membership query structure. We also showed that PAC's ability to query 500 000 transcript sequences in less than an hour. AVAILABILITY AND IMPLEMENTATION PAC's open-source software is available at https://github.com/Malfoy/PAC.
Collapse
Affiliation(s)
- Camille Marchet
- University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
| | - Antoine Limasset
- University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
| |
Collapse
|
3
|
Abstract
It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings ( http://bcazaux.polytech-lille.net/Minimap2/ ).
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 106 91, Stockholm, Sweden.
| | - Thomas Baudeau
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Bastien Cazaux
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Camille Marchet
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France.
| |
Collapse
|
4
|
Meyer F, Fritz A, Deng ZL, Koslicki D, Lesker TR, Gurevich A, Robertson G, Alser M, Antipov D, Beghini F, Bertrand D, Brito JJ, Brown CT, Buchmann J, Buluç A, Chen B, Chikhi R, Clausen PTLC, Cristian A, Dabrowski PW, Darling AE, Egan R, Eskin E, Georganas E, Goltsman E, Gray MA, Hansen LH, Hofmeyr S, Huang P, Irber L, Jia H, Jørgensen TS, Kieser SD, Klemetsen T, Kola A, Kolmogorov M, Korobeynikov A, Kwan J, LaPierre N, Lemaitre C, Li C, Limasset A, Malcher-Miranda F, Mangul S, Marcelino VR, Marchet C, Marijon P, Meleshko D, Mende DR, Milanese A, Nagarajan N, Nissen J, Nurk S, Oliker L, Paoli L, Peterlongo P, Piro VC, Porter JS, Rasmussen S, Rees ER, Reinert K, Renard B, Robertsen EM, Rosen GL, Ruscheweyh HJ, Sarwal V, Segata N, Seiler E, Shi L, Sun F, Sunagawa S, Sørensen SJ, Thomas A, Tong C, Trajkovski M, Tremblay J, Uritskiy G, Vicedomini R, Wang Z, Wang Z, Wang Z, Warren A, Willassen NP, Yelick K, You R, Zeller G, Zhao Z, Zhu S, Zhu J, Garrido-Oter R, Gastmeier P, Hacquard S, Häußler S, Khaledi A, Maechler F, Mesny F, Radutoiu S, Schulze-Lefert P, Smit N, Strowig T, Bremges A, Sczyrba A, McHardy AC. Critical Assessment of Metagenome Interpretation: the second round of challenges. Nat Methods 2022; 19:429-440. [PMID: 35396482 PMCID: PMC9007738 DOI: 10.1038/s41592-022-01431-4] [Citation(s) in RCA: 89] [Impact Index Per Article: 44.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2021] [Accepted: 02/14/2022] [Indexed: 12/20/2022]
Abstract
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses. This study presents the results of the second round of the Critical Assessment of Metagenome Interpretation challenges (CAMI II), which is a community-driven effort for comprehensively benchmarking tools for metagenomics data analysis.
Collapse
Affiliation(s)
- Fernando Meyer
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Adrian Fritz
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany.,German Center for Infection Research (DZIF), Hannover-Braunschweig Site, Braunschweig, Germany
| | - Zhi-Luo Deng
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany.,Cluster of Excellence RESIST (EXC 2155), Hannover Medical School, Hannover, Germany
| | | | - Till Robin Lesker
- German Center for Infection Research (DZIF), Hannover-Braunschweig Site, Braunschweig, Germany.,Helmholtz Centre for Infection Research, Braunschweig, Germany
| | | | - Gary Robertson
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Mohammed Alser
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| | - Dmitry Antipov
- Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia
| | | | | | | | | | - Jan Buchmann
- Institute for Biological Data Science, Heinrich-Heine-University, Düsseldorf, Germany
| | - Aydin Buluç
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,University of California, Berkeley, Berkeley, CA, USA
| | - Bo Chen
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,University of California, Berkeley, Berkeley, CA, USA
| | | | - Philip T L C Clausen
- National Food Institute, Division of Global Surveillance, Technical University of Denmark, Lyngby, Denmark
| | - Alexandru Cristian
- Drexel University, Philadelphia, PA, USA.,Google Inc., Philadelphia, PA, USA
| | - Piotr Wojciech Dabrowski
- Robert Koch-Institut, Berlin, Germany.,Hochschule für Technik und Wirtschaft Berlin, Berlin, Germany
| | | | - Rob Egan
- DOE Joint Genome Institute, Berkeley, CA, USA.,Lawrence Berkeley National Laboratories, Berkeley, CA, USA
| | - Eleazar Eskin
- University of California, Los Angeles, Los Angeles, CA, USA
| | | | - Eugene Goltsman
- DOE Joint Genome Institute, Berkeley, CA, USA.,Lawrence Berkeley National Laboratories, Berkeley, CA, USA
| | - Melissa A Gray
- Drexel University, Philadelphia, PA, USA.,Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Philadelphia, PA, USA
| | - Lars Hestbjerg Hansen
- University of Copenhagen, Department of Plant and Environmental Science, Frederiksberg, Denmark
| | - Steven Hofmeyr
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,University of California, Berkeley, Berkeley, CA, USA
| | - Pingqin Huang
- School of Computer Science, Fudan University, Shanghai, China
| | - Luiz Irber
- University of California, Davis, Davis, CA, USA
| | - Huijue Jia
- BGI-Shenzhen, Shenzhen, China.,Shenzhen Key Laboratory of Human Commensal Microorganisms and Health Research, BGI-Shenzhen, Shenzhen, China
| | - Tue Sparholt Jørgensen
- Technical University of Denmark, Novo Nordisk Foundation Center for Biosustainability, Lyngby, Denmark.,Aarhus University, Department of Environmental Science, Roskilde, Denmark
| | - Silas D Kieser
- Department of Cell Physiology and Metabolism, Faculty of Medicine, University of Geneva, Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| | | | - Axel Kola
- Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Mikhail Kolmogorov
- Department of Computer Science and Engineering, University of California San Diego, San Diego, CA, USA
| | - Anton Korobeynikov
- Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia.,Department of Statistical Modelling, Saint Petersburg State University, Saint Petersburg, Russia
| | - Jason Kwan
- University of Wisconsin-Madison, Madison, WI, USA
| | | | | | - Chenhao Li
- Genome Institute of Singapore, Singapore, Singapore
| | | | - Fabio Malcher-Miranda
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
| | | | - Vanessa R Marcelino
- Sydney Medical School, The University of Sydney, Sydney, Australia.,Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Australia
| | | | - Pierre Marijon
- Department of Computer Science, Inria, University of Lille, CNRS, Lille, France
| | - Dmitry Meleshko
- Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia
| | - Daniel R Mende
- Amsterdam University Medical Center, Amsterdam, the Netherlands
| | - Alessio Milanese
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, Zürich, Switzerland.,Structural and Computational Biology Unit, EMBL, Heidelberg, Germany
| | - Niranjan Nagarajan
- Genome Institute of Singapore, A*STAR, Singapore, Singapore.,National University of Singapore, Singapore, Singapore
| | | | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Leonid Oliker
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,University of California, Berkeley, Berkeley, CA, USA
| | - Lucas Paoli
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, Zürich, Switzerland
| | | | - Vitor C Piro
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
| | | | - Simon Rasmussen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Evan R Rees
- University of Wisconsin-Madison, Madison, WI, USA
| | - Knut Reinert
- Institute for Bioinformatics, FU Berlin, Berlin, Germany
| | - Bernhard Renard
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany.,Bioinformatics Unit (MF1), Robert Koch Institute, Berlin, Germany
| | | | - Gail L Rosen
- Drexel University, Philadelphia, PA, USA.,Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Philadelphia, PA, USA.,Center for Biological Discovery from Big Data, Philadelphia, PA, USA
| | - Hans-Joachim Ruscheweyh
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, Zürich, Switzerland
| | - Varuni Sarwal
- University of California, Los Angeles, Los Angeles, CA, USA
| | - Nicola Segata
- Department CIBIO, University of Trento, Trento, Italy
| | - Enrico Seiler
- Institute for Bioinformatics, FU Berlin, Berlin, Germany
| | - Lizhen Shi
- Florida Polytechnic University, Lakeland, FL, USA
| | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, CA, USA
| | - Shinichi Sunagawa
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, Zürich, Switzerland
| | | | - Ashleigh Thomas
- DOE Joint Genome Institute, Berkeley, CA, USA.,University of British Columbia, Vancouver, British Columbia, Canada
| | | | - Mirko Trajkovski
- Department of Cell Physiology and Metabolism, Faculty of Medicine, University of Geneva, Geneva, Switzerland.,Diabetes Center, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Julien Tremblay
- Energy, Mining and Environment, National Research Council Canada, Montreal, Quebec, Canada
| | | | | | - Zhengyang Wang
- School of Computer Science, Fudan University, Shanghai, China
| | - Ziye Wang
- School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Zhong Wang
- Department of Energy Joint Genome Institute, Berkeley, CA, USA.,Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,School of Natural Sciences, University of California at Merced, Merced, CA, USA
| | | | | | - Katherine Yelick
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,University of California, Berkeley, Berkeley, CA, USA
| | - Ronghui You
- School of Computer Science, Fudan University, Shanghai, China
| | - Georg Zeller
- Structural and Computational Biology Unit, EMBL, Heidelberg, Germany
| | | | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Jie Zhu
- BGI-Shenzhen, Shenzhen, China.,Shenzhen Key Laboratory of Human Commensal Microorganisms and Health Research, BGI-Shenzhen, Shenzhen, China
| | | | | | | | - Susanne Häußler
- Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Ariane Khaledi
- Helmholtz Centre for Infection Research, Braunschweig, Germany
| | | | - Fantin Mesny
- Max Planck Institute for Plant Breeding Research, Köln, Germany
| | | | | | - Nathiana Smit
- Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Till Strowig
- Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Andreas Bremges
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,German Center for Infection Research (DZIF), Hannover-Braunschweig Site, Braunschweig, Germany
| | - Alexander Sczyrba
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Alice Carolyn McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany. .,Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany. .,German Center for Infection Research (DZIF), Hannover-Braunschweig Site, Braunschweig, Germany. .,Cluster of Excellence RESIST (EXC 2155), Hannover Medical School, Hannover, Germany.
| |
Collapse
|
5
|
Marchet C, Kerbiriou M, Limasset A. BLight: efficient exact associative structure for k-mers. Bioinformatics 2021; 37:2858-2865. [PMID: 33821954 DOI: 10.1093/bioinformatics/btab217] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 02/18/2021] [Accepted: 04/01/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION A plethora of methods and applications share the fundamental need to associate information to words for high-throughput sequence analysis. Doing so for billions of k-mers is commonly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of overlaps between k-mers to leverage this challenge. Yet, existing data structures are either unable to associate information to k-mers or are not lightweight enough. RESULTS We present BLight, a static and exact data structure able to associate unique identifiers to k-mers and determine their membership in a set without false positive that scales to huge k-mer sets with a low memory cost. This index combines an extremely compact representation along with very fast queries. Besides, its construction is efficient and needs no additional memory. Our implementation achieves to index the k-mers from the human genome using 8 GB of RAM (23 bits per k-mer) within 10 min and the k-mers from the large axolotl genome using 63 GB of memory (27 bits per k-mer) within 76 min. Furthermore, while being memory efficient, the index provides a very high throughput: 1.4 million queries per second on a single CPU or 16.1 million using 12 cores. Finally, we also present how BLight can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range. AVAILABILITY AND IMPLEMENTATION We wrote the BLight index as an open source C++ library under the AGPL3 license available at github.com/Malfoy/BLight. It is designed as a user-friendly library and comes along with code usage samples.
Collapse
Affiliation(s)
- Camille Marchet
- University of Lille, CRIStAL CNRS, UMR 9189 - F-59000 Lille, France
| | - Mael Kerbiriou
- University of Lille, CRIStAL CNRS, UMR 9189 - F-59000 Lille, France
| | - Antoine Limasset
- University of Lille, CRIStAL CNRS, UMR 9189 - F-59000 Lille, France
| |
Collapse
|
6
|
Lima L, Marchet C, Caboche S, Da Silva C, Istace B, Aury JM, Touzet H, Chikhi R. Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data. Brief Bioinform 2021; 21:1164-1181. [PMID: 31232449 DOI: 10.1093/bib/bbz058] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2018] [Revised: 04/05/2019] [Accepted: 04/22/2019] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Nanopore long-read sequencing technology offers promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However this technology is currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames and creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error correction of Nanopore RNA-sequencing long reads remain limited. RESULTS In this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error correction metrics but also the effect of correction on gene families, isoform diversity, bias toward the major isoform and splice site detection. We find that long read error correction tools that were originally developed for DNA are also suitable for the correction of Nanopore RNA-sequencing data, especially in terms of increasing base pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error correction tools should be used, depending on the application type. BENCHMARKING SOFTWARE https://gitlab.com/leoisl/LR_EC_analyser.
Collapse
Affiliation(s)
- Leandro Lima
- Univ Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR Villeurbanne, France.,EPI ERABLE - Inria Grenoble, Rhône-Alpes, France.,Università di Roma 'Tor Vergata', Roma, Italy
| | | | - Ségolène Caboche
- Université de Lille, CNRS, Inserm, CHU Lille, Institut Pasteur de Lille, UMR, Center for Infection and Immunity of Lille, Lille, France
| | - Corinne Da Silva
- Genoscope, Institut de biologie Francois-Jacob, Commissariat à l'Energie Atomique (CEA), Université Paris-Saclay, Evry, France
| | - Benjamin Istace
- Genoscope, Institut de biologie Francois-Jacob, Commissariat à l'Energie Atomique (CEA), Université Paris-Saclay, Evry, France
| | - Jean-Marc Aury
- Genoscope, Institut de biologie Francois-Jacob, Commissariat à l'Energie Atomique (CEA), Université Paris-Saclay, Evry, France
| | - Hélène Touzet
- CNRS, Université de Lille, CRIStAL UMR, Lille, France
| | - Rayan Chikhi
- CNRS, Université de Lille, CRIStAL UMR, Lille, France.,Institut Pasteur, C3BI - USR 3756, 25-28 rue du Docteur Roux, Paris, France
| |
Collapse
|
7
|
Morisse P, Marchet C, Limasset A, Lecroq T, Lefebvre A. Scalable long read self-correction and assembly polishing with multiple sequence alignment. Sci Rep 2021; 11:761. [PMID: 33436980 PMCID: PMC7804095 DOI: 10.1038/s41598-020-80757-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 12/22/2020] [Indexed: 11/09/2022] Open
Abstract
Third-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSENT, a new self-correction method that relies both on multiple sequence alignment and local de Bruijn graphs. To ensure scalability, multiple sequence alignment computation benefits from a new and efficient segmentation strategy, allowing a massive speedup. CONSENT compares well to the state-of-the-art, and performs better on real Oxford Nanopore data. Specifically, CONSENT is the only method that efficiently scales to ultra-long reads, and allows to process a full human dataset, containing reads reaching up to 1.5 Mbp, in 10 days. Moreover, our experiments show that error correction with CONSENT improves the quality of Flye assemblies. Additionally, CONSENT implements a polishing feature, allowing to correct raw assemblies. Our experiments show that CONSENT is 2-38x times faster than other polishing tools, while providing comparable results. Furthermore, we show that, on a human dataset, assembling the raw data and polishing the assembly is less resource consuming than correcting and then assembling the reads, while providing better results. CONSENT is available at https://github.com/morispi/CONSENT .
Collapse
|
8
|
Marchet C, Boucher C, Puglisi SJ, Medvedev P, Salson M, Chikhi R. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res 2021; 31:1-12. [PMID: 33328168 PMCID: PMC7849385 DOI: 10.1101/gr.260604.119] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2019] [Accepted: 09/14/2020] [Indexed: 12/19/2022]
Abstract
High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.
Collapse
Affiliation(s)
- Camille Marchet
- Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida 32611, USA
| | - Simon J Puglisi
- Department of Computer Science, University of Helsinki, FI-00014, Helsinki, Finland
| | - Paul Medvedev
- Department of Computer Science, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Mikaël Salson
- Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France
| | - Rayan Chikhi
- Institut Pasteur & CNRS, C3BI USR 3756, F-75015 Paris, France
| |
Collapse
|
9
|
Abstract
MOTIVATION In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets. RESULTS We used REINDEER to index the abundances of sequences within 2585 human RNA-seq experiments in 45 h using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of ∼4 billion distinct k-mers across 2585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph of each dataset, then conceptually merges those de Bruijn graphs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances. AVAILABILITY AND IMPLEMENTATION https://github.com/kamimrcht/REINDEER. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Camille Marchet
- CNRS, UMR 9189 – CRIStAL, Université de Lille, F-59000 Lille, France
| | - Zamin Iqbal
- European Bioinformatics Institute, Cambridge CB10 1SD, UK
| | - Daniel Gautheret
- CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, Gif-sur-Yvette 91190, France
| | - Mikaël Salson
- CNRS, UMR 9189 – CRIStAL, Université de Lille, F-59000 Lille, France
| | - Rayan Chikhi
- Institut Pasteur, CNRS, C3BI – USR 3756, 75015 Paris, France
| |
Collapse
|
10
|
Marchet C, Morisse P, Lecompte L, Lefebvre A, Lecroq T, Peterlongo P, Limasset A. ELECTOR: evaluator for long reads correction methods. NAR Genom Bioinform 2019; 2:lqz015. [PMID: 33575566 PMCID: PMC7671326 DOI: 10.1093/nargab/lqz015] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Revised: 09/24/2019] [Accepted: 10/16/2019] [Indexed: 12/19/2022] Open
Abstract
The error rates of third-generation sequencing data have been capped >5%, mainly containing insertions and deletions. Thereby, an increasing number of diverse long reads correction methods have been proposed. The quality of the correction has huge impacts on downstream processes. Therefore, developing methods allowing to evaluate error correction tools with precise and reliable statistics is a crucial need. These evaluation methods rely on costly alignments to evaluate the quality of the corrected reads. Thus, key features must allow the fast comparison of different tools, and scale to the increasing length of the long reads. Our tool, ELECTOR, evaluates long reads correction and is directly compatible with a wide range of error correction tools. As it is based on multiple sequence alignment, we introduce a new algorithmic strategy for alignment segmentation, which enables us to scale to large instances using reasonable resources. To our knowledge, we provide the unique method that allows producing reproducible correction benchmarks on the latest ultra-long reads (>100 k bases). It is also faster than the current state-of-the-art on other datasets and provides a wider set of metrics to assess the read quality improvement after correction. ELECTOR is available on GitHub (https://github.com/kamimrcht/ELECTOR) and Bioconda.
Collapse
Affiliation(s)
- Camille Marchet
- Univ Rennes, CNRS, Inria, IRISA-UMR 6074, F-35000 Rennes, France.,Univ. Lille, CNRS, UMR 9189 - CRIStAL, 59655 Villeneuve-d'Ascq, France
| | - Pierre Morisse
- Normandie Université, UNIROUEN, INSA Rouen, LITIS, 76000 Rouen, France
| | - Lolita Lecompte
- Univ Rennes, CNRS, Inria, IRISA-UMR 6074, F-35000 Rennes, France
| | | | | | | | - Antoine Limasset
- Univ. Lille, CNRS, UMR 9189 - CRIStAL, 59655 Villeneuve-d'Ascq, France
| |
Collapse
|
11
|
Marchet C, Lecompte L, Silva CD, Cruaud C, Aury JM, Nicolas J, Peterlongo P. De novo clustering of long reads by gene from transcriptomics data. Nucleic Acids Res 2019; 47:e2. [PMID: 30260405 PMCID: PMC6326815 DOI: 10.1093/nar/gky834] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2018] [Accepted: 09/10/2018] [Indexed: 02/07/2023] Open
Abstract
Long-read sequencing currently provides sequences of several thousand base pairs. It is therefore possible to obtain complete transcripts, offering an unprecedented vision of the cellular transcriptome. However the literature lacks tools for de novo clustering of such data, in particular for Oxford Nanopore Technologies reads, because of the inherent high error rate compared to short reads. Our goal is to process reads from whole transcriptome sequencing data accurately and without a reference genome in order to reliably group reads coming from the same gene. This de novo approach is therefore particularly suitable for non-model species, but can also serve as a useful pre-processing step to improve read mapping. Our contribution both proposes a new algorithm adapted to clustering of reads by gene and a practical and free access tool that allows to scale the complete processing of eukaryotic transcriptomes. We sequenced a mouse RNA sample using the MinION device. This dataset is used to compare our solution to other algorithms used in the context of biological clustering. We demonstrate that it is the best approach for transcriptomics long reads. When a reference is available to enable mapping, we show that it stands as an alternative method that predicts complementary clusters.
Collapse
Affiliation(s)
| | | | - Corinne Da Silva
- Commissariat à l'Énergie Atomique (CEA), Institut de Biologie François Jacob, Genoscope, 91000 Evry, France
| | - Corinne Cruaud
- Commissariat à l'Énergie Atomique (CEA), Institut de Biologie François Jacob, Genoscope, 91000 Evry, France
| | - Jean-Marc Aury
- Commissariat à l'Énergie Atomique (CEA), Institut de Biologie François Jacob, Genoscope, 91000 Evry, France
| | | | | |
Collapse
|
12
|
Meng A, Marchet C, Corre E, Peterlongo P, Alberti A, Da Silva C, Wincker P, Pelletier E, Probert I, Decelle J, Le Crom S, Not F, Bittner L. A de novo approach to disentangle partner identity and function in holobiont systems. Microbiome 2018; 6:105. [PMID: 29885666 PMCID: PMC5994019 DOI: 10.1186/s40168-018-0481-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Accepted: 05/13/2018] [Indexed: 05/27/2023]
Abstract
BACKGROUND Study of meta-transcriptomic datasets involving non-model organisms represents bioinformatic challenges. The production of chimeric sequences and our inability to distinguish the taxonomic origins of the sequences produced are inherent and recurrent difficulties in de novo assembly analyses. As the study of holobiont meta-transcriptomes is affected by challenges invoked above, we propose an innovative bioinformatic approach to tackle such difficulties and tested it on marine models as a proof of concept. RESULTS We considered three holobiont models, of which two transcriptomes were previously published and a yet unpublished transcriptome, to analyze and sort their raw reads using Short Read Connector, a k-mer based similarity method. Before assembly, we thus defined four distinct categories for each holobiont meta-transcriptome: host reads, symbiont reads, shared reads, and unassigned reads. Afterwards, we observed that independent de novo assemblies for each category led to a diminution of the number of chimeras compared to classical assembly methods. Moreover, the separation of each partner's transcriptome offered the independent and comparative exploration of their functional diversity in the holobiont. Finally, our strategy allowed to propose new functional annotations for two well-studied holobionts (a Cnidaria-Dinophyta, a Porifera-Bacteria) and a first meta-transcriptome from a planktonic Radiolaria-Dinophyta system forming widespread symbiotic association for which our knowledge is considerably limited. CONCLUSIONS In contrast to classical assembly approaches, our bioinformatic strategy generates less de novo assembled chimera and allows biologists to study separately host and symbiont data from a holobiont mixture. The pre-assembly separation of reads using an efficient tool as Short Read Connector is an effective way to tackle meta-transcriptomic challenges and offers bright perpectives to study holobiont systems composed of either well-studied or poorly characterized symbiotic lineages and ultimately expand our knowledge about these associations.
Collapse
Affiliation(s)
- Arnaud Meng
- Sorbonne Université, Univ Antilles, CNRS, Evolution Paris Seine - Institut de Biologie Paris Seine (EPS - IBPS), F-75005 Paris, France
| | - Camille Marchet
- Univ Rennes, CNRS, Inria, IRISA - UMR 6074, F-35000 Rennes, France
| | - Erwan Corre
- Sorbonne Universités, CNRS - FR2424, ABiMS, Station biologique de Roscoff, Place Georges Teissier, 29680 Roscoff, France
| | | | - Adriana Alberti
- Institut de biologie François Jacob, GENOSCOPE, 2 rue Gaston Crémieux, 91057 Evry, France
| | - Corinne Da Silva
- Institut de biologie François Jacob, GENOSCOPE, 2 rue Gaston Crémieux, 91057 Evry, France
| | - Patrick Wincker
- Institut de biologie François Jacob, GENOSCOPE, 2 rue Gaston Crémieux, 91057 Evry, France
- UMR8030, CNRS, Evry, France
| | - Eric Pelletier
- Institut de biologie François Jacob, GENOSCOPE, 2 rue Gaston Crémieux, 91057 Evry, France
- UMR8030, CNRS, Evry, France
| | - Ian Probert
- Sorbonne Université, CNRS - FR2424, Roscoff Culture Collection, Station Biologique de Roscoff, Place Georges Teissier, 29682 Roscoff, France
| | - Johan Decelle
- Helmholtz Centre for Environmental Research – UFZ, Department of Isotope Biogeochemistry, Permoserstraße 15, 04318 Leipzig, Germany
| | - Stéphane Le Crom
- Sorbonne Université, Univ Antilles, CNRS, Evolution Paris Seine - Institut de Biologie Paris Seine (EPS - IBPS), F-75005 Paris, France
| | - Fabrice Not
- Sorbonne Université, CNRS - UMR7144 - Ecology of Marine Plankton Group, Station Biologique de Roscoff, Place Georges Teissier, 29680 Roscoff, France
| | - Lucie Bittner
- Sorbonne Université, Univ Antilles, CNRS, Evolution Paris Seine - Institut de Biologie Paris Seine (EPS - IBPS), F-75005 Paris, France
| |
Collapse
|
13
|
Benoit-Pilven C, Marchet C, Chautard E, Lima L, Lambert MP, Sacomoto G, Rey A, Cologne A, Terrone S, Dulaurier L, Claude JB, Bourgeois CF, Auboeuf D, Lacroix V. Complementarity of assembly-first and mapping-first approaches for alternative splicing annotation and differential analysis from RNAseq data. Sci Rep 2018; 8:4307. [PMID: 29523794 PMCID: PMC5844962 DOI: 10.1038/s41598-018-21770-7] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Accepted: 01/30/2018] [Indexed: 11/08/2022] Open
Abstract
Genome-wide analyses estimate that more than 90% of multi exonic human genes produce at least two transcripts through alternative splicing (AS). Various bioinformatics methods are available to analyze AS from RNAseq data. Most methods start by mapping the reads to an annotated reference genome, but some start by a de novo assembly of the reads. In this paper, we present a systematic comparison of a mapping-first approach (FARLINE) and an assembly-first approach (KISSPLICE). We applied these methods to two independent RNAseq datasets and found that the predictions of the two pipelines overlapped (70% of exon skipping events were common), but with noticeable differences. The assembly-first approach allowed to find more novel variants, including novel unannotated exons and splice sites. It also predicted AS in recently duplicated genes. The mapping-first approach allowed to find more lowly expressed splicing variants, and splice variants overlapping repeats. This work demonstrates that annotating AS with a single approach leads to missing out a large number of candidates, many of which are differentially regulated across conditions and can be validated experimentally. We therefore advocate for the combined use of both mapping-first and assembly-first approaches for the annotation and differential analysis of AS from RNAseq datasets.
Collapse
Affiliation(s)
- Clara Benoit-Pilven
- Université de Lyon, ENS de Lyon, Université Claude Bernard, CNRS UMR 5239, INSERM U1210, Laboratory of Biology and Modelling of the Cell, 46 Allée d'Italie Site Jacques Monod, F-69007, Lyon, France
| | - Camille Marchet
- IRISA Inria Rennes Bretagne Atlantique CNRS UMR 6074, Université Rennes 1, GenScale team, Rennes, 263 Avenue Général Leclerc, Rennes, France
| | - Emilie Chautard
- Université de Lyon, ENS de Lyon, Université Claude Bernard, CNRS UMR 5239, INSERM U1210, Laboratory of Biology and Modelling of the Cell, 46 Allée d'Italie Site Jacques Monod, F-69007, Lyon, France
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622, Villeurbanne, EPI ERABLE - Inria Grenoble, Rhône-Alpes, France
| | - Leandro Lima
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622, Villeurbanne, EPI ERABLE - Inria Grenoble, Rhône-Alpes, France
| | - Marie-Pierre Lambert
- Université de Lyon, ENS de Lyon, Université Claude Bernard, CNRS UMR 5239, INSERM U1210, Laboratory of Biology and Modelling of the Cell, 46 Allée d'Italie Site Jacques Monod, F-69007, Lyon, France
| | - Gustavo Sacomoto
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622, Villeurbanne, EPI ERABLE - Inria Grenoble, Rhône-Alpes, France
| | - Amandine Rey
- Université de Lyon, ENS de Lyon, Université Claude Bernard, CNRS UMR 5239, INSERM U1210, Laboratory of Biology and Modelling of the Cell, 46 Allée d'Italie Site Jacques Monod, F-69007, Lyon, France
| | - Audric Cologne
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622, Villeurbanne, EPI ERABLE - Inria Grenoble, Rhône-Alpes, France
| | - Sophie Terrone
- Université de Lyon, ENS de Lyon, Université Claude Bernard, CNRS UMR 5239, INSERM U1210, Laboratory of Biology and Modelling of the Cell, 46 Allée d'Italie Site Jacques Monod, F-69007, Lyon, France
| | - Louis Dulaurier
- Université de Lyon, ENS de Lyon, Université Claude Bernard, CNRS UMR 5239, INSERM U1210, Laboratory of Biology and Modelling of the Cell, 46 Allée d'Italie Site Jacques Monod, F-69007, Lyon, France
| | - Jean-Baptiste Claude
- Université de Lyon, ENS de Lyon, Université Claude Bernard, CNRS UMR 5239, INSERM U1210, Laboratory of Biology and Modelling of the Cell, 46 Allée d'Italie Site Jacques Monod, F-69007, Lyon, France
| | - Cyril F Bourgeois
- Université de Lyon, ENS de Lyon, Université Claude Bernard, CNRS UMR 5239, INSERM U1210, Laboratory of Biology and Modelling of the Cell, 46 Allée d'Italie Site Jacques Monod, F-69007, Lyon, France
| | - Didier Auboeuf
- Université de Lyon, ENS de Lyon, Université Claude Bernard, CNRS UMR 5239, INSERM U1210, Laboratory of Biology and Modelling of the Cell, 46 Allée d'Italie Site Jacques Monod, F-69007, Lyon, France.
| | - Vincent Lacroix
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622, Villeurbanne, EPI ERABLE - Inria Grenoble, Rhône-Alpes, France.
| |
Collapse
|
14
|
Lima L, Sinaimeri B, Sacomoto G, Lopez-Maestre H, Marchet C, Miele V, Sagot MF, Lacroix V. Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNA-seq reads. Algorithms Mol Biol 2017; 12:2. [PMID: 28250805 PMCID: PMC5322684 DOI: 10.1186/s13015-017-0091-2] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2016] [Accepted: 01/27/2017] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND The main challenge in de novo genome assembly of DNA-seq data is certainly to deal with repeats that are longer than the reads. In de novo transcriptome assembly of RNA-seq reads, on the other hand, this problem has been underestimated so far. Even though we have fewer and shorter repeated sequences in transcriptomics, they do create ambiguities and confuse assemblers if not addressed properly. Most transcriptome assemblers of short reads are based on de Bruijn graphs (DBG) and have no clear and explicit model for repeats in RNA-seq data, relying instead on heuristics to deal with them. RESULTS The results of this work are threefold. First, we introduce a formal model for representing high copy-number and low-divergence repeats in RNA-seq data and exploit its properties to infer a combinatorial characteristic of repeat-associated subgraphs. We show that the problem of identifying such subgraphs in a DBG is NP-complete. Second, we show that in the specific case of local assembly of alternative splicing (AS) events, we can implicitly avoid such subgraphs, and we present an efficient algorithm to enumerate AS events that are not included in repeats. Using simulated data, we show that this strategy is significantly more sensitive and precise than the previous version of KisSplice (Sacomoto et al. in WABI, pp 99-111, 1), Trinity (Grabherr et al. in Nat Biotechnol 29(7):644-652, 2), and Oases (Schulz et al. in Bioinformatics 28(8):1086-1092, 3), for the specific task of calling AS events. Third, we turn our focus to full-length transcriptome assembly, and we show that exploring the topology of DBGs can improve de novo transcriptome evaluation methods. Based on the observation that repeats create complicated regions in a DBG, and when assemblers try to traverse these regions, they can infer erroneous transcripts, we propose a measure to flag transcripts traversing such troublesome regions, thereby giving a confidence level for each transcript. The originality of our work when compared to other transcriptome evaluation methods is that we use only the topology of the DBG, and not read nor coverage information. We show that our simple method gives better results than Rsem-Eval (Li et al. in Genome Biol 15(12):553, 4) and TransRate (Smith-Unna et al. in Genome Res 26(8):1134-1144, 5) on both real and simulated datasets for detecting chimeras, and therefore is able to capture assembly errors missed by these methods.
Collapse
Affiliation(s)
- Leandro Lima
- Inria Grenoble, 655, Avenue de l’Europe, 38334 Montbonnot, France
- CNRS, UMR5558, Université Claude Bernard Lyon 1, 43, Boulevard du 11 Novembre 1918, 69622 Villeurbanne, France
| | - Blerina Sinaimeri
- Inria Grenoble, 655, Avenue de l’Europe, 38334 Montbonnot, France
- CNRS, UMR5558, Université Claude Bernard Lyon 1, 43, Boulevard du 11 Novembre 1918, 69622 Villeurbanne, France
| | - Gustavo Sacomoto
- Inria Grenoble, 655, Avenue de l’Europe, 38334 Montbonnot, France
- CNRS, UMR5558, Université Claude Bernard Lyon 1, 43, Boulevard du 11 Novembre 1918, 69622 Villeurbanne, France
| | - Helene Lopez-Maestre
- Inria Grenoble, 655, Avenue de l’Europe, 38334 Montbonnot, France
- CNRS, UMR5558, Université Claude Bernard Lyon 1, 43, Boulevard du 11 Novembre 1918, 69622 Villeurbanne, France
| | - Camille Marchet
- IRISA Inria Rennes Bretagne Atlantique; GenScale Team, Université Rennes 1, 263, Avenue Général Leclerc, 35042 Rennes, France
| | - Vincent Miele
- CNRS, UMR5558, Université Claude Bernard Lyon 1, 43, Boulevard du 11 Novembre 1918, 69622 Villeurbanne, France
| | - Marie-France Sagot
- Inria Grenoble, 655, Avenue de l’Europe, 38334 Montbonnot, France
- CNRS, UMR5558, Université Claude Bernard Lyon 1, 43, Boulevard du 11 Novembre 1918, 69622 Villeurbanne, France
| | - Vincent Lacroix
- Inria Grenoble, 655, Avenue de l’Europe, 38334 Montbonnot, France
- CNRS, UMR5558, Université Claude Bernard Lyon 1, 43, Boulevard du 11 Novembre 1918, 69622 Villeurbanne, France
| |
Collapse
|
15
|
Lopez-Maestre H, Brinza L, Marchet C, Kielbassa J, Bastien S, Boutigny M, Monnin D, Filali AE, Carareto CM, Vieira C, Picard F, Kremer N, Vavre F, Sagot MF, Lacroix V. SNP calling from RNA-seq data without a reference genome: identification, quantification, differential analysis and impact on the protein sequence. Nucleic Acids Res 2016; 44:e148. [PMID: 27458203 PMCID: PMC5100560 DOI: 10.1093/nar/gkw655] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2015] [Accepted: 07/11/2016] [Indexed: 11/14/2022] Open
Abstract
SNPs (Single Nucleotide Polymorphisms) are genetic markers whose precise identification is a prerequisite for association studies. Methods to identify them are currently well developed for model species, but rely on the availability of a (good) reference genome, and therefore cannot be applied to non-model species. They are also mostly tailored for whole genome (re-)sequencing experiments, whereas in many cases, transcriptome sequencing can be used as a cheaper alternative which already enables to identify SNPs located in transcribed regions. In this paper, we propose a method that identifies, quantifies and annotates SNPs without any reference genome, using RNA-seq data only. Individuals can be pooled prior to sequencing, if not enough material is available from one individual. Using pooled human RNA-seq data, we clarify the precision and recall of our method and discuss them with respect to other methods which use a reference genome or an assembled transcriptome. We then validate experimentally the predictions of our method using RNA-seq data from two non-model species. The method can be used for any species to annotate SNPs and predict their impact on the protein sequence. We further enable to test for the association of the identified SNPs with a phenotype of interest.
Collapse
Affiliation(s)
- Hélène Lopez-Maestre
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622 Villeurbanne, France.,EPI ERABLE - Inria Grenoble, Rhône-Alpes
| | - Lilia Brinza
- PT Génomique et Transcriptomique, BIOASTER, Lyon, France
| | - Camille Marchet
- Université de Rennes, F-35000 Rennes; équipe GenScale, IRISA, Rennes
| | - Janice Kielbassa
- Synergie-Lyon-Cancer, Universite Lyon 1, Centre Leon Berard, Lyon, France
| | - Sylvère Bastien
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622 Villeurbanne, France.,EPI ERABLE - Inria Grenoble, Rhône-Alpes
| | - Mathilde Boutigny
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622 Villeurbanne, France.,EPI ERABLE - Inria Grenoble, Rhône-Alpes
| | - David Monnin
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622 Villeurbanne, France
| | - Adil El Filali
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622 Villeurbanne, France
| | - Claudia Marcia Carareto
- Department of Biology, UNESP - São Paulo State University, São José do Rio Preto, São Paulo, Brazil
| | - Cristina Vieira
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622 Villeurbanne, France.,EPI ERABLE - Inria Grenoble, Rhône-Alpes
| | - Franck Picard
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622 Villeurbanne, France
| | - Natacha Kremer
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622 Villeurbanne, France
| | - Fabrice Vavre
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622 Villeurbanne, France.,EPI ERABLE - Inria Grenoble, Rhône-Alpes
| | - Marie-France Sagot
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622 Villeurbanne, France.,EPI ERABLE - Inria Grenoble, Rhône-Alpes
| | - Vincent Lacroix
- Université de Lyon, F-69000, Lyon; Université Lyon 1; CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622 Villeurbanne, France .,EPI ERABLE - Inria Grenoble, Rhône-Alpes
| |
Collapse
|
16
|
Le Bras Y, Collin O, Monjeaud C, Lacroix V, Rivals É, Lemaitre C, Miele V, Sacomoto G, Marchet C, Cazaux B, Zine El Aabidine A, Salmela L, Alves-Carvalho S, Andrieux A, Uricaru R, Peterlongo P. Colib'read on galaxy: a tools suite dedicated to biological information extraction from raw NGS reads. Gigascience 2016; 5:9. [PMID: 26870323 PMCID: PMC4750246 DOI: 10.1186/s13742-015-0105-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2014] [Accepted: 12/07/2015] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND With next-generation sequencing (NGS) technologies, the life sciences face a deluge of raw data. Classical analysis processes for such data often begin with an assembly step, needing large amounts of computing resources, and potentially removing or modifying parts of the biological information contained in the data. Our approach proposes to focus directly on biological questions, by considering raw unassembled NGS data, through a suite of six command-line tools. FINDINGS Dedicated to 'whole-genome assembly-free' treatments, the Colib'read tools suite uses optimized algorithms for various analyses of NGS datasets, such as variant calling or read set comparisons. Based on the use of a de Bruijn graph and bloom filter, such analyses can be performed in a few hours, using small amounts of memory. Applications using real data demonstrate the good accuracy of these tools compared to classical approaches. To facilitate data analysis and tools dissemination, we developed Galaxy tools and tool shed repositories. CONCLUSIONS With the Colib'read Galaxy tools suite, we enable a broad range of life scientists to analyze raw NGS data. More importantly, our approach allows the maximum biological information to be retained in the data, and uses a very low memory footprint.
Collapse
Affiliation(s)
- Yvan Le Bras
- />GenOuest Core Facility, UMR6074 IRISA CNRS/INRIA/Université de Rennes 1, Campus de Beaulieu, 35042 Rennes Cedex France
| | - Olivier Collin
- />GenOuest Core Facility, UMR6074 IRISA CNRS/INRIA/Université de Rennes 1, Campus de Beaulieu, 35042 Rennes Cedex France
| | - Cyril Monjeaud
- />GenOuest Core Facility, UMR6074 IRISA CNRS/INRIA/Université de Rennes 1, Campus de Beaulieu, 35042 Rennes Cedex France
| | - Vincent Lacroix
- />BAMBOO team, INRIA Grenoble Rhône-Alpes & Laboratoire Biométrie et Biologie Évolutive, UMR5558 CNRS, Université Claude Bernard (Lyon 1), Campus de la Doua, 43 Boulevard du 11 Novembre 1918, Villeurbanne Cedex, 69622 France
| | - Éric Rivals
- />MAB team, UMR5506 CNRS, Université Montpellier II, Sciences et techniques, Université Montpellier 2 LIRMM UMR 5506 CC477 161 rue Ada, Montpellier, 34095 Cedex 5 France
| | - Claire Lemaitre
- />INRIA/IRISA, Genscale team, UMR6074 IRISA CNRS/INRIA/Université de Rennes 1, Campus de Beaulieu, Rennes, 35042 Cedex France
| | - Vincent Miele
- />BAMBOO team, INRIA Grenoble Rhône-Alpes & Laboratoire Biométrie et Biologie Évolutive, UMR5558 CNRS, Université Claude Bernard (Lyon 1), Campus de la Doua, 43 Boulevard du 11 Novembre 1918, Villeurbanne Cedex, 69622 France
| | - Gustavo Sacomoto
- />BAMBOO team, INRIA Grenoble Rhône-Alpes & Laboratoire Biométrie et Biologie Évolutive, UMR5558 CNRS, Université Claude Bernard (Lyon 1), Campus de la Doua, 43 Boulevard du 11 Novembre 1918, Villeurbanne Cedex, 69622 France
| | - Camille Marchet
- />BAMBOO team, INRIA Grenoble Rhône-Alpes & Laboratoire Biométrie et Biologie Évolutive, UMR5558 CNRS, Université Claude Bernard (Lyon 1), Campus de la Doua, 43 Boulevard du 11 Novembre 1918, Villeurbanne Cedex, 69622 France
| | - Bastien Cazaux
- />MAB team, UMR5506 CNRS, Université Montpellier II, Sciences et techniques, Université Montpellier 2 LIRMM UMR 5506 CC477 161 rue Ada, Montpellier, 34095 Cedex 5 France
| | - Amal Zine El Aabidine
- />MAB team, UMR5506 CNRS, Université Montpellier II, Sciences et techniques, Université Montpellier 2 LIRMM UMR 5506 CC477 161 rue Ada, Montpellier, 34095 Cedex 5 France
| | - Leena Salmela
- />Department of Computer Science and Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, FI-00014 Finland
| | - Susete Alves-Carvalho
- />INRIA/IRISA, Genscale team, UMR6074 IRISA CNRS/INRIA/Université de Rennes 1, Campus de Beaulieu, Rennes, 35042 Cedex France
| | - Alexan Andrieux
- />INRIA/IRISA, Genscale team, UMR6074 IRISA CNRS/INRIA/Université de Rennes 1, Campus de Beaulieu, Rennes, 35042 Cedex France
| | - Raluca Uricaru
- />University of Bordeaux, LaBRI/CNRS, Talence, F-33405 France
- />University of Bordeaux, CBiB, Bordeaux, F-33000 France
| | - Pierre Peterlongo
- />INRIA/IRISA, Genscale team, UMR6074 IRISA CNRS/INRIA/Université de Rennes 1, Campus de Beaulieu, Rennes, 35042 Cedex France
| |
Collapse
|