1
|
Donaire L, Aranda MA. Computational Pipeline for the Detection of Plant RNA Viruses Using High-Throughput Sequencing. Methods Mol Biol 2024; 2724:1-20. [PMID: 37987894 DOI: 10.1007/978-1-0716-3485-1_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
In this chapter, we describe a computational pipeline for the in silico detection of plant viruses by high-throughput sequencing (HTS) from total RNA samples. The pipeline is designed for the analysis of short reads generated using an Illumina platform and free-available software tools. First, we provide advice for high-quality total RNA purification, library preparation, and sequencing. The bioinformatics pipeline begins with the raw reads obtained from the sequencing machine and performs some curation steps to obtain long contigs. Contigs are blasted against a local database of reference nucleotide viral sequences to identify the viruses in the samples. Then, the search is refined by applying specific filters. We also provide the code to re-map the short reads against the viruses found to get information on sequencing depth and read coverage for each virus. No previous bioinformatics background is required, but basic knowledge of the Unix command line and R language is recommended.
Collapse
Affiliation(s)
- Livia Donaire
- Abiopep S.L., Parque Científico de Murcia, Complejo de Espinardo, Murcia, Spain.
- Department of Stress Biology and Plant Pathology, Centro de Edafología y Biología Aplicada del Segura (CEBAS)-CSIC, Murcia, Spain.
| | - Miguel A Aranda
- Department of Stress Biology and Plant Pathology, Centro de Edafología y Biología Aplicada del Segura (CEBAS)-CSIC, Murcia, Spain
| |
Collapse
|
2
|
Abstract
The detection and quantification of transposable elements (TE) are notoriously challenging despite their relevance in evolutionary genomics and molecular ecology. The main hurdle is caused by the dependence of numerous tools on genome assemblies, whose level of completion directly affects the comparability of the results across species or populations. dnaPipeTE, whose use is demonstrated here, tackles this issue by directly performing TE detection, classification, and quantification from unassembled short reads. This chapter details all the required steps to perform a comparative analysis of the TE content between two related species, starting from the installation of a recently containerized version of the program to the post-processing of the outputs.
Collapse
Affiliation(s)
- Clément Goubert
- Canadian Centre for Computational Genomics, McGill University, Montreal, QC, Canada.
- McGill Genome Centre, Montreal, QC, Canada.
- Human Genetics, McGill University, Montreal, QC, Canada.
| |
Collapse
|
3
|
McKerrow W. Quantification of LINE-1 RNA Expression from Bulk RNA-seq Using L1EM. Methods Mol Biol 2023; 2607:115-126. [PMID: 36449161 DOI: 10.1007/978-1-0716-2883-6_7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
LINE-1 retrotransposons have the potential to cause DNA damage, contribute to genome instability, and induce an interferon response. Thus, accurate measurements of their expression, especially in disease contexts where genome instability and the interferon response are relevant, are of particular importance. Illumina-based bulk RNA sequencing remains the most abundant datatype for measuring gene expression. However, "active" expression from its own internal promoter is only one source of LINE-1 aligning reads in an RNA-seq experiment. With about half a million LINE-1 sequences scattered throughout the genome, many are incorporated into other transcripts that have nothing to do with LINE-1 activity. We call this "passive" co-transcription. Here we will describe how to use L1EM, a computational method that separates active from passive LINE-1 expression at the locus-specific level.
Collapse
Affiliation(s)
- Wilson McKerrow
- Institute for Systems Genetics, NYU Langone Health, New York, NY, USA.
| |
Collapse
|
4
|
Hu Y, Mangal S, Zhang L, Zhou X. Automated filtering of genome-wide large deletions through an ensemble deep learning framework. Methods 2022; 206:77-86. [PMID: 36038049 DOI: 10.1016/j.ymeth.2022.08.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2022] [Revised: 06/29/2022] [Accepted: 08/08/2022] [Indexed: 10/15/2022] Open
Abstract
Computational methods based on whole genome linked-reads and short-reads have been successful in genome assembly and detection of structural variants (SVs). Numerous variant callers that rely on linked-reads and short reads can detect genetic variations, including SVs. A shortcoming of existing tools is a propensity for overestimating SVs, especially for deletions. Optimizing the advantages of linked-read and short-read sequencing technologies would thus benefit from an additional step to effectively identify and eliminate false positive large deletions. Here, we introduce a novel tool, AquilaDeepFilter, aiming to automatically filter genome-wide false positive large deletions. Our approach relies on transforming sequencing data into an image and then relying on convolutional neural networks to improve classification of candidate deletions as such. Input data take into account multiple alignment signals including read depth, split reads and discordant read pairs. We tested the performance of AquilaDeepFilter on five linked-reads and short-read libraries sequenced from the well-studied NA24385 sample, validated against the Genome in a Bottle benchmark. To demonstrate the filtering ability of AquilaDeepFilter, we utilized the SV calls from three upstream SV detection tools including Aquila, Aquila_stLFR and Delly as the baseline. We showed that AquilaDeepFilter increased precision while preserving the recall rate of all three tools. The overall F1-score improved by an average 20% on linked-reads and by an average of 15% on short-read data. AquilaDeepFilter also compared favorably to existing deep learning based methods for SV filtering, such as DeepSVFilter. AquilaDeepFilter is thus an effective SV refinement framework that can improve SV calling for both linked-reads and short-read data.
Collapse
Affiliation(s)
- Yunfei Hu
- Department of Computer Science, Vanderbilt University, 2301 Vanderbilt Place, 37235 Nashville, USA
| | - Sanidhya Mangal
- Department of Computer Science, Vanderbilt University, 2301 Vanderbilt Place, 37235 Nashville, USA
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Room R708, Sir Run Run Shaw Building, Kowloon Tong, Hong Kong
| | - Xin Zhou
- Department of Computer Science, Vanderbilt University, 2301 Vanderbilt Place, 37235 Nashville, USA; Department of Biomedical Engineering, Vanderbilt University, 2301 Vanderbilt Place, 37235, Nashville, USA; Data Science Institute, Vanderbilt University, Sony Building, 1400 18th Ave S Building, Suite 2000, 37212 Nashville, USA.
| |
Collapse
|
5
|
Abstract
Structural variants (SVs) underlie genomic variation but are often overlooked due to difficult detection from short reads. Most algorithms have been tested on humans, and it remains unclear how applicable they are in other organisms. To solve this, we develop perSVade (personalized structural variation detection), a sample-tailored pipeline that provides optimally called SVs and their inferred accuracy, as well as small and copy number variants. PerSVade increases SV calling accuracy on a benchmark of six eukaryotes. We find no universal set of optimal parameters, underscoring the need for sample-specific parameter optimization. PerSVade will facilitate SV detection and study across diverse organisms.
Collapse
Affiliation(s)
- Miquel Àngel Schikora-Tamarit
- Barcelona Supercomputing Centre (BSC-CNS), Plaça Eusebi Güell, 1-3, 08034, Barcelona, Spain
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, 08028, Barcelona, Spain
| | - Toni Gabaldón
- Barcelona Supercomputing Centre (BSC-CNS), Plaça Eusebi Güell, 1-3, 08034, Barcelona, Spain.
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, 08028, Barcelona, Spain.
- Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, Spain.
- Centro Investigación Biomédica En Red de Enfermedades Infecciosas, Barcelona, Spain.
| |
Collapse
|
6
|
Nieuwenhuijse DF, van der Linden A, Kohl RHG, Sikkema RS, Koopmans MPG, Oude Munnink BB. Towards reliable whole genome sequencing for outbreak preparedness and response. BMC Genomics 2022; 23:569. [PMID: 35945497 PMCID: PMC9361258 DOI: 10.1186/s12864-022-08749-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Accepted: 07/08/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND To understand the dynamics of infectious diseases, genomic epidemiology is increasingly advocated, with a need for rapid generation of genetic sequences during outbreaks for public health decision making. Here, we explore the use of metagenomic sequencing compared to specific amplicon- and capture-based sequencing, both on the Nanopore and the Illumina platform for generation of whole genomes of Usutu virus, Zika virus, West Nile virus, and Yellow Fever virus. RESULTS We show that amplicon-based Nanopore sequencing can be used to rapidly obtain whole genome sequences in samples with a viral load up to Ct 33 and capture-based Illumina is the most sensitive method for initial virus determination. CONCLUSIONS The choice of sequencing approach and platform is important for laboratories wishing to start whole genome sequencing. Depending on the purpose of genome sequencing the best choice can differ. The insights presented in this work and the shown differences in data characteristics can guide labs to make a well informed choice.
Collapse
Affiliation(s)
| | | | - Robert H G Kohl
- Departement of Virology of the Vaccination Programme, RIVM, Bilthoven, the Netherlands
| | - Reina S Sikkema
- Viroscience Department, Erasmus Medical Center, Rotterdam, the Netherlands
| | | | - Bas B Oude Munnink
- Viroscience Department, Erasmus Medical Center, Rotterdam, the Netherlands.
| |
Collapse
|
7
|
Nikolić V, Afshinfard A, Chu J, Wong J, Coombe L, Nip KM, Warren RL, Birol I. RResolver: efficient short-read repeat resolution within ABySS. BMC Bioinformatics 2022; 23:246. [PMID: 35729491 PMCID: PMC9215042 DOI: 10.1186/s12859-022-04790-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 06/09/2022] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND De novo genome assembly is essential to modern genomics studies. As it is not biased by a reference, it is also a useful method for studying genomes with high variation, such as cancer genomes. De novo short-read assemblers commonly use de Bruijn graphs, where nodes are sequences of equal length k, also known as k-mers. Edges in this graph are established between nodes that overlap by [Formula: see text] bases, and nodes along unambiguous walks in the graph are subsequently merged. The selection of k is influenced by multiple factors, and optimizing this value results in a trade-off between graph connectivity and sequence contiguity. Ideally, multiple k sizes should be used, so lower values can provide good connectivity in lesser covered regions and higher values can increase contiguity in well-covered regions. However, current approaches that use multiple k values do not address the scalability issues inherent to the assembly of large genomes. RESULTS Here we present RResolver, a scalable algorithm that takes a short-read de Bruijn graph assembly with a starting k as input and uses a k value closer to that of the read length to resolve repeats. RResolver builds a Bloom filter of sequencing reads which is used to evaluate the assembly graph path support at branching points and removes paths with insufficient support. RResolver runs efficiently, taking only 26 min on average for an ABySS human assembly with 48 threads and 60 GiB memory. Across all experiments, compared to a baseline assembly, RResolver improves scaffold contiguity (NGA50) by up to 15% and reduces misassemblies by up to 12%. CONCLUSIONS RResolver adds a missing component to scalable de Bruijn graph genome assembly. By improving the initial and fundamental graph traversal outcome, all downstream ABySS algorithms greatly benefit by working with a more accurate and less complex representation of the genome. The RResolver code is integrated into ABySS and is available at https://github.com/bcgsc/abyss/tree/master/RResolver .
Collapse
Affiliation(s)
- Vladimir Nikolić
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada ,grid.17091.3e0000 0001 2288 9830The University of British Columbia, 2329 West Mall, Vancouver, V6T 1Z4 Canada
| | - Amirhossein Afshinfard
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada ,grid.17091.3e0000 0001 2288 9830The University of British Columbia, 2329 West Mall, Vancouver, V6T 1Z4 Canada
| | - Justin Chu
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada ,grid.17091.3e0000 0001 2288 9830The University of British Columbia, 2329 West Mall, Vancouver, V6T 1Z4 Canada
| | - Johnathan Wong
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada
| | - Lauren Coombe
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada
| | - Ka Ming Nip
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada ,grid.17091.3e0000 0001 2288 9830The University of British Columbia, 2329 West Mall, Vancouver, V6T 1Z4 Canada
| | - René L. Warren
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6, Canada. .,The University of British Columbia, 2329 West Mall, Vancouver, V6T 1Z4, Canada.
| |
Collapse
|
8
|
Lavrichenko K, Johansson S, Jonassen I. Comprehensive characterization of copy number variation (CNV) called from array, long- and short-read data. BMC Genomics 2021; 22:826. [PMID: 34789167 DOI: 10.1186/s12864-021-08082-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 10/13/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND SNP arrays, short- and long-read genome sequencing are genome-wide high-throughput technologies that may be used to assay copy number variants (CNVs) in a personal genome. Each of these technologies comes with its own limitations and biases, many of which are well-known, but not all of them are thoroughly quantified. RESULTS We assembled an ensemble of public datasets of published CNV calls and raw data for the well-studied Genome in a Bottle individual NA12878. This assembly represents a variety of methods and pipelines used for CNV calling from array, short- and long-read technologies. We then performed cross-technology comparisons regarding their ability to call CNVs. Different from other studies, we refrained from using the golden standard. Instead, we attempted to validate the CNV calls by the raw data of each technology. CONCLUSIONS Our study confirms that long-read platforms enable recalling CNVs in genomic regions inaccessible to arrays or short reads. We also found that the reproducibility of a CNV by different pipelines within each technology is strongly linked to other CNV evidence measures. Importantly, the three technologies show distinct public database frequency profiles, which differ depending on what technology the database was built on.
Collapse
|
9
|
Sarantopoulou D, Brooks TG, Nayak S, Mrčela A, Lahens NF, Grant GR. Comparative evaluation of full-length isoform quantification from RNA-Seq. BMC Bioinformatics 2021; 22:266. [PMID: 34034652 PMCID: PMC8145802 DOI: 10.1186/s12859-021-04198-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Accepted: 05/16/2021] [Indexed: 11/18/2022] Open
Abstract
Background Full-length isoform quantification from RNA-Seq is a key goal in transcriptomics analyses and has been an area of active development since the beginning. The fundamental difficulty stems from the fact that RNA transcripts are long, while RNA-Seq reads are short. Results Here we use simulated benchmarking data that reflects many properties of real data, including polymorphisms, intron signal and non-uniform coverage, allowing for systematic comparative analyses of isoform quantification accuracy and its impact on differential expression analysis. Genome, transcriptome and pseudo alignment-based methods are included; and a simple approach is included as a baseline control. Conclusions Salmon, kallisto, RSEM, and Cufflinks exhibit the highest accuracy on idealized data, while on more realistic data they do not perform dramatically better than the simple approach. We determine the structural parameters with the greatest impact on quantification accuracy to be length and sequence compression complexity and not so much the number of isoforms. The effect of incomplete annotation on performance is also investigated. Overall, the tested methods show sufficient divergence from the truth to suggest that full-length isoform quantification and isoform level DE should still be employed selectively. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04198-1.
Collapse
Affiliation(s)
- Dimitra Sarantopoulou
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.,National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Soumyashant Nayak
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA. .,Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
10
|
Delage WJ, Thevenon J, Lemaitre C. Towards a better understanding of the low recall of insertion variants with short-read based variant callers. BMC Genomics 2020; 21:762. [PMID: 33148192 PMCID: PMC7640490 DOI: 10.1186/s12864-020-07125-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2020] [Accepted: 10/06/2020] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Since 2009, numerous tools have been developed to detect structural variants using short read technologies. Insertions >50 bp are one of the hardest type to discover and are drastically underrepresented in gold standard variant callsets. The advent of long read technologies has completely changed the situation. In 2019, two independent cross technologies studies have published the most complete variant callsets with sequence resolved insertions in human individuals. Among the reported insertions, only 17 to 28% could be discovered with short-read based tools. RESULTS In this work, we performed an in-depth analysis of these unprecedented insertion callsets in order to investigate the causes of such failures. We have first established a precise classification of insertion variants according to four layers of characterization: the nature and size of the inserted sequence, the genomic context of the insertion site and the breakpoint junction complexity. Because these levels are intertwined, we then used simulations to characterize the impact of each complexity factor on the recall of several structural variant callers. We showed that most reported insertions exhibited characteristics that may interfere with their discovery: 63% were tandem repeat expansions, 38% contained homology larger than 10 bp within their breakpoint junctions and 70% were located in simple repeats. Consequently, the recall of short-read based variant callers was significantly lower for such insertions (6% for tandem repeats vs 56% for mobile element insertions). Simulations showed that the most impacting factor was the insertion type rather than the genomic context, with various difficulties being handled differently among the tested structural variant callers, and they highlighted the lack of sequence resolution for most insertion calls. CONCLUSIONS Our results explain the low recall by pointing out several difficulty factors among the observed insertion features and provide avenues for improving SV caller algorithms and their combinations.
Collapse
Affiliation(s)
| | - Julien Thevenon
- Inserm U1209, CNRS UMR 5309, Univ. Grenoble Alpes, Institute for Advanced Biosciences, Grenoble, France & Genetics, Genomics and Reproduction Service, Centre Hospitalo-Universitaire Grenoble-Alpes, Grenoble, France
| | | |
Collapse
|
11
|
Abstract
Background Next-generation sequencing has been used by investigators to address a diverse range of biological problems through, for example, polymorphism and mutation discovery and microRNA profiling. However, compared to conventional sequencing, the error rates for next-generation sequencing are often higher, which impacts the downstream genomic analysis. Recently, Wang et al. (BMC Bioinformatics 13:185, 2012) proposed a shadow regression approach to estimate the error rates for next-generation sequencing data based on the assumption of a linear relationship between the number of reads sequenced and the number of reads containing errors (denoted as shadows). However, this linear read-shadow relationship may not be appropriate for all types of sequence data. Therefore, it is necessary to estimate the error rates in a more reliable way without assuming linearity. We proposed an empirical error rate estimation approach that employs cubic and robust smoothing splines to model the relationship between the number of reads sequenced and the number of shadows. Results We performed simulation studies using a frequency-based approach to generate the read and shadow counts directly, which can mimic the real sequence counts data structure. Using simulation, we investigated the performance of the proposed approach and compared it to that of shadow linear regression. The proposed approach provided more accurate error rate estimations than the shadow linear regression approach for all the scenarios tested. We also applied the proposed approach to assess the error rates for the sequence data from the MicroArray Quality Control project, a mutation screening study, the Encyclopedia of DNA Elements project, and bacteriophage PhiX DNA samples. Conclusions The proposed empirical error rate estimation approach does not assume a linear relationship between the error-free read and shadow counts and provides more accurate estimations of error rates for next-generation, short-read sequencing data. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1052-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xuan Zhu
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Jian Wang
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Bo Peng
- Department of Bioinformatics & Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Sanjay Shete
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA. .,Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.
| |
Collapse
|
12
|
Pan W, Chen B, Xu Y. MetaObtainer: A Tool for Obtaining Specified Species from Metagenomic Reads of Next-generation Sequencing. Interdiscip Sci 2015; 7:405-13. [PMID: 26293485 DOI: 10.1007/s12539-015-0281-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2014] [Revised: 07/26/2014] [Accepted: 08/07/2014] [Indexed: 10/23/2022]
Abstract
Reads classification is an important fundamental problem in metagenomics study. With the development of next-generation sequencing, metagenome samples can be generated using much less money and time. However, the short reads generated by next-generation sequencing make the problem of reads classification much more difficult than before. None of the existing tools can assign NGS short reads to each genome accurately, which limit their use in real application. Fortunately, in many applications, it is meaningless to separate all the species in the metagenome sample from each other. That is because we usually only focus on some specified species categories in the sample and do not care about the others. There is no existing tool that is designed technically for obtaining specified species from short metagenome reads generated by next-generation sequencing. In this paper, we propose a tool named MetaObtainer to obtain the specified species from next-generation sequencing short reads. The tool synthesizes some of newest technologies for processing of short reads, so it can have better performance than other tools. It can (1) deal with next-generation sequencing reads which are shorter than 100 bp with very high accuracy (both of precision and recall are more than 90%); (2) find unknown species using the reference genomes of species which are similar with it; (3) perform well when reads of specified species are very few in the dataset; (4) handle genomes of similar abundance levels as well as different abundance levels (1:10); and (5) obtain multiple species categories from metagenome sample.
Collapse
|
13
|
Abstract
Background Categorizing protein coding sequences into one family, if the proteins they encode perform the same biochemical function, and then tabulating the relative abundances among all the families, is a widely-adopted practice for functional profiling of a metagenomic sample. By homology searching of metagenomic sequencing reads against a protein database, the relative abundance of a family can be represented by the number of reads aligned to its members. However, it has been observed that, for short reads generated by next-generation sequencing platforms, some may be erroneously assigned to the functional families they are not associated to. This commonly occurred phenomenon is termed as cross-annotation. Current methods for functional profiling of a metagenomic sample use empirical cutoff values, to select the alignments and ignore such cross-annotation problem, or employ summarized equation to do a simple adjustment. Result By introducing latent variables, we use the Probabilistic Latent Semantic Analysis to model the proportions of reads assigned to functional families in a metagenomic sample. The approach can be applied on a metagenomic sample after the list of the true functional families being obtained or estimated. It was implemented in metagenomic samples functionally characterized by the database of Clusters of Orthologous Groups of proteins, and successfully addressed the cross-annotation issue on both in vitro-simulated, bioinformatics tool simulated metagenomic samples, and a real-world data. Conclusions Correcting cross-annotation will increase the accuracy of the functional profiling of a metagenome generated by short reads. It will further benefit differential abundance analysis of metagenomic samples under different conditions.
Collapse
Affiliation(s)
- Ruofei Du
- Biostatistics Program, School of Public Health, Louisiana State University Health Sciences Center, New Orleans, Louisiana, USA.,Department of Agricultural and Bio-systems Engineering, University of Arizona, Tucson, Arizona, USA
| | - Donald Mercante
- Biostatistics Program, School of Public Health, Louisiana State University Health Sciences Center, New Orleans, Louisiana, USA
| | - Lingling An
- Department of Agricultural and Bio-systems Engineering, University of Arizona, Tucson, Arizona, USA
| | - Zhide Fang
- Biostatistics Program, School of Public Health, Louisiana State University Health Sciences Center, New Orleans, Louisiana, USA
| |
Collapse
|