1
|
Music of metagenomics-a review of its applications, analysis pipeline, and associated tools. Funct Integr Genomics 2021; 22:3-26. [PMID: 34657989 DOI: 10.1007/s10142-021-00810-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 09/25/2021] [Accepted: 10/03/2021] [Indexed: 10/20/2022]
Abstract
This humble effort highlights the intricate details of metagenomics in a simple, poetic, and rhythmic way. The paper enforces the significance of the research area, provides details about major analytical methods, examines the taxonomy and assembly of genomes, emphasizes some tools, and concludes by celebrating the richness of the ecosystem populated by the "metagenome."
Collapse
|
2
|
Dvorkina T, Bankevich A, Sorokin A, Yang F, Adu-Oppong B, Williams R, Turner K, Pevzner PA. ORFograph: search for novel insecticidal protein genes in genomic and metagenomic assembly graphs. MICROBIOME 2021; 9:149. [PMID: 34183047 PMCID: PMC8240309 DOI: 10.1186/s40168-021-01092-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Accepted: 05/11/2021] [Indexed: 05/07/2023]
Abstract
BACKGROUND Since the prolonged use of insecticidal proteins has led to toxin resistance, it is important to search for novel insecticidal protein genes (IPGs) that are effective in controlling resistant insect populations. IPGs are usually encoded in the genomes of entomopathogenic bacteria, especially in large plasmids in strains of the ubiquitous soil bacteria, Bacillus thuringiensis (Bt). Since there are often multiple similar IPGs encoded by such plasmids, their assemblies are typically fragmented and many IPGs are scattered through multiple contigs. As a result, existing gene prediction tools (that analyze individual contigs) typically predict partial rather than complete IPGs, making it difficult to conduct downstream IPG engineering efforts in agricultural genomics. METHODS Although it is difficult to assemble IPGs in a single contig, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding a single IPG. RESULTS We describe ORFograph, a pipeline for predicting IPGs in assembly graphs, benchmark it on (meta)genomic datasets, and discover nearly a hundred novel IPGs. This work shows that graph-aware gene prediction tools enable the discovery of greater diversity of IPGs from (meta)genomes. CONCLUSIONS We demonstrated that analysis of the assembly graphs reveals novel candidate IPGs. ORFograph identified both already known genes "hidden" in assembly graphs and potential novel IPGs that evaded existing tools for IPG identification. As ORFograph is fast, one could imagine a pipeline that processes many (meta)genomic assembly graphs to identify even more novel IPGs for phenotypic testing than would previously be inaccessible by traditional gene-finding methods. While here we demonstrated the results of ORFograph only for IPGs, the proposed approach can be generalized to any class of genes. Video abstract.
Collapse
Affiliation(s)
- Tatiana Dvorkina
- Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia
| | - Anton Bankevich
- Department of Computer Science and Engineering, University of California San Diego, San Diego, CA USA
| | - Alexei Sorokin
- Université Paris-Saclay, INRAE, Micalis Institute, AgroParisTech, 78350 Jouy-en-Josas, France
| | - Fan Yang
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
- Ascus Biosciences, San Diego, CA USA
| | - Boahemaa Adu-Oppong
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
- Thermo Fisher Scientific, Carlsbad, CA USA
| | - Ryan Williams
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
| | - Keith Turner
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
| | - Pavel A. Pevzner
- Department of Computer Science and Engineering, University of California San Diego, San Diego, CA USA
| |
Collapse
|
3
|
Fu S, Chang PL, Friesen ML, Teakle NL, Tarone AM, Sze SH. Identifying similar transcripts in a related organism from de Bruijn graphs of RNA-Seq data, with applications to the study of salt and waterlogging tolerance in Melilotus. BMC Genomics 2019; 20:425. [PMID: 31167652 PMCID: PMC6551239 DOI: 10.1186/s12864-019-5702-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Background A popular strategy to study alternative splicing in non-model organisms starts from sequencing the entire transcriptome, then assembling the reads by using de novo transcriptome assembly algorithms to obtain predicted transcripts. A similarity search algorithm is then applied to a related organism to infer possible function of these predicted transcripts. While some of these predictions may be inaccurate and transcripts with low coverage are often missed, we observe that it is possible to obtain a more complete set of transcripts to facilitate possible functional assignments by starting the search from the intermediate de Bruijn graph that contains all branching possibilities. Results We develop an algorithm to extract similar transcripts in a related organism by starting the search from the de Bruijn graph that represents the transcriptome instead of from predicted transcripts. We show that our algorithm is able to recover more similar transcripts than existing algorithms, with large improvements in obtaining longer transcripts and a finer resolution of isoforms. We apply our algorithm to study salt and waterlogging tolerance in two Melilotus species by constructing new RNA-Seq libraries. Conclusions We have developed an algorithm to identify paths in the de Bruijn graph that correspond to similar transcripts in a related organism directly. Our strategy bypasses the transcript prediction step in RNA-Seq data and makes use of support from evolutionary information. Electronic supplementary material The online version of this article (10.1186/s12864-019-5702-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shuhua Fu
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, 77843, TX, USA
| | - Peter L Chang
- Molecular and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, 90089, CA, USA
| | - Maren L Friesen
- Molecular and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, 90089, CA, USA.,Department of Crop and Soil Sciences, Washington State University, Pullman, 99164, WA, USA.,Department of Plant Pathology, Washington State University, Pullman, 99164, WA, USA
| | - Natasha L Teakle
- Centre for Ecohydrology, The University of Western Australia, 35 Stirling Highway, Crawley, 6009, WA, Australia.,School of Plant Biology (M084), Faculty of Natural and Agricultural Sciences, The University of Western Australia, 35 Stirling Highway, Crawley, 6009, WA, Australia
| | - Aaron M Tarone
- Department of Entomology, Texas A&M University, College Station, 77843, TX, USA
| | - Sing-Hoi Sze
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, 77843, TX, USA. .,Department of Computer Science and Engineering, Texas A&M University, College Station, 77843, TX, USA.
| |
Collapse
|
4
|
Abstract
UNLABELLED Metagenomic data, which contains sequenced DNA reads of uncultured microbial species from environmental samples, provide a unique opportunity to thoroughly analyze microbial species that have never been identified before. Reconstructing 16S ribosomal RNA, a phylogenetic marker gene, is usually required to analyze the composition of the metagenomic data. However, massive volume of dataset, high sequence similarity between related species, skewed microbial abundance and lack of reference genes make 16S rRNA reconstruction difficult. Generic de novo assembly tools are not optimized for assembling 16S rRNA genes. In this work, we introduce a targeted rRNA assembly tool, REAGO (REconstruct 16S ribosomal RNA Genes from metagenOmic data). It addresses the above challenges by combining secondary structure-aware homology search, zproperties of rRNA genes and de novo assembly. Our experimental results show that our tool can correctly recover more rRNA genes than several popular generic metagenomic assembly tools and specially designed rRNA construction tools. AVAILABILITY AND IMPLEMENTATION The source code of REAGO is freely available at https://github.com/chengyuan/reago.
Collapse
Affiliation(s)
- Cheng Yuan
- Computer Science and Engineering, Michigan State Univerisity, 428 South Shaw Rd East Lansing, MI 48824, USA and Center for Microbial Ecology, Michigan State University, East Lansing, MI 48824, USA
| | - Jikai Lei
- Computer Science and Engineering, Michigan State Univerisity, 428 South Shaw Rd East Lansing, MI 48824, USA and Center for Microbial Ecology, Michigan State University, East Lansing, MI 48824, USA
| | - James Cole
- Computer Science and Engineering, Michigan State Univerisity, 428 South Shaw Rd East Lansing, MI 48824, USA and Center for Microbial Ecology, Michigan State University, East Lansing, MI 48824, USA
| | - Yanni Sun
- Computer Science and Engineering, Michigan State Univerisity, 428 South Shaw Rd East Lansing, MI 48824, USA and Center for Microbial Ecology, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
5
|
Identification and Resolution of Microdiversity through Metagenomic Sequencing of Parallel Consortia. Appl Environ Microbiol 2015; 82:255-67. [PMID: 26497460 DOI: 10.1128/aem.02274-15] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2015] [Accepted: 10/16/2015] [Indexed: 01/02/2023] Open
Abstract
To gain a predictive understanding of the interspecies interactions within microbial communities that govern community function, the genomic complement of every member population must be determined. Although metagenomic sequencing has enabled the de novo reconstruction of some microbial genomes from environmental communities, microdiversity confounds current genome reconstruction techniques. To overcome this issue, we performed short-read metagenomic sequencing on parallel consortia, defined as consortia cultivated under the same conditions from the same natural community with overlapping species composition. The differences in species abundance between the two consortia allowed reconstruction of near-complete (at an estimated >85% of gene complement) genome sequences for 17 of the 20 detected member species. Two Halomonas spp. indistinguishable by amplicon analysis were found to be present within the community. In addition, comparison of metagenomic reads against the consensus scaffolds revealed within-species variation for one of the Halomonas populations, one of the Rhodobacteraceae populations, and the Rhizobiales population. Genomic comparison of these representative instances of inter- and intraspecies microdiversity suggests differences in functional potential that may result in the expression of distinct roles in the community. In addition, isolation and complete genome sequence determination of six member species allowed an investigation into the sensitivity and specificity of genome reconstruction processes, demonstrating robustness across a wide range of sequence coverage (9× to 2,700×) within the metagenomic data set.
Collapse
|
6
|
Ye Y, Tang H. Utilizing de Bruijn graph of metagenome assembly for metatranscriptome analysis. Bioinformatics 2015; 32:1001-8. [PMID: 26319390 PMCID: PMC4896364 DOI: 10.1093/bioinformatics/btv510] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2015] [Accepted: 08/24/2015] [Indexed: 11/26/2022] Open
Abstract
Motivation: Metagenomics research has accelerated the studies of microbial organisms, providing insights into the composition and potential functionality of various microbial communities. Metatranscriptomics (studies of the transcripts from a mixture of microbial species) and other meta-omics approaches hold even greater promise for providing additional insights into functional and regulatory characteristics of the microbial communities. Current metatranscriptomics projects are often carried out without matched metagenomic datasets (of the same microbial communities). For the projects that produce both metatranscriptomic and metagenomic datasets, their analyses are often not integrated. Metagenome assemblies are far from perfect, partially explaining why metagenome assemblies are not used for the analysis of metatranscriptomic datasets. Results: Here, we report a reads mapping algorithm for mapping of short reads onto a de Bruijn graph of assemblies. A hash table of junction k-mers (k-mers spanning branching structures in the de Bruijn graph) is used to facilitate fast mapping of reads to the graph. We developed an application of this mapping algorithm: a reference-based approach to metatranscriptome assembly using graphs of metagenome assembly as the reference. Our results show that this new approach (called TAG) helps to assemble substantially more transcripts that otherwise would have been missed or truncated because of the fragmented nature of the reference metagenome. Availability and implementation: TAG was implemented in C++ and has been tested extensively on the Linux platform. It is available for download as open source at http://omics.informatics.indiana.edu/TAG. Contact:yye@indiana.edu
Collapse
Affiliation(s)
- Yuzhen Ye
- School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA
| | - Haixu Tang
- School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA
| |
Collapse
|
7
|
Sim M, Kim J. Metagenome assembly through clustering of next-generation sequencing data using protein sequences. J Microbiol Methods 2015; 109:180-7. [PMID: 25572018 DOI: 10.1016/j.mimet.2015.01.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2014] [Revised: 01/03/2015] [Accepted: 01/03/2015] [Indexed: 11/16/2022]
Abstract
The study of environmental microbial communities, called metagenomics, has gained a lot of attention because of the recent advances in next-generation sequencing (NGS) technologies. Microbes play a critical role in changing their environments, and the mode of their effect can be solved by investigating metagenomes. However, the difficulty of metagenomes, such as the combination of multiple microbes and different species abundance, makes metagenome assembly tasks more challenging. In this paper, we developed a new metagenome assembly method by utilizing protein sequences, in addition to the NGS read sequences. Our method (i) builds read clusters by using mapping information against available protein sequences, and (ii) creates contig sequences by finding consensus sequences through probabilistic choices from the read clusters. By using simulated NGS read sequences from real microbial genome sequences, we evaluated our method in comparison with four existing assembly programs. We found that our method could generate relatively long and accurate metagenome assemblies, indicating that the idea of using protein sequences, as a guide for the assembly, is promising.
Collapse
Affiliation(s)
- Mikang Sim
- Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Republic of Korea
| | - Jaebum Kim
- Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Republic of Korea.
| |
Collapse
|
8
|
Zhang Y, Sun Y, Cole JR. A scalable and accurate targeted gene assembly tool (SAT-Assembler) for next-generation sequencing data. PLoS Comput Biol 2014; 10:e1003737. [PMID: 25122209 PMCID: PMC4133164 DOI: 10.1371/journal.pcbi.1003737] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2013] [Accepted: 06/05/2014] [Indexed: 11/21/2022] Open
Abstract
Gene assembly, which recovers gene segments from short reads, is an important step in functional analysis of next-generation sequencing data. Lacking quality reference genomes, de novo assembly is commonly used for RNA-Seq data of non-model organisms and metagenomic data. However, heterogeneous sequence coverage caused by heterogeneous expression or species abundance, similarity between isoforms or homologous genes, and large data size all pose challenges to de novo assembly. As a result, existing assembly tools tend to output fragmented contigs or chimeric contigs, or have high memory footprint. In this work, we introduce a targeted gene assembly program SAT-Assembler, which aims to recover gene families of particular interest to biologists. It addresses the above challenges by conducting family-specific homology search, homology-guided overlap graph construction, and careful graph traversal. It can be applied to both RNA-Seq and metagenomic data. Our experimental results on an Arabidopsis RNA-Seq data set and two metagenomic data sets show that SAT-Assembler has smaller memory usage, comparable or better gene coverage, and lower chimera rate for assembling a set of genes from one or multiple pathways compared with other assembly tools. Moreover, the family-specific design and rapid homology search allow SAT-Assembler to be naturally compatible with parallel computing platforms. The source code of SAT-Assembler is available at https://sourceforge.net/projects/sat-assembler/. The data sets and experimental settings can be found in supplementary material. Next-generation sequencing (NGS) provides an efficient and affordable way to sequence the genomes or transcriptomes of a large amount of organisms. With fast accumulation of the sequencing data from various NGS projects, the bottleneck is to efficiently mine useful knowledge from the data. As NGS platforms usually generate short and fragmented sequences (reads), one key step to annotate NGS data is to assemble short reads into longer contigs, which are then used to recover functional elements such as protein-coding genes. Short read assembly remains one of the most difficult computational problems in genomics. In particular, the performance of existing assembly tools is not satisfactory on complicated NGS data sets. They cannot reliably separate genes of high similarity, recover under-represented genes, and incur high computational time and memory usage. Hence, we propose a targeted gene assembly tool, SAT-Assembler, to assemble genes of interest directly from NGS data with low memory usage and high accuracy. Our experimental results on a transcriptomic data set and two microbial community data sets showed that SAT-Assembler used less memory and recovered more target genes with better accuracy than existing tools.
Collapse
Affiliation(s)
- Yuan Zhang
- Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
| | - Yanni Sun
- Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
- * E-mail:
| | - James R. Cole
- Center for Microbial Ecology, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
9
|
Chakraborty S. A fragmented alignment method detects a putative phosphorylation site and a putative BRC repeat in the Drosophila melanogaster BRCA2 protein. F1000Res 2013; 2:143. [PMID: 24627786 PMCID: PMC3924952 DOI: 10.12688/f1000research.2-143.v2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/07/2013] [Indexed: 11/28/2022] Open
Abstract
Mutations in the BRCA2 tumor suppressor protein leave individuals susceptible to breast, ovarian and other cancers. The BRCA2 protein is a critical component of the DNA repair pathways in eukaryotes, and also plays an integral role in fostering genomic variability through meiotic recombination. Although present in many eukaryotes, as a whole the
BRCA2 gene is weakly conserved. Conserved fragments of 30 amino acids (BRC repeats), which mediate interactions with the recombinase RAD51, helped detect orthologs of this protein in other organisms. The carboxy-terminal of the human BRCA2 has been shown to be phosphorylated by checkpoint kinases (Chk1/Chk2) at T3387, which regulate the sequestration of RAD51 on DNA damage. However, apart from three BRC repeats, the
Drosophila melanogaster gene has not been annotated and associated with other functionally relevant sequence fragments in human BRCA2. In the current work, the carboxy-terminal phosphorylation threonine site (E=9.1e-4) and a new BRC repeat (E=17e-4) in
D. melanogaster has been identified, using a fragmented alignment methodology (FRAGAL). In a similar study, FRAGAL has also identified a novel half-a- tetratricopeptide (HAT) motif (E=11e-4), a helical repeat motif implicated in various aspects of RNA metabolism, in Utp6 from yeast. The characteristic three aromatic residues with conserved spacing are observed in this new HAT repeat, further strengthening my claim. The reference and target sequences are sliced into overlapping fragments of equal parameterized lengths. All pairs of fragments in the reference and target proteins are aligned, and the gap penalties are adjusted to discourage gaps in the middle of the alignment. The results of the best matches are sorted based on differing criteria to aid the detection of known and putative sequences. The source code for FRAGAL results on these sequences is available at
https://github.com/sanchak/FragalCode, while the database can be accessed at
www.sanchak.com/fragal.html.
Collapse
Affiliation(s)
- Sandeep Chakraborty
- Department of Biological Sciences, Tata Institute of Fundamental Research, Mumbai, 400 005, India
| |
Collapse
|
10
|
Howison M, Zapata F, Dunn CW. Toward a statistically explicit understanding of de novo sequence assembly. Bioinformatics 2013; 29:2959-63. [PMID: 24021385 DOI: 10.1093/bioinformatics/btt525] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Draft de novo genome assemblies are now available for many organisms. These assemblies are point estimates of the true genome sequences. Each is a specific hypothesis, drawn from among many alternative hypotheses, of the sequence of a genome. Assembly uncertainty, the inability to distinguish between multiple alternative assembly hypotheses, can be due to real variation between copies of the genome in the sample, errors and ambiguities in the sequenced data and assumptions and heuristics of the assemblers. Most assemblers select a single assembly according to ad hoc criteria, and do not yet report and quantify the uncertainty of their outputs. Those assemblers that do report uncertainty take different approaches to describing multiple assembly hypotheses and the support for each. RESULTS Here we review and examine the problem of representing and measuring uncertainty in assemblies. A promising recent development is the implementation of assemblers that are built according to explicit statistical models. Some new assembly methods, for example, estimate and maximize assembly likelihood. These advances, combined with technical advances in the representation of alternative assembly hypotheses, will lead to a more complete and biologically relevant understanding of assembly uncertainty. This will in turn facilitate the interpretation of downstream analyses and tests of specific biological hypotheses.
Collapse
Affiliation(s)
- Mark Howison
- Center for Computation and Visualization and Department of Ecology and Evolutionary Biology, Brown University, Providence, RI 02912, USA
| | | | | |
Collapse
|