1
|
Wu E, Mallawaarachchi V, Zhao J, Yang Y, Liu H, Wang X, Shen C, Lin Y, Qiao L. Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics. Microbiome 2024; 12:58. [PMID: 38504332 PMCID: PMC10949615 DOI: 10.1186/s40168-024-01775-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 02/05/2024] [Indexed: 03/21/2024]
Abstract
BACKGROUND Microbiota are closely associated with human health and disease. Metaproteomics can provide a direct means to identify microbial proteins in microbiota for compositional and functional characterization. However, in-depth and accurate metaproteomics is still limited due to the extreme complexity and high diversity of microbiota samples. It is generally recommended to use metagenomic data from the same samples to construct the protein sequence database for metaproteomic data analysis. Although different metagenomics-based database construction strategies have been developed, an optimization of gene taxonomic annotation has not been reported, which, however, is extremely important for accurate metaproteomic analysis. RESULTS Herein, we proposed an accurate taxonomic annotation pipeline for genes from metagenomic data, namely contigs directed gene annotation (ConDiGA), and used the method to build a protein sequence database for metaproteomic analysis. We compared our pipeline (ConDiGA or MD3) with two other popular annotation pipelines (MD1 and MD2). In MD1, genes were directly annotated against the whole bacterial genome database; in MD2, contigs were annotated against the whole bacterial genome database and the taxonomic information of contigs was assigned to the genes; in MD3, the most confident species from the contigs annotation results were taken as reference to annotate genes. Annotation tools, including BLAST, Kaiju, and Kraken2, were compared. Based on a synthetic microbial community of 12 species, it was found that Kaiju with the MD3 pipeline outperformed the others in the construction of protein sequence database from metagenomic data. Similar performance was also observed with a fecal sample, as well as in silico mixed datasets of the simulated microbial community and the fecal sample. CONCLUSIONS Overall, we developed an optimized pipeline for gene taxonomic annotation to construct protein sequence databases. Our study can tackle the current taxonomic annotation reliability problem in metagenomics-derived protein sequence database and can promote the in-depth metaproteomic analysis of microbiome. The unique metagenomic and metaproteomic datasets of the 12 bacterial species are publicly available as a standard benchmarking sample for evaluating various analysis pipelines. The code of ConDiGA is open access at GitHub for the analysis of microbiota samples. Video Abstract.
Collapse
Affiliation(s)
- Enhui Wu
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China
| | - Vijini Mallawaarachchi
- School of Computing, College of Engineering, Computing and Cybernetics, The Australian National University, Canberra, ACT, 2600, Australia
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, SA, 5042, Australia
| | - Jinzhi Zhao
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China
| | - Yi Yang
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China
| | - Hebin Liu
- Shanghai Omicsolution Co., Ltd, Shanghai, 200000, China
| | - Xiaoqing Wang
- Shanghai Omicsolution Co., Ltd, Shanghai, 200000, China
| | - Chengpin Shen
- Shanghai Omicsolution Co., Ltd, Shanghai, 200000, China
| | - Yu Lin
- School of Computing, College of Engineering, Computing and Cybernetics, The Australian National University, Canberra, ACT, 2600, Australia
| | - Liang Qiao
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China.
| |
Collapse
|
2
|
Taj B, Adeolu M, Xiong X, Ang J, Nursimulu N, Parkinson J. MetaPro: a scalable and reproducible data processing and analysis pipeline for metatranscriptomic investigation of microbial communities. Microbiome 2023; 11:143. [PMID: 37370188 DOI: 10.1186/s40168-023-01562-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/24/2022] [Accepted: 04/28/2023] [Indexed: 06/29/2023]
Abstract
BACKGROUND Whole microbiome RNASeq (metatranscriptomics) has emerged as a powerful technology to functionally interrogate microbial communities. A key challenge is how best to process, analyze, and interpret these complex datasets. In a typical application, a single metatranscriptomic dataset may comprise from tens to hundreds of millions of sequence reads. These reads must first be processed and filtered for low quality and potential contaminants, before being annotated with taxonomic and functional labels and subsequently collated to generate global bacterial gene expression profiles. RESULTS Here, we present MetaPro, a flexible, massively scalable metatranscriptomic data analysis pipeline that is cross-platform compatible through its implementation within a Docker framework. MetaPro starts with raw sequence read input (single-end or paired-end reads) and processes them through a tiered series of filtering, assembly, and annotation steps. In addition to yielding a final list of bacterial genes and their relative expression, MetaPro delivers a taxonomic breakdown based on the consensus of complementary prediction algorithms, together with a focused breakdown of enzymes, readily visualized through the Cytoscape network visualization tool. We benchmark the performance of MetaPro against two current state-of-the-art pipelines and demonstrate improved performance and functionality. CONCLUSIONS MetaPro represents an effective integrated solution for the processing and analysis of metatranscriptomic datasets. Its modular architecture allows new algorithms to be deployed as they are developed, ensuring its longevity. To aid user uptake of the pipeline, MetaPro, together with an established tutorial that has been developed for educational purposes, is made freely available at https://github.com/ParkinsonLab/MetaPro . The software is freely available under the GNU general public license v3. Video Abstract.
Collapse
Affiliation(s)
- Billy Taj
- Program in Molecular Medicine, The Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Mobolaji Adeolu
- Program in Molecular Medicine, The Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Xuejian Xiong
- Program in Molecular Medicine, The Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Jordan Ang
- Department of Chemical and Physical Sciences, University of Toronto, Mississauga, ON, L5L 1C6, Canada
| | - Nirvana Nursimulu
- Program in Molecular Medicine, The Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, M5S 3G4, Canada
| | - John Parkinson
- Program in Molecular Medicine, The Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada.
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 3G4, Canada.
- Department of Biochemistry, University of Toronto, Toronto, ON, M5S 3G4, Canada.
| |
Collapse
|
3
|
Tamames J, Cobo-Simón M, Puente-Sánchez F. Assessing the performance of different approaches for functional and taxonomic annotation of metagenomes. BMC Genomics 2019; 20:960. [PMID: 31823721 DOI: 10.1186/s12864-019-6289-6] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2019] [Accepted: 11/14/2019] [Indexed: 12/28/2022] Open
Abstract
Background Metagenomes can be analysed using different approaches and tools. One of the most important distinctions is the way to perform taxonomic and functional assignment, choosing between the use of assembly algorithms or the direct analysis of raw sequence reads instead by homology searching, k-mer analysys, or detection of marker genes. Many instances of each approach can be found in the literature, but to the best of our knowledge no evaluation of their different performances has been carried on, and we question if their results are comparable. Results We have analysed several real and mock metagenomes using different methodologies and tools, and compared the resulting taxonomic and functional profiles. Our results show that database completeness (the representation of diverse organisms and taxa in it) is the main factor determining the performance of the methods relying on direct read assignment either by homology, k-mer composition or similarity to marker genes, while methods relying on assembly and assignment of predicted genes are most influenced by metagenomic size, that in turn determines the completeness of the assembly (the percentage of read that were assembled). Conclusions Although differences exist, taxonomic profiles are rather similar between raw read assignment and assembly assignment methods, while they are more divergent for methods based on k-mers and marker genes. Regarding functional annotation, analysis of raw reads retrieves more functions, but it also makes a substantial number of over-predictions. Assembly methods are more advantageous as the size of the metagenome grows bigger.
Collapse
|
4
|
Richardson RT, Bengtsson-Palme J, Gardiner MM, Johnson RM. A reference cytochrome c oxidase subunit I database curated for hierarchical classification of arthropod metabarcoding data. PeerJ 2018; 6:e5126. [PMID: 29967752 PMCID: PMC6025149 DOI: 10.7717/peerj.5126] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2018] [Accepted: 06/07/2018] [Indexed: 01/01/2023] Open
Abstract
Metabarcoding is a popular application which warrants continued methods optimization. To maximize barcoding inferences, hierarchy-based sequence classification methods are increasingly common. We present methods for the construction and curation of a database designed for hierarchical classification of a 157 bp barcoding region of the arthropod cytochrome c oxidase subunit I (COI) locus. We produced a comprehensive arthropod COI amplicon dataset including annotated arthropod COI sequences and COI sequences extracted from arthropod whole mitochondrion genomes, the latter of which provided the only source of representation for Zoraptera, Callipodida and Holothyrida. The database contains extracted sequences of the target amplicon from all major arthropod clades, including all insect orders, all arthropod classes and Onychophora, Tardigrada and Mollusca outgroups. During curation, we extracted the COI region of interest from approximately 81 percent of the input sequences, corresponding to 73 percent of the genus-level diversity found in the input data. Further, our analysis revealed a high degree of sequence redundancy within the NCBI nucleotide database, with a mean of approximately 11 sequence entries per species in the input data. The curated, low-redundancy database is included in the Metaxa2 sequence classification software (http://microbiology.se/software/metaxa2/). Using this database with the Metaxa2 classifier, we performed a cross-validation analysis to characterize the relationship between the Metaxa2 reliability score, an estimate of classification confidence, and classification error probability. We used this analysis to select a reliability score threshold which minimized error. We then estimated classification sensitivity, false discovery rate and overclassification, the propensity to classify sequences from taxa not represented in the reference database. Our work will help researchers design and evaluate classification databases and conduct metabarcoding on arthropods and alternate taxa.
Collapse
Affiliation(s)
- Rodney T Richardson
- Department of Entomology, Ohio State University, Columbus, OH, United States of America
| | - Johan Bengtsson-Palme
- Department of Infectious Diseases, Institute of Biomedicine, The Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden.,Center for Antibiotic Resistance Research (CARe), University of Gothenburg, Gothenburg, Sweden
| | - Mary M Gardiner
- Department of Entomology, Ohio State University, Columbus, OH, United States of America
| | - Reed M Johnson
- Department of Entomology, Ohio State University, Wooster, OH, United States of America
| |
Collapse
|