1
|
Lim PK, Wang R, Mutwil M. LSTrAP-denovo: Automated Generation of Transcriptome Atlases for Eukaryotic Species Without Genomes. PHYSIOLOGIA PLANTARUM 2024; 176:e14407. [PMID: 38973613 DOI: 10.1111/ppl.14407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Accepted: 05/28/2024] [Indexed: 07/09/2024]
Abstract
Despite the abundance of species with transcriptomic data, a significant number of species still lack sequenced genomes, making it difficult to study gene function and expression in these organisms. While de novo transcriptome assembly can be used to assemble protein-coding transcripts from RNA-sequencing (RNA-seq) data, the datasets used often only feature samples of arbitrarily selected or similar experimental conditions, which might fail to capture condition-specific transcripts. We developed the Large-Scale Transcriptome Assembly Pipeline for de novo assembled transcripts (LSTrAP-denovo) to automatically generate transcriptome atlases of eukaryotic species. Specifically, given an NCBI TaxID, LSTrAP-denovo can (1) filter undesirable RNA-seq accessions based on read data, (2) select RNA-seq accessions via unsupervised machine learning to construct a sample-balanced dataset for download, (3) assemble transcripts via over-assembly, (4) functionally annotate coding sequences (CDS) from assembled transcripts and (5) generate transcriptome atlases in the form of expression matrices for downstream transcriptomic analyses. LSTrAP-denovo is easy to implement, written in Python, and is freely available at https://github.com/pengkenlim/LSTrAP-denovo/.
Collapse
Affiliation(s)
- Peng Ken Lim
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Ruoxi Wang
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
2
|
Shiryev SA, Agarwala R. Indexing and searching petabase-scale nucleotide resources. Nat Methods 2024; 21:994-1002. [PMID: 38755321 PMCID: PMC11166510 DOI: 10.1038/s41592-024-02280-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 04/08/2024] [Indexed: 05/18/2024]
Abstract
Searching vast and rapidly growing nucleotide content in resources, such as runs in the Sequence Read Archive and assemblies for whole-genome shotgun sequencing projects in GenBank, is currently impractical for most researchers. Here we present Pebblescout, a tool that navigates such content by providing indexing and search capabilities. Indexing uses dense sampling of the sequences in the resource. Search finds subjects (runs or assemblies) that have short sequence matches to a user query, with well-defined guarantees and ranks them using informativeness of the matches. We illustrate the functionality of Pebblescout by creating eight databases that index over 3.7 petabases. The web service of Pebblescout can be reached at https://pebblescout.ncbi.nlm.nih.gov . We show that for a wide range of query lengths, Pebblescout provides a data-driven way for finding relevant subsets of large nucleotide resources, reducing the effort for downstream analysis substantially. We also show that Pebblescout results compare favorably to MetaGraph and Sourmash.
Collapse
Affiliation(s)
- Sergey A Shiryev
- Department of Health and Human Services, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Richa Agarwala
- Department of Health and Human Services, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
3
|
Shen L, Zhang Z, Wang R, Wu S, Wang Y, Fu S. Metatranscriptomic data mining together with microfluidic card uncovered the potential pathogens and seasonal RNA viral ecology in a drinking water source. J Appl Microbiol 2024; 135:lxad310. [PMID: 38130237 DOI: 10.1093/jambio/lxad310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 10/26/2023] [Accepted: 12/20/2023] [Indexed: 12/23/2023]
Abstract
AIMS Despite metatranscriptomics becoming an emerging tool for pathogen surveillance, very little is known about the feasibility of this approach for understanding the fate of human-derived pathogens in drinking water sources. METHODS AND RESULTS We conducted multiplexed microfluidic cards and metatranscriptomic sequencing of the drinking water source in a border city of North Korea in four seasons. Microfluidic card detected norovirus, hepatitis B virus (HBV), enterovirus, and Vibrio cholerae in the water. Phylogenetic analyses showed that environmental-derived sequences from norovirus GII.17, genotype C of HBV, and coxsackievirus A6 (CA6) were genetically related to the local clinical isolates. Meanwhile, metatranscriptomic assembly suggested that several bacterial pathogens, including Acinetobacter johnsonii and V. cholerae might be prevalent in the studied region. Metatranscriptomic analysis recovered 349 species-level groups with substantial viral diversity without detection of norovirus, HBV, and CA6. Seasonally distinct virus communities were also found. Specifically, 126, 73, 126, and 457 types of viruses were identified in spring, summer, autumn, and winter, respectively. The viromes were dominated by the Pisuviricota phylum, including members from Marnaviridae, Dicistroviridae, Luteoviridae, Potyviridae, Picornaviridae, Astroviridae, and Picobirnaviridae families. Further phylogenetic analyses of RNA (Ribonucleic Acid)-dependent RNA polymerase (RdRp) sequences showed a diverse set of picorna-like viruses associated with shellfish, of which several novel picorna-like viruses were also identified. Additionally, potential animal pathogens, including infectious bronchitis virus, Bat dicibavirus, Bat nodavirus, Bat picornavirus 2, infectious bursal disease virus, and Macrobrachium rosenbergii nodavirus were also identified. CONCLUSIONS Our data illustrate the divergence between microfluidic cards and metatranscriptomics, highlighting that the combination of both methods facilitates the source tracking of human viruses in challenging settings without sufficient clinical surveillance.
Collapse
Affiliation(s)
- Lixin Shen
- Key Laboratory of Resource Biology and Biotechnology in Western China, Ministry of Education, Department of Microbiology, College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Ziqiang Zhang
- Key Laboratory of Resource Biology and Biotechnology in Western China, Ministry of Education, Department of Microbiology, College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Rui Wang
- College of Marine Science and Environment, Dalian Ocean University, Dalian 116023, China
| | - Shuang Wu
- College of Food Technology and Sciences, Shanghai Ocean University, Shanghai 200093, China
- Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266237, China
| | - Yongjie Wang
- College of Food Technology and Sciences, Shanghai Ocean University, Shanghai 200093, China
- Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266237, China
- Laboratory of Quality and Safety Risk Assessment for Aquatic Products on Storage and Preservation (Shanghai), Ministry of Agriculture, Shanghai 200093, China
| | - Songzhe Fu
- Key Laboratory of Resource Biology and Biotechnology in Western China, Ministry of Education, Department of Microbiology, College of Life Sciences, Northwest University, Xi'an 710069, China
| |
Collapse
|
4
|
Yang P, Yang J, Long H, Huang K, Ji L, Lin H, Jiang X, Wang AK, Tian G, Ning K. MicroEXPERT: Microbiome profiling platform with cross-study metagenome-wide association analysis functionality. IMETA 2023; 2:e131. [PMID: 38868224 PMCID: PMC10989818 DOI: 10.1002/imt2.131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 06/06/2023] [Accepted: 07/10/2023] [Indexed: 06/14/2024]
Abstract
The framework of the MicroEXPERT platform. Our Platform was composed of five modules. Data management module: Users upload raw data and metadata to the system using a guided workflow. Data processing module: Uploaded data is processed to generate taxonomical distribution and functional composition results. Metagenome-wide association studies module (MWAS): Various methods, including biomarker analysis, PCA, co-occurrence networks, and sample classification, are employed using metadata. Data search module: Users can query nucleotide sequences to retrieve information in the MicroEXPERT database. Data visualization module: Visualization tools are used to illustrate the metagenome analysis results.
Collapse
Affiliation(s)
- Pengshuo Yang
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular‐imaging, Center of AI Biology, Department of Bioinformatics and Systems BiologyCollege of Life Science and Technology, Huazhong University of Science and TechnologyWuhanHubeiChina
- Institute of Medical GenomicsBiomedical Sciences College, Shandong First Medical UniversityJinanShandongChina
| | - Jialiang Yang
- Department of SciencesGeneis Beijing Co., Ltd.BeijingChina
- Department of SciencesQingdao Geneis Institute of Big Data Mining and Precision MedicineQingdaoChina
- Department of SciencesAcademician Workstation, Changsha Medical UniversityChangshaChina
| | - Haixia Long
- Department of Information Science TechnologyHainan Normal UniversityHaikouChina
| | - Kaimei Huang
- Department of MathematicsZhejiang Normal UniversityJinhuaChina
| | - Lei Ji
- Department of SciencesGeneis Beijing Co., Ltd.BeijingChina
- Department of SciencesQingdao Geneis Institute of Big Data Mining and Precision MedicineQingdaoChina
| | - Hanyang Lin
- Department of SciencesSequenxe Biological Technology Co., Ltd.XiamenChina
| | - Xiuli Jiang
- Department of SciencesSequenxe Biological Technology Co., Ltd.XiamenChina
| | | | - Geng Tian
- Department of SciencesGeneis Beijing Co., Ltd.BeijingChina
- Department of SciencesQingdao Geneis Institute of Big Data Mining and Precision MedicineQingdaoChina
| | - Kang Ning
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular‐imaging, Center of AI Biology, Department of Bioinformatics and Systems BiologyCollege of Life Science and Technology, Huazhong University of Science and TechnologyWuhanHubeiChina
| |
Collapse
|
5
|
Lu F, Wu S, Ni Y, Yu Y, Fu S, Wang Y. Metavirome-assembled genome sequence of a new aquatic RNA virus expands the genus Locarnavirus. Arch Virol 2023; 168:279. [PMID: 37878110 DOI: 10.1007/s00705-023-05908-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Accepted: 08/23/2023] [Indexed: 10/26/2023]
Abstract
RNA viruses in marine environments have long been recognized as major players in the virosphere. In this study, the complete genome sequence of an RNA virus from Yangshan Harbor, named marine RNA virus Yangshan-LWW (YS-LWW), was obtained based on metavirome assembly. The genome of YS-LWW is 8839 nt in length and contains two open reading frames (ORFs). Both RdRP- and whole-genome-based phylogenetic analysis showed that YS-LWW, together with 45 viral isolates with sequences in public datasets, represents a new species in the genus Locarnavirus of the family Marnaviridae. PCR and public dataset mining indicate that YS-LWW and YS-LWW-like viruses have been widely detected in coastal and freshwater environments, suggesting that they might play a significant role in aquatic ecosystems.
Collapse
Affiliation(s)
- Fangxin Lu
- College of Food Science and Technology, Shanghai Ocean University, Shanghai, China
| | - Shuang Wu
- College of Food Science and Technology, Shanghai Ocean University, Shanghai, China
| | - Yimin Ni
- College of Food Science and Technology, Shanghai Ocean University, Shanghai, China
| | - Yongxin Yu
- College of Food Science and Technology, Shanghai Ocean University, Shanghai, China
| | - Songzhe Fu
- Key Laboratory of Environment Controlled Aquaculture (KLECA), Ministry of Education, Dalian Ocean University, Dalian, China
| | - Yongjie Wang
- College of Food Science and Technology, Shanghai Ocean University, Shanghai, China.
- Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China.
- Laboratory of Quality and Safety Risk Assessment for Aquatic Products on Storage and Preservation (Shanghai), Ministry of Agriculture, Shanghai, China.
| |
Collapse
|
6
|
Leal JL, Milesi P, Salojärvi J, Lascoux M. Phylogenetic Analysis of Allotetraploid Species Using Polarized Genomic Sequences. Syst Biol 2023; 72:372-390. [PMID: 36932679 PMCID: PMC10275558 DOI: 10.1093/sysbio/syad009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 10/14/2022] [Accepted: 03/10/2023] [Indexed: 03/19/2023] Open
Abstract
Phylogenetic analysis of polyploid hybrid species has long posed a formidable challenge as it requires the ability to distinguish between alleles of different ancestral origins in order to disentangle their individual evolutionary history. This problem has been previously addressed by conceiving phylogenies as reticulate networks, using a two-step phasing strategy that first identifies and segregates homoeologous loci and then, during a second phasing step, assigns each gene copy to one of the subgenomes of an allopolyploid species. Here, we propose an alternative approach, one that preserves the core idea behind phasing-to produce separate nucleotide sequences that capture the reticulate evolutionary history of a polyploid-while vastly simplifying its implementation by reducing a complex multistage procedure to a single phasing step. While most current methods used for phylogenetic reconstruction of polyploid species require sequencing reads to be pre-phased using experimental or computational methods-usually an expensive, complex, and/or time-consuming endeavor-phasing executed using our algorithm is performed directly on the multiple-sequence alignment (MSA), a key change that allows for the simultaneous segregation and sorting of gene copies. We introduce the concept of genomic polarization that, when applied to an allopolyploid species, produces nucleotide sequences that capture the fraction of a polyploid genome that deviates from that of a reference sequence, usually one of the other species present in the MSA. We show that if the reference sequence is one of the parental species, the polarized polyploid sequence has a close resemblance (high pairwise sequence identity) to the second parental species. This knowledge is harnessed to build a new heuristic algorithm where, by replacing the allopolyploid genomic sequence in the MSA by its polarized version, it is possible to identify the phylogenetic position of the polyploid's ancestral parents in an iterative process. The proposed methodology can be used with long-read and short-read high-throughput sequencing data and requires only one representative individual for each species to be included in the phylogenetic analysis. In its current form, it can be used in the analysis of phylogenies containing tetraploid and diploid species. We test the newly developed method extensively using simulated data in order to evaluate its accuracy. We show empirically that the use of polarized genomic sequences allows for the correct identification of both parental species of an allotetraploid with up to 97% certainty in phylogenies with moderate levels of incomplete lineage sorting (ILS) and 87% in phylogenies containing high levels of ILS. We then apply the polarization protocol to reconstruct the reticulate histories of Arabidopsis kamchatica and Arabidopsis suecica, two allopolyploids whose ancestry has been well documented. [Allopolyploidy; Arabidopsis; genomic polarization; homoeologs; incomplete lineage sorting; phasing; polyploid phylogenetics; reticulate evolution.].
Collapse
Affiliation(s)
- J Luis Leal
- Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden
| | - Pascal Milesi
- Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden
- Science for Life Laboratory (SciLifeLab), Uppsala University, 75237 Uppsala, Sweden
| | - Jarkko Salojärvi
- Organismal and Evolutionary Biology Research Program, Faculty of Biological and Environmental Sciences, and Viikki Plant Science Centre, University of Helsinki, P.O. Box 65 (Viikinkaari 1), 00014 Helsinki, Finland
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Martin Lascoux
- Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden
- Science for Life Laboratory (SciLifeLab), Uppsala University, 75237 Uppsala, Sweden
| |
Collapse
|
7
|
Rouchka EC, de Almeida C, House RB, Daneshmand JC, Chariker JH, Saraswat-Ohri S, Gomes C, Sharp M, Shum-Siu A, Cesarz GM, Petruska JC, Magnuson DS. Construction of a searchable database for gene expression changes in spinal cord injury experiments. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.01.526630. [PMID: 36778366 PMCID: PMC9915599 DOI: 10.1101/2023.02.01.526630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Spinal cord injury (SCI) is a debilitating disease resulting in an estimated 18,000 new cases in the United States on an annual basis. Significant behavioral research on animal models has led to a large amount of data, some of which has been catalogued in the Open Data Commons for Spinal Cord Injury (ODC-SCI). More recently, high throughput sequencing experiments have been utilized to understand molecular mechanisms associated with SCI, with nearly 6,000 samples from over 90 studies available in the Sequence Read Archive. However, to date, no resource is available for efficiently mining high throughput sequencing data from SCI experiments. Therefore, we have developed a protocol for processing RNA-Seq samples from high-throughput sequencing experiments related to SCI resulting in both raw and normalized data that can be efficiently mined for comparisons across studies as well as homologous discovery across species. We have processed 1,196 publicly available RNA-seq samples from 50 bulk RNA-Seq studies across nine different species, resulting in an SQLite database that can be used by the SCI research community for further discovery. We provide both the database as well as a web-based front-end that can be used to query the database for genes of interest, differential gene expression, genes with high variance, and gene set enrichments.
Collapse
Affiliation(s)
- Eric C. Rouchka
- Department of Biochemistry and Molecular Genetics, University of Louisville School of Medicine, University of Louisville, Louisville, KY USA
- Kentucky IDeA Networks of Biomedical Research Excellence (KY INBRE) Bioinformatics Core, University of Louisville School of Medicine, 522 East Gray Street, Louisville, KY USA 40202
- Bioinformatics Program, School of Interdisciplinary and Graduate Studies, University of Louisville, Louisville, KY
| | - Carlos de Almeida
- Translational Neuroscience Program, School of Interdisciplinary and Graduate Studies, University of Louisville, Louisville, KY
- Kentucky Spinal Cord Injury Research Center, School of Medicine, University of Louisville, Louisville, KY
| | - Randi B. House
- Kentucky Spinal Cord Injury Research Center, School of Medicine, University of Louisville, Louisville, KY
- Department of Bioengineering, Speed School of Engineering, University of Louisville, Louisville, KY
| | - Jonah C. Daneshmand
- Bioinformatics Program, School of Interdisciplinary and Graduate Studies, University of Louisville, Louisville, KY
| | - Julia H. Chariker
- Kentucky IDeA Networks of Biomedical Research Excellence (KY INBRE) Bioinformatics Core, University of Louisville School of Medicine, 522 East Gray Street, Louisville, KY USA 40202
- Department of Neuroscience Training, School of Medicine, University of Louisville, Louisville, KY
| | - Sujata Saraswat-Ohri
- Kentucky Spinal Cord Injury Research Center, School of Medicine, University of Louisville, Louisville, KY
- Department of Neurological Surgery, School of Medicine, University of Louisville, Louisville, KY USA
| | - Cynthia Gomes
- Kentucky Spinal Cord Injury Research Center, School of Medicine, University of Louisville, Louisville, KY
- Department of Anatomical Sciences and Neurobiology, School of Medicine, University of Louisville, Louisville, KY
| | - Morgan Sharp
- Kentucky Spinal Cord Injury Research Center, School of Medicine, University of Louisville, Louisville, KY
- Department of Neurological Surgery, School of Medicine, University of Louisville, Louisville, KY USA
| | - Alice Shum-Siu
- Kentucky Spinal Cord Injury Research Center, School of Medicine, University of Louisville, Louisville, KY
- Department of Neurological Surgery, School of Medicine, University of Louisville, Louisville, KY USA
| | - Greta M. Cesarz
- Kentucky Spinal Cord Injury Research Center, School of Medicine, University of Louisville, Louisville, KY
| | - Jeffrey C. Petruska
- Kentucky Spinal Cord Injury Research Center, School of Medicine, University of Louisville, Louisville, KY
- Department of Neurological Surgery, School of Medicine, University of Louisville, Louisville, KY USA
- Department of Anatomical Sciences and Neurobiology, School of Medicine, University of Louisville, Louisville, KY
| | - David S.K. Magnuson
- Translational Neuroscience Program, School of Interdisciplinary and Graduate Studies, University of Louisville, Louisville, KY
- Kentucky Spinal Cord Injury Research Center, School of Medicine, University of Louisville, Louisville, KY
- Department of Neurological Surgery, School of Medicine, University of Louisville, Louisville, KY USA
- Department of Anatomical Sciences and Neurobiology, School of Medicine, University of Louisville, Louisville, KY
| |
Collapse
|
8
|
Tjaden B. Escherichia coli transcriptome assembly from a compendium of RNA-seq data sets. RNA Biol 2023; 20:77-84. [PMID: 36920168 PMCID: PMC10392735 DOI: 10.1080/15476286.2023.2189331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Revised: 01/09/2023] [Accepted: 01/27/2023] [Indexed: 03/16/2023] Open
Abstract
Owing to the complexities of bacterial RNA biology, the transcriptomes of even the best studied bacteria are not fully understood. To help elucidate the transcriptional landscape of E. coli, we compiled a compendium of 3,376 RNA-seq data sets composed of more than 7 trillion sequenced bases, which we evaluate with a transcript assembly pipeline. We report expression profiles for all annotated E. coli genes as well as 5,071 other transcripts. Additionally, we observe hundreds of instances of co-transcribed genes that are novel with respect to existing operon databases. By integrating data from a large number of sequencing experiments corresponding to a wide range of conditions, we are able to obtain a comprehensive view of the E. coli transcriptome.
Collapse
Affiliation(s)
- Brian Tjaden
- Department of Computer Science, Wellesley College, Wellesley, MA, USA
| |
Collapse
|
9
|
Forensic Analysis of Novel SARS2r-CoV Identified in Game Animal Datasets in China Shows Evolutionary Relationship to Pangolin GX CoV Clade and Apparent Genetic Experimentation. Appl Microbiol 2022. [DOI: 10.3390/applmicrobiol2040068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Pangolins are the only animals other than bats proposed to have been infected with SARS-CoV-2 related coronaviruses (SARS2r-CoVs) prior to the COVID-19 pandemic. Here, we examine the novel SARS2r-CoV we previously identified in game animal metatranscriptomic datasets sequenced by the Nanjing Agricultural University in 2022, and find that sections of the partial genome phylogenetically group with Guangxi pangolin CoVs (GX PCoVs), while the full RdRp sequence groups with bat-SL-CoVZC45. While the novel SARS2r-CoV is found in 6 pangolin datasets, it is also found in 10 additional NGS datasets from 5 separate mammalian species and is likely related to contamination by a laboratory researched virus. Absence of bat mitochondrial sequences from the datasets, the fragmentary nature of the virus sequence and the presence of a partial sequence of a cloning vector attached to a SARS2r-CoV read suggests that it has been cloned. We find that NGS datasets containing the novel SARS2r-CoV are contaminated with significant Homo sapiens genetic material, and numerous viruses not associated with the host animals sampled. We further identify the dominant human haplogroup of the contaminating H. sapiens genetic material to be F1c1a1, which is of East Asian provenance. The association of this novel SARS2r-CoV with both bat CoV and the GX PCoV clades is an important step towards identifying the origin of the GX PCoVs.
Collapse
|