1
|
Çi Ftçi B, Teki N R. Prediction of viral families and hosts of single-stranded RNA viruses based on K-Mer coding from phylogenetic gene sequences. Comput Biol Chem 2024; 112:108114. [PMID: 38852362 DOI: 10.1016/j.compbiolchem.2024.108114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Revised: 05/06/2024] [Accepted: 05/25/2024] [Indexed: 06/11/2024]
Abstract
There are billions of virus species worldwide, and viruses, the smallest parasitic entities, pose a serious threat. Therefore, fighting associated disorders requires an understanding of the genetic structure of viruses. Considering the wide diversity and rapid evolution of viruses, there is a critical need to quickly and accurately classify viral species and their potential hosts to better understand transmission dynamics, facilitating the development of targeted therapies. Recognizing this, this study has investigated the classes of RNA viruses based on their genomic sequences using Machine Learning (ML) and Deep Learning (DL) models. The PhyVirus dataset, consisting of pathogenic Single-stranded RNA viruses of Baltimore group four (+ssRNA) and five (-ssRNA) with different hosts and species, was analyzed. The dataset containing viral gene sequences was analyzed using the K-Mer coding technique, which is based on base words of various lengths. The study used classical ML algorithms (Random Forest, Gradient Boosting and Extra Trees) and the Fully Connected Deep Neural Network, a Deep Learning algorithm, to predict viral families and hosts. Detailed analyses were performed on the classifier performance in scenarios with different train-test ratios and different word lengths (k-values) for K-Mer. The observed results show that Fully Connected Deep Neural Network has a high success rate of 99.60 % in predicting virus families. In predicting virus hosts, the Extra Trees classifier achieved the highest success rate of 81.53 %. This study is considered to be the first classification study in the literature on this dataset, which has a very large family and host diversity consisting of gene sequences of Single-stranded RNA viruses. Our detailed investigations on how varying word lengths based on K-Mer coding in gene sequences affect the classification into viral families and hosts make this study particularly valuable. This study shows that ML and DL methods have the potential to produce valuable results in phylogenetic studies. In addition, the results and high-performance values show that these methods can be successfully used in regenerative applications of gene sequences or in studies such as the elimination of losses in gene sequences.
Collapse
Affiliation(s)
- Bahar Çi Ftçi
- Batman University, Institute of Graduate Studies, Department of Electrical and Electronic Engineering, Turkey; Siirt University, Distance Education Application and Research Center, Turkey.
| | - Ramazan Teki N
- Batman University, Faculty of Engineering and Architecture, Department of Computer Engineering, Turkey.
| |
Collapse
|
2
|
Ni Y, Chu T, Yan S, Wang Y. Forty-nine metagenomic-assembled genomes from an aquatic virome expand Caudoviricetes by 45 potential new families and the newly uncovered Gossevirus of Bamfordvirae. J Gen Virol 2024; 105. [PMID: 38446011 DOI: 10.1099/jgv.0.001967] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/07/2024] Open
Abstract
Twenty complete genomes (29-63 kb) and 29 genomes with an estimated completeness of over 90 % (30-90 kb) were identified for novel dsDNA viruses in the Yangshan Harbor metavirome. These newly discovered viruses contribute to the expansion of viral taxonomy by introducing 46 potential new families. Except for one virus, all others belong to the class Caudoviricetes. The exception is a novel member of the recently characterized viral group known as Gossevirus. Fifteen viruses were predicted to be temperate. The predicted hosts for the viruses appear to be involved in various aspects of the nitrogen cycle, including nitrogen fixation, oxidation and denitrification. Two viruses were identified to have a host of Flavobacterium and Tepidimonas fonticaldi, respectively, by matching CRISPR spacers with viral protospacers. Our findings provide an overview for characterizing and identifying specific viruses from Yangshan Harbor. The Gossevirus-like virus uncovered emphasizes the need for further comprehensive isolation and investigation of polinton-like viruses.
Collapse
Affiliation(s)
- Yimin Ni
- College of Food Science and Technology, Shanghai Ocean University, Shanghai, PR China
| | - Ting Chu
- College of Food Science and Technology, Shanghai Ocean University, Shanghai, PR China
| | - Shuling Yan
- Entwicklungsgenetik und Zellbiologie der Tiere, Philipps-Universität Marburg, Marburg, Germany
| | - Yongjie Wang
- College of Food Science and Technology, Shanghai Ocean University, Shanghai, PR China
- Laboratory for Marine Biology and Biotechnology, Qingdao Marine Science and Technology Center, Qingdao, PR China
- Laboratory of Quality and Safety Risk Assessment for Aquatic Products on Storage and Preservation, Ministry of Agriculture and Rural Affairs, Shanghai, PR China
| |
Collapse
|
3
|
Donaire L, Aranda MA. Computational Pipeline for the Detection of Plant RNA Viruses Using High-Throughput Sequencing. Methods Mol Biol 2024; 2724:1-20. [PMID: 37987894 DOI: 10.1007/978-1-0716-3485-1_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
In this chapter, we describe a computational pipeline for the in silico detection of plant viruses by high-throughput sequencing (HTS) from total RNA samples. The pipeline is designed for the analysis of short reads generated using an Illumina platform and free-available software tools. First, we provide advice for high-quality total RNA purification, library preparation, and sequencing. The bioinformatics pipeline begins with the raw reads obtained from the sequencing machine and performs some curation steps to obtain long contigs. Contigs are blasted against a local database of reference nucleotide viral sequences to identify the viruses in the samples. Then, the search is refined by applying specific filters. We also provide the code to re-map the short reads against the viruses found to get information on sequencing depth and read coverage for each virus. No previous bioinformatics background is required, but basic knowledge of the Unix command line and R language is recommended.
Collapse
Affiliation(s)
- Livia Donaire
- Abiopep S.L., Parque Científico de Murcia, Complejo de Espinardo, Murcia, Spain.
- Department of Stress Biology and Plant Pathology, Centro de Edafología y Biología Aplicada del Segura (CEBAS)-CSIC, Murcia, Spain.
| | - Miguel A Aranda
- Department of Stress Biology and Plant Pathology, Centro de Edafología y Biología Aplicada del Segura (CEBAS)-CSIC, Murcia, Spain
| |
Collapse
|
4
|
Rollin J, Rong W, Massart S. Cont-ID: detection of sample cross-contamination in viral metagenomic data. BMC Biol 2023; 21:217. [PMID: 37833740 PMCID: PMC10576407 DOI: 10.1186/s12915-023-01708-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Accepted: 09/20/2023] [Indexed: 10/15/2023] Open
Abstract
BACKGROUND High-throughput sequencing (HTS) technologies completed by the bioinformatic analysis of the generated data are becoming an important detection technique for virus diagnostics. They have the potential to replace or complement the current PCR-based methods thanks to their improved inclusivity and analytical sensitivity, as well as their overall good repeatability and reproducibility. Cross-contamination is a well-known phenomenon in molecular diagnostics and corresponds to the exchange of genetic material between samples. Cross-contamination management was a key drawback during the development of PCR-based detection and is now adequately monitored in routine diagnostics. HTS technologies are facing similar difficulties due to their very high analytical sensitivity. As a single viral read could be detected in millions of sequencing reads, it is mandatory to fix a detection threshold that will be informed by estimated cross-contamination. Cross-contamination monitoring should therefore be a priority when detecting viruses by HTS technologies. RESULTS We present Cont-ID, a bioinformatic tool designed to check for cross-contamination by analysing the relative abundance of virus sequencing reads identified in sequence metagenomic datasets and their duplication between samples. It can be applied when the samples in a sequencing batch have been processed in parallel in the laboratory and with at least one specific external control called Alien control. Using 273 real datasets, including 68 virus species from different hosts (fruit tree, plant, human) and several library preparation protocols (Ribodepleted total RNA, small RNA and double-stranded RNA), we demonstrated that Cont-ID classifies with high accuracy (91%) viral species detection into (true) infection or (cross) contamination. This classification raises confidence in the detection and facilitates the downstream interpretation and confirmation of the results by prioritising the virus detections that should be confirmed. CONCLUSIONS Cross-contamination between samples when detecting viruses using HTS (Illumina technology) can be monitored and highlighted by Cont-ID (provided an alien control is present). Cont-ID is based on a flexible methodology relying on the output of bioinformatics analyses of the sequencing reads and considering the contamination pattern specific to each batch of samples. The Cont-ID method is adaptable so that each laboratory can optimise it before its validation and routine use.
Collapse
Affiliation(s)
- Johan Rollin
- Plant Pathology Laboratory, Gembloux Agro-Bio Tech, University of Liège, 5030, Gembloux, Belgium
- DNAVision, 6041, Gosselies, Belgium
| | - Wei Rong
- Plant Pathology Laboratory, Gembloux Agro-Bio Tech, University of Liège, 5030, Gembloux, Belgium
| | - Sébastien Massart
- Plant Pathology Laboratory, Gembloux Agro-Bio Tech, University of Liège, 5030, Gembloux, Belgium.
| |
Collapse
|
5
|
Candresse T, Svanella-Dumas L, Marais A, Depasse F, Faure C, Lefebvre M. Identification of Seven Additional Genome Segments of Grapevine-Associated Jivivirus 1. Viruses 2022; 15:39. [PMID: 36680079 PMCID: PMC9862270 DOI: 10.3390/v15010039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 12/19/2022] [Accepted: 12/20/2022] [Indexed: 12/25/2022] Open
Abstract
Jiviruses are a group of recently described viruses characterized with a tripartite genome and having affinities with Virgaviridae (RNA1 and 2) and Flaviviridae (RNA3). Using a combination of high-throughput sequencing, datamining and RT-PCR approaches, we demonstrate here that in grapevine samples infected by grapevine-associated jivivirus 1 (GaJV-1) up to 7 additional molecules can be consistently detected with conserved 5' and 3' non-coding regions in common with the three previously identified GaJV-1 genomic RNAs. RNA4, RNA5, RNA6, RNA7, RNA8 and RNA10, together with a recombinant RNArec7-8, are all members of a family sharing a previously non recognized conserved protein domain, while RNA9 is part of a distinct family characterized by another conserved motif. Datamining of pecan (Carya illinoinensis) public transcriptomic data allowed the identification of two further jiviviruses and the identification of supplementary genomic RNAs with homologies to those of GaJV-1. Taken together, these results reshape our vision of the divided genome of jiviviruses and raise novel questions about the function(s) of the proteins encoded by jiviviruses supplementary RNAs.
Collapse
Affiliation(s)
- Thierry Candresse
- INRAE, UMR BFP, University of Bordeaux, CS20032, CEDEX, 33882 Villenave d’Ornon, France
| | | | | | | | | | | |
Collapse
|
6
|
Roy T, Sharma K, Dhall A, Patiyal S, Raghava GPS. In silico method for predicting infectious strains of influenza A virus from its genome and protein sequences. J Gen Virol 2022; 103. [PMID: 36318663 DOI: 10.1099/jgv.0.001802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023] Open
Abstract
Influenza A is a contagious viral disease responsible for four pandemics in the past and a major public health concern. Being zoonotic in nature, the virus can cross the species barrier and transmit from wild aquatic bird reservoirs to humans via intermediate hosts. In this study, we have developed a computational method for the prediction of human-associated and non-human-associated influenza A virus sequences. The models were trained and validated on proteins and genome sequences of influenza A virus. Firstly, we have developed prediction models for 15 types of influenza A proteins using composition-based and one-hot-encoding features. We have achieved a highest AUC of 0.98 for HA protein on a validation dataset using dipeptide composition-based features. Of note, we obtained a maximum AUC of 0.99 using one-hot-encoding features for protein-based models on a validation dataset. Secondly, we built models using whole genome sequences which achieved an AUC of 0.98 on a validation dataset. In addition, we showed that our method outperforms a similarity-based approach (i.e., blast) on the same validation dataset. Finally, we integrated our best models into a user-friendly web server 'FluSPred' (https://webs.iiitd.edu.in/raghava/fluspred/index.html) and a standalone version (https://github.com/raghavagps/FluSPred) for the prediction of human-associated/non-human-associated influenza A virus strains.
Collapse
Affiliation(s)
- Trinita Roy
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Khushal Sharma
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Gajendra Pal Singh Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| |
Collapse
|