1
|
Yin X, Dong Q, Fan S, Yang L, Li H, Jin Y, Laurentinah MR, Chen X, Sysa A, Fang H, Lyu J, Yu Y, Wang Y. A novel pathogenic mitochondrial DNA variant m.4344T>C in tRNA Gln causes developmental delay. J Hum Genet 2024; 69:381-389. [PMID: 38730005 DOI: 10.1038/s10038-024-01254-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Revised: 03/26/2024] [Accepted: 04/17/2024] [Indexed: 05/12/2024]
Abstract
Mitochondrial diseases are a group of genetic diseases caused by mutations in mitochondrial DNA and nuclear DNA. However, the genetic spectrum of this disease is not yet complete. In this study, we identified a novel variant m.4344T>C in mitochondrial tRNAGln from a patient with developmental delay. The mutant loads of m.4344T>C were 95% and 89% in the patient's blood and oral epithelial cells, respectively. Multialignment analysis showed high evolutionary conservation of this nucleotide. TrRosettaRNA predicted that m.4344T>C variant would introduce an additional hydrogen bond and alter the conformation of the T-loop. The transmitochondrial cybrid-based study demonstrated that m.4344T>C variant impaired the steady-state level of mitochondrial tRNAGln and decreased the contents of mitochondrial OXPHOS complexes I, III, and IV, resulting in defective mitochondrial respiration, elevated mitochondrial ROS production, reduced mitochondrial membrane potential and decreased mitochondrial ATP levels. Altogether, this is the first report in patient carrying the m.4344T>C variant. Our data uncover the pathogenesis of the m.4344T>C variant and expand the genetic mutation spectrum of mitochondrial diseases, thus contributing to the clinical diagnosis of mitochondrial tRNAGln gene variants-associated mitochondrial diseases.
Collapse
Affiliation(s)
- Xiaojie Yin
- Key Laboratory of Laboratory Medicine, Ministry of Education, Zhejiang Provincial Key Laboratory of Medical Genetics, School of Laboratory Medicine and Life sciences, Wenzhou Medical University, Wenzhou, 325035, Zhejiang, China
| | - Qiyu Dong
- Key Laboratory of Laboratory Medicine, Ministry of Education, Zhejiang Provincial Key Laboratory of Medical Genetics, School of Laboratory Medicine and Life sciences, Wenzhou Medical University, Wenzhou, 325035, Zhejiang, China
| | - Shuanglong Fan
- Key Laboratory of Laboratory Medicine, Ministry of Education, Zhejiang Provincial Key Laboratory of Medical Genetics, School of Laboratory Medicine and Life sciences, Wenzhou Medical University, Wenzhou, 325035, Zhejiang, China
| | - Lina Yang
- Key Laboratory of Laboratory Medicine, Ministry of Education, Zhejiang Provincial Key Laboratory of Medical Genetics, School of Laboratory Medicine and Life sciences, Wenzhou Medical University, Wenzhou, 325035, Zhejiang, China
| | - Hao Li
- Key Laboratory of Laboratory Medicine, Ministry of Education, Zhejiang Provincial Key Laboratory of Medical Genetics, School of Laboratory Medicine and Life sciences, Wenzhou Medical University, Wenzhou, 325035, Zhejiang, China
| | - Yijun Jin
- Key Laboratory of Laboratory Medicine, Ministry of Education, Zhejiang Provincial Key Laboratory of Medical Genetics, School of Laboratory Medicine and Life sciences, Wenzhou Medical University, Wenzhou, 325035, Zhejiang, China
| | - Mahlatsi Refiloe Laurentinah
- Key Laboratory of Laboratory Medicine, Ministry of Education, Zhejiang Provincial Key Laboratory of Medical Genetics, School of Laboratory Medicine and Life sciences, Wenzhou Medical University, Wenzhou, 325035, Zhejiang, China
| | - Xiandan Chen
- International Sakharov Environmental Institute of Belarusian State University, Minsk, 220070, Republic of Belarus
| | - Aliaksei Sysa
- International Sakharov Environmental Institute of Belarusian State University, Minsk, 220070, Republic of Belarus
| | - Hezhi Fang
- Key Laboratory of Laboratory Medicine, Ministry of Education, Zhejiang Provincial Key Laboratory of Medical Genetics, School of Laboratory Medicine and Life sciences, Wenzhou Medical University, Wenzhou, 325035, Zhejiang, China
| | - Jianxin Lyu
- Key Laboratory of Laboratory Medicine, Ministry of Education, Zhejiang Provincial Key Laboratory of Medical Genetics, School of Laboratory Medicine and Life sciences, Wenzhou Medical University, Wenzhou, 325035, Zhejiang, China.
- Laboratory Medicine Center, Department of Clinical Laboratory, Zhejiang Provincial People's Hospital, Affiliated People's Hospital, Hangzhou Medical College, Hangzhou, 310053, Zhejiang, China.
| | - Yongguo Yu
- Department of Pediatric Endocrinology and Genetics, Xinhua Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai Institute for Pediatric Research, Shanghai, 200092, China.
- Shanghai Key Laboratory of Pediatric Gastroenterology and Nutrition, Shanghai, 200092, China.
| | - Ya Wang
- Key Laboratory of Laboratory Medicine, Ministry of Education, Zhejiang Provincial Key Laboratory of Medical Genetics, School of Laboratory Medicine and Life sciences, Wenzhou Medical University, Wenzhou, 325035, Zhejiang, China.
| |
Collapse
|
2
|
Mumtaz Z, Rashid Z, Saif R, Yousaf MZ. Deep learning guided prediction modeling of dengue virus evolving serotype. Heliyon 2024; 10:e32061. [PMID: 38882365 PMCID: PMC11177124 DOI: 10.1016/j.heliyon.2024.e32061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Revised: 05/28/2024] [Accepted: 05/28/2024] [Indexed: 06/18/2024] Open
Abstract
Evolution remains an incessant process in viruses, allowing them to elude the host immune response and induce severe diseases, impacting the diagnostic and vaccine effectiveness. Emerging and re-emerging diseases are among the significant public health concerns globally. The revival of dengue is mainly due to the potential for naturally arising mutations to induce genotypic alterations in serotypes. These transformations could lead to future outbreaks, underscoring the significance of studying DENV evolution in endemic regions. Predicting the emerging Dengue Virus (DENV) genome is crucial as the virus disrupts host cells, leading to fatal outcomes. Deep learning has been applied to predict dengue fever cases; there has been relatively less emphasis on its significance in forecasting emerging DENV serotypes. While Recurrent Neural Networks (RNN) were initially designed for modeling temporal sequences, our proposed DL-DVE generative and classification model, trained on complete genome data of DENV, transcends traditional approaches by learning semantic relationships between nucleotides in a continuous vector space instead of representing the contextual meaning of nucleotide characters. Leveraging 2000 publicly available DENV complete genome sequences, our Long Short-Term Memory (LSTM) based generative and Feedforward Neural Network (FNN) based classification DL-DVE model showcases proficiency in learning intricate patterns and generating sequences for emerging serotype of DENV. The generated sequences were analyzed along with available DENV serotype sequences to find conserved motifs in the genome through MEME Suite (version 5.5.5). The generative model showed an accuracy of 93 %, and the classification model provided insight into the specific serotype label, corroborated by BLAST search verification. Evaluation metrics such as ROC-AUC value 0.818, accuracy, precision, recall and F1 score, all to be around 99.00 %, demonstrating the classification model's reliability. Our model classified the generated sequences as DENV-4, exhibiting 65.99 % similarity to DENV-4 and around 63-65 % similarity with other serotypes, indicating notable distinction from other serotypes. Moreover, the intra-serotype divergence of sequences with a minimum of 90 % similarity underscored their uniqueness.
Collapse
Affiliation(s)
- Zilwa Mumtaz
- KAM School of Life Sciences, Forman Christian College University, Ferozpur Road, Lahore, Pakistan
| | - Zubia Rashid
- Department of Biomedical Engineering, Faculty of Engineering, Science, Technology and Management, Ziauddin University, Karachi, Pakistan
| | - Rashid Saif
- Department of Biotechnology, Qarshi University, Lahore, Pakistan
| | - Muhammad Zubair Yousaf
- KAM School of Life Sciences, Forman Christian College University, Ferozpur Road, Lahore, Pakistan
| |
Collapse
|
3
|
Chen G, Jiang J, Sun Y. RNAVirHost: a machine learning-based method for predicting hosts of RNA viruses through viral genomes. Gigascience 2024; 13:giae059. [PMID: 39172545 PMCID: PMC11340644 DOI: 10.1093/gigascience/giae059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 05/29/2024] [Accepted: 07/23/2024] [Indexed: 08/24/2024] Open
Abstract
BACKGROUND The high-throughput sequencing technologies have revolutionized the identification of novel RNA viruses. Given that viruses are infectious agents, identifying hosts of these new viruses carries significant implications for public health and provides valuable insights into the dynamics of the microbiome. However, determining the hosts of these newly discovered viruses is not always straightforward, especially in the case of viruses detected in environmental samples. Even for host-associated samples, it is not always correct to assign the sample origin as the host of the identified viruses. The process of assigning hosts to RNA viruses remains challenging due to their high mutation rates and vast diversity. RESULTS In this study, we introduce RNAVirHost, a machine learning-based tool that predicts the hosts of RNA viruses solely based on viral genomes. RNAVirHost is a hierarchical classification framework that predicts hosts at different taxonomic levels. We demonstrate the superior accuracy of RNAVirHost in predicting hosts of RNA viruses through comprehensive comparisons with various state-of-the-art techniques. When applying to viruses from novel genera, RNAVirHost achieved the highest accuracy of 84.3%, outperforming the alignment-based strategy by 12.1%. CONCLUSIONS The application of machine learning models has proven beneficial in predicting hosts of RNA viruses. By integrating genomic traits and sequence homologies, RNAVirHost provides a cost-effective and efficient strategy for host prediction. We believe that RNAVirHost can greatly assist in RNA virus analyses and contribute to pandemic surveillance.
Collapse
Affiliation(s)
- Guowei Chen
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong (SAR), China
| | - Jingzhe Jiang
- Key Laboratory of South China Sea Fishery Resources Exploitation & Utilization, Ministry of Agriculture and Rural Affairs, South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou 510300, China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong (SAR), China
| |
Collapse
|
4
|
Lin Y, Pascall DJ. Characterisation of putative novel tick viruses and zoonotic risk prediction. Ecol Evol 2024; 14:e10814. [PMID: 38259958 PMCID: PMC10800298 DOI: 10.1002/ece3.10814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 11/02/2023] [Accepted: 11/24/2023] [Indexed: 01/24/2024] Open
Abstract
Tick-associated viruses remain a substantial zoonotic risk worldwide, so knowledge of the diversity of tick viruses has potential health consequences. Despite their importance, large amounts of sequences in public data sets from tick meta-genomic and -transcriptomic projects remain unannotated, sequence data that could contain undocumented viruses. Through data mining and bioinformatic analysis of more than 37,800 public meta-genomic and -transcriptomic data sets, we found 83 unannotated contigs exhibiting high identity with known tick viruses. These putative viral contigs were classified into three RNA viral families (Alphatetraviridae, Orthomyxoviridae and Chuviridae) and one DNA viral family (Asfarviridae). After manual checking of quality and dissimilarity towards other sequences in the data set, these 83 contigs were reduced to five contigs in the Alphatetraviridae from four putative viruses, four in the Orthomyxoviridae from two putative viruses and one in the Chuviridae which clustered with known tick-associated viruses, forming a separate clade within the viral families. We further attempted to assess which previously known tick viruses likely represent zoonotic risks and thus deserve further investigation. We ranked the human infection potential of 133 known tick-associated viruses using a genome composition-based machine learning model. We found five high-risk tick-associated viruses (Langat virus, Lonestar tick chuvirus 1, Grotenhout virus, Taggert virus and Johnston Atoll virus) that have not been known to infect human and two viral families (Nairoviridae and Phenuiviridae) that contain a large proportion of potential zoonotic tick-associated viruses. This adds to the knowledge of tick virus diversity and highlights the importance of surveillance of newly emerging tick-associated diseases.
Collapse
Affiliation(s)
- Yuting Lin
- MRC Biostatistics UnitUniversity of CambridgeCambridgeUK
- Royal Veterinary CollegeUniversity of LondonLondonUK
| | | |
Collapse
|
5
|
Hufsky F, Abecasis AB, Babaian A, Beck S, Brierley L, Dellicour S, Eggeling C, Elena SF, Gieraths U, Ha AD, Harvey W, Jones TC, Lamkiewicz K, Lovate GL, Lücking D, Machyna M, Nishimura L, Nocke MK, Renard BY, Sakaguchi S, Sakellaridi L, Spangenberg J, Tarradas-Alemany M, Triebel S, Vakulenko Y, Wijesekara RY, González-Candelas F, Krautwurst S, Pérez-Cataluña A, Randazzo W, Sánchez G, Marz M. The International Virus Bioinformatics Meeting 2023. Viruses 2023; 15:2031. [PMID: 37896809 PMCID: PMC10612056 DOI: 10.3390/v15102031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 09/08/2023] [Accepted: 09/14/2023] [Indexed: 10/29/2023] Open
Abstract
The 2023 International Virus Bioinformatics Meeting was held in Valencia, Spain, from 24-26 May 2023, attracting approximately 180 participants worldwide. The primary objective of the conference was to establish a dynamic scientific environment conducive to discussion, collaboration, and the generation of novel research ideas. As the first in-person event following the SARS-CoV-2 pandemic, the meeting facilitated highly interactive exchanges among attendees. It served as a pivotal gathering for gaining insights into the current status of virus bioinformatics research and engaging with leading researchers and emerging scientists. The event comprised eight invited talks, 19 contributed talks, and 74 poster presentations across eleven sessions spanning three days. Topics covered included machine learning, bacteriophages, virus discovery, virus classification, virus visualization, viral infection, viromics, molecular epidemiology, phylodynamic analysis, RNA viruses, viral sequence analysis, viral surveillance, and metagenomics. This report provides rewritten abstracts of the presentations, a summary of the key research findings, and highlights shared during the meeting.
Collapse
Affiliation(s)
- Franziska Hufsky
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany;
| | - Ana B. Abecasis
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- Global Health and Tropical Medicine, GHTM, Associate Laboratory in Translation and Innovation towards Global Health, LA-REAL, Instituto de Higiene e Medicina Tropical, IHMT, Universidade NOVA de Lisboa, UNL, Rua da Junqueira 100, 1349-008 Lisboa, Portugal
| | - Artem Babaian
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 3E1, Canada
- Donnelly Centre, University of Toronto, Toronto, ON M5S 1A1, Canada
| | - Sebastian Beck
- Leibniz Institute of Virology, Department Viral Zoonoses—One Health, 20251 Hamburg, Germany;
| | - Liam Brierley
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- Department of Health Data Science, University of Liverpool, Liverpool L69 3GF, UK
| | - Simon Dellicour
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- Spatial Epidemiology Lab (SpELL), Université Libre de Bruxelles, CP160/12, 50 av. FD Roosevelt, 1050 Bruxelles, Belgium
- Laboratory for Clinical and Epidemiological Virology, Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, University of Leuven, 3000 Leuven, Belgium
| | - Christian Eggeling
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- Institute of Applied Optics and Biophysics, Friedrich Schiller University Jena, Max-Wien-Platz 1, 07743 Jena, Germany
| | - Santiago F. Elena
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- Institute for Integrative Systems Biology (I2SysBio), CSIC-Universitat de Valencia, Catedratico Agustin Escardino 9, 46980 Valencia, Spain
| | - Udo Gieraths
- Institute of Virology, Charité, Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany
| | - Anh D. Ha
- Department of Biological Sciences, Virginia Tech, Blacksburg, VA 24061, USA
| | - Will Harvey
- The Roslin Institute, University of Edinburgh, Edinburgh EH25 9RG, UK
| | - Terry C. Jones
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- Institute of Virology, Charité, Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany
- Department of Zoology, University of Cambridge, Downing Street, Cambridge CB2 3EJ, UK
| | - Kevin Lamkiewicz
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany;
| | - Gabriel L. Lovate
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany;
| | - Dominik Lücking
- Max-Planck Institute for Marine Microbiology, Celsiusstraße 1, 28359 Bremen, Germany
| | - Martin Machyna
- Paul-Ehrlich-Institut, Host-Pathogen-Interactions, 63225 Langen, Germany
| | - Luca Nishimura
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- Department of Genetics, School of Life Science, The Graduate University for Advanced Studies (SOKENDAI), Mishima 411-8540, Japan
- Human Genetics Laboratory, National Institute of Genetics, Mishima 411-8540, Japan
| | - Maximilian K. Nocke
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- Department for Molecular & Medical Virology, Ruhr University Bochum, 44801 Bochum, Germany
| | - Bernard Y. Renard
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- Digital Engineering Faculty, Hasso Plattner Institute, University of Potsdam, 14482 Potsdam, Germany
| | - Shoichi Sakaguchi
- Department of Microbiology and Infection Control, Faculty of Medicine, Osaka Medical and Pharmaceutical University, Osaka 569-8686, Japan;
| | - Lygeri Sakellaridi
- Institute for Virology and Immunobiology, University of Würzburg, Versbacher Str. 7, 97078 Würzburg, Germany
| | - Jannes Spangenberg
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany;
| | - Maria Tarradas-Alemany
- Computational Genomics Lab., Department of Genetics, Microbiology and Statistics, Institut de Biomedicina UB (IBUB), Universitat de Barcelona (UB), 08028 Barcelona, Spain
| | - Sandra Triebel
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany;
| | - Yulia Vakulenko
- Martsinovsky Institute of Medical Parasitology, Tropical and Vector Borne Diseases, Sechenov First Moscow State Medical University, 119991 Moscow, Russia
| | - Rajitha Yasas Wijesekara
- Institute for Bioinformatics, University of Medicine Greifswald, Felix-Hausdorff-Str. 8, 17475 Greifswald, Germany
| | - Fernando González-Candelas
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- Institute for Integrative Systems Biology (I2SysBio), CSIC-Universitat de Valencia, Catedratico Agustin Escardino 9, 46980 Valencia, Spain
- Joint Research Unit “Infection and Public Health” FISABIO, University of Valencia, 46010 Valencia, Spain
| | - Sarah Krautwurst
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany;
| | - Alba Pérez-Cataluña
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- VISAFELab, Department of Preservation and Food Safety Technologies, Institute of Agrochemistry and Food Technology, IATA-CSIC, 46980 Valencia, Spain
| | - Walter Randazzo
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- VISAFELab, Department of Preservation and Food Safety Technologies, Institute of Agrochemistry and Food Technology, IATA-CSIC, 46980 Valencia, Spain
| | - Gloria Sánchez
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- VISAFELab, Department of Preservation and Food Safety Technologies, Institute of Agrochemistry and Food Technology, IATA-CSIC, 46980 Valencia, Spain
| | - Manja Marz
- European Virus Bioinformatics Center, 07743 Jena, Germany (A.B.A.); (L.B.); (S.D.); (C.E.); (S.F.E.); (T.C.J.); (K.L.); (G.L.L.); (M.K.N.); (B.Y.R.); (F.G.-C.); (A.P.-C.); (W.R.); (G.S.)
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany;
- German Center for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, 04103 Leipzig, Germany
- Michael Stifel Center Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany
- Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, 07745 Jena, Germany
- Leibniz Institute for Age Research—Fritz Lippman Institute, 07745 Jena, Germany
| |
Collapse
|
6
|
Gonzalez-Isunza G, Jawaid MZ, Liu P, Cox DL, Vazquez M, Arsuaga J. Using machine learning to detect coronaviruses potentially infectious to humans. Sci Rep 2023; 13:9319. [PMID: 37291260 PMCID: PMC10248971 DOI: 10.1038/s41598-023-35861-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Accepted: 05/24/2023] [Indexed: 06/10/2023] Open
Abstract
Establishing the host range for novel viruses remains a challenge. Here, we address the challenge of identifying non-human animal coronaviruses that may infect humans by creating an artificial neural network model that learns from spike protein sequences of alpha and beta coronaviruses and their binding annotation to their host receptor. The proposed method produces a human-Binding Potential (h-BiP) score that distinguishes, with high accuracy, the binding potential among coronaviruses. Three viruses, previously unknown to bind human receptors, were identified: Bat coronavirus BtCoV/133/2005 and Pipistrellus abramus bat coronavirus HKU5-related (both MERS related viruses), and Rhinolophus affinis coronavirus isolate LYRa3 (a SARS related virus). We further analyze the binding properties of BtCoV/133/2005 and LYRa3 using molecular dynamics. To test whether this model can be used for surveillance of novel coronaviruses, we re-trained the model on a set that excludes SARS-CoV-2 and all viral sequences released after the SARS-CoV-2 was published. The results predict the binding of SARS-CoV-2 with a human receptor, indicating that machine learning methods are an excellent tool for the prediction of host expansion events.
Collapse
Affiliation(s)
| | - M Zaki Jawaid
- Department of Physics, University of California, Davis, USA
| | - Pengyu Liu
- Department of Microbiology and Molecular Genetics, University of California, Davis, CA, USA
| | - Daniel L Cox
- Department of Physics, University of California, Davis, USA
| | - Mariel Vazquez
- Department of Microbiology and Molecular Genetics, University of California, Davis, CA, USA
- Department of Mathematics, University of California, Davis, CA, USA
| | - Javier Arsuaga
- Department of Molecular and Cellular Biology, University of California, Davis, CA, USA.
- Department of Mathematics, University of California, Davis, CA, USA.
| |
Collapse
|
7
|
Elbasir A, Ye Y, Schäffer DE, Hao X, Wickramasinghe J, Tsingas K, Lieberman PM, Long Q, Morris Q, Zhang R, Schäffer AA, Auslander N. A deep learning approach reveals unexplored landscape of viral expression in cancer. Nat Commun 2023; 14:785. [PMID: 36774364 PMCID: PMC9922274 DOI: 10.1038/s41467-023-36336-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2022] [Accepted: 01/25/2023] [Indexed: 02/13/2023] Open
Abstract
About 15% of human cancer cases are attributed to viral infections. To date, virus expression in tumor tissues has been mostly studied by aligning tumor RNA sequencing reads to databases of known viruses. To allow identification of divergent viruses and rapid characterization of the tumor virome, we develop viRNAtrap, an alignment-free pipeline to identify viral reads and assemble viral contigs. We utilize viRNAtrap, which is based on a deep learning model trained to discriminate viral RNAseq reads, to explore viral expression in cancers and apply it to 14 cancer types from The Cancer Genome Atlas (TCGA). Using viRNAtrap, we uncover expression of unexpected and divergent viruses that have not previously been implicated in cancer and disclose human endogenous viruses whose expression is associated with poor overall survival. The viRNAtrap pipeline provides a way forward to study viral infections associated with different clinical conditions.
Collapse
Affiliation(s)
| | - Ying Ye
- The Wistar Institute, Philadelphia, PA, 19104, USA
| | - Daniel E Schäffer
- The Wistar Institute, Philadelphia, PA, 19104, USA.,Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
| | - Xue Hao
- The Wistar Institute, Philadelphia, PA, 19104, USA
| | | | - Konstantinos Tsingas
- The Wistar Institute, Philadelphia, PA, 19104, USA.,University of Pennsylvania, Philadelphia, PA, USA
| | | | - Qi Long
- University of Pennsylvania, Philadelphia, PA, USA
| | - Quaid Morris
- Computational and Systems Biology, Sloan Kettering Institute, New York City, NY, 10065, USA
| | - Rugang Zhang
- The Wistar Institute, Philadelphia, PA, 19104, USA
| | - Alejandro A Schäffer
- Cancer Data Science Laboratory (CDSL), National Cancer Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | | |
Collapse
|
8
|
Iuchi H, Kawasaki J, Kubo K, Fukunaga T, Hokao K, Yokoyama G, Ichinose A, Suga K, Hamada M. Bioinformatics approaches for unveiling virus-host interactions. Comput Struct Biotechnol J 2023; 21:1774-1784. [PMID: 36874163 PMCID: PMC9969756 DOI: 10.1016/j.csbj.2023.02.044] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Revised: 02/22/2023] [Accepted: 02/22/2023] [Indexed: 03/03/2023] Open
Abstract
The coronavirus disease-2019 (COVID-19) pandemic has elucidated major limitations in the capacity of medical and research institutions to appropriately manage emerging infectious diseases. We can improve our understanding of infectious diseases by unveiling virus-host interactions through host range prediction and protein-protein interaction prediction. Although many algorithms have been developed to predict virus-host interactions, numerous issues remain to be solved, and the entire network remains veiled. In this review, we comprehensively surveyed algorithms used to predict virus-host interactions. We also discuss the current challenges, such as dataset biases toward highly pathogenic viruses, and the potential solutions. The complete prediction of virus-host interactions remains difficult; however, bioinformatics can contribute to progress in research on infectious diseases and human health.
Collapse
Affiliation(s)
- Hitoshi Iuchi
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan.,Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
| | - Junna Kawasaki
- Faculty of Science and Engineering, Waseda University, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
| | - Kento Kubo
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan.,School of Advanced Science and Engineering, Waseda University, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
| | - Tsukasa Fukunaga
- Waseda Institute for Advanced Study, Waseda University, Nishi Waseda, Shinjuku-ku, Tokyo 169-0051, Japan
| | - Koki Hokao
- School of Advanced Science and Engineering, Waseda University, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
| | - Gentaro Yokoyama
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan.,School of Advanced Science and Engineering, Waseda University, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
| | - Akiko Ichinose
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Kanta Suga
- School of Advanced Science and Engineering, Waseda University, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
| | - Michiaki Hamada
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan.,Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan.,School of Advanced Science and Engineering, Waseda University, Okubo Shinjuku-ku, Tokyo 169-8555, Japan.,Graduate School of Medicine, Nippon Medical School, Tokyo 113-8602, Japan
| |
Collapse
|
9
|
Enveloped viruses show increased propensity to cross-species transmission and zoonosis. Proc Natl Acad Sci U S A 2022; 119:e2215600119. [PMID: 36472956 PMCID: PMC9897429 DOI: 10.1073/pnas.2215600119] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
The transmission of viruses between different host species is a major source of emerging diseases and is of particular concern in the case of zoonotic transmission from mammals to humans. Several zoonosis risk factors have been identified, but it is currently unclear which viral traits primarily determine this process as previous work has focused on a few hundred viruses that are not representative of actual viral diversity. Here, we investigate fundamental virological traits that influence cross-species transmissibility and zoonotic propensity by interrogating a database of over 12,000 mammalian virus-host associations. Our analysis reveals that enveloped viruses tend to infect more host species and are more likely to be zoonotic than nonenveloped viruses, while other viral traits such as genome composition, structure, size, or the viral replication compartment play a less obvious role. This contrasts with the previous notion that viral envelopes did not significantly impact or even reduce zoonotic risk and should help better prioritize outbreak prevention efforts. We suggest several mechanisms by which viral envelopes could promote cross-species transmissibility, including structural flexibility of receptor-binding proteins and evasion of viral entry barriers.
Collapse
|
10
|
Sherif FF, Ahmed KS. Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder. JOURNAL OF ENGINEERING AND APPLIED SCIENCE 2022. [PMCID: PMC9383682 DOI: 10.1186/s44147-022-00125-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
SARS-CoV-2’s population structure might have a substantial impact on public health management and diagnostics if it can be identified. It is critical to rapidly monitor and characterize their lineages circulating globally for a more accurate diagnosis, improved care, and faster treatment. For a clearer picture of the SARS-CoV-2 population structure, clustering the sequencing data is essential. Here, deep clustering techniques were used to automatically group 29,017 different strains of SARS-CoV-2 into clusters. We aim to identify the main clusters of SARS-CoV-2 population structure based on convolutional autoencoder (CAE) trained with numerical feature vectors mapped from coronavirus Spike peptide sequences. Our clustering findings revealed that there are six large SARS-CoV-2 population clusters (C1, C2, C3, C4, C5, C6). These clusters contained 43 unique lineages in which the 29,017 publicly accessible strains were dispersed. In all the resulting six clusters, the genetic distances within the same cluster (intra-cluster distances) are less than the distances between inter-clusters (P-value 0.0019, Wilcoxon rank-sum test). This indicates substantial evidence of a connection between the cluster’s lineages. Furthermore, comparisons of the K-means and hierarchical clustering methods have been examined against the proposed deep learning clustering method. The intra-cluster genetic distances of the proposed method were smaller than those of K-means alone and hierarchical clustering methods. We used T-distributed stochastic-neighbor embedding (t-SNE) to show the outcomes of the deep learning clustering. The strains were isolated correctly between clusters in the t-SNE plot. Our results showed that the (C5) cluster exclusively includes Gamma lineage (P.1) only, suggesting that strains of P.1 in C5 are more diversified than those in the other clusters. Our study indicates that the genetic similarity between strains in the same cluster enables a better understanding of the major features of the unknown population lineages when compared to some of the more prevalent viral isolates. This information helps researchers figure out how the virus changed over time and spread to people all over the world.
Collapse
|
11
|
Bartoszewicz JM, Nasri F, Nowicka M, Renard BY. Detecting DNA of novel fungal pathogens using ResNets and a curated fungi-hosts data collection. Bioinformatics 2022; 38:ii168-ii174. [PMID: 36124807 DOI: 10.1093/bioinformatics/btac495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/08/2022] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND Emerging pathogens are a growing threat, but large data collections and approaches for predicting the risk associated with novel agents are limited to bacteria and viruses. Pathogenic fungi, which also pose a constant threat to public health, remain understudied. Relevant data remain comparatively scarce and scattered among many different sources, hindering the development of sequencing-based detection workflows for novel fungal pathogens. No prediction method working for agents across all three groups is available, even though the cause of an infection is often difficult to identify from symptoms alone. RESULTS We present a curated collection of fungal host range data, comprising records on human, animal and plant pathogens, as well as other plant-associated fungi, linked to publicly available genomes. We show that it can be used to predict the pathogenic potential of novel fungal species directly from DNA sequences with either sequence homology or deep learning. We develop learned, numerical representations of the collected genomes and visualize the landscape of fungal pathogenicity. Finally, we train multi-class models predicting if next-generation sequencing reads originate from novel fungal, bacterial or viral threats. CONCLUSIONS The neural networks trained using our data collection enable accurate detection of novel fungal pathogens. A curated set of over 1400 genomes with host and pathogenicity metadata supports training of machine-learning models and sequence comparison, not limited to the pathogen detection task. AVAILABILITY AND IMPLEMENTATION The data, models and code are hosted at https://zenodo.org/record/5846345, https://zenodo.org/record/5711877 and https://gitlab.com/dacs-hpi/deepac. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jakub M Bartoszewicz
- Hasso Plattner Institute for Digital Engineering, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany.,Department of Mathematics and Computer Science, Free University of Berlin, Berlin 14195, Germany
| | - Ferdous Nasri
- Hasso Plattner Institute for Digital Engineering, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany.,Department of Mathematics and Computer Science, Free University of Berlin, Berlin 14195, Germany
| | - Melania Nowicka
- Hasso Plattner Institute for Digital Engineering, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany.,Department of Mathematics and Computer Science, Free University of Berlin, Berlin 14195, Germany
| | - Bernhard Y Renard
- Hasso Plattner Institute for Digital Engineering, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany
| |
Collapse
|
12
|
Krishnamoorthy M, Ranjan P, Erb-Downward JR, Dickson RP, Wiens J. AMAISE: a machine learning approach to index-free sequence enrichment. Commun Biol 2022; 5:568. [PMID: 35681015 PMCID: PMC9184628 DOI: 10.1038/s42003-022-03498-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Accepted: 05/18/2022] [Indexed: 11/21/2022] Open
Abstract
Metagenomics holds potential to improve clinical diagnostics of infectious diseases, but DNA from clinical specimens is often dominated by host-derived sequences. To address this, researchers employ host-depletion methods. Laboratory-based host-depletion methods, however, are costly in terms of time and effort, while computational host-depletion methods rely on memory-intensive reference index databases and struggle to accurately classify noisy sequence data. To solve these challenges, we propose an index-free tool, AMAISE (A Machine Learning Approach to Index-Free Sequence Enrichment). Applied to the task of separating host from microbial reads, AMAISE achieves over 98% accuracy. Applied prior to metagenomic classification, AMAISE results in a 14-18% decrease in memory usage compared to using metagenomic classification alone. Our results show that a reference-independent machine learning approach to host depletion allows for accurate and efficient sequence detection.
Collapse
Affiliation(s)
- Meera Krishnamoorthy
- Division of Computer Science and Engineering, Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA
| | - Piyush Ranjan
- Division of Pulmonary & Critical Care Medicine, Department of Medicine, University of Michigan, Ann Arbor, MI, USA
| | - John R Erb-Downward
- Division of Pulmonary & Critical Care Medicine, Department of Medicine, University of Michigan, Ann Arbor, MI, USA
- Department of Microbiology and Immunology, University of Michigan, Ann Arbor, MI, USA
| | - Robert P Dickson
- Division of Pulmonary & Critical Care Medicine, Department of Medicine, University of Michigan, Ann Arbor, MI, USA
- Department of Microbiology and Immunology, University of Michigan, Ann Arbor, MI, USA
- Max Harry Weil Institute for Critical Care Research and Innovation, University of Michigan, Ann Arbor, MI, USA
| | - Jenna Wiens
- Division of Computer Science and Engineering, Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
13
|
Albery GF, Becker DJ, Brierley L, Brook CE, Christofferson RC, Cohen LE, Dallas TA, Eskew EA, Fagre A, Farrell MJ, Glennon E, Guth S, Joseph MB, Mollentze N, Neely BA, Poisot T, Rasmussen AL, Ryan SJ, Seifert S, Sjodin AR, Sorrell EM, Carlson CJ. The science of the host-virus network. Nat Microbiol 2021; 6:1483-1492. [PMID: 34819645 DOI: 10.1038/s41564-021-00999-5] [Citation(s) in RCA: 42] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Accepted: 10/18/2021] [Indexed: 01/21/2023]
Abstract
Better methods to predict and prevent the emergence of zoonotic viruses could support future efforts to reduce the risk of epidemics. We propose a network science framework for understanding and predicting human and animal susceptibility to viral infections. Related approaches have so far helped to identify basic biological rules that govern cross-species transmission and structure the global virome. We highlight ways to make modelling both accurate and actionable, and discuss the barriers that prevent researchers from translating viral ecology into public health policies that could prevent future pandemics.
Collapse
Affiliation(s)
- Gregory F Albery
- Department of Biology, Georgetown University, Washington DC, USA.
| | - Daniel J Becker
- Department of Biology, University of Oklahoma, Norman, OK, USA
| | - Liam Brierley
- Institute of Translational Medicine, University of Liverpool, Liverpool, UK
| | - Cara E Brook
- Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA
| | | | - Lily E Cohen
- Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Tad A Dallas
- Department of Biological Sciences, University of South Carolina, Columbia, SC, USA
| | - Evan A Eskew
- Department of Biology, Pacific Lutheran University, Tacoma, WA, USA
| | - Anna Fagre
- Department of Microbiology, Immunology and Pathology, Colorado State University, Fort Collins, CO, USA
| | - Maxwell J Farrell
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada
| | - Emma Glennon
- Disease Dynamics Unit, Department of Veterinary Medicine, University of Cambridge, Cambridge, UK
| | - Sarah Guth
- Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Maxwell B Joseph
- Earth Lab, Cooperative Institute for Research in Environmental Science, University of Colorado Boulder, Boulder, CO, USA
| | - Nardus Mollentze
- Institute of Biodiversity, Animal Health and Comparative Medicine, University of Glasgow, Glasgow, UK.,MRC - University of Glasgow Centre for Virus Research, Glasgow, UK
| | - Benjamin A Neely
- National Institute of Standards and Technology, Charleston, SC, USA
| | - Timothée Poisot
- Québec Centre for Biodiversity Sciences, Montréal, Québec, Canada.,Département de Sciences Biologiques, Université de Montréal, Montréal, Québec, Canada
| | - Angela L Rasmussen
- Vaccine and Infectious Disease Organization, University of Saskatchewan, Saskatoon, Saskatchewan, Canada.,Department of Biochemistry, Microbiology, and Immunology, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | - Sadie J Ryan
- Department of Geography, University of Florida, Gainesville, FL, USA.,Emerging Pathogens Institute, University of Florida, Gainesville, FL, USA.,School of Life Sciences, University of KwaZulu-Natal, Durban, South Africa
| | - Stephanie Seifert
- Paul G. Allen School for Global Health, Washington State University, Pullman, WA, USA
| | - Anna R Sjodin
- Department of Biological Sciences, University of Idaho, Moscow, ID, USA
| | - Erin M Sorrell
- Center for Global Health Science and Security, Georgetown University Medical Center, Washington, DC, USA.,Department of Microbiology and Immunology, Georgetown University Medical Center, Washington, DC, USA
| | - Colin J Carlson
- Center for Global Health Science and Security, Georgetown University Medical Center, Washington, DC, USA. .,Department of Microbiology and Immunology, Georgetown University Medical Center, Washington, DC, USA.
| |
Collapse
|
14
|
Löchel HF, Heider D. Chaos game representation and its applications in bioinformatics. Comput Struct Biotechnol J 2021; 19:6263-6271. [PMID: 34900136 PMCID: PMC8636998 DOI: 10.1016/j.csbj.2021.11.008] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Revised: 11/04/2021] [Accepted: 11/05/2021] [Indexed: 11/18/2022] Open
Abstract
Chaos game representation (CGR), a milestone in graphical bioinformatics, has become a powerful tool regarding alignment-free sequence comparison and feature encoding for machine learning. The algorithm maps a sequence to 2-dimensional space, while an extension of the CGR, the so-called frequency matrix representation (FCGR), transforms sequences of different lengths into equal-sized images or matrices. The CGR is a generalized Markov chain and includes various properties, which allow a unique representation of a sequence. Therefore, it has a broad spectrum of applications in bioinformatics, such as sequence comparison and phylogenetic analysis and as an encoding of sequences for machine learning. This review introduces the construction of CGRs and FCGRs, their applications on DNA and proteins, and gives an overview of recent applications and progress in bioinformatics.
Collapse
Affiliation(s)
- Hannah Franziska Löchel
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35032 Marburg, Germany
| | - Dominik Heider
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35032 Marburg, Germany
| |
Collapse
|
15
|
Correcting the Estimation of Viral Taxa Distributions in Next-Generation Sequencing Data after Applying Artificial Neural Networks. Genes (Basel) 2021; 12:genes12111755. [PMID: 34828361 PMCID: PMC8624964 DOI: 10.3390/genes12111755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Revised: 10/25/2021] [Accepted: 10/27/2021] [Indexed: 11/16/2022] Open
Abstract
Estimating the taxonomic composition of viral sequences in a biological samples processed by next-generation sequencing is an important step in comparative metagenomics. Mapping sequencing reads against a database of known viral reference genomes, however, fails to classify reads from novel viruses whose reference sequences are not yet available in public databases. Instead of a mapping approach, and in order to classify sequencing reads at least to a taxonomic level, the performance of artificial neural networks and other machine learning models was studied. Taxonomic and genomic data from the NCBI database were used to sample labelled sequencing reads as training data. The fitted neural network was applied to classify unlabelled reads of simulated and real-world test sets. Additional auxiliary test sets of labelled reads were used to estimate the conditional class probabilities, and to correct the prior estimation of the taxonomic distribution in the actual test set. Among the taxonomic levels, the biological order of viruses provided the most comprehensive data base to generate training data. The prediction accuracy of the artificial neural network to classify test reads to their viral order was considerably higher than that of a random classification. Posterior estimation of taxa frequencies could correct the primary classification results.
Collapse
|
16
|
Utilizing the VirIdAl Pipeline to Search for Viruses in the Metagenomic Data of Bat Samples. Viruses 2021; 13:v13102006. [PMID: 34696436 PMCID: PMC8541124 DOI: 10.3390/v13102006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 09/30/2021] [Accepted: 10/02/2021] [Indexed: 12/27/2022] Open
Abstract
According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.
Collapse
|
17
|
Mollentze N, Babayan SA, Streicker DG. Identifying and prioritizing potential human-infecting viruses from their genome sequences. PLoS Biol 2021; 19:e3001390. [PMID: 34582436 PMCID: PMC8478193 DOI: 10.1371/journal.pbio.3001390] [Citation(s) in RCA: 42] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Accepted: 08/10/2021] [Indexed: 11/18/2022] Open
Abstract
Determining which animal viruses may be capable of infecting humans is currently intractable at the time of their discovery, precluding prioritization of high-risk viruses for early investigation and outbreak preparedness. Given the increasing use of genomics in virus discovery and the otherwise sparse knowledge of the biology of newly discovered viruses, we developed machine learning models that identify candidate zoonoses solely using signatures of host range encoded in viral genomes. Within a dataset of 861 viral species with known zoonotic status, our approach outperformed models based on the phylogenetic relatedness of viruses to known human-infecting viruses (area under the receiver operating characteristic curve [AUC] = 0.773), distinguishing high-risk viruses within families that contain a minority of human-infecting species and identifying putatively undetected or so far unrealized zoonoses. Analyses of the underpinnings of model predictions suggested the existence of generalizable features of viral genomes that are independent of virus taxonomic relationships and that may preadapt viruses to infect humans. Our model reduced a second set of 645 animal-associated viruses that were excluded from training to 272 high and 41 very high-risk candidate zoonoses and showed significantly elevated predicted zoonotic risk in viruses from nonhuman primates, but not other mammalian or avian host groups. A second application showed that our models could have identified Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) as a relatively high-risk coronavirus strain and that this prediction required no prior knowledge of zoonotic Severe Acute Respiratory Syndrome (SARS)-related coronaviruses. Genome-based zoonotic risk assessment provides a rapid, low-cost approach to enable evidence-driven virus surveillance and increases the feasibility of downstream biological and ecological characterization of viruses.
Collapse
Affiliation(s)
- Nardus Mollentze
- Medical Research Council-University of Glasgow Centre for Virus Research, Glasgow, United Kingdom
- Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
| | - Simon A. Babayan
- Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
| | - Daniel G. Streicker
- Medical Research Council-University of Glasgow Centre for Virus Research, Glasgow, United Kingdom
- Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
| |
Collapse
|
18
|
Dasari CM, Bhukya R. Explainable deep neural networks for novel viral genome prediction. APPL INTELL 2021; 52:3002-3017. [PMID: 34764607 PMCID: PMC8232563 DOI: 10.1007/s10489-021-02572-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/26/2021] [Indexed: 11/27/2022]
Abstract
Viral infection causes a wide variety of human diseases including cancer and COVID-19. Viruses invade host cells and associate with host molecules, potentially disrupting the normal function of hosts that leads to fatal diseases. Novel viral genome prediction is crucial for understanding the complex viral diseases like AIDS and Ebola. While most existing computational techniques classify viral genomes, the efficiency of the classification depends solely on the structural features extracted. The state-of-the-art DNN models achieved excellent performance by automatic extraction of classification features, but the degree of model explainability is relatively poor. During model training for viral prediction, proposed CNN, CNN-LSTM based methods (EdeepVPP, EdeepVPP-hybrid) automatically extracts features. EdeepVPP also performs model interpretability in order to extract the most important patterns that cause viral genomes through learned filters. It is an interpretable CNN model that extracts vital biologically relevant patterns (features) from feature maps of viral sequences. The EdeepVPP-hybrid predictor outperforms all the existing methods by achieving 0.992 mean AUC-ROC and 0.990 AUC-PR on 19 human metagenomic contig experiment datasets using 10-fold cross-validation. We evaluate the ability of CNN filters to detect patterns across high average activation values. To further asses the robustness of EdeepVPP model, we perform leave-one-experiment-out cross-validation. It can work as a recommendation system to further analyze the raw sequences labeled as ‘unknown’ by alignment-based methods. We show that our interpretable model can extract patterns that are considered to be the most important features for predicting virus sequences through learned filters.
Collapse
Affiliation(s)
| | - Raju Bhukya
- National Institute of Technology, Warangal, Telangana 506004 India
| |
Collapse
|
19
|
Brierley L, Fowler A. Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning. PLoS Pathog 2021; 17:e1009149. [PMID: 33878118 PMCID: PMC8087038 DOI: 10.1371/journal.ppat.1009149] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Revised: 04/30/2021] [Accepted: 04/09/2021] [Indexed: 12/21/2022] Open
Abstract
The COVID-19 pandemic has demonstrated the serious potential for novel zoonotic coronaviruses to emerge and cause major outbreaks. The immediate animal origin of the causative virus, SARS-CoV-2, remains unknown, a notoriously challenging task for emerging disease investigations. Coevolution with hosts leads to specific evolutionary signatures within viral genomes that can inform likely animal origins. We obtained a set of 650 spike protein and 511 whole genome nucleotide sequences from 222 and 185 viruses belonging to the family Coronaviridae, respectively. We then trained random forest models independently on genome composition biases of spike protein and whole genome sequences, including dinucleotide and codon usage biases in order to predict animal host (of nine possible categories, including human). In hold-one-out cross-validation, predictive accuracy on unseen coronaviruses consistently reached ~73%, indicating evolutionary signal in spike proteins to be just as informative as whole genome sequences. However, different composition biases were informative in each case. Applying optimised random forest models to classify human sequences of MERS-CoV and SARS-CoV revealed evolutionary signatures consistent with their recognised intermediate hosts (camelids, carnivores), while human sequences of SARS-CoV-2 were predicted as having bat hosts (suborder Yinpterochiroptera), supporting bats as the suspected origins of the current pandemic. In addition to phylogeny, variation in genome composition can act as an informative approach to predict emerging virus traits as soon as sequences are available. More widely, this work demonstrates the potential in combining genetic resources with machine learning algorithms to address long-standing challenges in emerging infectious diseases.
Collapse
Affiliation(s)
- Liam Brierley
- Department of Health Data Science, University of Liverpool, Brownlow Street, Liverpool, United Kingdom
| | - Anna Fowler
- Department of Health Data Science, University of Liverpool, Brownlow Street, Liverpool, United Kingdom
| |
Collapse
|
20
|
Bergner LM, Mollentze N, Orton RJ, Tello C, Broos A, Biek R, Streicker DG. Characterizing and Evaluating the Zoonotic Potential of Novel Viruses Discovered in Vampire Bats. Viruses 2021; 13:252. [PMID: 33562073 PMCID: PMC7914986 DOI: 10.3390/v13020252] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2020] [Revised: 01/25/2021] [Accepted: 02/03/2021] [Indexed: 12/17/2022] Open
Abstract
The contemporary surge in metagenomic sequencing has transformed knowledge of viral diversity in wildlife. However, evaluating which newly discovered viruses pose sufficient risk of infecting humans to merit detailed laboratory characterization and surveillance remains largely speculative. Machine learning algorithms have been developed to address this imbalance by ranking the relative likelihood of human infection based on viral genome sequences, but are not yet routinely applied to viruses at the time of their discovery. Here, we characterized viral genomes detected through metagenomic sequencing of feces and saliva from common vampire bats (Desmodus rotundus) and used these data as a case study in evaluating zoonotic potential using molecular sequencing data. Of 58 detected viral families, including 17 which infect mammals, the only known zoonosis detected was rabies virus; however, additional genomes were detected from the families Hepeviridae, Coronaviridae, Reoviridae, Astroviridae and Picornaviridae, all of which contain human-infecting species. In phylogenetic analyses, novel vampire bat viruses most frequently grouped with other bat viruses that are not currently known to infect humans. In agreement, machine learning models built from only phylogenetic information ranked all novel viruses similarly, yielding little insight into zoonotic potential. In contrast, genome composition-based machine learning models estimated different levels of zoonotic potential, even for closely related viruses, categorizing one out of four detected hepeviruses and two out of three picornaviruses as having high priority for further research. We highlight the value of evaluating zoonotic potential beyond ad hoc consideration of phylogeny and provide surveillance recommendations for novel viruses in a wildlife host which has frequent contact with humans and domestic animals.
Collapse
Affiliation(s)
- Laura M. Bergner
- Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow G12 8QQ, UK; (N.M.); (R.B.); (D.G.S.)
- MRC–University of Glasgow Centre for Virus Research, Glasgow G61 1QH, UK; (R.J.O.); (A.B.)
| | - Nardus Mollentze
- Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow G12 8QQ, UK; (N.M.); (R.B.); (D.G.S.)
- MRC–University of Glasgow Centre for Virus Research, Glasgow G61 1QH, UK; (R.J.O.); (A.B.)
| | - Richard J. Orton
- MRC–University of Glasgow Centre for Virus Research, Glasgow G61 1QH, UK; (R.J.O.); (A.B.)
| | - Carlos Tello
- Association for the Conservation and Development of Natural Resources, Lima 15037, Peru;
- Yunkawasi, Lima 15049, Peru
| | - Alice Broos
- MRC–University of Glasgow Centre for Virus Research, Glasgow G61 1QH, UK; (R.J.O.); (A.B.)
| | - Roman Biek
- Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow G12 8QQ, UK; (N.M.); (R.B.); (D.G.S.)
| | - Daniel G. Streicker
- Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow G12 8QQ, UK; (N.M.); (R.B.); (D.G.S.)
- MRC–University of Glasgow Centre for Virus Research, Glasgow G61 1QH, UK; (R.J.O.); (A.B.)
| |
Collapse
|