1
|
Yin R, Luo Z, Zhuang P, Zeng M, Li M, Lin Z, Kwoh CK. ViPal: A framework for virulence prediction of influenza viruses with prior viral knowledge using genomic sequences. J Biomed Inform 2023; 142:104388. [PMID: 37178781 PMCID: PMC10602211 DOI: 10.1016/j.jbi.2023.104388] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Revised: 04/30/2023] [Accepted: 05/07/2023] [Indexed: 05/15/2023]
Abstract
Influenza viruses pose great threats to public health and cause enormous economic losses every year. Previous work has revealed the viral factors associated with the virulence of influenza viruses in mammals. However, taking prior viral knowledge represented by heterogeneous categorical and discrete information into account to explore virus virulence is scarce in the existing work. How to make full use of the preceding domain knowledge in virulence study is challenging but beneficial. This paper proposes a general framework named ViPal for virulence prediction in mice that incorporates discrete prior viral mutation and reassortment information based on all eight influenza segments. The posterior regularization technique is leveraged to transform prior viral knowledge into constraint features and integrated into the machine learning models. Experimental results on influenza genomic datasets validate that our proposed framework can improve virulence prediction performance over baselines. The comparison between ViPal and other existing methods shows the computational efficiency of our framework with comparable or superior performance. Moreover, the interpretable analysis through SHAP (SHapley Additive exPlanations) identifies the scores of constraint features contributing to the prediction. We hope this framework could provide assistance for the accurate detection of influenza virulence and facilitate flu surveillance.
Collapse
Affiliation(s)
- Rui Yin
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, USA; School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798, Singapore.
| | - Zihan Luo
- School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, China
| | - Pei Zhuang
- Brigham and Women's Hospital, Harvard Medical School, Boston, USA
| | - Min Zeng
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Zhuoyi Lin
- School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798, Singapore
| | - Chee Keong Kwoh
- School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798, Singapore
| |
Collapse
|
2
|
Rashid S, Ng TA, Kwoh CK. Jupytope: computational extraction of structural properties of viral epitopes. Brief Bioinform 2022; 23:6696137. [PMID: 36094101 DOI: 10.1093/bib/bbac362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 07/29/2022] [Accepted: 08/02/2022] [Indexed: 12/14/2022] Open
Abstract
Epitope residues located on viral surface proteins are of immense interest in immunology and related applications such as vaccine development, disease diagnosis and drug design. Most tools rely on sequence-based statistical comparisons, such as information entropy of residue positions in aligned columns to infer location and properties of epitope sites. To facilitate cross-structural comparisons of epitopes on viral surface proteins, a python-based extraction tool implemented with Jupyter notebook is presented (Jupytope). Given a viral antigen structure of interest, a list of known epitope sites and a reference structure, the corresponding epitope structural properties can quickly be obtained. The tool integrates biopython modules for commonly used software such as NACCESS, DSSP as well as residue depth and outputs a list of structure-derived properties such as dihedral angles, solvent accessibility, residue depth and secondary structure that can be saved in several convenient data formats. To ensure correct spatial alignment, Jupytope takes a list of given epitope sites and their corresponding reference structure and aligns them before extracting the desired properties. Examples are demonstrated for epitopes of Influenza and severe acute respiratory syndrome coronavirus 2 (SARS-CoV2) viral strains. The extracted properties assist detection of two Influenza subtypes and show potential in distinguishing between four major clades of SARS-CoV2, as compared with randomized labels. The tool will facilitate analytical and predictive works on viral epitopes through the extracted structural information. Jupytope and extracted datasets are available at https://github.com/shamimarashid/Jupytope.
Collapse
Affiliation(s)
- Shamima Rashid
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, 639798, Singapore
| | - Teng Ann Ng
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, 639798, Singapore
| | - Chee Keong Kwoh
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, 639798, Singapore
| |
Collapse
|
3
|
Agapito G, Milano M, Cannataro M. A Python Clustering Analysis Protocol of Genes Expression Data Sets. Genes (Basel) 2022; 13:genes13101839. [PMID: 36292724 PMCID: PMC9601308 DOI: 10.3390/genes13101839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 10/05/2022] [Accepted: 10/08/2022] [Indexed: 11/16/2022] Open
Abstract
Gene expression and SNPs data hold great potential for a new understanding of disease prognosis, drug sensitivity, and toxicity evaluations. Cluster analysis is used to analyze data that do not contain any specific subgroups. The goal is to use the data itself to recognize meaningful and informative subgroups. In addition, cluster investigation helps data reduction purposes, exposes hidden patterns, and generates hypotheses regarding the relationship between genes and phenotypes. Cluster analysis could also be used to identify bio-markers and yield computational predictive models. The methods used to analyze microarrays data can profoundly influence the interpretation of the results. Therefore, a basic understanding of these computational tools is necessary for optimal experimental design and meaningful data analysis. This manuscript provides an analysis protocol to effectively analyze gene expression data sets through the K-means and DBSCAN algorithms. The general protocol enables analyzing omics data to identify subsets of features with low redundancy and high robustness, speeding up the identification of new bio-markers through pathway enrichment analysis. In addition, to demonstrate the effectiveness of our clustering analysis protocol, we analyze a real data set from the GEO database. Finally, the manuscript provides some best practice and tips to overcome some issues in the analysis of omics data sets through unsupervised learning.
Collapse
Affiliation(s)
- Giuseppe Agapito
- Department of Law, Economics and Social Sciences, University Magna Græcia of Catanzaro, 88100 Catanzaro, Italy
- Data Analytics Research Center, University Magna Græcia of Catanzaro, 88100 Catanzaro, Italy
- Correspondence:
| | - Marianna Milano
- Data Analytics Research Center, University Magna Græcia of Catanzaro, 88100 Catanzaro, Italy
- Department of Medical and Clinical Surgery, University Magna Græcia of Catanzaro, 88100 Catanzaro, Italy
| | - Mario Cannataro
- Data Analytics Research Center, University Magna Græcia of Catanzaro, 88100 Catanzaro, Italy
- Department of Medical and Clinical Surgery, University Magna Græcia of Catanzaro, 88100 Catanzaro, Italy
| |
Collapse
|
4
|
Zhang Y, Eskridge KM, Zhang S, Lu G. Identifying host-specific amino acid signatures for influenza A viruses using an adjusted entropy measure. BMC Bioinformatics 2022; 23:333. [PMID: 35962315 PMCID: PMC9372975 DOI: 10.1186/s12859-022-04885-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Accepted: 08/02/2022] [Indexed: 11/29/2022] Open
Abstract
Background Influenza A viruses (IAV) exhibit vast genetic mutability and have great zoonotic potential to infect avian and mammalian hosts and are known to be responsible for a number of pandemics. A key computational issue in influenza prevention and control is the identification of molecular signatures with cross-species transmission potential. We propose an adjusted entropy-based host-specific signature identification method that uses a similarity coefficient to incorporate the amino acid substitution information and improve the identification performance. Mutations in the polymerase genes (e.g., PB2) are known to play a major role in avian influenza virus adaptation to mammalian hosts. We thus focus on the analysis of PB2 protein sequences and identify host specific PB2 amino acid signatures. Results Validation with a set of H5N1 PB2 sequences from 1996 to 2006 results in adjusted entropy having a 40% false negative discovery rate compared to a 60% false negative rate using unadjusted entropy. Simulations across different levels of sequence divergence show a false negative rate of no higher than 10% while unadjusted entropy ranged from 9 to 100%. In addition, under all levels of divergence adjusted entropy never had a false positive rate higher than 9%. Adjusted entropy also identifies important mutations in H1N1pdm PB2 previously identified in the literature that explain changes in divergence between 2008 and 2009 which unadjusted entropy could not identify. Conclusions Based on these results, adjusted entropy provides a reliable and widely applicable host signature identification approach useful for IAV monitoring and vaccine development.
Collapse
Affiliation(s)
- Yixiang Zhang
- Department of Statistics, University of Nebraska - Lincoln, Lincoln, NE, USA
| | - Kent M Eskridge
- Department of Statistics, University of Nebraska - Lincoln, Lincoln, NE, USA.
| | - Shunpu Zhang
- Department of Statistics, University of Central Florida, Orlando, USA
| | - Guoqing Lu
- Department of Biology, University of Nebraska - Omaha, Omaha, NE, USA
| |
Collapse
|
5
|
Yin R, Zhu X, Zeng M, Wu P, Li M, Kwoh CK. A framework for predicting variable-length epitopes of human-adapted viruses using machine learning methods. Brief Bioinform 2022; 23:6645487. [PMID: 35849093 DOI: 10.1093/bib/bbac281] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Revised: 06/16/2022] [Accepted: 06/17/2022] [Indexed: 11/14/2022] Open
Abstract
The coronavirus disease 2019 pandemic has alerted people of the threat caused by viruses. Vaccine is the most effective way to prevent the disease from spreading. The interaction between antibodies and antigens will clear the infectious organisms from the host. Identifying B-cell epitopes is critical in vaccine design, development of disease diagnostics and antibody production. However, traditional experimental methods to determine epitopes are time-consuming and expensive, and the predictive performance using the existing in silico methods is not satisfactory. This paper develops a general framework to predict variable-length linear B-cell epitopes specific for human-adapted viruses with machine learning approaches based on Protvec representation of peptides and physicochemical properties of amino acids. QR decomposition is incorporated during the embedding process that enables our models to handle variable-length sequences. Experimental results on large immune epitope datasets validate that our proposed model's performance is superior to the state-of-the-art methods in terms of AUROC (0.827) and AUPR (0.831) on the testing set. Moreover, sequence analysis also provides the results of the viral category for the corresponding predicted epitopes with high precision. Therefore, this framework is shown to reliably identify linear B-cell epitopes of human-adapted viruses given protein sequences and could provide assistance for potential future pandemics and epidemics.
Collapse
Affiliation(s)
- Rui Yin
- Department of Biomedical Informatics, Harvard Medical School, Boston, USA
| | - Xianghe Zhu
- Department of Statistics, University of Oxford, Oxford, UK
| | - Min Zeng
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Pengfei Wu
- Center for Medical Genetics, School of Life Sciences, Central South University, Changsha, China
| | - Min Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Chee Keong Kwoh
- School of Computer Science and Engineering, Nanyang Technological University, Singapore
| |
Collapse
|
6
|
Borkenhagen LK, Allen MW, Runstadler JA. Influenza virus genotype to phenotype predictions through machine learning: a systematic review. Emerg Microbes Infect 2021; 10:1896-1907. [PMID: 34498543 PMCID: PMC8462836 DOI: 10.1080/22221751.2021.1978824] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background: There is great interest in understanding the viral genomic predictors of phenotypic traits that allow influenza A viruses to adapt to or become more virulent in different hosts. Machine learning techniques have demonstrated promise in addressing this critical need for other pathogens because the underlying algorithms are especially well equipped to uncover complex patterns in large datasets and produce generalizable predictions for new data. As the body of research where these techniques are applied for influenza A virus phenotype prediction continues to grow, it is useful to consider the strengths and weaknesses of these approaches to understand what has prevented these models from seeing widespread use by surveillance laboratories and to identify gaps that are underexplored with this technology. Methods and Results: We present a systematic review of English literature published through 15 April 2021 of studies employing machine learning methods to generate predictions of influenza A virus phenotypes from genomic or proteomic input. Forty-nine studies were included in this review, spanning the topics of host discrimination, human adaptability, subtype and clade assignment, pandemic lineage assignment, characteristics of infection, and antiviral drug resistance. Conclusions: Our findings suggest that biases in model design and a dearth of wet laboratory follow-up may explain why these models often go underused. We, therefore, offer guidance to overcome these limitations, aid in improving predictive models of previously studied influenza A virus phenotypes, and extend those models to unexplored phenotypes in the ultimate pursuit of tools to enable the characterization of virus isolates across surveillance laboratories.
Collapse
Affiliation(s)
- Laura K Borkenhagen
- Department of Infectious Disease and Global Health, Cummings School of Veterinary Medicine, Tufts University, North Grafton, MA, USA
| | - Martin W Allen
- Department of Computer Science, School of Engineering, Tufts University, Medford, MA, USA
| | - Jonathan A Runstadler
- Department of Infectious Disease and Global Health, Cummings School of Veterinary Medicine, Tufts University, North Grafton, MA, USA
| |
Collapse
|
7
|
Tarasova O, Poroikov V. Machine Learning in Discovery of New Antivirals and Optimization of Viral Infections Therapy. Curr Med Chem 2021; 28:7840-7861. [PMID: 33949929 DOI: 10.2174/0929867328666210504114351] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Revised: 02/13/2021] [Accepted: 02/24/2021] [Indexed: 11/22/2022]
Abstract
Nowadays, computational approaches play an important role in the design of new drug-like compounds and optimization of pharmacotherapeutic treatment of diseases. The emerging growth of viral infections, including those caused by the Human Immunodeficiency Virus (HIV), Ebola virus, recently detected coronavirus, and some others, leads to many newly infected people with a high risk of death or severe complications. A huge amount of chemical, biological, clinical data is at the disposal of the researchers. Therefore, there are many opportunities to find the relationships between the particular features of chemical data and the antiviral activity of biologically active compounds based on machine learning approaches. Biological and clinical data can also be used for building models to predict relationships between viral genotype and drug resistance, which might help determine the clinical outcome of treatment. In the current study, we consider machine-learning approaches in the antiviral research carried out during the past decade. We overview in detail the application of machine-learning methods for the design of new potential antiviral agents and vaccines, drug resistance prediction, and analysis of virus-host interactions. Our review also covers the perspectives of using the machine-learning approaches for antiviral research, including Dengue, Ebola viruses, Influenza A, Human Immunodeficiency Virus, coronaviruses, and some others.
Collapse
Affiliation(s)
- Olga Tarasova
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow. Russian Federation
| | - Vladimir Poroikov
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow. Russian Federation
| |
Collapse
|
8
|
Yin R, Zhou X, Rashid S, Kwoh CK. HopPER: an adaptive model for probability estimation of influenza reassortment through host prediction. BMC Med Genomics 2020; 13:9. [PMID: 31973709 PMCID: PMC6979075 DOI: 10.1186/s12920-019-0656-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2019] [Accepted: 12/26/2019] [Indexed: 12/29/2022] Open
Abstract
Background Influenza reassortment, a mechanism where influenza viruses exchange their RNA segments by co-infecting a single cell, has been implicated in several major pandemics since 19th century. Owing to the significant impact on public health and social stability, great attention has been received on the identification of influenza reassortment. Methods We proposed a novel computational method named HopPER (Host-prediction-based Probability Estimation of Reassortment), that sturdily estimates reassortment probabilities through host tropism prediction using 147 new features generated from seven physicochemical properties of amino acids. We conducted the experiments on a range of real and synthetic datasets and compared HopPER with several state-of-the-art methods. Results It is shown that 280 out of 318 candidate reassortants have been successfully identified. Additionally, not only can HopPER be applied to complete genomes but its effectiveness on incomplete genomes is also demonstrated. The analysis of evolutionary success of avian, human and swine viruses generated through reassortment across different years using HopPER further revealed the reassortment history of the influenza viruses. Conclusions Our study presents a novel method for the prediction of influenza reassortment. We hope this method could facilitate rapid reassortment detection and provide novel insights into the evolutionary patterns of influenza viruses.
Collapse
Affiliation(s)
- Rui Yin
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798, Singapore.
| | - Xinrui Zhou
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798, Singapore
| | - Shamima Rashid
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798, Singapore
| | - Chee Keong Kwoh
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798, Singapore
| |
Collapse
|
9
|
Li J, Nakai K, Zheng Y, Sato K, Wong L. Introduction to Selected Papers from GIW2018. J Bioinform Comput Biol 2019; 16:1802005. [PMID: 30616475 DOI: 10.1142/s0219720018020055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Affiliation(s)
- Jinyan Li
- 1 University of Technology Sydney, Australia
| | | | - Yun Zheng
- 3 Kunming University of Science and Technology, China
| | | | | |
Collapse
|