1
|
Li Z, Zhang Y, Bai Y, Xie X, Zeng L. IMC-MDA: Prediction of miRNA-disease association based on induction matrix completion. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:10659-10674. [PMID: 37322953 DOI: 10.3934/mbe.2023471] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
To comprehend the etiology and pathogenesis of many illnesses, it is essential to identify disease-associated microRNAs (miRNAs). However, there are a number of challenges with current computational approaches, such as the lack of "negative samples", that is, confirmed irrelevant miRNA-disease pairs, and the poor performance in terms of predicting miRNAs related with "isolated diseases", i.e. illnesses with no known associated miRNAs, which presents the need for novel computational methods. In this study, for the purpose of predicting the connection between disease and miRNA, an inductive matrix completion model was designed, referred to as IMC-MDA. In the model of IMC-MDA, for each miRNA-disease pair, the predicted marks are calculated by combining the known miRNA-disease connection with the integrated disease similarities and miRNA similarities. Based on LOOCV, IMC-MDA had an AUC of 0.8034, which shows better performance than previous methods. Furthermore, experiments have validated the prediction of disease-related miRNAs for three major human diseases: colon cancer, kidney cancer, and lung cancer.
Collapse
Affiliation(s)
- Zejun Li
- School of Computer and Information Science, Hunan Institute of Technology, Hengyang 412002, China
| | - Yuxiang Zhang
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, Henan, 450001, China
| | - Yuting Bai
- College of Information Science and Engineering, Hunan University, Changsha 410082, Hunan, China
| | - Xiaohui Xie
- School of Computer and Information Science, Hunan Institute of Technology, Hengyang 412002, China
| | - Lijun Zeng
- School of Computer and Information Science, Hunan Institute of Technology, Hengyang 412002, China
| |
Collapse
|
2
|
Zhou J, Li X, Ma Y, Wu Z, Xie Z, Zhang Y, Wei Y. Optimal modeling of anti-breast cancer candidate drugs screening based on multi-model ensemble learning with imbalanced data. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:5117-5134. [PMID: 36896538 DOI: 10.3934/mbe.2023237] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
The imbalanced data makes the machine learning model seriously biased, which leads to false positive in screening of therapeutic drugs for breast cancer. In order to deal with this problem, a multi-model ensemble framework based on tree-model, linear model and deep-learning model is proposed. Based on the methodology constructed in this study, we screened the 20 most critical molecular descriptors from 729 molecular descriptors of 1974 anti-breast cancer drug candidates and, in order to measure the pharmacokinetic properties and safety of the drug candidates, the screened molecular descriptors were used in this study for subsequent bioactivity, absorption, distribution metabolism, excretion, toxicity, and other prediction tasks. The results show that the method constructed in this study is superior and more stable than the individual models used in the ensemble approach.
Collapse
Affiliation(s)
- Juan Zhou
- School of Software, East China Jiaotong University, Nanchang 330013, China
| | - Xiong Li
- School of Software, East China Jiaotong University, Nanchang 330013, China
| | - Yuanting Ma
- School of Economics and Management, East China Jiaotong University, Nanchang 330013, China
| | - Zejiu Wu
- School of Science, East China Jiaotong University, Nanchang 330013, China
| | - Ziruo Xie
- School of Software, East China Jiaotong University, Nanchang 330013, China
| | - Yuqi Zhang
- School of Foreign Languages, East China Jiaotong University, Nanchang 330013, China
| | - Yiming Wei
- School of Software, East China Jiaotong University, Nanchang 330013, China
| |
Collapse
|
3
|
Forghani M, Firstkov AL, Alyannezhadi MM, Danilenko DM, Komissarov AB. Reduced amino acid alphabet-based encoding and its impact on modeling influenza antigenic evolution. RUSSIAN JOURNAL OF INFECTION AND IMMUNITY 2022. [DOI: 10.15789/2220-7619-raa-1968] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Currently, vaccination is one of the most efficient ways to control and prevent influenza infection. Vaccine production largely relies on the results of laboratory assays, including hemagglutination inhibition and microneutralization assays, which are time-consuming and laborious. Viruses can escape from the immune response that results in the need to revise and update vaccines biannually. The hemagglutination inhibition assay can measure how effectively antibodies against a reference strain bind and block an antigen of the test strain. Various computer-aided models have been developed to optimize candidate vaccine strain selection. A general problem in modeling of antigenic evolution is the representation of genetic sequences for input into the research model. Our motivation stems from the well-known problem of encoding genetic information for modeling antigenic evolution. This paper introduces a two-fold encoding approach based on reduced amino acid alphabet and amino acid index databases called AAindex. We propose to apply a simplified amino acid alphabet in modeling of antigenic evolution. A simplified alphabet, also called a sub-alphabet or reduced amino acid alphabet, implies to use the 20 amino acids being clustered and divided into amino acid groups. The proposed encoding allows to redefine mutations termed for amino acid groups located in reduced alphabets. We investigated 40 reduced amino acid sets and their performance in modeling antigenic evolution. The experimental results indicate that the proposed reduced amino acid alphabets can achieve the performance of the standard alphabet in its accuracy. Moreover, these alphabets provide deeper insight into various aspects of the relationship between mutation and antigenic variation. By checking identified high-impact sites in the Influenza Research Database, we found that not only antigenic sites have a significant influence on antigenicity, but also other amino acids located in close proximity. The results indicate that all selected non-antigenic sites are related to immune responses. According to the Influenza Research Database, these have been experimentally determined to be T-cell epitopes, B-cell epitopes, and MHC-binding epitopes of different classes. This highlighted a caveat: while simulating antigenic evolution, the model should consider not only the genetic information on antigenic sites, but also that of neighboring positions, as they may indirectly impact antigenicity. Additionally, our findings indicate that structural and charge characteristics are the most beneficial in modeling antigenic evolution, which is in agreement with previous studies.
Collapse
|
4
|
Yin R, Thwin NN, Zhuang P, Lin Z, Kwoh CK. IAV-CNN: A 2D Convolutional Neural Network Model to Predict Antigenic Variants of Influenza A Virus. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3497-3506. [PMID: 34469306 DOI: 10.1109/tcbb.2021.3108971] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The rapid evolution of influenza viruses constantly leads to the emergence of novel influenza strains that are capable of escaping from population immunity. The timely determination of antigenic variants is critical to vaccine design. Empirical experimental methods like hemagglutination inhibition (HI) assays are time-consuming and labor-intensive, requiring live viruses. Recently, many computational models have been developed to predict the antigenic variants without considerations of explicitly modeling the interdependencies between the channels of feature maps. Moreover, the influenza sequences consisting of similar distribution of residues will have high degrees of similarity and will affect the prediction outcome. Consequently, it is challenging but vital to determine the importance of different residue sites and enhance the predictive performance of influenza antigenicity. We have proposed a 2D convolutional neural network (CNN) model to infer influenza antigenic variants (IAV-CNN). Specifically, we apply a new distributed representation of amino acids, named ProtVec that can be applied to a variety of downstream proteomic machine learning tasks. After splittings and embeddings of influenza strains, a 2D squeeze-and-excitation CNN architecture is constructed that enables networks to focus on informative residue features by fusing both spatial and channel-wise information with local receptive fields at each layer. Experimental results on three influenza datasets show IAV-CNN achieves state-of-the-art performance combining the new distributed representation with our proposed architecture. It outperforms both traditional machine algorithms with the same feature representations and the majority of existing models in the independent test data. Therefore we believe that our model can be served as a reliable and robust tool for the prediction of antigenic variants.
Collapse
|
5
|
Makau DN, Lycett S, Michalska-Smith M, Paploski IAD, Cheeran MCJ, Craft ME, Kao RR, Schroeder DC, Doeschl-Wilson A, VanderWaal K. Ecological and evolutionary dynamics of multi-strain RNA viruses. Nat Ecol Evol 2022; 6:1414-1422. [PMID: 36138206 DOI: 10.1038/s41559-022-01860-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Accepted: 07/28/2022] [Indexed: 11/09/2022]
Abstract
Potential interactions among co-circulating viral strains in host populations are often overlooked in the study of virus transmission. However, these interactions probably shape transmission dynamics by influencing host immune responses or altering the relative fitness among co-circulating strains. In this Review, we describe multi-strain dynamics from ecological and evolutionary perspectives, outline scales in which multi-strain dynamics occur and summarize important immunological, phylogenetic and mathematical modelling approaches used to quantify interactions among strains. We also discuss how host-pathogen interactions influence the co-circulation of pathogens. Finally, we highlight outstanding questions and knowledge gaps in the current theory and study of ecological and evolutionary dynamics of multi-strain viruses.
Collapse
Affiliation(s)
- Dennis N Makau
- Department of Veterinary Population Medicine, University of Minnesota, St. Paul, MN, USA
| | | | | | - Igor A D Paploski
- Department of Veterinary Population Medicine, University of Minnesota, St. Paul, MN, USA
| | - Maxim C-J Cheeran
- Department of Veterinary Population Medicine, University of Minnesota, St. Paul, MN, USA
| | - Meggan E Craft
- Department of Ecology, Evolution, and Behavior, University of Minnesota, St. Paul, MN, USA
| | - Rowland R Kao
- Roslin Institute, University of Edinburgh, Edinburgh, UK
| | - Declan C Schroeder
- Department of Veterinary Population Medicine, University of Minnesota, St. Paul, MN, USA
- School of Biological Sciences, University of Reading, Reading, UK
| | | | - Kimberly VanderWaal
- Department of Veterinary Population Medicine, University of Minnesota, St. Paul, MN, USA.
| |
Collapse
|
6
|
Yang C, Yin J, Liu J, Liu J, Chen Q, Yang H, Ni Y, Li B, Li Y, Lin J, Zhou Z, Li Z. The roles of primary care doctors in the COVID-19 pandemic: consistency and influencing factors of doctor's perception and actions and nominal definitions. BMC Health Serv Res 2022; 22:1143. [PMID: 36085066 PMCID: PMC9462892 DOI: 10.1186/s12913-022-08487-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Accepted: 08/23/2022] [Indexed: 11/24/2022] Open
Abstract
Background At the end of 2019, the Coronavirus Disease 2019 (COVID-19) pandemic broke out. As front-line health professionals, primary care doctors play a significant role in screening SARS-CoV-2 infection and transferring suspected cases. However, the performance of primary care doctors is influenced by their knowledge and role perception. A web-based cross-sectional survey was conducted to assess the consistency and influencing factors of primary care doctor's role perception and expert advice in the guidelines (regulatory definition). Methods We designed the questionnaire using “Wenjuanxing” platform, distributed and collected the questionnaire through WeChat social platform, and surveyed 1758 primary care doctors from 11 community health service stations, community health service centers and primary hospitals in Zhejiang Province, China. After the questionnaire was collected, descriptive statistics were made on the characteristics of participants, and univariate analysis and multivariate analysis were used to determine the relevant factors affecting their role cognition. Results In the reporting and referral suspected cases and patients receiving treatment, most participants’ cognition of their roles were consistent with the requirements of guidelines. However, 49.54% and 61.43% of participant doctors were not in line with the government guidelines for diagnosing and classifying COVID-19 and treating suspected cases, respectively. Having a middle or senior professional title and participating in front-line COVID-19 prevention and control work is beneficial to the accurate role perception of diagnosis and classification of COVID-19, the reporting and transfer of suspected cases, and the treatment of suspected cases. Conclusions Primary care doctors’ role perceptions in the COVID-19 pandemic are not always consistent with government guidelines in some aspects, such as transferring and diagnosing suspected cases. Therefore, it is essential to guide primary care doctors in performing their duties, especially those with lower professional titles. Supplementary Information The online version contains supplementary material available at 10.1186/s12913-022-08487-0.
Collapse
|
7
|
Informative SNP Selection Based on a Fuzzy Clustering and Improved Binary Particle Swarm Optimization Algorithm. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:3837579. [PMID: 35756402 PMCID: PMC9225903 DOI: 10.1155/2022/3837579] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Revised: 04/14/2022] [Accepted: 04/30/2022] [Indexed: 12/04/2022]
Abstract
Single-nucleotide polymorphism (SNP) involves the replacement of a single nucleotide in a deoxyribonucleic acid (DNA) sequence and is often linked to the development of specific diseases. Although current genotyping methods can tag SNP loci within biological samples to provide accurate genetic information for a disease associated, they have limited prediction accuracy. Furthermore, they are complex to perform and may result in the prediction of an excessive number of tag SNP loci, which may not always be associated with the disease. Therefore in this manuscript, we aimed to evaluate the impact of a newly optimized fuzzy clustering and binary particle swarm optimization algorithm (FCBPSO) on the accuracy and running time of informative SNP selection. Fuzzy clustering and FCBPSO were first applied to identify the equivalence relation and the candidate tag SNP set to reduce the redundancy between loci. The FCBPSO algorithm was then optimized and used to obtain the final tag SNP set. The prediction performance and running time of the newly developed model were compared with other traditional methods, including NMC, SPSO, and MCMR. The prediction accuracy of the FCBPSO algorithm was always higher than that of the other algorithms especially as the number of tag SNPs increased. However, when the number of tag SNPs was low, the prediction accuracy of FCBPSO was slightly lower than that of MCMR (add prediction accuracy values for each algorithm). However, the running time of the FCBPSO algorithm was always lower than that of MCMR. FCBPSO not only reduced the size and dimension of the optimization problem but also simplified the training of the prediction model. This improved the prediction accuracy of the model and reduced the running time when compared with other traditional methods.
Collapse
|
8
|
Li G, Wang D, Zhang Y, Liang C, Xiao Q, Luo J. Using Graph Attention Network and Graph Convolutional Network to Explore Human CircRNA-Disease Associations Based on Multi-Source Data. Front Genet 2022; 13:829937. [PMID: 35198012 PMCID: PMC8859418 DOI: 10.3389/fgene.2022.829937] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2021] [Accepted: 01/10/2022] [Indexed: 11/13/2022] Open
Abstract
Cumulative research studies have verified that multiple circRNAs are closely associated with the pathogenic mechanism and cellular level. Exploring human circRNA-disease relationships is significant to decipher pathogenic mechanisms and provide treatment plans. At present, several computational models are designed to infer potential relationships between diseases and circRNAs. However, the majority of existing approaches could not effectively utilize the multisource data and achieve poor performance in sparse networks. In this study, we develop an advanced method, GATGCN, using graph attention network (GAT) and graph convolutional network (GCN) to detect potential circRNA-disease relationships. First, several sources of biomedical information are fused via the centered kernel alignment model (CKA), which calculates the corresponding weight of different kernels. Second, we adopt the graph attention network to learn latent representation of diseases and circRNAs. Third, the graph convolutional network is deployed to effectively extract features of associations by aggregating feature vectors of neighbors. Meanwhile, GATGCN achieves the prominent AUC of 0.951 under leave-one-out cross-validation and AUC of 0.932 under 5-fold cross-validation. Furthermore, case studies on lung cancer, diabetes retinopathy, and prostate cancer verify the reliability of GATGCN for detecting latent circRNA-disease pairs.
Collapse
Affiliation(s)
- Guanghui Li
- School of Information Engineering, East China Jiaotong University, Nanchang, China
| | - Diancheng Wang
- School of Information Engineering, East China Jiaotong University, Nanchang, China
| | - Yuejin Zhang
- School of Information Engineering, East China Jiaotong University, Nanchang, China
| | - Cheng Liang
- School of Information Science and Engineering, Shandong Normal University, Jinan, China
| | - Qiu Xiao
- College of Information Science and Engineering, Hunan Normal University, Changsha, China
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| |
Collapse
|
9
|
Forghani M, Khachay M. Convolutional Neural Network Based Approach to in Silico Non-Anticipating Prediction of Antigenic Distance for Influenza Virus. Viruses 2020; 12:E1019. [PMID: 32932748 PMCID: PMC7551508 DOI: 10.3390/v12091019] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Revised: 09/06/2020] [Accepted: 09/08/2020] [Indexed: 12/18/2022] Open
Abstract
Evaluation of the antigenic similarity degree between the strains of the influenza virus is highly important for vaccine production. The conventional method used to measure such a degree is related to performing the immunological assays of hemagglutinin inhibition. Namely, the antigenic distance between two strains is calculated on the basis of HI assays. Usually, such distances are visualized by using some kind of antigenic cartography method. The known drawback of the HI assay is that it is rather time-consuming and expensive. In this paper, we propose a novel approach for antigenic distance approximation based on deep learning in the feature spaces induced by hemagglutinin protein sequences and Convolutional Neural Networks (CNNs). To apply a CNN to compare the protein sequences, we utilize the encoding based on the physical and chemical characteristics of amino acids. By varying (hyper)parameters of the CNN architecture design, we find the most robust network. Further, we provide insight into the relationship between approximated antigenic distance and antigenicity by evaluating the network on the HI assay database for the H1N1 subtype. The results indicate that the best-trained network gives a high-precision approximation for the ground-truth antigenic distances, and can be used as a good exploratory tool in practical tasks.
Collapse
|
10
|
Skarlupka AL, Handel A, Ross TM. Influenza hemagglutinin antigenic distance measures capture trends in HAI differences and infection outcomes, but are not suitable predictive tools. Vaccine 2020; 38:5822-5830. [PMID: 32682618 DOI: 10.1016/j.vaccine.2020.06.042] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Revised: 05/28/2020] [Accepted: 06/16/2020] [Indexed: 01/24/2023]
Abstract
Vaccination is the most effective method to combat influenza. Vaccine effectiveness is influenced by the antigenic distance between the vaccine strain and the actual circulating virus. Amino acid sequence based methods of quantifying the antigenic distance were designed to predict influenza vaccine effectiveness in humans. The use of these antigenic distance measures has been proposed as an additive method for seasonal vaccine selection. In this report, several antigenic distance measures were evaluated as predictors of hemagglutination inhibition titer differences and clinical outcomes following influenza vaccination or infection in mice or ferrets. The antigenic distance measures described the increasing trend in the change of HAI titer, lung viral titer and percent weight loss in mice and ferrets. However, the variability of outcome variables produced wide prediction intervals for any given antigenic distance value. The amino acid substitution based antigenic distance measures were no better predictors of viral load and weight loss than HAI titer differences, the current predictive measure of immunological correlate of protection for clinical signs after challenge.
Collapse
Affiliation(s)
- Amanda L Skarlupka
- Center for Vaccines and Immunology, University of Georgia, Athens, GA, USA
| | - Andreas Handel
- Department of Epidemiology and Biostatistics, University of Georgia, Athens, GA, USA
| | - Ted M Ross
- Center for Vaccines and Immunology, University of Georgia, Athens, GA, USA; Department of Infectious Diseases, University of Georgia, Athens, GA, USA.
| |
Collapse
|
11
|
Paton DJ, Reeve R, Capozzo AV, Ludi A. Estimating the protection afforded by foot-and-mouth disease vaccines in the laboratory. Vaccine 2019; 37:5515-5524. [PMID: 31405637 DOI: 10.1016/j.vaccine.2019.07.102] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2019] [Revised: 07/28/2019] [Accepted: 07/31/2019] [Indexed: 10/26/2022]
Abstract
Foot-and-mouth disease (FMD) vaccines must be carefully selected and their application closely monitored to optimise their effectiveness. This review covers serological techniques for FMD vaccine quality control, including potency testing, vaccine matching and post-vaccination monitoring. It also discusses alternative laboratory procedures, such as antigen quantification and nucleotide sequencing, and briefly compares the approaches for FMD with those for measuring protection against influenza virus, where humoral immunity is also important. Serology is widely used to predict the protection afforded by vaccines and has great practical utility but also limitations. Animals differ in their responses to vaccines and in the protective mechanisms that they develop. Antibodies have a variety of properties and tests differ in what they measure. Antibody-virus interactions may vary between virus serotypes and strains and protection may be affected by the vaccination regime and the nature and timing of field virus challenge. Finally, tests employing biological reagents are difficult to standardise, whilst cross-protection data needed for test calibration and validation are scarce. All of this is difficult to reconcile with the desire for simple and universal criteria and thresholds for evaluating vaccines and vaccination responses and means that oversimplification of test procedures and their interpretation can lead to poor predictions. A holistic approach is therefore recommended, considering multiple sources of field, experimental and laboratory data. New antibody avidity and isotype tests seem promising alternatives to evaluate cross-protective, post-vaccination serological responses, taking account of vaccine potency as well as match. After choosing appropriate serological tests or test combinations and cut-offs, results should be interpreted cautiously and in context. Since opportunities for experimental challenge studies of cross-protection are limited and the approaches incompletely reflect real life, more field studies are needed to quantify cross-protection and its correlation to in vitro measurements.
Collapse
Affiliation(s)
- D J Paton
- The Pirbright Institute, Ash Road, Pirbright, Surrey GU24 0NF, UK.
| | - R Reeve
- Boyd Orr Centre for Population and Ecosystem Health, Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow G12 8QQ, UK
| | - A V Capozzo
- Instituto de Virología, CICVyA, INTA, N Repetto y De Los Reseros s/n, Hurlingham (1686), Buenos Aires, Argentina; Consejo Nacional de Investigaciones Científicas y Tecnológicas, CONICET, Godoy Cruz 2290 (C1454FQB), Buenos Aires, Argentina
| | - A Ludi
- The Pirbright Institute, Ash Road, Pirbright, Surrey GU24 0NF, UK
| |
Collapse
|
12
|
Ru X, Li L, Wang C. Identification of Phage Viral Proteins With Hybrid Sequence Features. Front Microbiol 2019; 10:507. [PMID: 30972038 PMCID: PMC6443926 DOI: 10.3389/fmicb.2019.00507] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2018] [Accepted: 02/27/2019] [Indexed: 02/01/2023] Open
Abstract
The uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accurately. Extracting comprehensive and effective sequence features from proteins plays a vital role in protein classification. In order to more fully represent protein information, this paper is more comprehensive and effective by combining the features extracted by the feature information representation algorithm based on sequence information (CCPA) and the feature representation algorithm based on sequence and structure information. After extracting features, the Max-Relevance-Max-Distance (MRMD) algorithm is used to select the optimal feature set with the strongest correlation between class labels and low redundancy between features. Given the randomness of the samples selected by the random forest classification algorithm and the randomness features for producing each node variable, a random forest method is employed to perform 10-fold cross-validation on the bacteriophage protein classification. The accuracy of this model is as high as 93.5% in the classification of phage proteins in this study. This study also found that, among the eight physicochemical properties considered, the charge property has the greatest impact on the classification of bacteriophage proteins These results indicate that the model discussed in this paper is an important tool in bacteriophage protein research.
Collapse
Affiliation(s)
- Xiaoqing Ru
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Lihong Li
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|