1
|
Patiyal S, Dhall A, Bajaj K, Sahu H, Raghava GPS. Prediction of RNA-interacting residues in a protein using CNN and evolutionary profile. Brief Bioinform 2023; 24:6901899. [PMID: 36516298 DOI: 10.1093/bib/bbac538] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2022] [Revised: 09/28/2022] [Accepted: 11/08/2022] [Indexed: 12/15/2022] Open
Abstract
This paper describes a method Pprint2, which is an improved version of Pprint developed for predicting RNA-interacting residues in a protein. Training and independent/validation datasets used in this study comprises of 545 and 161 non-redundant RNA-binding proteins, respectively. All models were trained on training dataset and evaluated on the validation dataset. The preliminary analysis reveals that positively charged amino acids such as H, R and K, are more prominent in the RNA-interacting residues. Initially, machine learning based models have been developed using binary profile and obtain maximum area under curve (AUC) 0.68 on validation dataset. The performance of this model improved significantly from AUC 0.68 to 0.76, when evolutionary profile is used instead of binary profile. The performance of our evolutionary profile-based model improved further from AUC 0.76 to 0.82, when convolutional neural network has been used for developing model. Our final model based on convolutional neural network using evolutionary information achieved AUC 0.82 with Matthews correlation coefficient of 0.49 on the validation dataset. Our best model outperforms existing methods when evaluated on the independent/validation dataset. A user-friendly standalone software and web-based server named 'Pprint2' has been developed for predicting RNA-interacting residues (https://webs.iiitd.edu.in/raghava/pprint2 and https://github.com/raghavagps/pprint2).
Collapse
Affiliation(s)
- Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Khushboo Bajaj
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Harshita Sahu
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| |
Collapse
|
2
|
Herrera-Bravo J, Farías JG, Contreras FP, Herrera-Belén L, Norambuena JA, Beltrán JF. VirVACPRED: A Web Server for Prediction of Protective Viral Antigens. Int J Pept Res Ther 2021; 28:35. [PMID: 34934411 PMCID: PMC8679566 DOI: 10.1007/s10989-021-10345-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/07/2021] [Indexed: 11/25/2022]
Abstract
Viral antigens are key in the development of vaccines that prevent or eradicate infections caused by these pathogens. Bioinformatics tools are modern alternatives that facilitate the discovery of viral antigens, reducing the costs of experimental assays. We developed a bioinformatics tool called VirVACPRED, which is highly efficient in predicting viral antigens. In this study, we obtained a model based on the gradient boosting classifier, which showed high performance during the training, leave-one-out cross-validation (accuracy = 0.7402, sensitivity = 0.7319, precision = 0.7503, F1 = 0.7251, kappa = 0.4774, Matthews correlation coefficient = 0.4981) and testing (accuracy = 0.8889, sensitivity = 1.0, precision = 0.8276, F1 = 0.9057, kappa = 0.7734, Matthews correlation coefficient = 0.7941). VirVACPRED is a robust tool that can be of great help in the search and proposal of new viral antigens, which can be considered in the development of future vaccines against infections caused by viruses.
Collapse
Affiliation(s)
- Jesús Herrera-Bravo
- Departamento de Ciencias Básicas, Facultad de Ciencias, Universidad Santo Tomas, Santiago, Chile
- Center of Molecular Biology and Pharmacogenetics, Scientific and Technological Bioresource Nucleus, Universidad de La Frontera, Temuco, Chile
| | - Jorge G. Farías
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Fernanda Parraguez Contreras
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Lisandra Herrera-Belén
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Juan-Alejandro Norambuena
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
- Program on Natural Resources Sciences, Universidad de La Frontera, Avenida Francisco Salazar, 01145, P.O. Box 54-D, 4780000 Temuco, Chile
| | - Jorge F. Beltrán
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| |
Collapse
|
3
|
Wang YG, Huang SY, Wang LN, Zhou ZY, Qiu JD. Accurate prediction of species-specific 2-hydroxyisobutyrylation sites based on machine learning frameworks. Anal Biochem 2020; 602:113793. [DOI: 10.1016/j.ab.2020.113793] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2020] [Revised: 04/25/2020] [Accepted: 05/20/2020] [Indexed: 12/17/2022]
|
4
|
Wekesa JS, Meng J, Luan Y. A deep learning model for plant lncRNA-protein interaction prediction with graph attention. Mol Genet Genomics 2020; 295:1091-1102. [DOI: 10.1007/s00438-020-01682-w] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2020] [Accepted: 05/01/2020] [Indexed: 02/06/2023]
|
5
|
Wekesa JS, Meng J, Luan Y. Multi-feature fusion for deep learning to predict plant lncRNA-protein interaction. Genomics 2020; 112:2928-2936. [PMID: 32437848 DOI: 10.1016/j.ygeno.2020.05.005] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Revised: 04/22/2020] [Accepted: 05/05/2020] [Indexed: 12/28/2022]
Abstract
Long non-coding RNAs (lncRNAs) play key roles in regulating cellular biological processes through diverse molecular mechanisms including binding to RNA binding proteins. The majority of plant lncRNAs are functionally uncharacterized, thus, accurate prediction of plant lncRNA-protein interaction is imperative for subsequent functional studies. We present an integrative model, namely DRPLPI. Its uniqueness is that it predicts by multi-feature fusion. Structural and four groups of sequence features are used, including tri-nucleotide composition, gapped k-mer, recursive complement and binary profile. We design a multi-head self-attention long short-term memory encoder-decoder network to extract generative high-level features. To obtain robust results, DRPLPI combines categorical boosting and extra trees into a single meta-learner. Experiments on Zea mays and Arabidopsis thaliana obtained 0.9820 and 0.9652 area under precision/recall curve (AUPRC) respectively. The proposed method shows significant enhancement in the prediction performance compared with existing state-of-the-art methods.
Collapse
Affiliation(s)
- Jael Sanyanda Wekesa
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China; School of Computing and Information Technology, Jomo Kenyatta University of Agriculture and Technology, Nairobi 62000-00200, Kenya
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China.
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning 116023, China
| |
Collapse
|
6
|
Su M, Lyles JT, Petit III RA, Peterson J, Hargita M, Tang H, Solis-Lemus C, Quave CL, Read TD. Genomic analysis of variability in Delta-toxin levels between Staphylococcus aureus strains. PeerJ 2020; 8:e8717. [PMID: 32231873 PMCID: PMC7100594 DOI: 10.7717/peerj.8717] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2019] [Accepted: 02/10/2020] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND The delta-toxin (δ-toxin) of Staphylococcus aureus is the only hemolysin shown to cause mast cell degranulation and is linked to atopic dermatitis, a chronic inflammatory skin disease. We sought to characterize variation in δ-toxin production across S. aureus strains and identify genetic loci potentially associated with differences between strains. METHODS A set of 124 S. aureus strains was genome-sequenced and δ-toxin levels in stationary phase supernatants determined by high performance liquid chromatography (HPLC). SNPs and kmers were associated with differences in toxin production using four genome-wide association study (GWAS) methods. Transposon mutations in candidate genes were tested for their δ-toxin levels. We constructed XGBoost models to predict toxin production based on genetic loci discovered to be potentially associated with the phenotype. RESULTS The S. aureus strain set encompassed 40 sequence types (STs) in 23 clonal complexes (CCs). δ-toxin production ranged from barely detectable levels to >90,000 units, with a median of >8,000 units. CC30 had significantly lower levels of toxin production than average while CC45 and CC121 were higher. MSSA (methicillin sensitive) strains had higher δ-toxin production than MRSA (methicillin resistant) strains. Through multiple GWAS approaches, 45 genes were found to be potentially associated with toxicity. Machine learning models using loci discovered through GWAS as features were able to predict δ-toxin production (as a high/low binary phenotype) with a precision of .875 and specificity of .990 but recall of .333. We discovered that mutants in the carA gene, encoding the small chain of carbamoyl phosphate synthase, completely abolished toxin production and toxicity in Caenorhabditis elegans. CONCLUSIONS The amount of stationary phase production of the toxin is a strain-specific phenotype likely affected by a complex interaction of number of genes with different levels of effect. We discovered new candidate genes that potentially play a role in modulating production. We report for the first time that the product of the carA gene is necessary for δ-toxin production in USA300. This work lays a foundation for future work on understanding toxin regulation in S. aureus and prediction of phenotypes from genomic sequences.
Collapse
Affiliation(s)
- Michelle Su
- Division of Infectious Diseases, Department of Medicine, School of Medicine, Emory University, Atlanta, GA, United States of America
| | - James T. Lyles
- Center for the Study of Human Health, College of Arts and Sciences, Emory University, Atlanta, GA, United States of America
| | - Robert A. Petit III
- Division of Infectious Diseases, Department of Medicine, School of Medicine, Emory University, Atlanta, GA, United States of America
| | - Jessica Peterson
- Division of Infectious Diseases, Department of Medicine, School of Medicine, Emory University, Atlanta, GA, United States of America
| | - Michelle Hargita
- Division of Infectious Diseases, Department of Medicine, School of Medicine, Emory University, Atlanta, GA, United States of America
| | - Huaqiao Tang
- Center for the Study of Human Health, College of Arts and Sciences, Emory University, Atlanta, GA, United States of America
| | - Claudia Solis-Lemus
- Department of Human Genetics, School of Medicine, Emory University, Atlanta, GA, United States of America
| | - Cassandra L. Quave
- Center for the Study of Human Health, College of Arts and Sciences, Emory University, Atlanta, GA, United States of America
- Department of Dermatology, School of Medicine, Emory University, Atlanta, GA, United States of America
| | - Timothy D. Read
- Division of Infectious Diseases, Department of Medicine, School of Medicine, Emory University, Atlanta, GA, United States of America
- Department of Human Genetics, School of Medicine, Emory University, Atlanta, GA, United States of America
| |
Collapse
|
7
|
Hu X, Feng Z, Zhang X, Liu L, Wang S. The Identification of Metal Ion Ligand-Binding Residues by Adding the Reclassified Relative Solvent Accessibility. Front Genet 2020; 11:214. [PMID: 32265982 PMCID: PMC7096583 DOI: 10.3389/fgene.2020.00214] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Accepted: 02/24/2020] [Indexed: 11/13/2022] Open
Abstract
Many proteins realize their special functions by binding with specific metal ion ligands during a cell's life cycle. The ability to correctly identify metal ion ligand-binding residues is valuable for the human health and the design of molecular drug. Precisely identifying these residues, however, remains challenging work. We have presented an improved computational approach for predicting the binding residues of 10 metal ion ligands (Zn2+, Cu2+, Fe2+, Fe3+, Co2+, Ca2+, Mg2+, Mn2+, Na+, and K+) by adding reclassified relative solvent accessibility (RSA). The best accuracy of fivefold cross-validation was higher than 77.9%, which was about 16% higher than the previous result on the same dataset. It was found that different reclassification of the RSA information can make different contributions to the identification of specific ligand binding residues. Our study has provided an additional understanding of the effect of the RSA on the identification of metal ion ligand binding residues.
Collapse
Affiliation(s)
| | - Zhenxing Feng
- College of Sciences, Inner Mongolla University of Technology, Hohhot, China
| | - Xiaojin Zhang
- College of Sciences, Inner Mongolla University of Technology, Hohhot, China
| | | | | |
Collapse
|
8
|
Zhang P, Shen L, Yang W. Solvation Free Energy Calculations with Quantum Mechanics/Molecular Mechanics and Machine Learning Models. J Phys Chem B 2019; 123:901-908. [PMID: 30557020 PMCID: PMC6448400 DOI: 10.1021/acs.jpcb.8b11905] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
For exploration of chemical and biological systems, the combined quantum mechanics and molecular mechanics (QM/MM) and machine learning (ML) models have been developed recently to achieve high accuracy and efficiency for molecular dynamics (MD) simulations. Despite its success on reaction free energy calculations, how to identify new configurations on insufficiently sampled regions during MD and how to update the current ML models with the growing database on the fly are both very important but still challenging. In this article, we apply the QM/MM ML method to solvation free energy calculations and address these two challenges. We employ three approaches to detect new data points and introduce the gradient boosting algorithm to reoptimize efficiently the ML model during ML-based MD sampling. The solvation free energy calculations on several typical organic molecules demonstrate that our developed method provides a systematic, robust, and efficient way to explore new chemistry using ML-based QM/MM MD simulations.
Collapse
Affiliation(s)
- Pan Zhang
- Department of Chemistry, Duke University, Durham, North Carolina 27708, United States
| | - Lin Shen
- Department of Chemistry, Duke University, Durham, North Carolina 27708, United States
| | - Weitao Yang
- Department of Chemistry and Department of Physics, Duke University, Durham, NC 27708, United States
- Key laboratory of Theoretical Chemistry of Environment, Ministry of Education, School of Chemistry and Environment, South China Normal University, Guangzhou 510006, P.R.China
| |
Collapse
|
9
|
Liu G, Chen Z, Danilova IG, Bolkov MA, Tuzankina IA, Liu G. Identification of miR-200c and miR141-Mediated lncRNA-mRNA Crosstalks in Muscle-Invasive Bladder Cancer Subtypes. Front Genet 2018; 9:422. [PMID: 30323832 PMCID: PMC6172409 DOI: 10.3389/fgene.2018.00422] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2018] [Accepted: 09/10/2018] [Indexed: 11/30/2022] Open
Abstract
Basal and luminal subtypes of muscle-invasive bladder cancer (MIBC) have distinct molecular profiles and heterogeneous clinical behaviors. The interactions between mRNAs and lncRNAs, which might be regulated by miRNAs, have crucial roles in many cancers. However, the miRNA-dependent crosstalk between lncRNA and mRNA in specific MIBC subtypes still remains unclear. In this study, we first classified MIBC into two conservative subtypes using miRNA, mRNA and lncRNA expression data derived from The Cancer Genome Atlas. Then we investigated subtype-related biological pathways and evaluated the subtype classification performance using Decision Trees, Random Forest and eXtreme Gradient Boosting (XGBoost). At last, we explored potential miRNA-mediated lncRNA-mRNA crosstalks based on co-expression analysis. Our results show that: (1) the luminal subtype is primarily characterized by upregulation of metabolism-related pathways while the basal subtype is predominantly characterized by upregulation of epithelial-mesenchymal transition, metastasis, and immune system process-related pathways; (2) the XGBoost prediction model is consistently robust for classification of the molecular subtypes of MIBC across four datasets (The area under the ROC curve > 0.9); (3) the expression levels of the molecules in the miR-200c and miR141-mediated lncRNA-mRNA crosstalks differ considerably between the two subtypes and have close relationships with the prognosis of MIBC. The miR-200c and miR-141-dependent mRNA-lncRNA crosstalks might be of great significance in tumorigenesis and tumor progression and may serve as the novel prognostic predictors and classification markers of MIBC subtypes.
Collapse
Affiliation(s)
- Guojun Liu
- Institute of Natural Sciences and Mathematics, Ural Federal University, Yekaterinburg, Russia
| | - Zihao Chen
- Department of Urology, Nanfang Hospital, Southern Medical University, Guangzhou, China
| | - Irina G Danilova
- Institute of Natural Sciences and Mathematics, Ural Federal University, Yekaterinburg, Russia
| | - Mikhail A Bolkov
- Institute of Immunology and Physiology, Ural Branch of the Russian Academy of Sciences, Yekaterinburg, Russia
| | - Irina A Tuzankina
- Institute of Immunology and Physiology, Ural Branch of the Russian Academy of Sciences, Yekaterinburg, Russia
| | - Guoqing Liu
- School of Life Sciences and Technology, Inner Mongolia University of Science and Technology, Baotou, China
| |
Collapse
|