1
|
Yadav AK, Gupta PK, Singh TR. PMTPred: machine-learning-based prediction of protein methyltransferases using the composition of k-spaced amino acid pairs. Mol Divers 2024:10.1007/s11030-024-10937-2. [PMID: 39033257 DOI: 10.1007/s11030-024-10937-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Accepted: 07/10/2024] [Indexed: 07/23/2024]
Abstract
Protein methyltransferases (PMTs) are a group of enzymes that help catalyze the transfer of a methyl group to its substrates. These enzymes play an important role in epigenetic regulation and can methylate various substrates with DNA, RNA, protein, and small-molecule secondary metabolites. Dysregulation of methyltransferases is implicated in various human cancers. However, in light of the well-recognized significance of PMTs, reliable and efficient identification methods are essential. In the present work, we propose a machine-learning-based method for the identification of PMTs. Various sequence-based features were calculated, and prediction models were trained using various machine-learning algorithms using a tenfold cross-validation technique. After evaluating each model on the dataset, the SVM-based CKSAAP model achieved the highest prediction accuracy with balanced sensitivity and specificity. Also, this SVM model outperformed deep-learning algorithms for the prediction of PMTs. In addition, cross-database validation was performed to ensure the robustness of the model. Feature importance was assessed using shapley additive explanations (SHAP) values, providing insights into the contributions of different features to the model's predictions. Finally, the SVM-based CKSAAP model was implemented in a standalone tool, PMTPred, due to its consistent performance during independent testing and cross-database evaluation. We believe that PMTPred will be a useful and efficient tool for the identification of PMTs. The PMTPred is freely available for download at https://github.com/ArvindYadav7/PMTPred and http://www.bioinfoindia.org/PMTPred/home.html for research and academic use.
Collapse
Affiliation(s)
- Arvind Kumar Yadav
- Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Solan- 173234, Himachal Pradesh, India
| | - Pradeep Kumar Gupta
- Department of Computer Science and Engineering, Jaypee University of Information Technology, Solan- 173234, Himachal Pradesh, India
- School of Computing, Department of Data Science and Engineering, Mohan Babu University, Tirupati- 517102, Andhra Pradesh, India
| | - Tiratha Raj Singh
- Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Solan- 173234, Himachal Pradesh, India.
- Centre of Excellence in Healthcare Technologies and Informatics (CHETI), Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Solan- 173234, Himachal Pradesh, India.
| |
Collapse
|
2
|
Pandey D, Singhal N, Kumar M. β-LacFamPred: An online tool for prediction and classification of β-lactamase class, subclass, and family. Front Microbiol 2023; 13:1039687. [PMID: 36713195 PMCID: PMC9878453 DOI: 10.3389/fmicb.2022.1039687] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 12/19/2022] [Indexed: 01/13/2023] Open
Abstract
β-Lactams are a broad class of antimicrobial agents with a high safety profile, making them the most widely used class in clinical, agricultural, and veterinary setups. The widespread use of β-lactams has induced the extensive spread of β-lactamase hydrolyzing enzymes known as β-lactamases (BLs). To neutralize the effect of β-lactamases, newer generations of β-lactams have been developed, which ultimately led to the evolution of a highly diverse family of BLs. Based on sequence homology, BLs are categorized into four classes: A-D in Ambler's classification system. Further, each class is subdivided into families. Class B is first divided into subclasses B1-B3, and then each subclass is divided into families. The class to which a BL belongs gives a lot of insight into its hydrolytic profile. Traditional methods of determining the hydrolytic profile of BLs and their classification are time-consuming and require resources. Hence we developed a machine-learning-based in silico method, named as β-LacFamPred, for the prediction and annotation of Ambler's class, subclass, and 96 families of BLs. During leave-one-out cross-validation, except one all β-LacFamPred model HMMs showed 100% accuracy. Benchmarking with other BL family prediction methods showed β-LacFamPred to be the most accurate. Out of 60 penicillin-binding proteins (PBPs) and 57 glyoxalase II proteins, β-LacFamPred correctly predicted 56 PBPs and none of the glyoxalase II sequences as non-BLs. Proteome-wide annotation of BLs by β-LacFamPred showed a very less number of false-positive predictions in comparison to the recently developed BL class prediction tool DeepBL. β-LacFamPred is available both as a web-server and standalone tool at http://proteininformatics.org/mkumar/blacfampred and GitHub repository https://github.com/mkubiophysics/B-LacFamPred respectively.
Collapse
|
3
|
Ahmadzadeh M, Cosco TD, Best JR, Christie GJ, DiPaola S. Predictors of the rate of cognitive decline in older adults using machine learning. PLoS One 2023; 18:e0280029. [PMID: 36867596 PMCID: PMC9983884 DOI: 10.1371/journal.pone.0280029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Accepted: 12/20/2022] [Indexed: 03/04/2023] Open
Abstract
BACKGROUND The longitudinal rates of cognitive decline among aging populations are heterogeneous. Few studies have investigated the possibility of implementing prognostic models to predict cognitive changes with the combination of categorical and continuous data from multiple domains. OBJECTIVE Implement a multivariate robust model to predict longitudinal cognitive changes over 12 years among older adults and to identify the most significant predictors of cognitive changes using machine learning techniques. METHOD In total, data of 2733 participants aged 50-85 years from the English Longitudinal Study of Ageing are included. Two categories of cognitive changes were determined including minor cognitive decliners (2361 participants, 86.4%) and major cognitive decliners (372 participants, 13.6%) over 12 years from wave 2 (2004-2005) to wave 8 (2016-2017). Machine learning methods were used to implement the predictive models and to identify the predictors of cognitive decline using 43 baseline features from seven domains including sociodemographic, social engagement, health, physical functioning, psychological, health-related behaviors, and baseline cognitive tests. RESULTS The model predicted future major cognitive decliners from those with the minor cognitive decline with a relatively high performance. The overall AUC, sensitivity, and specificity of prediction were 72.84%, 78.23%, and 67.41%, respectively. Furthermore, the top 7 ranked features with an important role in predicting major vs minor cognitive decliners included age, employment status, socioeconomic status, self-rated memory changes, immediate word recall, the feeling of loneliness, and vigorous physical activity. In contrast, the five least important baseline features consisted of smoking, instrumental activities of daily living, eye disease, life satisfaction, and cardiovascular disease. CONCLUSION The present study indicated the possibility of identifying individuals at high risk of future major cognitive decline as well as potential risk/protective factors of cognitive decline among older adults. The findings could assist in improving the effective interventions to delay cognitive decline among aging populations.
Collapse
Affiliation(s)
- Maryam Ahmadzadeh
- School of Interactive Arts and Technology, Simon Fraser University, Surrey, BC, Canada
| | - Theodore David Cosco
- Gerontology Research Center, Simon Fraser University, Vancouver, BC, Canada
- Oxford Institute of Population Ageing, University of Oxford, Oxford, United Kingdom
| | - John R. Best
- Gerontology Research Center, Simon Fraser University, Vancouver, BC, Canada
| | | | - Steve DiPaola
- School of Interactive Arts and Technology, Simon Fraser University, Surrey, BC, Canada
- * E-mail:
| |
Collapse
|
4
|
Park J, Kim J, Ryu D, Choi HY. Factors related to steroid treatment responsiveness in thyroid eye disease patients and application of SHAP for feature analysis with XGBoost. Front Endocrinol (Lausanne) 2023; 14:1079628. [PMID: 36817584 PMCID: PMC9928572 DOI: 10.3389/fendo.2023.1079628] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Accepted: 01/12/2023] [Indexed: 02/03/2023] Open
Abstract
INTRODUCTION The primary treatment for active thyroid eye disease (TED) is immunosuppressive therapy with intravenous steroids. In this study, we attempted to predict responsiveness to steroid treatment in TED patients using eXtreme Gradient Boosting (XGBoost). Factors associated with steroid responsiveness were also statistically evaluated. METHODS Clinical characteristics and laboratory results of 89 patients with TED who received steroid treatment were retrospectively reviewed. XGBoost was used to explore responsiveness to steroid treatment, and the diagnostic performance was evaluated. Factors contributing to the model output were investigated using the SHapley Additive exPlanation (SHAP), and the treatment response was investigated statistically using SPSS software. RESULTS The eXtra Gradient Boost model showed high performance, with an excellent accuracy of 0.861. Thyroid-stimulating hormone, thyroid-stimulating immunoglobulin (TSI), and low-density lipoprotein (LDL) cholesterol had the highest impact on the model. Multivariate logistic regression analysis showed that less extraocular muscle limitation and high TSI levels were associated with a high risk of poor intravenous methylprednisolone treatment response. As a result of analysis through SHAP, TSH, TSI, and LDL had the highest impact on the XGBoost model. CONCLUSION TSI, extraocular muscle limitation, and LDL cholesterol levels may be useful in predicting steroid treatment response in patients with TED. In terms of machine learning, XGBoost showed relatively robust and reliable results for small datasets. The machine-learning model can assist in decision-making for further treatment of patients with TED.
Collapse
Affiliation(s)
- Jungyul Park
- Department of Ophthalmology, Pusan National University Hospital, Busan, Republic of Korea
- Biomedical Research Institute, Pusan National University Hospital, Busan, Republic of Korea
| | - Jaehyun Kim
- Department of Ophthalmology, Pusan National University Hospital, Busan, Republic of Korea
| | - Dongman Ryu
- Medical Research Institute, Pusan National University, Busan, Republic of Korea
| | - Hee-young Choi
- Department of Ophthalmology, Pusan National University Hospital, Busan, Republic of Korea
- Biomedical Research Institute, Pusan National University Hospital, Busan, Republic of Korea
- Department of Ophthalmology, School of Medicine, Pusan National University, Busan, Republic of Korea
- *Correspondence: Hee-young Choi,
| |
Collapse
|
5
|
Sikander R, Arif M, Ghulam A, Worachartcheewan A, Thafar MA, Habib S. Identification of the ubiquitin–proteasome pathway domain by hyperparameter optimization based on a 2D convolutional neural network. Front Genet 2022; 13:851688. [PMID: 35937990 PMCID: PMC9355632 DOI: 10.3389/fgene.2022.851688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 06/29/2022] [Indexed: 11/13/2022] Open
Abstract
The major mechanism of proteolysis in the cytosol and nucleus is the ubiquitin–proteasome pathway (UPP). The highly controlled UPP has an effect on a wide range of cellular processes and substrates, and flaws in the system can lead to the pathogenesis of a number of serious human diseases. Knowledge about UPPs provide useful hints to understand the cellular process and drug discovery. The exponential growth in next-generation sequencing wet lab approaches have accelerated the accumulation of unannotated data in online databases, making the UPP characterization/analysis task more challenging. Thus, computational methods are used as an alternative for fast and accurate identification of UPPs. Aiming this, we develop a novel deep learning-based predictor named “2DCNN-UPP” for identifying UPPs with low error rate. In the proposed method, we used proposed algorithm with a two-dimensional convolutional neural network with dipeptide deviation features. To avoid the over fitting problem, genetic algorithm is employed to select the optimal features. Finally, the optimized attribute set are fed as input to the 2D-CNN learning engine for building the model. Empirical evidence or outcomes demonstrates that the proposed predictor achieved an overall accuracy and AUC (ROC) value using 10-fold cross validation test. Superior performance compared to other state-of-the art methods for discrimination the relations UPPs classification. Both on and independent test respectively was trained on 10-fold cross validation method and then evaluated through independent test. In the case where experimentally validated ubiquitination sites emerged, we must devise a proteomics-based predictor of ubiquitination. Meanwhile, we also evaluated the generalization power of our trained modal via independent test, and obtained remarkable performance in term of 0.862 accuracy, 0.921 sensitivity, 0.803 specificity 0.803, and 0.730 Matthews correlation coefficient (MCC) respectively. Four approaches were used in the sequences, and the physical properties were calculated combined. When used a 10-fold cross-validation, 2D-CNN-UPP obtained an AUC (ROC) value of 0.862 predicted score. We analyzed the relationship between UPP protein and non-UPP protein predicted score. Last but not least, this research could effectively analyze the large scale relationship between UPP proteins and non-UPP proteins in particular and other protein problems in general and our research work might improve computational biological research. Therefore, we could utilize the latest features in our model framework and Dipeptide Deviation from Expected Mean (DDE) -based protein structure features for the prediction of protein structure, functions, and different molecules, such as DNA and RNA.
Collapse
Affiliation(s)
- Rahu Sikander
- School of Computer Science and Technology, Xidian University, Xi’an, China
- *Correspondence: Rahu Sikander, ; Apilak Worachartcheewan,
| | - Muhammad Arif
- Department of Community Medical Technology, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand
| | - Ali Ghulam
- Computerization and Network Section, Sindh Agriculture University, Tando Jam, Pakistan
| | - Apilak Worachartcheewan
- Department of Community Medical Technology, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand
- *Correspondence: Rahu Sikander, ; Apilak Worachartcheewan,
| | - Maha A. Thafar
- Department of Computer Science, Collage of Computer and Information Technology, Taif University, Taif, Saudi Arabia
| | - Shabana Habib
- Department of Information Technology, College of Computer, Qassim University, Buraydah, Saudi Arabia
| |
Collapse
|
6
|
Yan L, Jin Y, Zhang B, Xu Y, Peng X, Qin S, Chen L. Diverse Aquatic Animal Matrices Play a Key Role in Survival and Potential Virulence of Non-O1/O139 Vibrio cholerae Isolates. Front Microbiol 2022; 13:896767. [PMID: 35801116 PMCID: PMC9255913 DOI: 10.3389/fmicb.2022.896767] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Accepted: 05/04/2022] [Indexed: 11/13/2022] Open
Abstract
Vibrio cholerae can cause pandemic cholera in humans. The waterborne bacterium is frequently isolated from aquatic products worldwide. However, current literature on the impact of aquatic product matrices on the survival and pathogenicity of cholerae is rare. In this study, the growth of eleven non-O1/0O139 V. cholerae isolates recovered from eight species of commonly consumed fish and shellfish was for the first time determined in the eight aquatic animal matrices, most of which highly increased the bacterial biomass when compared with routine trypsin soybean broth (TSB) medium. Secretomes of the V. cholerae isolates (draft genome size: 3,852,021–4,144,013 bp) were determined using two-dimensional gel electrophoresis (2DE-GE) and liquid chromatography-tandem mass spectrometry (LC-MS/MS) techniques. Comparative secretomic analyses revealed 74 differential extracellular proteins, including several virulence- and resistance-associated proteins secreted by the V. cholerae isolates when grown in the eight matrices. Meanwhile, a total of 8,119 intracellular proteins were identified, including 83 virulence- and 8 resistance-associated proteins, of which 61 virulence-associated proteins were absent from proteomes of these isolates when grown in the TSB medium. Additionally, comparative genomic and proteomic analyses also revealed several strain-specific proteins with unknown functions in the V. cholerae isolates. Taken, the results in this study demonstrate that distinct secretomes and proteomes induced by the aquatic animal matrices facilitate V. cholerae resistance in the edible aquatic animals and enhance the pathogenicity of the leading waterborne pathogen worldwide.
Collapse
Affiliation(s)
- Lili Yan
- Key Laboratory of Quality and Safety Risk Assessment for Aquatic Products on Storage and Preservation (Shanghai), Ministry of Agriculture and Rural Affairs of the People's Republic of China, Shanghai, China
- College of Food Science and Technology, Shanghai Ocean University, Shanghai, China
| | - Yinzhe Jin
- Key Laboratory of Quality and Safety Risk Assessment for Aquatic Products on Storage and Preservation (Shanghai), Ministry of Agriculture and Rural Affairs of the People's Republic of China, Shanghai, China
- College of Food Science and Technology, Shanghai Ocean University, Shanghai, China
| | - Beiyu Zhang
- Key Laboratory of Quality and Safety Risk Assessment for Aquatic Products on Storage and Preservation (Shanghai), Ministry of Agriculture and Rural Affairs of the People's Republic of China, Shanghai, China
- College of Food Science and Technology, Shanghai Ocean University, Shanghai, China
| | - Yingwei Xu
- Key Laboratory of Quality and Safety Risk Assessment for Aquatic Products on Storage and Preservation (Shanghai), Ministry of Agriculture and Rural Affairs of the People's Republic of China, Shanghai, China
- College of Food Science and Technology, Shanghai Ocean University, Shanghai, China
| | - Xu Peng
- Department of Biology, Archaea Centre, University of Copenhagen, Copenhagen, Denmark
| | - Si Qin
- Key Laboratory for Food Science and Biotechnology of Hunan Province, College of Food Science and Technology, Hunan Agricultural University, Changsha, China
- *Correspondence: Si Qin
| | - Lanming Chen
- Key Laboratory of Quality and Safety Risk Assessment for Aquatic Products on Storage and Preservation (Shanghai), Ministry of Agriculture and Rural Affairs of the People's Republic of China, Shanghai, China
- College of Food Science and Technology, Shanghai Ocean University, Shanghai, China
- Lanming Chen
| |
Collapse
|
7
|
Bioinformatic Analyses of Peroxiredoxins and RF-Prx: A Random Forest-Based Predictor and Classifier for Prxs. Methods Mol Biol 2022; 2499:155-176. [PMID: 35696080 PMCID: PMC9844236 DOI: 10.1007/978-1-0716-2317-6_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
Peroxiredoxins (Prxs) are a protein superfamily, present in all organisms, that play a critical role in protecting cellular macromolecules from oxidative damage but also regulate intracellular and intercellular signaling processes involving redox-regulated proteins and pathways. Bioinformatic approaches using computational tools that focus on active site-proximal sequence fragments (known as active site signatures) and iterative clustering and searching methods (referred to as TuLIP and MISST) have recently enabled the recognition of over 38,000 peroxiredoxins, as well as their classification into six functionally relevant groups. With these data providing so many examples of Prxs in each class, machine learning approaches offer an opportunity to extract additional information about features characteristic of these protein groups.In this study, we developed a novel computational method named "RF-Prx" based on a random forest (RF) approach integrated with K-space amino acid pairs (KSAAP) to identify peroxiredoxins and classify them into one of six subgroups. Our process performed in a superior manner compared to other machine learning classifiers. Thus the RF approach integrated with K-space amino acid pairs enabled the detection of class-specific conserved sequences outside the known functional centers and with potential importance. For example, drugs designed to target Prx proteins would likely suffer from cross-reactivity among distinct Prxs if targeted to conserved active sites, but this may be avoidable if remote, class-specific regions could be targeted instead.
Collapse
|
8
|
βLact-Pred: A Predictor Developed for Identification of Beta-Lactamases Using Statistical Moments and PseAAC via 5-Step Rule. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021; 2021:8974265. [PMID: 34956358 PMCID: PMC8709780 DOI: 10.1155/2021/8974265] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 11/22/2021] [Indexed: 12/02/2022]
Abstract
Beta-lactamase (β-lactamase) produced by different bacteria confers resistance against β-lactam-containing drugs. The gene encoding β-lactamase is plasmid-borne and can easily be transferred from one bacterium to another during conjugation. By such transformations, the recipient also acquires resistance against the drugs of the β-lactam family. β-Lactam antibiotics play a vital significance in clinical treatment of disastrous diseases like soft tissue infections, gonorrhoea, skin infections, urinary tract infections, and bronchitis. Herein, we report a prediction classifier named as βLact-Pred for the identification of β-lactamase proteins. The computational model uses the primary amino acid sequence structure as its input. Various metrics are derived from the primary structure to form a feature vector. Experimentally determined data of positive and negative beta-lactamases are collected and transformed into feature vectors. An operating algorithm based on the artificial neural network is used by integrating the position relative features and sequence statistical moments in PseAAC for training the neural networks. The results for the proposed computational model were validated by employing numerous types of approach, i.e., self-consistency testing, jackknife testing, cross-validation, and independent testing. The overall accuracy of the predictor for self-consistency, jackknife testing, cross-validation, and independent testing presents 99.76%, 96.07%, 94.20%, and 91.65%, respectively, for the proposed model. Stupendous experimental results demonstrated that the proposed predictor “βLact-Pred” has surpassed results from the existing methods.
Collapse
|
9
|
Chen PN, Lee CC, Liang CM, Pao SI, Huang KH, Lin KF. General deep learning model for detecting diabetic retinopathy. BMC Bioinformatics 2021; 22:84. [PMID: 34749634 PMCID: PMC8576963 DOI: 10.1186/s12859-021-04005-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 02/08/2021] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Doctors can detect symptoms of diabetic retinopathy (DR) early by using retinal ophthalmoscopy, and they can improve diagnostic efficiency with the assistance of deep learning to select treatments and support personnel workflow. Conventionally, most deep learning methods for DR diagnosis categorize retinal ophthalmoscopy images into training and validation data sets according to the 80/20 rule, and they use the synthetic minority oversampling technique (SMOTE) in data processing (e.g., rotating, scaling, and translating training images) to increase the number of training samples. Oversampling training may lead to overfitting of the training model. Therefore, untrained or unverified images can yield erroneous predictions. Although the accuracy of prediction results is 90%-99%, this overfitting of training data may distort training module variables. RESULTS This study uses a 2-stage training method to solve the overfitting problem. In the training phase, to build the model, the Learning module 1 used to identify the DR and no-DR. The Learning module 2 on SMOTE synthetic datasets to identify the mild-NPDR, moderate NPDR, severe NPDR and proliferative DR classification. These two modules also used early stopping and data dividing methods to reduce overfitting by oversampling. In the test phase, we use the DIARETDB0, DIARETDB1, eOphtha, MESSIDOR, and DRIVE datasets to evaluate the performance of the training network. The prediction accuracy achieved to 85.38%, 84.27%, 85.75%, 86.73%, and 92.5%. CONCLUSIONS Based on the experiment, a general deep learning model for detecting DR was developed, and it could be used with all DR databases. We provided a simple method of addressing the imbalance of DR databases, and this method can be used with other medical images.
Collapse
Affiliation(s)
- Ping-Nan Chen
- Department of Biomedical Engineering, National Defense Medical Center, Taipei, 114, Taiwan, ROC.
| | - Chia-Chiang Lee
- Graduate Institute of Applied Science and Technology, National Taiwan University of Science and Technology, Taipei, 106, Taiwan, ROC
| | - Chang-Min Liang
- Department of Ophthalmology, Tri-Service General Hospital, National Defense Medical Center, Taipei, 114, Taiwan, ROC
| | - Shu-I Pao
- Department of Ophthalmology, Tri-Service General Hospital, National Defense Medical Center, Taipei, 114, Taiwan, ROC
| | - Ke-Hao Huang
- Department of Ophthalmology, Tri-Service General Hospital, National Defense Medical Center, Taipei, 114, Taiwan, ROC
| | - Ke-Feng Lin
- Graduate Institute of Applied Science and Technology, National Taiwan University of Science and Technology, Taipei, 106, Taiwan, ROC.,Department of Medical Records, Tri-Service General Hospital, National Defense Medical Center, Taipei, 114, Taiwan, ROC
| |
Collapse
|
10
|
Li Y, Xu Z, Han W, Cao H, Umarov R, Yan A, Fan M, Chen H, Duarte CM, Li L, Ho PL, Gao X. HMD-ARG: hierarchical multi-task deep learning for annotating antibiotic resistance genes. MICROBIOME 2021; 9:40. [PMID: 33557954 PMCID: PMC7871585 DOI: 10.1186/s40168-021-01002-3] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Accepted: 01/08/2021] [Indexed: 05/07/2023]
Abstract
BACKGROUND The spread of antibiotic resistance has become one of the most urgent threats to global health, which is estimated to cause 700,000 deaths each year globally. Its surrogates, antibiotic resistance genes (ARGs), are highly transmittable between food, water, animal, and human to mitigate the efficacy of antibiotics. Accurately identifying ARGs is thus an indispensable step to understanding the ecology, and transmission of ARGs between environmental and human-associated reservoirs. Unfortunately, the previous computational methods for identifying ARGs are mostly based on sequence alignment, which cannot identify novel ARGs, and their applications are limited by currently incomplete knowledge about ARGs. RESULTS Here, we propose an end-to-end Hierarchical Multi-task Deep learning framework for ARG annotation (HMD-ARG). Taking raw sequence encoding as input, HMD-ARG can identify, without querying against existing sequence databases, multiple ARG properties simultaneously, including if the input protein sequence is an ARG, and if so, what antibiotic family it is resistant to, what resistant mechanism the ARG takes, and if the ARG is an intrinsic one or acquired one. In addition, if the predicted antibiotic family is beta-lactamase, HMD-ARG further predicts the subclass of beta-lactamase that the ARG is resistant to. Comprehensive experiments, including cross-fold validation, third-party dataset validation in human gut microbiota, wet-experimental functional validation, and structural investigation of predicted conserved sites, demonstrate not only the superior performance of our method over the state-of-art methods, but also the effectiveness and robustness of the proposed method. CONCLUSIONS We propose a hierarchical multi-task method, HMD-ARG, which is based on deep learning and can provide detailed annotations of ARGs from three important aspects: resistant antibiotic class, resistant mechanism, and gene mobility. We believe that HMD-ARG can serve as a powerful tool to identify antibiotic resistance genes and, therefore mitigate their global threat. Our method and the constructed database are available at http://www.cbrc.kaust.edu.sa/HMDARG/ . Video abstract (MP4 50984 kb).
Collapse
Affiliation(s)
- Yu Li
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Hong Kong, People's Republic of China
| | - Zeling Xu
- School of Biological Sciences, The University of Hong Kong, Hong Kong, People's Republic of China
| | - Wenkai Han
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Huiluo Cao
- Carol Yu Center for Infection and Department of Microbiology, The University of Hong Kong, Hong Kong, People's Republic of China
| | - Ramzan Umarov
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Aixin Yan
- School of Biological Sciences, The University of Hong Kong, Hong Kong, People's Republic of China
| | - Ming Fan
- Institute of Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou, People's Republic of China
| | - Huan Chen
- Key Laboratory of Microbial Technology and Bioinformatics of Zhejiang Province, Zhejiang Institute of Microbiology, Hangzhou, People's Republic of China
| | - Carlos M Duarte
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
- Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Lihua Li
- Institute of Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou, People's Republic of China
| | - Pak-Leung Ho
- Carol Yu Center for Infection and Department of Microbiology, The University of Hong Kong, Hong Kong, People's Republic of China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia.
| |
Collapse
|
11
|
Wang Y, Li F, Bharathwaj M, Rosas NC, Leier A, Akutsu T, Webb GI, Marquez-Lago TT, Li J, Lithgow T, Song J. DeepBL: a deep learning-based approach for in silico discovery of beta-lactamases. Brief Bioinform 2020; 22:5992357. [PMID: 33212503 DOI: 10.1093/bib/bbaa301] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 10/05/2020] [Accepted: 10/09/2020] [Indexed: 01/14/2023] Open
Abstract
Beta-lactamases (BLs) are enzymes localized in the periplasmic space of bacterial pathogens, where they confer resistance to beta-lactam antibiotics. Experimental identification of BLs is costly yet crucial to understand beta-lactam resistance mechanisms. To address this issue, we present DeepBL, a deep learning-based approach by incorporating sequence-derived features to enable high-throughput prediction of BLs. Specifically, DeepBL is implemented based on the Small VGGNet architecture and the TensorFlow deep learning library. Furthermore, the performance of DeepBL models is investigated in relation to the sequence redundancy level and negative sample selection in the benchmark dataset. The models are trained on datasets of varying sequence redundancy thresholds, and the model performance is evaluated by extensive benchmarking tests. Using the optimized DeepBL model, we perform proteome-wide screening for all reviewed bacterium protein sequences available from the UniProt database. These results are freely accessible at the DeepBL webserver at http://deepbl.erc.monash.edu.au/.
Collapse
Affiliation(s)
- Yanan Wang
- Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology at Monash University, Australia
| | - Fuyi Li
- Bioinformatics from Monash University, Australia
| | - Manasa Bharathwaj
- Department of Microbiology at the Biomedicine Discovery Institute, Monash University, Australia
| | - Natalia C Rosas
- Department of Microbiology at the Biomedicine Discovery Institute, Monash University, Australia
| | - André Leier
- Department of Genetics and the Department of Cell, Developmental and Integrative Biology, University of Alabama at Birmingham (UAB) School of Medicine, USA
| | | | | | - Tatiana T Marquez-Lago
- Department of Genetics and the Department of Cell, Developmental and Integrative Biology, UAB School of Medicine, USA
| | - Jian Li
- Monash Biomedicine Discovery Institute and Department of Microbiology, Monash University, Australia
| | - Trevor Lithgow
- Department of Microbiology at Monash University, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| |
Collapse
|
12
|
Chen C, Zhang Q, Yu B, Yu Z, Lawrence PJ, Ma Q, Zhang Y. Improving protein-protein interactions prediction accuracy using XGBoost feature selection and stacked ensemble classifier. Comput Biol Med 2020; 123:103899. [DOI: 10.1016/j.compbiomed.2020.103899] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 06/28/2020] [Accepted: 06/28/2020] [Indexed: 10/23/2022]
|
13
|
Xu Y, Verma D, Sheridan RP, Liaw A, Ma J, Marshall NM, McIntosh J, Sherer EC, Svetnik V, Johnston JM. Deep Dive into Machine Learning Models for Protein Engineering. J Chem Inf Model 2020; 60:2773-2790. [PMID: 32250622 DOI: 10.1021/acs.jcim.0c00073] [Citation(s) in RCA: 90] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Protein redesign and engineering has become an important task in pharmaceutical research and development. Recent advances in technology have enabled efficient protein redesign by mimicking natural evolutionary mutation, selection, and amplification steps in the laboratory environment. For any given protein, the number of possible mutations is astronomical. It is impractical to synthesize all sequences or even to investigate all functionally interesting variants. Recently, there has been an increased interest in using machine learning to assist protein redesign, since prediction models can be used to virtually screen a large number of novel sequences. However, many state-of-the-art machine learning models, especially deep learning models, have not been extensively explored. Moreover, only a small selection of protein sequence descriptors has been considered. In this work, the performance of prediction models built using an array of machine learning methods and protein descriptor types, including two novel, single amino acid descriptors and one structure-based three-dimensional descriptor, is benchmarked. The predictions were evaluated on a diverse collection of public and proprietary data sets, using a variety of evaluation metrics. The results of this comparison suggest that Convolution Neural Network models built with amino acid property descriptors are the most widely applicable to the types of protein redesign problems faced in the pharmaceutical industry.
Collapse
Affiliation(s)
- Yuting Xu
- Biometrics Research, Merck & Co., Inc., Rahway, New Jersey 07065, United States
| | - Deeptak Verma
- Computational and Structural Chemistry, Merck & Co., Inc., Kenilworth, New Jersey 07033, United States
| | - Robert P Sheridan
- Computational and Structural Chemistry, Merck & Co., Inc., Kenilworth, New Jersey 07033, United States
| | - Andy Liaw
- Biometrics Research, Merck & Co., Inc., Rahway, New Jersey 07065, United States
| | - Junshui Ma
- Early Oncology Statistics, Merck & Co., Inc., Rahway, New Jersey 07065, United States
| | | | - John McIntosh
- Process Research & Development, Merck & Co., Inc., Rahway, New Jersey 07065, United States
| | - Edward C Sherer
- Computational and Structural Chemistry, Merck & Co., Inc., Kenilworth, New Jersey 07033, United States
| | - Vladimir Svetnik
- Biometrics Research, Merck & Co., Inc., Rahway, New Jersey 07065, United States
| | - Jennifer M Johnston
- Computational and Structural Chemistry, Merck & Co., Inc., Kenilworth, New Jersey 07033, United States
| |
Collapse
|
14
|
RF-MaloSite and DL-Malosite: Methods based on random forest and deep learning to identify malonylation sites. Comput Struct Biotechnol J 2020; 18:852-860. [PMID: 32322367 PMCID: PMC7160427 DOI: 10.1016/j.csbj.2020.02.012] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2019] [Revised: 01/27/2020] [Accepted: 02/19/2020] [Indexed: 12/19/2022] Open
Abstract
Malonylation, which has recently emerged as an important lysine modification, regulates diverse biological activities and has been implicated in several pervasive disorders, including cardiovascular disease and cancer. However, conventional global proteomics analysis using tandem mass spectrometry can be time-consuming, expensive and technically challenging. Therefore, to complement and extend existing experimental methods for malonylation site identification, we developed two novel computational methods for malonylation site prediction based on random forest and deep learning machine learning algorithms, RF-MaloSite and DL-MaloSite, respectively. DL-MaloSite requires the primary amino acid sequence as an input and RF-MaloSite utilizes a diverse set of biochemical, physiochemical and sequence-based features. While systematic assessment of performance metrics suggests that both ‘RF-MaloSite’ and ‘DL-MaloSite’ perform well in all metrics tested, our methods perform particularly well in the areas of accuracy, sensitivity and overall method performance (assessed by the Matthew’s Correlation Coefficient). For instance, RF-MaloSite exhibited MCC scores of 0.42 and 0.40 using 10-fold cross-validation and an independent test set, respectively. Meanwhile, DL-MaloSite was characterized by MCC scores of 0.51 and 0.49 based on 10-fold cross-validation and an independent set, respectively. Importantly, both methods exhibited efficiency scores that were on par or better than those achieved by existing malonylation site prediction methods. The identification of these sites may also provide important insights into the mechanisms of crosstalk between malonylation and other lysine modifications, such as acetylation, glutarylation and succinylation. To facilitate their use, both methods have been made freely available to the research community at https://github.com/dukkakc/DL-MaloSite-and-RF-MaloSite.
Collapse
|
15
|
Huang KY, Hsu JBK, Lee TY. Characterization and Identification of Lysine Succinylation Sites based on Deep Learning Method. Sci Rep 2019; 9:16175. [PMID: 31700141 PMCID: PMC6838336 DOI: 10.1038/s41598-019-52552-4] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Accepted: 10/18/2019] [Indexed: 12/14/2022] Open
Abstract
Succinylation is a type of protein post-translational modification (PTM), which can play important roles in a variety of cellular processes. Due to an increasing number of site-specific succinylated peptides obtained from high-throughput mass spectrometry (MS), various tools have been developed for computationally identifying succinylated sites on proteins. However, most of these tools predict succinylation sites based on traditional machine learning methods. Hence, this work aimed to carry out the succinylation site prediction based on a deep learning model. The abundance of MS-verified succinylated peptides enabled the investigation of substrate site specificity of succinylation sites through sequence-based attributes, such as position-specific amino acid composition, the composition of k-spaced amino acid pairs (CKSAAP), and position-specific scoring matrix (PSSM). Additionally, the maximal dependence decomposition (MDD) was adopted to detect the substrate signatures of lysine succinylation sites by dividing all succinylated sequences into several groups with conserved substrate motifs. According to the results of ten-fold cross-validation, the deep learning model trained using PSSM and informative CKSAAP attributes can reach the best predictive performance and also perform better than traditional machine-learning methods. Moreover, an independent testing dataset that truly did not exist in the training dataset was used to compare the proposed method with six existing prediction tools. The testing dataset comprised of 218 positive and 2621 negative instances, and the proposed model could yield a promising performance with 84.40% sensitivity, 86.99% specificity, 86.79% accuracy, and an MCC value of 0.489. Finally, the proposed method has been implemented as a web-based prediction tool (CNN-SuccSite), which is now freely accessible at http://csb.cse.yzu.edu.tw/CNN-SuccSite/.
Collapse
Affiliation(s)
- Kai-Yao Huang
- Department of Medical Research, Hsinchu Mackay Memorial Hospital, Hsinchu city, 300, Taiwan
| | - Justin Bo-Kai Hsu
- Department of Medical Research, Taipei Medical University Hospital, Taipei city, 110, Taiwan
| | - Tzong-Yi Lee
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, 518172, China. .,School of Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, 518172, China.
| |
Collapse
|
16
|
AL-barakati HJ, Saigo H, Newman RH, KC DB. RF-GlutarySite: a random forest based predictor for glutarylation sites. Mol Omics 2019; 15:189-204. [DOI: 10.1039/c9mo00028c] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Glutarylation, which is a newly identified posttranslational modification that occurs on lysine residues, has recently emerged as an important regulator of several metabolic and mitochondrial processes. Here, we describe the development of RF-GlutarySite, a random forest-based predictor designed to predict glutarylation sites based on protein primary amino acid sequence.
Collapse
Affiliation(s)
- Hussam J. AL-barakati
- Department of Computational Science and Engineering
- North Carolina Agricultural & Technical State University
- Greensboro
- USA
| | - Hiroto Saigo
- Department of Informatics
- Kyushu University
- Fukuoka 819-0395
- Japan
| | - Robert H. Newman
- Department of Biology
- North Carolina Agricultural & Technical State University
- Greensboro
- USA
| | - Dukka B. KC
- Department of Computational Science and Engineering
- North Carolina Agricultural & Technical State University
- Greensboro
- USA
| |
Collapse
|
17
|
Schönbach C, Li J, Ma L, Horton P, Sjaugi MF, Ranganathan S. A bioinformatics potpourri. BMC Genomics 2018; 19:920. [PMID: 29363432 PMCID: PMC5780851 DOI: 10.1186/s12864-017-4326-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
The 16th International Conference on Bioinformatics (InCoB) was held at Tsinghua University, Shenzhen from September 20 to 22, 2017. The annual conference of the Asia-Pacific Bioinformatics Network featured six keynotes, two invited talks, a panel discussion on big data driven bioinformatics and precision medicine, and 66 oral presentations of accepted research articles or posters. Fifty-seven articles comprising a topic assortment of algorithms, biomolecular networks, cancer and disease informatics, drug-target interactions and drug efficacy, gene regulation and expression, imaging, immunoinformatics, metagenomics, next generation sequencing for genomics and transcriptomics, ontologies, post-translational modification, and structural bioinformatics are the subject of this editorial for the InCoB2017 supplement issues in BMC Genomics, BMC Bioinformatics, BMC Systems Biology and BMC Medical Genomics. New Delhi will be the location of InCoB2018, scheduled for September 26-28, 2018.
Collapse
Affiliation(s)
- Christian Schönbach
- International Research Center for Medical Sciences, Graduate School of Medical Sciences, Kumamoto University, Kumamoto, 860-0811 Japan
| | - Jinyan Li
- The Advanced Analytics Institute, University of Technology Sydney, Sydney, NSW 2007 Australia
| | - Lan Ma
- Graduate School at Shenzhen, Tsinghua University, Shenzhen, 518055 People’s Republic of China
| | - Paul Horton
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, 135-0064 Japan
| | | | - Shoba Ranganathan
- Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, NSW 2109 Australia
| |
Collapse
|