1
|
Al-Zubayer MA, Alam K, Shanto HH, Maniruzzaman M, Majumder UK, Ahammed B. Machine learning models for prediction of double and triple burdens of non-communicable diseases in Bangladesh. J Biosoc Sci 2024; 56:426-444. [PMID: 38505939 DOI: 10.1017/s0021932024000063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/21/2024]
Abstract
Increasing prevalence of non-communicable diseases (NCDs) has become the leading cause of death and disability in Bangladesh. Therefore, this study aimed to measure the prevalence of and risk factors for double and triple burden of NCDs (DBNCDs and TBNCDs), considering diabetes, hypertension, and overweight and obesity as well as establish a machine learning approach for predicting DBNCDs and TBNCDs. A total of 12,151 respondents from the 2017 to 2018 Bangladesh Demographic and Health Survey were included in this analysis, where 10%, 27.4%, and 24.3% of respondents had diabetes, hypertension, and overweight and obesity, respectively. Chi-square test and multilevel logistic regression (LR) analysis were applied to select factors associated with DBNCDs and TBNCDs. Furthermore, six classifiers including decision tree (DT), LR, naïve Bayes (NB), k-nearest neighbour (KNN), random forest (RF), and extreme gradient boosting (XGBoost) with three cross-validation protocols (K2, K5, and K10) were adopted to predict the status of DBNCDs and TBNCDs. The classification accuracy (ACC) and area under the curve (AUC) were computed for each protocol and repeated 10 times to make them more robust, and then the average ACC and AUC were computed. The prevalence of DBNCDs and TBNCDs was 14.3% and 2.3%, respectively. The findings of this study revealed that DBNCDs and TBNCDs were significantly influenced by age, sex, marital status, wealth index, education and geographic region. Compared to other classifiers, the RF-based classifier provides the highest ACC and AUC for both DBNCDs (ACC = 81.06% and AUC = 0.93) and TBNCDs (ACC = 88.61% and AUC = 0.97) for the K10 protocol. A combination of considered two-step factor selections and RF-based classifier can better predict the burden of NCDs. The findings of this study suggested that decision-makers might adopt suitable decisions to control and prevent the burden of NCDs using RF classifiers.
Collapse
Affiliation(s)
| | - Khorshed Alam
- School of Business, University of Southern Queensland, Toowoomba, QLD, Australia
- Centre for Health Research, University of Southern Queensland, Toowoomba, QLD, Australia
| | | | - Md Maniruzzaman
- Statistics Discipline, Khulna University, Khulna, Bangladesh
| | | | - Benojir Ahammed
- Statistics Discipline, Khulna University, Khulna, Bangladesh
| |
Collapse
|
2
|
Lone IM, Midlej K, Nun NB, Iraqi FA. Intestinal cancer development in response to oral infection with high-fat diet-induced Type 2 diabetes (T2D) in collaborative cross mice under different host genetic background effects. Mamm Genome 2023; 34:56-75. [PMID: 36757430 DOI: 10.1007/s00335-023-09979-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2022] [Accepted: 01/20/2023] [Indexed: 02/10/2023]
Abstract
Type 2 diabetes (T2D) is a metabolic disease with an imbalance in blood glucose concentration. There are significant studies currently showing association between T2D and intestinal cancer developments. High-fat diet (HFD) plays part in the disease development of T2D, intestinal cancer and infectious diseases through many biological mechanisms, including but not limited to inflammation. Understanding the system genetics of the multimorbidity of these diseases will provide an important knowledge and platform for dissecting the complexity of these diseases. Furthermore, in this study we used some machine learning (ML) models to explore more aspects of diabetes mellitus. The ultimate aim of this project is to study the genetic factors, which underline T2D development, associated with intestinal cancer in response to a HFD consumption and oral coinfection, jointly or separately, on the same host genetic background. A cohort of 307 mice of eight different CC mouse lines in the four experimental groups was assessed. The mice were maintained on either HFD or chow diet (CHD) for 12-week period, while half of each dietary group was either coinfected with oral bacteria or uninfected. Host response to a glucose load and clearance was assessed using intraperitoneal glucose tolerance test (IPGTT) at two time points (weeks 6 and 12) during the experiment period and, subsequently, was translated to area under curve (AUC) values. At week 5 of the experiment, mice of group two and four were coinfected with Porphyromonas gingivalis (Pg) and Fusobacterium nucleatum (Fn) strains, three times a week, while keeping the other uninfected mice as a control group. At week 12, mice were killed, small intestines and colon were extracted, and subsequently, the polyp counts were assessed; as well, the intestine lengths and size were measured. Our results have shown that there is a significant variation in polyp's number in different CC lines, with a spectrum between 2.5 and 12.8 total polyps on average. There was a significant correlation between area under curve (AUC) and intestine measurements, including polyp counts, length and size. In addition, our results have shown a significant sex effect on polyp development and glucose tolerance ability with males more susceptible to HFD than females by showing higher AUC in the glucose tolerance test. The ML results showed that classification with random forest could reach the highest accuracy when all the attributes were used. These results provide an excellent platform for proceeding toward understanding the nature of the genes involved in resistance and rate of development of intestinal cancer and T2D induced by HFD and oral coinfection. Once obtained, such data can be used to predict individual risk for developing these diseases and to establish the genetically based strategy for their prevention and treatment.
Collapse
Affiliation(s)
- Iqbal M Lone
- Department of Clinical Microbiology and Immunology, Sackler Faculty of Medicine, Tel-Aviv University, Ramat Aviv, 69978, Tel-Aviv, Israel
| | - Kareem Midlej
- Department of Clinical Microbiology and Immunology, Sackler Faculty of Medicine, Tel-Aviv University, Ramat Aviv, 69978, Tel-Aviv, Israel
| | - Nadav Ben Nun
- Department of Clinical Microbiology and Immunology, Sackler Faculty of Medicine, Tel-Aviv University, Ramat Aviv, 69978, Tel-Aviv, Israel
| | - Fuad A Iraqi
- Department of Clinical Microbiology and Immunology, Sackler Faculty of Medicine, Tel-Aviv University, Ramat Aviv, 69978, Tel-Aviv, Israel.
| |
Collapse
|
3
|
Gu X, Ding Y, Xiao P, He T. A GHKNN model based on the physicochemical property extraction method to identify SNARE proteins. Front Genet 2022; 13:935717. [PMID: 36506312 PMCID: PMC9727185 DOI: 10.3389/fgene.2022.935717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 11/02/2022] [Indexed: 11/24/2022] Open
Abstract
There is a great deal of importance to SNARE proteins, and their absence from function can lead to a variety of diseases. The SNARE protein is known as a membrane fusion protein, and it is crucial for mediating vesicle fusion. The identification of SNARE proteins must therefore be conducted with an accurate method. Through extensive experiments, we have developed a model based on graph-regularized k-local hyperplane distance nearest neighbor model (GHKNN) binary classification. In this, the model uses the physicochemical property extraction method to extract protein sequence features and the SMOTE method to upsample protein sequence features. The combination achieves the most accurate performance for identifying all protein sequences. Finally, we compare the model based on GHKNN binary classification with other classifiers and measure them using four different metrics: SN, SP, ACC, and MCC. In experiments, the model performs significantly better than other classifiers.
Collapse
Affiliation(s)
- Xingyue Gu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Pengfeng Xiao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
| | - Tao He
- Beidahuang Industry Group General Hospital, Harbin, China
| |
Collapse
|
4
|
Alzahrani E, Alghamdi W, Ullah MZ, Khan YD. Identification of stress response proteins through fusion of machine learning models and statistical paradigms. Sci Rep 2021; 11:21767. [PMID: 34741132 PMCID: PMC8571424 DOI: 10.1038/s41598-021-99083-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 09/13/2021] [Indexed: 11/08/2022] Open
Abstract
Proteins are a vital component of cells that perform physiological functions to ensure smooth operations of bodily functions. Identification of a protein's function involves a detailed understanding of the structure of proteins. Stress proteins are essential mediators of several responses to cellular stress and are categorized based on their structural characteristics. These proteins are found to be conserved across many eukaryotic and prokaryotic linkages and demonstrate varied crucial functional activities inside a cell. The in-vivo, ex vivo, and in-vitro identification of stress proteins are a time-consuming and costly task. This study is aimed at the identification of stress protein sequences with the aid of mathematical modelling and machine learning methods to supplement the aforementioned wet lab methods. The model developed using Random Forest showed remarkable results with 91.1% accuracy while models based on neural network and support vector machine showed 87.7% and 47.0% accuracy, respectively. Based on evaluation results it was concluded that random-forest based classifier surpassed all other predictors and is suitable for use in practical applications for the identification of stress proteins. Live web server is available at http://biopred.org/stressprotiens , while the webserver code available is at https://github.com/abdullah5naveed/SRP_WebServer.git.
Collapse
Affiliation(s)
- Ebraheem Alzahrani
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P. O. Box 80203, Jeddah, 21589, Saudi Arabia
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, P. O. Box 80221, Jeddah, 21589, Saudi Arabia
| | - Malik Zaka Ullah
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P. O. Box 80203, Jeddah, 21589, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore, 54770, Pakistan.
| |
Collapse
|
5
|
Qiu W, Lv Z, Xiao X, Shao S, Lin H. EMCBOW-GPCR: A method for identifying G-protein coupled receptors based on word embedding and wordbooks. Comput Struct Biotechnol J 2021; 19:4961-4969. [PMID: 34527200 PMCID: PMC8437786 DOI: 10.1016/j.csbj.2021.08.044] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 08/07/2021] [Accepted: 08/27/2021] [Indexed: 11/15/2022] Open
Abstract
An computational method was developed to identify G-protein coupled receptors. Three word-embedding models and a bag-of-words model are used to extract original features. A high accuracy was achieved by using fusion information. A powerful tool was established.
G Protein-Coupled Receptors (GPCRs) are one of the largest membrane protein receptor family in human, which are also important targets for many drugs. Thence, it’s of great significance to judge whether a protein is a GPCR or not. However, identifying GPCRs by experimental methods is very expensive and time-consuming. As more and more GPCR primary sequences are accumulated, it’s feasible to develop a computational model to predict GPCRs precisely and quickly. In this paper, a novel method called EMCBOW-GPCR has been proposed to improve the accuracy of identifying GPCRs based on natural language processing (NLP). For representing GPCRs, three word-embedding models and a bag-of-words model are used to extract original features. Then, the original features are thrown into a Deep-learning algorithm to extract features further and reduce the dimension. Finally, the obtained features are fed into Extreme Gradient Boosting. As shown with the results comparison, the overall prediction metrics of EMCBOW-GPCR are higher than the state of the arts. In order to be convenient for more researchers to use EMCBOW-GPCR, the method and source code have been opened in github, which are available at https://github.com/454170054/EMCBOW-GPCR, and a user-friendly web-server for EMCBOW-GPCR has been established at http://www.jci-bioinfo.cn/emcbowgpcr.
Collapse
Affiliation(s)
- Wangren Qiu
- School of Information Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Zhe Lv
- School of Information Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Xuan Xiao
- School of Information Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Shuai Shao
- School of Information Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Hao Lin
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
6
|
Karimi S, Ahmadi M, Goudarzi F, Ferdousi R. A computational model for GPCR-ligand interaction prediction. J Integr Bioinform 2020; 18:155-165. [PMID: 34171942 PMCID: PMC7790179 DOI: 10.1515/jib-2019-0084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Accepted: 11/25/2020] [Indexed: 11/25/2022] Open
Abstract
G protein-coupled receptors (GPCRs) play an essential role in critical human activities, and they are considered targets for a wide range of drugs. Accordingly, based on these crucial roles, GPCRs are mainly considered and focused on pharmaceutical research. Hence, there are a lot of investigations on GPCRs. Experimental laboratory research is very costly in terms of time and expenses, and accordingly, there is a marked tendency to use computational methods as an alternative method. In this study, a prediction model based on machine learning (ML) approaches was developed to predict GPCRs and ligand interactions. Decision tree (DT), random forest (RF), multilayer perceptron (MLP), support vector machine (SVM), and Naive Bayes (NB) were the algorithms that were investigated in this study. After several optimization steps, receiver operating characteristic (ROC) for DT, RF, MLP, SVM, and NB algorithm were 95.2, 98.1, 96.3, 95.5, and 97.3, respectively. Accordingly final model was made base on the RF algorithm. The current computational study compared with others focused on specific and important types of proteins (GPCR) interaction and employed/examined different types of sequence-based features to obtain more accurate results. Drug science researchers could widely use the developed prediction model in this study. The developed predictor was applied over 16,132 GPCR-ligand pairs and about 6778 potential interactions predicted.
Collapse
Affiliation(s)
- Shiva Karimi
- Health Information Management Department, Paramedical School, Kermanshah University of Medical Sciences, Kermanshah, Iran
| | - Maryam Ahmadi
- Department of Health Information Management, School of Management and Medical Information Sciences, Iran University of Medical Sciences, Tehran, Iran
| | - Farjam Goudarzi
- Regenerative Medicine Research Center, Kermanshah University of Medical Sciences, Kermanshah, Iran
| | - Reza Ferdousi
- Department of Health Information Technology, School of Management and Medical Informatics, Tabriz University of Medical Sciences, Tabriz, Iran
| |
Collapse
|
7
|
Paki R, Nourani E, Farajzadeh D. Classification of G protein-coupled receptors using attention mechanism. GENE REPORTS 2020. [DOI: 10.1016/j.genrep.2020.100882] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
8
|
Feng Z, Liang T, Wang S, Chen M, Hou T, Zhao J, Chen H, Zhou Y, Xie XQ. Binding Characterization of GPCRs-Modulator by Molecular Complex Characterizing System (MCCS). ACS Chem Neurosci 2020; 11:3333-3345. [PMID: 32941011 PMCID: PMC10063373 DOI: 10.1021/acschemneuro.0c00457] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Increasing attention has been devoted to allosteric modulators as the preferred therapeutic agents for their colossal advantages such as higher selectivity, fewer side effects, and lower toxicity since they bind at allosteric sites that are topographically distinct from the classic orthosteric sites. However, the allosteric binding pockets are not conserved and there are no cogent methods to comprehensively characterize the features of allosteric sites with the binding of modulators. To overcome this limitation, our lab has developed a novel algorithm that can quantitatively characterize the receptor-ligand binding feature named Molecular Complex Characterizing System (MCCS). To illustrate the methodology and application of MCCS, we take G protein coupled receptors (GPCRs) as an example. First, we summarized and analyzed the reported allosteric binding pockets of class A GPCRs using MCCS. Sequentially, a systematic study was conducted between cannabinoid receptor type 1 (CB1) and its allosteric modulators, where we used MCCS to analyze the residue energy contribution and the interaction pattern. Finally, we validated the predicted allosteric binding site in CB2 via MCCS in combination with molecular dynamics (MD) simulation. Our results demonstrate that the MCCS program is advantageous in recapitulating the allosteric regulation pattern of class A GPCRs of the reported pockets as well as in predicting potential allosteric binding pockets. This MCCS program can serve as a valuable tool for the discovery of small-molecule allosteric modulators for class A GPCRs.
Collapse
Affiliation(s)
- Zhiwei Feng
- Department of Pharmaceutical Sciences and Computational Chemical Genomics Screening Center, School of Pharmacy; National Center of Excellence for Computational Drug Abuse Research, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, United States
| | - Tianjian Liang
- Department of Pharmaceutical Sciences and Computational Chemical Genomics Screening Center, School of Pharmacy; National Center of Excellence for Computational Drug Abuse Research, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, United States
| | - Siyi Wang
- Department of Pharmaceutical Sciences and Computational Chemical Genomics Screening Center, School of Pharmacy; National Center of Excellence for Computational Drug Abuse Research, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, United States
| | - Maozi Chen
- Department of Pharmaceutical Sciences and Computational Chemical Genomics Screening Center, School of Pharmacy; National Center of Excellence for Computational Drug Abuse Research, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, United States
| | - Tianling Hou
- Department of Pharmaceutical Sciences and Computational Chemical Genomics Screening Center, School of Pharmacy; National Center of Excellence for Computational Drug Abuse Research, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, United States
| | - Jack Zhao
- Department of Pharmaceutical Sciences and Computational Chemical Genomics Screening Center, School of Pharmacy; National Center of Excellence for Computational Drug Abuse Research, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, United States
| | - Hui Chen
- Department of Pharmaceutical Sciences and Computational Chemical Genomics Screening Center, School of Pharmacy; National Center of Excellence for Computational Drug Abuse Research, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, United States
| | - Yuehan Zhou
- Department of Pharmaceutical Sciences and Computational Chemical Genomics Screening Center, School of Pharmacy; National Center of Excellence for Computational Drug Abuse Research, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, United States
| | - Xiang-Qun Xie
- Department of Pharmaceutical Sciences and Computational Chemical Genomics Screening Center, School of Pharmacy; National Center of Excellence for Computational Drug Abuse Research; Drug Discovery Institute; Departments of Computational Biology and Structural Biology, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, United States
| |
Collapse
|
9
|
Sampa MB, Hossain MN, Hoque MR, Islam R, Yokota F, Nishikitani M, Ahmed A. Blood Uric Acid Prediction With Machine Learning: Model Development and Performance Comparison. JMIR Med Inform 2020; 8:e18331. [PMID: 33030442 PMCID: PMC7582147 DOI: 10.2196/18331] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2020] [Revised: 07/16/2020] [Accepted: 08/10/2020] [Indexed: 02/06/2023] Open
Abstract
Background Uric acid is associated with noncommunicable diseases such as cardiovascular diseases, chronic kidney disease, coronary artery disease, stroke, diabetes, metabolic syndrome, vascular dementia, and hypertension. Therefore, uric acid is considered to be a risk factor for the development of noncommunicable diseases. Most studies on uric acid have been performed in developed countries, and the application of machine-learning approaches in uric acid prediction in developing countries is rare. Different machine-learning algorithms will work differently on different types of data in various diseases; therefore, a different investigation is needed for different types of data to identify the most accurate algorithms. Specifically, no study has yet focused on the urban corporate population in Bangladesh, despite the high risk of developing noncommunicable diseases for this population. Objective The aim of this study was to develop a model for predicting blood uric acid values based on basic health checkup test results, dietary information, and sociodemographic characteristics using machine-learning algorithms. The prediction of health checkup test measurements can be very helpful to reduce health management costs. Methods Various machine-learning approaches were used in this study because clinical input data are not completely independent and exhibit complex interactions. Conventional statistical models have limitations to consider these complex interactions, whereas machine learning can consider all possible interactions among input data. We used boosted decision tree regression, decision forest regression, Bayesian linear regression, and linear regression to predict personalized blood uric acid based on basic health checkup test results, dietary information, and sociodemographic characteristics. We evaluated the performance of these five widely used machine-learning models using data collected from 271 employees in the Grameen Bank complex of Dhaka, Bangladesh. Results The mean uric acid level was 6.63 mg/dL, indicating a borderline result for the majority of the sample (normal range <7.0 mg/dL). Therefore, these individuals should be monitoring their uric acid regularly. The boosted decision tree regression model showed the best performance among the models tested based on the root mean squared error of 0.03, which is also better than that of any previously reported model. Conclusions A uric acid prediction model was developed based on personal characteristics, dietary information, and some basic health checkup measurements. This model will be useful for improving awareness among high-risk individuals and populations, which can help to save medical costs. A future study could include additional features (eg, work stress, daily physical activity, alcohol intake, eating red meat) in improving prediction.
Collapse
Affiliation(s)
- Masuda Begum Sampa
- Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan
| | - Md Nazmul Hossain
- Department of Marketing, Faculty of Business Studies, University of Dhaka, Dhaka, Bangladesh
| | - Md Rakibul Hoque
- School of Business, Emporia State University, Kansas, KS, United States
| | - Rafiqul Islam
- Medical Information Center, Kyushu University Hospital, Fukuoka, Japan
| | - Fumihiko Yokota
- Institute of Decision Science for a Sustainable Society, Kyushu University, Fukuoka, Japan
| | | | - Ashir Ahmed
- Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan
| |
Collapse
|
10
|
Tripathi V, Tripathi P. Detecting antimicrobial peptides by exploring the mutual information of their sequences. J Biomol Struct Dyn 2020; 38:5037-5043. [PMID: 31760879 DOI: 10.1080/07391102.2019.1695667] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
The rise of antibiotic resistance in pathogenic bacteria is a growing concern for every part of the world. The present study shows the prediction efficiency of mutual information for the classification of antimicrobial peptides. The proven role of antimicrobial peptides (AMPs) to fight against multidrug-resistant pathogens and AMP's low toxic properties laid the foundation of computational methods to play their role in detecting AMPs from non-AMPs. Mutual information vectors (MIV) were created for AMP/non-AMP sequences and then fed to different machine learning classifiers out of which a random forest (RF) classifier showed best results for predicting AMPs. Random forest classifiers were evaluated on benchmark datasets by 10-fold cross-validation. The proposed MIV-RF method showed better prediction accuracy, MCC (Matthews correlation coefficient), and AUC-ROC (Area Under The Curve-Receiver Operating Characteristics) than available methods for detecting AMPs.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Vijay Tripathi
- Department of Molecular and Cellular Engineering, Jacob Institute of Biotechnology and Bioengineering, Sam Higginbottom University of Agriculture, Technology and Sciences, Prayagraj, India
| | - Pooja Tripathi
- Department of Computational Biology & Bioinformatics, Jacob Institute of Biotechnology and Bioengineering, Sam Higginbottom University of Agriculture, Technology and Sciences, Prayagraj, India
| |
Collapse
|
11
|
Chen W, Nie F, Ding H. Recent Advances of Computational Methods for Identifying Bacteriophage Virion Proteins. Protein Pept Lett 2020; 27:259-264. [PMID: 30968770 DOI: 10.2174/0929866526666190410124642] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Revised: 03/07/2019] [Accepted: 04/01/2019] [Indexed: 01/09/2023]
Abstract
Phage Virion Proteins (PVP) are essential materials of bacteriophage, which participate in a series of biological processes. Accurate identification of phage virion proteins is helpful to understand the mechanism of interaction between the phage and its host bacteria. Since experimental method is labor intensive and time-consuming, in the past few years, many computational approaches have been proposed to identify phage virion proteins. In order to facilitate researchers to select appropriate methods, it is necessary to give a comprehensive review and comparison on existing computational methods on identifying phage virion proteins. In this review, we summarized the existing computational methods for identifying phage virion proteins and also assessed their performances on an independent dataset. Finally, challenges and future perspectives for identifying phage virion proteins were presented. Taken together, we hope that this review could provide clues to researches on the study of phage virion proteins.
Collapse
Affiliation(s)
- Wei Chen
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China.,Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Center for Genomics and Computational Biology, School of Life Sciences, North China University of Science and Technology, Tangshan 063000, China
| | - Fulei Nie
- Center for Genomics and Computational Biology, School of Life Sciences, North China University of Science and Technology, Tangshan 063000, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
12
|
Gu X, Chen Z, Wang D. Prediction of G Protein-Coupled Receptors With CTDC Extraction and MRMD2.0 Dimension-Reduction Methods. Front Bioeng Biotechnol 2020; 8:635. [PMID: 32671038 PMCID: PMC7329982 DOI: 10.3389/fbioe.2020.00635] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Accepted: 05/26/2020] [Indexed: 11/13/2022] Open
Abstract
The G Protein-Coupled Receptor (GPCR) family consists of more than 800 different members. In this article, we attempt to use the physicochemical properties of Composition, Transition, Distribution (CTD) to represent GPCRs. The dimensionality reduction method of MRMD2.0 filters the physicochemical properties of GPCR redundancy. Matplotlib plots the coordinates to distinguish GPCRs from other protein sequences. The chart data show a clear distinction effect, and there is a well-defined boundary between the two. The experimental results show that our method can predict GPCRs.
Collapse
Affiliation(s)
- Xingyue Gu
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Zhihua Chen
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Donghua Wang
- Department of General Surgery, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| |
Collapse
|
13
|
Classification and prediction of diabetes disease using machine learning paradigm. Health Inf Sci Syst 2020; 8:7. [PMID: 31949894 DOI: 10.1007/s13755-019-0095-z] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2019] [Accepted: 12/21/2019] [Indexed: 12/19/2022] Open
Abstract
Background and objectives Diabetes is a chronic disease characterized by high blood sugar. It may cause many complicated disease like stroke, kidney failure, heart attack, etc. About 422 million people were affected by diabetes disease in worldwide in 2014. The figure will be reached 642 million in 2040. The main objective of this study is to develop a machine learning (ML)-based system for predicting diabetic patients. Materials and methods Logistic regression (LR) is used to identify the risk factors for diabetes disease based on p value and odds ratio (OR). We have adopted four classifiers like naïve Bayes (NB), decision tree (DT), Adaboost (AB), and random forest (RF) to predict the diabetic patients. Three types of partition protocols (K2, K5, and K10) have also adopted and repeated these protocols into 20 trails. Performances of these classifiers are evaluated using accuracy (ACC) and area under the curve (AUC). Results We have used diabetes dataset, conducted in 2009-2012, derived from the National Health and Nutrition Examination Survey. The dataset consists of 6561 respondents with 657 diabetic and 5904 controls. LR model demonstrates that 7 factors out of 14 as age, education, BMI, systolic BP, diastolic BP, direct cholesterol, and total cholesterol are the risk factors for diabetes. The overall ACC of ML-based system is 90.62%. The combination of LR-based feature selection and RF-based classifier gives 94.25% ACC and 0.95 AUC for K10 protocol. Conclusion The combination of LR and RF-based classifier performs better. This combination will be very helpful for predicting diabetic patients.
Collapse
|
14
|
Xue M, Su Y, Li C, Wang S, Yao H. Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework. J Diabetes Res 2020; 2020:6873891. [PMID: 33029536 PMCID: PMC7532405 DOI: 10.1155/2020/6873891] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 08/01/2020] [Accepted: 09/02/2020] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND An estimated 425 million people globally have diabetes, accounting for 12% of the world's health expenditures, and the number continues to grow, placing a huge burden on the healthcare system, especially in those remote, underserved areas. METHODS A total of 584,168 adult subjects who have participated in the national physical examination were enrolled in this study. The risk factors for type II diabetes mellitus (T2DM) were identified by p values and odds ratio, using logistic regression (LR) based on variables of physical measurement and a questionnaire. Combined with the risk factors selected by LR, we used a decision tree, a random forest, AdaBoost with a decision tree (AdaBoost), and an extreme gradient boosting decision tree (XGBoost) to identify individuals with T2DM, compared the performance of the four machine learning classifiers, and used the best-performing classifier to output the degree of variables' importance scores of T2DM. RESULTS The results indicated that XGBoost had the best performance (accuracy = 0.906, precision = 0.910, recall = 0.902, F-1 = 0.906, and AUC = 0.968). The degree of variables' importance scores in XGBoost showed that BMI was the most significant feature, followed by age, waist circumference, systolic pressure, ethnicity, smoking amount, fatty liver, hypertension, physical activity, drinking status, dietary ratio (meat to vegetables), drink amount, smoking status, and diet habit (oil loving). CONCLUSIONS We proposed a classifier based on LR-XGBoost which used fourteen variables of patients which are easily obtained and noninvasive as predictor variables to identify potential incidents of T2DM. The classifier can accurately screen the risk of diabetes in the early phrase, and the degree of variables' importance scores gives a clue to prevent diabetes occurrence.
Collapse
Affiliation(s)
- Mingyue Xue
- Hospital of Traditional Chinese Medicine Affiliated to the Fourth Clinical Medical College of Xinjiang Medical University, Urumqi, China
- College of Public Health, Xinjiang Medical University, Urumqi, China
| | - Yinxia Su
- College of Public Health, Xinjiang Medical University, Urumqi, China
| | - Chen Li
- The First Affiliated Hospital of Xinjiang Medical University, Urumqi, China
| | - Shuxia Wang
- Center of Health Management, The First Affiliated Hospital, Xinjiang Medical University, Urumqi, China
| | - Hua Yao
- Center of Health Management, The First Affiliated Hospital, Xinjiang Medical University, Urumqi, China
| |
Collapse
|
15
|
Lu C, Liu Z, Zhang E, He F, Ma Z, Wang H. MPLs-Pred: Predicting Membrane Protein-Ligand Binding Sites Using Hybrid Sequence-Based Features and Ligand-Specific Models. Int J Mol Sci 2019; 20:ijms20133120. [PMID: 31247932 PMCID: PMC6651575 DOI: 10.3390/ijms20133120] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Revised: 06/23/2019] [Accepted: 06/23/2019] [Indexed: 02/07/2023] Open
Abstract
Membrane proteins (MPs) are involved in many essential biomolecule mechanisms as a pivotal factor in enabling the small molecule and signal transport between the two sides of the biological membrane; this is the reason that a large portion of modern medicinal drugs target MPs. Therefore, accurately identifying the membrane protein-ligand binding sites (MPLs) will significantly improve drug discovery. In this paper, we propose a sequence-based MPLs predictor called MPLs-Pred, where evolutionary profiles, topology structure, physicochemical properties, and primary sequence segment descriptors are combined as features applied to a random forest classifier, and an under-sampling scheme is used to enhance the classification capability with imbalanced samples. Additional ligand-specific models were taken into consideration in refining the prediction. The corresponding experimental results based on our method achieved an appreciable performance, with 0.63 MCC (Matthews correlation coefficient) as the overall prediction precision, and those values were 0.604, 0.7, and 0.692, respectively, for the three main types of ligands: drugs, metal ions, and biomacromolecules. MPLs-Pred is freely accessible at http://icdtools.nenu.edu.cn/.
Collapse
Affiliation(s)
- Chang Lu
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China
| | - Zhe Liu
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China
| | - Enju Zhang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China
| | - Fei He
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China.
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China.
| | - Zhiqiang Ma
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China.
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China.
| | - Han Wang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China.
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China.
| |
Collapse
|
16
|
Wei HH, Yang W, Tang H, Lin H. The Development of Machine Learning Methods in Cell-Penetrating Peptides Identification: A Brief Review. Curr Drug Metab 2019; 20:217-223. [DOI: 10.2174/1389200219666181010114750] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2018] [Revised: 05/21/2018] [Accepted: 08/02/2018] [Indexed: 11/22/2022]
Abstract
Background:Cell-penetrating Peptides (CPPs) are important short peptides that facilitate cellular intake or uptake of various molecules. CPPs can transport drug molecules through the plasma membrane and send these molecules to different cellular organelles. Thus, CPP identification and related mechanisms have been extensively explored. In order to reveal the penetration mechanisms of a large number of CPPs, it is necessary to develop convenient and fast methods for CPPs identification.Methods:Biochemical experiments can provide precise details for accurately identifying CPP, but these methods are expensive and laborious. To overcome these disadvantages, several computational methods have been developed to identify CPPs. We have performed review on the development of machine learning methods in CPP identification. This review provides an insight into CPP identification.Results:We summarized the machine learning-based CPP identification methods and compared the construction strategies of 11 different computational methods. Furthermore, we pointed out the limitations and difficulties in predicting CPPs.Conclusion:In this review, the last studies on CPP identification using machine learning method were reported. We also discussed the future development direction of CPP recognition with computational methods.
Collapse
Affiliation(s)
- Huan-Huan Wei
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Wuritu Yang
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hua Tang
- Department of Pathophysiology, Southwest Medical University, Luzhou, China
| | - Hao Lin
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
17
|
Zou Q, Qu K, Luo Y, Yin D, Ju Y, Tang H. Predicting Diabetes Mellitus With Machine Learning Techniques. Front Genet 2018; 9:515. [PMID: 30459809 PMCID: PMC6232260 DOI: 10.3389/fgene.2018.00515] [Citation(s) in RCA: 188] [Impact Index Per Article: 31.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2018] [Accepted: 10/12/2018] [Indexed: 12/30/2022] Open
Abstract
Diabetes mellitus is a chronic disease characterized by hyperglycemia. It may cause many complications. According to the growing morbidity in recent years, in 2040, the world’s diabetic patients will reach 642 million, which means that one of the ten adults in the future is suffering from diabetes. There is no doubt that this alarming figure needs great attention. With the rapid development of machine learning, machine learning has been applied to many aspects of medical health. In this study, we used decision tree, random forest and neural network to predict diabetes mellitus. The dataset is the hospital physical examination data in Luzhou, China. It contains 14 attributes. In this study, five-fold cross validation was used to examine the models. In order to verity the universal applicability of the methods, we chose some methods that have the better performance to conduct independent test experiments. We randomly selected 68994 healthy people and diabetic patients’ data, respectively as training set. Due to the data unbalance, we randomly extracted 5 times data. And the result is the average of these five experiments. In this study, we used principal component analysis (PCA) and minimum redundancy maximum relevance (mRMR) to reduce the dimensionality. The results showed that prediction with random forest could reach the highest accuracy (ACC = 0.8084) when all the attributes were used.
Collapse
Affiliation(s)
- Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Kaiyang Qu
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Yamei Luo
- School of Medical Information and Engineering, Southwest Medical University, Luzhou, China
| | - Dehui Yin
- School of Medical Information and Engineering, Southwest Medical University, Luzhou, China
| | - Ying Ju
- School of Information Science and Technology, Xiamen University, Xiamen, China
| | - Hua Tang
- Department of Pathophysiology, School of Basic Medicine, Southwest Medical University, Luzhou, China
| |
Collapse
|
18
|
Uddin R, Jamil F. Prioritization of potential drug targets against P. aeruginosa by core proteomic analysis using computational subtractive genomics and Protein-Protein interaction network. Comput Biol Chem 2018; 74:115-122. [DOI: 10.1016/j.compbiolchem.2018.02.017] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2017] [Revised: 01/06/2018] [Accepted: 02/22/2018] [Indexed: 01/12/2023]
|
19
|
Representation Learning for Class C G Protein-Coupled Receptors Classification. Molecules 2018; 23:molecules23030690. [PMID: 29562690 PMCID: PMC6017523 DOI: 10.3390/molecules23030690] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2018] [Revised: 03/14/2018] [Accepted: 03/15/2018] [Indexed: 11/17/2022] Open
Abstract
G protein-coupled receptors (GPCRs) are integral cell membrane proteins of relevance for pharmacology. The complete tertiary structure including both extracellular and transmembrane domains has not been determined for any member of class C GPCRs. An alternative way to work on GPCR structural models is the investigation of their functionality through the analysis of their primary structure. For this, sequence representation is a key factor for the GPCRs' classification context, where usually, feature engineering is carried out. In this paper, we propose the use of representation learning to acquire the features that best represent the class C GPCR sequences and at the same time to obtain a model for classification automatically. Deep learning methods in conjunction with amino acid physicochemical property indices are then used for this purpose. Experimental results assessed by the classification accuracy, Matthews' correlation coefficient and the balanced error rate show that using a hydrophobicity index and a restricted Boltzmann machine (RBM) can achieve performance results (accuracy of 92.9%) similar to those reported in the literature. As a second proposal, we combine two or more physicochemical property indices instead of only one as the input for a deep architecture in order to add information from the sequences. Experimental results show that using three hydrophobicity-related index combinations helps to improve the classification performance (accuracy of 94.1%) of an RBM better than those reported in the literature for class C GPCRs without using feature selection methods.
Collapse
|
20
|
Bhadra P, Yan J, Li J, Fong S, Siu SWI. AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci Rep 2018; 8:1697. [PMID: 29374199 PMCID: PMC5785966 DOI: 10.1038/s41598-018-19752-w] [Citation(s) in RCA: 149] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Accepted: 01/03/2018] [Indexed: 02/05/2023] Open
Abstract
Antimicrobial peptides (AMPs) are promising candidates in the fight against multidrug-resistant pathogens owing to AMPs’ broad range of activities and low toxicity. Nonetheless, identification of AMPs through wet-lab experiments is still expensive and time consuming. Here, we propose an accurate computational method for AMP prediction by the random forest algorithm. The prediction model is based on the distribution patterns of amino acid properties along the sequence. Using our collection of large and diverse sets of AMP and non-AMP data (3268 and 166791 sequences, respectively), we evaluated 19 random forest classifiers with different positive:negative data ratios by 10-fold cross-validation. Our optimal model, AmPEP with the 1:3 data ratio, showed high accuracy (96%), Matthew’s correlation coefficient (MCC) of 0.9, area under the receiver operating characteristic curve (AUC-ROC) of 0.99, and the Kappa statistic of 0.9. Descriptor analysis of AMP/non-AMP distributions by means of Pearson correlation coefficients revealed that reduced feature sets (from a full-featured set of 105 to a minimal-feature set of 23) can result in comparable performance in all respects except for some reductions in precision. Furthermore, AmPEP outperformed existing methods in terms of accuracy, MCC, and AUC-ROC when tested on benchmark datasets.
Collapse
Affiliation(s)
- Pratiti Bhadra
- Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Jielu Yan
- Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Jinyan Li
- Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Simon Fong
- Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Shirley W I Siu
- Department of Computer and Information Science, University of Macau, Taipa, Macau, China.
| |
Collapse
|
21
|
Zhao YW, Su ZD, Yang W, Lin H, Chen W, Tang H. IonchanPred 2.0: A Tool to Predict Ion Channels and Their Types. Int J Mol Sci 2017; 18:ijms18091838. [PMID: 28837067 PMCID: PMC5618487 DOI: 10.3390/ijms18091838] [Citation(s) in RCA: 51] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2017] [Revised: 08/21/2017] [Accepted: 08/21/2017] [Indexed: 12/11/2022] Open
Abstract
Ion channels (IC) are ion-permeable protein pores located in the lipid membranes of all cells. Different ion channels have unique functions in different biological processes. Due to the rapid development of high-throughput mass spectrometry, proteomic data are rapidly accumulating and provide us an opportunity to systematically investigate and predict ion channels and their types. In this paper, we constructed a support vector machine (SVM)-based model to quickly predict ion channels and their types. By considering the residue sequence information and their physicochemical properties, a novel feature-extracted method which combined dipeptide composition with the physicochemical correlation between two residues was employed. A feature selection strategy was used to improve the performance of the model. Comparison results of in jackknife cross-validation demonstrated that our method was superior to other methods for predicting ion channels and their types. Based on the model, we built a web server called IonchanPred which can be freely accessed from http://lin.uestc.edu.cn/server/IonchanPredv2.0.
Collapse
Affiliation(s)
- Ya-Wei Zhao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Zhen-Dong Su
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Wuritu Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
- Development and Planning Department, Inner Mongolia University, Hohhot 010021, China.
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
- Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan 063000, China.
| | - Hua Tang
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China.
| |
Collapse
|
22
|
Dao FY, Yang H, Su ZD, Yang W, Wu Y, Hui D, Chen W, Tang H, Lin H. Recent Advances in Conotoxin Classification by Using Machine Learning Methods. Molecules 2017; 22:molecules22071057. [PMID: 28672838 PMCID: PMC6152242 DOI: 10.3390/molecules22071057] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Revised: 06/12/2017] [Accepted: 06/19/2017] [Indexed: 11/16/2022] Open
Abstract
Conotoxins are disulfide-rich small peptides, which are invaluable peptides that target ion channel and neuronal receptors. Conotoxins have been demonstrated as potent pharmaceuticals in the treatment of a series of diseases, such as Alzheimer's disease, Parkinson's disease, and epilepsy. In addition, conotoxins are also ideal molecular templates for the development of new drug lead compounds and play important roles in neurobiological research as well. Thus, the accurate identification of conotoxin types will provide key clues for the biological research and clinical medicine. Generally, conotoxin types are confirmed when their sequence, structure, and function are experimentally validated. However, it is time-consuming and costly to acquire the structure and function information by using biochemical experiments. Therefore, it is important to develop computational tools for efficiently and effectively recognizing conotoxin types based on sequence information. In this work, we reviewed the current progress in computational identification of conotoxins in the following aspects: (i) construction of benchmark dataset; (ii) strategies for extracting sequence features; (iii) feature selection techniques; (iv) machine learning methods for classifying conotoxins; (v) the results obtained by these methods and the published tools; and (vi) future perspectives on conotoxin classification. The paper provides the basis for in-depth study of conotoxins and drug therapy research.
Collapse
Affiliation(s)
- Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Hui Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Zhen-Dong Su
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Wuritu Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
- Development and Planning Department, Inner Mongolia University, Hohhot 010021, China.
| | - Yun Wu
- College of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, China.
| | - Ding Hui
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
- Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan 063000, China.
| | - Hua Tang
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China.
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
23
|
Abstract
Classification problems from different domains vary in complexity, size, and imbalance of the number of samples from different classes. Although several classification models have been proposed, selecting the right model and parameters for a given classification task to achieve good performance is not trivial. Therefore, there is a constant interest in developing novel robust and efficient models suitable for a great variety of data. Here, we propose OmniGA, a framework for the optimization of omnivariate decision trees based on a parallel genetic algorithm, coupled with deep learning structure and ensemble learning methods. The performance of the OmniGA framework is evaluated on 12 different datasets taken mainly from biomedical problems and compared with the results obtained by several robust and commonly used machine-learning models with optimized parameters. The results show that OmniGA systematically outperformed these models for all the considered datasets, reducing the F1 score error in the range from 100% to 2.25%, compared to the best performing model. This demonstrates that OmniGA produces robust models with improved performance. OmniGA code and datasets are available at www.cbrc.kaust.edu.sa/omniga/.
Collapse
Affiliation(s)
- Arturo Magana-Mora
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center, Thuwal, 23955-6900, Saudi Arabia
| | - Vladimir B Bajic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center, Thuwal, 23955-6900, Saudi Arabia.
| |
Collapse
|
24
|
Asako Y, Uesawa Y. High-Performance Prediction of Human Estrogen Receptor Agonists Based on Chemical Structures. Molecules 2017; 22:molecules22040675. [PMID: 28441746 PMCID: PMC6154693 DOI: 10.3390/molecules22040675] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Revised: 04/16/2017] [Accepted: 04/19/2017] [Indexed: 12/20/2022] Open
Abstract
Many agonists for the estrogen receptor are known to disrupt endocrine functioning. We have developed a computational model that predicts agonists for the estrogen receptor ligand-binding domain in an assay system. Our model was entered into the Tox21 Data Challenge 2014, a computational toxicology competition organized by the National Center for Advancing Translational Sciences. This competition aims to find high-performance predictive models for various adverse-outcome pathways, including the estrogen receptor. Our predictive model, which is based on the random forest method, delivered the best performance in its competition category. In the current study, the predictive performance of the random forest models was improved by strictly adjusting the hyperparameters to avoid overfitting. The random forest models were optimized from 4000 descriptors simultaneously applied to 10,000 activity assay results for the estrogen receptor ligand-binding domain, which have been measured and compiled by Tox21. Owing to the correlation between our model's and the challenge's results, we consider that our model currently possesses the highest predictive power on agonist activity of the estrogen receptor ligand-binding domain. Furthermore, analysis of the optimized model revealed some important features of the agonists, such as the number of hydroxyl groups in the molecules.
Collapse
Affiliation(s)
- Yuki Asako
- Department of Clinical Pharmaceutics Meiji Pharmaceutical University, 2-522-1 Noshio, Kiyose, Tokyo 204-8588, Japan.
| | - Yoshihiro Uesawa
- Department of Clinical Pharmaceutics Meiji Pharmaceutical University, 2-522-1 Noshio, Kiyose, Tokyo 204-8588, Japan.
| |
Collapse
|
25
|
Identification of DEP domain-containing proteins by a machine learning method and experimental analysis of their expression in human HCC tissues. Sci Rep 2016; 6:39655. [PMID: 28000796 PMCID: PMC5175133 DOI: 10.1038/srep39655] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2016] [Accepted: 11/24/2016] [Indexed: 12/23/2022] Open
Abstract
The Dishevelled/EGL-10/Pleckstrin (DEP) domain-containing (DEPDC) proteins have seven members. However, whether this superfamily can be distinguished from other proteins based only on the amino acid sequences, remains unknown. Here, we describe a computational method to segregate DEPDCs and non-DEPDCs. First, we examined the Pfam numbers of the known DEPDCs and used the longest sequences for each Pfam to construct a phylogenetic tree. Subsequently, we extracted 188-dimensional (188D) and 20D features of DEPDCs and non-DEPDCs and classified them with random forest classifier. We also mined the motifs of human DEPDCs to find the related domains. Finally, we designed experimental verification methods of human DEPDC expression at the mRNA level in hepatocellular carcinoma (HCC) and adjacent normal tissues. The phylogenetic analysis showed that the DEPDCs superfamily can be divided into three clusters. Moreover, the 188D and 20D features can both be used to effectively distinguish the two protein types. Motif analysis revealed that the DEP and RhoGAP domain was common in human DEPDCs, human HCC and the adjacent tissues that widely expressed DEPDCs. However, their regulation was not identical. In conclusion, we successfully constructed a binary classifier for DEPDCs and experimentally verified their expression in human HCC tissues.
Collapse
|
26
|
Li Y, Song T, Yang J, Zhang Y, Yang J. An Alignment-Free Algorithm in Comparing the Similarity of Protein Sequences Based on Pseudo-Markov Transition Probabilities among Amino Acids. PLoS One 2016; 11:e0167430. [PMID: 27918587 PMCID: PMC5137889 DOI: 10.1371/journal.pone.0167430] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Accepted: 11/14/2016] [Indexed: 11/30/2022] Open
Abstract
In this paper, we have proposed a novel alignment-free method for comparing the similarity of protein sequences. We first encode a protein sequence into a 440 dimensional feature vector consisting of a 400 dimensional Pseudo-Markov transition probability vector among the 20 amino acids, a 20 dimensional content ratio vector, and a 20 dimensional position ratio vector of the amino acids in the sequence. By evaluating the Euclidean distances among the representing vectors, we compare the similarity of protein sequences. We then apply this method into the ND5 dataset consisting of the ND5 protein sequences of 9 species, and the F10 and G11 datasets representing two of the xylanases containing glycoside hydrolase families, i.e., families 10 and 11. As a result, our method achieves a correlation coefficient of 0.962 with the canonical protein sequence aligner ClustalW in the ND5 dataset, much higher than those of other 5 popular alignment-free methods. In addition, we successfully separate the xylanases sequences in the F10 family and the G11 family and illustrate that the F10 family is more heat stable than the G11 family, consistent with a few previous studies. Moreover, we prove mathematically an identity equation involving the Pseudo-Markov transition probability vector and the amino acids content ratio vector.
Collapse
Affiliation(s)
- Yushuang Li
- School of Science, Yanshan University, Qinhuangdao, China
| | - Tian Song
- School of Science, Yanshan University, Qinhuangdao, China
| | - Jiasheng Yang
- Department of Civil and Environmental Engineering, National Universality of Singapore, Singapore
| | - Yi Zhang
- Department of Mathematics, Hebei University of Science and Technology, Shijiazhuang, Hebei, China
| | - Jialiang Yang
- School of Mathematics and Information Science, Henan Polytechnic University, Henan, China
| |
Collapse
|
27
|
Identifying the Types of Ion Channel-Targeted Conotoxins by Incorporating New Properties of Residues into Pseudo Amino Acid Composition. BIOMED RESEARCH INTERNATIONAL 2016; 2016:3981478. [PMID: 27631006 PMCID: PMC5008028 DOI: 10.1155/2016/3981478] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/13/2016] [Accepted: 07/31/2016] [Indexed: 12/31/2022]
Abstract
Conotoxins are a kind of neurotoxin which can specifically interact with potassium, sodium type, and calcium channels. They have become potential drug candidates to treat diseases such as chronic pain, epilepsy, and cardiovascular diseases. Thus, correctly identifying the types of ion channel-targeted conotoxins will provide important clue to understand their function and find potential drugs. Based on this consideration, we developed a new computational method to rapidly and accurately predict the types of ion-targeted conotoxins. Three kinds of new properties of residues were proposed to use in pseudo amino acid composition to formulate conotoxins samples. The support vector machine was utilized as classifier. A feature selection technique based on F-score was used to optimize features. Jackknife cross-validated results showed that the overall accuracy of 94.6% was achieved, which is higher than other published results, demonstrating that the proposed method is superior to published methods. Hence the current method may play a complementary role to other existing methods for recognizing the types of ion-target conotoxins.
Collapse
|