1
|
Xiao H, Zou Y, Wang J, Wan S. A Review for Artificial Intelligence Based Protein Subcellular Localization. Biomolecules 2024; 14:409. [PMID: 38672426 PMCID: PMC11048326 DOI: 10.3390/biom14040409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 03/21/2024] [Accepted: 03/25/2024] [Indexed: 04/28/2024] Open
Abstract
Proteins need to be located in appropriate spatiotemporal contexts to carry out their diverse biological functions. Mislocalized proteins may lead to a broad range of diseases, such as cancer and Alzheimer's disease. Knowing where a target protein resides within a cell will give insights into tailored drug design for a disease. As the gold validation standard, the conventional wet lab uses fluorescent microscopy imaging, immunoelectron microscopy, and fluorescent biomarker tags for protein subcellular location identification. However, the booming era of proteomics and high-throughput sequencing generates tons of newly discovered proteins, making protein subcellular localization by wet-lab experiments a mission impossible. To tackle this concern, in the past decades, artificial intelligence (AI) and machine learning (ML), especially deep learning methods, have made significant progress in this research area. In this article, we review the latest advances in AI-based method development in three typical types of approaches, including sequence-based, knowledge-based, and image-based methods. We also elaborately discuss existing challenges and future directions in AI-based method development in this research field.
Collapse
Affiliation(s)
- Hanyu Xiao
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| | - Yijin Zou
- College of Veterinary Medicine, China Agricultural University, Beijing 100193, China;
| | - Jieqiong Wang
- Department of Neurological Sciences, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| | - Shibiao Wan
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| |
Collapse
|
2
|
Wang C, Wang Y, Ding P, Li S, Yu X, Yu B. ML-FGAT: Identification of multi-label protein subcellular localization by interpretable graph attention networks and feature-generative adversarial networks. Comput Biol Med 2024; 170:107944. [PMID: 38215617 DOI: 10.1016/j.compbiomed.2024.107944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 12/08/2023] [Accepted: 01/01/2024] [Indexed: 01/14/2024]
Abstract
The prediction of multi-label protein subcellular localization (SCL) is a pivotal area in bioinformatics research. Recent advancements in protein structure research have facilitated the application of graph neural networks. This paper introduces a novel approach termed ML-FGAT. The approach begins by extracting node information of proteins from sequence data, physical-chemical properties, evolutionary insights, and structural details. Subsequently, various evolutionary techniques are integrated to consolidate multi-view information. A linear discriminant analysis framework, grounded on entropy weight, is then employed to reduce the dimensionality of the merged features. To enhance the robustness of the model, the training dataset is augmented using feature-generative adversarial networks. For the primary prediction step, graph attention networks are employed to determine multi-label protein SCL, leveraging both node and neighboring information. The interpretability is enhanced by analyzing the attention weight parameters. The training is based on the Gram-positive bacteria dataset, while validation employs newly constructed datasets: human, virus, Gram-negative bacteria, plant, and SARS-CoV-2. Following a leave-one-out cross-validation procedure, ML-FGAT demonstrates noteworthy superiority in this domain.
Collapse
Affiliation(s)
- Congjing Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yifei Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Pengju Ding
- College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Shan Li
- School of Mathematics and Statistics, Central South University, Changsha, 410083, China
| | - Xu Yu
- Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum, Qingdao, 266580, China
| | - Bin Yu
- School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
3
|
Cao J, Xu Y. Predicting cysteine reactivity changes upon phosphorylation using XGBoost. FEBS Open Bio 2024; 14:51-62. [PMID: 37964470 PMCID: PMC10761938 DOI: 10.1002/2211-5463.13737] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 10/11/2023] [Accepted: 10/27/2023] [Indexed: 11/16/2023] Open
Abstract
Cysteine reactivity serves as a significant indicator of protein function and can be affected by phosphorylation events. Experimental approaches have been developed to investigate this effect, but the scale is still relatively limited. Machine-learning approaches promise to accelerate the investigation of these phenomena. In this study, protein sequence information, distances to the closest phosphorylation sites, and the membership score of the intrinsically disordered region were used to represent the cysteine. Following the feature selection using an elastic net model, two groups of binary classifiers based on XGBoost were built to predict the occurrence and the direction of the reactivity change as a response to phosphorylation events, respectively. In addition, function enrichment analysis was performed on proteins/genes predicted to have reactivity changes. XGBoost performed the best in the independent test with AUC of 0.8192 and 0.9203 for the prediction of the change's occurrence and direction, respectively. The use of two binary classifiers successively resulted in an accuracy of 0.7568 in predicting whether reactivity would be unchanged, increased, or decreased. The enrichment analysis revealed the association of proteins carrying reactivity-changed cysteine residues with various disease-related pathways, particularly cancer, autosomal dominant diseases, and viral infections. Changes in cysteine reactivity influenced by phosphorylation are site-specific and can be predicted by XGBoost algorithms. Our model provides an efficient alternative way to explore the cysteine reactivity upon phosphorylation at the proteome-wide level, facilitating the investigation of protein functions and their clinical insights. Our code is available on GitHub (https://github.com/DarinaOsamu/predictors-of-cysteine-reactivity-changes).
Collapse
Affiliation(s)
- Jing Cao
- Department of StatisticsUniversity of Science and Technology BeijingChina
| | - Yan Xu
- Department of StatisticsUniversity of Science and Technology BeijingChina
| |
Collapse
|
4
|
Wu D, Fang X, Luan K, Xu Q, Lin S, Sun S, Yang J, Dong B, Manavalan B, Liao Z. Identification of SH2 domain-containing proteins and motifs prediction by a deep learning method. Comput Biol Med 2023; 162:107065. [PMID: 37267826 DOI: 10.1016/j.compbiomed.2023.107065] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Revised: 04/30/2023] [Accepted: 05/27/2023] [Indexed: 06/04/2023]
Abstract
The Src Homology 2 (SH2) domain plays an important role in the signal transmission mechanism in organisms. It mediates the protein-protein interactions based on the combination between phosphotyrosine and motifs in SH2 domain. In this study, we designed a method to identify SH2 domain-containing proteins and non-SH2 domain-containing proteins through deep learning technology. Firstly, we collected SH2 and non-SH2 domain-containing protein sequences including multiple species. We built six deep learning models through DeepBIO after data preprocessing and compared their performance. Secondly, we selected the model with the strongest comprehensive ability to conduct training and test separately again, and analyze the results visually. It was found that 288-dimensional (288D) feature could effectively identify two types of proteins. Finally, motifs analysis discovered the specific motif YKIR and revealed its function in signal transduction. In summary, we successfully identified SH2 domain and non-SH2 domain proteins through deep learning method, and obtained 288D features that perform best. In addition, we found a new motif YKIR in SH2 domain, and analyzed its function which helps to further understand the signaling mechanisms within the organism.
Collapse
Affiliation(s)
- Duanzhi Wu
- School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Xin Fang
- School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China; Laboratory of Non-communicable Chronic Disease Control, Fujian Provincial Center for Disease Control and Prevention, Fuzhou, 350012, China
| | - Kai Luan
- School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Qijin Xu
- School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Shiqi Lin
- School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Shiying Sun
- School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Jiaying Yang
- School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China; Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Bingying Dong
- School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China; Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Gyeonggi-do, Republic of Korea.
| | - Zhijun Liao
- School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China; Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China.
| |
Collapse
|
5
|
Qin L, Qi Q, Aikeliyaer A, Hou WQ, Zuo CX, Ma X. Machine learning algorithm can provide assistance for the diagnosis of non-ST-segment elevation myocardial infarction. Postgrad Med J 2023; 99:442-454. [PMID: 37294714 DOI: 10.1136/postgradmedj-2021-141329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Accepted: 01/28/2022] [Indexed: 11/04/2022]
Abstract
INTRODUCTION Our aim was to use the constructed machine learning (ML) models as auxiliary diagnostic tools to improve the diagnostic accuracy of non-ST-elevation myocardial infarction (NSTEMI). MATERIALS AND METHODS A total of 2878 patients were included in this retrospective study, including 1409 patients with NSTEMI and 1469 patients with unstable angina pectoris. The clinical and biochemical characteristics of the patients were used to construct the initial attribute set. SelectKBest algorithm was used to determine the most important features. A feature engineering method was applied to create new features correlated strongly to train ML models and obtain promising results. Based on the experimental dataset, the ML models of extreme gradient boosting, support vector machine, random forest, naïve Bayesian, gradient boosting machines and logistic regression were constructed. Each model was verified by test set data, and the diagnostic performance of each model was comprehensively evaluated. RESULTS The six ML models based on the training set all play an auxiliary role in the diagnosis of NSTEMI. Although all models taken for comparison performed differences, the extreme gradient boosting ML model performed the best in terms of accuracy rate (0.95±0.014), precision rate (0.94±0.011), recall rate (0.98±0.003) and F-1 score (0.96±0.007) in NSTEMI. CONCLUSIONS The ML model constructed based on clinical data can be used as an auxiliary tool to improve the accuracy of NSTEMI diagnosis. According to our comprehensive evaluation, the performance of the extreme gradient boosting model was the best.
Collapse
Affiliation(s)
- Lian Qin
- Department of Cardiology, Xinjiang Medical University Affiliated First Hospital, Urumqi, Xinjiang, China
| | - Quan Qi
- College of Information Science and Technology, Shihezi University, Shihezi, Xinjiang, China
| | - Ainiwaer Aikeliyaer
- Department of Cardiology, Xinjiang Medical University Affiliated First Hospital, Urumqi, Xinjiang, China
| | - Wen Qing Hou
- College of Information Science and Technology, Shihezi University, Shihezi, Xinjiang, China
| | - Chang Xin Zuo
- College of Information Science and Technology, Shihezi University, Shihezi, Xinjiang, China
| | - Xiang Ma
- Department of Cardiology, Xinjiang Medical University Affiliated First Hospital, Urumqi, Xinjiang, China
| |
Collapse
|
6
|
Li W, Tan L, Peng M, Chen H, Tan C, Zhao E, Zhang L, Peng H, Liang Y. The spatial distribution of phytoliths and phytolith-occluded carbon in wheat (Triticum aestivum L.) ecosystem in China. THE SCIENCE OF THE TOTAL ENVIRONMENT 2022; 850:158005. [PMID: 35964741 DOI: 10.1016/j.scitotenv.2022.158005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/25/2022] [Revised: 08/07/2022] [Accepted: 08/09/2022] [Indexed: 06/15/2023]
Abstract
Phytolith is a form of SiO2 in plants. Carbon can be sequestrated as phytolith-occluded carbon (PhytOC) during the formation of phytoliths. PhytOC is characterized by its high resistance to temperature, oxidation and decomposition under protection of phytoliths and can be stored in the soil for thousands of years. Soil also is a huge PhytOC sink; however, most studies focus on PhytOC storage in straw and other residues. Wheat is a major staple food crop accumulating high content of Si and distributed widely, while its potential for PhytOC is not clear. At present, PhytOC storage only considers on the average value, but not on the relationship between ecological factors and the spatial distribution of PhytOC sequestration. Climatic factors and soil physiochemical properties together affect the formation process and stability of phytoliths. In our study, we collected wheat straw and soil samples from 95 sites among five provinces to extract phytolith and PhytOC. We constructed XGBoost model to predict the spatial distribution of phytolith and PhytOC across the country using the national soil testing and formula fertilization nutrient dataset and climate data. As a result, soil physiochemical factors such as available silicon (Siavail), total carbon (Ctot) and total nitrogen (Ntot) and climate factors related to temperature and precipitation have a great positive impact on the production of phytoliths and PhytOC. Meanwhile, PhytOC storage in wheat ecosystems was estimated to be 7.59 × 106 t, which is equivalent to 27.83 Tg of CO2. In China, the distribution characteristics of phytoliths and PhytOC in wheat straw and soil display a trend of decrease from south to north. He'nan Province is the largest wheat production area, producing approximately 1.59 × 106 t PhytOC per year. Therefore, PhytOC is a stable CO2 sink pathway in the agricultural ecosystems, which is of great importance for mitigating climate warming.
Collapse
Affiliation(s)
- Wenjuan Li
- Ministry of Education Key Laboratory of Environment Remediation and Ecological Health, College of Environmental & Resource Sciences, Zhejiang University, Hangzhou 310058, China
| | - Li Tan
- Ministry of Education Key Laboratory of Environment Remediation and Ecological Health, College of Environmental & Resource Sciences, Zhejiang University, Hangzhou 310058, China
| | - Miao Peng
- Ministry of Education Key Laboratory of Environment Remediation and Ecological Health, College of Environmental & Resource Sciences, Zhejiang University, Hangzhou 310058, China
| | - Hao Chen
- Ministry of Education Key Laboratory of Environment Remediation and Ecological Health, College of Environmental & Resource Sciences, Zhejiang University, Hangzhou 310058, China
| | - Che Tan
- Ministry of Education Key Laboratory of Environment Remediation and Ecological Health, College of Environmental & Resource Sciences, Zhejiang University, Hangzhou 310058, China
| | - Enqiang Zhao
- Ministry of Education Key Laboratory of Environment Remediation and Ecological Health, College of Environmental & Resource Sciences, Zhejiang University, Hangzhou 310058, China
| | - Lei Zhang
- Ministry of Education Key Laboratory of Environment Remediation and Ecological Health, College of Environmental & Resource Sciences, Zhejiang University, Hangzhou 310058, China
| | - Hongyun Peng
- Ministry of Education Key Laboratory of Environment Remediation and Ecological Health, College of Environmental & Resource Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yongchao Liang
- Ministry of Education Key Laboratory of Environment Remediation and Ecological Health, College of Environmental & Resource Sciences, Zhejiang University, Hangzhou 310058, China.
| |
Collapse
|
7
|
Suha SA, Islam MN. An extended machine learning technique for polycystic ovary syndrome detection using ovary ultrasound image. Sci Rep 2022; 12:17123. [PMID: 36224353 PMCID: PMC9556522 DOI: 10.1038/s41598-022-21724-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Accepted: 09/30/2022] [Indexed: 01/04/2023] Open
Abstract
Polycystic ovary syndrome (PCOS) is the most prevalent endocrinological abnormality and one of the primary causes of anovulatory infertility in women globally. The detection of multiple cysts using ovary ultrasonograpgy (USG) scans is one of the most reliable approach for making an accurate diagnosis of PCOS and creating an appropriate treatment plan to heal the patients with this syndrome. Instead of depending on error-prone manual identification, an intelligent computer-aided cyst detection system can be a viable approach. Therefore, in this research, an extended machine learning classification technique for PCOS prediction has been proposed, trained and tested over 594 ovary USG images; where the Convolutional Neural Network (CNN) incorporating different state-of-the-art techniques and transfer learning has been employed for feature extraction from the images; and then stacking ensemble machine learning technique using conventional models as base learners and bagging or boosting ensemble model as meta-learner have been used on that reduced feature set to classify between PCOS and non-PCOS ovaries. The proposed technique significantly enhances the accuracy while also reducing training execution time comparing with the other existing ML based techniques. Again, following the proposed extended technique, the best performing results are obtained by incorporating the "VGGNet16" pre-trained model with CNN architecture as feature extractor and then stacking ensemble model with the meta-learner being "XGBoost" model as image classifier with an accuracy of 99.89% for classification.
Collapse
Affiliation(s)
- Sayma Alam Suha
- grid.442983.00000 0004 0456 6642Military Institute of Science and Technology, Department of Computer Science and Technology, Dhaka, 1216 Bangladesh
| | - Muhammad Nazrul Islam
- grid.442983.00000 0004 0456 6642Military Institute of Science and Technology, Department of Computer Science and Technology, Dhaka, 1216 Bangladesh
| |
Collapse
|
8
|
Ruzicka D, Kondo T, Fujimoto G, Craig AP, Kim SW, Mikamo H. Development of a clinical prediction model for recurrence and mortality outcomes after Clostridioides difficile infection using a machine learning approach. Anaerobe 2022; 77:102628. [PMID: 35985607 DOI: 10.1016/j.anaerobe.2022.102628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Revised: 06/29/2022] [Accepted: 08/10/2022] [Indexed: 11/26/2022]
Abstract
OBJECTIVES Clostridioides difficile infection (CDI) is associated with a large burden of morbidity and mortality worldwide. Previous studies have developed models for predicting recurrence and mortality following CDI, but no machine learning predictive models have been developed specifically using data from Japanese patients. METHODS Using a database of records from acute care hospitals in Japan, we extracted records from January 2012 to September 2016 (plus a 60-day lookback window). A total of 19,159 patients were included. We used a machine learning approach, XGBoost, and compared it to a traditional unregularized logistic regression model. The first 80% of the dataset (by patient index date) was used to optimize model hyperparameters and train the final models, and evaluation was performed on the remaining 20%. We measured model performance by the area under the receiver operator curve and assessed feature importance using Shapley additive explanations. RESULTS Performance was similar between the machine learning approach and the classical logistic regression model. Logistic regression performed slightly better than XGBoost for predicting mortality. CONCLUSION XGBoost performed slightly better than logistic regression for predicting recurrence, but it was not competitive with existing published models. Despite this, a future machine learning-based application provided in a bedside setting at low cost might be a clinically useful tool.
Collapse
Affiliation(s)
- Daniel Ruzicka
- Medical Affairs, MSD K.K., Tokyo, Japan, Kitanomaru Square, 1-13-12 Kudan-kita, Chiyoda-ku, Tokyo, 102-8667, Japan
| | - Takayuki Kondo
- Medical Affairs, MSD K.K., Tokyo, Japan, Kitanomaru Square, 1-13-12 Kudan-kita, Chiyoda-ku, Tokyo, 102-8667, Japan.
| | - Go Fujimoto
- Medical Affairs, MSD K.K., Tokyo, Japan, Kitanomaru Square, 1-13-12 Kudan-kita, Chiyoda-ku, Tokyo, 102-8667, Japan
| | - Andrew P Craig
- Real World Evidence Solutions, IQVIA Solutions Japan K.K., Takanawa 4-10-18, Minato-ku, Tokyo, 108-0074, Japan
| | - Seok-Won Kim
- Real World Evidence Solutions, IQVIA Solutions Japan K.K., Takanawa 4-10-18, Minato-ku, Tokyo, 108-0074, Japan
| | - Hiroshige Mikamo
- Department of Clinical Infectious Diseases, Aichi Medical University Graduate School of Medicine, 1-1, Yazakokarimata, Nagakute, Aichi, 480-1195, Japan
| |
Collapse
|
9
|
A hybrid machine learning/deep learning COVID-19 severity predictive model from CT images and clinical data. Sci Rep 2022; 12:4329. [PMID: 35288579 PMCID: PMC8919158 DOI: 10.1038/s41598-022-07890-1] [Citation(s) in RCA: 35] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Accepted: 02/22/2022] [Indexed: 01/08/2023] Open
Abstract
AbstractCOVID-19 clinical presentation and prognosis are highly variable, ranging from asymptomatic and paucisymptomatic cases to acute respiratory distress syndrome and multi-organ involvement. We developed a hybrid machine learning/deep learning model to classify patients in two outcome categories, non-ICU and ICU (intensive care admission or death), using 558 patients admitted in a northern Italy hospital in February/May of 2020. A fully 3D patient-level CNN classifier on baseline CT images is used as feature extractor. Features extracted, alongside with laboratory and clinical data, are fed for selection in a Boruta algorithm with SHAP game theoretical values. A classifier is built on the reduced feature space using CatBoost gradient boosting algorithm and reaching a probabilistic AUC of 0.949 on holdout test set. The model aims to provide clinical decision support to medical doctors, with the probability score of belonging to an outcome class and with case-based SHAP interpretation of features importance.
Collapse
|
10
|
Sikander R, Wang Y, Ghulam A, Wu X. Identification of Enzymes-specific Protein Domain Based on DDE, and Convolutional Neural Network. Front Genet 2021; 12:759384. [PMID: 34917128 PMCID: PMC8670239 DOI: 10.3389/fgene.2021.759384] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Accepted: 10/25/2021] [Indexed: 11/21/2022] Open
Abstract
Predicting the protein sequence information of enzymes and non-enzymes is an important but a very challenging task. Existing methods use protein geometric structures only or protein sequences alone to predict enzymatic functions. Thus, their prediction results are unsatisfactory. In this paper, we propose a novel approach for predicting the amino acid sequences of enzymes and non-enzymes via Convolutional Neural Network (CNN). In CNN, the roles of enzymes are predicted from multiple sides of biological information, including information on sequences and structures. We propose the use of two-dimensional data via 2DCNN to predict the proteins of enzymes and non-enzymes by using the same fivefold cross-validation function. We also use an independent dataset to test the performance of our model, and the results demonstrate that we are able to solve the overfitting problem. We used the CNN model proposed herein to demonstrate the superiority of our model for classifying an entire set of filters, such as 32, 64, and 128 parameters, with the fivefold validation test set as the independent classification. Via the Dipeptide Deviation from Expected Mean (DDE) matrix, mutation information is extracted from amino acid sequences and structural information with the distance and angle of amino acids is conveyed. The derived feature maps are then encoded in DDE exploitation. The independent datasets are then compared with other two methods, namely, GRU and XGBOOST. All analyses were conducted using 32, 64 and 128 filters on our proposed CNN method. The cross-validation datasets achieved an accuracy score of 0.8762%, whereas the accuracy of independent datasets was 0.7621%. Additional variables were derived on the basis of ROC AUC with fivefold cross-validation was achieved score is 0.95%. The performance of our model and that of other models in terms of sensitivity (0.9028%) and specificity (0.8497%) was compared. The overall accuracy of our model was 0.9133% compared with 0.8310% for the other model.
Collapse
Affiliation(s)
- Rahu Sikander
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Yuping Wang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Ali Ghulam
- Computerization and Network Section, Sindh Agriculture University, Tando Jam, Pakistan
| | - Xianjuan Wu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
11
|
Jiang Y, Wang D, Wang W, Xu D. Computational methods for protein localization prediction. Comput Struct Biotechnol J 2021; 19:5834-5844. [PMID: 34765098 PMCID: PMC8564054 DOI: 10.1016/j.csbj.2021.10.023] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 10/12/2021] [Accepted: 10/13/2021] [Indexed: 12/16/2022] Open
Abstract
The accurate annotation of protein localization is crucial in understanding protein function in tandem with a broad range of applications such as pathological analysis and drug design. Since most proteins do not have experimentally-determined localization information, the computational prediction of protein localization has been an active research area for more than two decades. In particular, recent machine-learning advancements have fueled the development of new methods in protein localization prediction. In this review paper, we first categorize the main features and algorithms used for protein localization prediction. Then, we summarize a list of protein localization prediction tools in terms of their coverage, characteristics, and accessibility to help users find suitable tools based on their needs. Next, we evaluate some of these tools on a benchmark dataset. Finally, we provide an outlook on the future exploration of protein localization methods.
Collapse
Affiliation(s)
- Yuexu Jiang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Duolin Wang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Weiwei Wang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| |
Collapse
|
12
|
Asad E, Mollah AF. Biomarker Identification From Gene Expression Based on Symmetrical Uncertainty. INTERNATIONAL JOURNAL OF INTELLIGENT INFORMATION TECHNOLOGIES 2021. [DOI: 10.4018/ijiit.289966] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In this paper, we present an effective information theoretic feature selection method, Symmetrical Uncertainty to classify gene expression microarray data and detect biomarkers from it. Here, Information Gain and Symmetrical Uncertainty contribute for ranking the features. Based on computed values of Symmetrical Uncertainty, features were sorted from most informative to least informative ones. Then, the top features from the sorted list are passed to Random Forest, Logistic Regression and other well-known classifiers with Leave-One-Out cross validation to construct the best classification model(s) and accordingly select the most important genes from microarray datasets. Obtained results in terms of classification accuracy, running time, root mean square error and other parameters computed on Leukemia and Colon cancer datasets demonstrate the effectiveness of the proposed approach. The proposed method is relatively much faster than many other wrapper or ensemble methods.
Collapse
|
13
|
Chen YZ, Wang ZZ, Wang Y, Ying G, Chen Z, Song J. nhKcr: a new bioinformatics tool for predicting crotonylation sites on human nonhistone proteins based on deep learning. Brief Bioinform 2021; 22:6277413. [PMID: 34002774 DOI: 10.1093/bib/bbab146] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Revised: 03/18/2021] [Accepted: 03/25/2021] [Indexed: 12/20/2022] Open
Abstract
Lysine crotonylation (Kcr) is a newly discovered type of protein post-translational modification and has been reported to be involved in various pathophysiological processes. High-resolution mass spectrometry is the primary approach for identification of Kcr sites. However, experimental approaches for identifying Kcr sites are often time-consuming and expensive when compared with computational approaches. To date, several predictors for Kcr site prediction have been developed, most of which are capable of predicting crotonylation sites on either histones alone or mixed histone and nonhistone proteins together. These methods exhibit high diversity in their algorithms, encoding schemes, feature selection techniques and performance assessment strategies. However, none of them were designed for predicting Kcr sites on nonhistone proteins. Therefore, it is desirable to develop an effective predictor for identifying Kcr sites from the large amount of nonhistone sequence data. For this purpose, we first provide a comprehensive review on six methods for predicting crotonylation sites. Second, we develop a novel deep learning-based computational framework termed as CNNrgb for Kcr site prediction on nonhistone proteins by integrating different types of features. We benchmark its performance against multiple commonly used machine learning classifiers (including random forest, logitboost, naïve Bayes and logistic regression) by performing both 10-fold cross-validation and independent test. The results show that the proposed CNNrgb framework achieves the best performance with high computational efficiency on large datasets. Moreover, to facilitate users' efforts to investigate Kcr sites on human nonhistone proteins, we implement an online server called nhKcr and compare it with other existing tools to illustrate the utility and robustness of our method. The nhKcr web server and all the datasets utilized in this study are freely accessible at http://nhKcr.erc.monash.edu/.
Collapse
Affiliation(s)
- Yong-Zi Chen
- Laboratory of Tumor Cell Biology, Tianjin Medical University Cancer Institute and Hospital, Tianjin 300060, China
| | | | | | - Guoguang Ying
- Laboratory of Tumor Cell Biology in Tianjin Medical University Cancer Institute and Hospital, Tianjin 300060, China
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Australia
| |
Collapse
|
14
|
Cohen S, Rokach L, Motro Y, Moran-Gilad J, Veksler-Lublinsky I. minMLST: machine learning for optimization of bacterial strain typing. Bioinformatics 2021; 37:303-311. [PMID: 32804993 DOI: 10.1093/bioinformatics/btaa724] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2020] [Revised: 07/08/2020] [Accepted: 08/10/2020] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION High-resolution microbial strain typing is essential for various clinical purposes, including disease outbreak investigation, tracking of microbial transmission events and epidemiological surveillance of bacterial infections. The widely used approach for multilocus sequence typing (MLST) that is based on the core genome, cgMLST, has the advantage of a high level of typeability and maximal discriminatory power. Yet, the transition from a seven loci-based scheme to cgMLST involves several challenges, that include the need by some users to maintain backward compatibility, growing difficulties in the day-to-day communication within the microbiology community with respect to nomenclature and ontology, issues with typeability, especially if a more stringent approach to loci presence is used, and computational requirements concerning laboratory data management and sharing with end-users. Hence, methods for optimizing cgMLST schemes through careful reduction of the number of loci are expected to be beneficial for practical needs in different settings. RESULTS We present a new machine learning-based methodology, minMLST, for minimizing the number of genes in cgMLST schemes by identifying subsets of informative genes and analyzing the trade-off between gene reduction and typing performance. The results achieved with minMLST over eight bacterial species show that despite the reduction in the number of genes up to a factor of 10, the typing performance remains very high and significant with an Adjusted Rand Index that ranges between 0.4 and 0.93 in different species and a P-value < 10-3. The identification of such optimized MLST schemes for bacterial strain typing is expected to improve the implementation of cgMLST by improving interlaboratory agreement and communication. AVAILABILITY AND IMPLEMENTATION The python package minMLST is available at https://PyPi.org/project/minmlst/PyPI and supported on Linux and Windows. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shani Cohen
- Department of Software and Information Systems Engineering, Ben Gurion University of the Negev, Beer Sheva 8410501, Israel
| | - Lior Rokach
- Department of Software and Information Systems Engineering, Ben Gurion University of the Negev, Beer Sheva 8410501, Israel
| | - Yair Motro
- Department of Health Systems Management, Ben Gurion University of the Negev, Beer Sheva 8410501, Israel
| | - Jacob Moran-Gilad
- Department of Health Systems Management, Ben Gurion University of the Negev, Beer Sheva 8410501, Israel
| | - Isana Veksler-Lublinsky
- Department of Software and Information Systems Engineering, Ben Gurion University of the Negev, Beer Sheva 8410501, Israel
| |
Collapse
|
15
|
Wang L, Niu D, Zhao X, Wang X, Hao M, Che H. A Comparative Analysis of Novel Deep Learning and Ensemble Learning Models to Predict the Allergenicity of Food Proteins. Foods 2021; 10:809. [PMID: 33918556 PMCID: PMC8069377 DOI: 10.3390/foods10040809] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Revised: 04/02/2021] [Accepted: 04/06/2021] [Indexed: 11/16/2022] Open
Abstract
Traditional food allergen identification mainly relies on in vivo and in vitro experiments, which often needs a long period and high cost. The artificial intelligence (AI)-driven rapid food allergen identification method has solved the above mentioned some drawbacks and is becoming an efficient auxiliary tool. Aiming to overcome the limitations of lower accuracy of traditional machine learning models in predicting the allergenicity of food proteins, this work proposed to introduce deep learning model-transformer with self-attention mechanism, ensemble learning models (representative as Light Gradient Boosting Machine (LightGBM) eXtreme Gradient Boosting (XGBoost)) to solve the problem. In order to highlight the superiority of the proposed novel method, the study also selected various commonly used machine learning models as the baseline classifiers. The results of 5-fold cross-validation showed that the area under the receiver operating characteristic curve (AUC) of the deep model was the highest (0.9578), which was better than the ensemble learning and baseline algorithms. But the deep model need to be pre-trained, and the training time is the longest. By comparing the characteristics of the transformer model and boosting models, it can be analyzed that, each model has its own advantage, which provides novel clues and inspiration for the rapid prediction of food allergens in the future.
Collapse
Affiliation(s)
- Liyang Wang
- Key Laboratory of Precision Nutrition and Food Quality, The Ministry of Education, College of Food Science and Nutritional Engineering, China Agricultural University, Beijing 100083, China; (L.W.); (X.W.); (M.H.)
| | - Dantong Niu
- College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China;
| | - Xinjie Zhao
- College of Humanities and Development Studies, China Agricultural University, Beijing 100083, China;
| | - Xiaoya Wang
- Key Laboratory of Precision Nutrition and Food Quality, The Ministry of Education, College of Food Science and Nutritional Engineering, China Agricultural University, Beijing 100083, China; (L.W.); (X.W.); (M.H.)
| | - Mengzhen Hao
- Key Laboratory of Precision Nutrition and Food Quality, The Ministry of Education, College of Food Science and Nutritional Engineering, China Agricultural University, Beijing 100083, China; (L.W.); (X.W.); (M.H.)
| | - Huilian Che
- Key Laboratory of Precision Nutrition and Food Quality, The Ministry of Education, College of Food Science and Nutritional Engineering, China Agricultural University, Beijing 100083, China; (L.W.); (X.W.); (M.H.)
| |
Collapse
|
16
|
DeepPred-SubMito: A Novel Submitochondrial Localization Predictor Based on Multi-Channel Convolutional Neural Network and Dataset Balancing Treatment. Int J Mol Sci 2020; 21:ijms21165710. [PMID: 32784927 PMCID: PMC7460811 DOI: 10.3390/ijms21165710] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2020] [Revised: 08/05/2020] [Accepted: 08/07/2020] [Indexed: 12/18/2022] Open
Abstract
Mitochondrial proteins are physiologically active in different compartments, and their abnormal location will trigger the pathogenesis of human mitochondrial pathologies. Correctly identifying submitochondrial locations can provide information for disease pathogenesis and drug design. A mitochondrion has four submitochondrial compartments, the matrix, the outer membrane, the inner membrane, and the intermembrane space, but various existing studies ignored the intermembrane space. The majority of researchers used traditional machine learning methods for predicting mitochondrial protein localization. Those predictors required expert-level knowledge of biology to be encoded as features rather than allowing the underlying predictor to extract features through a data-driven procedure. Besides, few researchers have considered the imbalance in datasets. In this paper, we propose a novel end-to-end predictor employing deep neural networks, DeepPred-SubMito, for protein submitochondrial location prediction. First, we utilize random over-sampling to decrease the influence caused by unbalanced datasets. Next, we train a multi-channel bilayer convolutional neural network for multiple subsequences to learn high-level features. Third, the prediction result is outputted through the fully connected layer. The performance of the predictor is measured by 10-fold cross-validation and 5-fold cross-validation on the SM424-18 dataset and the SubMitoPred dataset, respectively. Experimental results show that the predictor outperforms state-of-the-art predictors. In addition, the prediction of results in the M983 dataset also confirmed its effectiveness in predicting submitochondrial locations.
Collapse
|
17
|
Bouziane H, Chouarfia A. Use of Chou's 5-steps rule to predict the subcellular localization of gram-negative and gram-positive bacterial proteins by multi-label learning based on gene ontology annotation and profile alignment. J Integr Bioinform 2020; 18:51-79. [PMID: 32598314 PMCID: PMC8035964 DOI: 10.1515/jib-2019-0091] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2019] [Accepted: 04/08/2020] [Indexed: 12/31/2022] Open
Abstract
To date, many proteins generated by large-scale genome sequencing projects are still uncharacterized and subject to intensive investigations by both experimental and computational means. Knowledge of protein subcellular localization (SCL) is of key importance for protein function elucidation. However, it remains a challenging task, especially for multiple sites proteins known to shuttle between cell compartments to perform their proper biological functions and proteins which do not have significant homology to proteins of known subcellular locations. Due to their low-cost and reasonable accuracy, machine learning-based methods have gained much attention in this context with the availability of a plethora of biological databases and annotated proteins for analysis and benchmarking. Various predictive models have been proposed to tackle the SCL problem, using different protein sequence features pertaining to the subcellular localization, however, the overwhelming majority of them focuses on single localization and cover very limited cellular locations. The prediction was basically established on sorting signals, amino acids compositions, and homology. To improve the prediction quality, focus is actually on knowledge information extracted from annotation databases, such as protein-protein interactions and Gene Ontology (GO) functional domains annotation which has been recently a widely adopted and essential information for learning systems. To deal with such problem, in the present study, we considered SCL prediction task as a multi-label learning problem and tried to label both single site and multiple sites unannotated bacterial protein sequences by mining proteins homology relationships using both GO terms of protein homologs and PSI-BLAST profiles. The experiments using 5-fold cross-validation tests on the benchmark datasets showed a significant improvement on the results obtained by the proposed consensus multi-label prediction model which discriminates six compartments for Gram-negative and five compartments for Gram-positive bacterial proteins.
Collapse
Affiliation(s)
- Hafida Bouziane
- Département d’Informatique, Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf, USTO-MB BP 1505, El M’Naouer, 31000, Oran, Algeria
| | - Abdallah Chouarfia
- Département d’Informatique, Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf, USTO-MB BP 1505, El M’Naouer, 31000, Oran, Algeria
| |
Collapse
|
18
|
Computational Identification and Analysis of Ubiquinone-Binding Proteins. Cells 2020; 9:cells9020520. [PMID: 32102444 PMCID: PMC7072731 DOI: 10.3390/cells9020520] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 02/21/2020] [Accepted: 02/21/2020] [Indexed: 12/15/2022] Open
Abstract
Ubiquinone is an important cofactor that plays vital and diverse roles in many biological processes. Ubiquinone-binding proteins (UBPs) are receptor proteins that dock with ubiquinones. Analyzing and identifying UBPs via a computational approach will provide insights into the pathways associated with ubiquinones. In this work, we were the first to propose a UBPs predictor (UBPs-Pred). The optimal feature subset selected from three categories of sequence-derived features was fed into the extreme gradient boosting (XGBoost) classifier, and the parameters of XGBoost were tuned by multi-objective particle swarm optimization (MOPSO). The experimental results over the independent validation demonstrated considerable prediction performance with a Matthews correlation coefficient (MCC) of 0.517. After that, we analyzed the UBPs using bioinformatics methods, including the statistics of the binding domain motifs and protein distribution, as well as an enrichment analysis of the gene ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway.
Collapse
|
19
|
Yoo TK, Ryu IH, Choi H, Kim JK, Lee IS, Kim JS, Lee G, Rim TH. Explainable Machine Learning Approach as a Tool to Understand Factors Used to Select the Refractive Surgery Technique on the Expert Level. Transl Vis Sci Technol 2020; 9:8. [PMID: 32704414 PMCID: PMC7346876 DOI: 10.1167/tvst.9.2.8] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2019] [Accepted: 11/18/2019] [Indexed: 12/23/2022] Open
Abstract
Purpose Recently, laser refractive surgery options, including laser epithelial keratomileusis, laser in situ keratomileusis, and small incision lenticule extraction, successfully improved patients' quality of life. Evidence-based recommendation for an optimal surgery technique is valuable in increasing patient satisfaction. We developed an interpretable multiclass machine learning model that selects the laser surgery option on the expert level. Methods A multiclass XGBoost model was constructed to classify patients into four categories including laser epithelial keratomileusis, laser in situ keratomileusis, small incision lenticule extraction, and contraindication groups. The analysis included 18,480 subjects who intended to undergo refractive surgery at the B&VIIT Eye center. Training (n = 10,561) and internal validation (n = 2640) were performed using subjects who visited between 2016 and 2017. The model was trained based on clinical decisions of highly experienced experts and ophthalmic measurements. External validation (n = 5279) was conducted using subjects who visited in 2018. The SHapley Additive ex-Planations technique was adopted to explain the output of the XGBoost model. Results The multiclass XGBoost model exhibited an accuracy of 81.0% and 78.9% when tested on the internal and external validation datasets, respectively. The SHapley Additive ex-Planations explanations for the results were consistent with prior knowledge from ophthalmologists. The explanation from one-versus-one and one-versus-rest XGBoost classifiers was effective for easily understanding users in the multicategorical classification problem. Conclusions This study suggests an expert-level multiclass machine learning model for selecting the refractive surgery for patients. It also provided a clinical understanding in a multiclass problem based on an explainable artificial intelligence technique. Translational Relevance Explainable machine learning exhibits a promising future for increasing the practical use of artificial intelligence in ophthalmic clinics.
Collapse
Affiliation(s)
- Tae Keun Yoo
- Department of Ophthalmology, Aerospace Medical Center, Republic of Korea Air Force, Cheongju, South Korea
| | | | | | | | | | | | | | - Tyler Hyungtaek Rim
- Singapore Eye Research Institute, Singapore National Eye Centre, Duke-NUS Medical School, Singapore, Singapore
| |
Collapse
|
20
|
A XGBoost Model with Weather Similarity Analysis and Feature Engineering for Short-Term Wind Power Forecasting. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9153019] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Large-scale wind power access may cause a series of safety and stability problems. Wind power forecasting (WPF) is beneficial to dispatch in advance. In this paper, a new extreme gradient boosting (XGBoost) model with weather similarity analysis and feature engineering is proposed for short-term wind power forecasting. Based on the similarity among historical days’ weather, k-means clustering algorithm is used to divide the samples into several categories. Additionally, we also create some time features and drop unimportant features through feature engineering. For each category, we make predictions using XGBoost. The results of the proposed model are compared with the back propagation neural network (BPNN) and classification and regression tree (CART), random forests (RF), support vector regression (SVR), and a single XGBoost model. It is shown that the proposed model produces the highest forecasting accuracy among all these models.
Collapse
|