1
|
Shazia, Ullah FUM, Rho S, Lee MY. Predictive modeling for ubiquitin proteins through advanced machine learning technique. Heliyon 2024; 10:e32517. [PMID: 38975176 PMCID: PMC11225741 DOI: 10.1016/j.heliyon.2024.e32517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 06/05/2024] [Indexed: 07/09/2024] Open
Abstract
Ubiquitination is an essential post-translational modification mechanism involving the ubiquitin protein's bonding to a substrate protein. It is crucial in a variety of physiological activities including cell survival and differentiation, and innate and adaptive immunity. Any alteration in the ubiquitin system leads to the development of various human diseases. Numerous researches show the highly reversibility and dynamic of ubiquitin system, making the experimental identification quite difficult. To solve this issue, this article develops a model using a machine learning approach, tending to improve the ubiquitin protein prediction precisely. We deeply investigate the ubiquitination data that is proceed through different features extraction methods, followed by the classification. The evaluation and assessment are conducted considering Jackknife tests and 10-fold cross-validation. The proposed method demonstrated the remarkable performance in terms of 100 %, 99.88 %, and 99.84 % accuracy on Dataset-I, Dataset-II, and Dataset-III, respectively. Using Jackknife test, the method achieves 100 %, 99.91 %, and 99.99 % for Dataset-I, Dataset-II and Dataset-III, respectively. This analysis concludes that the proposed method outperformed the state-of-the-arts to identify the ubiquitination sites and helpful in the development of current clinical therapies. The source code and datasets will be made available at Github.
Collapse
Affiliation(s)
- Shazia
- Mardan College of Nursing, Bacha Khan Medical College, Mardan, Pakistan
| | - Fath U Min Ullah
- Deparment of Computing, School of Engineering and Computing, University of Central Lancashire, Preston, United Kingdom
| | - Seungmin Rho
- Department of Industrial Security, Chung-Ang University, Seoul 06974, Republic of Korea
| | - Mi Young Lee
- Chung-Ang University, Seoul 06974, Republic of Korea
| |
Collapse
|
2
|
Brahmi Z, Mahyoob M, Al-Sarem M, Algaraady J, Bousselmi K, Alblwi A. Exploring the Role of Machine Learning in Diagnosing and Treating Speech Disorders: A Systematic Literature Review. Psychol Res Behav Manag 2024; 17:2205-2232. [PMID: 38835654 PMCID: PMC11149643 DOI: 10.2147/prbm.s460283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Accepted: 05/07/2024] [Indexed: 06/06/2024] Open
Abstract
Purpose Speech disorders profoundly impact the overall quality of life by impeding social operations and hindering effective communication. This study addresses the gap in systematic reviews concerning machine learning-based assistive technology for individuals with speech disorders. The overarching purpose is to offer a comprehensive overview of the field through a Systematic Literature Review (SLR) and provide valuable insights into the landscape of ML-based solutions and related studies. Methods The research employs a systematic approach, utilizing a Systematic Literature Review (SLR) methodology. The study extensively examines the existing literature on machine learning-based assistive technology for speech disorders. Specific attention is given to ML techniques, characteristics of exploited datasets in the training phase, speaker languages, feature extraction techniques, and the features employed by ML algorithms. Originality This study contributes to the existing literature by systematically exploring the machine learning landscape in assistive technology for speech disorders. The originality lies in the focused investigation of ML-speech recognition for impaired speech disorder users over ten years (2014-2023). The emphasis on systematic research questions related to ML techniques, dataset characteristics, languages, feature extraction techniques, and feature sets adds a unique and comprehensive perspective to the current discourse. Findings The systematic literature review identifies significant trends and critical studies published between 2014 and 2023. In the analysis of the 65 papers from prestigious journals, support vector machines and neural networks (CNN, DNN) were the most utilized ML technique (20%, 16.92%), with the most studied disease being Dysarthria (35/65, 54% studies). Furthermore, an upsurge in using neural network-based architectures, mainly CNN and DNN, was observed after 2018. Almost half of the included studies were published between 2021 and 2022).
Collapse
Affiliation(s)
- Zaki Brahmi
- Department of Computer Science, Taibah University, Madina, Kingdom of Saudi Arabia
| | - Mohammad Mahyoob
- Department of Languages and Translation, Taibah University, Madina, Kingdom of Saudi Arabia
| | - Mohammed Al-Sarem
- Department of Computer Science, Taibah University, Madina, Kingdom of Saudi Arabia
| | | | - Khadija Bousselmi
- Department of Computer Science, LISTIC, University of Savoie Mont Blanc, Chambéry, France
| | - Abdulaziz Alblwi
- Department of Computer Science, Taibah University, Madina, Kingdom of Saudi Arabia
| |
Collapse
|
3
|
Chikkanayakanahalli Mukunda D, Rodrigues J, Chandra S, Mazumder N, Vitkin A, Kishore Mahato K. Protein classification by autofluorescence spectral shape analysis using machine learning. Talanta 2024; 267:125167. [PMID: 37714041 DOI: 10.1016/j.talanta.2023.125167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Revised: 08/23/2023] [Accepted: 09/04/2023] [Indexed: 09/17/2023]
Abstract
Depending on the relative numbers and spatial arrangement of Tryptophan (Trp; W) and Tyrosine (Tyr; Y) residues, different proteins produce distinct autofluorescence (AF) spectral shapes when excited at ∼280 nm. Yet, considering the vast number and heterogeneous forms in nature, visual analysis and precise identification of proteins based on their AF spectra is challenging and further compounded in cases when different proteins produce substantially similar AF spectral shapes. There is, thus, a serious need to develop a methodology to address this problem. The current study proposes a practical technology to quickly identify proteins using machine learning (ML) algorithms based on their AF spectra. Specifically, AF spectra of fifteen different standard proteins of varying origin with distinct structural and Trp/Tyr compositions were recorded; based on the spectral features selected by the Minimum-Redundancy-Maximum-Relevance (mRMR) algorithm, a multiclass Support Vector Machine (SVM) learning model with Radial Basis Function (RBF), Polynomial, and Linear kernels classified the proteins with high accuracy of 99.06%, 99.03%, and 98.29% respectively. Since protein identification is the key to understand biological functions and disease diagnosis, the proposed methodology could offer a viable alternative to and improve the existing protein identification techniques.
Collapse
Affiliation(s)
| | - Jackson Rodrigues
- Department of Biophysics, Manipal School of Life Sciences, Manipal Academy of Higher Education, Manipal, 576104, Karnataka, India
| | - Subhash Chandra
- Department of Biophysics, Manipal School of Life Sciences, Manipal Academy of Higher Education, Manipal, 576104, Karnataka, India
| | - Nirmal Mazumder
- Department of Biophysics, Manipal School of Life Sciences, Manipal Academy of Higher Education, Manipal, 576104, Karnataka, India
| | - Alex Vitkin
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, M5G 1L7, Canada
| | - Krishna Kishore Mahato
- Department of Biophysics, Manipal School of Life Sciences, Manipal Academy of Higher Education, Manipal, 576104, Karnataka, India.
| |
Collapse
|
4
|
Arya N, Mathur A, Saha S, Saha S. Proposal of SVM Utility Kernel for Breast Cancer Survival Estimation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1372-1383. [PMID: 35994556 DOI: 10.1109/tcbb.2022.3198879] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The advancement of medical research in the field of cancer prognosis and diagnosis using various modalities has put oncologists under tremendous stress. The complexity and heterogeneity involved in multiple modalities and their significantly varied clinical outcomes make it difficult to analyze the disease and provide the correct treatment. Breast cancer is the major concern among all cancers worldwide, specifically for females. To help oncologists and cancer patients, research for breast cancer survival estimation has been proposed. It ranges from complex deep neural networks to simple and interpretable architectures. We propose a utility kernel for a support vector machine (SVM) in this article. It is a simple yet powerful function, which performs better than other popular machine learning algorithms and deep neural networks in the task of breast cancer survival prediction using the TCGA-BRCA dataset. This study validates the proposed utility kernel using four different modalities (gene expression, copy number variation, clinical, and histopathological tissue images) and their multi-modal combinations. The SVM based on our utility kernel empirically proves its efficacy by achieving the highest value on various performance measures, whereas advanced deep neural networks fail to train on small and highly imbalanced breast cancer data.
Collapse
|
5
|
Ju Z, Wang SY. Prediction of lysine HMGylation sites using multiple feature extraction and fuzzy support vector machine. Anal Biochem 2023; 663:115032. [PMID: 36592921 DOI: 10.1016/j.ab.2022.115032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Accepted: 12/25/2022] [Indexed: 12/31/2022]
Abstract
Protein 3-hydroxyl-3-methylglutarylation (HMGylation) is newly discovered lysine acylation modification in mitochondrion. The accurate identification of HMGylation sites is the premise and key to further explore the molecular mechanisms of HMGylation. In this study, a novel bioinformatics tool named HMGPred is developed to predict HMGylation sites. Multiple effective features, including amino acid composition, amino acid factors, binary encoding, and the composition of k-spaced amino acid pairs, are integrated to encode HMGylation sites. And F-score feature ranking with incremental feature selection was used to eliminate redundant features. Moreover, a fuzzy support vector machine algorithm is used to effectively reduce the influence of noise problem by assigning different samples to different fuzzy membership degrees. As illustrated by 10-fold cross-validation, HMGPred achieves a satisfactory performance with an area under receiver operating characteristic curve of 0.9110. Feature analysis indicates that some k-spaced amino acid pair features, such as 'KxxxT' and 'DxxxE', play a critical role in the prediction of HMGylation sites. The results of prediction and analysis might be helpful for investigating the mechanisms of HMGylation. For the convenience of experimental researchers, HMGPred is implemented as a web server at http://123.206.31.171/HMGPred/.
Collapse
Affiliation(s)
- Zhe Ju
- College of Science, Shenyang Aerospace University, 110136, People's Republic of China.
| | - Shi-Yun Wang
- College of Science, Shenyang Aerospace University, 110136, People's Republic of China
| |
Collapse
|
6
|
Li W, Wang J, Luo Y, Bezabih TT. Multi-dimensional feature recognition model based on capsule network for ubiquitination site prediction. PeerJ 2022; 10:e14427. [PMID: 36523471 PMCID: PMC9745908 DOI: 10.7717/peerj.14427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Accepted: 10/30/2022] [Indexed: 12/12/2022] Open
Abstract
Ubiquitination is an important post-translational modification of proteins that regulates many cellular activities. Traditional experimental methods for identification are costly and time-consuming, so many researchers have proposed computational methods for ubiquitination site prediction in recent years. However, traditional machine learning methods focus on feature engineering and are not suitable for large-scale proteomic data. In addition, deep learning methods are mostly based on convolutional neural networks and fuse multiple coding approaches to achieve classification prediction. This cannot effectively identify potential fine-grained features of the input data and has limitations in the representation of dependencies between low-level features and high-level features. A multi-dimensional feature recognition model based on a capsule network (MDCapsUbi) was proposed to predict protein ubiquitination sites. The proposed module consisting of convolution operations and channel attention was used to recognize coarse-grained features in the sequence dimension and the feature map dimension. The capsule network module consisting of capsule vectors was used to identify fine-grained features and classify ubiquitinated sites. With ten-fold cross-validation, the MDCapsUbi achieved 91.82% accuracy, 91.39% sensitivity, 92.24% specificity, 0.837 MCC, 0.918 F-Score and 0.97 AUC. Experimental results indicated that the proposed method outperformed other ubiquitination site prediction technologies.
Collapse
Affiliation(s)
- Weimin Li
- School of Computer Engineering and Science, Shanghai University, Shanghai, China
| | - Jie Wang
- School of Computer Engineering and Science, Shanghai University, Shanghai, China
| | - Yin Luo
- School of Life Sciences, East China Normal University, Shanghai, China
| | | |
Collapse
|
7
|
Prediction of anti-inflammatory peptides by a sequence-based stacking ensemble model named AIPStack. iScience 2022; 25:104967. [PMID: 36093066 PMCID: PMC9449674 DOI: 10.1016/j.isci.2022.104967] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 08/09/2022] [Accepted: 08/12/2022] [Indexed: 11/23/2022] Open
Abstract
Accurate and efficient identification of anti-inflammatory peptides (AIPs) is crucial for the treatment of inflammation. Here, we proposed a two-layer stacking ensemble model, AIPStack, to effectively predict AIPs. At first, we constructed a new dataset for model building and validation. Then, peptide sequences were represented by hybrid features, which were fused by two amino acid composition descriptors. Next, the stacking ensemble model was constructed by random forest and extremely randomized tree as the base-classifiers and logistic regression as the meta-classifier to receive the outputs from the base-classifiers. AIPStack achieved an AUC of 0.819, accuracy of 0.755, and MCC of 0.510 on the independent set 3, which were higher than other AIP predictors. Furthermore, the essential sequence features were highlighted by the Shapley Additive exPlanation (SHAP) method. It is anticipated that AIPStack could be used for AIP prediction in a high-throughput manner and facilitate the hypothesis-driven experimental design. AIPStack model was developed for the prediction of anti-inflammatory peptides The hybrid features were used to describe the peptide sequences The proposed model AIPStack outperformed existing ones SHAP was used to highlight the essential features required for AIP prediction
Collapse
|
8
|
Song J, Li Z, Yao G, Wei S, Li L, Wu H. Framework for feature selection of predicting the diagnosis and prognosis of necrotizing enterocolitis. PLoS One 2022; 17:e0273383. [PMID: 35984833 PMCID: PMC9390903 DOI: 10.1371/journal.pone.0273383] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 08/08/2022] [Indexed: 11/18/2022] Open
Abstract
Neonatal necrotizing enterocolitis (NEC) occurs worldwide and is a major source of neonatal morbidity and mortality. Researchers have developed many methods for predicting NEC diagnosis and prognosis. However, most people use statistical methods to select features, which may ignore the correlation between features. In addition, because they consider a small dimension of characteristics, they neglect some laboratory parameters such as white blood cell count, lymphocyte percentage, and mean platelet volume, which could be potentially influential factors affecting the diagnosis and prognosis of NEC. To address these issues, we include more perinatal, clinical, and laboratory information, including anemia—red blood cell transfusion and feeding strategies, and propose a ridge regression and Q-learning strategy based bee swarm optimization (RQBSO) metaheuristic algorithm for predicting NEC diagnosis and prognosis. Finally, a linear support vector machine (linear SVM), which specializes in classifying high-dimensional features, is used as a classifier. In the NEC diagnostic prediction experiment, the area under the receiver operating characteristic curve (AUROC) of dataset 1 (feeding intolerance + NEC) reaches 94.23%. In the NEC prognostic prediction experiment, the AUROC of dataset 2 (medical NEC + surgical NEC) reaches 91.88%. Additionally, the classification accuracy of the RQBSO algorithm on the NEC dataset is higher than the other feature selection algorithms. Thus, the proposed approach has the potential to identify predictors that contribute to the diagnosis of NEC and stratification of disease severity in a clinical setting.
Collapse
Affiliation(s)
- Jianfei Song
- College of Communication Engineering, Jilin University, Changchun, Jilin, PR China
| | - Zhenyu Li
- Department of Neonatology, Jilin University First Hospital, Changchun, Jilin, PR China
| | - Guijin Yao
- College of Communication Engineering, Jilin University, Changchun, Jilin, PR China
| | - Songping Wei
- College of Communication Engineering, Jilin University, Changchun, Jilin, PR China
| | - Ling Li
- College of Communication Engineering, Jilin University, Changchun, Jilin, PR China
- * E-mail: (LL); (HW)
| | - Hui Wu
- Department of Neonatology, Jilin University First Hospital, Changchun, Jilin, PR China
- * E-mail: (LL); (HW)
| |
Collapse
|
9
|
Huang X, Chen X, Chen X, Wang W. Screening of Serum miRNAs as Diagnostic Biomarkers for Lung Cancer Using the Minimal-Redundancy-Maximal-Relevance Algorithm and Random Forest Classifier Based on a Public Database. Public Health Genomics 2022; 25:1-9. [PMID: 35917800 DOI: 10.1159/000525316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Accepted: 05/12/2022] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Lung cancer is one of the deadliest cancers, early diagnosis of which can efficiently enhance patient's survival. We aimed to screening out the serum miRNAs as diagnostic biomarkers for patients with lung cancer. METHODS A total of 416 remarkably differentially expressed miRNAs were acquired using the limma package, and next feature ranking was derived by the minimal-redundancy-maximal-relevance method. An incremental feature selection algorithm of a random forest (RF) classifier was utilized to choose the top 5 miRNA combination with the optimum predictive performance. The performance of the RF classifier of top 5 miRNAs was analyzed using the receiver operator characteristic (ROC) curve. Afterward, the classification effect of the 5-miRNA combination was validated through principal component analysis and hierarchical clustering analysis. Analysis of top 5 miRNA expressions between lung cancer patients and normal people was performed based on GSE137140 dataset, and their expression was validated by qPCR. The hierarchical clustering analysis was used to analyze the similarity of 5 miRNAs expression profiles. ROC analysis was undertaken on each miRNA. RESULTS We acquired top 5 miRNAs finally, with the Matthews correlation coefficient value as 0.988 and the area under the curve (AUC) value as 0.996. The 5 feature miRNAs were capable of distinguishing most cancer patients and normal people. Furthermore, except for the lowly expressed miR-6875-5p in lung cancer tissue, the other 4 miRNAs all expressed highly in cancer patients. Performance analysis revealed that their AUC values were 0.92, 0.96, 0.94, 0.95, and 0.93, respectively. CONCLUSION By and large, the 5 feature miRNAs screened here were anticipated to be effective biomarkers for lung cancer.
Collapse
Affiliation(s)
- Xiaoyan Huang
- Medical Oncology, 900 Hospital of the Joint Logistics Team, Fuzhou, China
| | - Xiong Chen
- Medical Oncology, 900 Hospital of the Joint Logistics Team, Fuzhou, China
| | - Xi Chen
- Medical Oncology, 900 Hospital of the Joint Logistics Team, Fuzhou, China
| | - Wenling Wang
- Medical Oncology, 900 Hospital of the Joint Logistics Team, Fuzhou, China
| |
Collapse
|
10
|
Sikander R, Arif M, Ghulam A, Worachartcheewan A, Thafar MA, Habib S. Identification of the ubiquitin–proteasome pathway domain by hyperparameter optimization based on a 2D convolutional neural network. Front Genet 2022; 13:851688. [PMID: 35937990 PMCID: PMC9355632 DOI: 10.3389/fgene.2022.851688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 06/29/2022] [Indexed: 11/13/2022] Open
Abstract
The major mechanism of proteolysis in the cytosol and nucleus is the ubiquitin–proteasome pathway (UPP). The highly controlled UPP has an effect on a wide range of cellular processes and substrates, and flaws in the system can lead to the pathogenesis of a number of serious human diseases. Knowledge about UPPs provide useful hints to understand the cellular process and drug discovery. The exponential growth in next-generation sequencing wet lab approaches have accelerated the accumulation of unannotated data in online databases, making the UPP characterization/analysis task more challenging. Thus, computational methods are used as an alternative for fast and accurate identification of UPPs. Aiming this, we develop a novel deep learning-based predictor named “2DCNN-UPP” for identifying UPPs with low error rate. In the proposed method, we used proposed algorithm with a two-dimensional convolutional neural network with dipeptide deviation features. To avoid the over fitting problem, genetic algorithm is employed to select the optimal features. Finally, the optimized attribute set are fed as input to the 2D-CNN learning engine for building the model. Empirical evidence or outcomes demonstrates that the proposed predictor achieved an overall accuracy and AUC (ROC) value using 10-fold cross validation test. Superior performance compared to other state-of-the art methods for discrimination the relations UPPs classification. Both on and independent test respectively was trained on 10-fold cross validation method and then evaluated through independent test. In the case where experimentally validated ubiquitination sites emerged, we must devise a proteomics-based predictor of ubiquitination. Meanwhile, we also evaluated the generalization power of our trained modal via independent test, and obtained remarkable performance in term of 0.862 accuracy, 0.921 sensitivity, 0.803 specificity 0.803, and 0.730 Matthews correlation coefficient (MCC) respectively. Four approaches were used in the sequences, and the physical properties were calculated combined. When used a 10-fold cross-validation, 2D-CNN-UPP obtained an AUC (ROC) value of 0.862 predicted score. We analyzed the relationship between UPP protein and non-UPP protein predicted score. Last but not least, this research could effectively analyze the large scale relationship between UPP proteins and non-UPP proteins in particular and other protein problems in general and our research work might improve computational biological research. Therefore, we could utilize the latest features in our model framework and Dipeptide Deviation from Expected Mean (DDE) -based protein structure features for the prediction of protein structure, functions, and different molecules, such as DNA and RNA.
Collapse
Affiliation(s)
- Rahu Sikander
- School of Computer Science and Technology, Xidian University, Xi’an, China
- *Correspondence: Rahu Sikander, ; Apilak Worachartcheewan,
| | - Muhammad Arif
- Department of Community Medical Technology, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand
| | - Ali Ghulam
- Computerization and Network Section, Sindh Agriculture University, Tando Jam, Pakistan
| | - Apilak Worachartcheewan
- Department of Community Medical Technology, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand
- *Correspondence: Rahu Sikander, ; Apilak Worachartcheewan,
| | - Maha A. Thafar
- Department of Computer Science, Collage of Computer and Information Technology, Taif University, Taif, Saudi Arabia
| | - Shabana Habib
- Department of Information Technology, College of Computer, Qassim University, Buraydah, Saudi Arabia
| |
Collapse
|
11
|
Yu L, Qiu W, Lin W, Cheng X, Xiao X, Dai J. HGDTI: predicting drug-target interaction by using information aggregation based on heterogeneous graph neural network. BMC Bioinformatics 2022; 23:126. [PMID: 35413800 PMCID: PMC9004085 DOI: 10.1186/s12859-022-04655-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Accepted: 03/28/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In research on new drug discovery, the traditional wet experiment has a long period. Predicting drug-target interaction (DTI) in silico can greatly narrow the scope of search of candidate medications. Excellent algorithm model may be more effective in revealing the potential connection between drug and target in the bioinformatics network composed of drugs, proteins and other related data. RESULTS In this work, we have developed a heterogeneous graph neural network model, named as HGDTI, which includes a learning phase of network node embedding and a training phase of DTI classification. This method first obtains the molecular fingerprint information of drugs and the pseudo amino acid composition information of proteins, then extracts the initial features of nodes through Bi-LSTM, and uses the attention mechanism to aggregate heterogeneous neighbors. In several comparative experiments, the overall performance of HGDTI significantly outperforms other state-of-the-art DTI prediction models, and the negative sampling technology is employed to further optimize the prediction power of model. In addition, we have proved the robustness of HGDTI through heterogeneous network content reduction tests, and proved the rationality of HGDTI through other comparative experiments. These results indicate that HGDTI can utilize heterogeneous information to capture the embedding of drugs and targets, and provide assistance for drug development. CONCLUSIONS The HGDTI based on heterogeneous graph neural network model, can utilize heterogeneous information to capture the embedding of drugs and targets, and provide assistance for drug development. For the convenience of related researchers, a user-friendly web-server has been established at http://bioinfo.jcu.edu.cn/hgdti .
Collapse
Affiliation(s)
- Liyi Yu
- School of Information Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Wangren Qiu
- School of Information Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Weizhong Lin
- School of Information Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Xiang Cheng
- School of Information Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Xuan Xiao
- School of Information Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China.
| | - Jiexia Dai
- School of Foreign Languages, Jingdezhen University, Jingdezhen, China
| |
Collapse
|
12
|
Arya N, Saha S. Multi-Modal Classification for Human Breast Cancer Prognosis Prediction: Proposal of Deep-Learning Based Stacked Ensemble Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1032-1041. [PMID: 32822302 DOI: 10.1109/tcbb.2020.3018467] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Breast Cancer is a highly aggressive type of cancer generally formed in the cells of the breast. Despite significant advances in the treatment of primary breast cancer in the last decade, there is a dire need to attempt of an accurate predictive model for breast cancer prognosis prediction. Researchers from various disciplines are working together to develop methods to save people from this fatal disease. A good predictive model can help in correct prognosis prediction of breast cancer. This accurate prediction can have several benefits like detection of cancer in the early stage, spare patients from getting unnecessary treatment and medical expenses related to it. Previous works rely mostly on uni-modal data (selected gene expression)for predictive model design. In recent years, however, multi-modal cancer data sets have become available (gene expression, copy number alteration and clinical). Motivated by the enhancement of deep-learning based models, in the current study, we propose to use some deep-learning based predictive models in a stacked ensemble framework to improve the prognosis prediction of breast cancer from available multi-modal data sets. One of the unique advantages of the proposed approach lies in the architecture of the model. It is a two-stage model. Stage one uses a convolutional neural network for feature extraction, while stage two uses the extracted features as input to the stack-based ensemble model. The predictive performance evaluated using different performance measures shows that this model produces better result than already existing approaches. This model results in AUC value of 0.93 and accuracy of 90.2 percent at medium stringency level (Specificity = 95 percent and threshold = 0.45). Keras 2.2.1, along with Tensorflow 1.12, is used for implementing the source code of the model. The source code can be downloaded from Github: https://github.com/nikhilaryan92/BreastCancer.
Collapse
|
13
|
Wang C, Tan X, Tang D, Gou Y, Han C, Ning W, Lin S, Zhang W, Chen M, Peng D, Xue Y. GPS-Uber: a hybrid-learning framework for prediction of general and E3-specific lysine ubiquitination sites. Brief Bioinform 2022; 23:6509047. [PMID: 35037020 DOI: 10.1093/bib/bbab574] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Revised: 12/11/2021] [Accepted: 12/14/2021] [Indexed: 12/13/2022] Open
Abstract
As an important post-translational modification, lysine ubiquitination participates in numerous biological processes and is involved in human diseases, whereas the site specificity of ubiquitination is mainly decided by ubiquitin-protein ligases (E3s). Although numerous ubiquitination predictors have been developed, computational prediction of E3-specific ubiquitination sites is still a great challenge. Here, we carefully reviewed the existing tools for the prediction of general ubiquitination sites. Also, we developed a tool named GPS-Uber for the prediction of general and E3-specific ubiquitination sites. From the literature, we manually collected 1311 experimentally identified site-specific E3-substrate relations, which were classified into different clusters based on corresponding E3s at different levels. To predict general ubiquitination sites, we integrated 10 types of sequence and structure features, as well as three types of algorithms including penalized logistic regression, deep neural network and convolutional neural network. Compared with other existing tools, the general model in GPS-Uber exhibited a highly competitive accuracy, with an area under curve values of 0.7649. Then, transfer learning was adopted for each E3 cluster to construct E3-specific models, and in total 112 individual E3-specific predictors were implemented. Using GPS-Uber, we conducted a systematic prediction of human cancer-associated ubiquitination events, which could be helpful for further experimental consideration. GPS-Uber will be regularly updated, and its online service is free for academic research at http://gpsuber.biocuckoo.cn/.
Collapse
Affiliation(s)
- Chenwei Wang
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, Center for Artificial Intelligence Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Xiaodan Tan
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, Center for Artificial Intelligence Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Dachao Tang
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, Center for Artificial Intelligence Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Yujie Gou
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, Center for Artificial Intelligence Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Cheng Han
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, Center for Artificial Intelligence Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Wanshan Ning
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, Center for Artificial Intelligence Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Shaofeng Lin
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, Center for Artificial Intelligence Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Weizhi Zhang
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, Center for Artificial Intelligence Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Miaomiao Chen
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, Center for Artificial Intelligence Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Di Peng
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, Center for Artificial Intelligence Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Yu Xue
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, Center for Artificial Intelligence Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| |
Collapse
|
14
|
Chen Z, Liu X, Li F, Li C, Marquez-Lago T, Leier A, Webb GI, Xu D, Akutsu T, Song J. Systematic Characterization of Lysine Post-translational Modification Sites Using MUscADEL. Methods Mol Biol 2022; 2499:205-219. [PMID: 35696083 DOI: 10.1007/978-1-0716-2317-6_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Among various types of protein post-translational modifications (PTMs), lysine PTMs play an important role in regulating a wide range of functions and biological processes. Due to the generation and accumulation of enormous amount of protein sequence data by ongoing whole-genome sequencing projects, systematic identification of different types of lysine PTM substrates and their specific PTM sites in the entire proteome is increasingly important and has therefore received much attention. Accordingly, a variety of computational methods for lysine PTM identification have been developed based on the combination of various handcrafted sequence features and machine-learning techniques. In this chapter, we first briefly review existing computational methods for lysine PTM identification and then introduce a recently developed deep learning-based method, termed MUscADEL (Multiple Scalable Accurate Deep Learner for lysine PTMs). Specifically, MUscADEL employs bidirectional long short-term memory (BiLSTM) recurrent neural networks and is capable of predicting eight major types of lysine PTMs in both the human and mouse proteomes. The web server of MUscADEL is publicly available at http://muscadel.erc.monash.edu/ for the research community to use.
Collapse
Affiliation(s)
- Zhen Chen
- Key Laboratory of Rice Biology in Henan Province, Henan Agricultural University, Zhengzhou, China
| | - Xuhan Liu
- Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden, The Netherlands
| | - Fuyi Li
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
| | - Chen Li
- Biomedicine Discovery Institute, Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC, Australia
| | - Tatiana Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
| | - Dakang Xu
- Faculty of Medical Laboratory Science, Ruijin Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- Department of Molecular and Translational Science, Faculty of Medicine, Hudson Institute of Medical Research, Monash University, Melbourne, VIC, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan.
| | - Jiangning Song
- Biomedicine Discovery Institute, Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC, Australia.
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia.
| |
Collapse
|
15
|
Automatic Diagnosis of Epileptic Seizures in EEG Signals Using Fractal Dimension Features and Convolutional Autoencoder Method. BIG DATA AND COGNITIVE COMPUTING 2021. [DOI: 10.3390/bdcc5040078] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This paper proposes a new method for epileptic seizure detection in electroencephalography (EEG) signals using nonlinear features based on fractal dimension (FD) and a deep learning (DL) model. Firstly, Bonn and Freiburg datasets were used to perform experiments. The Bonn dataset consists of binary and multi-class classification problems, and the Freiburg dataset consists of two-class EEG classification problems. In the preprocessing step, all datasets were prepossessed using a Butterworth band pass filter with 0.5–60 Hz cut-off frequency. Then, the EEG signals of the datasets were segmented into different time windows. In this section, dual-tree complex wavelet transform (DT-CWT) was used to decompose the EEG signals into the different sub-bands. In the following section, in order to feature extraction, various FD techniques were used, including Higuchi (HFD), Katz (KFD), Petrosian (PFD), Hurst exponent (HE), detrended fluctuation analysis (DFA), Sevcik, box counting (BC), multiresolution box-counting (MBC), Margaos-Sun (MSFD), multifractal DFA (MF-DFA), and recurrence quantification analysis (RQA). In the next step, the minimum redundancy maximum relevance (mRMR) technique was used for feature selection. Finally, the k-nearest neighbors (KNN), support vector machine (SVM), and convolutional autoencoder (CNN-AE) were used for the classification step. In the classification step, the K-fold cross-validation with k = 10 was employed to demonstrate the effectiveness of the classifier methods. The experiment results show that the proposed CNN-AE method achieved an accuracy of 99.736% and 99.176% for the Bonn and Freiburg datasets, respectively.
Collapse
|
16
|
Wang ZH, Xiao XL, Zhang ZT, He K, Hu F. A Radiomics Model for Predicting Early Recurrence in Grade II Gliomas Based on Preoperative Multiparametric Magnetic Resonance Imaging. Front Oncol 2021; 11:684996. [PMID: 34540662 PMCID: PMC8443788 DOI: 10.3389/fonc.2021.684996] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Accepted: 08/12/2021] [Indexed: 12/23/2022] Open
Abstract
Objective This study aimed to develop a radiomics model to predict early recurrence (<1 year) in grade II glioma after the first resection. Methods The pathological, clinical, and magnetic resonance imaging (MRI) data of patients diagnosed with grade II glioma who underwent surgery and had a recurrence between 2017 and 2020 in our hospital were retrospectively analyzed. After a rigorous selection, 64 patients were eligible and enrolled in the study. Twenty-two cases had a pathologically confirmed recurrent glioma. The cases were randomly assigned using a ratio of 7:3 to either the training set or validation set. T1-weighted image (T1WI), T2-weighted image (T2WI), and contrast-enhanced T1-weighted image (T1CE) were acquired. The minimum-redundancy-maximum-relevancy (mRMR) method alone or in combination with univariate logistic analysis were used to identify the most optimal predictive feature from the three image sequences. Multivariate logistic regression analysis was then used to develop a predictive model using the screened features. The performance of each model in both training and validation datasets was assessed using a receiver operating characteristic (ROC) curve, calibration curve, and decision curve analysis (DCA). Results A total of 396 radiomics features were initially extracted from each image sequence. After running the mRMR and univariate logistic analysis, nine predictive features were identified and used to build the multiparametric radiomics model. The model had a higher AUC when compared with the univariate models in both training and validation data sets with an AUC of 0.966 (95% confidence interval: 0.949–0.99) and 0.930 (95% confidence interval: 0.905–0.973), respectively. The calibration curves indicated a good agreement between the predictable and the actual probability of developing recurrence. The DCA demonstrated that the predictive value of the model improved when combining the three MRI sequences. Conclusion Our multiparametric radiomics model could be used as an efficient and accurate tool for predicting the recurrence of grade II glioma.
Collapse
Affiliation(s)
- Zhen-Hua Wang
- Department of Radiology, The Second Affiliated Hospital of Nanchang University, Nanchang, China
| | - Xin-Lan Xiao
- Department of Radiology, The Second Affiliated Hospital of Nanchang University, Nanchang, China
| | - Zhao-Tao Zhang
- Department of Radiology, The Second Affiliated Hospital of Nanchang University, Nanchang, China
| | - Keng He
- Department of Radiology, The Second Affiliated Hospital of Nanchang University, Nanchang, China
| | - Feng Hu
- Department of Radiology, The Second Affiliated Hospital of Nanchang University, Nanchang, China
| |
Collapse
|
17
|
Liu X, Shen Y, Zhang Y, Liu F, Ma Z, Yue Z, Yue Y. IdentPMP: identification of moonlighting proteins in plants using sequence-based learning models. PeerJ 2021; 9:e11900. [PMID: 34434652 PMCID: PMC8351581 DOI: 10.7717/peerj.11900] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2020] [Accepted: 07/13/2021] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND A moonlighting protein refers to a protein that can perform two or more functions. Since the current moonlighting protein prediction tools mainly focus on the proteins in animals and microorganisms, and there are differences in the cells and proteins between animals and plants, these may cause the existing tools to predict plant moonlighting proteins inaccurately. Hence, the availability of a benchmark data set and a prediction tool specific for plant moonlighting protein are necessary. METHODS This study used some protein feature classes from the data set constructed in house to develop a web-based prediction tool. In the beginning, we built a data set about plant protein and reduced redundant sequences. We then performed feature selection, feature normalization and feature dimensionality reduction on the training data. Next, machine learning methods for preliminary modeling were used to select feature classes that performed best in plant moonlighting protein prediction. This selected feature was incorporated into the final plant protein prediction tool. After that, we compared five machine learning methods and used grid searching to optimize parameters, and the most suitable method was chosen as the final model. RESULTS The prediction results indicated that the eXtreme Gradient Boosting (XGBoost) performed best, which was used as the algorithm to construct the prediction tool, called IdentPMP (Identification of Plant Moonlighting Proteins). The results of the independent test set shows that the area under the precision-recall curve (AUPRC) and the area under the receiver operating characteristic curve (AUC) of IdentPMP is 0.43 and 0.68, which are 19.44% (0.43 vs. 0.36) and 13.33% (0.68 vs. 0.60) higher than state-of-the-art non-plant specific methods, respectively. This further demonstrated that a benchmark data set and a plant-specific prediction tool was required for plant moonlighting protein studies. Finally, we implemented the tool into a web version, and users can use it freely through the URL: http://identpmp.aielab.net/.
Collapse
Affiliation(s)
- Xinyi Liu
- School of Information and Computer, Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, Anhui, China
| | - Yueyue Shen
- School of Information and Computer, Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, Anhui, China
| | - Youhua Zhang
- School of Information and Computer, Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, Anhui, China
| | - Fei Liu
- School of Information and Computer, Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, Anhui, China
| | - Zhiyu Ma
- School of Information and Computer, Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, Anhui, China
| | - Zhenyu Yue
- School of Information and Computer, Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, Anhui, China
| | - Yi Yue
- School of Information and Computer, Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, Anhui, China
| |
Collapse
|
18
|
Liu Y, Jin S, Song L, Han Y, Yu B. Prediction of protein ubiquitination sites via multi-view features based on eXtreme gradient boosting classifier. J Mol Graph Model 2021; 107:107962. [PMID: 34198216 DOI: 10.1016/j.jmgm.2021.107962] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 05/03/2021] [Accepted: 06/02/2021] [Indexed: 01/29/2023]
Abstract
Ubiquitination is a common and reversible post-translational protein modification that regulates apoptosis and plays an important role in protein degradation and cell diseases. However, experimental identification of protein ubiquitination sites is usually time-consuming and labor-intensive, so it is necessary to establish effective predictors. In this study, we propose a ubiquitination sites prediction method based on multi-view features, namely UbiSite-XGBoost. Firstly, we use seven single-view features encoding methods to convert protein sequence fragments into digital information. Secondly, the least absolute shrinkage and selection operator (LASSO) is applied to remove the redundant information and get the optimal feature subsets. Finally, these features are inputted into the eXtreme gradient boosting (XGBoost) classifier to predict ubiquitination sites. Five-fold cross-validation shows that the AUC values of Set1-Set6 datasets are 0.8258, 0.7592, 0.7853, 0.8345, 0.8979 and 0.8901, respectively. The synthetic minority oversampling technique (SMOTE) is employed in Set4-Set6 unbalanced datasets, and the AUC values are 0.9777, 0.9782 and 0.9860, respectively. In addition, we have constructed three independent test datasets which the AUC values are 0.8007, 0.6897 and 0.7280, respectively. The results show that the proposed method UbiSite-XGBoost is superior to other ubiquitination prediction methods and it provides new guidance for the identification of ubiquitination sites. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/UbiSite-XGBoost/.
Collapse
Affiliation(s)
- Yushuang Liu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Shuping Jin
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Lili Song
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yu Han
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China; Key Laboratory of Computational Science and Application of Hainan Province, Haikou, 571158, China.
| |
Collapse
|
19
|
Arya N, Saha S. Multi-modal advanced deep learning architectures for breast cancer survival prediction. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.106965] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
20
|
The Blood Gene Expression Signature for Kawasaki Disease in Children Identified with Advanced Feature Selection Methods. BIOMED RESEARCH INTERNATIONAL 2021; 2020:6062436. [PMID: 32685506 PMCID: PMC7327570 DOI: 10.1155/2020/6062436] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/26/2020] [Accepted: 06/12/2020] [Indexed: 01/22/2023]
Abstract
Kawasaki disease (KD) is an acute vasculitis, accompanied by coronary artery aneurysm, coronary artery dilatation, arrhythmia, and other serious cardiovascular diseases. So far, the etiology of KD is unclear; it is necessary to study the molecular mechanism and related factors of KD. In this study, we analyzed the expression profiles of 75 DB (identifying bacteria), 122 DV (identifying virus), 71 HC (healthy control), and 311 KD (Kawasaki disease) samples. 332 key genes related to KD and pathogen infections were identified using a combination of advanced feature selection methods: (1) Boruta, (2) Monte-Carlo Feature Selection (MCFS), and (3) Incremental Feature Selection (IFS). The number of signature genes was narrowed down step by step. Subsequently, their functions were revealed by KEGG and GO enrichment analyses. Our results provided clues of potential molecular mechanisms of KD and were helpful for KD detection and treatment.
Collapse
|
21
|
Zhang ZM, Guan ZX, Wang F, Zhang D, Ding H. Application of Machine Learning Methods in Predicting Nuclear Receptors and their Families. Med Chem 2021; 16:594-604. [PMID: 31584374 DOI: 10.2174/1573406415666191004125551] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Revised: 06/18/2019] [Accepted: 08/23/2019] [Indexed: 11/22/2022]
Abstract
Nuclear receptors (NRs) are a superfamily of ligand-dependent transcription factors that are closely related to cell development, differentiation, reproduction, homeostasis, and metabolism. According to the alignments of the conserved domains, NRs are classified and assigned the following seven subfamilies or eight subfamilies: (1) NR1: thyroid hormone like (thyroid hormone, retinoic acid, RAR-related orphan receptor, peroxisome proliferator activated, vitamin D3- like), (2) NR2: HNF4-like (hepatocyte nuclear factor 4, retinoic acid X, tailless-like, COUP-TFlike, USP), (3) NR3: estrogen-like (estrogen, estrogen-related, glucocorticoid-like), (4) NR4: nerve growth factor IB-like (NGFI-B-like), (5) NR5: fushi tarazu-F1 like (fushi tarazu-F1 like), (6) NR6: germ cell nuclear factor like (germ cell nuclear factor), and (7) NR0: knirps like (knirps, knirpsrelated, embryonic gonad protein, ODR7, trithorax) and DAX like (DAX, SHP), or dividing NR0 into (7) NR7: knirps like and (8) NR8: DAX like. Different NRs families have different structural features and functions. Since the function of a NR is closely correlated with which subfamily it belongs to, it is highly desirable to identify NRs and their subfamilies rapidly and effectively. The knowledge acquired is essential for a proper understanding of normal and abnormal cellular mechanisms. With the advent of the post-genomics era, huge amounts of sequence-known proteins have increased explosively. Conventional methods for accurately classifying the family of NRs are experimental means with high cost and low efficiency. Therefore, it has created a greater need for bioinformatics tools to effectively recognize NRs and their subfamilies for the purpose of understanding their biological function. In this review, we summarized the application of machine learning methods in the prediction of NRs from different aspects. We hope that this review will provide a reference for further research on the classification of NRs and their families.
Collapse
Affiliation(s)
- Zi-Mei Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zheng-Xing Guan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fang Wang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Dan Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
22
|
Yu X, Pan X, Zhang S, Zhang YH, Chen L, Wan S, Huang T, Cai YD. Identification of Gene Signatures and Expression Patterns During Epithelial-to-Mesenchymal Transition From Single-Cell Expression Atlas. Front Genet 2021; 11:605012. [PMID: 33584803 PMCID: PMC7876317 DOI: 10.3389/fgene.2020.605012] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2020] [Accepted: 12/21/2020] [Indexed: 11/13/2022] Open
Abstract
Cancer, which refers to abnormal cell proliferative diseases with systematic pathogenic potential, is one of the leading threats to human health. The final causes for patients’ deaths are usually cancer recurrence, metastasis, and drug resistance against continuing therapy. Epithelial-to-mesenchymal transition (EMT), which is the transformation of tumor cells (TCs), is a prerequisite for pathogenic cancer recurrence, metastasis, and drug resistance. Conventional biomarkers can only define and recognize large tissues with obvious EMT markers but cannot accurately monitor detailed EMT processes. In this study, a systematic workflow was established integrating effective feature selection, multiple machine learning models [Random forest (RF), Support vector machine (SVM)], rule learning, and functional enrichment analyses to find new biomarkers and their functional implications for distinguishing single-cell isolated TCs with unique epithelial or mesenchymal markers using public single-cell expression profiling. Our discovered signatures may provide an effective and precise transcriptomic reference to monitor EMT progression at the single-cell level and contribute to the exploration of detailed tumorigenesis mechanisms during EMT.
Collapse
Affiliation(s)
- Xiangtian Yu
- Clinical Research Center, Shanghai Jiao Tong University Affiliated Sixth People's Hospital, Shanghai, China
| | - XiaoYong Pan
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China
| | - ShiQi Zhang
- Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark
| | - Yu-Hang Zhang
- CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China.,Shanghai Key Laboratory of PMMP, East China Normal University, Shanghai, China
| | - Sibao Wan
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Tao Huang
- CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
23
|
Meng F, Liang Z, Zhao K, Luo C. Drug design targeting active posttranslational modification protein isoforms. Med Res Rev 2020; 41:1701-1750. [PMID: 33355944 DOI: 10.1002/med.21774] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2020] [Revised: 11/29/2020] [Accepted: 12/03/2020] [Indexed: 12/11/2022]
Abstract
Modern drug design aims to discover novel lead compounds with attractable chemical profiles to enable further exploration of the intersection of chemical space and biological space. Identification of small molecules with good ligand efficiency, high activity, and selectivity is crucial toward developing effective and safe drugs. However, the intersection is one of the most challenging tasks in the pharmaceutical industry, as chemical space is almost infinity and continuous, whereas the biological space is very limited and discrete. This bottleneck potentially limits the discovery of molecules with desirable properties for lead optimization. Herein, we present a new direction leveraging posttranslational modification (PTM) protein isoforms target space to inspire drug design termed as "Post-translational Modification Inspired Drug Design (PTMI-DD)." PTMI-DD aims to extend the intersections of chemical space and biological space. We further rationalized and highlighted the importance of PTM protein isoforms and their roles in various diseases and biological functions. We then laid out a few directions to elaborate the PTMI-DD in drug design including discovering covalent binding inhibitors mimicking PTMs, targeting PTM protein isoforms with distinctive binding sites from that of wild-type counterpart, targeting protein-protein interactions involving PTMs, and hijacking protein degeneration by ubiquitination for PTM protein isoforms. These directions will lead to a significant expansion of the biological space and/or increase the tractability of compounds, primarily due to precisely targeting PTM protein isoforms or complexes which are highly relevant to biological functions. Importantly, this new avenue will further enrich the personalized treatment opportunity through precision medicine targeting PTM isoforms.
Collapse
Affiliation(s)
- Fanwang Meng
- Drug Discovery and Design Center, the Center for Chemical Biology, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China.,Department of Chemistry and Chemical Biology, McMaster University, Hamilton, Ontario, Canada
| | - Zhongjie Liang
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou, China
| | - Kehao Zhao
- School of Pharmacy, Key Laboratory of Molecular Pharmacology and Drug Evaluation (Yantai University), Ministry of Education, Collaborative Innovation Center of Advanced Drug Delivery System and Biotech Drugs in Universities of Shandong, Yantai University, Yantai, China
| | - Cheng Luo
- Drug Discovery and Design Center, the Center for Chemical Biology, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
| |
Collapse
|
24
|
A Comparative Analysis of Machine Learning classifiers for Dysphonia-based classification of Parkinson’s Disease. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2020. [DOI: 10.1007/s41060-020-00234-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
25
|
Wang H, Wang Z, Li Z, Lee TY. Incorporating Deep Learning With Word Embedding to Identify Plant Ubiquitylation Sites. Front Cell Dev Biol 2020; 8:572195. [PMID: 33102477 PMCID: PMC7554246 DOI: 10.3389/fcell.2020.572195] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2020] [Accepted: 08/24/2020] [Indexed: 12/17/2022] Open
Abstract
Protein ubiquitylation is an important posttranslational modification (PTM), which is involved in diverse biological processes and plays an essential role in the regulation of physiological mechanisms and diseases. The Protein Lysine Modifications Database (PLMD) has accumulated abundant ubiquitylated proteins with their substrate sites for more than 20 kinds of species. Numerous works have consequently developed a variety of ubiquitylation site prediction tools across all species, mainly relying on the predefined sequence features and machine learning algorithms. However, the difference in ubiquitylated patterns between these species stays unclear. In this work, the sequence-based characterization of ubiquitylated substrate sites has revealed remarkable differences among plants, animals, and fungi. Then an improved word-embedding scheme based on the transfer learning strategy was incorporated with the multilayer convolutional neural network (CNN) for identifying protein ubiquitylation sites. For the prediction of plant ubiquitylation sites, the proposed deep learning scheme could outperform the machine learning-based methods, with the accuracy of 75.6%, precision of 73.3%, recall of 76.7%, F-score of 0.7493, and 0.82 AUC on the independent testing set. Although the ubiquitylated specificity of substrate sites is complicated, this work has demonstrated that the application of the word-embedding method can enable the extraction of informative features and help the identification of ubiquitylated sites. To accelerate the investigation of protein ubiquitylation, the data sets and source code used in this study are freely available at https://github.com/wang-hong-fei/DL-plant-ubsites-prediction.
Collapse
Affiliation(s)
- Hongfei Wang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China
| | - Zhuo Wang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China.,School of Life Sciences, University of Science and Technology of China, Hefei, China
| | - Zhongyan Li
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China.,School of Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, China
| | - Tzong-Yi Lee
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China.,School of Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, China
| |
Collapse
|
26
|
Bi Y, Xiang D, Ge Z, Li F, Jia C, Song J. An Interpretable Prediction Model for Identifying N 7-Methylguanosine Sites Based on XGBoost and SHAP. MOLECULAR THERAPY. NUCLEIC ACIDS 2020; 22:362-372. [PMID: 33230441 PMCID: PMC7533297 DOI: 10.1016/j.omtn.2020.08.022] [Citation(s) in RCA: 58] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Accepted: 08/20/2020] [Indexed: 12/19/2022]
Abstract
Recent studies have increasingly shown that the chemical modification of mRNA plays an important role in the regulation of gene expression. N7-methylguanosine (m7G) is a type of positively-charged mRNA modification that plays an essential role for efficient gene expression and cell viability. However, the research on m7G has received little attention to date. Bioinformatics tools can be applied as auxiliary methods to identify m7G sites in transcriptomes. In this study, we develop a novel interpretable machine learning-based approach termed XG-m7G for the differentiation of m7G sites using the XGBoost algorithm and six different types of sequence-encoding schemes. Both 10-fold and jackknife cross-validation tests indicate that XG-m7G outperforms iRNA-m7G. Moreover, using the powerful SHAP algorithm, this new framework also provides desirable interpretations of the model performance and highlights the most important features for identifying m7G sites. XG-m7G is anticipated to serve as a useful tool and guide for researchers in their future studies of mRNA modification sites.
Collapse
Affiliation(s)
- Yue Bi
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Dongxu Xiang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Zongyuan Ge
- Monash e-Research Centre and Faculty of Engineering, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
27
|
Liu Y, Li A, Zhao XM, Wang M. DeepTL-Ubi: A novel deep transfer learning method for effectively predicting ubiquitination sites of multiple species. Methods 2020; 192:103-111. [PMID: 32791338 DOI: 10.1016/j.ymeth.2020.08.003] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Revised: 07/17/2020] [Accepted: 08/06/2020] [Indexed: 11/16/2022] Open
Abstract
Ubiquitination is one of the most important post-translational modifications which involves in many biological processes. Because mass spectrometry-based ubiquitination site identification methods are costly and time consuming, computational approaches provide alternative ways to the determination of ubiquitination sites. Although machine learning based methods can effectively predict ubiquitination sites, most of them rely on feature engineering, which may lead to bias or incomplete feature. Recently, deep learning has achieved great success in prediction of post-translational modification sites. However, deep learning method has not been explored in the prediction of species-specific ubiquitination sites. In this paper, we propose a novel transfer deep learning method, named DeepTL-Ubi, for predicting ubiquitination sites of multiple species. DeepTL-Ubi enhances the performance of species-specific ubiquitination site prediction by transferring common knowledge from the large amount of human data to other species, which effectively solves the problem of insufficient training data for other species. Besides, we train and test our model by collecting ubiquitination sites for multiple species from several sources. Experiment results show that our transfer learning technique can effectively improve the predictive performance of species with small sample size, and DeepTL-Ubi is superior to existing tools in many species. The source code and training data of DeepTL-Ubi are publicly deposited at https://github.com/USTC-HIlab/DeepTL-Ubi.
Collapse
Affiliation(s)
- Yu Liu
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China.
| | - Ao Li
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China; Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China.
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai 200433, China.
| | - Minghui Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China; Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China.
| |
Collapse
|
28
|
Wang K, Zhou Z, Wang R, Chen L, Zhang Q, Sher D, Wang J. A multi‐objective radiomics model for the prediction of locoregional recurrence in head and neck squamous cell cancer. Med Phys 2020; 47:5392-5400. [DOI: 10.1002/mp.14388] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Revised: 05/11/2020] [Accepted: 07/02/2020] [Indexed: 02/05/2023] Open
Affiliation(s)
- Kai Wang
- Department of Radiation Oncology UT Southwestern Medical Center Dallas TX75390USA
| | - Zhiguo Zhou
- Department of Radiation Oncology UT Southwestern Medical Center Dallas TX75390USA
- School of Computer Science and Mathematics University of Central Missouri Warrensburg MO64093USA
| | - Rongfang Wang
- Department of Radiation Oncology UT Southwestern Medical Center Dallas TX75390USA
- School of Artificial Intelligence Xidian University Xi'an710071China
| | - Liyuan Chen
- Department of Radiation Oncology UT Southwestern Medical Center Dallas TX75390USA
| | - Qiongwen Zhang
- Department of Radiation Oncology UT Southwestern Medical Center Dallas TX75390USA
- State Key Laboratory of Biotherapy and Cancer Center Sichuan University and Collaborative Innovation Center Chengdu610041China
- Department of Head and Neck Cancer West China Hospital Chengdu610041China
| | - David Sher
- Department of Radiation Oncology UT Southwestern Medical Center Dallas TX75390USA
| | - Jing Wang
- Department of Radiation Oncology UT Southwestern Medical Center Dallas TX75390USA
| |
Collapse
|
29
|
Song C, Yang B. Use Chou’s 5-Step Rule to Classify Protein Modification Sites with Neural Network. SCIENTIFIC PROGRAMMING 2020; 2020:1-7. [DOI: 10.1155/2020/8894633] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
Lysine malonylation is a novel-type protein post-translational modification and plays essential roles in many biological activities. Having a good knowledge of malonylation sites can provide guidance in many issues, including disease prevention and drug discovery and other related fields. There are several experimental approaches to identify modification sites in the field of biology. However, these methods seem to be expensive. In this study, we proposed malNet, which employed neural network and utilized several novel and effective feature description methods. It was pointed that ANN’s performance is better than other models. Furthermore, we trained the classifiers according to an original crossvalidation method named Split to Equal validation (SEV). The results achieved AUC value of 0.6684, accuracy of 54.93%, and MCC of 0.1045, which showed great improvement than before.
Collapse
Affiliation(s)
- Chuandong Song
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, Shandong 277160, China
| | - Bin Yang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, Shandong 277160, China
| |
Collapse
|
30
|
Wang L, Zhang R. Towards Computational Models of Identifying Protein Ubiquitination Sites. Curr Drug Targets 2020; 20:565-578. [PMID: 30246637 DOI: 10.2174/1389450119666180924150202] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2018] [Revised: 08/29/2018] [Accepted: 09/04/2018] [Indexed: 12/25/2022]
Abstract
Ubiquitination is an important post-translational modification (PTM) process for the regulation of protein functions, which is associated with cancer, cardiovascular and other diseases. Recent initiatives have focused on the detection of potential ubiquitination sites with the aid of physicochemical test approaches in conjunction with the application of computational methods. The identification of ubiquitination sites using laboratory tests is especially susceptible to the temporality and reversibility of the ubiquitination processes, and is also costly and time-consuming. It has been demonstrated that computational methods are effective in extracting potential rules or inferences from biological sequence collections. Up to the present, the computational strategy has been one of the critical research approaches that have been applied for the identification of ubiquitination sites, and currently, there are numerous state-of-the-art computational methods that have been developed from machine learning and statistical analysis to undertake such work. In the present study, the construction of benchmark datasets is summarized, together with feature representation methods, feature selection approaches and the classifiers involved in several previous publications. In an attempt to explore pertinent development trends for the identification of ubiquitination sites, an independent test dataset was constructed and the predicting results obtained from five prediction tools are reported here, together with some related discussions.
Collapse
Affiliation(s)
- Lidong Wang
- College of Science, Dalian Maritime University, Dalian, China
| | - Ruijun Zhang
- College of Science, Dalian Maritime University, Dalian, China
| |
Collapse
|
31
|
Mosharaf MP, Hassan MM, Ahmed FF, Khatun MS, Moni MA, Mollah MNH. Computational prediction of protein ubiquitination sites mapping on Arabidopsis thaliana. Comput Biol Chem 2020; 85:107238. [DOI: 10.1016/j.compbiolchem.2020.107238] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2018] [Revised: 01/22/2020] [Accepted: 02/18/2020] [Indexed: 02/06/2023]
|
32
|
Arif M, Ahmad S, Ali F, Fang G, Li M, Yu DJ. TargetCPP: accurate prediction of cell-penetrating peptides from optimized multi-scale features using gradient boost decision tree. J Comput Aided Mol Des 2020; 34:841-856. [PMID: 32180124 DOI: 10.1007/s10822-020-00307-z] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2019] [Accepted: 03/09/2020] [Indexed: 02/08/2023]
Abstract
Cell-penetrating peptides (CPPs) are short length permeable proteins have emerged as drugs delivery tool of therapeutic agents including genetic materials and macromolecules into cells. Recently, CPP has become a hotspot avenue for life science research and paved a new way of disease treatment without harmful impact on cell viability due to nontoxic characteristic. Therefore, the correct identification of CPPs will provide hints for medical applications. Considering the shortcomings of traditional experimental CPPs identification, it is urgently needed to design intelligent predictor for accurate identification of CPPs for the large scale uncharacterized sequences. We develop a novel computational method, called TargetCPP, to discriminate CPPs from Non-CPPs with improved accuracy. In TargetCPP, first the peptide sequences are formulated with four distinct encoding methods i.e., composite protein sequence representation, composition transition and distribution, split amino acid composition, and information theory features. These dominant feature vectors were fused and applied intelligent minimum redundancy and maximum relevancy feature selection method to choose an optimal subset of features. Finally, the predictive model is learned through different classification algorithms on the optimized features. Among these classifiers, gradient boost decision tree algorithm achieved excellent performance throughout the experiments. Notably, the TargetCPP tool attained high prediction Accuracy of 93.54% and 88.28% using jackknife and independent test, respectively. Empirical outcomes prove the superiority and potency of proposed bioinformatics method over state-of-the-art methods. It is highly anticipated that the outcomes of this study will provide a strong background for large scale prediction of CPPs and instructive guidance in clinical therapy and medical applications.
Collapse
Affiliation(s)
- Muhammad Arif
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| | - Saeed Ahmad
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| | - Farman Ali
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| | - Ge Fang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| | - Min Li
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| |
Collapse
|
33
|
Wang M, Cui X, Yu B, Chen C, Ma Q, Zhou H. SulSite-GTB: identification of protein S-sulfenylation sites by fusing multiple feature information and gradient tree boosting. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-04792-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
34
|
Huang G, Zheng Y, Wu YQ, Han GS, Yu ZG. An Information Entropy-Based Approach for Computationally Identifying Histone Lysine Butyrylation. Front Genet 2020; 10:1325. [PMID: 32117407 PMCID: PMC7033570 DOI: 10.3389/fgene.2019.01325] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2019] [Accepted: 12/05/2019] [Indexed: 12/14/2022] Open
Abstract
Butyrylation plays a crucial role in the cellular processes. Due to limit of techniques, it is a challenging task to identify histone butyrylation sites on a large scale. To fill the gap, we propose an approach based on information entropy and machine learning for computationally identifying histone butyrylation sites. The proposed method achieves 0.92 of area under the receiver operating characteristic (ROC) curve over the training set by 3-fold cross validation and 0.80 over the testing set by independent test. Feature analysis implies that amino acid residues in the down/upstream of butyrylation sites would exhibit specific sequence motif to a certain extent. Functional analysis suggests that histone butyrylation was most possibly associated with four pathways (systemic lupus erythematosus, alcoholism, viral carcinogenesis and transcriptional misregulation in cancer), was involved in binding with other molecules, processes of biosynthesis, assembly, arrangement or disassembly and was located in such complex as consists of DNA, RNA, protein, etc. The proposed method is useful to predict histone butyrylation sites. Analysis of feature and function improves understanding of histone butyrylation and increases knowledge of functions of butyrylated histones.
Collapse
Affiliation(s)
- Guohua Huang
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang, China
| | - Yang Zheng
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang, China
| | - Yao-Qun Wu
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang, China.,Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China
| | - Guo-Sheng Han
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China
| | - Zu-Guo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China.,School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, QLD, Australia
| |
Collapse
|
35
|
Rajab M, Wang D. Practical Challenges and Recommendations of Filter Methods for Feature Selection. JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT 2020. [DOI: 10.1142/s0219649220400195] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Feature selection, the process of identifying relevant features to be incorporated into a proposed model, is one of the significant steps of the learning process. It removes noise from the data to increase the learning performance while reducing the computational complexity. The literature review indicated that most previous studies had focused on improving the overall classifier performance or reducing costs associated with training time during building of the classifiers. However, in this era of big data, there is an urgent need to deal with more complex issues that makes feature selection, especially using filter-based methods, more challenging; this in terms of dimensionality, data structures, data format, domain experts’ availability, data sparsity, and result discrepancies, among others. Filter methods identify the informative features of a given dataset to establish various predictive models using mathematical models. This paper takes a new route in an attempt to pinpoint recent practical challenges associated with filter methods and discusses potential areas of development to yield better performance. Several practical recommendations, based on recent studies, are made to overcome the identified challenges and make the feature selection process simpler and more efficient.
Collapse
Affiliation(s)
- Mohammed Rajab
- Department of Computer Science, The University of Sheffield, Sheffield, UK
| | - Dennis Wang
- Department of Computer Science, The University of Sheffield, Sheffield, UK
- Sheffield Institute for Translational Neuroscience, Sheffield, UK
- NIHR Sheffield Biomedical Research Centre, Sheffield, UK
| |
Collapse
|
36
|
Zhang H, Jin Z, Cheng L, Zhang B. Integrative Analysis of Methylation and Gene Expression in Lung Adenocarcinoma and Squamous Cell Lung Carcinoma. Front Bioeng Biotechnol 2020; 8:3. [PMID: 32117905 PMCID: PMC7019569 DOI: 10.3389/fbioe.2020.00003] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Accepted: 01/03/2020] [Indexed: 12/18/2022] Open
Abstract
Lung cancer is a highly prevalent type of cancer with a poor 5-year survival rate of about 4-17%. Eighty percent lung cancer belongs to non-small-cell lung cancer (NSCLC). For a long time, the treatment of NSCLC has been mostly guided by tumor stage, and there has been no significant difference between the therapy strategy of lung adenocarcinoma (LUAD) and squamous cell lung carcinoma (SCLC), the two major subtypes of NSCLC. In recent years, important molecular differences between LUAD and SCLC are increasingly identified, indicating that targeted therapy will be more and more histologically specific in the future. To investigate the LUAD and SCLC difference on multi-omics scale, we analyzed the methylation and gene expression data together. With the Boruta method to remove irrelevant features and the MCFS (Monte Carlo Feature Selection) method to identify the significantly important features, we identified 113 key methylation features and 23 key gene expression features. HNF1B and TP63 were found to be dysfunctional on both methylation and gene expression levels. The experimentally determined interaction network suggested that TP63 may play an important role in connecting methylation genes and expression genes. Many of the discovered signature genes have been supported by literature. Our results may provide directions of precision diagnosis and therapy of LUAD and SCLC.
Collapse
Affiliation(s)
- Hao Zhang
- Department of Respiratory and Critical Care Medicine, Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, China
| | - Zhou Jin
- Department of Respiratory and Critical Care Medicine, Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, China.,Department of Respiration, Hospital of Traditional Chinese Medicine of Zhenhai, Ningbo, China
| | - Ling Cheng
- Shanghai Engineering Research Center of Pharmaceutical Translation, Shanghai, China
| | - Bin Zhang
- Department of Respiratory and Critical Care Medicine, Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, China
| |
Collapse
|
37
|
Qiu WR, Xu A, Xu ZC, Zhang CH, Xiao X. Identifying Acetylation Protein by Fusing Its PseAAC and Functional Domain Annotation. Front Bioeng Biotechnol 2019; 7:311. [PMID: 31867311 PMCID: PMC6908504 DOI: 10.3389/fbioe.2019.00311] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2019] [Accepted: 10/22/2019] [Indexed: 11/13/2022] Open
Abstract
Acetylation is one of post-translational modification (PTM), which often reacts with acetic acid and brings an acetyl radical to an organic compound. It is helpful to identify acetylation protein correctly for understanding the mechanism of acetylation in biological systems. Although many acetylation sites have been identified by high throughput experimental studies via mass spectrometry, there still are lots of acetylation sites need to be discovered. Computational methods have showed their power for identifying acetylation sites with informatics techniques which usually reduce experiment cost and improve the effectiveness and efficiency. In fact, if there is an approach can distinguish the acetylated proteins from the non-acetylated ones, it is no doubt a very meaningful and effective method for this issue. Here, we proposed a novel computational method for identifying acetylation proteins by extracting features from the conservation information of sequence via gray system model and KNN scores based on the information of functional domain annotation and subcellular localization. The authors have performed the 5-fold cross-validation on three datasets along with much analysis of features and the Relief feature selection algorithm. The obtained accuracies are all satisfactory, as the mean performance, the accuracy is 77.10%, the Matthew's correlation coefficient is 0.5457, and the AUC value is 0.8389. These works might provide useful insights for the related experimental validation, and further studies of other PTM process. For the convenience of related researchers, the web-server named “iACetyP” was established and is accessible at http://www.jci-bioinfo.cn/iAcetyP.
Collapse
Affiliation(s)
- Wang-Ren Qiu
- School of Information and Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Ao Xu
- School of Information and Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Zhao-Chun Xu
- School of Information and Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Chun-Hua Zhang
- School of Information and Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Xuan Xiao
- School of Information and Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| |
Collapse
|
38
|
Qiu W, Xu C, Xiao X, Xu D. Computational Prediction of Ubiquitination Proteins Using Evolutionary Profiles and Functional Domain Annotation. Curr Genomics 2019; 20:389-399. [PMID: 32476995 PMCID: PMC7235393 DOI: 10.2174/1389202919666191014091250] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2019] [Revised: 07/14/2019] [Accepted: 08/29/2019] [Indexed: 11/22/2022] Open
Abstract
Background: Ubiquitination, as a post-translational modification, is a crucial biological process in cell signaling, apoptosis, and localization. Identification of ubiquitination proteins is of fundamental importance for understanding the molecular mechanisms in biological systems and diseases. Although high-throughput experimental studies using mass spectrometry have identified many ubiquitination proteins and ubiquitination sites, the vast majority of ubiquitination proteins remain undiscovered, even in well-studied model organisms. Objective: To reduce experimental costs, computational methods have been introduced to predict ubiquitination sites, but the accuracy is unsatisfactory. If it can be predicted whether a protein can be ubiquitinated or not, it will help in predicting ubiquitination sites. However, all the computational methods so far can only predict ubiquitination sites. Methods: In this study, the first computational method for predicting ubiquitination proteins without relying on ubiquitination site prediction has been developed. The method extracts features from sequence conservation information through a grey system model, as well as functional domain annotation and subcellular localization. Results: Together with the feature analysis and application of the relief feature selection algorithm, the results of 5-fold cross-validation on three datasets achieved a high accuracy of 90.13%, with Matthew’s correlation coefficient of 80.34%. The predicted results on an independent test data achieved 87.71% as accuracy and 75.43% of Matthew’s correlation coefficient, better than the prediction from the best ubiquitination site prediction tool available. Conclusion: Our study may guide experimental design and provide useful insights for studying the mechanisms and modulation of ubiquitination pathways. The code is available at: https://github.com/Chunhuixu/UBIPredic_QWRCHX
Collapse
Affiliation(s)
- Wangren Qiu
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen 333046, China
| | - Chunhui Xu
- Informatics Institute, University of Missouri, Columbia, MO 65201, USA
| | - Xuan Xiao
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen 333046, China
| | - Dong Xu
- Informatics Institute, University of Missouri, Columbia, MO 65201, USA.,Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO 65201, USA
| |
Collapse
|
39
|
Chen Z, Liu X, Li F, Li C, Marquez-Lago T, Leier A, Akutsu T, Webb GI, Xu D, Smith AI, Li L, Chou KC, Song J. Large-scale comparative assessment of computational predictors for lysine post-translational modification sites. Brief Bioinform 2019; 20:2267-2290. [PMID: 30285084 PMCID: PMC6954452 DOI: 10.1093/bib/bby089] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2018] [Revised: 08/17/2018] [Accepted: 08/18/2018] [Indexed: 12/22/2022] Open
Abstract
Lysine post-translational modifications (PTMs) play a crucial role in regulating diverse functions and biological processes of proteins. However, because of the large volumes of sequencing data generated from genome-sequencing projects, systematic identification of different types of lysine PTM substrates and PTM sites in the entire proteome remains a major challenge. In recent years, a number of computational methods for lysine PTM identification have been developed. These methods show high diversity in their core algorithms, features extracted and feature selection techniques and evaluation strategies. There is therefore an urgent need to revisit these methods and summarize their methodologies, to improve and further develop computational techniques to identify and characterize lysine PTMs from the large amounts of sequence data. With this goal in mind, we first provide a comprehensive survey on a large collection of 49 state-of-the-art approaches for lysine PTM prediction. We cover a variety of important aspects that are crucial for the development of successful predictors, including operating algorithms, sequence and structural features, feature selection, model performance evaluation and software utility. We further provide our thoughts on potential strategies to improve the model performance. Second, in order to examine the feasibility of using deep learning for lysine PTM prediction, we propose a novel computational framework, termed MUscADEL (Multiple Scalable Accurate Deep Learner for lysine PTMs), using deep, bidirectional, long short-term memory recurrent neural networks for accurate and systematic mapping of eight major types of lysine PTMs in the human and mouse proteomes. Extensive benchmarking tests show that MUscADEL outperforms current methods for lysine PTM characterization, demonstrating the potential and power of deep learning techniques in protein PTM prediction. The web server of MUscADEL, together with all the data sets assembled in this study, is freely available at http://muscadel.erc.monash.edu/. We anticipate this comprehensive review and the application of deep learning will provide practical guide and useful insights into PTM prediction and inspire future bioinformatics studies in the related fields.
Collapse
Affiliation(s)
- Zhen Chen
- School of Basic Medical Science, Qingdao University, Dengzhou Road, Qingdao, Shandong, China
| | - Xuhan Liu
- Medicinal Chemistry, Leiden Academic Centre for Drug Research,Einsteinweg, Leiden, The Netherlands
| | - Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, VIC, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC, Australia
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, VIC, Australia
- Institute of Molecular Systems Biology, ETH Zürich,Auguste-Piccard-Hof, Zürich, Switzerland
| | - Tatiana Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research,Kyoto University, Uji, Kyoto, Japan
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
| | - Dakang Xu
- Faculty of Medical Laboratory Science, Ruijin Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- Department of Molecular and Translational Science, Faculty of Medicine, Hudson Institute of Medical Research, Monash University, Melbourne, VIC, Australia
| | - Alexander Ian Smith
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, VIC, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC, Australia
| | - Lei Li
- School of Basic Medical Science, Qingdao University, Dengzhou Road, Qingdao, Shandong, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA, USA
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, VIC, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
| |
Collapse
|
40
|
Chen L, Li D, Shao Y, Wang H, Liu Y, Zhang Y. Identifying Microbiota Signature and Functional Rules Associated With Bacterial Subtypes in Human Intestine. Front Genet 2019; 10:1146. [PMID: 31803234 PMCID: PMC6872643 DOI: 10.3389/fgene.2019.01146] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2019] [Accepted: 10/21/2019] [Indexed: 12/12/2022] Open
Abstract
Gut microbiomes are integral microflora located in the human intestine with particular symbiosis. Among all microorganisms in the human intestine, bacteria are the most significant subgroup that contains many unique and functional species. The distribution patterns of bacteria in the human intestine not only reflect the different microenvironments in different sections of the intestine but also indicate that bacteria may have unique biological functions corresponding to their proper regions of the intestine. However, describing the functional differences between the bacterial subgroups and their distributions in different individuals is difficult using traditional computational approaches. Here, we first attempted to introduce four effective sets of bacterial features from independent databases. We then presented a novel computational approach to identify potential distinctive features among bacterial subgroups based on a systematic dataset on the gut microbiome from approximately 1,500 human gut bacterial strains. We also established a group of quantitative rules for explaining such distinctions. Results may reveal the microstructural characteristics of the intestinal flora and deepen our understanding on the regulatory role of bacterial subgroups in the human intestine.
Collapse
Affiliation(s)
- Lijuan Chen
- College of Animal Science and Technology, Anhui Agricultural University, Hefei, China
| | - Daojie Li
- College of Animal Science and Technology, Anhui Agricultural University, Hefei, China
| | - Ye Shao
- School of Medicine, Huaqiao University, Quanzhou, China
| | - Hui Wang
- College of Animal Science and Technology, Anhui Agricultural University, Hefei, China
| | - Yuqing Liu
- Anhui Province Key Laboratory of Farmland Ecological Conservation and Pollution Prevention, School of Resources and Environment, Anhui Agricultural University, Hefei, China
| | - Yunhua Zhang
- Anhui Province Key Laboratory of Farmland Ecological Conservation and Pollution Prevention, School of Resources and Environment, Anhui Agricultural University, Hefei, China
| |
Collapse
|
41
|
Sun Z, Li Y, Wang Y, Fan X, Xu K, Wang K, Li S, Zhang Z, Jiang T, Liu X. Radiogenomic analysis of vascular endothelial growth factor in patients with diffuse gliomas. Cancer Imaging 2019; 19:68. [PMID: 31639060 PMCID: PMC6805458 DOI: 10.1186/s40644-019-0256-y] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2019] [Accepted: 09/25/2019] [Indexed: 01/02/2023] Open
Abstract
OBJECTIVE To predict vascular endothelial growth factor (VEGF) expression in patients with diffuse gliomas using radiomic analysis. MATERIALS AND METHODS Preoperative magnetic resonance images were retrospectively obtained from 239 patients with diffuse gliomas (World Health Organization grades II-IV). The patients were randomly assigned to a training group (n = 160) or a validation group (n = 79) at a 2:1 ratio. For each patient, a total of 431 radiomic features were extracted. The minimum redundancy maximum relevance (mRMR) algorithm was used for feature selection. A machine-learning model for predicting VEGF status was then developed using the selected features and a support vector machine classifier. The predictive performance of the model was evaluated in both groups using receiver operating characteristic curve analysis, and correlations between selected features were assessed. RESULTS Nine radiomic features were selected to generate a VEGF-associated radiomic signature of diffuse gliomas based on the mRMR algorithm. This radiomic signature consisted of two first-order statistics or related wavelet features (Entropy and Minimum) and seven textural features or related wavelet features (including Cluster Tendency and Long Run Low Gray Level Emphasis). The predictive efficiencies measured by the area under the curve were 74.1% in the training group and 70.2% in the validation group. The overall correlations between the 9 radiomic features were low in both groups. CONCLUSIONS Radiomic analysis facilitated efficient prediction of VEGF status in diffuse gliomas, suggesting that using tumor-derived radiomic features for predicting genomic information is feasible.
Collapse
Affiliation(s)
- Zhiyan Sun
- Beijing Neurosurgical Institute, Capital Medical University, 6 Tiantanxili, Beijing, 100050, China
| | - Yiming Li
- Beijing Neurosurgical Institute, Capital Medical University, 6 Tiantanxili, Beijing, 100050, China
| | - Yinyan Wang
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Xing Fan
- Beijing Neurosurgical Institute, Capital Medical University, 6 Tiantanxili, Beijing, 100050, China
| | - Kaibin Xu
- Chinese Academy of Sciences, Institute of Automation, Beijing, China
| | - Kai Wang
- Department of Nuclear Medicine, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Shaowu Li
- Beijing Neurosurgical Institute, Capital Medical University, 6 Tiantanxili, Beijing, 100050, China
| | - Zhong Zhang
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Tao Jiang
- Beijing Neurosurgical Institute, Capital Medical University, 6 Tiantanxili, Beijing, 100050, China.,Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China.,Center of Brain Tumor, Beijing Institute for Brain Disorders, Beijing, China.,China National Clinical Research Center for Neurological Diseases, Beijing, China.,Chinese Glioma Genome Atlas Network (CGGA) and Asian Glioma Genome Atlas Network (AGGA), Beijing, China
| | - Xing Liu
- Beijing Neurosurgical Institute, Capital Medical University, 6 Tiantanxili, Beijing, 100050, China.
| |
Collapse
|
42
|
Chen J, Zhao J, Yang S, Chen Z, Zhang Z. Prediction of Protein Ubiquitination Sites in Arabidopsis thaliana. Curr Bioinform 2019. [DOI: 10.2174/1574893614666190311141647] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
As one of the most important reversible protein post-translation modification
types, ubiquitination plays a significant role in the regulation of many biological processes,
such as cell division, signal transduction, apoptosis and immune response. Protein ubiquitination
usually occurs when ubiquitin molecule is attached to a lysine on a target protein, which is also
known as “lysine ubiquitination”.
Objective:
In order to investigate the molecular mechanisms of ubiquitination-related biological
processes, the crucial first step is the identification of ubiquitination sites. However, conventional
experimental methods in detecting ubiquitination sites are often time-consuming and a large number
of ubiquitination sites remain unidentified. In this study, a ubiquitination site prediction method
for Arabidopsis thaliana was developed using a Support Vector Machine (SVM).
Methods:
We collected 3009 experimentally validated ubiquitination sites on 1607 proteins in A.
thaliana to construct the training set. Three feature encoding schemes were used to characterize
the sequence patterns around ubiquitination sites, including AAC, Binary and CKSAAP. The maximum
Relevance and Minimum Redundancy (mRMR) feature selection method was employed to
reduce the dimensionality of input features. Five-fold cross-validation and independent tests were
used to evaluate the performance of the established models.
Results:
As a result, the combination of AAC and CKSAAP encoding schemes yielded the
best performance with the accuracy and AUC of 81.35% and 0.868 in the independent test.
We also generated an online predictor termed as AraUbiSite, which is freely accessible at:
http://systbio.cau.edu.cn/araubisite.
Conclusion:
We developed a well-performed prediction tool for large-scale ubiquitination site
identification in A. thaliana. It is hoped that the current work will speed up the process of identification
of ubiquitination sites in A. thaliana and help to further elucidate the molecular mechanisms
of ubiquitination in plants.
Collapse
Affiliation(s)
- Jiajing Chen
- National Demonstration Center for Experimental Biological Sciences Education, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Jianan Zhao
- National Demonstration Center for Experimental Biological Sciences Education, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Shiping Yang
- National Demonstration Center for Experimental Biological Sciences Education, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Zhen Chen
- National Demonstration Center for Experimental Biological Sciences Education, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Ziding Zhang
- National Demonstration Center for Experimental Biological Sciences Education, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| |
Collapse
|
43
|
Kumar VS, Vellaichamy A. Sequence and structure‐based characterization of ubiquitination sites in human and yeast proteins using Chou's sample formulation. Proteins 2019; 87:646-657. [DOI: 10.1002/prot.25689] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Revised: 02/20/2019] [Accepted: 04/04/2019] [Indexed: 12/29/2022]
|
44
|
Fu H, Yang Y, Wang X, Wang H, Xu Y. DeepUbi: a deep learning framework for prediction of ubiquitination sites in proteins. BMC Bioinformatics 2019; 20:86. [PMID: 30777029 PMCID: PMC6379983 DOI: 10.1186/s12859-019-2677-9] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Accepted: 02/12/2019] [Indexed: 01/22/2023] Open
Abstract
Background Protein ubiquitination occurs when the ubiquitin protein binds to a target protein residue of lysine (K), and it is an important regulator of many cellular functions, such as signal transduction, cell division, and immune reactions, in eukaryotes. Experimental and clinical studies have shown that ubiquitination plays a key role in several human diseases, and recent advances in proteomic technology have spurred interest in identifying ubiquitination sites. However, most current computing tools for predicting target sites are based on small-scale data and shallow machine learning algorithms. Results As more experimentally validated ubiquitination sites emerge, we need to design a predictor that can identify lysine ubiquitination sites in large-scale proteome data. In this work, we propose a deep learning predictor, DeepUbi, based on convolutional neural networks. Four different features are adopted from the sequences and physicochemical properties. In a 10-fold cross validation, DeepUbi obtains an AUC (area under the Receiver Operating Characteristic curve) of 0.9, and the accuracy, sensitivity and specificity exceeded 85%. The more comprehensive indicator, MCC, reaches 0.78. We also develop a software package that can be freely downloaded from https://github.com/Sunmile/DeepUbi. Conclusion Our results show that DeepUbi has excellent performance in predicting ubiquitination based on large data. Electronic supplementary material The online version of this article (10.1186/s12859-019-2677-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hongli Fu
- Department of Information and Computing Science, University of Science and Technology Beijing, Beijing, 100083, China
| | - Yingxi Yang
- Department of Information and Computing Science, University of Science and Technology Beijing, Beijing, 100083, China
| | - Xiaobo Wang
- Department of Information and Computing Science, University of Science and Technology Beijing, Beijing, 100083, China
| | - Hui Wang
- Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
| | - Yan Xu
- Department of Information and Computing Science, University of Science and Technology Beijing, Beijing, 100083, China. .,Beijing Key Laboratory for Magneto-photoelectrical Composite and Interface Science, University of Science and Technology Beijing, Beijing, 100083, China.
| |
Collapse
|
45
|
Kabir M, Ahmad S, Iqbal M, Hayat M. iNR-2L: A two-level sequence-based predictor developed via Chou's 5-steps rule and general PseAAC for identifying nuclear receptors and their families. Genomics 2019; 112:276-285. [PMID: 30779939 DOI: 10.1016/j.ygeno.2019.02.006] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2018] [Revised: 01/09/2019] [Accepted: 02/07/2019] [Indexed: 12/25/2022]
Abstract
Nuclear receptor proteins (NRPs) perform a vital role in regulating gene expression. With the rapidity growth of NRPs in post-genomic era, it is highly recommendable to identify NRPs and their sub-families accurately from their primary sequences. Several conventional methods have been used for discrimination of NRPs and their sub-families, but did not achieve considerable results. In a sequel, a two-level new computational model "iNR-2 L" is developed. Two discrete methods namely: Dipeptide Composition and Tripeptide Composition were used to formulate NRPs sequences. Further, both the descriptor spaces were merged to construct hybrid space. Furthermore, feature selection technique minimum redundancy and maximum relevance was employed in order to select salient features as well as reduce the noise and redundancy. The experiential outcomes exhibited that the proposed model iNR-2 L achieved outstanding results. It is anticipated that the proposed computational model might be a practical and effective tool for academia and research community.
Collapse
Affiliation(s)
- Muhammad Kabir
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China.
| | - Saeed Ahmad
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
| | - Muhammad Iqbal
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan.
| |
Collapse
|
46
|
Wang S, Li J, Sun X, Zhang YH, Huang T, Cai Y. Computational Method for Identifying Malonylation Sites by Using Random Forest Algorithm. Comb Chem High Throughput Screen 2018; 23:304-312. [PMID: 30588879 DOI: 10.2174/1386207322666181227144318] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2018] [Revised: 09/03/2018] [Accepted: 12/04/2018] [Indexed: 12/12/2022]
Abstract
BACKGROUND As a newly uncovered post-translational modification on the ε-amino group of lysine residue, protein malonylation was found to be involved in metabolic pathways and certain diseases. Apart from experimental approaches, several computational methods based on machine learning algorithms were recently proposed to predict malonylation sites. However, previous methods failed to address imbalanced data sizes between positive and negative samples. OBJECTIVE In this study, we identified the significant features of malonylation sites in a novel computational method which applied machine learning algorithms and balanced data sizes by applying synthetic minority over-sampling technique. METHOD Four types of features, namely, amino acid (AA) composition, position-specific scoring matrix (PSSM), AA factor, and disorder were used to encode residues in protein segments. Then, a two-step feature selection procedure including maximum relevance minimum redundancy and incremental feature selection, together with random forest algorithm, was performed on the constructed hybrid feature vector. RESULTS An optimal classifier was built from the optimal feature subset, which featured an F1-measure of 0.356. Feature analysis was performed on several selected important features. CONCLUSION Results showed that certain types of PSSM and disorder features may be closely associated with malonylation of lysine residues. Our study contributes to the development of computational approaches for predicting malonyllysine and provides insights into molecular mechanism of malonylation.
Collapse
Affiliation(s)
- ShaoPeng Wang
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - JiaRui Li
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Xijun Sun
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Yudong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| |
Collapse
|
47
|
Chen L, Zhang YH, Pan X, Liu M, Wang S, Huang T, Cai YD. Tissue Expression Difference between mRNAs and lncRNAs. Int J Mol Sci 2018; 19:ijms19113416. [PMID: 30384456 PMCID: PMC6274976 DOI: 10.3390/ijms19113416] [Citation(s) in RCA: 49] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2018] [Revised: 10/26/2018] [Accepted: 10/28/2018] [Indexed: 12/15/2022] Open
Abstract
Messenger RNA (mRNA) and long noncoding RNA (lncRNA) are two main subgroups of RNAs participating in transcription regulation. With the development of next generation sequencing, increasing lncRNAs are identified. Many hidden functions of lncRNAs are also revealed. However, the differences in lncRNAs and mRNAs are still unclear. For example, we need to determine whether lncRNAs have stronger tissue specificity than mRNAs and which tissues have more lncRNAs expressed. To investigate such tissue expression difference between mRNAs and lncRNAs, we encoded 9339 lncRNAs and 14,294 mRNAs with 71 expression features, including 69 maximum expression features for 69 types of cells, one feature for the maximum expression in all cells, and one expression specificity feature that was measured as Chao-Shen-corrected Shannon's entropy. With advanced feature selection methods, such as maximum relevance minimum redundancy, incremental feature selection methods, and random forest algorithm, 13 features presented the dissimilarity of lncRNAs and mRNAs. The 11 cell subtype features indicated which cell types of the lncRNAs and mRNAs had the largest expression difference. Such cell subtypes may be the potential cell models for lncRNA identification and function investigation. The expression specificity feature suggested that the cell types to express mRNAs and lncRNAs were different. The maximum expression feature suggested that the maximum expression levels of mRNAs and lncRNAs were different. In addition, the rule learning algorithm, repeated incremental pruning to produce error reduction algorithm, was also employed to produce effective classification rules for classifying lncRNAs and mRNAs, which gave competitive results compared with random forest and could give a clearer picture of different expression patterns between lncRNAs and mRNAs. Results not only revealed the heterogeneous expression pattern of lncRNA and mRNA, but also gave rise to the development of a new tool to identify the potential biological functions of such RNA subgroups.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
- Shanghai Key Laboratory of PMMP, East China Normal University, Shanghai 200241, China.
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China.
| | - Xiaoyong Pan
- Department of Medical Informatics, Erasmus MC, 3000 CA Rotterdam, The Netherlands.
| | - Min Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Shaopeng Wang
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| |
Collapse
|
48
|
Ju Z, Wang SY. Prediction of citrullination sites by incorporating k-spaced amino acid pairs into Chou's general pseudo amino acid composition. Gene 2018; 664:78-83. [DOI: 10.1016/j.gene.2018.04.055] [Citation(s) in RCA: 76] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2017] [Revised: 03/23/2018] [Accepted: 04/18/2018] [Indexed: 01/09/2023]
|
49
|
Islam MM, Saha S, Rahman MM, Shatabda S, Farid DM, Dehzangi A. iProtGly-SS: Identifying protein glycation sites using sequence and structure based features. Proteins 2018; 86:777-789. [PMID: 29675975 DOI: 10.1002/prot.25511] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2017] [Revised: 02/27/2018] [Accepted: 04/14/2018] [Indexed: 12/20/2022]
Abstract
Glycation is chemical reaction by which sugar molecule bonds with a protein without the help of enzymes. This is often cause to many diseases and therefore the knowledge about glycation is very important. In this paper, we present iProtGly-SS, a protein lysine glycation site identification method based on features extracted from sequence and secondary structural information. In the experiments, we found the best feature groups combination: Amino Acid Composition, Secondary Structure Motifs, and Polarity. We used support vector machine classifier to train our model and used an optimal set of features using a group based forward feature selection technique. On standard benchmark datasets, our method is able to significantly outperform existing methods for glycation prediction. A web server for iProtGly-SS is implemented and publicly available to use: http://brl.uiu.ac.bd/iprotgly-ss/.
Collapse
Affiliation(s)
- Md Mofijul Islam
- Department of CSE, University of Dhaka, Dhaka, Bangladesh.,Department of CSE, United International University, Dhaka, Bangladesh
| | - Sanjay Saha
- Department of CSE, United International University, Dhaka, Bangladesh
| | | | - Swakkhar Shatabda
- Department of CSE, United International University, Dhaka, Bangladesh
| | - Dewan Md Farid
- Department of CSE, United International University, Dhaka, Bangladesh
| | - Abdollah Dehzangi
- Department of Computer Science, Morgan State University, Baltimore, MD, 21251, USA
| |
Collapse
|
50
|
Shen S, Gui T, Ma C. Identification of molecular biomarkers for pancreatic cancer with mRMR shortest path method. Oncotarget 2018; 8:41432-41439. [PMID: 28611293 PMCID: PMC5522256 DOI: 10.18632/oncotarget.18186] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Accepted: 04/20/2017] [Indexed: 12/20/2022] Open
Abstract
The high mortality rate of pancreatic cancer makes it one of the most studied diseases among all cancer types. Many researches have been conducted to understand the mechanism underlying its emergence and pathogenesis of this disease. Here, by using minimum-redundancy-maximum-relevance (mRMR) method, we studied a set of transcriptome data of pancreatic cancer. As we gradually added features to achieve the most accurate classification results of Jackknife, a gene set of 9 genes was identified. They were NHS, SCML2, LAMC2, S100P, COL17A1, AMIGO2, PTPRR, KPNA7 and KCNN4. Through STRING 2.0 protein-protein interactions (PPIs) analysis, 40 proteins were identified in the shortest paths between genes in the gene set, 30 of them passed the permutation test, which indicated they were hubs in the background network. Those genes in the protein-protein interaction network were enriched to 37 functional modules, such as: negative regulation of transcription from RNA polymerase II promoter, negative regulation of ERK1 and ERK2 cascade and BMP signaling pathway. Our study indicated new mechanism of pancreatic cancer, suggesting potential therapeutic targets for further study.
Collapse
Affiliation(s)
- Shuhua Shen
- Zhejiang Provincial Hospital of Traditional Chinese Medicine, Hangzhou, China
| | - Tuantuan Gui
- Shanghai Smartquerier Biotechnology Co., Ltd, Shanghai, China
| | - Chengcheng Ma
- CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China.,Shanghai Center for Bioinformatics Technology, Shanghai, China
| |
Collapse
|