1
|
Zhong G, Liu H, Deng L. Ensemble Machine Learning and Predicted Properties Promote Antimicrobial Peptide Identification. Interdiscip Sci 2024:10.1007/s12539-024-00640-z. [PMID: 38972032 DOI: 10.1007/s12539-024-00640-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Revised: 06/04/2024] [Accepted: 06/07/2024] [Indexed: 07/08/2024]
Abstract
The emergence of antibiotic-resistant microbes raises a pressing demand for novel alternative treatments. One promising alternative is the antimicrobial peptides (AMPs), a class of innate immunity mediators within the therapeutic peptide realm. AMPs offer salient advantages such as high specificity, cost-effective synthesis, and reduced toxicity. Although some computational methodologies have been proposed to identify potential AMPs with the rapid development of artificial intelligence techniques, there is still ample room to improve their performance. This study proposes a predictive framework which ensembles deep learning and statistical learning methods to screen peptides with antimicrobial activity. We integrate multiple LightGBM classifiers and convolution neural networks which leverages various predicted sequential, structural and physicochemical properties from their residue sequences extracted by diverse machine learning paradigms. Comparative experiments exhibit that our method outperforms other state-of-the-art approaches on an independent test dataset, in terms of representative capability measures. Besides, we analyse the discrimination quality under different varieties of attribute information and it reveals that combination of multiple features could improve prediction. In addition, a case study is carried out to illustrate the exemplary favorable identification effect. We establish a web application at http://amp.denglab.org to provide convenient usage of our proposal and make the predictive framework, source code, and datasets publicly accessible at https://github.com/researchprotein/amp .
Collapse
Affiliation(s)
- Guolun Zhong
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Hui Liu
- College of Computer and Information Engineering, Nanjing Tech University, Nanjing, 211816, China.
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.
| |
Collapse
|
2
|
Zhong G, Deng L. ACPScanner: Prediction of Anticancer Peptides by Integrated Machine Learning Methodologies. J Chem Inf Model 2024; 64:1092-1104. [PMID: 38277774 DOI: 10.1021/acs.jcim.3c01860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2024]
Abstract
Novel therapeutic alternatives for cancer treatment are increasingly attracting global research attention. Although chemotherapy remains a primary clinical solution, it often results in significant side effects for patients. In recent years, anticancer peptides (ACPs) have emerged as promising candidates for highly specific anticancer drugs, and a number of computational approaches have been developed to identify ACPs. However, existing methods do not recognize specific types of anticancer function. In this article, we propose ACPScanner, an integrated approach to predict ACPs and non-ACPs at first and then predict several specific activity types for potential ACPs. We incorporate sequential, physicochemical properties, secondary structural information, and deep representation learning embeddings which are generated from artificial intelligence methods to build feature space. Customized deep learning and statistical learning methods are combined to form an integral architecture for the comprehensive two-level prediction task. To the best of our knowledge, ACPScanner is the first approach for specific ACP activity prediction. The comparative evaluation illustrates that ACPScanner achieves competitive prediction performance in both prediction phases in independent testings. We establish a web server at http://acpscanner.denglab.org to provide convenient usage of ACPScanner and make the predictive framework, source code, and data sets publicly available.
Collapse
Affiliation(s)
- Guolun Zhong
- School of Computer Science and Engineering, Central South University, Changsha 410000, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha 410000, China
| |
Collapse
|
3
|
Gu X, Ding Y, Xiao P, He T. A GHKNN model based on the physicochemical property extraction method to identify SNARE proteins. Front Genet 2022; 13:935717. [PMID: 36506312 PMCID: PMC9727185 DOI: 10.3389/fgene.2022.935717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 11/02/2022] [Indexed: 11/24/2022] Open
Abstract
There is a great deal of importance to SNARE proteins, and their absence from function can lead to a variety of diseases. The SNARE protein is known as a membrane fusion protein, and it is crucial for mediating vesicle fusion. The identification of SNARE proteins must therefore be conducted with an accurate method. Through extensive experiments, we have developed a model based on graph-regularized k-local hyperplane distance nearest neighbor model (GHKNN) binary classification. In this, the model uses the physicochemical property extraction method to extract protein sequence features and the SMOTE method to upsample protein sequence features. The combination achieves the most accurate performance for identifying all protein sequences. Finally, we compare the model based on GHKNN binary classification with other classifiers and measure them using four different metrics: SN, SP, ACC, and MCC. In experiments, the model performs significantly better than other classifiers.
Collapse
Affiliation(s)
- Xingyue Gu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Pengfeng Xiao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
| | - Tao He
- Beidahuang Industry Group General Hospital, Harbin, China
| |
Collapse
|
4
|
Liu S, Cui C, Chen H, Liu T. Ensemble learning-based feature selection for phosphorylation site detection. Front Genet 2022; 13:984068. [PMID: 36338976 PMCID: PMC9634105 DOI: 10.3389/fgene.2022.984068] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Accepted: 10/05/2022] [Indexed: 11/18/2022] Open
Abstract
SARS-COV-2 is prevalent all over the world, causing more than six million deaths and seriously affecting human health. At present, there is no specific drug against SARS-COV-2. Protein phosphorylation is an important way to understand the mechanism of SARS -COV-2 infection. It is often expensive and time-consuming to identify phosphorylation sites with specific modified residues through experiments. A method that uses machine learning to make predictions about them is proposed. As all the methods of extracting protein sequence features are knowledge-driven, these features may not be effective for detecting phosphorylation sites without a complete understanding of the mechanism of protein. Moreover, redundant features also have a great impact on the fitting degree of the model. To solve these problems, we propose a feature selection method based on ensemble learning, which firstly extracts protein sequence features based on knowledge, then quantifies the importance score of each feature based on data, and finally uses the subset of important features as the final features to predict phosphorylation sites.
Collapse
Affiliation(s)
- Songbo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Chengmin Cui
- Beijing Institute of Control Engineering, China Academy of Space Technology, Beijing, China
| | - Huipeng Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Tong Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
5
|
A Novel Ensemble-Based Technique for the Preemptive Diagnosis of Rheumatoid Arthritis Disease in the Eastern Province of Saudi Arabia Using Clinical Data. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:2339546. [PMID: 36158117 PMCID: PMC9492338 DOI: 10.1155/2022/2339546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 07/20/2022] [Accepted: 08/02/2022] [Indexed: 11/17/2022]
Abstract
Rheumatoid arthritis (RA) is a chronic inflammatory disease caused by numerous genetic and environmental factors leading to musculoskeletal system pain. RA may damage other tissues and organs, causing complications that severely reduce patients' quality of life. According to the World Health Organization (WHO), over 1.71 billion individuals worldwide had musculoskeletal problems in 2021. Rheumatologists face challenges in the early detection of RA since its symptoms are similar to other illnesses, and there is no definitive test to diagnose the disease. Accordingly, it is preferable to profit from the power of computational intelligence techniques that can identify hidden patterns to diagnose RA early. Although multiple studies were conducted to diagnose RA early, they showed unsatisfactory performance, with the highest accuracy of 87.5% using imaging data. Yet, imaging data requires diagnostic tools that are challenging to collect and examine and are more costly. Recent studies indicated that neither a blood test nor a physical finding could early confirm the diagnosis. Therefore, this study proposes a novel ensemble technique for the preemptive prediction of RA and investigates the possibility of diagnosing the disease using clinical data before the symptoms appear. Two datasets were obtained from King Fahad University Hospital (KFUH), Dammam, Saudi Arabia, including 446 patients, with 251 positive cases of RA and 195 negative cases of RA. Two experiments were conducted where the former was developed without upsampling the dataset, and the latter was carried out using an upsampled dataset. Multiple machine learning (ML) algorithms were utilized to assemble the novel voting ensemble, including support vector machine (SVM), logistic regression (LR), and adaptive boosting (Adaboost). The results indicated that clinical laboratory tests fed to the proposed voting ensemble technique could accurately diagnose RA preemptively with an accuracy, recall, and precision of 94.03%, 96.00%, and 93.51%, respectively, with 30 clinical features when utilizing the original data and sequential forward feature selection (SFFS) technique. It is concluded that deploying the proposed model in local hospitals can contribute to introducing a method that aids medical specialists in preemptively diagnosing RA and stopping or delaying the course using clinical laboratory tests.
Collapse
|
6
|
Zhao S, Meng J, Kang Q, Luan Y. Identifying LncRNA-Encoded Short Peptides Using Optimized Hybrid Features and Ensemble Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2873-2881. [PMID: 34383651 DOI: 10.1109/tcbb.2021.3104288] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Long non-coding RNA (lncRNA) contains short open reading frames (sORFs), and sORFs-encoded short peptides (SEPs) have become the focus of scientific studies due to their crucial role in life activities. The identification of SEPs is vital to further understanding their regulatory function. Bioinformatics methods can quickly identify SEPs to provide credible candidate sequences for verifying SEPs by biological experimenrts. However, there is a lack of methods for identifying SEPs directly. In this study, a machine learning method to identify SEPs of plant lncRNA (ISPL) is proposed. Hybrid features including sequence features and physicochemical features are extracted manually or adaptively to construct different modal features. In order to keep the stability of feature selection, the non-linear correction applied in Max-Relevance-Max-Distance (nocRD) feature selection method is proposed, which integrates multiple feature ranking results and uses the iterative random forest for different modal features dimensionality reduction. Classification models with different modal features are constructed, and their outputs are combined for ensemble classification. The experimental results show that the accuracy of ISPL is 89.86% percent on the independent test set, which will have important implications for further studies of functional genomic.
Collapse
|
7
|
Zhou H, Wang H, Ding Y, Tang J. Multivariate Information Fusion for Identifying Antifungal Peptides with
Hilbert-Schmidt Independence Criterion. Curr Bioinform 2022. [DOI: 10.2174/1574893616666210727161003] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Antifungal Peptides (AFP) have been found to be effective against many fungal
infections.
Objective:
However, it is difficult to identify AFP. Therefore, it is great practical significance to identify
AFP via machine learning methods (with sequence information).
Method:
In this study, a Multi-Kernel Support Vector Machine (MKSVM) with Hilbert-Schmidt Independence
Criterion (HSIC) is proposed. Proteins are encoded with five types of features (188-bit,
AAC, ASDC, CKSAAP, DPC), and then construct kernels using Gaussian kernel function. HSIC are
used to combine kernels and multi-kernel SVM model is built.
Results:
Our model performed well on three AFPs datasets and the performance is better than or comparable
to other state-of-art predictive models.
Conclusion:
Our method will be a useful tool for identifying antifungal peptides.
Collapse
Affiliation(s)
- Haohao Zhou
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin,
300354, China
| | - Hao Wang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin,
300354, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou,
215009, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of
China, Quzhou, 324000, China
| | - Jijun Tang
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USA
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055,
China
| |
Collapse
|
8
|
Jiao S, Zou Q, Guo H, Shi L. iTTCA-RF: a random forest predictor for tumor T cell antigens. J Transl Med 2021; 19:449. [PMID: 34706730 PMCID: PMC8554859 DOI: 10.1186/s12967-021-03084-x] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Accepted: 09/16/2021] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Cancer is one of the most serious diseases threatening human health. Cancer immunotherapy represents the most promising treatment strategy due to its high efficacy and selectivity and lower side effects compared with traditional treatment. The identification of tumor T cell antigens is one of the most important tasks for antitumor vaccines development and molecular function investigation. Although several machine learning predictors have been developed to identify tumor T cell antigen, more accurate tumor T cell antigen identification by existing methodology is still challenging. METHODS In this study, we used a non-redundant dataset of 592 tumor T cell antigens (positive samples) and 393 tumor T cell antigens (negative samples). Four types feature encoding methods have been studied to build an efficient predictor, including amino acid composition, global protein sequence descriptors and grouped amino acid and peptide composition. To improve the feature representation ability of the hybrid features, we further employed a two-step feature selection technique to search for the optimal feature subset. The final prediction model was constructed using random forest algorithm. RESULTS Finally, the top 263 informative features were selected to train the random forest classifier for detecting tumor T cell antigen peptides. iTTCA-RF provides satisfactory performance, with balanced accuracy, specificity and sensitivity values of 83.71%, 78.73% and 88.69% over tenfold cross-validation as well as 73.14%, 62.67% and 83.61% over independent tests, respectively. The online prediction server was freely accessible at http://lab.malab.cn/~acy/iTTCA . CONCLUSIONS We have proven that the proposed predictor iTTCA-RF is superior to the other latest models, and will hopefully become an effective and useful tool for identifying tumor T cell antigens presented in the context of major histocompatibility complex class I.
Collapse
Affiliation(s)
- Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Huannan Guo
- Department of Oncology, General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China.
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China.
| |
Collapse
|
9
|
Su R, Hu J, Zou Q, Manavalan B, Wei L. Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools. Brief Bioinform 2021; 21:408-420. [PMID: 30649170 DOI: 10.1093/bib/bby124] [Citation(s) in RCA: 107] [Impact Index Per Article: 35.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2018] [Revised: 11/30/2018] [Accepted: 11/30/2018] [Indexed: 12/16/2022] Open
Abstract
Cell-penetrating peptides (CPPs) facilitate the delivery of therapeutically relevant molecules, including DNA, proteins and oligonucleotides, into cells both in vitro and in vivo. This unique ability explores the possibility of CPPs as therapeutic delivery and its potential applications in clinical therapy. Over the last few decades, a number of machine learning (ML)-based prediction tools have been developed, and some of them are freely available as web portals. However, the predictions produced by various tools are difficult to quantify and compare. In particular, there is no systematic comparison of the web-based prediction tools in performance, especially in practical applications. In this work, we provide a comprehensive review on the biological importance of CPPs, CPP database and existing ML-based methods for CPP prediction. To evaluate current prediction tools, we conducted a comparative study and analyzed a total of 12 models from 6 publicly available CPP prediction tools on 2 benchmark validation sets of CPPs and non-CPPs. Our benchmarking results demonstrated that a model from the KELM-CPPpred, namely KELM-hybrid-AAC, showed a significant improvement in overall performance, when compared to the other 11 prediction models. Moreover, through a length-dependency analysis, we find that existing prediction tools tend to more accurately predict CPPs and non-CPPs with the length of 20-25 residues long than peptides in other length ranges.
Collapse
Affiliation(s)
- Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jie Hu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | | | - Leyi Wei
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
10
|
Abstract
Background:
Bioluminescence is a unique and significant phenomenon in nature.
Bioluminescence is important for the lifecycle of some organisms and is valuable in biomedical
research, including for gene expression analysis and bioluminescence imaging technology. In recent
years, researchers have identified a number of methods for predicting bioluminescent proteins
(BLPs), which have increased in accuracy, but could be further improved.
Method:
In this study, a new bioluminescent proteins prediction method, based on a voting
algorithm, is proposed. Four methods of feature extraction based on the amino acid sequence were
used. 314 dimensional features in total were extracted from amino acid composition,
physicochemical properties and k-spacer amino acid pair composition. In order to obtain the highest
MCC value to establish the optimal prediction model, a voting algorithm was then used to build the
model. To create the best performing model, the selection of base classifiers and vote counting rules
are discussed.
Results:
The proposed model achieved 93.4% accuracy, 93.4% sensitivity and
91.7% specificity in the test set, which was better than any other method. A previous prediction of
bioluminescent proteins in three lineages was also improved using the model building method,
resulting in greatly improved accuracy.
Collapse
Affiliation(s)
- Shulin Zhao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba Science City, Japan
| | - Jun Zhang
- Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Shuguang Han
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
11
|
Nath A, Leier A. Improved cytokine-receptor interaction prediction by exploiting the negative sample space. BMC Bioinformatics 2020; 21:493. [PMID: 33129275 PMCID: PMC7603689 DOI: 10.1186/s12859-020-03835-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Accepted: 10/23/2020] [Indexed: 01/19/2023] Open
Abstract
Background Cytokines act by binding to specific receptors in the plasma membrane of target cells. Knowledge of cytokine–receptor interaction (CRI) is very important for understanding the pathogenesis of various human diseases—notably autoimmune, inflammatory and infectious diseases—and identifying potential therapeutic targets. Recently, machine learning algorithms have been used to predict CRIs. “Gold Standard” negative datasets are still lacking and strong biases in negative datasets can significantly affect the training of learning algorithms and their evaluation. To mitigate the unrepresentativeness and bias inherent in the negative sample selection (non-interacting proteins), we propose a clustering-based approach for representative negative sample selection. Results We used deep autoencoders to investigate the effect of different sampling approaches for non-interacting pairs on the training and the performance of machine learning classifiers. By using the anomaly detection capabilities of deep autoencoders we deduced the effects of different categories of negative samples on the training of learning algorithms. Random sampling for selecting non-interacting pairs results in either over- or under-representation of hard or easy to classify instances. When K-means based sampling of negative datasets is applied to mitigate the inadequacies of random sampling, random forest (RF) together with the combined feature set of atomic composition, physicochemical-2grams and two different representations of evolutionary information performs best. Average model performances based on leave-one-out cross validation (loocv) over ten different negative sample sets that each model was trained with, show that RF models significantly outperform the previous best CRI predictor in terms of accuracy (+ 5.1%), specificity (+ 13%), mcc (+ 0.1) and g-means value (+ 5.1). Evaluations using tenfold cv and training/testing splits confirm the competitive performance. Conclusions A comparative analysis was performed to assess the effect of three different sampling methods (random, K-means and uniform sampling) on the training of learning algorithms using different evaluation methods. Models trained on K-means sampled datasets generally show a significantly improved performance compared to those trained on random selections—with RF seemingly benefiting most in our particular setting. Our findings on the sampling are highly relevant and apply to many applications of supervised learning approaches in bioinformatics.
Collapse
Affiliation(s)
- Abhigyan Nath
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur, 492001, India.
| | - André Leier
- Department of Genetics, Department of Cell Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA.
| |
Collapse
|
12
|
Dou L, Li X, Zhang L, Xiang H, Xu L. iGlu_AdaBoost: Identification of Lysine Glutarylation Using the AdaBoost Classifier. J Proteome Res 2020; 20:191-201. [PMID: 33090794 DOI: 10.1021/acs.jproteome.0c00314] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Lysine glutarylation is a newly reported post-translational modification (PTM) that plays significant roles in regulating metabolic and mitochondrial processes. Accurate identification of protein glutarylation is the primary task to better investigate molecular functions and various applications. Due to the common disadvantages of the time-consuming and expensive nature of traditional biological sequencing techniques as well as the explosive growth of protein data, building precise computational models to rapidly diagnose glutarylation is a popular and feasible solution. In this work, we proposed a novel AdaBoost-based predictor called iGlu_AdaBoost to distinguish glutarylation and non-glutarylation sequences. Here, the top 37 features were chosen from a total of 1768 combined features using Chi2 following incremental feature selection (IFS) to build the model, including 188D, the composition of k-spaced amino acid pairs (CKSAAP), and enhanced amino acid composition (EAAC). With the help of the hybrid-sampling method SMOTE-Tomek, the AdaBoost algorithm was performed with satisfactory recall, specificity, and AUC values of 87.48%, 72.49%, and 0.89 over 10-fold cross validation as well as 72.73%, 71.92%, and 0.63 over independent test, respectively. Further feature analysis inferred that positively charged amino acids RK play critical roles in glutarylation recognition. Our model presented the well generalization ability and consistency of the prediction results of positive and negative samples, which is comparable to four published tools. The proposed predictor is an efficient tool to find potential glutarylation sites and provides helpful suggestions for further research on glutarylation mechanisms and concerned disease treatments.
Collapse
Affiliation(s)
- Lijun Dou
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen 518055, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xiaoling Li
- Department of Oncology, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin 150000, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen 518172, China
| | - Huaikun Xiang
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen 518055, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518055, China
| |
Collapse
|
13
|
Nguyen TTD, Le NQK, Ho QT, Phan DV, Ou YY. TNFPred: identifying tumor necrosis factors using hybrid features based on word embeddings. BMC Med Genomics 2020; 13:155. [PMID: 33087125 PMCID: PMC7579990 DOI: 10.1186/s12920-020-00779-w] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Background Cytokines are a class of small proteins that act as chemical messengers and play a significant role in essential cellular processes including immunity regulation, hematopoiesis, and inflammation. As one important family of cytokines, tumor necrosis factors have association with the regulation of a various biological processes such as proliferation and differentiation of cells, apoptosis, lipid metabolism, and coagulation. The implication of these cytokines can also be seen in various diseases such as insulin resistance, autoimmune diseases, and cancer. Considering the interdependence between this kind of cytokine and others, classifying tumor necrosis factors from other cytokines is a challenge for biological scientists. Methods In this research, we employed a word embedding technique to create hybrid features which was proved to efficiently identify tumor necrosis factors given cytokine sequences. We segmented each protein sequence into protein words and created corresponding word embedding for each word. Then, word embedding-based vector for each sequence was created and input into machine learning classification models. When extracting feature sets, we not only diversified segmentation sizes of protein sequence but also conducted different combinations among split grams to find the best features which generated the optimal prediction. Furthermore, our methodology follows a well-defined procedure to build a reliable classification tool. Results With our proposed hybrid features, prediction models obtain more promising performance compared to seven prominent sequenced-based feature kinds. Results from 10 independent runs on the surveyed dataset show that on an average, our optimal models obtain an area under the curve of 0.984 and 0.998 on 5-fold cross-validation and independent test, respectively. Conclusions These results show that biologists can use our model to identify tumor necrosis factors from other cytokines efficiently. Moreover, this study proves that natural language processing techniques can be applied reasonably to help biologists solve bioinformatics problems efficiently.
Collapse
Affiliation(s)
| | - Nguyen-Quoc-Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City, 106, Taiwan.,Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei City, 106, Taiwan
| | - Quang-Thai Ho
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, 32003, Taiwan
| | - Dinh-Van Phan
- University of Economics, The University of Danang, Danang, 550000, Vietnam
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, 32003, Taiwan.
| |
Collapse
|
14
|
Gu X, Chen Z, Wang D. Prediction of G Protein-Coupled Receptors With CTDC Extraction and MRMD2.0 Dimension-Reduction Methods. Front Bioeng Biotechnol 2020; 8:635. [PMID: 32671038 PMCID: PMC7329982 DOI: 10.3389/fbioe.2020.00635] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Accepted: 05/26/2020] [Indexed: 11/13/2022] Open
Abstract
The G Protein-Coupled Receptor (GPCR) family consists of more than 800 different members. In this article, we attempt to use the physicochemical properties of Composition, Transition, Distribution (CTD) to represent GPCRs. The dimensionality reduction method of MRMD2.0 filters the physicochemical properties of GPCR redundancy. Matplotlib plots the coordinates to distinguish GPCRs from other protein sequences. The chart data show a clear distinction effect, and there is a well-defined boundary between the two. The experimental results show that our method can predict GPCRs.
Collapse
Affiliation(s)
- Xingyue Gu
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Zhihua Chen
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Donghua Wang
- Department of General Surgery, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| |
Collapse
|
15
|
Hou R, Wang L, Wu YJ. Predicting ATP-Binding Cassette Transporters Using the Random Forest Method. Front Genet 2020; 11:156. [PMID: 32269586 PMCID: PMC7109328 DOI: 10.3389/fgene.2020.00156] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2019] [Accepted: 02/11/2020] [Indexed: 12/21/2022] Open
Abstract
ATP-binding cassette (ABC) proteins play important roles in a wide variety of species. These proteins are involved in absorbing nutrients, exporting toxic substances, and regulating potassium channels, and they contribute to drug resistance in cancer cells. Therefore, the identification of ABC transporters is an urgent task. The present study used 188D as the feature extraction method, which is based on sequence information and physicochemical properties. We also visualized the feature extracted by t-Distributed Stochastic Neighbor Embedding (t-SNE). The sample based on the features extracted by 188D may be separated. Further, random forest (RF) is an efficient classifier to identify proteins. Under the 10-fold cross-validation of the model proposed here for a training set, the average accuracy rate of 10 training sets was 89.54%. We obtained values of 0.87 for specificity, 0.92 for sensitivity, and 0.79 for MCC. In the testing set, the accuracy achieved was 89%. These results suggest that the model combining 188D with RF is an optimal tool to identify ABC transporters.
Collapse
Affiliation(s)
- Ruiyan Hou
- Laboratory of Molecular Toxicology, State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.,College of Life Science, University of Chinese Academy of Sciences, Beijing, China
| | - Lida Wang
- Department of Scientific Research, General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China
| | - Yi-Jun Wu
- Laboratory of Molecular Toxicology, State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
16
|
iPromoter-2L2.0: Identifying Promoters and Their Types by Combining Smoothing Cutting Window Algorithm and Sequence-Based Features. MOLECULAR THERAPY-NUCLEIC ACIDS 2019; 18:80-87. [PMID: 31536883 PMCID: PMC6796744 DOI: 10.1016/j.omtn.2019.08.008] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/07/2019] [Revised: 07/17/2019] [Accepted: 08/02/2019] [Indexed: 11/23/2022]
Abstract
Promoters are short regions at specific locations of DNA sequences, which are playing key roles in directing gene transcription. They can be grouped into six types (σ24,σ28,σ32,σ38,σ54,σ70). Recently, a predictor called "iPromoter-2L" was constructed to predict the promoters and their six types, which is the first approach to predict all the six types of promoters. However, its predictive quality still needs to be further improved for real-world application requirement. In this study, we proposed the smoothing cutting window algorithm to find the window fragments of the DNA sequences based on the conservation scores to capture the sequence patterns of promoters. For each window fragment, the discriminative features were extracted by using kmer and PseKNC. Combined with support vector machines (SVMs), different predictors were constructed and then clustered into several groups based on their distances. Finally, a new predictor called iPromoter-2L2.0 was constructed to identify the promoters and their six types, which was developed by ensemble learning based on the key predictors selected from the cluster groups. The results showed that iPromoter-2L2.0 outperformed other existing methods for both promoter prediction and identification of their six types, indicating that iPromoter-2L2.0 will be helpful for genomics analysis.
Collapse
|
17
|
Wei L, Xing P, Shi G, Ji Z, Zou Q. Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1264-1273. [PMID: 28222000 DOI: 10.1109/tcbb.2017.2670558] [Citation(s) in RCA: 124] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
Protein methylation, an important post-translational modification, plays crucial roles in many cellular processes. The accurate prediction of protein methylation sites is fundamentally important for revealing the molecular mechanisms undergoing methylation. In recent years, computational prediction based on machine learning algorithms has emerged as a powerful and robust approach for identifying methylation sites, and much progress has been made in predictive performance improvement. However, the predictive performance of existing methods is not satisfactory in terms of overall accuracy. Motivated by this, we propose a novel random-forest-based predictor called MePred-RF, integrating several discriminative sequence-based feature descriptors and improving feature representation capability using a powerful feature selection technique. Importantly, unlike other methods based on multiple, complex information inputs, our proposed MePred-RF is based on sequence information alone. Comparative studies on benchmark datasets via vigorous jackknife tests indicate that our proposed MePred-RF method remarkably outperforms other state-of-the-art predictors, leading by a 4.5 percent average in terms of overall accuracy. A user-friendly webserver that implements the proposed method has been established for researchers' convenience, and is now freely available for public use through http://server.malab.cn/MePred-RF. We anticipate our research tool to be useful for the large-scale prediction and analysis of protein methylation sites.
Collapse
|
18
|
Lin Y, Cai Y, Liu J, Lin C, Liu X. An advanced approach to identify antimicrobial peptides and their function types for penaeus through machine learning strategies. BMC Bioinformatics 2019; 20:291. [PMID: 31182007 PMCID: PMC6557738 DOI: 10.1186/s12859-019-2766-9] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Antimicrobial peptides (AMPs) are essential components of the innate immune system and can protect the host from various pathogenic bacteria. The marine environment is known to be one of the richest sources for AMPs. Effective usage of AMPs and their derivatives can greatly improve the immunity and breeding survival rate of aquatic products. It is highly desirable to develop computational tools for rapidly and accurately identifying AMPs and their functional types, for the purpose of helping design new and more effective antimicrobial agents. RESULTS In this study, we made an attempt to develop an advanced machine learning based computational approach, MAMPs-Pred, for identification of AMPs and its function types. Initially, SVM-prot 188-D features were extracted that were subsequently used as input to a two-layer multi-label classifier. In specific, the first layer is to identify whether it is an AMP by applying RF classifier, and the second layer addresses the multi-type problem by identifying the activites or function types of AMPs by applying PS-RF and LC-RF classifiers. To benchmark the methods,the MAMPs-Pred method is also compared with existing best-performing methods in literature and has shown an improved identification accuracy. CONCLUSIONS The results reported in this study indicate that the MAMP-Pred method achieves high performance for identifying AMPs and its functional types.The proposed approach is believed to supplement the tools and techniques that have been developed in the past for predicting AMPs and their function types.
Collapse
Affiliation(s)
- Yuan Lin
- Department of Computer Science, School of Information Science and Technology, Xiamen University, Xiamen, 361005 China
- Sparebanken Vest, Jonsvollsgaten 2, 5011 Bergen, Bergen, 5058 Norway
| | - Yinyin Cai
- Department of Computer Science, School of Information Science and Technology, Xiamen University, Xiamen, 361005 China
| | - Juan Liu
- Department of Instrumental and Electrical Engineering, School of Aerospace Engineering, Xiamen University, Xiamen, 361005 China
| | - Chen Lin
- Department of Computer Science, School of Information Science and Technology, Xiamen University, Xiamen, 361005 China
| | - Xiangrong Liu
- Department of Computer Science, School of Information Science and Technology, Xiamen University, Xiamen, 361005 China
| |
Collapse
|
19
|
Ru X, Li L, Zou Q. Incorporating Distance-Based Top-n-gram and Random Forest To Identify Electron Transport Proteins. J Proteome Res 2019; 18:2931-2939. [DOI: 10.1021/acs.jproteome.9b00250] [Citation(s) in RCA: 70] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Affiliation(s)
- Xiaoqing Ru
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Lihong Li
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
20
|
Han K, Wang M, Zhang L, Wang Y, Guo M, Zhao M, Zhao Q, Zhang Y, Zeng N, Wang C. Predicting Ion Channels Genes and Their Types With Machine Learning Techniques. Front Genet 2019; 10:399. [PMID: 31130983 PMCID: PMC6510169 DOI: 10.3389/fgene.2019.00399] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2019] [Accepted: 04/12/2019] [Indexed: 02/01/2023] Open
Abstract
Motivation: The number of ion channels is increasing rapidly. As many of them are associated with diseases, they are the targets of more than 700 drugs. The discovery of new ion channels is facilitated by computational methods that predict ion channels and their types from protein sequences. Methods: We used the SVMProt and the k-skip-n-gram methods to extract the feature vectors of ion channels, and obtained 188- and 400-dimensional features, respectively. The 188- and 400-dimensional features were combined to obtain 588-dimensional features. We then employed the maximum-relevance-maximum-distance method to reduce the dimensions of the 588-dimensional features. Finally, the support vector machine and random forest methods were used to build the prediction models to evaluate the classification effect. Results: Different methods were employed to extract various feature vectors, and after effective dimensionality reduction, different classifiers were used to classify the ion channels. We extracted the ion channel data from the Universal Protein Resource (UniProt, http://www.uniprot.org/) and Ligand-Gated Ion Channel databases (http://www.ebi.ac.uk/compneur-srv/LGICdb/LGICdb.php), and then verified the performance of the classifiers after screening. The findings of this study could inform the research and development of drugs.
Collapse
Affiliation(s)
- Ke Han
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Miao Wang
- Life Sciences and Environmental Sciences Development Center, Harbin University of Commerce, Harbin, China
| | - Lei Zhang
- Life Sciences and Environmental Sciences Development Center, Harbin University of Commerce, Harbin, China
| | - Ying Wang
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
| | - Mian Guo
- Department of Neurosurgery, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Ming Zhao
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Qian Zhao
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Yu Zhang
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Nianyin Zeng
- Department of Instrumental and Electrical Engineering, Xiamen University, Xiamen, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
21
|
Gao YC, Zhou XH, Zhang W. An Ensemble Strategy to Predict Prognosis in Ovarian Cancer Based on Gene Modules. Front Genet 2019; 10:366. [PMID: 31068972 PMCID: PMC6491874 DOI: 10.3389/fgene.2019.00366] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2019] [Accepted: 04/05/2019] [Indexed: 12/15/2022] Open
Abstract
Due to the high heterogeneity and complexity of cancer, it is still a challenge to predict the prognosis of cancer patients. In this work, we used a clustering algorithm to divide patients into different subtypes in order to reduce the heterogeneity of the cancer patients in each subtype. Based on the hypothesis that the gene co-expression network may reveal relationships among genes, some communities in the network could influence the prognosis of cancer patients and all the prognosis-related communities could fully reveal the prognosis of cancer patients. To predict the prognosis for cancer patients in each subtype, we adopted an ensemble classifier based on the gene co-expression network of the corresponding subtype. Using the gene expression data of ovarian cancer patients in TCGA (The Cancer Genome Atlas), three subtypes were identified. Survival analysis showed that patients in different subtypes had different survival risks. Three ensemble classifiers were constructed for each subtype. Leave-one-out and independent validation showed that our method outperformed control and literature methods. Furthermore, the function annotation of the communities in each subtype showed that some communities were cancer-related. Finally, we found that the current drug targets can partially support our method.
Collapse
Affiliation(s)
| | - Xiong-Hui Zhou
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Wen Zhang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| |
Collapse
|
22
|
Ru X, Li L, Wang C. Identification of Phage Viral Proteins With Hybrid Sequence Features. Front Microbiol 2019; 10:507. [PMID: 30972038 PMCID: PMC6443926 DOI: 10.3389/fmicb.2019.00507] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2018] [Accepted: 02/27/2019] [Indexed: 02/01/2023] Open
Abstract
The uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accurately. Extracting comprehensive and effective sequence features from proteins plays a vital role in protein classification. In order to more fully represent protein information, this paper is more comprehensive and effective by combining the features extracted by the feature information representation algorithm based on sequence information (CCPA) and the feature representation algorithm based on sequence and structure information. After extracting features, the Max-Relevance-Max-Distance (MRMD) algorithm is used to select the optimal feature set with the strongest correlation between class labels and low redundancy between features. Given the randomness of the samples selected by the random forest classification algorithm and the randomness features for producing each node variable, a random forest method is employed to perform 10-fold cross-validation on the bacteriophage protein classification. The accuracy of this model is as high as 93.5% in the classification of phage proteins in this study. This study also found that, among the eight physicochemical properties considered, the charge property has the greatest impact on the classification of bacteriophage proteins These results indicate that the model discussed in this paper is an important tool in bacteriophage protein research.
Collapse
Affiliation(s)
- Xiaoqing Ru
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Lihong Li
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
23
|
Li Y, Niu M, Zou Q. ELM-MHC: An Improved MHC Identification Method with Extreme Learning Machine Algorithm. J Proteome Res 2019; 18:1392-1401. [DOI: 10.1021/acs.jproteome.9b00012] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Affiliation(s)
- Yanjuan Li
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Mengting Niu
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
24
|
Wei L, Hu J, Li F, Song J, Su R, Zou Q. Comparative analysis and prediction of quorum-sensing peptides using feature representation learning and machine learning algorithms. Brief Bioinform 2018; 21:106-119. [PMID: 30383239 DOI: 10.1093/bib/bby107] [Citation(s) in RCA: 53] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2018] [Revised: 09/18/2018] [Accepted: 10/05/2018] [Indexed: 12/11/2022] Open
Abstract
Quorum-sensing peptides (QSPs) are the signal molecules that are closely associated with diverse cellular processes, such as cell-cell communication, and gene expression regulation in Gram-positive bacteria. It is therefore of great importance to identify QSPs for better understanding and in-depth revealing of their functional mechanisms in physiological processes. Machine learning algorithms have been developed for this purpose, showing the great potential for the reliable prediction of QSPs. In this study, several sequence-based feature descriptors for peptide representation and machine learning algorithms are comprehensively reviewed, evaluated and compared. To effectively use existing feature descriptors, we used a feature representation learning strategy that automatically learns the most discriminative features from existing feature descriptors in a supervised way. Our results demonstrate that this strategy is capable of effectively capturing the sequence determinants to represent the characteristics of QSPs, thereby contributing to the improved predictive performance. Furthermore, wrapping this feature representation learning strategy, we developed a powerful predictor named QSPred-FL for the detection of QSPs in large-scale proteomic data. Benchmarking results with 10-fold cross validation showed that QSPred-FL is able to achieve better performance as compared to the state-of-the-art predictors. In addition, we have established a user-friendly webserver that implements QSPred-FL, which is currently available at http://server.malab.cn/QSPred-FL. We expect that this tool will be useful for the high-throughput prediction of QSPs and the discovery of important functional mechanisms of QSPs.
Collapse
Affiliation(s)
- Leyi Wei
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Jie Hu
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Fuyi Li
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Clayton, VIC, Australia
| | - Jiangning Song
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Clayton, VIC, Australia
| | - Ran Su
- School of Computer Software, Tianjin University, Tianjin, China
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
25
|
Niu M, Li Y, Wang C, Han K. RFAmyloid: A Web Server for Predicting Amyloid Proteins. Int J Mol Sci 2018; 19:ijms19072071. [PMID: 30013015 PMCID: PMC6073578 DOI: 10.3390/ijms19072071] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 07/10/2018] [Accepted: 07/12/2018] [Indexed: 12/22/2022] Open
Abstract
Amyloid is an insoluble fibrous protein and its mis-aggregation can lead to some diseases, such as Alzheimer’s disease and Creutzfeldt–Jakob’s disease. Therefore, the identification of amyloid is essential for the discovery and understanding of disease. We established a novel predictor called RFAmy based on random forest to identify amyloid, and it employed SVMProt 188-D feature extraction method based on protein composition and physicochemical properties and pse-in-one feature extraction method based on amino acid composition, autocorrelation pseudo acid composition, profile-based features and predicted structures features. In the ten-fold cross-validation test, RFAmy’s overall accuracy was 89.19% and F-measure was 0.891. Results were obtained by comparison experiments with other feature, classifiers, and existing methods. This shows the effectiveness of RFAmy in predicting amyloid protein. The RFAmy proposed in this paper can be accessed through the URL http://server.malab.cn/RFAmyloid/.
Collapse
Affiliation(s)
- Mengting Niu
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China.
| | - Yanjuan Li
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China.
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150040, China.
| | - Ke Han
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150040, China.
| |
Collapse
|
26
|
Kumar P, Joy J, Pandey A, Gupta D. PRmePRed: A protein arginine methylation prediction tool. PLoS One 2017; 12:e0183318. [PMID: 28813517 PMCID: PMC5557562 DOI: 10.1371/journal.pone.0183318] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2017] [Accepted: 08/02/2017] [Indexed: 12/16/2022] Open
Abstract
Protein methylation is an important Post-Translational Modification (PTMs) of proteins. Arginine methylation carries out and regulates several important biological functions, including gene regulation and signal transduction. Experimental identification of arginine methylation site is a daunting task as it is costly as well as time and labour intensive. Hence reliable prediction tools play an important task in rapid screening and identification of possible methylation sites in proteomes. Our preliminary assessment using the available prediction methods on collected data yielded unimpressive results. This motivated us to perform a comprehensive data analysis and appraisal of features relevant in the context of biological significance, that led to the development of a prediction tool PRmePRed with better performance. The PRmePRed perform reasonably well with an accuracy of 84.10%, 82.38% sensitivity, 83.77% specificity, and Matthew’s correlation coefficient of 66.20% in 10-fold cross-validation. PRmePRed is freely available at http://bioinfo.icgeb.res.in/PRmePRed/
Collapse
Affiliation(s)
- Pawan Kumar
- Translational Bioinformatics Group, ICGEB, New Delhi, India
| | - Joseph Joy
- Translational Bioinformatics Group, ICGEB, New Delhi, India
| | | | - Dinesh Gupta
- Translational Bioinformatics Group, ICGEB, New Delhi, India
- * E-mail:
| |
Collapse
|
27
|
Xu ZC, Wang P, Qiu WR, Xiao X. iSS-PC: Identifying Splicing Sites via Physical-Chemical Properties Using Deep Sparse Auto-Encoder. Sci Rep 2017; 7:8222. [PMID: 28811565 PMCID: PMC5557945 DOI: 10.1038/s41598-017-08523-8] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2017] [Accepted: 07/10/2017] [Indexed: 12/13/2022] Open
Abstract
Gene splicing is one of the most significant biological processes in eukaryotic gene expression, such as RNA splicing, which can cause a pre-mRNA to produce one or more mature messenger RNAs containing the coded information with multiple biological functions. Thus, identifying splicing sites in DNA/RNA sequences is significant for both the bio-medical research and the discovery of new drugs. However, it is expensive and time consuming based only on experimental technique, so new computational methods are needed. To identify the splice donor sites and splice acceptor sites accurately and quickly, a deep sparse auto-encoder model with two hidden layers, called iSS-PC, was constructed based on minimum error law, in which we incorporated twelve physical-chemical properties of the dinucleotides within DNA into PseDNC to formulate given sequence samples via a battery of cross-covariance and auto-covariance transformations. In this paper, five-fold cross-validation test results based on the same benchmark data-sets indicated that the new predictor remarkably outperformed the existing prediction methods in this field. Furthermore, it is expected that many other related problems can be also studied by this approach. To implement classification accurately and quickly, an easy-to-use web-server for identifying slicing sites has been established for free access at: http://www.jci-bioinfo.cn/iSS-PC.
Collapse
Affiliation(s)
- Zhao-Chun Xu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333403, China.
| | - Peng Wang
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333403, China
| | - Wang-Ren Qiu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333403, China.
- Department of Computer Science and Bond Life Science Center, University of Missouri, Columbia, MO, USA.
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333403, China.
- Gordon Life Science Institute, Boston, Massachusetts, 02478, United States of America.
| |
Collapse
|
28
|
Wei L, Xing P, Su R, Shi G, Ma ZS, Zou Q. CPPred-RF: A Sequence-based Predictor for Identifying Cell-Penetrating Peptides and Their Uptake Efficiency. J Proteome Res 2017; 16:2044-2053. [PMID: 28436664 DOI: 10.1021/acs.jproteome.7b00019] [Citation(s) in RCA: 129] [Impact Index Per Article: 18.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
Cell-penetrating peptides (CPPs), have been proven as important drug-delivery vehicles, demonstrating the potential as therapeutic candidates. The past decade has witnessed a rapid growth in CPP-based research. Recently, many computational efforts have been made to develop machine-learning-based methods for identifying CPPs. Although much progress has been made, existing methods still suffer low feature representation capability that limits further performance improvement. In this study, we propose a novel predictor called CPPred-RF, in which we integrate multiple sequence-based feature descriptors to sufficiently explore distinct information embedded in CPPs, employ a well-established feature selection technique to improve the feature representation, and, for the first time, construct a two-layer prediction framework based on the random forest algorithm. The jackknife results on benchmark data sets show that the proposed CPPred-RF is at least competitive with the state-of-the-art predictors. Moreover, we establish the first online Web server in terms of predicting CPPs and their uptake efficiency simultaneously. It is freely available at http://server.malab.cn/CPPred-RF .
Collapse
Affiliation(s)
- Leyi Wei
- School of Computer Science and Technology, Tianjin University , Tianjin 300072, China
| | - PengWei Xing
- School of Computer Science and Technology, Tianjin University , Tianjin 300072, China
| | - Ran Su
- School of Software, Tianjin University , Tianjin 300354, China
| | - Gaotao Shi
- School of Computer Science and Technology, Tianjin University , Tianjin 300072, China
| | - Zhanshan Sam Ma
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences , Kunming 650223, China
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University , Tianjin 300072, China.,State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences , Kunming 650223, China
| |
Collapse
|
29
|
Wei L, Tang J, Zou Q. Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information. Inf Sci (N Y) 2017. [DOI: 10.1016/j.ins.2016.06.026] [Citation(s) in RCA: 196] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
30
|
Detecting N 6-methyladenosine sites from RNA transcriptomes using ensemble Support Vector Machines. Sci Rep 2017; 7:40242. [PMID: 28079126 PMCID: PMC5227715 DOI: 10.1038/srep40242] [Citation(s) in RCA: 88] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2016] [Accepted: 12/05/2016] [Indexed: 12/22/2022] Open
Abstract
As one of the most abundant RNA post-transcriptional modifications, N6-methyladenosine (m6A) involves in a broad spectrum of biological and physiological processes ranging from mRNA splicing and stability to cell differentiation and reprogramming. However, experimental identification of m6A sites is expensive and laborious. Therefore, it is urgent to develop computational methods for reliable prediction of m6A sites from primary RNA sequences. In the current study, a new method called RAM-ESVM was developed for detecting m6A sites from Saccharomyces cerevisiae transcriptome, which employed ensemble support vector machine classifiers and novel sequence features. The jackknife test results show that RAM-ESVM outperforms single support vector machine classifiers and other existing methods, indicating that it would be a useful computational tool for detecting m6A sites in S. cerevisiae. Furthermore, a web server named RAM-ESVM was constructed and could be freely accessible at http://server.malab.cn/RAM-ESVM/.
Collapse
|
31
|
Zou Q, Wan S, Ju Y, Tang J, Zeng X. Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy. BMC SYSTEMS BIOLOGY 2016; 10:114. [PMID: 28155714 PMCID: PMC5259984 DOI: 10.1186/s12918-016-0353-5] [Citation(s) in RCA: 135] [Impact Index Per Article: 16.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Background It is necessary and essential to discovery protein function from the novel primary sequences. Wet lab experimental procedures are not only time-consuming, but also costly, so predicting protein structure and function reliably based only on amino acid sequence has significant value. TATA-binding protein (TBP) is a kind of DNA binding protein, which plays a key role in the transcription regulation. Our study proposed an automatic approach for identifying TATA-binding proteins efficiently, accurately, and conveniently. This method would guide for the special protein identification with computational intelligence strategies. Results Firstly, we proposed novel fingerprint features for TBP based on pseudo amino acid composition, physicochemical properties, and secondary structure. Secondly, hierarchical features dimensionality reduction strategies were employed to improve the performance furthermore. Currently, Pretata achieves 92.92% TATA-binding protein prediction accuracy, which is better than all other existing methods. Conclusions The experiments demonstrate that our method could greatly improve the prediction accuracy and speed, thus allowing large-scale NGS data prediction to be practical. A web server is developed to facilitate the other researchers, which can be accessed at http://server.malab.cn/preTata/.
Collapse
Affiliation(s)
- Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Shixiang Wan
- School of Computer Science and Technology, Tianjin University, Tianjin, China.,Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, 518055, China
| | - Ying Ju
- School of Information Science and Engineering, Xiamen University, Xiamen, China
| | - Jijun Tang
- School of Computer Science and Technology, Tianjin University, Tianjin, China.,School of Computational Science and Engineering, University of South Carolina, Columbia, USA
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Xiamen University, Xiamen, China.
| |
Collapse
|
32
|
Cai Y, Liao Z, Ju Y, Liu J, Mao Y, Liu X. Resistance gene identification from Larimichthys crocea with machine learning techniques. Sci Rep 2016; 6:38367. [PMID: 27922074 PMCID: PMC5138596 DOI: 10.1038/srep38367] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Accepted: 11/08/2016] [Indexed: 12/11/2022] Open
Abstract
The research on resistance genes (R-gene) plays a vital role in bioinformatics as it has the capability of coping with adverse changes in the external environment, which can form the corresponding resistance protein by transcription and translation. It is meaningful to identify and predict R-gene of Larimichthys crocea (L.Crocea). It is friendly for breeding and the marine environment as well. Large amounts of L.Crocea's immune mechanisms have been explored by biological methods. However, much about them is still unclear. In order to break the limited understanding of the L.Crocea's immune mechanisms and to detect new R-gene and R-gene-like genes, this paper came up with a more useful combination prediction method, which is to extract and classify the feature of available genomic data by machine learning. The effectiveness of feature extraction and classification methods to identify potential novel R-gene was evaluated, and different statistical analyzes were utilized to explore the reliability of prediction method, which can help us further understand the immune mechanisms of L.Crocea against pathogens. In this paper, a webserver called LCRG-Pred is available at http://server.malab.cn/rg_lc/.
Collapse
Affiliation(s)
- Yinyin Cai
- School of Information Science and Technology, Xiamen University, Xiamen, Fujian 361005, China
- State Key Laboratory of Large Yellow Croaker Breeding, Ningde Fufa Fisheries Company Limited, Ningde, 352000, China
| | - Zhijun Liao
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, 350122, China
| | - Ying Ju
- School of Information Science and Technology, Xiamen University, Xiamen, Fujian 361005, China
| | - Juan Liu
- School of Aerospace Engineering, Xiamen University, Xiamen, Fujian 361005, China
| | - Yong Mao
- State Key Laboratory of Large Yellow Croaker Breeding, Ningde Fufa Fisheries Company Limited, Ningde, 352000, China
- College of Ocean and Earth Sciences, Xiamen University, Xiamen, 361102, China
| | - Xiangrong Liu
- School of Information Science and Technology, Xiamen University, Xiamen, Fujian 361005, China
- State Key Laboratory of Large Yellow Croaker Breeding, Ningde Fufa Fisheries Company Limited, Ningde, 352000, China
| |
Collapse
|
33
|
Wei L, Liao M, Gao X, Wang J, Lin W. mGOF-loc: A novel ensemble learning method for human protein subcellular localization prediction. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2015.09.137] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
34
|
Liu B, Liu Y, Jin X, Wang X, Liu B. iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance. Sci Rep 2016; 6:33483. [PMID: 27641752 PMCID: PMC5027590 DOI: 10.1038/srep33483] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2016] [Accepted: 08/25/2016] [Indexed: 01/01/2023] Open
Abstract
Meiotic recombination presents an uneven distribution across the genome. Genomic regions that exhibit at relatively high frequencies of recombination are called hotspots, whereas those with relatively low frequencies of recombination are called coldspots. Therefore, hotspots and coldspots would provide useful information for the study of the mechanism of recombination. In this study, we proposed a computational predictor called iRSpot-DACC to predict hot/cold spots across the yeast genome. It combined Support Vector Machines (SVMs) and a feature called dinucleotide-based auto-cross covariance (DACC), which is able to incorporate the global sequence-order information and fifteen local DNA properties into the predictor. Combined with Principal Component Analysis (PCA), its performance was further improved. Experimental results on a benchmark dataset showed that iRSpot-DACC can achieve an accuracy of 82.7%, outperforming some highly related methods.
Collapse
Affiliation(s)
- Bingquan Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150080, China
| | - Yumeng Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Xiaopeng Jin
- School of Computer Science and Technology, Harbin Engineering University, Harbin, Heilongjiang 150001, China
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China.,Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China.,Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| |
Collapse
|
35
|
Li YH, Xu JY, Tao L, Li XF, Li S, Zeng X, Chen SY, Zhang P, Qin C, Zhang C, Chen Z, Zhu F, Chen YZ. SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity. PLoS One 2016; 11:e0155290. [PMID: 27525735 PMCID: PMC4985167 DOI: 10.1371/journal.pone.0155290] [Citation(s) in RCA: 80] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2016] [Accepted: 04/27/2016] [Indexed: 12/20/2022] Open
Abstract
Knowledge of protein function is important for biological, medical and therapeutic studies, but many proteins are still unknown in function. There is a need for more improved functional prediction methods. Our SVM-Prot web-server employed a machine learning method for predicting protein functional families from protein sequences irrespective of similarity, which complemented those similarity-based and other methods in predicting diverse classes of proteins including the distantly-related proteins and homologous proteins of different functions. Since its publication in 2003, we made major improvements to SVM-Prot with (1) expanded coverage from 54 to 192 functional families, (2) more diverse protein descriptors protein representation, (3) improved predictive performances due to the use of more enriched training datasets and more variety of protein descriptors, (4) newly integrated BLAST analysis option for assessing proteins in the SVM-Prot predicted functional families that were similar in sequence to a query protein, and (5) newly added batch submission option for supporting the classification of multiple proteins. Moreover, 2 more machine learning approaches, K nearest neighbor and probabilistic neural networks, were added for facilitating collective assessment of protein functions by multiple methods. SVM-Prot can be accessed at http://bidd2.nus.edu.sg/cgi-bin/svmprot/svmprot.cgi.
Collapse
Affiliation(s)
- Ying Hong Li
- Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
| | - Jing Yu Xu
- Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
- School of Mathematics and Statistics, Beijing Institute of Technology, Beijing, China
| | - Lin Tao
- Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
- Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
| | - Xiao Feng Li
- Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
| | - Shuang Li
- Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
| | - Xian Zeng
- Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
| | - Shang Ying Chen
- Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
| | - Peng Zhang
- Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
| | - Chu Qin
- Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
| | - Cheng Zhang
- Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
| | - Zhe Chen
- Zhejiang Key Laboratory of Gastro-intestinal Pathophysiology, Zhejiang Hospital of Traditional Chinese Medicine, Zhejiang Chinese Medical University, Hangzhou, P. R. China
| | - Feng Zhu
- Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
| | - Yu Zong Chen
- Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
| |
Collapse
|
36
|
In Silico Prediction of Gamma-Aminobutyric Acid Type-A Receptors Using Novel Machine-Learning-Based SVM and GBDT Approaches. BIOMED RESEARCH INTERNATIONAL 2016; 2016:2375268. [PMID: 27579307 PMCID: PMC4992803 DOI: 10.1155/2016/2375268] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Revised: 06/08/2016] [Accepted: 06/19/2016] [Indexed: 11/17/2022]
Abstract
Gamma-aminobutyric acid type-A receptors (GABAARs) belong to multisubunit membrane spanning ligand-gated ion channels (LGICs) which act as the principal mediators of rapid inhibitory synaptic transmission in the human brain. Therefore, the category prediction of GABAARs just from the protein amino acid sequence would be very helpful for the recognition and research of novel receptors. Based on the proteins' physicochemical properties, amino acids composition and position, a GABAAR classifier was first constructed using a 188-dimensional (188D) algorithm at 90% cd-hit identity and compared with pseudo-amino acid composition (PseAAC) and ProtrWeb web-based algorithms for human GABAAR proteins. Then, four classifiers including gradient boosting decision tree (GBDT), random forest (RF), a library for support vector machine (libSVM), and k-nearest neighbor (k-NN) were compared on the dataset at cd-hit 40% low identity. This work obtained the highest correctly classified rate at 96.8% and the highest specificity at 99.29%. But the values of sensitivity, accuracy, and Matthew's correlation coefficient were a little lower than those of PseAAC and ProtrWeb; GBDT and libSVM can make a little better performance than RF and k-NN at the second dataset. In conclusion, a GABAAR classifier was successfully constructed using only the protein sequence information.
Collapse
|
37
|
Ensemble Feature Learning of Genomic Data Using Support Vector Machine. PLoS One 2016; 11:e0157330. [PMID: 27304923 PMCID: PMC4909287 DOI: 10.1371/journal.pone.0157330] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2015] [Accepted: 05/28/2016] [Indexed: 11/29/2022] Open
Abstract
The identification of a subset of genes having the ability to capture the necessary information to distinguish classes of patients is crucial in bioinformatics applications. Ensemble and bagging methods have been shown to work effectively in the process of gene selection and classification. Testament to that is random forest which combines random decision trees with bagging to improve overall feature selection and classification accuracy. Surprisingly, the adoption of these methods in support vector machines has only recently received attention but mostly on classification not gene selection. This paper introduces an ensemble SVM-Recursive Feature Elimination (ESVM-RFE) for gene selection that follows the concepts of ensemble and bagging used in random forest but adopts the backward elimination strategy which is the rationale of RFE algorithm. The rationale behind this is, building ensemble SVM models using randomly drawn bootstrap samples from the training set, will produce different feature rankings which will be subsequently aggregated as one feature ranking. As a result, the decision for elimination of features is based upon the ranking of multiple SVM models instead of choosing one particular model. Moreover, this approach will address the problem of imbalanced datasets by constructing a nearly balanced bootstrap sample. Our experiments show that ESVM-RFE for gene selection substantially increased the classification performance on five microarray datasets compared to state-of-the-art methods. Experiments on the childhood leukaemia dataset show that an average 9% better accuracy is achieved by ESVM-RFE over SVM-RFE, and 5% over random forest based approach. The selected genes by the ESVM-RFE algorithm were further explored with Singular Value Decomposition (SVD) which reveals significant clusters with the selected data.
Collapse
|
38
|
A Novel Peptide Binding Prediction Approach for HLA-DR Molecule Based on Sequence and Structural Information. BIOMED RESEARCH INTERNATIONAL 2016; 2016:3832176. [PMID: 27340658 PMCID: PMC4906198 DOI: 10.1155/2016/3832176] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/10/2016] [Accepted: 05/04/2016] [Indexed: 11/18/2022]
Abstract
MHC molecule plays a key role in immunology, and the molecule binding reaction with peptide is an important prerequisite for T cell immunity induced. MHC II molecules do not have conserved residues, so they appear as open grooves. As a consequence, this will increase the difficulty in predicting MHC II molecules binding peptides. In this paper, we aim to propose a novel prediction method for MHC II molecules binding peptides. First, we calculate sequence similarity and structural similarity between different MHC II molecules. Then, we reorder pseudosequences according to descending similarity values and use a weight calculation formula to calculate new pocket profiles. Finally, we use three scoring functions to predict binding cores and evaluate the accuracy of prediction to judge performance of each scoring function. In the experiment, we set a parameter α in the weight formula. By changing α value, we can observe different performances of each scoring function. We compare our method with the best function to some popular prediction methods and ultimately find that our method outperforms them in identifying binding cores of HLA-DR molecules.
Collapse
|
39
|
Liu B, Wang S, Dong Q, Li S, Liu X. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. IEEE Trans Nanobioscience 2016; 15:328-334. [PMID: 28113908 DOI: 10.1109/tnb.2016.2555951] [Citation(s) in RCA: 65] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
DNA-binding proteins play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. With the rapid development of next generation of sequencing technique, the number of protein sequences is unprecedentedly increasing. Thus it is necessary to develop computational methods to identify the DNA-binding proteins only based on the protein sequence information. In this study, a novel method called iDNA-KACC is presented, which combines the Support Vector Machine (SVM) and the auto-cross covariance transformation. The protein sequences are first converted into profile-based protein representation, and then converted into a series of fixed-length vectors by the auto-cross covariance transformation with Kmer composition. The sequence order effect can be effectively captured by this scheme. These vectors are then fed into Support Vector Machine (SVM) to discriminate the DNA-binding proteins from the non DNA-binding ones. iDNA-KACC achieves an overall accuracy of 75.16% and Matthew correlation coefficient of 0.5 by a rigorous jackknife test. Its performance is further improved by employing an ensemble learning approach, and the improved predictor is called iDNA-KACC-EL. Experimental results on an independent dataset shows that iDNA-KACC-EL outperforms all the other state-of-the-art predictors, indicating that it would be a useful computational tool for DNA binding protein identification. .
Collapse
|
40
|
Che Y, Ju Y, Xuan P, Long R, Xing F. Identification of Multi-Functional Enzyme with Multi-Label Classifier. PLoS One 2016; 11:e0153503. [PMID: 27078147 PMCID: PMC4831692 DOI: 10.1371/journal.pone.0153503] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2016] [Accepted: 03/30/2016] [Indexed: 11/23/2022] Open
Abstract
Enzymes are important and effective biological catalyst proteins participating in almost all active cell processes. Identification of multi-functional enzymes is essential in understanding the function of enzymes. Machine learning methods perform better in protein structure and function prediction than traditional biological wet experiments. Thus, in this study, we explore an efficient and effective machine learning method to categorize enzymes according to their function. Multi-functional enzymes are predicted with a special machine learning strategy, namely, multi-label classifier. Sequence features are extracted from a position-specific scoring matrix with autocross-covariance transformation. Experiment results show that the proposed method obtains an accuracy rate of 94.1% in classifying six main functional classes through five cross-validation tests and outperforms state-of-the-art methods. In addition, 91.25% accuracy is achieved in multi-functional enzyme prediction, which is often ignored in other enzyme function prediction studies. The online prediction server and datasets can be accessed from the link http://server.malab.cn/MEC/.
Collapse
Affiliation(s)
- Yuxin Che
- School of Information Science and Technology, Xiamen University, Xiamen, Fujian 361005, China
| | - Ying Ju
- School of Information Science and Technology, Xiamen University, Xiamen, Fujian 361005, China
| | - Ping Xuan
- School of Computer Science and Technology, Heilongjiang University, Harbin 150080, China
| | - Ren Long
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Fei Xing
- School of Aerospace Engineering, Xiamen University, Xiamen, Fujian 361005, China
| |
Collapse
|
41
|
The Virtual Screening of the Drug Protein with a Few Crystal Structures Based on the Adaboost-SVM. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2016; 2016:4809831. [PMID: 27127534 PMCID: PMC4834164 DOI: 10.1155/2016/4809831] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/27/2015] [Revised: 03/06/2016] [Accepted: 03/07/2016] [Indexed: 11/27/2022]
Abstract
Using the theory of machine learning to assist the virtual screening (VS) has been an effective plan. However, the quality of the training set may reduce because of mixing with the wrong docking poses and it will affect the screening efficiencies. To solve this problem, we present a method using the ensemble learning to improve the support vector machine to process the generated protein-ligand interaction fingerprint (IFP). By combining multiple classifiers, ensemble learning is able to avoid the limitations of the single classifier's performance and obtain better generalization. According to the research of virtual screening experiment with SRC and Cathepsin K as the target, the results show that the ensemble learning method can effectively reduce the error because the sample quality is not high and improve the effect of the whole virtual screening process.
Collapse
|
42
|
Chen L, Zhang YH, Huang T, Cai YD. Identifying novel protein phenotype annotations by hybridizing protein-protein interactions and protein sequence similarities. Mol Genet Genomics 2016; 291:913-34. [PMID: 26728152 DOI: 10.1007/s00438-015-1157-9] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2015] [Accepted: 12/08/2015] [Indexed: 01/18/2023]
Abstract
Studies of protein phenotypes represent a central challenge of modern genetics in the post-genome era because effective and accurate investigation of protein phenotypes is one of the most critical procedures to identify functional biological processes in microscale, which involves the analysis of multifactorial traits and has greatly contributed to the development of modern biology in the post genome era. Therefore, we have developed a novel computational method that identifies novel proteins associated with certain phenotypes in yeast based on the protein-protein interaction network. Unlike some existing network-based computational methods that identify the phenotype of a query protein based on its direct neighbors in the local network, the proposed method identifies novel candidate proteins for a certain phenotype by considering all annotated proteins with this phenotype on the global network using a shortest path (SP) algorithm. The identified proteins are further filtered using both a permutation test and their interactions and sequence similarities to annotated proteins. We compared our method with another widely used method called random walk with restart (RWR). The biological functions of proteins for each phenotype identified by our SP method and the RWR method were analyzed and compared. The results confirmed a large proportion of our novel protein phenotype annotation, and the RWR method showed a higher false positive rate than the SP method. Our method is equally effective for the prediction of proteins involving in all the eleven clustered yeast phenotypes with a quite low false positive rate. Considering the universality and generalizability of our supporting materials and computing strategies, our method can further be applied to study other organisms and the new functions we predicted can provide pertinent instructions for the further experimental verifications.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai, 200444, People's Republic of China. .,College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China.
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, People's Republic of China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, People's Republic of China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, 200444, People's Republic of China.
| |
Collapse
|
43
|
Pai PP, Mondal S. MOWGLI: prediction of protein-MannOse interacting residues With ensemble classifiers usinG evoLutionary Information. J Biomol Struct Dyn 2015; 34:2069-83. [PMID: 26457920 DOI: 10.1080/07391102.2015.1106978] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Proteins interact with carbohydrates to perform various cellular interactions. Of the many carbohydrate ligands that proteins bind with, mannose constitute an important class, playing important roles in host defense mechanisms. Accurate identification of mannose-interacting residues (MIR) may provide important clues to decipher the underlying mechanisms of protein-mannose interactions during infections. This study proposes an approach using an ensemble of base classifiers for prediction of MIR using their evolutionary information in the form of position-specific scoring matrix. The base classifiers are random forests trained by different subsets of training data set Dset128 using 10-fold cross-validation. The optimized ensemble of base classifiers, MOWGLI, is then used to predict MIR on protein chains of the test data set Dtestset29 which showed a promising performance with 92.0% accurate prediction. An overall improvement of 26.6% in precision was observed upon comparison with the state-of-art. It is hoped that this approach, yielding enhanced predictions, could be eventually used for applications in drug design and vaccine development.
Collapse
Affiliation(s)
- Priyadarshini P Pai
- a Department of Biological Sciences , Birla Institute of Technology and Science-Pilani , K.K. Birla Goa Campus, Near NH17 Bypass Road, Zuarinagar , Goa 403726 , India
| | - Sukanta Mondal
- a Department of Biological Sciences , Birla Institute of Technology and Science-Pilani , K.K. Birla Goa Campus, Near NH17 Bypass Road, Zuarinagar , Goa 403726 , India
| |
Collapse
|
44
|
Survey of Natural Language Processing Techniques in Bioinformatics. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:674296. [PMID: 26525745 PMCID: PMC4615216 DOI: 10.1155/2015/674296] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2015] [Revised: 06/12/2015] [Accepted: 06/21/2015] [Indexed: 01/02/2023]
Abstract
Informatics methods, such as text mining and natural language processing, are always involved in bioinformatics research. In this study, we discuss text mining and natural language processing methods in bioinformatics from two perspectives. First, we aim to search for knowledge on biology, retrieve references using text mining methods, and reconstruct databases. For example, protein-protein interactions and gene-disease relationship can be mined from PubMed. Then, we analyze the applications of text mining and natural language processing techniques in bioinformatics, including predicting protein structure and function, detecting noncoding RNA. Finally, numerous methods and applications, as well as their contributions to bioinformatics, are discussed for future use by text mining and natural language processing researchers.
Collapse
|
45
|
A Systematic Evaluation of Feature Selection and Classification Algorithms Using Simulated and Real miRNA Sequencing Data. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:178572. [PMID: 26508990 PMCID: PMC4609795 DOI: 10.1155/2015/178572] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/10/2015] [Accepted: 08/25/2015] [Indexed: 11/29/2022]
Abstract
Sequencing is widely used to discover associations between microRNAs (miRNAs) and diseases. However, the negative binomial distribution (NB) and high dimensionality of data obtained using sequencing can lead to low-power results and low reproducibility. Several statistical learning algorithms have been proposed to address sequencing data, and although evaluation of these methods is essential, such studies are relatively rare. The performance of seven feature selection (FS) algorithms, including baySeq, DESeq, edgeR, the rank sum test, lasso, particle swarm optimistic decision tree, and random forest (RF), was compared by simulation under different conditions based on the difference of the mean, the dispersion parameter of the NB, and the signal to noise ratio. Real data were used to evaluate the performance of RF, logistic regression, and support vector machine. Based on the simulation and real data, we discuss the behaviour of the FS and classification algorithms. The Apriori algorithm identified frequent item sets (mir-133a, mir-133b, mir-183, mir-937, and mir-96) from among the deregulated miRNAs of six datasets from The Cancer Genomics Atlas. Taking these findings altogether and considering computational memory requirements, we propose a strategy that combines edgeR and DESeq for large sample sizes.
Collapse
|
46
|
Abstract
Automatic detection systems usually require large and representative training datasets in order to obtain good detection and false positive rates. Training datasets are such that the positive set has few samples and/or the negative set should represent anything except the object of interest. In this respect, the negative set typically contains orders of magnitude more images than the positive set. However, imbalanced training databases lead to biased classifiers. In this paper, we focus our attention on a negative sample selection method to properly balance the training data for cascade detectors. The method is based on the selection of the most informative false positive samples generated in one stage to feed the next stage. The results show that the proposed cascade detector with sample selection obtains on average better partial AUC and smaller standard deviation than the other compared cascade detectors.
Collapse
|
47
|
Zou Q, Li J, Song L, Zeng X, Wang G. Similarity computation strategies in the microRNA-disease network: a survey. Brief Funct Genomics 2015; 15:55-64. [PMID: 26134276 DOI: 10.1093/bfgp/elv024] [Citation(s) in RCA: 141] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Various microRNAs have been demonstrated to play roles in a number of human diseases. Several microRNA-disease network reconstruction methods have been used to describe the association from a systems biology perspective. The key problem for the network is the similarity computation model. In this article, we reviewed the main similarity computation methods and discussed these methods and future works. This survey may prompt and guide systems biology and bioinformatics researchers to build more perfect microRNA-disease associations and may make the network relationship clear for medical researchers.
Collapse
|
48
|
Toward a Literature-Driven Definition of Big Data in Healthcare. BIOMED RESEARCH INTERNATIONAL 2015; 2015:639021. [PMID: 26137488 PMCID: PMC4468280 DOI: 10.1155/2015/639021] [Citation(s) in RCA: 85] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/13/2014] [Accepted: 02/04/2015] [Indexed: 11/17/2022]
Abstract
Objective. The aim of this study was to provide a definition of big data in healthcare. Methods. A systematic search of PubMed literature published until May 9, 2014, was conducted. We noted the number of statistical individuals (n) and the number of variables (p) for all papers describing a dataset. These papers were classified into fields of study. Characteristics attributed to big data by authors were also considered. Based on this analysis, a definition of big data was proposed. Results. A total of 196 papers were included. Big data can be defined as datasets with Log(n∗p) ≥ 7. Properties of big data are its great variety and high velocity. Big data raises challenges on veracity, on all aspects of the workflow, on extracting meaningful information, and on sharing information. Big data requires new computational methods that optimize data management. Related concepts are data reuse, false knowledge discovery, and privacy issues. Conclusion. Big data is defined by volume. Big data should not be confused with data reuse: data can be big without being reused for another purpose, for example, in omics. Inversely, data can be reused without being necessarily big, for example, secondary use of Electronic Medical Records (EMR) data.
Collapse
|
49
|
A linear-RBF multikernel SVM to classify big text corpora. BIOMED RESEARCH INTERNATIONAL 2015; 2015:878291. [PMID: 25879039 PMCID: PMC4386713 DOI: 10.1155/2015/878291] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/22/2014] [Revised: 11/10/2014] [Accepted: 11/13/2014] [Indexed: 11/23/2022]
Abstract
Support vector machine (SVM) is a powerful technique for classification. However, SVM is not suitable for classification of large datasets or text corpora, because the training complexity of SVMs is highly dependent on the input size. Recent developments in the literature on the SVM and other kernel methods emphasize the need to consider multiple kernels or parameterizations of kernels because they provide greater flexibility. This paper shows a multikernel SVM to manage highly dimensional data, providing an automatic parameterization with low computational cost and improving results against SVMs parameterized under a brute-force search. The model consists in spreading the dataset into cohesive term slices (clusters) to construct a defined structure (multikernel).
The new approach is tested on different text corpora. Experimental results show that the new classifier has good accuracy compared with the classic SVM, while the training is significantly faster than several other SVM classifiers.
Collapse
|
50
|
Xu R, Zhou J, Wang H, He Y, Wang X, Liu B. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC SYSTEMS BIOLOGY 2015; 9 Suppl 1:S10. [PMID: 25708928 PMCID: PMC4331676 DOI: 10.1186/1752-0509-9-s1-s10] [Citation(s) in RCA: 64] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
BACKGROUND DNA-binding proteins play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. Identification of DNA-binding proteins is one of the major challenges in the field of genome annotation. There have been several computational methods proposed in the literature to deal with the DNA-binding protein identification. However, most of them can't provide an invaluable knowledge base for our understanding of DNA-protein interactions. RESULTS We firstly presented a new protein sequence encoding method called PSSM Distance Transformation, and then constructed a DNA-binding protein identification method (SVM-PSSM-DT) by combining PSSM Distance Transformation with support vector machine (SVM). First, the PSSM profiles are generated by using the PSI-BLAST program to search the non-redundant (NR) database. Next, the PSSM profiles are transformed into uniform numeric representations appropriately by distance transformation scheme. Lastly, the resulting uniform numeric representations are inputted into a SVM classifier for prediction. Thus whether a sequence can bind to DNA or not can be determined. In benchmark test on 525 DNA-binding and 550 non DNA-binding proteins using jackknife validation, the present model achieved an ACC of 79.96%, MCC of 0.622 and AUC of 86.50%. This performance is considerably better than most of the existing state-of-the-art predictive methods. When tested on a recently constructed independent dataset PDB186, SVM-PSSM-DT also achieved the best performance with ACC of 80.00%, MCC of 0.647 and AUC of 87.40%, and outperformed some existing state-of-the-art methods. CONCLUSIONS The experiment results demonstrate that PSSM Distance Transformation is an available protein sequence encoding method and SVM-PSSM-DT is a useful tool for identifying the DNA-binding proteins. A user-friendly web-server of SVM-PSSM-DT was constructed, which is freely accessible to the public at the web-site on http://bioinformatics.hitsz.edu.cn/PSSM-DT/.
Collapse
Affiliation(s)
- Ruifeng Xu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Jiyun Zhou
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Hongpeng Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Yulan He
- School of Engineering & Applied Science, Aston University, Birmingham, UK
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| |
Collapse
|