1
|
Ruan X, Xia S, Li S, Su Z, Yang J. Hybrid framework for membrane protein type prediction based on the PSSM. Sci Rep 2024; 14:17156. [PMID: 39060345 PMCID: PMC11282086 DOI: 10.1038/s41598-024-68163-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2024] [Accepted: 07/22/2024] [Indexed: 07/28/2024] Open
Abstract
Membrane proteins are considered the major source of drug targets and are indispensable for drug design and disease prevention. However, traditional biomechanical experiments are costly and time-consuming; thus, many computational methods for predicting membrane protein types are gaining popularity. The position-specific scoring matrix (PSSM) method is an excellent method for describing the evolutionary information of protein sequences. In this study, we propose an improved capsule neural network (ICNN) model based on a capsule neural network to acquire sufficient relevant information from the PSSM. Furthermore, accounting for the complementarity between traditional machine learning and deep learning, we propose a hybrid framework that combines both approaches to predict protein types. This framework trains 41 baseline models based on the PSSM. The optimal subset features, selected after traversal, are fused using a two-level decision-level feature fusion approach. Subsequently, comparisons are made using three combined strategies within an ensemble learning framework. The experimental results demonstrate that solely relying on PSSM input, the proposed method not only surpasses the optimal methods by 1.52 % , 2.26 % and 2.67 % on Dataset1, Dataset2, and Datasets3, respectively, but also exhibits superior generalizability. Furthermore, the code and dataset can be free download at https://github.com/ruanxiaoli/membrane-protein-types .
Collapse
Affiliation(s)
- Xiaoli Ruan
- State Key Laboratory of Public Big Data, Guizhou University, Guizhou, 550000, Guizhou, China.
| | - Sina Xia
- State Key Laboratory of Public Big Data, Guizhou University, Guizhou, 550000, Guizhou, China
| | - Shaobo Li
- State Key Laboratory of Public Big Data, Guizhou University, Guizhou, 550000, Guizhou, China
| | - Zhidong Su
- Department of Electrical and Computer Engineering, University of Oklahoma State, Stillwater, 74078, USA
| | - Jing Yang
- State Key Laboratory of Public Big Data, Guizhou University, Guizhou, 550000, Guizhou, China
| |
Collapse
|
2
|
Cai Y, Gao Y, Lv Y, Chen Z, Zhong L, Chen J, Fan Y. Multicomponent comprehensive confirms that erythroferrone is a molecular biomarker of pan-cancer. Heliyon 2024; 10:e26990. [PMID: 38444475 PMCID: PMC10912481 DOI: 10.1016/j.heliyon.2024.e26990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Revised: 02/01/2024] [Accepted: 02/22/2024] [Indexed: 03/07/2024] Open
Abstract
All vertebrates organisms produce erythroferrone, a secretory hormone with structure-related functions during iron homeostasis. However, limited knowledge exists regarding the effect of this hormone on the occurrence and progression of cancer. To systematically and comprehensively identify the diverse implications of Erythroferrone (ERFE) in various malignant tumors, we conducted an in-depth analysis of multiple datasets, including the expression levels of oncogenes and target proteins, biological functions, and molecular characteristics. This analysis aimed to assess the diagnostic and prognostic value of ERFE in pan-cancer. Our findings revealed a significant elevation in ERFE expression across 20 distinct cancer types, with notable increases in gastrointestinal cancers. Utilizing the Cytoscape and STRING databases, we identified 35 ERFE-targeted binding proteins. Survival prognosis studies, particularly gastrointestinal cancers indicated by Colon adenocarcinoma (COAD), demonstrated a poor prognosis in patients with high ERFE expression (p < 0.001), consistently observed across various clinical subgroups. Furthermore, the ROC curve underscored the high predictive ability of EFRE for gastrointestinal cancer (AUC >0.9). Understanding the roles and interactions of ERFE in biological processes can also be aided by examining the genes co-expressed with ERFE in the coat and ranking the top 50 positive and negative genes. In the correlation analysis between the ERFE gene and different immune cells in COAD, we discovered that the expression of ERFE was positively correlated with Th1 cells, cytotoxic cells, and activated DC (aDC) abundance, and negatively correlated with Tcm (T central memory) abundance (P < 0.001). in summary, ERFE emerges as strongly associated with various malignant cancers, positioning it as a prospective biological target for cancer treatment. It stands out as a key molecular biomarker for diagnosing and prognosticating pancreatic cancer, also serves as an independent prognostic risk factor for COAD.
Collapse
Affiliation(s)
- Ying Cai
- Department of Gastroenterology, Zhongshan Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen, Fujian, PR China
- Xiamen Key Laboratory of intestinal microbiome and human health, Zhongshan Hospital affiliated to Xiamen University, Xiamen, Fujian, PR China
| | - Yaling Gao
- Department of Xia He, Zhongshan Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen, Fujian, PR China
| | - Yinyin Lv
- Department of Gastroenterology, Zhongshan Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen, Fujian, PR China
- Xiamen Key Laboratory of intestinal microbiome and human health, Zhongshan Hospital affiliated to Xiamen University, Xiamen, Fujian, PR China
| | - Zhiyuan Chen
- Department of Gastroenterology, Zhongshan Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen, Fujian, PR China
- Xiamen Key Laboratory of intestinal microbiome and human health, Zhongshan Hospital affiliated to Xiamen University, Xiamen, Fujian, PR China
| | - Lingfeng Zhong
- Department of Gastroenterology, Zhongshan Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen, Fujian, PR China
- Xiamen Key Laboratory of intestinal microbiome and human health, Zhongshan Hospital affiliated to Xiamen University, Xiamen, Fujian, PR China
| | - Junjie Chen
- Department of Gastroenterology, Zhongshan Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen, Fujian, PR China
- Xiamen Key Laboratory of intestinal microbiome and human health, Zhongshan Hospital affiliated to Xiamen University, Xiamen, Fujian, PR China
| | - Yanyun Fan
- Department of Gastroenterology, Zhongshan Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen, Fujian, PR China
- Xiamen Key Laboratory of intestinal microbiome and human health, Zhongshan Hospital affiliated to Xiamen University, Xiamen, Fujian, PR China
| |
Collapse
|
3
|
Yu S, Liao B, Zhu W, Peng D, Wu F. Accurate prediction and key protein sequence feature identification of cyclins. Brief Funct Genomics 2023; 22:411-419. [PMID: 37118891 DOI: 10.1093/bfgp/elad014] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2023] [Revised: 03/03/2023] [Accepted: 03/17/2023] [Indexed: 04/30/2023] Open
Abstract
Cyclin proteins are a group of proteins that activate the cell cycle by forming complexes with cyclin-dependent kinases. Identifying cyclins correctly can provide key clues to understanding the function of cyclins. However, due to the low similarity between cyclin protein sequences, the advancement of a machine learning-based approach to identify cycles is urgently needed. In this study, cyclin protein sequence features were extracted using the profile-based auto-cross covariance method. Then the features were ranked and selected with maximum relevance-maximum distance (MRMD) 1.0 and MRMD2.0. Finally, the prediction model was assessed through 10-fold cross-validation. The computational experiments showed that the best protein sequence features generated by MRMD1.0 could correctly predict 98.2% of cyclins using the random forest (RF) classifier, whereas seven-dimensional key protein sequence features identified with MRMD2.0 could correctly predict 96.1% of cyclins, which was superior to previous studies on the same dataset both in terms of dimensionality and performance comparisons. Therefore, our work provided a valuable tool for identifying cyclins. The model data can be downloaded from https://github.com/YUshunL/cyclin.
Collapse
Affiliation(s)
- Shaoyou Yu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Bo Liao
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Wen Zhu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Dejun Peng
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Fangxiang Wu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| |
Collapse
|
4
|
Shen J, Xia Y, Lu Y, Lu W, Qian M, Wu H, Fu Q, Chen J. Identification of membrane protein types via deep residual hypergraph neural network. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:20188-20212. [PMID: 38052642 DOI: 10.3934/mbe.2023894] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
A membrane protein's functions are significantly associated with its type, so it is crucial to identify the types of membrane proteins. Conventional computational methods for identifying the species of membrane proteins tend to ignore two issues: High-order correlation among membrane proteins and the scenarios of multi-modal representations of membrane proteins, which leads to information loss. To tackle those two issues, we proposed a deep residual hypergraph neural network (DRHGNN), which enhances the hypergraph neural network (HGNN) with initial residual and identity mapping in this paper. We carried out extensive experiments on four benchmark datasets of membrane proteins. In the meantime, we compared the DRHGNN with recently developed advanced methods. Experimental results showed the better performance of DRHGNN on the membrane protein classification task on four datasets. Experiments also showed that DRHGNN can handle the over-smoothing issue with the increase of the number of model layers compared with HGNN. The code is available at https://github.com/yunfighting/Identification-of-Membrane-Protein-Types-via-deep-residual-hypergraph-neural-network.
Collapse
Affiliation(s)
- Jiyun Shen
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Yiyi Xia
- Tianping College of Suzhou University of Science and Technology, Suzhou, China
| | - Yiming Lu
- Tianping College of Suzhou University of Science and Technology, Suzhou, China
| | - Weizhong Lu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
- Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, China
| | - Meiling Qian
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Qiming Fu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Jing Chen
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| |
Collapse
|
5
|
Liu N, Zhang Z, Wu Y, Wang Y, Liang Y. CRBSP:Prediction of CircRNA-RBP Binding Sites Based on Multimodal Intermediate Fusion. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2898-2906. [PMID: 37130249 DOI: 10.1109/tcbb.2023.3272400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Circular RNA (CircRNA) is widely expressed and has physiological and pathological significance, regulating post-transcriptional processes via its protein-binding activity. However, whereas much work has been done on linear RNA and RNA binding protein (RBP), little is known about the binding sites of CircRNA. The current report is on the development of a medium-term multimodal data fusion strategy, CRBSP, to predict CircRNA-RBP binding sites. CRBSP represents the CircRNA trinucleotide semantic, location, composition and frequency information as the corresponding coding methods of Word to vector (Word2vec), Position-specific trinucleotide propensity (PSTNP), Pseudo trinucleotide composition (PseTNC) and Trinucleotide nucleotide composition (TNC), respectively. CNN (Convolution Neural Networks) was used to extract global information and BiLSTM (bidirectional Long- and Short-Term Memory network) encoder and LSTM (Long- and Short-Term Memory network) decoder for local sequence information. Enhancement of the contributions of key features by the self-attention mechanism was followed by mid-term fusion of the four enhanced features. Logistic Regression (LR) classifier showed that CRBSP gives a mean AUC value of 0.9362 through 5-fold Cross Validation of all 37 datasets, a performance which is superior to five current state-of-the-art models. Similar evaluation of linear RNA-RBP binding sites gave an AUC value of 0.7615 which is also higher than other prediction methods, demonstrating the robustness of CRBSP.
Collapse
|
6
|
Qian Y, Shang T, Guo F, Wang C, Cui Z, Ding Y, Wu H. Identification of DNA-binding protein based multiple kernel model. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:13149-13170. [PMID: 37501482 DOI: 10.3934/mbe.2023586] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
DNA-binding proteins (DBPs) play a critical role in the development of drugs for treating genetic diseases and in DNA biology research. It is essential for predicting DNA-binding proteins more accurately and efficiently. In this paper, a Laplacian Local Kernel Alignment-based Restricted Kernel Machine (LapLKA-RKM) is proposed to predict DBPs. In detail, we first extract features from the protein sequence using six methods. Second, the Radial Basis Function (RBF) kernel function is utilized to construct pre-defined kernel metrics. Then, these metrics are combined linearly by weights calculated by LapLKA. Finally, the fused kernel is input to RKM for training and prediction. Independent tests and leave-one-out cross-validation were used to validate the performance of our method on a small dataset and two large datasets. Importantly, we built an online platform to represent our model, which is now freely accessible via http://8.130.69.121:8082/.
Collapse
Affiliation(s)
- Yuqing Qian
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Tingting Shang
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Fei Guo
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Chunliang Wang
- The Second Affiliated Hospital of Soochow University, Suzhou, China
| | - Zhiming Cui
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Hongjie Wu
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| |
Collapse
|
7
|
Wang Y, Zhang Y, Wang J, Xie F, Zheng D, Zou X, Guo M, Ding Y, Wan J, Han K. Prediction of drug-target interactions via neural tangent kernel extraction feature matrix factorization model. Comput Biol Med 2023; 159:106955. [PMID: 37094465 DOI: 10.1016/j.compbiomed.2023.106955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 04/04/2023] [Accepted: 04/16/2023] [Indexed: 04/26/2023]
Abstract
Drug discovery is a complex and lengthy process that often requires years of research and development. Therefore, drug research and development require a lot of investment and resource support, as well as professional knowledge, technology, skills, and other elements. Predicting of drug-target interactions (DTIs) is an important part of drug development. If machine learning is used to predict DTIs, the cost and time of drug development can be significantly reduced. Currently, machine learning methods are widely used to predict DTIs. In this study neighborhood regularized logistic matrix factorization method based on extracted features from a neural tangent kernel (NTK) to predict DTIs. First, the potential feature matrix of drugs and targets is extracted from the NTK model, then the corresponding Laplacian matrix is constructed according to the feature matrix. Next, the Laplacian matrix of the drugs and targets is used as the condition for matrix factorization to obtain two low-dimensional matrices. Finally, the matrix of the predicted DTIs was obtained by multiplying these two low-dimensional matrices. For the four gold standard datasets, the present method is significantly better than the other methods that is compared to, indicating that the automatic feature extraction method using the deep learning model is competitive compared with the manual feature selection method.
Collapse
Affiliation(s)
- Yu Wang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China; Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China
| | - Yu Zhang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Jianchun Wang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Fang Xie
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Dequan Zheng
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Xiang Zou
- Pharmaceutical Engineering Technology Research Center, Harbin University of Commerce, Harbin, 150076, China
| | - Mian Guo
- Department of Neurosurgery, The Second Affiliated Hospital of Harbin Medical University, 150086, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China.
| | - Jie Wan
- Laboratory for Space Environment and Physical Sciences, Harbin Institute of Technology, Harbin, 150001, China.
| | - Ke Han
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China; Pharmaceutical Engineering Technology Research Center, Harbin University of Commerce, Harbin, 150076, China.
| |
Collapse
|
8
|
Wang C, Zou Q, Ju Y, Shi H. Enhancer-FRL: Improved and Robust Identification of Enhancers and Their Activities Using Feature Representation Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:967-975. [PMID: 36063523 DOI: 10.1109/tcbb.2022.3204365] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Enhancers are crucial for precise regulation of gene expression, while enhancer identification and strength prediction are challenging because of their free distribution and tremendous number of similar fractions in the genome. Although several bioinformatics tools have been developed, shortfalls in these models remain, and their performances need further improvement. In the present study, a two-layer predictor called Enhancer-FRL was proposed for identifying enhancers (enhancers or nonenhancers) and their activities (strong and weak). More specifically, to build an efficient model, the feature representation learning scheme was applied to generate a 50D probabilistic vector based on 10 feature encodings and five machine learning algorithms. Subsequently, the multiview probabilistic features were integrated to construct the final prediction model. Compared with the single feature-based model, Enhancer-FRL showed significant performance improvement and model robustness. Performance assessment on the independent test dataset indicated that the proposed model outperformed state-of-the-art available toolkits. The webserver Enhancer-FRL is freely accessible at http://lab.malab.cn/∼wangchao/softwares/Enhancer-FRL/, The code and datasets can be downloaded at the webserver page or at the Github https://github.com/wangchao-malab/Enhancer-FRL/.
Collapse
|
9
|
Qian Y, Ding Y, Zou Q, Guo F. Multi-View Kernel Sparse Representation for Identification of Membrane Protein Types. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1234-1245. [PMID: 35857734 DOI: 10.1109/tcbb.2022.3191325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Membrane proteins are the main undertaker of biomembrane functions and play a vital role in many biological activities of organisms. Prediction of membrane protein types has a great help in determining the function of proteins and understanding the interactions of membrane proteins. However, the biochemical experiment is expensive and not suitable for the large-scale identification of membrane protein types. Therefore, computational methods were used to improve the efficiency of biological experiments. Most existing computational methods only use a single feature of protein, or use multiple features but do not integrate these well. In our study, the protein sequence is described via three different views (features), including amino acid composition, evolutionary information and physicochemical properties of amino acids. To exploit information among all views (features), we introduce a coupling strategy for Kernel Sparse Representation based Classification (KSRC) and construct a new model called Multi-view KSRC (MvKSRC). We implement our method on 4 benchmark data sets of membrane proteins. The comparison results indicate that our method is much superior to all existing methods.
Collapse
|
10
|
Sun J, Kulandaisamy A, Liu J, Hu K, Gromiha MM, Zhang Y. Machine learning in computational modelling of membrane protein sequences and structures: From methodologies to applications. Comput Struct Biotechnol J 2023; 21:1205-1226. [PMID: 36817959 PMCID: PMC9932300 DOI: 10.1016/j.csbj.2023.01.036] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Revised: 01/16/2023] [Accepted: 01/25/2023] [Indexed: 01/29/2023] Open
Abstract
Membrane proteins mediate a wide spectrum of biological processes, such as signal transduction and cell communication. Due to the arduous and costly nature inherent to the experimental process, membrane proteins have long been devoid of well-resolved atomic-level tertiary structures and, consequently, the understanding of their functional roles underlying a multitude of life activities has been hampered. Currently, computational tools dedicated to furthering the structure-function understanding are primarily focused on utilizing intelligent algorithms to address a variety of site-wise prediction problems (e.g., topology and interaction sites), but are scattered across different computing sources. Moreover, the recent advent of deep learning techniques has immensely expedited the development of computational tools for membrane protein-related prediction problems. Given the growing number of applications optimized particularly by manifold deep neural networks, we herein provide a review on the current status of computational strategies mainly in membrane protein type classification, topology identification, interaction site detection, and pathogenic effect prediction. Meanwhile, we provide an overview of how the entire prediction process proceeds, including database collection, data pre-processing, feature extraction, and method selection. This review is expected to be useful for developing more extendable computational tools specific to membrane proteins.
Collapse
Affiliation(s)
- Jianfeng Sun
- Botnar Research Centre, Nuffield Department of Orthopedics, Rheumatology, and Musculoskeletal Sciences, University of Oxford, Headington, Oxford OX3 7LD, UK
| | - Arulsamy Kulandaisamy
- Department of Biotechnology, Bhupat and Jyoti Mehta School of BioSciences, Indian Institute of Technology Madras, Chennai 600 036, Tamilnadu, India
| | - Jacklyn Liu
- UCL Cancer Institute, University College London, 72 Huntley Street, London WC1E 6BT, UK
| | - Kai Hu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan 411105, China
| | - M. Michael Gromiha
- Department of Biotechnology, Bhupat and Jyoti Mehta School of BioSciences, Indian Institute of Technology Madras, Chennai 600 036, Tamilnadu, India,Corresponding authors.
| | - Yuan Zhang
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan 411105, China,Corresponding authors.
| |
Collapse
|
11
|
Gao W, Xu D, Li H, Du J, Wang G, Li D. Identification of adaptor proteins by incorporating deep learning and PSSM profiles. Methods 2023; 209:10-17. [PMID: 36427763 DOI: 10.1016/j.ymeth.2022.11.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 10/25/2022] [Accepted: 11/02/2022] [Indexed: 11/23/2022] Open
Abstract
Adaptor proteins, also known as signal transduction adaptor proteins, are important proteins in signal transduction pathways, and play a role in connecting signal proteins for signal transduction between cells. Studies have shown that adaptor proteins are closely related to some diseases, such as tumors and diabetes. Therefore, it is very meaningful to construct a relevant model to accurately identify adaptor proteins. In recent years, many studies have used a position-specific scoring matrix (PSSM) and neural network methods to identify adaptor proteins. However, ordinary neural network models cannot correlate the contextual information in PSSM profiles well, so these studies usually process 20×N (N > 20) PSSM into 20×20 dimensions, which results in the loss of a large amount of protein information; This research proposes an efficient method that combines one-dimensional convolution (1-D CNN) and a bidirectional long short-term memory network (biLSTM) to identify adaptor proteins. The complete PSSM profiles are the input of the model, and the complete information of the protein is retained during the training process. We perform cross-validation during model training and test the performance of the model on an independent test set; in the data set with 1224 adaptor proteins and 11,078 non-adaptor proteins, five indicators including specificity, sensitivity, accuracy, area under the receiver operating characteristic curve (AUC) metric and Matthews correlation coefficient (MCC), were employed to evaluate model performance. On the independent test set, the specificity, sensitivity, accuracy and MCC were 0.817, 0.865, 0.823 and 0.465, respectively. Those results show that our method is better than the state-of-the art methods. This study is committed to improve the accuracy of adaptor protein identification, and laid a foundation for further research on diseases related to adaptor protein. This research provided a new idea for the application of deep learning related models in bioinformatics and computational biology.
Collapse
Affiliation(s)
- Wentao Gao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150000, China
| | - Dali Xu
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150000, China
| | - Hongfei Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150000, China
| | - Junping Du
- Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, 100876, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150000, China.
| | - Dan Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150000, China.
| |
Collapse
|
12
|
Ding Y, He W, Tang J, Zou Q, Guo F. Laplacian Regularized Sparse Representation Based Classifier for Identifying DNA N4-Methylcytosine Sites via L 2,1/2-Matrix Norm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:500-511. [PMID: 34882559 DOI: 10.1109/tcbb.2021.3133309] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
N4-methylcytosine (4mC) is one of important epigenetic modifications in DNA sequences. Detecting 4mC sites is time-consuming. The computational method based on machine learning has provided effective help for identifying 4mC. To further improve the performance of prediction, we propose a Laplacian Regularized Sparse Representation based Classifier with L2,1/2-matrix norm (LapRSRC). We also utilize kernel trick to derive the kernel LapRSRC for nonlinear modeling. Matrix factorization technology is employed to solve the sparse representation coefficients of all test samples in the training set. And an efficient iterative algorithm is proposed to solve the objective function. We implement our model on six benchmark datasets of 4mC and eight UCI datasets to evaluate performance. The results show that the performance of our method is better or comparable.
Collapse
|
13
|
Zhou H, Wang H, Tang J, Ding Y, Guo F. Identify ncRNA Subcellular Localization via Graph Regularized k-Local Hyperplane Distance Nearest Neighbor Model on Multi-Kernel Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3517-3529. [PMID: 34432632 DOI: 10.1109/tcbb.2021.3107621] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Non-coding RNAs (ncRNAs) are a type of RNAs which are not used to encode protein sequences. Emerging evidence shows that lots of ncRNAs may participate in many biological processes and must be widely involved in many types of cancers. Therefore, understanding their functionality is of great importance. Similar to proteins, various functions of ncRNAs relies on their subcellular localizations. Traditional high-throughput methods in wet-lab to identify subcellular localization is time-consuming and costly. In this paper, we propose a novel computational method based on multi-kernel learning to identify multi-label ncRNA subcellular localizations, via graph regularized k-local hyperplane distance nearest neighbor algorithm. First, we construct six types of sequence-based feature descriptors and select important feature vectors. Then, we build a multi-kernel learning model with Hilbert-Schmidt independence criterion (HSIC) to obtain optimal weights for vairous features. Furthermore, we propose the graph regularized k-local hyperplane distance nearest neighbor algorithm (GHKNN) as a binary classification model for detecting one kind of non-coding RNA subcellular localization. Finally, we apply One-vs-Rest strategy to decompose multi-label problem of non-coding RNA subcellular localizations. Our method achieves excellent performance on three ncRNA datasets and three human ncRNA datasets, and out-performs other outstanding machine learning methods. Comparing to existing method, our model also performs well especially on small datasets. We expect that this model will be useful for the prediction of subcellular localization and the study of important functional mechanisms of ncRNAs. Furthermore, we establish user-friendly web server (http://ncrna.lbci.net/) with the implementation of our method, which can be easily used by most experimental scientists.
Collapse
|
14
|
Yu S, Peng D, Zhu W, Liao B, Wang P, Yang D, Wu F. Hybrid_DBP: Prediction of DNA-binding proteins using hybrid features and convolutional neural networks. Front Pharmacol 2022; 13:1031759. [PMID: 36299898 PMCID: PMC9589247 DOI: 10.3389/fphar.2022.1031759] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 09/27/2022] [Indexed: 11/21/2022] Open
Abstract
DNA-binding proteins (DBP) play an essential role in the genetics and evolution of organisms. A particular DNA sequence could provide underlying therapeutic benefits for hereditary diseases and cancers. Studying these proteins can timely and effectively understand their mechanistic analysis and play a particular function in disease prevention and treatment. The limitation of identifying DNA-binding protein members from the sequence database is time-consuming, costly, and ineffective. Therefore, efficient methods for improving DBP classification are crucial to disease research. In this paper, we developed a novel predictor Hybrid _DBP, which identified potential DBP by using hybrid features and convolutional neural networks. The method combines two feature selection methods, MonoDiKGap and Kmer, and then used MRMD2.0 to remove redundant features. According to the results, 94% of DBP were correctly recognized, and the accuracy of the independent test set reached 91.2%. This means Hybrid_ DBP can become a useful prediction tool for predicting DBP.
Collapse
Affiliation(s)
- Shaoyou Yu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Dejun Peng
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Wen Zhu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
- *Correspondence: Wen Zhu,
| | - Bo Liao
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Peng Wang
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Dongxuan Yang
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Fangxiang Wu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| |
Collapse
|
15
|
Identification of DNA-binding proteins via Multi-view LSSVM with independence criterion. Methods 2022; 207:29-37. [PMID: 36087888 DOI: 10.1016/j.ymeth.2022.08.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 08/06/2022] [Accepted: 08/25/2022] [Indexed: 11/24/2022] Open
Abstract
DNA-binding proteins actively participate in life activities such as DNA replication, recombination, gene expression and regulation and play a prominent role in these processes. As DNA-binding proteins continue to be discovered and increase, it is imperative to design an efficient and accurate identification tool. Considering the time-consuming and expensive traditional experimental technology and the insufficient number of samples in the biological computing method based on structural information, we proposed a machine learning algorithm based on sequence information to identify DNA binding proteins, named multi-view Least Squares Support Vector Machine via Hilbert-Schmidt Independence Criterion (multi-view LSSVM via HSIC). This method took 6 feature sets as multi-view input and trains a single view through the LSSVM algorithm. Then, we integrated HSIC into LSSVM as a regular term to reduce the dependence between views and explored the complementary information of multiple views. Subsequently, we trained and coordinated the submodels and finally combined the submodels in the form of weights to obtain the final prediction model. On training set PDB1075, the prediction results of our model were better than those of most existing methods. Independent tests are conducted on the datasets PDB186 and PDB2272. The accuracy of the prediction results was 85.5% and 79.36%, respectively. This result exceeded the current state-of-the-art methods, which showed that the multi-view LSSVM via HSIC can be used as an efficient predictor.
Collapse
|
16
|
Feng C, Wu J, Wei H, Xu L, Zou Q. CRCF: A Method of Identifying Secretory Proteins of Malaria Parasites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2149-2157. [PMID: 34061749 DOI: 10.1109/tcbb.2021.3085589] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Malaria is a mosquito-borne disease that results in millions of cases and deaths annually. The development of a fast computational method that identifies secretory proteins of the malaria parasite is important for research on antimalarial drugs and vaccines. Thus, a method was developed to identify the secretory proteins of malaria parasites. In this method, a reduced alphabet was selected to recode the original protein sequence. A feature synthesis method was used to synthesise three different types of feature information. Finally, the random forest method was used as a classifier to identify the secretory proteins. In addition, a web server was developed to share the proposed algorithm. Experiments using the benchmark dataset demonstrated that the overall accuracy achieved by the proposed method was greater than 97.8 percent using the 10-fold cross-validation method. Furthermore, the reduced schemes and characteristic performance analyses are discussed.
Collapse
|
17
|
Ai C, Yang H, Ding Y, Tang J, Guo F. A multi-layer multi-kernel neural network for determining associations between non-coding RNAs and diseases. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.04.068] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
18
|
Comparative Analysis on Alignment-Based and Pretrained Feature Representations for the Identification of DNA-Binding Proteins. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:5847242. [PMID: 35799660 PMCID: PMC9256349 DOI: 10.1155/2022/5847242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Accepted: 06/07/2022] [Indexed: 11/17/2022]
Abstract
The interaction between DNA and protein is vital for the development of a living body. Previous numerous studies on in silico identification of DNA-binding proteins (DBPs) usually include features extracted from the alignment-based (pseudo) position-specific scoring matrix (PSSM), leading to limited application due to its time-consuming generation. Few researchers have paid attention to the application of pretrained language models at the scale of evolution to the identification of DBPs. To this end, we present comprehensive insights into a comparison study on alignment-based PSSM and pretrained evolutionary scale modeling (ESM) representations in the field of DBP classification. The comparison is conducted by extracting information from PSSM and ESM representations using four unified averaging operations and by performing various feature selection (FS) methods. Experimental results demonstrate that the pretrained ESM representation outperforms the PSSM-derived features in a fair comparison perspective. The pretrained feature presentation deserves wide application to the area of in silico DBP identification as well as other function annotation issues. Finally, it is also confirmed that an ensemble scheme by aggregating various trained FS models can significantly improve the classification performance of DBPs.
Collapse
|
19
|
Zhang P, Zhou S, Liu P, Li M. Distributed Ellipsoidal Intersection Fusion Estimation for Multi-Sensor Complex Systems. SENSORS 2022; 22:s22114306. [PMID: 35684925 PMCID: PMC9185458 DOI: 10.3390/s22114306] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 05/30/2022] [Accepted: 06/04/2022] [Indexed: 01/27/2023]
Abstract
This paper investigates the problem of distributed ellipsoidal intersection (DEI) fusion estimation for linear time-varying multi-sensor complex systems with unknown input disturbances and measurement data transmission delays. For the problem with external unknown input disturbance signals, a non-informative prior distribution is used to model the problem. A set of independent random variables obeying Bernoulli distribution is also used to describe the situation of measurement data transmission delay caused by network channel congestion, and appropriate buffer areas are added at the link nodes to retrieve the delayed transmission data values. For multi-sensor systems with complex situations, a minimum mean square error (MMSE) local estimator is designed in a Bayesian framework based on the maximum a posteriori (MAP) estimation criterion. In order to deal with the unknown correlations among the local estimators and to select the fusion estimator with lower computational complexity, the fusion estimator is designed using ellipsoidal intersection (EI) fusion technique, and the consistency of the estimator is demonstrated. In this paper, the difference between DEI fusion and distributed covariance intersection (DCI) fusion and centralized fusion estimation is analyzed by a numerical example, and the superiority of the DEI fusion method is demonstrated.
Collapse
Affiliation(s)
- Peng Zhang
- School of Instrumentation and Electronic, North University of China, Taiyuan 030051, China; (P.Z.); (S.Z.); (M.L.)
- Academy for Advanced Interdisciplinary Research, North University of China, Taiyuan 030051, China
| | - Shuyu Zhou
- School of Instrumentation and Electronic, North University of China, Taiyuan 030051, China; (P.Z.); (S.Z.); (M.L.)
- Academy for Advanced Interdisciplinary Research, North University of China, Taiyuan 030051, China
| | - Peng Liu
- Academy for Advanced Interdisciplinary Research, North University of China, Taiyuan 030051, China
- North Automatic Control Technology Institute, Taiyuan 030006, China
- Correspondence:
| | - Mengwei Li
- School of Instrumentation and Electronic, North University of China, Taiyuan 030051, China; (P.Z.); (S.Z.); (M.L.)
- Academy for Advanced Interdisciplinary Research, North University of China, Taiyuan 030051, China
| |
Collapse
|
20
|
|
21
|
Identifying and Classifying Enhancers by Dinucleotide-Based Auto-Cross Covariance and Attention-Based Bi-LSTM. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:7518779. [PMID: 35422876 PMCID: PMC9005296 DOI: 10.1155/2022/7518779] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Accepted: 03/12/2022] [Indexed: 11/17/2022]
Abstract
Enhancers are a class of noncoding DNA elements located near structural genes. In recent years, their identification and classification have been the focus of research in the field of bioinformatics. However, due to their high free scattering and position variability, although the performance of the prediction model has been continuously improved, there is still a lot of room for progress. In this paper, density-based spatial clustering of applications with noise (DBSCAN) was used to screen the physicochemical properties of dinucleotides to extract dinucleotide-based auto-cross covariance (DACC) features; then, the features are reduced by feature selection Python toolkit MRMD 2.0. The reduced features are input into the random forest to identify enhancers. The enhancer classification model was built by word2vec and attention-based Bi-LSTM. Finally, the accuracies of our enhancer identification and classification models were 77.25% and 73.50%, respectively, and the Matthews’ correlation coefficients (MCCs) were 0.5470 and 0.4881, respectively, which were better than the performance of most predictors.
Collapse
|
22
|
Liu J, Tang X, Cui S, Guan X. Predicting the function of rice proteins through Multi-instance Multi-label Learning based on multiple features fusion. Brief Bioinform 2022; 23:6553933. [PMID: 35325033 DOI: 10.1093/bib/bbac095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 02/17/2022] [Accepted: 02/23/2022] [Indexed: 11/13/2022] Open
Abstract
There are a large number of unannotated proteins with unknown functions in rice, which are difficult to be verified by biological experiments. Therefore, computational method is one of the mainstream methods for rice proteins function prediction. Two representative rice proteins, indica protein and japonica protein, are selected as the experimental dataset. In this paper, two feature extraction methods (the residue couple model method and the pseudo amino acid composition method) and the Principal Component Analysis method are combined to design protein descriptive features. Moreover, based on the state-of-the-art MIML algorithm EnMIMLNN, a novel MIML learning framework MK-EnMIMLNN is proposed. And the MK-EnMIMLNN algorithm is designed by learning multiple kernel fusion function neural network. The experimental results show that the hybrid feature extraction method is better than the single feature extraction method. More importantly, the MK-EnMIMLNN algorithm is superior to most classic MIML learning algorithms, which proves the effectiveness of the MK-EnMIMLNN algorithm in rice proteins function prediction.
Collapse
Affiliation(s)
- Jing Liu
- Information Engineering College, Shanghai Maritime University, 201306 Shanghai, China
| | - Xinghua Tang
- Information Engineering College, Shanghai Maritime University, 201306 Shanghai, China
| | - Shuanglong Cui
- Information Engineering College, Shanghai Maritime University, 201306 Shanghai, China
| | - Xiao Guan
- School of Health Science and Engineering, University of Shanghai for Science and Technology, 200093 Shanghai, China
| |
Collapse
|
23
|
Jiao S, Chen Z, Zhang L, Zhou X, Shi L. ATGPred-FL: sequence-based prediction of autophagy proteins with feature representation learning. Amino Acids 2022; 54:799-809. [PMID: 35286461 DOI: 10.1007/s00726-022-03145-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2021] [Accepted: 01/28/2022] [Indexed: 11/26/2022]
Abstract
Autophagy plays an important role in biological evolution and is regulated by many autophagy proteins. Accurate identification of autophagy proteins is crucially important to reveal their biological functions. Due to the expense and labor cost of experimental methods, it is urgent to develop automated, accurate and reliable sequence-based computational tools to enable the identification of novel autophagy proteins among numerous proteins and peptides. For this purpose, a new predictor named ATGPred-FL was proposed for the efficient identification of autophagy proteins. We investigated various sequence-based feature descriptors and adopted the feature learning method to generate corresponding, more informative probability features. Then, a two-step feature selection strategy based on accuracy was utilized to remove irrelevant and redundant features, leading to the most discriminative 14-dimensional feature set. The final predictor was built using a support vector machine classifier, which performed favorably on both the training and testing sets with accuracy values of 94.40% and 90.50%, respectively. ATGPred-FL is the first ATG machine learning predictor based on protein primary sequences. We envision that ATGPred-FL will be an effective and useful tool for autophagy protein identification, and it is available for free at http://lab.malab.cn/~acy/ATGPred-FL , the source code and datasets are accessible at https://github.com/jiaoshihu/ATGPred .
Collapse
Affiliation(s)
- Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Zheng Chen
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, 7098 Liuxian Street, Shenzhen, 518055, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No.4 Block 2 North Jianshe Road, Chengdu, 61005, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, 518172, China
| | - Xun Zhou
- Beidahuang Industry Group General Hospital, Harbin, 150001, China.
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, No 415, Fengyang Road, Huangpu District, Shanghai, 210000, China.
| |
Collapse
|
24
|
Zou H, Yang F, Yin Z. Identification of tumor homing peptides by utilizing hybrid feature representation. J Biomol Struct Dyn 2022; 41:3405-3412. [PMID: 35262448 DOI: 10.1080/07391102.2022.2049368] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Cancer is one of the serious diseases, recent studies reported that tumor homing peptides (THPs) play a key role in treatment of cancer. Due to the experimental methods are time-consuming and expensive, it is urgent to develop automatic computational approaches to identify THPs. Hence, in this study, we proposed a novel machine learning methods to distinguish THPs from non-THPs, in which the peptide sequences firstly encoded by pseudo residue pairwise energy content matrix (PseRECM) and pseudo physicochemical property (PsePC). Moreover, the least absolute shrinkage and selection operator (LAASO) was employed to select optimal features from the extracted features. All of these selected features were fed into support vector machine (SVM) for identifying THPs. We achieved 89.02%, 88.49%, and 94.58% classification accuracy on the Main, Small, and Main90 dataset, respectively. Experimental results showed that our proposed method outperforms the existing predictors on the same benchmark datasets. It indicates that the proposed method may be a useful tool in identifying THPs. The datasets and codes used in current study are available at https://figshare.com/articles/online_resource/iTHPs/16778770.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Hongliang Zou
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, China
| | - Fan Yang
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, China
| | - Zhijian Yin
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, China
| |
Collapse
|
25
|
Chen Z, Jiao S, Zhao D, Zou Q, Xu L, Zhang L, Su X. The Characterization of Structure and Prediction for Aquaporin in Tumour Progression by Machine Learning. Front Cell Dev Biol 2022; 10:845622. [PMID: 35178393 PMCID: PMC8844512 DOI: 10.3389/fcell.2022.845622] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Accepted: 01/17/2022] [Indexed: 11/21/2022] Open
Abstract
Recurrence and new cases of cancer constitute a challenging human health problem. Aquaporins (AQPs) can be expressed in many types of tumours, including the brain, breast, pancreas, colon, skin, ovaries, and lungs, and the histological grade of cancer is positively correlated with AQP expression. Therefore, the identification of aquaporins is an area to explore. Computational tools play an important role in aquaporin identification. In this research, we propose reliable, accurate and automated sequence predictor iAQPs-RF to identify AQPs. In this study, the feature extraction method was 188D (global protein sequence descriptor, GPSD). Six common classifiers, including random forest (RF), NaiveBayes (NB), support vector machine (SVM), XGBoost, logistic regression (LR) and decision tree (DT), were used for AQP classification. The classification results show that the random forest (RF) algorithm is the most suitable machine learning algorithm, and the accuracy was 97.689%. Analysis of Variance (ANOVA) was used to analyse these characteristics. Feature rank based on the ANOVA method and IFS strategy was applied to search for the optimal features. The classification results suggest that the 26th feature (neutral/hydrophobic) and 21st feature (hydrophobic) are the two most powerful and informative features that distinguish AQPs from non-AQPs. Previous studies reported that plasma membrane proteins have hydrophobic characteristics. Aquaporin subcellular localization prediction showed that all aquaporins were plasma membrane proteins with highly conserved transmembrane structures. In addition, the 3D structure of aquaporins was consistent with the localization results. Therefore, these studies confirmed that aquaporins possess hydrophobic properties. Although aquaporins are highly conserved transmembrane structures, the phylogenetic tree shows the diversity of aquaporins during evolution. The PCA showed that positive and negative samples were well separated by 54D features, indicating that the 54D feature can effectively classify aquaporins. The online prediction server is accessible at http://lab.malab.cn/∼acy/iAQP.
Collapse
Affiliation(s)
- Zheng Chen
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, Shenzhen, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Da Zhao
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, Shenzhen, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.,Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Lijun Zhang
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, Shenzhen, China
| | - Xi Su
- Foshan Maternal and Child Health Hospital, Foshan, China
| |
Collapse
|
26
|
Sniatynski MJ, Shepherd JA, Ernst T, Wilkens LR, Hsu DF, Kristal BS. Ranks underlie outcome of combining classifiers: Quantitative roles for diversity and accuracy. PATTERNS (NEW YORK, N.Y.) 2022; 3:100415. [PMID: 35199065 PMCID: PMC8848007 DOI: 10.1016/j.patter.2021.100415] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Revised: 09/20/2021] [Accepted: 11/24/2021] [Indexed: 11/22/2022]
Abstract
Combining classifier systems potentially improves predictive accuracy, but outcomes have proven impossible to predict. Classification most commonly improves when the classifiers are "sufficiently good" (generalized as " accuracy ") and "sufficiently different" (generalized as " diversity "), but the individual and joint quantitative influence of these factors on the final outcome remains unknown. We resolve these issues. Beginning with simulated data, we develop the DIRAC framework (DIversity of Ranks and ACcuracy), which accurately predicts outcome of both score-based fusions originating from exponentially modified Gaussian distributions and rank-based fusions, which are inherently distribution independent. DIRAC was validated using biological dual-energy X-ray absorption and magnetic resonance imaging data. The DIRAC framework is domain independent and has expected utility in far-ranging areas such as clinical biomarker development/personalized medicine, clinical trial enrollment, insurance pricing, portfolio management, and sensor optimization.
Collapse
Affiliation(s)
- Matthew J. Sniatynski
- Division of Sleep and Circadian Disorders, Department of Medicine, Brigham and Women's Hospital, 221 Longwood Avenue, LM322B, Boston, MA 02115, USA
- Division of Sleep Medicine, Harvard Medical School, Boston, MA 02115, USA
| | - John A. Shepherd
- School of Medicine, University of California San Francisco, San Francisco, CA 94143, USA
| | - Thomas Ernst
- John A. Burns School of Medicine, University of Hawaii at Mānoa, Honolulu, HI 96813, USA
| | - Lynne R. Wilkens
- University of Hawaii Cancer Center, University of Hawaii at Mānoa, Honolulu, HI 96813, USA
| | - D. Frank Hsu
- Department of Computer and Information Science, Fordham University, LL813, 113 West 60th Street, New York, NY 10023, USA
| | - Bruce S. Kristal
- Division of Sleep and Circadian Disorders, Department of Medicine, Brigham and Women's Hospital, 221 Longwood Avenue, LM322B, Boston, MA 02115, USA
- Division of Sleep Medicine, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
27
|
Ma D, Chen Z, He Z, Huang X. A SNARE Protein Identification Method Based on iLearnPlus to Efficiently Solve the Data Imbalance Problem. Front Genet 2022; 12:818841. [PMID: 35154261 PMCID: PMC8832978 DOI: 10.3389/fgene.2021.818841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2021] [Accepted: 12/14/2021] [Indexed: 11/13/2022] Open
Abstract
Machine learning has been widely used to solve complex problems in engineering applications and scientific fields, and many machine learning-based methods have achieved good results in different fields. SNAREs are key elements of membrane fusion and required for the fusion process of stable intermediates. They are also associated with the formation of some psychiatric disorders. This study processes the original sequence data with the synthetic minority oversampling technique (SMOTE) to solve the problem of data imbalance and produces the most suitable machine learning model with the iLearnPlus platform for the identification of SNARE proteins. Ultimately, a sensitivity of 66.67%, specificity of 93.63%, accuracy of 91.33%, and MCC of 0.528 were obtained in the cross-validation dataset, and a sensitivity of 66.67%, specificity of 93.63%, accuracy of 91.33%, and MCC of 0.528 were obtained in the independent dataset (the adaptive skip dipeptide composition descriptor was used for feature extraction, and LightGBM with proper parameters was used as the classifier). These results demonstrate that this combination can perform well in the classification of SNARE proteins and is superior to other methods.
Collapse
|
28
|
Zhao Z, Yang W, Zhai Y, Liang Y, Zhao Y. Identify DNA-Binding Proteins Through the Extreme Gradient Boosting Algorithm. Front Genet 2022; 12:821996. [PMID: 35154264 PMCID: PMC8837382 DOI: 10.3389/fgene.2021.821996] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Accepted: 12/07/2021] [Indexed: 12/13/2022] Open
Abstract
The exploration of DNA-binding proteins (DBPs) is an important aspect of studying biological life activities. Research on life activities requires the support of scientific research results on DBPs. The decline in many life activities is closely related to DBPs. Generally, the detection method for identifying DBPs is achieved through biochemical experiments. This method is inefficient and requires considerable manpower, material resources and time. At present, several computational approaches have been developed to detect DBPs, among which machine learning (ML) algorithm-based computational techniques have shown excellent performance. In our experiments, our method uses fewer features and simpler recognition methods than other methods and simultaneously obtains satisfactory results. First, we use six feature extraction methods to extract sequence features from the same group of DBPs. Then, this feature information is spliced together, and the data are standardized. Finally, the extreme gradient boosting (XGBoost) model is used to construct an effective predictive model. Compared with other excellent methods, our proposed method has achieved better results. The accuracy achieved by our method is 78.26% for PDB2272 and 85.48% for PDB186. The accuracy of the experimental results achieved by our strategy is similar to that of previous detection methods.
Collapse
Affiliation(s)
- Ziye Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Wen Yang
- International Medical Center, Shenzhen University General Hospital, Shenzhen, China
| | - Yixiao Zhai
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yingjian Liang
- Department of Obstetrics and Gynecology, The First Affiliated Hospital of Harbin Medical University, Harbin, China
- *Correspondence: Yingjian Liang, ; Yuming Zhao,
| | - Yuming Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- *Correspondence: Yingjian Liang, ; Yuming Zhao,
| |
Collapse
|
29
|
Wan H, Zhang J, Ding Y, Wang H, Tian G. Immunoglobulin Classification Based on FC* and GC* Features. Front Genet 2022; 12:827161. [PMID: 35140745 PMCID: PMC8819591 DOI: 10.3389/fgene.2021.827161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Accepted: 12/22/2021] [Indexed: 11/13/2022] Open
Abstract
Immunoglobulins have a pivotal role in disease regulation. Therefore, it is vital to accurately identify immunoglobulins to develop new drugs and research related diseases. Compared with utilizing high-dimension features to identify immunoglobulins, this research aimed to examine a method to classify immunoglobulins and non-immunoglobulins using two features, FC* and GC*. Classification of 228 samples (109 immunoglobulin samples and 119 non-immunoglobulin samples) revealed that the overall accuracy was 80.7% in 10-fold cross-validation using the J48 classifier implemented in Weka software. The FC* feature identified in this study was found in the immunoglobulin subtype domain, which demonstrated that this extracted feature could represent functional and structural properties of immunoglobulins for forecasting.
Collapse
Affiliation(s)
- Hao Wan
- Institute of Advanced Cross-field Science, College of Life Science, Qingdao University, Qingdao, China
| | - Jina Zhang
- Geneis (Beijing) Co., Ltd., Beijing, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Hetian Wang
- Beidahuang Industry Group General Hospital, Harbin, China
- *Correspondence: Hetian Wang, ; Geng Tian,
| | - Geng Tian
- Geneis (Beijing) Co., Ltd., Beijing, China
- *Correspondence: Hetian Wang, ; Geng Tian,
| |
Collapse
|
30
|
Gong Y, Dong B, Zhang Z, Zhai Y, Gao B, Zhang T, Zhang J. VTP-Identifier: Vesicular Transport Proteins Identification Based on PSSM Profiles and XGBoost. Front Genet 2022; 12:808856. [PMID: 35047020 PMCID: PMC8762342 DOI: 10.3389/fgene.2021.808856] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Accepted: 11/29/2021] [Indexed: 11/13/2022] Open
Abstract
Vesicular transport proteins are related to many human diseases, and they threaten human health when they undergo pathological changes. Protein function prediction has been one of the most in-depth topics in bioinformatics. In this work, we developed a useful tool to identify vesicular transport proteins. Our strategy is to extract transition probability composition, autocovariance transformation and other information from the position-specific scoring matrix as feature vectors. EditedNearesNeighbours (ENN) is used to address the imbalance of the data set, and the Max-Relevance-Max-Distance (MRMD) algorithm is adopted to reduce the dimension of the feature vector. We used 5-fold cross-validation and independent test sets to evaluate our model. On the test set, VTP-Identifier presented a higher performance compared with GRU. The accuracy, Matthew's correlation coefficient (MCC) and area under the ROC curve (AUC) were 83.6%, 0.531 and 0.873, respectively.
Collapse
Affiliation(s)
- Yue Gong
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Benzhi Dong
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zixiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yixiao Zhai
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Bo Gao
- Department of Radiology, The Second Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Jingyu Zhang
- Department of Neurology, The Fourth Affiliated Hospital of Harbin Medical University, Harbin, China
| |
Collapse
|
31
|
Zhang Z, Gong Y, Gao B, Li H, Gao W, Zhao Y, Dong B. SNAREs-SAP: SNARE Proteins Identification With PSSM Profiles. Front Genet 2022; 12:809001. [PMID: 34987554 PMCID: PMC8721734 DOI: 10.3389/fgene.2021.809001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Accepted: 11/15/2021] [Indexed: 12/20/2022] Open
Abstract
Soluble N-ethylmaleimide sensitive factor activating protein receptor (SNARE) proteins are a large family of transmembrane proteins located in organelles and vesicles. The important roles of SNARE proteins include initiating the vesicle fusion process and activating and fusing proteins as they undergo exocytosis activity, and SNARE proteins are also vital for the transport regulation of membrane proteins and non-regulatory vesicles. Therefore, there is great significance in establishing a method to efficiently identify SNARE proteins. However, the identification accuracy of the existing methods such as SNARE CNN is not satisfied. In our study, we developed a method based on a support vector machine (SVM) that can effectively recognize SNARE proteins. We used the position-specific scoring matrix (PSSM) method to extract features of SNARE protein sequences, used the support vector machine recursive elimination correlation bias reduction (SVM-RFE-CBR) algorithm to rank the importance of features, and then screened out the optimal subset of feature data based on the sorted results. We input the feature data into the model when building the model, used 10-fold crossing validation for training, and tested model performance by using an independent dataset. In independent tests, the ability of our method to identify SNARE proteins achieved a sensitivity of 68%, specificity of 94%, accuracy of 92%, area under the curve (AUC) of 84%, and Matthew’s correlation coefficient (MCC) of 0.48. The results of the experiment show that the common evaluation indicators of our method are excellent, indicating that our method performs better than other existing classification methods in identifying SNARE proteins.
Collapse
Affiliation(s)
- Zixiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yue Gong
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Bo Gao
- Department of Radiology, The Second Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Hongfei Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Wentao Gao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yuming Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Benzhi Dong
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| |
Collapse
|
32
|
Li H, Shi L, Gao W, Zhang Z, Zhang L, Wang G. dPromoter-XGBoost: Detecting promoters and strength by combining multiple descriptors and feature selection using XGBoost. Methods 2022; 204:215-222. [PMID: 34998983 DOI: 10.1016/j.ymeth.2022.01.001] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 12/13/2021] [Accepted: 01/02/2022] [Indexed: 12/12/2022] Open
Abstract
Promoters play an irreplaceable role in biological processes and genetics, which are responsible for stimulating the transcription and expression of specific genes. Promoter abnormalities have been found in some diseases, and the level of promoter-binding transcription factors can be used as a marker before a disease occurs. Hence, detecting promoters from DNA sequences has important biological significance, particular, distinguishing strong promoters can help to elucidate differences in gene expression and the mechanisms of specific diseases. With the introduction of third-generation sequencing, it is difficult to match the speed of sequencing to the speed of labeling promoters experimentally. Many computing models have been designed to fill this gap and identify unlabeled DNA. However, their feature representation methods are very singular, which cannot reflect the information contained in the original samples. With the aim of avoiding information loss, we propose a computational model based on multiple descriptors and feature selection to jointly express samples. It is worth mentioning that a new feature descriptor called K-mer word vector is defined. The promoter model of multiple feature descriptors dominated by K-mer word vector achieves similar performance to existing methods, the sensitivity of 85.72% can distinguish the promoter more effectively than other methods. Furthermore, the performance of the promoter strength has surpassed published methods, and accuracy of 77.00% greatly improves the ability to distinguish between strong and weak promoters.
Collapse
Affiliation(s)
- Hongfei Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China; Yangtze Delta Region Institute, University of Electronic Science and Technology, Quzhou,China
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China
| | - Wentao Gao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zixiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China.
| |
Collapse
|
33
|
Niu M, Zou Q, Lin C. CRBPDL: Identification of circRNA-RBP interaction sites using an ensemble neural network approach. PLoS Comput Biol 2022; 18:e1009798. [PMID: 35051187 PMCID: PMC8806072 DOI: 10.1371/journal.pcbi.1009798] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Revised: 02/01/2022] [Accepted: 01/02/2022] [Indexed: 02/06/2023] Open
Abstract
Circular RNAs (circRNAs) are non-coding RNAs with a special circular structure produced formed by the reverse splicing mechanism. Increasing evidence shows that circular RNAs can directly bind to RNA-binding proteins (RBP) and play an important role in a variety of biological activities. The interactions between circRNAs and RBPs are key to comprehending the mechanism of posttranscriptional regulation. Accurately identifying binding sites is very useful for analyzing interactions. In past research, some predictors on the basis of machine learning (ML) have been presented, but prediction accuracy still needs to be ameliorated. Therefore, we present a novel calculation model, CRBPDL, which uses an Adaboost integrated deep hierarchical network to identify the binding sites of circular RNA-RBP. CRBPDL combines five different feature encoding schemes to encode the original RNA sequence, uses deep multiscale residual networks (MSRN) and bidirectional gating recurrent units (BiGRUs) to effectively learn high-level feature representations, it is sufficient to extract local and global context information at the same time. Additionally, a self-attention mechanism is employed to train the robustness of the CRBPDL. Ultimately, the Adaboost algorithm is applied to integrate deep learning (DL) model to improve prediction performance and reliability of the model. To verify the usefulness of CRBPDL, we compared the efficiency with state-of-the-art methods on 37 circular RNA data sets and 31 linear RNA data sets. Moreover, results display that CRBPDL is capable of performing universal, reliable, and robust. The code and data sets are obtainable at https://github.com/nmt315320/CRBPDL.git. More and more evidences show that circular RNA can directly bind to proteins and participate in countless different biological processes. The calculation method can quickly and accurately predict the binding site of circular RNA and RBP. In order to identify the interaction of circRNA with 37 different types of circRNA binding proteins, we developed an integrated deep learning network based on hierarchical network, called CRBPDL. It can effectively learn high-level feature representations. The performance of the model was verified through comparative experiments of different feature extraction algorithms, different deep learning models and classifier models. Moreover, the CRBPDL model was applied to 31 linear RNAs, and the effectiveness of our method was proved by comparison with the results of current excellent algorithms. It is expected that the CRBPDL model can effectively predict the binding site of circular RNA-RBP and provide reliable candidates for further biological experiments.
Collapse
Affiliation(s)
- Mengting Niu
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Chen Lin
- School of Informatics, Xiamen University, Xiamen, China
- * E-mail:
| |
Collapse
|
34
|
Qian Y, Meng H, Lu W, Liao Z, Ding Y, Wu H. Identification of DNA-Binding Proteins via Hypergraph Based Laplacian
Support Vector Machine. Curr Bioinform 2022. [DOI: 10.2174/1574893616666210806091922] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
The identification of DNA binding proteins (DBP) is an important research
field. Experiment-based methods are time-consuming and labor-intensive for detecting DBP.
Objective:
To solve the problem of large-scale DBP identification, some machine learning methods are
proposed. However, these methods have insufficient predictive accuracy. Our aim is to develop a sequence-
based machine learning model to predict DBP.
Methods:
In our study, we extracted six types of features (including NMBAC, GE, MCD, PSSM-AB,
PSSM-DWT, and PsePSSM) from protein sequences. We used Multiple Kernel Learning based on Hilbert-
Schmidt Independence Criterion (MKL-HSIC) to estimate the optimal kernel. Then, we constructed
a hypergraph model to describe the relationship between labeled and unlabeled samples. Finally, Laplacian
Support Vector Machines (LapSVM) is employed to train the predictive model. Our method is
tested on PDB186, PDB1075, PDB2272 and PDB14189 data sets.
Result:
Compared with other methods, our model achieved best results on benchmark data sets.
Conclusion:
The accuracy of 87.1% and 74.2% are achieved on PDB186 (Independent test of
PDB1075) and PDB2272 (Independent test of PDB14189), respectively.
Collapse
Affiliation(s)
- Yuqing Qian
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, P.R. China
| | - Hao Meng
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, P.R. China
| | - Weizhong Lu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, P.R. China
| | - Zhijun Liao
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University,
Fuzhou, P.R. China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China,
Quzhou, P.R. China
| | - Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, P.R. China
| |
Collapse
|
35
|
Zhou H, Wang H, Ding Y, Tang J. Multivariate Information Fusion for Identifying Antifungal Peptides with
Hilbert-Schmidt Independence Criterion. Curr Bioinform 2022. [DOI: 10.2174/1574893616666210727161003] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Antifungal Peptides (AFP) have been found to be effective against many fungal
infections.
Objective:
However, it is difficult to identify AFP. Therefore, it is great practical significance to identify
AFP via machine learning methods (with sequence information).
Method:
In this study, a Multi-Kernel Support Vector Machine (MKSVM) with Hilbert-Schmidt Independence
Criterion (HSIC) is proposed. Proteins are encoded with five types of features (188-bit,
AAC, ASDC, CKSAAP, DPC), and then construct kernels using Gaussian kernel function. HSIC are
used to combine kernels and multi-kernel SVM model is built.
Results:
Our model performed well on three AFPs datasets and the performance is better than or comparable
to other state-of-art predictive models.
Conclusion:
Our method will be a useful tool for identifying antifungal peptides.
Collapse
Affiliation(s)
- Haohao Zhou
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin,
300354, China
| | - Hao Wang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin,
300354, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou,
215009, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of
China, Quzhou, 324000, China
| | - Jijun Tang
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USA
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055,
China
| |
Collapse
|
36
|
Teng Z, Zhao Z, Li Y, Tian Z, Guo M, Lu Q, Wang G. i6mA-Vote: Cross-Species Identification of DNA N6-Methyladenine Sites in Plant Genomes Based on Ensemble Learning With Voting. FRONTIERS IN PLANT SCIENCE 2022; 13:845835. [PMID: 35237293 PMCID: PMC8882731 DOI: 10.3389/fpls.2022.845835] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Accepted: 01/24/2022] [Indexed: 05/17/2023]
Abstract
DNA N6-Methyladenine (6mA) is a common epigenetic modification, which plays some significant roles in the growth and development of plants. It is crucial to identify 6mA sites for elucidating the functions of 6mA. In this article, a novel model named i6mA-vote is developed to predict 6mA sites of plants. Firstly, DNA sequences were coded into six feature vectors with diverse strategies based on density, physicochemical properties, and position of nucleotides, respectively. To find the best coding strategy, the feature vectors were compared on several machine learning classifiers. The results suggested that the position of nucleotides has a significant positive effect on 6mA sites identification. Thus, the dinucleotide one-hot strategy which can describe position characteristics of nucleotides well was employed to extract DNA features in our method. Secondly, DNA sequences of Rosaceae were divided into a training dataset and a test dataset randomly. Finally, i6mA-vote was constructed by combining five different base-classifiers under a majority voting strategy and trained on the Rosaceae training dataset. The i6mA-vote was evaluated on the task of predicting 6mA sites from the genome of the Rosaceae, Rice, and Arabidopsis separately. In Rosaceae, the performances of i6mA-vote were 0.955 on accuracy (ACC), 0.909 on Matthew correlation coefficients (MCC), 0.955 on sensitivity (SN), and 0.954 on specificity (SP). Those indicators, in the order of ACC, MCC, SN, SP, were 0.882, 0.774, 0.961, and 0.803 on Rice while they were 0.798, 0.617, 0.666, and 0.929 on Arabidopsis. According to the indicators, our method was effectiveness and better than other concerned methods. The results also illustrated that i6mA-vote does not only well in 6mA sites prediction of intraspecies but also interspecies plants. Moreover, it can be seen that the specificity is distinctly lower than the sensitivity in Rice while it is just the opposite in Arabidopsis. It may be resulted from sequence similarity among Rosaceae, Rice and Arabidopsis.
Collapse
Affiliation(s)
- Zhixia Teng
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zhengnan Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yanjuan Li
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| | - Zhen Tian
- College of Information Engineering, Zhengzhou University, Zhengzhou, China
| | - Maozu Guo
- College of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Qianzi Lu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
- *Correspondence: Qianzi Lu,
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- Guohua Wang,
| |
Collapse
|
37
|
Chen Y, Juan L, Lv X, Shi L. Bioinformatics Research on Drug Sensitivity Prediction. Front Pharmacol 2021; 12:799712. [PMID: 34955863 PMCID: PMC8696280 DOI: 10.3389/fphar.2021.799712] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Accepted: 11/18/2021] [Indexed: 11/28/2022] Open
Abstract
Modeling-based anti-cancer drug sensitivity prediction has been extensively studied in recent years. While most drug sensitivity prediction models only use gene expression data, the remarkable impacts of gene mutation, methylation, and copy number variation on drug sensitivity are neglected. Drug sensitivity prediction can both help protect patients from some adverse drug reactions and improve the efficacy of treatment. Genomics data are extremely useful for drug sensitivity prediction task. This article reviews the role of drug sensitivity prediction, describes a variety of methods for predicting drug sensitivity. Moreover, the research significance of drug sensitivity prediction, as well as existing problems are well discussed.
Collapse
Affiliation(s)
- Yaojia Chen
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Liran Juan
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Xiao Lv
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Lei Shi
- Department of Spine Surgery Changzheng Hospital, Naval Medical University, Shanghai, China
| |
Collapse
|
38
|
Gu X, Guo L, Liao B, Jiang Q. Pseudo-188D: Phage Protein Prediction Based on a Model of Pseudo-188D. Front Genet 2021; 12:796327. [PMID: 34925468 PMCID: PMC8672092 DOI: 10.3389/fgene.2021.796327] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Accepted: 11/15/2021] [Indexed: 11/13/2022] Open
Abstract
Phages have seriously affected the biochemical systems of the world, and not only are phages related to our health, but medical treatments for many cancers and skin infections are related to phages; therefore, this paper sought to identify phage proteins. In this paper, a Pseudo-188D model was established. The digital features of the phage were extracted by PseudoKNC, an appropriate vector was selected by the AdaBoost tool, and features were extracted by 188D. Then, the extracted digital features were combined together, and finally, the viral proteins of the phage were predicted by a stochastic gradient descent algorithm. Our model effect reached 93.4853%. To verify the stability of our model, we randomly selected 80% of the downloaded data to train the model and used the remaining 20% of the data to verify the robustness of our model.
Collapse
Affiliation(s)
- Xiaomei Gu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Institute of Yangtze River Delta, University of Electronic Science and Technology of China, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Lina Guo
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Bo Liao
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Qinghua Jiang
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| |
Collapse
|
39
|
Zhao D, Teng Z, Li Y, Chen D. iAIPs: Identifying Anti-Inflammatory Peptides Using Random Forest. Front Genet 2021; 12:773202. [PMID: 34917130 PMCID: PMC8669811 DOI: 10.3389/fgene.2021.773202] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Accepted: 10/08/2021] [Indexed: 12/25/2022] Open
Abstract
Recently, several anti-inflammatory peptides (AIPs) have been found in the process of the inflammatory response, and these peptides have been used to treat some inflammatory and autoimmune diseases. Therefore, identifying AIPs accurately from a given amino acid sequences is critical for the discovery of novel and efficient anti-inflammatory peptide-based therapeutics and the acceleration of their application in therapy. In this paper, a random forest-based model called iAIPs for identifying AIPs is proposed. First, the original samples were encoded with three feature extraction methods, including g-gap dipeptide composition (GDC), dipeptide deviation from the expected mean (DDE), and amino acid composition (AAC). Second, the optimal feature subset is generated by a two-step feature selection method, in which the feature is ranked by the analysis of variance (ANOVA) method, and the optimal feature subset is generated by the incremental feature selection strategy. Finally, the optimal feature subset is inputted into the random forest classifier, and the identification model is constructed. Experiment results showed that iAIPs achieved an AUC value of 0.822 on an independent test dataset, which indicated that our proposed model has better performance than the existing methods. Furthermore, the extraction of features for peptide sequences provides the basis for evolutionary analysis. The study of peptide identification is helpful to understand the diversity of species and analyze the evolutionary history of species.
Collapse
Affiliation(s)
- Dongxu Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zhixia Teng
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yanjuan Li
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| | - Dong Chen
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| |
Collapse
|
40
|
Gong Y, Liao B, Wang P, Zou Q. DrugHybrid_BS: Using Hybrid Feature Combined With Bagging-SVM to Predict Potentially Druggable Proteins. Front Pharmacol 2021; 12:771808. [PMID: 34916947 PMCID: PMC8669608 DOI: 10.3389/fphar.2021.771808] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Accepted: 11/15/2021] [Indexed: 01/09/2023] Open
Abstract
Drug targets are biological macromolecules or biomolecule structures capable of specifically binding a therapeutic effect with a particular drug or regulating physiological functions. Due to the important value and role of drug targets in recent years, the prediction of potential drug targets has become a research hotspot. The key to the research and development of modern new drugs is first to identify potential drug targets. In this paper, a new predictor, DrugHybrid_BS, is developed based on hybrid features and Bagging-SVM to identify potentially druggable proteins. This method combines the three features of monoDiKGap (k = 2), cross-covariance, and grouped amino acid composition. It removes redundant features and analyses key features through MRMD and MRMD2.0. The cross-validation results show that 96.9944% of the potentially druggable proteins can be accurately identified, and the accuracy of the independent test set has reached 96.5665%. This all means that DrugHybrid_BS has the potential to become a useful predictive tool for druggable proteins. In addition, the hybrid key features can identify 80.0343% of the potentially druggable proteins combined with Bagging-SVM, which indicates the significance of this part of the features for research.
Collapse
Affiliation(s)
- Yuxin Gong
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China.,Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Smart Education, Hainan Normal University, Ministry of Education, Haikou, China
| | - Bo Liao
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China.,Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Smart Education, Hainan Normal University, Ministry of Education, Haikou, China
| | - Peng Wang
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China.,Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Smart Education, Hainan Normal University, Ministry of Education, Haikou, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| |
Collapse
|
41
|
Dou L, Zhou W, Zhang L, Xu L, Han K. Accurate identification of RNA D modification using multiple features. RNA Biol 2021; 18:2236-2246. [PMID: 33729104 PMCID: PMC8632091 DOI: 10.1080/15476286.2021.1898160] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 02/13/2021] [Accepted: 02/23/2021] [Indexed: 10/21/2022] Open
Abstract
As one of the common post-transcriptional modifications in tRNAs, dihydrouridine (D) has prominent effects on regulating the flexibility of tRNA as well as cancerous diseases. Facing with the expensive and time-consuming sequencing techniques to detect D modification, precise computational tools can largely promote the progress of molecular mechanisms and medical developments. We proposed a novel predictor, called iRNAD_XGBoost, to identify potential D sites using multiple RNA sequence representations. In this method, by considering the imbalance problem using hybrid sampling method SMOTEEEN, the XGBoost-selected top 30 features are applied to construct model. The optimized model showed high Sn and Sp values of 97.13% and 97.38% over jackknife test, respectively. For the independent experiment, these two metrics separately achieved 91.67% and 94.74%. Compared with iRNAD method, this model illustrated high generalizability and consistent prediction efficiencies for positive and negative samples, which yielded satisfactory MCC scores of 0.94 and 0.86, respectively. It is inferred that the chemical property and nucleotide density features (CPND), electron-ion interaction pseudopotential (EIIP and PseEIIP) as well as dinucleotide composition (DNC) are crucial to the recognition of D modification. The proposed predictor is a promising tool to help experimental biologists investigate molecular functions.
Collapse
Affiliation(s)
- Lijun Dou
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen, GuangdongChina
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, SichuanChina
| | - Wenyang Zhou
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, HeilongjiangChina
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, Guangdong, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, GuangdongChina
| | - Ke Han
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, HeilongjiangChina
| |
Collapse
|
42
|
Learning with Hilbert–Schmidt independence criterion: A review and new perspectives. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
43
|
Zou Y, Ding Y, Peng L, Zou Q. FTWSVM-SR: DNA-Binding Proteins Identification via Fuzzy Twin Support Vector Machines on Self-Representation. Interdiscip Sci 2021; 14:372-384. [PMID: 34743286 DOI: 10.1007/s12539-021-00489-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2021] [Revised: 10/11/2021] [Accepted: 10/24/2021] [Indexed: 12/01/2022]
Abstract
Due to the high cost of DNA-binding proteins (DBPs) detection, many machine learning algorithms (ML) have been utilized to large-scale process and detect DBPs. The previous methods took no count of the processing of noise samples. In this study, a fuzzy twin support vector machine (FTWSVM) is employed to detect DBPs. First, multiple types of protein sequence features are formed into kernel matrices; Then, multiple kernel learning (MKL) algorithm is utilized to linear combine multiple kernels; next, self-representation-based membership function is utilized to estimate membership value (weight) of each training sample; finally, we feed the integrated kernel matrix and membership values into the FTWSVM-SR model for training and testing. On comparison with other predictive models, FTWSVM based on SR (FTWSVM-SR) obtains the best performance of Matthew's correlation coefficient (MCC): 0.7410 and 0.5909 on two independent testing sets (PDB186 and PDB2272 datasets), respectively. The results confirm that our method can be an effective DBPs detection tool. Before the biochemical experiment, our model can screen and analyze DBPs on a large scale.
Collapse
Affiliation(s)
- Yi Zou
- School of Internet of Things Engineering, Jiangnan University, Wuxi, 214122, People's Republic of China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, People's Republic of China
| | - Li Peng
- School of Internet of Things Engineering, Jiangnan University, Wuxi, 214122, People's Republic of China.
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, People's Republic of China.
| |
Collapse
|
44
|
Drug–disease associations prediction via Multiple Kernel-based Dual Graph Regularized Least Squares. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107811] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
|
45
|
Jiao S, Zou Q, Guo H, Shi L. iTTCA-RF: a random forest predictor for tumor T cell antigens. J Transl Med 2021; 19:449. [PMID: 34706730 PMCID: PMC8554859 DOI: 10.1186/s12967-021-03084-x] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Accepted: 09/16/2021] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Cancer is one of the most serious diseases threatening human health. Cancer immunotherapy represents the most promising treatment strategy due to its high efficacy and selectivity and lower side effects compared with traditional treatment. The identification of tumor T cell antigens is one of the most important tasks for antitumor vaccines development and molecular function investigation. Although several machine learning predictors have been developed to identify tumor T cell antigen, more accurate tumor T cell antigen identification by existing methodology is still challenging. METHODS In this study, we used a non-redundant dataset of 592 tumor T cell antigens (positive samples) and 393 tumor T cell antigens (negative samples). Four types feature encoding methods have been studied to build an efficient predictor, including amino acid composition, global protein sequence descriptors and grouped amino acid and peptide composition. To improve the feature representation ability of the hybrid features, we further employed a two-step feature selection technique to search for the optimal feature subset. The final prediction model was constructed using random forest algorithm. RESULTS Finally, the top 263 informative features were selected to train the random forest classifier for detecting tumor T cell antigen peptides. iTTCA-RF provides satisfactory performance, with balanced accuracy, specificity and sensitivity values of 83.71%, 78.73% and 88.69% over tenfold cross-validation as well as 73.14%, 62.67% and 83.61% over independent tests, respectively. The online prediction server was freely accessible at http://lab.malab.cn/~acy/iTTCA . CONCLUSIONS We have proven that the proposed predictor iTTCA-RF is superior to the other latest models, and will hopefully become an effective and useful tool for identifying tumor T cell antigens presented in the context of major histocompatibility complex class I.
Collapse
Affiliation(s)
- Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Huannan Guo
- Department of Oncology, General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China.
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China.
| |
Collapse
|
46
|
Ding Y, Yang C, Tang J, Guo F. Identification of protein-nucleotide binding residues via graph regularized k-local hyperplane distance nearest neighbor model. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02737-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
47
|
Lu W, Cao Y, Wu H, Ding Y, Song Z, Zhang Y, Fu Q, Li H. Research on RNA secondary structure predicting via bidirectional recurrent neural network. BMC Bioinformatics 2021; 22:431. [PMID: 34496763 PMCID: PMC8427827 DOI: 10.1186/s12859-021-04332-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2021] [Accepted: 08/23/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND RNA secondary structure prediction is an important research content in the field of biological information. Predicting RNA secondary structure with pseudoknots has been proved to be an NP-hard problem. Traditional machine learning methods can not effectively apply protein sequence information with different sequence lengths to the prediction process due to the constraint of the self model when predicting the RNA secondary structure. In addition, there is a large difference between the number of paired bases and the number of unpaired bases in the RNA sequences, which means the problem of positive and negative sample imbalance is easy to make the model fall into a local optimum. To solve the above problems, this paper proposes a variable-length dynamic bidirectional Gated Recurrent Unit(VLDB GRU) model. The model can accept sequences with different lengths through the introduction of flag vector. The model can also make full use of the base information before and after the predicted base and can avoid losing part of the information due to truncation. Introducing a weight vector to predict the RNA training set by dynamically adjusting each base loss function solves the problem of balanced sample imbalance. RESULTS The algorithm proposed in this paper is compared with the existing algorithms on five representative subsets of the data set RNA STRAND. The experimental results show that the accuracy and Matthews correlation coefficient of the method are improved by 4.7% and 11.4%, respectively. CONCLUSIONS The flag vector introduced allows the model to effectively use the information before and after the protein sequence; the introduced weight vector solves the problem of unbalanced sample balance. Compared with other algorithms, the LVDB GRU algorithm proposed in this paper has the best detection results.
Collapse
Affiliation(s)
- Weizhong Lu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China.,Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - Yan Cao
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China. .,Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, 215009, China.
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China.,Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - Zhengwei Song
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - Yu Zhang
- Suzhou Industrial Park Institute of Services Outsourcing, Suzhou, 215123, China
| | - Qiming Fu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China.,Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - Haiou Li
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China
| |
Collapse
|
48
|
Chen X, Lin Y, Qu Q, Ning B, Chen H, Li X. An epistasis and heterogeneity analysis method based on maximum correlation and maximum consistence criteria. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:7711-7726. [PMID: 34814271 DOI: 10.3934/mbe.2021382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Tumor heterogeneity significantly increases the difficulty of tumor treatment. The same drugs and treatment methods have different effects on different tumor subtypes. Therefore, tumor heterogeneity is one of the main sources of poor prognosis, recurrence and metastasis. At present, there have been some computational methods to study tumor heterogeneity from the level of genome, transcriptome, and histology, but these methods still have certain limitations. In this study, we proposed an epistasis and heterogeneity analysis method based on genomic single nucleotide polymorphism (SNP) data. First of all, a maximum correlation and maximum consistence criteria was designed based on Bayesian network score K2 and information entropy for evaluating genomic epistasis. As the number of SNPs increases, the epistasis combination space increases sharply, resulting in a combination explosion phenomenon. Therefore, we next use an improved genetic algorithm to search the SNP epistatic combination space for identifying potential feasible epistasis solutions. Multiple epistasis solutions represent different pathogenic gene combinations, which may lead to different tumor subtypes, that is, heterogeneity. Finally, the XGBoost classifier is trained with feature SNPs selected that constitute multiple sets of epistatic solutions to verify that considering tumor heterogeneity is beneficial to improve the accuracy of tumor subtype prediction. In order to demonstrate the effectiveness of our method, the power of multiple epistatic recognition and the accuracy of tumor subtype classification measures are evaluated. Extensive simulation results show that our method has better power and prediction accuracy than previous methods.
Collapse
Affiliation(s)
- Xia Chen
- School of Basic Education, Changsha Aeronautical Vocational and Technical College, Changsha, Hunan 410124, China
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Yexiong Lin
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Qiang Qu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Bin Ning
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Haowen Chen
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Xiong Li
- School of Software, East China Jiaotong University, Nanchang 330013, China
| |
Collapse
|
49
|
Niu M, Wu J, Zou Q, Liu Z, Xu L. rBPDL:Predicting RNA-Binding Proteins Using Deep Learning. IEEE J Biomed Health Inform 2021; 25:3668-3676. [PMID: 33780344 DOI: 10.1109/jbhi.2021.3069259] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
RNA-binding protein (RBP) is a powerful and wide-ranging regulator that plays an important role in cell development, differentiation, metabolism, health and disease. The prediction of RBPs provides valuable guidance for biologists. Although experimental methods have made great progress in predicting RBP, they are time-consuming and not flexible. Therefore, we developed a network model, rBPDL, by combining a convolutional neural network and long short-term memory for multilabel classification of RBPs. Moreover, to achieve better prediction results, we used a voting algorithm for ensemble learning of the model. We compared rBPDL with state-of-the-art methods and found that rBPDL significantly improved identification performance for the RBP68 dataset, with a macro-Area Under Curve (AUC), micro-AUC, and weighted AUC of 0.936, 0.962, and 0.946, respectively. Furthermore, through AUC statistical analysis of the RBP domain, we analyzed the performance of rBPDL and found that the RBP identification performance in the same domain was similar. In addition, we analyzed the performance preferences and physicochemical properties of the binding protein amino acids and explored the characteristics that affect the binding by using the RBP86 dataset.
Collapse
|
50
|
Zhang C, Guo C, Li Y, Liu K, Zhao Q, Ouyang L. Identification of Claudin-6 as a Molecular Biomarker in Pan-Cancer Through Multiple Omics Integrative Analysis. Front Cell Dev Biol 2021; 9:726656. [PMID: 34409042 PMCID: PMC8365468 DOI: 10.3389/fcell.2021.726656] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2021] [Accepted: 07/08/2021] [Indexed: 12/13/2022] Open
Abstract
Claudin-6 (CLDN6) is one of the 27 family members of claudins and majorly involved in the tight junction and cell-to-cell adhesion of epithelial cell sheets, playing a significant role in cancer initiation and progression. To provide a more systematic and comprehensive dimension of identifying the diverse significance of CLDN6 in a variety of malignant tumors, we explored CLDN6 through multiple omics data integrative analysis, including gene expression level in pan-cancer and comparison of CLDN6 expression in different molecular subtypes and immune subtypes of pan-cancer, targeted protein, biological functions, molecular signatures, diagnostic value, and prognostic value in pan-cancer. Furthermore, we focused on uterine corpus endometrial carcinoma (UCEC) and further investigated CLDN6 from the perspective of the correlations with clinical characteristics, prognosis in different clinical subgroups, co-expression genes, and differentially expressed genes (DEGs), basing on discussing the validation of its established monoclonal antibody by immunohistochemical staining and semi-quantification reported in the previous study. As a result, CLDN6 expression differs significantly not only in most cancers but also in different molecular and immune subtypes of cancers. Besides, high accuracy in predicting cancers and notable correlations with prognosis of certain cancers suggest that CLDN6 might be a potential diagnostic and prognostic biomarker of cancers. Additionally, CLDN6 is identified to be significantly correlated with age, stage, weight, histological type, histologic grade, and menopause status in UCEC. Moreover, CLDN6 high expression can lead to a worse overall survival (OS), disease-specific survival (DSS), and progression-free interval (PFI) in UCEC, especially in different clinical subgroups of UCEC. Taken together, CLDN6 may be a remarkable molecular biomarker for diagnosis and prognosis in pan-cancer and an independent prognostic risk factor of UCEC, presenting to be a promising molecular target for cancer therapy.
Collapse
Affiliation(s)
- Chiyuan Zhang
- Department of Obstetrics and Gynecology, Shengjing Hospital of China Medical University, Shenyang, China
| | - Cuishan Guo
- Department of Obstetrics and Gynecology, Shengjing Hospital of China Medical University, Shenyang, China
| | - Yan Li
- Department of Obstetrics and Gynecology, Shengjing Hospital of China Medical University, Shenyang, China
| | - Kuiran Liu
- Department of Obstetrics and Gynecology, Shengjing Hospital of China Medical University, Shenyang, China
| | - Qi Zhao
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, China
| | - Ling Ouyang
- Department of Obstetrics and Gynecology, Shengjing Hospital of China Medical University, Shenyang, China
| |
Collapse
|