1
|
Bao W, Gu Y, Chen B, Yu H. Golgi_DF: Golgi proteins classification with deep forest. Front Neurosci 2023; 17:1197824. [PMID: 37250391 PMCID: PMC10213405 DOI: 10.3389/fnins.2023.1197824] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 04/19/2023] [Indexed: 05/31/2023] Open
Abstract
Introduction Golgi is one of the components of the inner membrane system in eukaryotic cells. Its main function is to send the proteins involved in the synthesis of endoplasmic reticulum to specific parts of cells or secrete them outside cells. It can be seen that Golgi is an important organelle for eukaryotic cells to synthesize proteins. Golgi disorders can cause various neurodegenerative and genetic diseases, and the accurate classification of Golgi proteins is helpful to develop corresponding therapeutic drugs. Methods This paper proposed a novel Golgi proteins classification method, which is Golgi_DF with the deep forest algorithm. Firstly, the classified proteins method can be converted the vector features containing various information. Secondly, the synthetic minority oversampling technique (SMOTE) is utilized to deal with the classified samples. Next, the Light GBM method is utilized to feature reduction. Meanwhile, the features can be utilized in the penultimate dense layer. Therefore, the reconstructed features can be classified with the deep forest algorithm. Results In Golgi_DF, this method can be utilized to select the important features and identify Golgi proteins. Experiments show that the well-performance than the other art-of-the state methods. Golgi_DF as a standalone tools, all its source codes publicly available at https://github.com/baowz12345/golgiDF. Discussion Golgi_DF employed reconstructed feature to classify the Golgi proteins. Such method may achieve more available features among the UniRep features.
Collapse
Affiliation(s)
- Wenzheng Bao
- School of Information Engineering, Xuzhou University of Technology, Xuzhou, China
| | - Yujian Gu
- School of Information Engineering, Xuzhou University of Technology, Xuzhou, China
| | - Baitong Chen
- Department of Stomatology, Xuzhou First People’s Hospital, Xuzhou, China
- The Affiliated Hospital of China University of Mining and Technology, Xuzhou, China
| | - Huiping Yu
- Department of Neurosurgery, The Hospital of Joint Logistic, Quanzhou, China
| |
Collapse
|
2
|
Gu X, Ding Y, Xiao P, He T. A GHKNN model based on the physicochemical property extraction method to identify SNARE proteins. Front Genet 2022; 13:935717. [PMID: 36506312 PMCID: PMC9727185 DOI: 10.3389/fgene.2022.935717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 11/02/2022] [Indexed: 11/24/2022] Open
Abstract
There is a great deal of importance to SNARE proteins, and their absence from function can lead to a variety of diseases. The SNARE protein is known as a membrane fusion protein, and it is crucial for mediating vesicle fusion. The identification of SNARE proteins must therefore be conducted with an accurate method. Through extensive experiments, we have developed a model based on graph-regularized k-local hyperplane distance nearest neighbor model (GHKNN) binary classification. In this, the model uses the physicochemical property extraction method to extract protein sequence features and the SMOTE method to upsample protein sequence features. The combination achieves the most accurate performance for identifying all protein sequences. Finally, we compare the model based on GHKNN binary classification with other classifiers and measure them using four different metrics: SN, SP, ACC, and MCC. In experiments, the model performs significantly better than other classifiers.
Collapse
Affiliation(s)
- Xingyue Gu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China,*Correspondence: Pengfeng Xiao, ; Tao He, ; Yijie Ding,
| | - Pengfeng Xiao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China,*Correspondence: Pengfeng Xiao, ; Tao He, ; Yijie Ding,
| | - Tao He
- Beidahuang Industry Group General Hospital, Harbin, China,*Correspondence: Pengfeng Xiao, ; Tao He, ; Yijie Ding,
| |
Collapse
|
3
|
Le NQK, Huynh TT. Identifying SNAREs by Incorporating Deep Learning Architecture and Amino Acid Embedding Representation. Front Physiol 2019; 10:1501. [PMID: 31920706 PMCID: PMC6914855 DOI: 10.3389/fphys.2019.01501] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2019] [Accepted: 11/26/2019] [Indexed: 12/12/2022] Open
Abstract
SNAREs (soluble N-ethylmaleimide-sensitive factor activating protein receptors) are a group of proteins that are crucial for membrane fusion and exocytosis of neurotransmitters from the cell. They play an important role in a broad range of cell processes, including cell growth, cytokinesis, and synaptic transmission, to promote cell membrane integration in eukaryotes. Many studies determined that SNARE proteins have been associated with a lot of human diseases, especially in cancer. Therefore, identifying their functions is a challenging problem for scientists to better understand the cancer disease as well as design the drug targets for treatment. We described each protein sequence based on the amino acid embeddings using fastText, which is a natural language processing model performing well in its field. Because each protein sequence is similar to a sentence with different words, applying language model into protein sequence is challenging and promising. After generating, the amino acid embedding features were fed into a deep learning algorithm for prediction. Our model which combines fastText model and deep convolutional neural networks could identify SNARE proteins with an independent test accuracy of 92.8%, sensitivity of 88.5%, specificity of 97%, and Matthews correlation coefficient (MCC) of 0.86. Our performance results were superior to the state-of-the-art predictor (SNARE-CNN). We suggest this study as a reliable method for biologists for SNARE identification and it serves a basis for applying fastText word embedding model into bioinformatics, especially in protein sequencing prediction.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
| | - Tuan-Tu Huynh
- Department of Electrical Electronic and Mechanical Engineering, Lac Hong University, Bien Hoa, Vietnam
- Department of Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan
| |
Collapse
|
4
|
Lv Z, Jin S, Ding H, Zou Q. A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features. Front Bioeng Biotechnol 2019; 7:215. [PMID: 31552241 PMCID: PMC6737778 DOI: 10.3389/fbioe.2019.00215] [Citation(s) in RCA: 80] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2019] [Accepted: 08/22/2019] [Indexed: 02/01/2023] Open
Abstract
To gain insight into the malfunction of the Golgi apparatus and its relationship to various genetic and neurodegenerative diseases, the identification of sub-Golgi proteins, both cis-Golgi and trans-Golgi proteins, is of great significance. In this study, a state-of-art random forests sub-Golgi protein classifier, rfGPT, was developed. The rfGPT used 2-gap dipeptide and split amino acid composition for the feature vectors and was combined with the synthetic minority over-sampling technique (SMOTE) and an analysis of variance (ANOVA) feature selection method. The rfGPT was trained on a sub-Golgi protein sequence data set (137 sequences), with sequence identity less than 25%. For the optimal rfGPT classifier with 93 features, the accuracy (ACC) was 90.5%; the Matthews correlation coefficient (MCC) was 0.811; the sensitivity (Sn) was 92.6%; and the specificity (Sp) was 88.4%. The independent testing scores for the rfGPT were ACC = 90.6%; MCC = 0.696; Sn = 96.1%; and Sp = 69.2%. Although the independent testing accuracy was 4.4% lower than that for the best reported sub-Golgi classifier trained on a data set with 40% sequence identity (304 sequences), the rfGPT is currently the top sub-Golgi protein predictor utilizing feature vectors without any position-specific scoring matrix and its derivative features. Therefore, the rfGPT is a more practical tool, because no sequence alignment is required with tens of millions of protein sequences. To date, the rfGPT is the Golgi classifier with the best independent testing scores, optimized by training on smaller benchmark data sets. Feature importance analysis proves that the non-polar and aliphatic residues composition, the (aromatic residues) + (non-polar, aliphatic residues) dipeptide and aromatic residues composition between NH2-termial and COOH-terminal of protein sequences are the three top biological features for distinguishing the sub-Golgi proteins.
Collapse
Affiliation(s)
- Zhibin Lv
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Shunshan Jin
- Department of Neurology, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Hui Ding
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
5
|
Zhao W, Li GP, Wang J, Zhou YK, Gao Y, Du PF. Predicting protein sub-Golgi locations by combining functional domain enrichment scores with pseudo-amino acid compositions. J Theor Biol 2019; 473:38-43. [DOI: 10.1016/j.jtbi.2019.04.025] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2019] [Revised: 04/22/2019] [Accepted: 04/29/2019] [Indexed: 12/11/2022]
|
6
|
Le NQK, Nguyen VN. SNARE-CNN: a 2D convolutional neural network architecture to identify SNARE proteins from high-throughput sequencing data. PeerJ Comput Sci 2019; 5:e177. [PMID: 33816830 PMCID: PMC7924420 DOI: 10.7717/peerj-cs.177] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2018] [Accepted: 02/06/2019] [Indexed: 05/04/2023]
Abstract
Deep learning has been increasingly and widely used to solve numerous problems in various fields with state-of-the-art performance. It can also be applied in bioinformatics to reduce the requirement for feature extraction and reach high performance. This study attempts to use deep learning to predict SNARE proteins, which is one of the most vital molecular functions in life science. A functional loss of SNARE proteins has been implicated in a variety of human diseases (e.g., neurodegenerative, mental illness, cancer, and so on). Therefore, creating a precise model to identify their functions is a crucial problem for understanding these diseases, and designing the drug targets. Our SNARE-CNN model which uses two-dimensional convolutional neural networks and position-specific scoring matrix profiles could identify SNARE proteins with achieved sensitivity of 76.6%, specificity of 93.5%, accuracy of 89.7%, and MCC of 0.7 in cross-validation dataset. We also evaluate the performance of our model via an independent dataset and the result shows that we are able to solve the overfitting problem. Compared with other state-of-the-art methods, this approach achieved significant improvement in all of the metrics. Throughout the proposed study, we provide an effective model for identifying SNARE proteins and a basis for further research that can apply deep learning in bioinformatics, especially in protein function prediction. SNARE-CNN are freely available at https://github.com/khanhlee/snare-cnn.
Collapse
Affiliation(s)
| | - Van-Nui Nguyen
- University of Information and Communication Technology, Thai Nguyen University, Thai Nguyen, Vietnam
| |
Collapse
|
7
|
Rahman MS, Rahman MK, Kaykobad M, Rahman MS. isGPT: An optimized model to identify sub-Golgi protein types using SVM and Random Forest based feature selection. Artif Intell Med 2017; 84:90-100. [PMID: 29183738 DOI: 10.1016/j.artmed.2017.11.003] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2017] [Revised: 11/13/2017] [Accepted: 11/17/2017] [Indexed: 10/18/2022]
Abstract
The Golgi Apparatus (GA) is a key organelle for protein synthesis within the eukaryotic cell. The main task of GA is to modify and sort proteins for transport throughout the cell. Proteins permeate through the GA on the ER (Endoplasmic Reticulum) facing side (cis side) and depart on the other side (trans side). Based on this phenomenon, we get two types of GA proteins, namely, cis-Golgi protein and trans-Golgi protein. Any dysfunction of GA proteins can result in congenital glycosylation disorders and some other forms of difficulties that may lead to neurodegenerative and inherited diseases like diabetes, cancer and cystic fibrosis. So, the exact classification of GA proteins may contribute to drug development which will further help in medication. In this paper, we focus on building a new computational model that not only introduces easy ways to extract features from protein sequences but also optimizes classification of trans-Golgi and cis-Golgi proteins. After feature extraction, we have employed Random Forest (RF) model to rank the features based on the importance score obtained from it. After selecting the top ranked features, we have applied Support Vector Machine (SVM) to classify the sub-Golgi proteins. We have trained regression model as well as classification model and found the former to be superior. The model shows improved performance over all previous methods. As the benchmark dataset is significantly imbalanced, we have applied Synthetic Minority Over-sampling Technique (SMOTE) to the dataset to make it balanced and have conducted experiments on both versions. Our method, namely, identification of sub-Golgi Protein Types (isGPT), achieves accuracy values of 95.4%, 95.9% and 95.3% for 10-fold cross-validation test, jackknife test and independent test respectively. According to different performance metrics, isGPT performs better than state-of-the-art techniques. The source code of isGPT, along with relevant dataset and detailed experimental results, can be found at https://github.com/srautonu/isGPT.
Collapse
Affiliation(s)
- M Saifur Rahman
- Department of CSE, BUET, ECE Building, West Palasi, Dhaka 1205, Bangladesh.
| | | | - M Kaykobad
- Department of CSE, BUET, ECE Building, West Palasi, Dhaka 1205, Bangladesh.
| | - M Sohel Rahman
- Department of CSE, BUET, ECE Building, West Palasi, Dhaka 1205, Bangladesh.
| |
Collapse
|
8
|
Ahmad J, Javed F, Hayat M. Intelligent computational model for classification of sub-Golgi protein using oversampling and fisher feature selection methods. Artif Intell Med 2017; 78:14-22. [DOI: 10.1016/j.artmed.2017.05.001] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2017] [Revised: 04/19/2017] [Accepted: 05/02/2017] [Indexed: 10/19/2022]
|
9
|
Prediction of Golgi-resident protein types using general form of Chou's pseudo-amino acid compositions: Approaches with minimal redundancy maximal relevance feature selection. J Theor Biol 2016; 402:38-44. [PMID: 27155042 DOI: 10.1016/j.jtbi.2016.04.032] [Citation(s) in RCA: 44] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2016] [Revised: 04/19/2016] [Accepted: 04/26/2016] [Indexed: 11/20/2022]
Abstract
Recently, several efforts have been made in predicting Golgi-resident proteins. However, it is still a challenging task to identify the type of a Golgi-resident protein. Precise prediction of the type of a Golgi-resident protein plays a key role in understanding its molecular functions in various biological processes. In this paper, we proposed to use a mutual information based feature selection scheme with the general form Chou's pseudo-amino acid compositions to predict the Golgi-resident protein types. The positional specific physicochemical properties were applied in the Chou's pseudo-amino acid compositions. We achieved 91.24% prediction accuracy in a jackknife test with 49 selected features. It has the best performance among all the present predictors. This result indicates that our computational model can be useful in identifying Golgi-resident protein types.
Collapse
|
10
|
Jiao YS, Du PF. Predicting Golgi-resident protein types using pseudo amino acid compositions: Approaches with positional specific physicochemical properties. J Theor Biol 2015; 391:35-42. [PMID: 26702543 DOI: 10.1016/j.jtbi.2015.11.009] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2015] [Revised: 11/17/2015] [Accepted: 11/19/2015] [Indexed: 11/24/2022]
Abstract
Knowing the type of a Golgi-resident protein is an important step in understanding its molecular functions as well as its role in biological processes. In this paper, we developed a novel computational method to predict Golgi-resident protein types using positional specific physicochemical properties and analysis of variance based feature selection methods. Our method achieved 86.9% prediction accuracy in leave-one-out cross-validations with only 59 features. Our method has the potential to be applied in predicting a wide range of protein attributes.
Collapse
Affiliation(s)
- Ya-Sen Jiao
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China
| | - Pu-Feng Du
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China.
| |
Collapse
|
11
|
Schoberer J, Liebminger E, Vavra U, Veit C, Castilho A, Dicker M, Maresch D, Altmann F, Hawes C, Botchway SW, Strasser R. The transmembrane domain of N -acetylglucosaminyltransferase I is the key determinant for its Golgi subcompartmentation. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2014; 80:809-22. [PMID: 25230686 PMCID: PMC4282539 DOI: 10.1111/tpj.12671] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/07/2014] [Revised: 08/28/2014] [Accepted: 09/11/2014] [Indexed: 05/18/2023]
Abstract
Golgi-resident type-II membrane proteins are asymmetrically distributed across the Golgi stack. The intrinsic features of the protein that determine its subcompartment-specific concentration are still largely unknown. Here, we used a series of chimeric proteins to investigate the contribution of the cytoplasmic, transmembrane and stem region of Nicotiana benthamiana N-acetylglucosaminyltransferase I (GnTI) for its cis/medial-Golgi localization and for protein-protein interaction in the Golgi. The individual GnTI protein domains were replaced with those from the well-known trans-Golgi enzyme α2,6-sialyltransferase (ST) and transiently expressed in Nicotiana benthamiana. Using co-localization analysis and N-glycan profiling, we show that the transmembrane domain of GnTI is the major determinant for its cis/medial-Golgi localization. By contrast, the stem region of GnTI contributes predominately to homomeric and heteromeric protein complex formation. Importantly, in transgenic Arabidopsis thaliana, a chimeric GnTI variant with altered sub-Golgi localization was not able to complement the GnTI-dependent glycosylation defect. Our results suggest that sequence-specific features in the transmembrane domain of GnTI account for its steady-state distribution in the cis/medial-Golgi in plants, which is a prerequisite for efficient N-glycan processing in vivo.
Collapse
Affiliation(s)
- Jennifer Schoberer
- Department of Applied Genetics and Cell Biology, University of Natural Resources and Life SciencesMuthgasse 18, Vienna, 1190, Austria
| | - Eva Liebminger
- Department of Applied Genetics and Cell Biology, University of Natural Resources and Life SciencesMuthgasse 18, Vienna, 1190, Austria
| | - Ulrike Vavra
- Department of Applied Genetics and Cell Biology, University of Natural Resources and Life SciencesMuthgasse 18, Vienna, 1190, Austria
| | - Christiane Veit
- Department of Applied Genetics and Cell Biology, University of Natural Resources and Life SciencesMuthgasse 18, Vienna, 1190, Austria
| | - Alexandra Castilho
- Department of Applied Genetics and Cell Biology, University of Natural Resources and Life SciencesMuthgasse 18, Vienna, 1190, Austria
| | - Martina Dicker
- Department of Applied Genetics and Cell Biology, University of Natural Resources and Life SciencesMuthgasse 18, Vienna, 1190, Austria
| | - Daniel Maresch
- Department of Chemistry, University of Natural Resources and Life SciencesMuthgasse 18, Vienna, 1190, Austria
| | - Friedrich Altmann
- Department of Chemistry, University of Natural Resources and Life SciencesMuthgasse 18, Vienna, 1190, Austria
| | - Chris Hawes
- Department of Biological and Medical Sciences, Faculty of Health and Life Sciences, Oxford Brookes UniversityHeadington, Oxford, OX3 0BP, UK
| | - Stanley W Botchway
- Research Complex at Harwell, Central Laser Facility, Science and Technology Facilities Council, Rutherford Appleton LaboratoryHarwell-Oxford, Didcot, OX11 0QX, UK
| | - Richard Strasser
- Department of Applied Genetics and Cell Biology, University of Natural Resources and Life SciencesMuthgasse 18, Vienna, 1190, Austria
| |
Collapse
|
12
|
Li X, Wu X, Wu G. Robust feature generation for protein subchloroplast location prediction with a weighted GO transfer model. J Theor Biol 2014; 347:84-94. [PMID: 24423409 DOI: 10.1016/j.jtbi.2014.01.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2013] [Revised: 10/17/2013] [Accepted: 01/03/2014] [Indexed: 10/25/2022]
Abstract
Chloroplasts are crucial organelles of green plants and eukaryotic algae since they conduct photosynthesis. Predicting the subchloroplast location of a protein can provide important insights for understanding its biological functions. The performance of subchloroplast location prediction algorithms often depends on deriving predictive and succinct features from genomic and proteomic data. In this work, a novel weighted Gene Ontology (GO) transfer model is proposed to generate discriminating features from sequence data and GO Categories. This model contains two components. First, we transfer the GO terms of the homologous protein, and then assign the bit-score as weights to GO features. Second, we employ term-selection methods to determine weights for GO terms. This model is capable of improving prediction accuracy due to the tolerance of the noise derived from homolog knowledge transfer. The proposed weighted GO transfer method based on bit-score and a logarithmic transformation of CHI-square (WS-LCHI) performs better than the baseline models, and also outperforms the four off-the-shelf subchloroplast prediction methods.
Collapse
Affiliation(s)
- Xiaomei Li
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, PR China.
| | - Xindong Wu
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, PR China; Department of Computer Science, University of Vermont, Burlington, VT 50405, USA.
| | - Gongqing Wu
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, PR China.
| |
Collapse
|
13
|
Mei S. SVM ensemble based transfer learning for large-scale membrane proteins discrimination. J Theor Biol 2013; 340:105-10. [PMID: 24050851 DOI: 10.1016/j.jtbi.2013.09.007] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2013] [Revised: 09/04/2013] [Accepted: 09/06/2013] [Indexed: 11/16/2022]
Abstract
Membrane proteins play important roles in molecular trans-membrane transport, ligand-receptor recognition, cell-cell interaction, enzyme catalysis, host immune defense response and infectious disease pathways. Up to present, discriminating membrane proteins remains a challenging problem from the viewpoints of biological experimental determination and computational modeling. This work presents SVM ensemble based transfer learning model for membrane proteins discrimination (SVM-TLM). To reduce the data constraints on computational modeling, this method investigates the effectiveness of transferring the homolog knowledge to the target membrane proteins under the framework of probability weighted ensemble learning. As compared to multiple kernel learning based transfer learning model, the method takes the advantages of sparseness based SVM optimization on large data, thus more computationally efficient for large protein data analysis. The experiments on large membrane protein benchmark dataset show that SVM-TLM achieves significantly better cross validation performance than the baseline model.
Collapse
Affiliation(s)
- Suyu Mei
- Software College, Shenyang Normal University, Shenyang, China.
| |
Collapse
|
14
|
Using over-represented tetrapeptides to predict protein submitochondria locations. Acta Biotheor 2013; 61:259-68. [PMID: 23475502 DOI: 10.1007/s10441-013-9181-9] [Citation(s) in RCA: 62] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2012] [Accepted: 02/23/2013] [Indexed: 01/25/2023]
Abstract
The mitochondrion is a key organelle of eukaryotic cell that provides the energy for cellular activities. Correctly identifying submitochondria locations of proteins can provide plentiful information for understanding their functions. However, using web-experimental methods to recognize submitochondria locations of proteins are time-consuming and costly. Thus, it is highly desired to develop a bioinformatics method to predict the submitochondria locations of mitochondrion proteins. In this work, a novel method based on support vector machine was developed to predict the submitochondria locations of mitochondrion proteins by using over-represented tetrapeptides selected by using binomial distribution. A reliable and rigorous benchmark dataset including 495 mitochondrion proteins with sequence identity ≤25% was constructed for testing and evaluating the proposed model. Jackknife cross-validated results showed that the 91.1% of the 495 mitochondrion proteins can be correctly predicted. Subsequently, our model was estimated by three existing benchmark datasets. The overall accuracies are 94.0, 94.7 and 93.4%, respectively, suggesting that the proposed model is potentially useful in the realm of mitochondrion proteome research. Based on this model, we built a predictor called TetraMito which is freely available at http://lin.uestc.edu.cn/server/TetraMito.
Collapse
|
15
|
Mei S. Multi-label multi-kernel transfer learning for human protein subcellular localization. PLoS One 2012; 7:e37716. [PMID: 22719847 PMCID: PMC3374840 DOI: 10.1371/journal.pone.0037716] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2011] [Accepted: 04/28/2012] [Indexed: 11/19/2022] Open
Abstract
Recent years have witnessed much progress in computational modelling for protein subcellular localization. However, the existing sequence-based predictive models demonstrate moderate or unsatisfactory performance, and the gene ontology (GO) based models may take the risk of performance overestimation for novel proteins. Furthermore, many human proteins have multiple subcellular locations, which renders the computational modelling more complicated. Up to the present, there are far few researches specialized for predicting the subcellular localization of human proteins that may reside in multiple cellular compartments. In this paper, we propose a multi-label multi-kernel transfer learning model for human protein subcellular localization (MLMK-TLM). MLMK-TLM proposes a multi-label confusion matrix, formally formulates three multi-labelling performance measures and adapts one-against-all multi-class probabilistic outputs to multi-label learning scenario, based on which to further extends our published work GO-TLM (gene ontology based transfer learning model for protein subcellular localization) and MK-TLM (multi-kernel transfer learning based on Chou's PseAAC formulation for protein submitochondria localization) for multiplex human protein subcellular localization. With the advantages of proper homolog knowledge transfer, comprehensive survey of model performance for novel protein and multi-labelling capability, MLMK-TLM will gain more practical applicability. The experiments on human protein benchmark dataset show that MLMK-TLM significantly outperforms the baseline model and demonstrates good multi-labelling ability for novel human proteins. Some findings (predictions) are validated by the latest Swiss-Prot database. The software can be freely downloaded at http://soft.synu.edu.cn/upload/msy.rar.
Collapse
Affiliation(s)
- Suyu Mei
- Software College, Shenyang Normal University, Shenyang, China.
| |
Collapse
|
16
|
Driouich A, Follet-Gueye ML, Bernard S, Kousar S, Chevalier L, Vicré-Gibouin M, Lerouxel O. Golgi-mediated synthesis and secretion of matrix polysaccharides of the primary cell wall of higher plants. FRONTIERS IN PLANT SCIENCE 2012; 3:79. [PMID: 22639665 PMCID: PMC3355623 DOI: 10.3389/fpls.2012.00079] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2012] [Accepted: 04/09/2012] [Indexed: 05/17/2023]
Abstract
The Golgi apparatus of eukaryotic cells is known for its central role in the processing, sorting, and transport of proteins to intra- and extra-cellular compartments. In plants, it has the additional task of assembling and exporting the non-cellulosic polysaccharides of the cell wall matrix including pectin and hemicelluloses, which are important for plant development and protection. In this review, we focus on the biosynthesis of complex polysaccharides of the primary cell wall of eudicotyledonous plants. We present and discuss the compartmental organization of the Golgi stacks with regards to complex polysaccharide assembly and secretion using immuno-electron microscopy and specific antibodies recognizing various sugar epitopes. We also discuss the significance of the recently identified Golgi-localized glycosyltransferases responsible for the biosynthesis of xyloglucan (XyG) and pectin.
Collapse
Affiliation(s)
- Azeddine Driouich
- Laboratoire ‶Glycobiologie et Matrice Extracellulaire Végétale″, UPRES EA 4358, Institut Federatif de Recherche Multidisciplinaire sur les Peptides, Plate-forme de Recherche en Imagerie Cellulaire de Haute Normandie, Université de RouenMont Saint Aignan, France
- *Correspondence: Azeddine Driouich, Laboratoire “Glycobiologie et Matrice Extracellulaire Végétale” UPRES EA 4358, Institut Federatif de Recherche Multidisciplinaire sur les Peptides, Plate-forme de Recherche en Imagerie Cellulaire de Haute Normandie, Université de Rouen, Rue Tesnière, Bâtiment Henri Gadeau de Kerville, 76821. Mont Saint Aignan, Cedex, France. e-mail:
| | - Marie-Laure Follet-Gueye
- Laboratoire ‶Glycobiologie et Matrice Extracellulaire Végétale″, UPRES EA 4358, Institut Federatif de Recherche Multidisciplinaire sur les Peptides, Plate-forme de Recherche en Imagerie Cellulaire de Haute Normandie, Université de RouenMont Saint Aignan, France
| | - Sophie Bernard
- Laboratoire ‶Glycobiologie et Matrice Extracellulaire Végétale″, UPRES EA 4358, Institut Federatif de Recherche Multidisciplinaire sur les Peptides, Plate-forme de Recherche en Imagerie Cellulaire de Haute Normandie, Université de RouenMont Saint Aignan, France
| | - Sumaira Kousar
- Centre de Recherches sur les Macromolécules végétales–CNRS, Université Joseph FourierGrenoble, France
| | - Laurence Chevalier
- Institut des Matériaux/UMR6634/CNRS, Faculté des Sciences et Techniques, Université de RouenSt. Etienne du Rouvray Cedex, France
| | - Maïté Vicré-Gibouin
- Laboratoire ‶Glycobiologie et Matrice Extracellulaire Végétale″, UPRES EA 4358, Institut Federatif de Recherche Multidisciplinaire sur les Peptides, Plate-forme de Recherche en Imagerie Cellulaire de Haute Normandie, Université de RouenMont Saint Aignan, France
| | - Olivier Lerouxel
- Centre de Recherches sur les Macromolécules végétales–CNRS, Université Joseph FourierGrenoble, France
| |
Collapse
|
17
|
Du P, Li T, Wang X. Recent progress in predicting protein sub-subcellular locations. Expert Rev Proteomics 2011; 8:391-404. [PMID: 21679119 DOI: 10.1586/epr.11.20] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
In the last two decades, the number of the known protein sequences increased very rapidly. However, a knowledge of protein function only exists for a small portion of these sequences. Since the experimental approaches for determining protein functions are costly and time consuming, in silico methods have been introduced to bridge the gap between knowledge of protein sequences and their functions. Knowing the subcellular location of a protein is considered to be a critical step in understanding its biological functions. Many efforts have been undertaken to predict the protein subcellular locations in silico. With the accumulation of available data, the substructures of some subcellular organelles, such as the cell nucleus, mitochondria and chloroplasts, have been taken into consideration by several studies in recent years. These studies create a new research topic, namely 'protein sub-subcellular location prediction', which goes one level deeper than classic protein subcellular location prediction.
Collapse
Affiliation(s)
- Pufeng Du
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China
| | | | | |
Collapse
|
18
|
Mei S, Fei W, Zhou S. Gene ontology based transfer learning for protein subcellular localization. BMC Bioinformatics 2011; 12:44. [PMID: 21284890 PMCID: PMC3039576 DOI: 10.1186/1471-2105-12-44] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2010] [Accepted: 02/02/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as GO, uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the GO terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology. RESULTS In this paper, we propose a Gene Ontology Based Transfer Learning Model (GO-TLM) for large-scale protein subcellular localization. The model transfers the signature-based homologous GO terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false GO terms that are resulted from evolutionary divergence. We derive three GO kernels from the three aspects of gene ontology to measure the GO similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate GO-TLM performance against three baseline models: MultiLoc, MultiLoc-GO and Euk-mPLoc on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that GO-TLM achieves substantial accuracy improvement against the baseline models: 80.38% against model Euk-mPLoc 67.40% with 12.98% substantial increase; 96.65% and 96.27% against model MultiLoc-GO 89.60% and 89.60%, with 7.05% and 6.67% accuracy increase on dataset MultiLoc plant and dataset MultiLoc animal, respectively; 97.14%, 95.90% and 96.85% against model MultiLoc-GO 83.70%, 90.10% and 85.70%, with accuracy increase 13.44%, 5.8% and 11.15% on dataset BaCelLoc plant, dataset BaCelLoc fungi and dataset BaCelLoc animal respectively. For BaCelLoc independent sets, GO-TLM achieves 81.25%, 80.45% and 79.46% on dataset BaCelLoc plant holdout, dataset BaCelLoc plant holdout and dataset BaCelLoc animal holdout, respectively, as compared against baseline model MultiLoc-GO 76%, 60.00% and 73.00%, with accuracy increase 5.25%, 20.45% and 6.46%, respectively. CONCLUSIONS Since direct homology-based GO term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, GO-TLM) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based GO term transfer and explicitly weighing the GO kernels substantially improve the prediction performance.
Collapse
Affiliation(s)
- Suyu Mei
- Software College, Shenyang Normal University, Shenyang, PR China.
| | | | | |
Collapse
|
19
|
Oikawa A, Joshi HJ, Rennie EA, Ebert B, Manisseri C, Heazlewood JL, Scheller HV. An integrative approach to the identification of Arabidopsis and rice genes involved in xylan and secondary wall development. PLoS One 2010; 5:e15481. [PMID: 21124849 PMCID: PMC2990762 DOI: 10.1371/journal.pone.0015481] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2010] [Accepted: 09/24/2010] [Indexed: 11/19/2022] Open
Abstract
Xylans constitute the major non-cellulosic component of plant biomass. Xylan biosynthesis is particularly pronounced in cells with secondary walls, implying that the synthesis network consists of a set of highly expressed genes in such cells. To improve the understanding of xylan biosynthesis, we performed a comparative analysis of co-expression networks between Arabidopsis and rice as reference species with different wall types. Many co-expressed genes were represented by orthologs in both species, which implies common biological features, while some gene families were only found in one of the species, and therefore likely to be related to differences in their cell walls. To predict the subcellular location of the identified proteins, we developed a new method, PFANTOM (plant protein family information-based predictor for endomembrane), which was shown to perform better for proteins in the endomembrane system than other available prediction methods. Based on the combined approach of co-expression and predicted cellular localization, we propose a model for Arabidopsis and rice xylan synthesis in the Golgi apparatus and signaling from plasma membrane to nucleus for secondary cell wall differentiation. As an experimental validation of the model, we show that an Arabidopsis mutant in the PGSIP1 gene encoding one of the Golgi localized candidate proteins has a highly decreased content of glucuronic acid in secondary cell walls and substantially reduced xylan glucuronosyltransferase activity.
Collapse
Affiliation(s)
- Ai Oikawa
- Feedstocks Division, Joint BioEnergy Institute, Emeryville, California, USA
| | | | | | | | | | | | | |
Collapse
|
20
|
Sun W, Jin J, Xu R, Hu W, Szulc ZM, Bielawski J, Obeid LM, Mao C. Substrate specificity, membrane topology, and activity regulation of human alkaline ceramidase 2 (ACER2). J Biol Chem 2010; 285:8995-9007. [PMID: 20089856 PMCID: PMC2838321 DOI: 10.1074/jbc.m109.069203] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2009] [Revised: 01/14/2010] [Indexed: 11/06/2022] Open
Abstract
Human alkaline ceramidase 2 (ACER2) plays an important role in cellular responses by regulating the hydrolysis of ceramides in cells. Here we report its biochemical characterization, membrane topology, and activity regulation. Recombinant ACER2 was expressed in yeast mutant cells (Deltaypc1Deltaydc1) that lack endogenous ceramidase activity, and microsomes from ACER2-expressiong yeast cells were used to biochemically characterize ACER2. ACER2 catalyzed the hydrolysis of various ceramides and followed Michaelis-Menten kinetics. ACER2 required Ca(2+) for both its in vitro and cellular activities. ACER2 has 7 putative transmembrane domains, and its amino (N) and carboxyl (C) termini were found to be oriented in the lumen of the Golgi complex and cytosol, respectively. ACER2 mutant (ACER2DeltaN36) lacking the N-terminal tail (the first 36 amino acid residues) exhibited undetectable activity and was mislocalized to the endoplasmic reticulum, suggesting that the N-terminal tail is necessary for both ACER2 activity and Golgi localization. ACER2 mutant (ACER2DeltaN13) lacking the first 13 residues was also mislocalized to the endoplasmic reticulum although it retained ceramidase activity. Overexpression of ACER2, ACER2DeltaN13, but not ACER2DeltaN36 increased the release of sphingosine 1-phosphate from cells, suggesting that its mislocalization does not affect the ability of ACER2 to regulate sphingosine 1-phosphate secretion. However, overexpression of ACER2 but not ACER2DeltaN13 or ACER2DeltaN36 inhibited the glycosylation of integrin beta1 subunit and Lamp1, suggesting that its mistargeting abolishes the ability of ACER2 to regulation protein glycosylation. These data suggest that ACER2 has broad substrate specificity and requires Ca(2+) for its activity and that ACER2 has the cytosolic C terminus and luminal N terminus, which are essential for its activity, correct cellular localization, and regulation for protein glycosylation.
Collapse
Affiliation(s)
- Wei Sun
- From the Departments of Medicine and
| | | | | | - Wei Hu
- From the Departments of Medicine and
| | | | | | - Lina M. Obeid
- From the Departments of Medicine and
- Biochemistry and Molecular Biology and
- the Ralph H. Johnson Veterans Affairs Hospital, Medical University of South Carolina, Charleston, South Carolina 29425
| | - Cungui Mao
- From the Departments of Medicine and
- Biochemistry and Molecular Biology and
| |
Collapse
|
21
|
Mei S, Fei W. Amino acid classification based spectrum kernel fusion for protein subnuclear localization. BMC Bioinformatics 2010; 11 Suppl 1:S17. [PMID: 20122188 PMCID: PMC3009488 DOI: 10.1186/1471-2105-11-s1-s17] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND Prediction of protein localization in subnuclear organelles is more challenging than general protein subcelluar localization. There are only three computational models for protein subnuclear localization thus far, to the best of our knowledge. Two models were based on protein primary sequence only. The first model assumed homogeneous amino acid substitution pattern across all protein sequence residue sites and used BLOSUM62 to encode k-mer of protein sequence. Ensemble of SVM based on different k-mers drew the final conclusion, achieving 50% overall accuracy. The simplified assumption did not exploit protein sequence profile and ignored the fact of heterogeneous amino acid substitution patterns across sites. The second model derived the PsePSSM feature representation from protein sequence by simply averaging the profile PSSM and combined the PseAA feature representation to construct a kNN ensemble classifier Nuc-PLoc, achieving 67.4% overall accuracy. The two models based on protein primary sequence only both achieved relatively poor predictive performance. The third model required that GO annotations be available, thus restricting the model's applicability. METHODS In this paper, we only use the amino acid information of protein sequence without any other information to design a widely-applicable model for protein subnuclear localization. We use K-spectrum kernel to exploit the contextual information around an amino acid and the conserved motif information. Besides expanding window size, we adopt various amino acid classification approaches to capture diverse aspects of amino acid physiochemical properties. Each amino acid classification generates a series of spectrum kernels based on different window size. Thus, (I) window expansion can capture more contextual information and cover size-varying motifs; (II) various amino acid classifications can exploit multi-aspect biological information from the protein sequence. Finally, we combine all the spectrum kernels by simple addition into one single kernel called SpectrumKernel+ for protein subnuclear localization. RESULTS We conduct the performance evaluation experiments on two benchmark datasets: Lei and Nuc-PLoc. Experimental results show that SpectrumKernel+ achieves substantial performance improvement against the previous model Nuc-PLoc, with overall accuracy 83.47% against 67.4%; and 71.23% against 50% of Lei SVM Ensemble, against 66.50% of Lei GO SVM Ensemble. CONCLUSION The method SpectrumKernel+ can exploit rich amino acid information of protein sequence by embedding into implicit size-varying motifs the multi-aspect amino acid physiochemical properties captured by amino acid classification approaches. The kernels derived from diverse amino acid classification approaches and different sizes of k-mer are summed together for data integration. Experiments show that the method SpectrumKernel+ significantly outperforms the existing models for protein subnuclear localization.
Collapse
Affiliation(s)
- Suyu Mei
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, PR China
| | - Wang Fei
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, PR China
| |
Collapse
|