1
|
Anteghini M, Santos VAMD, Saccenti E. PortPred: Exploiting deep learning embeddings of amino acid sequences for the identification of transporter proteins and their substrates. J Cell Biochem 2023; 124:1803-1824. [PMID: 37877557 DOI: 10.1002/jcb.30490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 09/29/2023] [Accepted: 10/03/2023] [Indexed: 10/26/2023]
Abstract
The physiology of every living cell is regulated at some level by transporter proteins which constitute a relevant portion of membrane-bound proteins and are involved in the movement of ions, small and macromolecules across bio-membranes. The importance of transporter proteins is unquestionable. The prediction and study of previously unknown transporters can lead to the discovery of new biological pathways, drugs and treatments. Here we present PortPred, a tool to accurately identify transporter proteins and their substrate starting from the protein amino acid sequence. PortPred successfully combines pre-trained deep learning-based protein embeddings and machine learning classification approaches and outperforms other state-of-the-art methods. In addition, we present a comparison of the most promising protein sequence embeddings (Unirep, SeqVec, ProteinBERT, ESM-1b) and their performances for this specific task.
Collapse
Affiliation(s)
- Marco Anteghini
- LifeGlimmer GmbH, Berlin, Germany
- Department of Systems and Synthetic Biology, Wageningen University & Research, Wageningen WE, The Netherlands
- Department of Visual and Data-Centric Computing, Zuse Institute Berlin, Berlin, Germany
| | - Vitor Ap Martins Dos Santos
- LifeGlimmer GmbH, Berlin, Germany
- Department of Bioprocess Engineering, Wageningen University & Research, Wageningen WE, The Netherlands
| | - Edoardo Saccenti
- Department of Systems and Synthetic Biology, Wageningen University & Research, Wageningen WE, The Netherlands
| |
Collapse
|
2
|
Ghazikhani H, Butler G. Enhanced identification of membrane transport proteins: a hybrid approach combining ProtBERT-BFD and convolutional neural networks. J Integr Bioinform 2023; 0:jib-2022-0055. [PMID: 37497772 PMCID: PMC10389051 DOI: 10.1515/jib-2022-0055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Accepted: 06/21/2023] [Indexed: 07/28/2023] Open
Abstract
Transmembrane transport proteins (transporters) play a crucial role in the fundamental cellular processes of all organisms by facilitating the transport of hydrophilic substrates across hydrophobic membranes. Despite the availability of numerous membrane protein sequences, their structures and functions remain largely elusive. Recently, natural language processing (NLP) techniques have shown promise in the analysis of protein sequences. Bidirectional Encoder Representations from Transformers (BERT) is an NLP technique adapted for proteins to learn contextual embeddings of individual amino acids within a protein sequence. Our previous strategy, TooT-BERT-T, differentiated transporters from non-transporters by employing a logistic regression classifier with fine-tuned representations from ProtBERT-BFD. In this study, we expand upon this approach by utilizing representations from ProtBERT, ProtBERT-BFD, and MembraneBERT in combination with classical classifiers. Additionally, we introduce TooT-BERT-CNN-T, a novel method that fine-tunes ProtBERT-BFD and discriminates transporters using a Convolutional Neural Network (CNN). Our experimental results reveal that CNN surpasses traditional classifiers in discriminating transporters from non-transporters, achieving an MCC of 0.89 and an accuracy of 95.1 % on the independent test set. This represents an improvement of 0.03 and 1.11 percentage points compared to TooT-BERT-T, respectively.
Collapse
Affiliation(s)
- Hamed Ghazikhani
- Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada
| | - Gregory Butler
- Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada
| |
Collapse
|
3
|
Wang Q, Xu T, Xu K, Lu Z, Ying J. Prediction of transport proteins from sequence information with the deep learning approach. Comput Biol Med 2023; 160:106974. [PMID: 37167658 DOI: 10.1016/j.compbiomed.2023.106974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Revised: 04/17/2023] [Accepted: 04/22/2023] [Indexed: 05/13/2023]
Abstract
Transport proteins (TPs) are vital to the growth and life of all living things, especially in fields of microbial pathogenesis and drug resistance of tumor cells. Accurately identifying potential TPs remains an important challenge for the advancement of functional genomics. This study aimed to develop a tool for predicting TPs using the deep learning approach. Here, we proposed DeepTP, a convolutional neural network model that uses parallel subnetworks to extract features from protein sequences and uses fully connected layers for TP classification. To train and evaluate the performance of the developed model, datasets were collected from the UniProtKB/Swiss-Prot database. The test results revealed that the proposed model could successfully identify TPs with the AUCROC, accuracy, F-value, and Matthews correlation coefficient of 0.9719, 0.9513, 0.8982, and 0.8679, respectively. By further comparison, DeepTP achieved better performance than other commonly used methods. Analysis of the gradients of prediction score concerning input suggested that DeepTP makes predictions by recognizing the functional domains of TPs. We anticipate that DeepTP will serve as a useful tool for predicting TPs in large-scale genome projects, which will facilitate the discovery of novel TPs.
Collapse
Affiliation(s)
- Qian Wang
- Department of Clinical Laboratory, Wenzhou People's Hospital, The Third Affiliated Hospital of Shanghai University, The Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, China
| | - Teng Xu
- Institute of Translational Medicine, Baotou Central Hospital, Baotou, China
| | - Kai Xu
- Department of Clinical Laboratory, Wenzhou People's Hospital, The Third Affiliated Hospital of Shanghai University, The Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, China
| | - Zhongqiu Lu
- Wenzhou Key Laboratory of Emergency, Critical Care, and Disaster Medicine, Department of Emergency, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China.
| | - Jianchao Ying
- Central Laboratory, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China; Wenzhou Key Laboratory of Emergency, Critical Care, and Disaster Medicine, Department of Emergency, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China.
| |
Collapse
|
4
|
Alkhadrawi AM, Wang Y, Li C. In-silico screening of potential target transporters for glycyrrhetinic acid (GA) via deep learning prediction of drug-target interactions. Biochem Eng J 2022. [DOI: 10.1016/j.bej.2022.108375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
5
|
Nguyen TTD, Le NQK, Tran TA, Pham DM, Ou YY. Incorporating a transfer learning technique with amino acid embeddings to efficiently predict N-linked glycosylation sites in ion channels. Comput Biol Med 2021; 130:104212. [PMID: 33454535 DOI: 10.1016/j.compbiomed.2021.104212] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2020] [Revised: 12/21/2020] [Accepted: 01/04/2021] [Indexed: 11/27/2022]
Abstract
Glycosylation is a dynamic enzymatic process that attaches glycan to proteins or other organic molecules such as lipoproteins. Research has shown that such a process in ion channel proteins plays a fundamental role in modulating ion channel functions. This study used a computational method to predict N-linked glycosylation sites, the most common type, in ion channel proteins. From segments of ion channel proteins centered around N-linked glycosylation sites, the amino acid embedding vectors of each residue were concatenated to create features for prediction. We experimented with two different models for converting amino acids to their corresponding embeddings: one was fed with ion channel sequences and the other with a large dataset composed of more than one million protein sequences. The latter model stemmed from the idea of transfer learning technique and emerged as a more efficient feature extractor. Our best model was obtained from this transfer learning approach and a hyperparameter tuning process with a random search on 5-fold cross-validation data. It achieved an accuracy, specificity, sensitivity, and Matthews correlation coefficient of 93.4%, 92.8%, 98.6%, and 0.726, respectively. Corresponding scores on an independent test were 92.9%, 92.2%, 99%, and 0.717. These results outperform the position-specific scoring matrix features that are predominantly employed in post-translational modification site predictions. Furthermore, compared to N-GlyDE, GlycoEP, SPRINT-Gly, the most recent N-linked glycosylation site predictors, our model yields higher scores on the above 4 metrics, thus further demonstrating the efficiency of our approach.
Collapse
Affiliation(s)
| | - Nguyen-Quoc-Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, 106, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, 106, Taiwan
| | | | - Dinh-Minh Pham
- Institute of Biotechnology, Vietnam Academy of Science and Technology, Hanoi, Viet Nam
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan.
| |
Collapse
|
6
|
Alballa M, Butler G. TooT-T: discrimination of transport proteins from non-transport proteins. BMC Bioinformatics 2020; 21:25. [PMID: 32321420 PMCID: PMC7178945 DOI: 10.1186/s12859-019-3311-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Accepted: 12/09/2019] [Indexed: 11/10/2022] Open
Abstract
Background Membrane transport proteins (transporters) play an essential role in every living cell by transporting hydrophilic molecules across the hydrophobic membranes. While the sequences of many membrane proteins are known, their structure and function is still not well characterized and understood, owing to the immense effort needed to characterize them. Therefore, there is a need for advanced computational techniques takes sequence information alone to distinguish membrane transporter proteins; this can then be used to direct new experiments and give a hint about the function of a protein. Results This work proposes an ensemble classifier TooT-T that is trained to optimally combine the predictions from homology annotation transfer and machine-learning methods to determine the final prediction. Experimental results obtained by cross-validation and independent testing show that combining the two approaches is more beneficial than employing only one. Conclusion The proposed model outperforms all of the state-of-the-art methods that rely on the protein sequence alone, with respect to accuracy and MCC. TooT-T achieved an overall accuracy of 90.07% and 92.22% and an MCC 0.80 and 0.82 with the training and independent datasets, respectively.
Collapse
Affiliation(s)
- Munira Alballa
- Department of Computer Science and Software Engineering, Concordia University, Montréal, Québec, Canada.
| | - Gregory Butler
- Department of Computer Science and Software Engineering, Concordia University, Montréal, Québec, Canada.,Centre for Structural and Functional Genomics, Concordia University, Montréal, Québec, 24105, Canada
| |
Collapse
|
7
|
Alballa M, Aplop F, Butler G. TranCEP: Predicting the substrate class of transmembrane transport proteins using compositional, evolutionary, and positional information. PLoS One 2020; 15:e0227683. [PMID: 31935244 PMCID: PMC6959595 DOI: 10.1371/journal.pone.0227683] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Accepted: 12/26/2019] [Indexed: 11/24/2022] Open
Abstract
Transporters mediate the movement of compounds across the membranes that separate the cell from its environment and across the inner membranes surrounding cellular compartments. It is estimated that one third of a proteome consists of membrane proteins, and many of these are transport proteins. Given the increase in the number of genomes being sequenced, there is a need for computational tools that predict the substrates that are transported by the transmembrane transport proteins. In this paper, we present TranCEP, a predictor of the type of substrate transported by a transmembrane transport protein. TranCEP combines the traditional use of the amino acid composition of the protein, with evolutionary information captured in a multiple sequence alignment (MSA), and restriction to important positions of the alignment that play a role in determining the specificity of the protein. Our experimental results show that TranCEP significantly outperforms the state-of-the-art predictors. The results quantify the contribution made by each type of information used.
Collapse
Affiliation(s)
- Munira Alballa
- Department of Computer Science and Software Engineering, Concordia University, Montréal, Québec, Canada
- College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| | - Faizah Aplop
- School of Informatics and Applied Mathematics, Universiti Malaysia Terengganu, Malaysia
| | - Gregory Butler
- Department of Computer Science and Software Engineering, Concordia University, Montréal, Québec, Canada
- Centre for Structural and Functional Genomics, Concordia University, Montréal, Québec, Canada
- * E-mail:
| |
Collapse
|
8
|
Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters. Anal Biochem 2019; 577:73-81. [PMID: 31022378 DOI: 10.1016/j.ab.2019.04.011] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 04/02/2019] [Accepted: 04/12/2019] [Indexed: 02/08/2023]
Abstract
Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis, thus being an important problem for bioinformatics researchers. In this study, we applied word embedding approach, the main cause for natural language processing breakout in recent years, to protein sequences of transporters. We defined each protein sequence based on the word embeddings and frequencies of its biological words. The protein features were then fed into machine learning models for prediction. We also varied the lengths of protein sequence's constituent biological words to find the optimal length which generated the most discriminative feature set. Compared to four other feature types created from protein sequences, our proposed features can help prediction models yield superior performance. Our best models reach an average area under the curve of 0.96 and 0.99, respectively on the 5-fold cross validation and the independent test. With this result, our study can help biologists identify transporters based on substrate specificities as well as provides a basis for further research that enriches a field of applying natural language processing techniques in bioinformatics.
Collapse
|
9
|
Fuertes MA, Rodrigo JR, Alonso C. A Method for the Annotation of Functional Similarities of Coding DNA Sequences: the Case of a Populated Cluster of Transmembrane Proteins. J Mol Evol 2016; 84:29-38. [PMID: 27812751 DOI: 10.1007/s00239-016-9763-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2015] [Accepted: 10/25/2016] [Indexed: 11/30/2022]
Abstract
The analysis of a large number of human and mouse genes codifying for a populated cluster of transmembrane proteins revealed that some of the genes significantly vary in their primary nucleotide sequence inter-species and also intra-species. In spite of that divergence and of the fact that all these genes share a common parental function we asked the question of whether at DNA level they have some kind of common compositional structure, not evident from the analysis of their primary nucleotide sequence. To reveal the existence of gene clusters not based on primary sequence relationships we have analyzed 13574 human and 14047 mouse genes by the composon-clustering methodology. The data presented show that most of the genes from each one of the samples are distributed in 18 clusters sharing the common compositional features between the particular human and mouse clusters. It was observed, in addition, that between particular human and mouse clusters having similar composon-profiles large variations in gene population were detected as an indication that a significant amount of orthologs between both species differs in compositional features. A gene cluster containing exclusively genes codifying for transmembrane proteins, an important fraction of which belongs to the Rhodopsin G-protein coupled receptor superfamily, was also detected. This indicates that even though some of them display low sequence similarity, all of them, in both species, participate with similar compositional features in terms of composons. We conclude that in this family of transmembrane proteins in general and in the Rhodopsin G-protein coupled receptor in particular, the composon-clustering reveals the existence of a type of common compositional structure underlying the primary nucleotide sequence closely correlated to function.
Collapse
Affiliation(s)
- Miguel Angel Fuertes
- Centro de Biología Molecular ''Severo Ochoa'' (CSIC-UAM), Universidad Autónoma de Madrid, c/Nicolás Cabrera 1, 28049, Madrid, Spain.
| | | | - Carlos Alonso
- Centro de Biología Molecular ''Severo Ochoa'' (CSIC-UAM), Universidad Autónoma de Madrid, c/Nicolás Cabrera 1, 28049, Madrid, Spain
| |
Collapse
|