1
|
Ghazikhani H, Butler G. Exploiting protein language models for the precise classification of ion channels and ion transporters. Proteins 2024; 92:998-1055. [PMID: 38656743 DOI: 10.1002/prot.26694] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 03/26/2024] [Accepted: 04/08/2024] [Indexed: 04/26/2024]
Abstract
This study introduces TooT-PLM-ionCT, a comprehensive framework that consolidates three distinct systems, each meticulously tailored for one of the following tasks: distinguishing ion channels (ICs) from membrane proteins (MPs), segregating ion transporters (ITs) from MPs, and differentiating ICs from ITs. Drawing upon the strengths of six Protein Language Models (PLMs)-ProtBERT, ProtBERT-BFD, ESM-1b, ESM-2 (650M parameters), and ESM-2 (15B parameters), TooT-PLM-ionCT employs a combination of traditional classifiers and deep learning models for nuanced protein classification. Originally validated on an existing dataset by previous researchers, our systems demonstrated superior performance in identifying ITs from MPs and distinguishing ICs from ITs, with the IC-MP discrimination achieving state-of-the-art results. In light of recommendations for additional validation, we introduced a new dataset, significantly enhancing the robustness and generalization of our models across bioinformatics challenges. This new evaluation underscored the effectiveness of TooT-PLM-ionCT in adapting to novel data while maintaining high classification accuracy. Furthermore, this study explores critical factors affecting classification accuracy, such as dataset balancing, the impact of using frozen versus fine-tuned PLM representations, and the variance between half and full precision in floating-point computations. To facilitate broader application and accessibility, a web server (https://tootsuite.encs.concordia.ca/service/TooT-PLM-ionCT) has been developed, allowing users to evaluate unknown protein sequences through our specialized systems for IC-MP, IT-MP, and IC-IT classification tasks.
Collapse
Affiliation(s)
- Hamed Ghazikhani
- Department of Computer Science and Software Engineering, Concordia University, Montréal, Québec, Canada
| | - Gregory Butler
- Centre for Structural and Functional Genomics, Concordia University, Montréal, Québec, Canada
| |
Collapse
|
2
|
Le VT, Malik MS, Tseng YH, Lee YC, Huang CI, Ou YY. DeepPLM_mCNN: An approach for enhancing ion channel and ion transporter recognition by multi-window CNN based on features from pre-trained language models. Comput Biol Chem 2024; 110:108055. [PMID: 38555810 DOI: 10.1016/j.compbiolchem.2024.108055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2023] [Revised: 02/28/2024] [Accepted: 03/19/2024] [Indexed: 04/02/2024]
Abstract
Accurate classification of membrane proteins like ion channels and transporters is critical for elucidating cellular processes and drug development. We present DeepPLM_mCNN, a novel framework combining Pretrained Language Models (PLMs) and multi-window convolutional neural networks (mCNNs) for effective classification of membrane proteins into ion channels and ion transporters. Our approach extracts informative features from protein sequences by utilizing various PLMs, including TAPE, ProtT5_XL_U50, ESM-1b, ESM-2_480, and ESM-2_1280. These PLM-derived features are then input into a mCNN architecture to learn conserved motifs important for classification. When evaluated on ion transporters, our best performing model utilizing ProtT5 achieved 90% sensitivity, 95.8% specificity, and 95.4% overall accuracy. For ion channels, we obtained 88.3% sensitivity, 95.7% specificity, and 95.2% overall accuracy using ESM-1b features. Our proposed DeepPLM_mCNN framework demonstrates significant improvements over previous methods on unseen test data. This study illustrates the potential of combining PLMs and deep learning for accurate computational identification of membrane proteins from sequence data alone. Our findings have important implications for membrane protein research and drug development targeting ion channels and transporters. The data and source codes in this study are publicly available at the following link: https://github.com/s1129108/DeepPLM_mCNN.
Collapse
Affiliation(s)
- Van-The Le
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan
| | - Muhammad-Shahid Malik
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan; Department of Computer Science and Engineering, Karakoram International University, Pakistan
| | - Yi-Hsuan Tseng
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan
| | - Yu-Cheng Lee
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan
| | - Cheng-I Huang
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan; Graduate Program in Biomedical Informatics, Yuan Ze University, Chung-Li, 32003, Taiwan.
| |
Collapse
|
3
|
Nguyen TTD, Chen S, Ho QT, Ou YY. Using multiple convolutional window scanning of convolutional neural network for an efficient prediction of ATP-binding sites in transport proteins. Proteins 2022; 90:1486-1492. [PMID: 35246878 DOI: 10.1002/prot.26329] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2021] [Revised: 02/23/2022] [Accepted: 02/25/2022] [Indexed: 12/31/2022]
Abstract
Protein multiple sequence alignment information has long been important features to know about functions of proteins inferred from related sequences with known functions. It is therefore one of the underlying ideas of Alpha fold 2, a breakthrough study and model for the prediction of three-dimensional structures of proteins from their primary sequence. Our study used protein multiple sequence alignment information in the form of position-specific scoring matrices as input. We also refined the use of a convolutional neural network, a well-known deep-learning architecture with impressive achievement on image and image-like data. Specifically, we revisited the study of prediction of adenosine triphosphate (ATP)-binding sites with more efficient convolutional neural networks. We applied multiple convolutional window scanning filters of a convolutional neural network on position-specific scoring matrices for as much as useful information as possible. Furthermore, only the most specific motifs are retained at each feature map output through the one-max pooling layer before going to the next layer. We assumed that this way could help us retain the most conserved motifs which are discriminative information for prediction. Our experiment results show that a convolutional neural network with not too many convolutional layers can be enough to extract the conserved information of proteins, which leads to higher performance. Our best prediction models were obtained after examining them with different hyper-parameters. Our experiment results showed that our models were superior to traditional use of convolutional neural networks on the same datasets as well as other machine-learning classification algorithms.
Collapse
Affiliation(s)
| | - Syun Chen
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
| | - Quang-Thai Ho
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
| |
Collapse
|
4
|
Nguyen TTD, Ho QT, Le NQK, Phan VD, Ou YY. Use Chou's 5-Steps Rule With Different Word Embedding Types to Boost Performance of Electron Transport Protein Prediction Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1235-1244. [PMID: 32750894 DOI: 10.1109/tcbb.2020.3010975] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Living organisms receive necessary energy substances directly from cellular respiration. The completion of electron storage and transportation requires the process of cellular respiration with the aid of electron transport chains. Therefore, the work of deciphering electron transport proteins is inevitably needed. The identification of these proteins with high performance has a prompt dependence on the choice of methods for feature extraction and machine learning algorithm. In this study, protein sequences served as natural language sentences comprising words. The nominated word embedding-based feature sets, hinged on the word embedding modulation and protein motif frequencies, were useful for feature choosing. Five word embedding types and a variety of conjoint features were examined for such feature selection. The support vector machine algorithm consequentially was employed to perform classification. The performance statistics within the 5-fold cross-validation including average accuracy, specificity, sensitivity, as well as MCC rates surpass 0.95. Such metrics in the independent test are 96.82, 97.16, 95.76 percent, and 0.9, respectively. Compared to state-of-the-art predictors, the proposed method can generate more preferable performance above all metrics indicating the effectiveness of the proposed method in determining electron transport proteins. Furthermore, this study reveals insights about the applicability of various word embeddings for understanding surveyed sequences.
Collapse
|
5
|
Ashrafuzzaman M. Artificial Intelligence, Machine Learning and Deep Learning in Ion Channel Bioinformatics. MEMBRANES 2021; 11:membranes11090672. [PMID: 34564489 PMCID: PMC8467682 DOI: 10.3390/membranes11090672] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 08/20/2021] [Accepted: 08/30/2021] [Indexed: 11/28/2022]
Abstract
Ion channels are linked to important cellular processes. For more than half a century, we have been learning various structural and functional aspects of ion channels using biological, physiological, biochemical, and biophysical principles and techniques. In recent days, bioinformaticians and biophysicists having the necessary expertise and interests in computer science techniques including versatile algorithms have started covering a multitude of physiological aspects including especially evolution, mutations, and genomics of functional channels and channel subunits. In these focused research areas, the use of artificial intelligence (AI), machine learning (ML), and deep learning (DL) algorithms and associated models have been found very popular. With the help of available articles and information, this review provide an introduction to this novel research trend. Ion channel understanding is usually made considering the structural and functional perspectives, gating mechanisms, transport properties, channel protein mutations, etc. Focused research on ion channels and related findings over many decades accumulated huge data which may be utilized in a specialized scientific manner to fast conclude pinpointed aspects of channels. AI, ML, and DL techniques and models may appear as helping tools. This review aims at explaining the ways we may use the bioinformatics techniques and thus draw a few lines across the avenue to let the ion channel features appear clearer.
Collapse
Affiliation(s)
- Md Ashrafuzzaman
- Department of Biochemistry, College of Science, King Saud University, Riyadh 11451, Saudi Arabia
| |
Collapse
|
6
|
ActTRANS: Functional classification in active transport proteins based on transfer learning and contextual representations. Comput Biol Chem 2021; 93:107537. [PMID: 34217007 DOI: 10.1016/j.compbiolchem.2021.107537] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 05/09/2021] [Accepted: 06/26/2021] [Indexed: 01/08/2023]
Abstract
MOTIVATION Primary and secondary active transport are two types of active transport that involve using energy to move the substances. Active transport mechanisms do use proteins to assist in transport and play essential roles to regulate the traffic of ions or small molecules across a cell membrane against the concentration gradient. In this study, the two main types of proteins involved in such transport are classified from transmembrane transport proteins. We propose a Support Vector Machine (SVM) with contextualized word embeddings from Bidirectional Encoder Representations from Transformers (BERT) to represent protein sequences. BERT is a powerful model in transfer learning, a deep learning language representation model developed by Google and one of the highest performing pre-trained model for Natural Language Processing (NLP) tasks. The idea of transfer learning with pre-trained model from BERT is applied to extract fixed feature vectors from the hidden layers and learn contextual relations between amino acids in the protein sequence. Therefore, the contextualized word representations of proteins are introduced to effectively model complex structures of amino acids in the sequence and the variations of these amino acids in the context. By generating context information, we capture multiple meanings for the same amino acid to reveal the importance of specific residues in the protein sequence. RESULTS The performance of the proposed method is evaluated using five-fold cross-validation and independent test. The proposed method achieves an accuracy of 85.44 %, 88.74 % and 92.84 % for Class-1, Class-2, and Class-3, respectively. Experimental results show that this approach can outperform from other feature extraction methods using context information, effectively classify two types of active transport and improve the overall performance.
Collapse
|
7
|
Shah SMA, Taju SW, Dlamini BB, Ou YY. DeepSIRT: A deep neural network for identification of sirtuin targets and their subcellular localizations. Comput Biol Chem 2021; 93:107514. [PMID: 34058657 DOI: 10.1016/j.compbiolchem.2021.107514] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Accepted: 05/12/2021] [Indexed: 09/30/2022]
Abstract
Sirtuins are a family of proteins that play a key role in regulating a wide range of cellular processes including DNA regulation, metabolism, aging/longevity, cell survival, apoptosis, and stress resistance. Sirtuins are protein deacetylases and include in the class III family of histone deacetylase enzymes (HDACs). The class III HDACs contains seven members of the sirtuin family from SIRT1 to SIRT7. The seven members of the sirtuin family have various substrates and are present in nearly all subcellular localizations including the nucleus, cytoplasm, and mitochondria. In this study, a deep neural network approach using one-dimensional Convolutional Neural Networks (CNN) was proposed to build a prediction model that can accurately identify the outcome of the sirtuin protein by targeting their subcellular localizations. Therefore, the function and localization of sirtuin targets were analyzed and annotated to compartmentalize into distinct subcellular localizations. We further reduced the sequence similarity between protein sequences and three feature extraction methods were applied in datasets. Finally, the proposed method has been tested and compared with various machine-learning algorithms. The proposed method is validated on two independent datasets and showed an average of up to 85.77 % sensitivity, 97.32 % specificity, and 0.82 MCC for seven members of the sirtuin family of proteins.
Collapse
Affiliation(s)
- Syed Muazzam Ali Shah
- Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan
| | - Semmy Wellem Taju
- Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan
| | - Bongani Brian Dlamini
- Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan
| | - Yu-Yen Ou
- Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan.
| |
Collapse
|
8
|
Ho QT, Nguyen TTD, Khanh Le NQ, Ou YY. FAD-BERT: Improved prediction of FAD binding sites using pre-training of deep bidirectional transformers. Comput Biol Med 2021; 131:104258. [PMID: 33601085 DOI: 10.1016/j.compbiomed.2021.104258] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 01/16/2021] [Accepted: 02/03/2021] [Indexed: 02/07/2023]
Abstract
The electron transport chain is a series of protein complexes embedded in the process of cellular respiration, which is an important process to transfer electrons and other macromolecules throughout the cell. Identifying Flavin Adenine Dinucleotide (FAD) binding sites in the electron transport chain is vital since it helps biological researchers precisely understand how electrons are produced and are transported in cells. This study distills and analyzes the contextualized word embedding from pre-trained BERT models to explore similarities in natural language and protein sequences. Thereby, we propose a new approach based on Pre-training of Bidirectional Encoder Representations from Transformers (BERT), Position-specific Scoring Matrix profiles (PSSM), Amino Acid Index database (AAIndex) to predict FAD-binding sites from the transport proteins which are found in nature recently. Our proposed approach archives 85.14% accuracy and improves accuracy by 11%, with Matthew's correlation coefficient of 0.39 compared to the previous method on the same independent set. We also deploy a web server that identifies FAD-binding sites in electron transporters available for academics at http://140.138.155.216/fadbert/.
Collapse
Affiliation(s)
- Quang-Thai Ho
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan; College of Information & Communication Technology, Can Tho University, Viet Nam
| | | | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City, 106, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei City, 106, Taiwan
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan.
| |
Collapse
|
9
|
Nguyen T, Le N, Ho Q, Phan D, Ou Y. Using Language Representation Learning Approach to Efficiently Identify Protein Complex Categories in Electron Transport Chain. Mol Inform 2020; 39:e2000033. [DOI: 10.1002/minf.202000033] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Accepted: 06/26/2020] [Indexed: 11/10/2022]
Affiliation(s)
| | - Nguyen‐Quoc‐Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine College of Medicine, Taipei Medical University Taipei City 106 Taiwan
- Research Center for Artificial Intelligence in Medicine Taipei Medical University Taipei City 106 Taiwan
| | - Quang‐Thai Ho
- Department of Computer Science and Engineering Yuan Ze University Chung-Li Taiwan 32003
| | - Dinh‐Van Phan
- University of Economics University of Danang 41 Leduan St Danang City 550000 Vietnam
| | - Yu‐Yen Ou
- Department of Computer Science and Engineering Yuan Ze University Chung-Li Taiwan 32003
| |
Collapse
|