1
|
Ou YY, Ho QT, Chang HT. Recent advances in features generation for membrane protein sequences: From multiple sequence alignment to pre-trained language models. Proteomics 2023; 23:e2200494. [PMID: 37863817 DOI: 10.1002/pmic.202200494] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 09/19/2023] [Accepted: 09/20/2023] [Indexed: 10/22/2023]
Abstract
Membrane proteins play a crucial role in various cellular processes and are essential components of cell membranes. Computational methods have emerged as a powerful tool for studying membrane proteins due to their complex structures and properties that make them difficult to analyze experimentally. Traditional features for protein sequence analysis based on amino acid types, composition, and pair composition have limitations in capturing higher-order sequence patterns. Recently, multiple sequence alignment (MSA) and pre-trained language models (PLMs) have been used to generate features from protein sequences. However, the significant computational resources required for MSA-based features generation can be a major bottleneck for many applications. Several methods and tools have been developed to accelerate the generation of MSAs and reduce their computational cost, including heuristics and approximate algorithms. Additionally, the use of PLMs such as BERT has shown great potential in generating informative embeddings for protein sequence analysis. In this review, we provide an overview of traditional and more recent methods for generating features from protein sequences, with a particular focus on MSAs and PLMs. We highlight the advantages and limitations of these approaches and discuss the methods and tools developed to address the computational challenges associated with features generation. Overall, the advancements in computational methods and tools provide a promising avenue for gaining deeper insights into the function and properties of membrane proteins, which can have significant implications in drug discovery and personalized medicine.
Collapse
Affiliation(s)
- Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
- Graduate Program in Biomedical Informatics, Yuan Ze University, Chung-Li, Taiwan
| | - Quang-Thai Ho
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
| | - Heng-Ta Chang
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
| |
Collapse
|
2
|
Xia L, Xu L, Pan S, Niu D, Zhang B, Li Z. Drug-target binding affinity prediction using message passing neural network and self supervised learning. BMC Genomics 2023; 24:557. [PMID: 37730555 PMCID: PMC10510145 DOI: 10.1186/s12864-023-09664-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Accepted: 09/09/2023] [Indexed: 09/22/2023] Open
Abstract
BACKGROUND Drug-target binding affinity (DTA) prediction is important for the rapid development of drug discovery. Compared to traditional methods, deep learning methods provide a new way for DTA prediction to achieve good performance without much knowledge of the biochemical background. However, there are still room for improvement in DTA prediction: (1) only focusing on the information of the atom leads to an incomplete representation of the molecular graph; (2) the self-supervised learning method could be introduced for protein representation. RESULTS In this paper, a DTA prediction model using the deep learning method is proposed, which uses an undirected-CMPNN for molecular embedding and combines CPCProt and MLM models for protein embedding. An attention mechanism is introduced to discover the important part of the protein sequence. The proposed method is evaluated on the datasets Ki and Davis, and the model outperformed other deep learning methods. CONCLUSIONS The proposed model improves the performance of the DTA prediction, which provides a novel strategy for deep learning-based virtual screening methods.
Collapse
Affiliation(s)
- Leiming Xia
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Lei Xu
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Shourun Pan
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Dongjiang Niu
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Beiyi Zhang
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Zhen Li
- College of Computer Science and Technology, Qingdao University, Qingdao, China.
| |
Collapse
|
3
|
Wang W, Wu Q, Li C. iEnhancer-DCSA: identifying enhancers via dual-scale convolution and spatial attention. BMC Genomics 2023; 24:393. [PMID: 37442977 DOI: 10.1186/s12864-023-09468-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Accepted: 06/20/2023] [Indexed: 07/15/2023] Open
Abstract
BACKGROUND Due to the dynamic nature of enhancers, identifying enhancers and their strength are major bioinformatics challenges. With the development of deep learning, several models have facilitated enhancers detection in recent years. However, existing studies either neglect different length motifs information or treat the features at all spatial locations equally. How to effectively use multi-scale motifs information while ignoring irrelevant information is a question worthy of serious consideration. In this paper, we propose an accurate and stable predictor iEnhancer-DCSA, mainly composed of dual-scale fusion and spatial attention, automatically extracting features of different length motifs and selectively focusing on the important features. RESULTS Our experimental results demonstrate that iEnhancer-DCSA is remarkably superior to existing state-of-the-art methods on the test dataset. Especially, the accuracy and MCC of enhancer identification are improved by 3.45% and 9.41%, respectively. Meanwhile, the accuracy and MCC of enhancer classification are improved by 7.65% and 18.1%, respectively. Furthermore, we conduct ablation studies to demonstrate the effectiveness of dual-scale fusion and spatial attention. CONCLUSIONS iEnhancer-DCSA will be a valuable computational tool in identifying and classifying enhancers, especially for those not included in the training dataset.
Collapse
Affiliation(s)
- Wenjun Wang
- School of Software Engineering, South China University of Technology, Guangzhou, China
- School of Data Science and Information Engineering, Guizhou Minzu University, Guiyang, China
- Key Laboratory of Big Data and Intelligent Robot, Ministry of Education, Guangzhou, China
| | - Qingyao Wu
- School of Software Engineering, South China University of Technology, Guangzhou, China.
- Pazhou Lab, Guangzhou, China.
- Peng Cheng Laboratory, Shenzhen, China.
| | - Chunshan Li
- Department of Computer Science and Technology, Harbin Institute of Technology, Weihai, China.
| |
Collapse
|
4
|
Wei Y, Khalaf AT, Rui C, Abdul Kadir SY, Zainol J, Oglah Z. The Emergence of TRP Channels Interactome as a Potential Therapeutic Target in Pancreatic Ductal Adenocarcinoma. Biomedicines 2023; 11:biomedicines11041164. [PMID: 37189782 DOI: 10.3390/biomedicines11041164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2023] [Revised: 04/04/2023] [Accepted: 04/06/2023] [Indexed: 05/17/2023] Open
Abstract
Integral membrane proteins, known as Transient Receptor Potential (TRP) channels, are cellular sensors for various physical and chemical stimuli in the nervous system, respiratory airways, colon, pancreas, bladder, skin, cardiovascular system, and eyes. TRP channels with nine subfamilies are classified by sequence similarity, resulting in this superfamily's tremendous physiological functional diversity. Pancreatic Ductal Adenocarcinoma (PDAC) is the most common and aggressive form of pancreatic cancer. Moreover, the development of effective treatment methods for pancreatic cancer has been hindered by the lack of understanding of the pathogenesis, partly due to the difficulty in studying human tissue samples. However, scientific research on this topic has witnessed steady development in the past few years in understanding the molecular mechanisms that underlie TRP channel disturbance. This brief review summarizes current knowledge of the molecular role of TRP channels in the development and progression of pancreatic ductal carcinoma to identify potential therapeutic interventions.
Collapse
Affiliation(s)
- Yuanyuan Wei
- Basic Medical College, Chengdu University, Chengdu 610106, China
| | | | - Cao Rui
- Basic Medical College, Chengdu University, Chengdu 610106, China
| | - Samiah Yasmin Abdul Kadir
- Faculty of Medicine, Widad University College, BIM Point, Bandar Indera Mahkota, Kuantan 25200, Malaysia
| | - Jamaludin Zainol
- Faculty of Medicine, Widad University College, BIM Point, Bandar Indera Mahkota, Kuantan 25200, Malaysia
| | - Zahraa Oglah
- School of Science, Auckland University of Technology (AUT), 55 Wellesley Street, Auckland 1010, New Zealand
| |
Collapse
|
5
|
Thafar MA, Albaradei S, Uludag M, Alshahrani M, Gojobori T, Essack M, Gao X. OncoRTT: Predicting novel oncology-related therapeutic targets using BERT embeddings and omics features. Front Genet 2023; 14:1139626. [PMID: 37091791 PMCID: PMC10117673 DOI: 10.3389/fgene.2023.1139626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2023] [Accepted: 03/24/2023] [Indexed: 04/08/2023] Open
Abstract
Late-stage drug development failures are usually a consequence of ineffective targets. Thus, proper target identification is needed, which may be possible using computational approaches. The reason being, effective targets have disease-relevant biological functions, and omics data unveil the proteins involved in these functions. Also, properties that favor the existence of binding between drug and target are deducible from the protein’s amino acid sequence. In this work, we developed OncoRTT, a deep learning (DL)-based method for predicting novel therapeutic targets. OncoRTT is designed to reduce suboptimal target selection by identifying novel targets based on features of known effective targets using DL approaches. First, we created the “OncologyTT” datasets, which include genes/proteins associated with ten prevalent cancer types. Then, we generated three sets of features for all genes: omics features, the proteins’ amino-acid sequence BERT embeddings, and the integrated features to train and test the DL classifiers separately. The models achieved high prediction performances in terms of area under the curve (AUC), i.e., AUC greater than 0.88 for all cancer types, with a maximum of 0.95 for leukemia. Also, OncoRTT outperformed the state-of-the-art method using their data in five out of seven cancer types commonly assessed by both methods. Furthermore, OncoRTT predicts novel therapeutic targets using new test data related to the seven cancer types. We further corroborated these results with other validation evidence using the Open Targets Platform and a case study focused on the top-10 predicted therapeutic targets for lung cancer.
Collapse
Affiliation(s)
- Maha A. Thafar
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- College of Computers and Information Technology, Computer Science Department, Taif University, Taif, Saudi Arabia
| | - Somayah Albaradei
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Mahmut Uludag
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Mona Alshahrani
- National Center for Artificial Intelligence (NCAI), Saudi Data and Artificial Intelligence Authority (SDAIA), Riyadh, Saudi Arabia
| | - Takashi Gojobori
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Magbubah Essack
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- *Correspondence: Xin Gao, ; Magbubah Essack,
| | - Xin Gao
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- *Correspondence: Xin Gao, ; Magbubah Essack,
| |
Collapse
|
6
|
EMSI-BERT: Asymmetrical Entity-Mask Strategy and Symbol-Insert Structure for Drug–Drug Interaction Extraction Based on BERT. Symmetry (Basel) 2023. [DOI: 10.3390/sym15020398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Drug-drug interaction (DDI) extraction has seen growing usage of deep models, but their effectiveness has been restrained by limited domain-labeled data, a weak representation of co-occurring entities, and poor adaptation of downstream tasks. This paper proposes a novel EMSI-BERT method for drug–drug interaction extraction based on an asymmetrical Entity-Mask strategy and a Symbol-Insert structure. Firstly, the EMSI-BERT method utilizes the asymmetrical Entity-Mask strategy to address the weak representation of co-occurring entity information using the drug entity dictionary in the pre-training BERT task. Secondly, the EMSI-BERT method incorporates four symbols to distinguish different entity combinations of the same input sequence and utilizes the Symbol-Insert structure to address the week adaptation of downstream tasks in the fine-tuning stage of DDI classification. The experimental results showed that EMSI-BERT for DDI extraction achieved a 0.82 F1-score on DDI-Extraction 2013, and it improved the performances of the multi-classification task of DDI extraction and the two-classification task of DDI detection. Compared with baseline Basic-BERT, the proposed pre-training BERT with the asymmetrical Entity-Mask strategy could obtain better effects in downstream tasks and effectively limit “Other” samples’ effects. The model visualization results illustrated that EMSI-BERT could extract semantic information at different levels and granularities in a continuous space.
Collapse
|
7
|
Charoenkwan P, Schaduangrat N, Hasan MM, Moni MA, Lió P, Shoombuatong W. Empirical comparison and analysis of machine learning-based predictors for predicting and analyzing of thermophilic proteins. EXCLI JOURNAL 2022; 21:554-570. [PMID: 35651661 PMCID: PMC9150013 DOI: 10.17179/excli2022-4723] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/28/2022] [Accepted: 02/21/2022] [Indexed: 12/15/2022]
Abstract
Thermophilic proteins (TPPs) are critical for basic research and in the food industry due to their ability to maintain a thermodynamically stable fold at extremely high temperatures. Thus, the expeditious identification of novel TPPs through computational models from protein sequences is very desirable. Over the last few decades, a number of computational methods, especially machine learning (ML)-based methods, for in silico prediction of TPPs have been developed. Therefore, it is desirable to revisit these methods and summarize their advantages and disadvantages in order to further develop new computational approaches to achieve more accurate and improved prediction of TPPs. With this goal in mind, we comprehensively investigate a large collection of fourteen state-of-the-art TPP predictors in terms of their dataset size, feature encoding schemes, feature selection strategies, ML algorithms, evaluation strategies and web server/software usability. To the best of our knowledge, this article represents the first comprehensive review on the development of ML-based methods for in silico prediction of TPPs. Among these TPP predictors, they can be classified into two groups according to the interpretability of ML algorithms employed (i.e., computational black-box methods and computational white-box methods). In order to perform the comparative analysis, we conducted a comparative study on several currently available TPP predictors based on two benchmark datasets. Finally, we provide future perspectives for the design and development of new computational models for TPP prediction. We hope that this comprehensive review will facilitate researchers in selecting an appropriate TPP predictor that is the most suitable one to deal with their purposes and provide useful perspectives for the development of more effective and accurate TPP predictors.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, Thailand, 50200
| | - Nalini Schaduangrat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| | - Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Mohammad Ali Moni
- School of Health and Rehabilitation Sciences, Faculty of Health and Behavioural Sciences, the University of Queensland, St Lucia, QLD 4072, Australia
| | - Pietro Lió
- Department of Computer Science and Technology, University of Cambridge, Cambridge, CB3 0FD, UK
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| |
Collapse
|
8
|
van Gils JHM, Gogishvili D, van Eck J, Bouwmeester R, van Dijk E, Abeln S. How sticky are our proteins? Quantifying hydrophobicity of the human proteome. BIOINFORMATICS ADVANCES 2022; 2:vbac002. [PMID: 36699344 PMCID: PMC9710682 DOI: 10.1093/bioadv/vbac002] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 12/19/2021] [Accepted: 01/24/2022] [Indexed: 01/28/2023]
Abstract
Summary Proteins tend to bury hydrophobic residues inside their core during the folding process to provide stability to the protein structure and to prevent aggregation. Nevertheless, proteins do expose some 'sticky' hydrophobic residues to the solvent. These residues can play an important functional role, e.g. in protein-protein and membrane interactions. Here, we first investigate how hydrophobic protein surfaces are by providing three measures for surface hydrophobicity: the total hydrophobic surface area, the relative hydrophobic surface area and-using our MolPatch method-the largest hydrophobic patch. Secondly, we analyze how difficult it is to predict these measures from sequence: by adapting solvent accessibility predictions from NetSurfP2.0, we obtain well-performing prediction methods for the THSA and RHSA, while predicting LHP is more challenging. Finally, we analyze implications of exposed hydrophobic surfaces: we show that hydrophobic proteins typically have low expression, suggesting cells avoid an overabundance of sticky proteins. Availability and implementation The data underlying this article are available in GitHub at https://github.com/ibivu/hydrophobic_patches. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Juami Hermine Mariama van Gils
- Computer Science Department, Center for Integrative Bioinformatics (IBIVU), Vrije Universiteit Amsterdam, 1081 HV Noord-Holland, The Netherlands,To whom correspondence should be addressed. or
| | - Dea Gogishvili
- Computer Science Department, Center for Integrative Bioinformatics (IBIVU), Vrije Universiteit Amsterdam, 1081 HV Noord-Holland, The Netherlands
| | - Jan van Eck
- Computer Science Department, Center for Integrative Bioinformatics (IBIVU), Vrije Universiteit Amsterdam, 1081 HV Noord-Holland, The Netherlands
| | - Robbin Bouwmeester
- Computer Science Department, Center for Integrative Bioinformatics (IBIVU), Vrije Universiteit Amsterdam, 1081 HV Noord-Holland, The Netherlands
| | - Erik van Dijk
- Computer Science Department, Center for Integrative Bioinformatics (IBIVU), Vrije Universiteit Amsterdam, 1081 HV Noord-Holland, The Netherlands
| | - Sanne Abeln
- Computer Science Department, Center for Integrative Bioinformatics (IBIVU), Vrije Universiteit Amsterdam, 1081 HV Noord-Holland, The Netherlands,To whom correspondence should be addressed. or
| |
Collapse
|
9
|
Taju SW, Shah SMA, Ou YY. Identification of efflux proteins based on contextual representations with deep bidirectional transformer encoders. Anal Biochem 2021; 633:114416. [PMID: 34656612 DOI: 10.1016/j.ab.2021.114416] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Revised: 10/07/2021] [Accepted: 10/11/2021] [Indexed: 10/20/2022]
Abstract
Efflux proteins are the transport proteins expressed in the plasma membrane, which are involved in the movement of unwanted toxic substances through specific efflux pumps. Several studies based on computational approaches have been proposed to predict transport proteins and thereby to understand the mechanism of the movement of ions across cell membranes. However, few methods were developed to identify efflux proteins. This paper presents an approach based on the contextualized word embeddings from Bidirectional Encoder Representations from Transformers (BERT) with the Support Vector Machine (SVM) classifier. BERT is the most effective pre-trained language model that performs exceptionally well on several Natural Language Processing (NLP) tasks. Therefore, the contextualized representations from BERT were implemented to incorporate multiple interpretations of identical amino acids in the sequence. A dataset of efflux proteins with annotations was first established. The feature vectors were extracted by transferring protein data through the hidden layers of the pre-trained model. Our proposed method was trained on complete training datasets to identify efflux proteins and achieved the accuracies of 94.15% and 87.13% in the independent tests on membrane and transport datasets, respectively. This study opens a research avenue for the implementation of contextualized word embeddings in Bioinformatics and Computational Biology.
Collapse
Affiliation(s)
- Semmy Wellem Taju
- Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan
| | - Syed Muazzam Ali Shah
- Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan
| | - Yu-Yen Ou
- Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan.
| |
Collapse
|