1
|
John C, Sahoo J, Sajan IK, Madhavan M, Mathew OK. CNN-BLSTM based deep learning framework for eukaryotic kinome classification: An explainability based approach. Comput Biol Chem 2024; 112:108169. [PMID: 39137619 DOI: 10.1016/j.compbiolchem.2024.108169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 05/08/2024] [Accepted: 08/03/2024] [Indexed: 08/15/2024]
Abstract
Classification of protein families from their sequences is an enduring task in Proteomics and related studies. Numerous deep-learning models have been moulded to tackle this challenge, but due to the black-box character, they still fall short in reliability. Here, we present a novel explainability pipeline that explains the pivotal decisions of the deep learning model on the classification of the Eukaryotic kinome. Based on a comparative and experimental analysis of the most cutting-edge deep learning algorithms, the best deep learning model CNN-BLSTM was chosen to classify the eight eukaryotic kinase sequences to their corresponding families. As a substitution for the conventional class activation map-based interpretation of CNN-based models in the domain, we have cascaded the GRAD CAM and Integrated Gradient (IG) explainability modus operandi for improved and responsible results. To ensure the trustworthiness of the classifier, we have masked the kinase domain traces, identified from the explainability pipeline and observed a class-specific drop in F1-score from 0.96 to 0.76. In compliance with the Explainable AI paradigm, our results are promising and contribute to enhancing the trustworthiness of deep learning models for biological sequence-associated studies.
Collapse
Affiliation(s)
- Chinju John
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India.
| | - Jayakrushna Sahoo
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| | - Irish K Sajan
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| | - Manu Madhavan
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| | - Oommen K Mathew
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| |
Collapse
|
2
|
Abrar M, Hussain D, Khan IA, Ullah F, Haq MA, Aleisa MA, Alenizi A, Bhushan S, Martha S. DeepSplice: a deep learning approach for accurate prediction of alternative splicing events in the human genome. Front Genet 2024; 15:1349546. [PMID: 38974384 PMCID: PMC11224287 DOI: 10.3389/fgene.2024.1349546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 05/21/2024] [Indexed: 07/09/2024] Open
Abstract
Alternative splicing (AS) is a crucial process in genetic information processing that generates multiple mRNA molecules from a single gene, producing diverse proteins. Accurate prediction of AS events is essential for understanding various physiological aspects, including disease progression and prognosis. Machine learning (ML) techniques have been widely employed in bioinformatics to address this challenge. However, existing models have limitations in capturing AS events in the presence of mutations and achieving high prediction performance. To overcome these limitations, this research presents deep splicing code (DSC), a deep learning (DL)-based model for AS prediction. The proposed model aims to improve predictive ability by investigating state-of-the-art techniques in AS and developing a DL model specifically designed to predict AS events accurately. The performance of the DSC model is evaluated against existing techniques, revealing its potential to enhance the understanding and predictive power of DL algorithms in AS. It outperforms other models by achieving an average AUC score of 92%. The significance of this research lies in its contribution to identifying functional implications and potential therapeutic targets associated with AS, with applications in genomics, bioinformatics, and biomedical research. The findings of this study have the potential to advance the field and pave the way for more precise and reliable predictions of AS events, ultimately leading to a deeper understanding of genetic information processing and its impact on human physiology and disease.
Collapse
Affiliation(s)
- Mohammad Abrar
- Faculty of Computer Studies, Arab Open University, Muscat, Oman
| | - Didar Hussain
- Department of Computer Science, Bacha Khan University Charsadda, Charsadda, Pakistan
| | - Izaz Ahmad Khan
- Department of Computer Science, Bacha Khan University Charsadda, Charsadda, Pakistan
| | - Fasee Ullah
- Computer and Information Sciences department, Universiti Teknologi PETRONAS, Seri Iskandar, Malaysia
| | - Mohd Anul Haq
- Department of Computer Science, College of Computer and Information Sciences, Majmaah University, Al-Majmaah, Saudi Arabia
| | - Mohammed A. Aleisa
- Department of Computer Science, College of Computer and Information Sciences, Majmaah University, Al-Majmaah, Saudi Arabia
| | - Abdullah Alenizi
- Department of Information Technology, College of Computer and Information Sciences, Majmaah University, Al-Majmaah, Saudi Arabia
| | - Shashi Bhushan
- Computer and Information Sciences department, Universiti Teknologi PETRONAS, Seri Iskandar, Malaysia
| | - Sheshikala Martha
- School of Computer Science and Artificial Intelligence, SR University, Warangal, India
| |
Collapse
|
3
|
Alquran H, Al Fahoum A, Zyout A, Abu Qasmieh I. A comprehensive framework for advanced protein classification and function prediction using synergistic approaches: Integrating bispectral analysis, machine learning, and deep learning. PLoS One 2023; 18:e0295805. [PMID: 38096313 PMCID: PMC10721063 DOI: 10.1371/journal.pone.0295805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Accepted: 11/29/2023] [Indexed: 12/17/2023] Open
Abstract
Proteins are fundamental components of diverse cellular systems and play crucial roles in a variety of disease processes. Consequently, it is crucial to comprehend their structure, function, and intricate interconnections. Classifying proteins into families or groups with comparable structural and functional characteristics is a crucial aspect of this comprehension. This classification is crucial for evolutionary research, predicting protein function, and identifying potential therapeutic targets. Sequence alignment and structure-based alignment are frequently ineffective techniques for identifying protein families.This study addresses the need for a more efficient and accurate technique for feature extraction and protein classification. The research proposes a novel method that integrates bispectrum characteristics, deep learning techniques, and machine learning algorithms to overcome the limitations of conventional methods. The proposed method uses numbers to represent protein sequences, utilizes bispectrum analysis, uses different topologies for convolutional neural networks to pull out features, and chooses robust features to classify protein families. The goal is to outperform existing methods for identifying protein families, thereby enhancing classification metrics. The materials consist of numerous protein datasets, whereas the methods incorporate bispectrum characteristics and deep learning strategies. The results of this study demonstrate that the proposed method for identifying protein families is superior to conventional approaches. Significantly enhanced quality metrics demonstrated the efficacy of the combined bispectrum and deep learning approaches. These findings have the potential to advance the field of protein biology and facilitate pharmaceutical innovation. In conclusion, this study presents a novel method that employs bispectrum characteristics and deep learning techniques to improve the precision and efficiency of protein family identification. The demonstrated advancements in classification metrics demonstrate this method's applicability to numerous scientific disciplines. This furthers our understanding of protein function and its implications for disease and treatment.
Collapse
Affiliation(s)
- Hiam Alquran
- Hijjawi Faculty for Engineering Technology, Biomedical Systems and Informatics Engineering Department, Yarmouk University, Irbid, Jordan
| | - Amjed Al Fahoum
- Hijjawi Faculty for Engineering Technology, Biomedical Systems and Informatics Engineering Department, Yarmouk University, Irbid, Jordan
| | - Ala’a Zyout
- Hijjawi Faculty for Engineering Technology, Biomedical Systems and Informatics Engineering Department, Yarmouk University, Irbid, Jordan
| | - Isam Abu Qasmieh
- Hijjawi Faculty for Engineering Technology, Biomedical Systems and Informatics Engineering Department, Yarmouk University, Irbid, Jordan
| |
Collapse
|
4
|
Li M, Shi W, Zhang F, Zeng M, Li Y. A Deep Learning Framework for Predicting Protein Functions With Co-Occurrence of GO Terms. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:833-842. [PMID: 35476573 DOI: 10.1109/tcbb.2022.3170719] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The understanding of protein functions is critical to many biological problems such as the development of new drugs and new crops. To reduce the huge gap between the increase of protein sequences and annotations of protein functions, many methods have been proposed to deal with this problem. These methods use Gene Ontology (GO) to classify the functions of proteins and consider one GO term as a class label. However, they ignore the co-occurrence of GO terms that is helpful for protein function prediction. We propose a new deep learning model, named DeepPFP-CO, which uses Graph Convolutional Network (GCN) to explore and capture the co-occurrence of GO terms to improve the protein function prediction performance. In this way, we can further deduce the protein functions by fusing the predicted propensity of the center function and its co-occurrence functions. We use Fmax and AUPR to evaluate the performance of DeepPFP-CO and compare DeepPFP-CO with state-of-the-art methods such as DeepGOPlus and DeepGOA. The computational results show that DeepPFP-CO outperforms DeepGOPlus and other methods. Moreover, we further analyze our model at the protein level. The results have demonstrated that DeepPFP-CO improves the performance of protein function prediction. DeepPFP-CO is available at https://csuligroup.com/DeepPFP/.
Collapse
|
5
|
Asudani DS, Nagwani NK, Singh P. Impact of word embedding models on text analytics in deep learning environment: a review. Artif Intell Rev 2023; 56:1-81. [PMID: 36844886 PMCID: PMC9944441 DOI: 10.1007/s10462-023-10419-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/01/2023] [Indexed: 02/25/2023]
Abstract
The selection of word embedding and deep learning models for better outcomes is vital. Word embeddings are an n-dimensional distributed representation of a text that attempts to capture the meanings of the words. Deep learning models utilize multiple computing layers to learn hierarchical representations of data. The word embedding technique represented by deep learning has received much attention. It is used in various natural language processing (NLP) applications, such as text classification, sentiment analysis, named entity recognition, topic modeling, etc. This paper reviews the representative methods of the most prominent word embedding and deep learning models. It presents an overview of recent research trends in NLP and a detailed understanding of how to use these models to achieve efficient results on text analytics tasks. The review summarizes, contrasts, and compares numerous word embedding and deep learning models and includes a list of prominent datasets, tools, APIs, and popular publications. A reference for selecting a suitable word embedding and deep learning approach is presented based on a comparative analysis of different techniques to perform text analytics tasks. This paper can serve as a quick reference for learning the basics, benefits, and challenges of various word representation approaches and deep learning models, with their application to text analytics and a future outlook on research. It can be concluded from the findings of this study that domain-specific word embedding and the long short term memory model can be employed to improve overall text analytics task performance.
Collapse
Affiliation(s)
- Deepak Suresh Asudani
- Department of Computer Science and Engineering, National Institute of Technology, Raipur, Chhattisgarh India
| | - Naresh Kumar Nagwani
- Department of Computer Science and Engineering, National Institute of Technology, Raipur, Chhattisgarh India
| | - Pradeep Singh
- Department of Computer Science and Engineering, National Institute of Technology, Raipur, Chhattisgarh India
| |
Collapse
|
6
|
Predicting Conserved Water Molecules in Binding Sites of Proteins Using Machine Learning Methods and Combining Features. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:5104464. [PMID: 36226242 PMCID: PMC9550495 DOI: 10.1155/2022/5104464] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 09/15/2022] [Indexed: 11/17/2022]
Abstract
Water molecules play an important role in many biological processes in terms of stabilizing protein structures, assisting protein folding, and improving binding affinity. It is well known that, due to the impacts of various environmental factors, it is difficult to identify the conserved water molecules (CWMs) from free water molecules (FWMs) directly as CWMs are normally deeply embedded in proteins and form strong hydrogen bonds with surrounding polar groups. To circumvent this difficulty, in this work, the abundance of spatial structure information and physicochemical properties of water molecules in proteins inspires us to adopt machine learning methods for identifying the CWMs. Therefore, in this study, a machine learning framework to identify the CWMs in the binding sites of the proteins was presented. First, by analyzing water molecules' physicochemical properties and spatial structure information, six features (i.e., atom density, hydrophilicity, hydrophobicity, solvent-accessible surface area, temperature B-factors, and mobility) were extracted. Those features were further analyzed and combined to reach a higher CWM identification rate. As a result, an optimal feature combination was determined. Based on this optimal combination, seven different machine learning models (including support vector machine (SVM), K-nearest neighbor (KNN), decision tree (DT), logistic regression (LR), discriminant analysis (DA), naïve Bayes (NB), and ensemble learning (EL)) were evaluated for their abilities in identifying two categories of water molecules, i.e., CWMs and FWMs. It showed that the EL model was the desired prediction model due to its comprehensive advantages. Furthermore, the presented methodology was validated through a case study of crystal 3skh and extensively compared with Dowser++. The prediction performance showed that the optimal feature combination and the desired EL model in our method could achieve satisfactory prediction accuracy in identifying CWMs from FWMs in the proteins' binding sites.
Collapse
|
7
|
BERT-PPII: The Polyproline Type II Helix Structure Prediction Model Based on BERT and Multichannel CNN. BIOMED RESEARCH INTERNATIONAL 2022; 2022:9015123. [PMID: 36060139 PMCID: PMC9433275 DOI: 10.1155/2022/9015123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 08/01/2022] [Accepted: 08/03/2022] [Indexed: 11/26/2022]
Abstract
Predicting the polyproline type II (PPII) helix structure is crucial important in many research areas, such as the protein folding mechanisms, the drug targets, and the protein functions. However, many existing PPII helix prediction algorithms encode the protein sequence information in a single way, which causes the insufficient learning of protein sequence feature information. To improve the protein sequence encoding performance, this paper proposes a BERT-based PPII helix structure prediction algorithm (BERT-PPII), which learns the protein sequence information based on the BERT model. The BERT model's CLS vector can fairly fuse sample's each amino acid residue information. Thus, we utilize the CLS vector as the global feature to represent the sample's global contextual information. As the interactions among the protein chains' local amino acid residues have an important influence on the formation of PPII helix, we utilize the CNN to extract local amino acid residues' features which can further enhance the information expression of protein sequence samples. In this paper, we fuse the CLS vectors with CNN local features to improve the performance of predicting PPII structure. Compared to the state-of-the-art PPIIPRED method, the experimental results on the unbalanced dataset show that the proposed method improves the accuracy value by 1% on the strict dataset and 2% on the less strict dataset. Correspondingly, the results on the balanced dataset show that the AUCs of the proposed method are 0.826 on the strict dataset and 0.785 on less strict datasets, respectively. For the independent test set, the proposed method has the AUC value of 0.827 on the strict dataset and 0.783 on the less strict dataset. The above experimental results have proved that the proposed BERT-PPII method can achieve a superior performance of predicting the PPII helix.
Collapse
|
8
|
BERT-m7G: A Transformer Architecture Based on BERT and Stacking Ensemble to Identify RNA N7-Methylguanosine Sites from Sequence Information. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:7764764. [PMID: 34484416 PMCID: PMC8413034 DOI: 10.1155/2021/7764764] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 08/13/2021] [Indexed: 01/19/2023]
Abstract
As one of the most prevalent posttranscriptional modifications of RNA, N7-methylguanosine (m7G) plays an essential role in the regulation of gene expression. Accurate identification of m7G sites in the transcriptome is invaluable for better revealing their potential functional mechanisms. Although high-throughput experimental methods can locate m7G sites precisely, they are overpriced and time-consuming. Hence, it is imperative to design an efficient computational method that can accurately identify the m7G sites. In this study, we propose a novel method via incorporating BERT-based multilingual model in bioinformatics to represent the information of RNA sequences. Firstly, we treat RNA sequences as natural sentences and then employ bidirectional encoder representations from transformers (BERT) model to transform them into fixed-length numerical matrices. Secondly, a feature selection scheme based on the elastic net method is constructed to eliminate redundant features and retain important features. Finally, the selected feature subset is input into a stacking ensemble classifier to predict m7G sites, and the hyperparameters of the classifier are tuned with tree-structured Parzen estimator (TPE) approach. By 10-fold cross-validation, the performance of BERT-m7G is measured with an ACC of 95.48% and an MCC of 0.9100. The experimental results indicate that the proposed method significantly outperforms state-of-the-art prediction methods in the identification of m7G modifications.
Collapse
|
9
|
Lee T, Lee S, Kang M, Kim S. Deep hierarchical embedding for simultaneous modeling of GPCR proteins in a unified metric space. Sci Rep 2021; 11:9543. [PMID: 33953216 PMCID: PMC8100104 DOI: 10.1038/s41598-021-88623-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2021] [Accepted: 04/13/2021] [Indexed: 11/23/2022] Open
Abstract
GPCR proteins belong to diverse families of proteins that are defined at multiple hierarchical levels. Inspecting relationships between GPCR proteins on the hierarchical structure is important, since characteristics of the protein can be inferred from proteins in similar hierarchical information. However, modeling of GPCR families has been performed separately for each of the family, subfamily, and sub-subfamily level. Relationships between GPCR proteins are ignored in these approaches as they process the information in the proteins with several disconnected models. In this study, we propose DeepHier, a deep learning model to simultaneously learn representations of GPCR family hierarchy from the protein sequences with a unified single model. Novel loss term based on metric learning is introduced to incorporate hierarchical relations between proteins. We tested our approach using a public GPCR sequence dataset. Metric distances in the deep feature space corresponded to the hierarchical family relation between GPCR proteins. Furthermore, we demonstrated that further downstream tasks, like phylogenetic reconstruction and motif discovery, are feasible in the constructed embedding space. These results show that hierarchical relations between sequences were successfully captured in both of technical and biological aspects.
Collapse
Affiliation(s)
- Taeheon Lee
- Looxid Labs, Seoul, 06628, Republic of Korea
| | - Sangseon Lee
- BK21 FOUR Intelligence Computing, Seoul National University, Seoul, 08826, Republic of Korea
| | - Minji Kang
- Department of Computer Science, Stanford University, Stanford, CA, 94305, USA
| | - Sun Kim
- Bioinformatics Institute, Seoul National University, Seoul, 08826, Republic of Korea. .,Department of Computer Science and Engineering, Seoul National University, Seoul, 08826, Republic of Korea. .,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 08826, Republic of Korea. .,Institute of Engineering Research, Seoul National University, Seoul, 08826, Republic of Korea.
| |
Collapse
|