1
|
Ma W, Bi X, Jiang H, Wei Z, Zhang S. Annotating protein functions via fusing multiple biological modalities. Commun Biol 2024; 7:1705. [PMID: 39730886 DOI: 10.1038/s42003-024-07411-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 12/17/2024] [Indexed: 12/29/2024] Open
Abstract
Understanding the function of proteins is of great significance for revealing disease pathogenesis and discovering new targets. Benefiting from the explosive growth of the protein universal, deep learning has been applied to accelerate the protein annotation cycle from different biological modalities. However, most existing deep learning-based methods not only fail to effectively fuse different biological modalities, resulting in low-quality protein representations, but also suffer from the convergence of suboptimal solution caused by sparse label representations. Aiming at the above issue, we propose a multiprocedural approach for fusing heterogeneous biological modalities and annotating protein functions, i.e., MIF2GO (Multimodal Information Fusion to infer Gene Ontology terms), which sequentially fuses up to six biological modalities ranging from different biological levels in three steps, thus leading to powerful protein representations. Evaluation results on seven benchmark datasets show that the proposed method not only considerably outperforms state-of-the-art performance, but also demonstrates great robustness and generalizability across species. Besides, we also present biological insights into the associations between those modalities and protein functions. This research provides a robust framework for integrating multimodal biological data, offering a scalable solution for protein function annotation, ultimately facilitating advancements in precision medicine and the discovery of novel therapeutic strategies.
Collapse
Affiliation(s)
- Wenjian Ma
- College of Computer Science and Technology, Ocean University of China, Qingdao, China
| | - Xiangpeng Bi
- College of Computer Science and Technology, Ocean University of China, Qingdao, China
| | - Huasen Jiang
- College of Computer Science and Technology, Ocean University of China, Qingdao, China
| | - Zhiqiang Wei
- College of Computer Science and Technology, Ocean University of China, Qingdao, China
| | - Shugang Zhang
- College of Computer Science and Technology, Ocean University of China, Qingdao, China.
| |
Collapse
|
2
|
Vu TTD, Kim J, Jung J. An experimental analysis of graph representation learning for Gene Ontology based protein function prediction. PeerJ 2024; 12:e18509. [PMID: 39553733 PMCID: PMC11569786 DOI: 10.7717/peerj.18509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2024] [Accepted: 10/21/2024] [Indexed: 11/19/2024] Open
Abstract
Understanding protein function is crucial for deciphering biological systems and facilitating various biomedical applications. Computational methods for predicting Gene Ontology functions of proteins emerged in the 2000s to bridge the gap between the number of annotated proteins and the rapidly growing number of newly discovered amino acid sequences. Recently, there has been a surge in studies applying graph representation learning techniques to biological networks to enhance protein function prediction tools. In this review, we provide fundamental concepts in graph embedding algorithms. This study described graph representation learning methods for protein function prediction based on four principal data categories, namely PPI network, protein structure, Gene Ontology graph, and integrated graph. The commonly used approaches for each category were summarized and diagrammed, with the specific results of each method explained in detail. Finally, existing limitations and potential solutions were discussed, and directions for future research within the protein research community were suggested.
Collapse
Affiliation(s)
- Thi Thuy Duong Vu
- Faculty of Fundamental Sciences, University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam
| | - Jeongho Kim
- Department of Information and Communication Engineering, Myongji University, Yongin, Republic of South Korea
| | - Jaehee Jung
- Department of Information and Communication Engineering, Myongji University, Yongin, Republic of South Korea
| |
Collapse
|
3
|
Boadu F, Lee A, Cheng J. Deep learning methods for protein function prediction. Proteomics 2024:e2300471. [PMID: 38996351 DOI: 10.1002/pmic.202300471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 06/15/2024] [Accepted: 06/18/2024] [Indexed: 07/14/2024]
Abstract
Predicting protein function from protein sequence, structure, interaction, and other relevant information is important for generating hypotheses for biological experiments and studying biological systems, and therefore has been a major challenge in protein bioinformatics. Numerous computational methods had been developed to advance protein function prediction gradually in the last two decades. Particularly, in the recent years, leveraging the revolutionary advances in artificial intelligence (AI), more and more deep learning methods have been developed to improve protein function prediction at a faster pace. Here, we provide an in-depth review of the recent developments of deep learning methods for protein function prediction. We summarize the significant advances in the field, identify several remaining major challenges to be tackled, and suggest some potential directions to explore. The data sources and evaluation metrics widely used in protein function prediction are also discussed to assist the machine learning, AI, and bioinformatics communities to develop more cutting-edge methods to advance protein function prediction.
Collapse
Affiliation(s)
- Frimpong Boadu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| | - Ahhyun Lee
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| |
Collapse
|
4
|
Zhao Y, Yang Z, Wang L, Zhang Y, Lin H, Wang J. Predicting Protein Functions Based on Heterogeneous Graph Attention Technique. IEEE J Biomed Health Inform 2024; 28:2408-2415. [PMID: 38319781 DOI: 10.1109/jbhi.2024.3357834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2024]
Abstract
In bioinformatics, protein function prediction stands as a fundamental area of research and plays a crucial role in addressing various biological challenges, such as the identification of potential targets for drug discovery and the elucidation of disease mechanisms. However, known functional annotation databases usually provide positive experimental annotations that proteins carry out a given function, and rarely record negative experimental annotations that proteins do not carry out a given function. Therefore, existing computational methods based on deep learning models focus on these positive annotations for prediction and ignore these scarce but informative negative annotations, leading to an underestimation of precision. To address this issue, we introduce a deep learning method that utilizes a heterogeneous graph attention technique. The method first constructs a heterogeneous graph that covers the protein-protein interaction network, ontology structure, and positive and negative annotation information. Then, it learns embedding representations of proteins and ontology terms by using the heterogeneous graph attention technique. Finally, it leverages these learned representations to reconstruct the positive protein-term associations and score unobserved functional annotations. It can enhance the predictive performance by incorporating these known limited negative annotations into the constructed heterogeneous graph. Experimental results on three species (i.e., Human, Mouse, and Arabidopsis) demonstrate that our method can achieve better performance in predicting new protein annotations than state-of-the-art methods.
Collapse
|
5
|
Giri SJ, Ibtehaz N, Kihara D. GO2Sum: generating human-readable functional summary of proteins from GO terms. NPJ Syst Biol Appl 2024; 10:29. [PMID: 38491038 PMCID: PMC10943200 DOI: 10.1038/s41540-024-00358-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Accepted: 03/05/2024] [Indexed: 03/18/2024] Open
Abstract
Understanding the biological functions of proteins is of fundamental importance in modern biology. To represent a function of proteins, Gene Ontology (GO), a controlled vocabulary, is frequently used, because it is easy to handle by computer programs avoiding open-ended text interpretation. Particularly, the majority of current protein function prediction methods rely on GO terms. However, the extensive list of GO terms that describe a protein function can pose challenges for biologists when it comes to interpretation. In response to this issue, we developed GO2Sum (Gene Ontology terms Summarizer), a model that takes a set of GO terms as input and generates a human-readable summary using the T5 large language model. GO2Sum was developed by fine-tuning T5 on GO term assignments and free-text function descriptions for UniProt entries, enabling it to recreate function descriptions by concatenating GO term descriptions. Our results demonstrated that GO2Sum significantly outperforms the original T5 model that was trained on the entire web corpus in generating Function, Subunit Structure, and Pathway paragraphs for UniProt entries.
Collapse
Affiliation(s)
| | - Nabil Ibtehaz
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, USA.
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.
| |
Collapse
|
6
|
Zheng L, Shi S, Lu M, Fang P, Pan Z, Zhang H, Zhou Z, Zhang H, Mou M, Huang S, Tao L, Xia W, Li H, Zeng Z, Zhang S, Chen Y, Li Z, Zhu F. AnnoPRO: a strategy for protein function annotation based on multi-scale protein representation and a hybrid deep learning of dual-path encoding. Genome Biol 2024; 25:41. [PMID: 38303023 PMCID: PMC10832132 DOI: 10.1186/s13059-024-03166-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2023] [Accepted: 01/05/2024] [Indexed: 02/03/2024] Open
Abstract
Protein function annotation has been one of the longstanding issues in biological sciences, and various computational methods have been developed. However, the existing methods suffer from a serious long-tail problem, with a large number of GO families containing few annotated proteins. Herein, an innovative strategy named AnnoPRO was therefore constructed by enabling sequence-based multi-scale protein representation, dual-path protein encoding using pre-training, and function annotation by long short-term memory-based decoding. A variety of case studies based on different benchmarks were conducted, which confirmed the superior performance of AnnoPRO among available methods. Source code and models have been made freely available at: https://github.com/idrblab/AnnoPRO and https://zenodo.org/records/10012272.
Collapse
Affiliation(s)
- Lingyan Zheng
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China
| | - Shuiyang Shi
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Mingkun Lu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Pan Fang
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Ziqi Pan
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Hongning Zhang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Zhimeng Zhou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Hanyu Zhang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Shijie Huang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Lin Tao
- Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Weiqi Xia
- Pharmaceutical Department, Zhejiang Provincial People's Hospital, Hangzhou, 310014, China
| | - Honglin Li
- School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Zhenyu Zeng
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Shun Zhang
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Yuzong Chen
- State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, The Graduate School at Shenzhen, Tsinghua University, Shenzhen, 518055, China
| | - Zhaorong Li
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China.
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China.
| | - Feng Zhu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China.
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China.
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China.
| |
Collapse
|
7
|
Sharma L, Deepak A, Ranjan A, Krishnasamy G. A CNN-CBAM-BIGRU model for protein function prediction. Stat Appl Genet Mol Biol 2024; 23:sagmb-2024-0004. [PMID: 38943434 DOI: 10.1515/sagmb-2024-0004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 06/07/2024] [Indexed: 07/01/2024]
Abstract
Understanding a protein's function based solely on its amino acid sequence is a crucial but intricate task in bioinformatics. Traditionally, this challenge has proven difficult. However, recent years have witnessed the rise of deep learning as a powerful tool, achieving significant success in protein function prediction. Their strength lies in their ability to automatically learn informative features from protein sequences, which can then be used to predict the protein's function. This study builds upon these advancements by proposing a novel model: CNN-CBAM+BiGRU. It incorporates a Convolutional Block Attention Module (CBAM) alongside BiGRUs. CBAM acts as a spotlight, guiding the CNN to focus on the most informative parts of the protein data, leading to more accurate feature extraction. BiGRUs, a type of Recurrent Neural Network (RNN), excel at capturing long-range dependencies within the protein sequence, which are essential for accurate function prediction. The proposed model integrates the strengths of both CNN-CBAM and BiGRU. This study's findings, validated through experimentation, showcase the effectiveness of this combined approach. For the human dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +1.0 % for cellular components, +1.1 % for molecular functions, and +0.5 % for biological processes. For the yeast dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +2.4 % for the cellular component, +1.2 % for molecular functions, and +0.6 % for biological processes.
Collapse
Affiliation(s)
- Lavkush Sharma
- Department of Computer Science and Engineering, 230635 National Institute of Technology Patna , Patna, Bihar, India
| | - Akshay Deepak
- Department of Computer Science and Engineering, 230635 National Institute of Technology Patna , Patna, Bihar, India
| | - Ashish Ranjan
- Department of Computer Science and Engineering, C.V. Raman Global University, Bhubaneswar, Odisha, India
| | | |
Collapse
|
8
|
Giri SJ, Ibtehaz N, Kihara D. GO2Sum: Generating Human Readable Functional Summary of Proteins from GO Terms. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.10.566665. [PMID: 38014080 PMCID: PMC10680659 DOI: 10.1101/2023.11.10.566665] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Understanding the biological functions of proteins is of fundamental importance in modern biology. To represent function of proteins, Gene Ontology (GO), a controlled vocabulary, is frequently used, because it is easy to handle by computer programs avoiding open-ended text interpretation. Particularly, the majority of current protein function prediction methods rely on GO terms. However, the extensive list of GO terms that describe a protein function can pose challenges for biologists when it comes to interpretation. In response to this issue, we developed GO2Sum (Gene Ontology terms Summarizer), a model that takes a set of GO terms as input and generates a human-readable summary using the T5 large language model. GO2Sum was developed by fine-tuning T5 on GO term assignments and free-text function descriptions for UniProt entries, enabling it to recreate function descriptions by concatenating GO term descriptions. Our results demonstrated that GO2Sum significantly outperforms the original T5 model that was trained on the entire web corpus in generating Function, Subunit Structure, and Pathway paragraphs for UniProt entries.
Collapse
Affiliation(s)
| | - Nabil Ibtehaz
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
| |
Collapse
|
9
|
Jha K, Saha S, Karmakar S. Prediction of Protein-Protein Interactions Using Vision Transformer and Language Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3215-3225. [PMID: 37027644 DOI: 10.1109/tcbb.2023.3248797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
The knowledge of protein-protein interaction (PPI) helps us to understand proteins' functions, the causes and growth of several diseases, and can aid in designing new drugs. The majority of existing PPI research has relied mainly on sequence-based approaches. With the availability of multi-omics datasets (sequence, 3D structure) and advancements in deep learning techniques, it is feasible to develop a deep multi-modal framework that fuses the features learned from different sources of information to predict PPI. In this work, we propose a multi-modal approach utilizing protein sequence and 3D structure. To extract features from the 3D structure of proteins, we use a pre-trained vision transformer model that has been fine-tuned on the structural representation of proteins. The protein sequence is encoded into a feature vector using a pre-trained language model. The feature vectors extracted from the two modalities are fused and then fed to the neural network classifier to predict the protein interactions. To showcase the effectiveness of the proposed methodology, we conduct experiments on two popular PPI datasets, namely, the human dataset and the S. cerevisiae dataset. Our approach outperforms the existing methodologies to predict PPI, including multi-modal approaches. We also evaluate the contributions of each modality by designing uni-modal baselines. We perform experiments with three modalities as well, having gene ontology as the third modality.
Collapse
|
10
|
Zhang F, Zhang Y, Zhu X, Chen X, Lu F, Zhang X. DeepSG2PPI: A Protein-Protein Interaction Prediction Method Based on Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2907-2919. [PMID: 37079417 DOI: 10.1109/tcbb.2023.3268661] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Protein-protein interaction (PPI) plays an important role in almost all life activities. Many protein interaction sites have been confirmed by biological experiments, but these PPI site identification methods are time-consuming and expensive. In this study, a deep learning-based PPI prediction method, named DeepSG2PPI, is developed. First, the protein sequence information is retrieved and the local context information of each amino acid residue is calculated. A two-dimensional convolutional neural network (2D-CNN) model is employed to extract features from a two-channel coding structure, in which an attention mechanism is embedded to assign higher weights to key features. Second, the global statistical information of each amino acid residue and the relationship graph between the protein and GO (Gene Ontology) function annotation are built, and the graph embedding vector is constructed to represent the biological features of the protein. Finally, a 2D-CNN model and two 1D-CNN models are combined for PPI prediction. The comparison analysis with existing algorithms shows that the DeepSG2PPI method has better performance. It provides more accurate and effective PPI site prediction, which will be helpful in reducing the cost and failure rate of biological experiments.
Collapse
|
11
|
Arya N, Mathur A, Saha S, Saha S. Proposal of SVM Utility Kernel for Breast Cancer Survival Estimation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1372-1383. [PMID: 35994556 DOI: 10.1109/tcbb.2022.3198879] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The advancement of medical research in the field of cancer prognosis and diagnosis using various modalities has put oncologists under tremendous stress. The complexity and heterogeneity involved in multiple modalities and their significantly varied clinical outcomes make it difficult to analyze the disease and provide the correct treatment. Breast cancer is the major concern among all cancers worldwide, specifically for females. To help oncologists and cancer patients, research for breast cancer survival estimation has been proposed. It ranges from complex deep neural networks to simple and interpretable architectures. We propose a utility kernel for a support vector machine (SVM) in this article. It is a simple yet powerful function, which performs better than other popular machine learning algorithms and deep neural networks in the task of breast cancer survival prediction using the TCGA-BRCA dataset. This study validates the proposed utility kernel using four different modalities (gene expression, copy number variation, clinical, and histopathological tissue images) and their multi-modal combinations. The SVM based on our utility kernel empirically proves its efficacy by achieving the highest value on various performance measures, whereas advanced deep neural networks fail to train on small and highly imbalanced breast cancer data.
Collapse
|
12
|
Sharma L, Deepak A, Ranjan A, Krishnasamy G. A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction. Stat Appl Genet Mol Biol 2023; 22:sagmb-2022-0057. [PMID: 37658681 DOI: 10.1515/sagmb-2022-0057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Accepted: 04/20/2023] [Indexed: 09/03/2023]
Abstract
Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.
Collapse
Affiliation(s)
- Lavkush Sharma
- Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, Bihar, India
| | - Akshay Deepak
- Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, Bihar, India
| | - Ashish Ranjan
- Department of Computer Science and Engineering, ITER, Siksha 'O' Anusandhan University (Deemed to be University), Bhubaneswar, Odisha, India
| | | |
Collapse
|
13
|
Jha K, Saha S. Analyzing Effect of Multi-Modality in Predicting Protein-Protein Interactions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:162-173. [PMID: 35259112 DOI: 10.1109/tcbb.2022.3157531] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Nowadays, multiple sources of information about proteins are available such as protein sequences, 3D structures, Gene Ontology (GO), etc. Most of the works on protein-protein interaction (PPI) identification had utilized these information about proteins, mainly sequence-based, but individually. The new advances in deep learning techniques allow us to leverage multiple sources/modalities of proteins, which complement each other. Some recent works have shown that multi-modal PPI models perform better than uni-modal approaches. This paper aims to investigate whether the performance of multi-modal PPI models is always consistent or depends on other factors such as dataset distribution, algorithms used to learn features, etc. We have used three modalities for this study: Protein sequence, 3D structure, and GO. Various techniques, including deep learning algorithms, are employed to extract features from multiple sources of proteins. These feature vectors from different modalities are then integrated in several combinations (bi-modal and tri-modal) to predict PPI. To conduct this study, we have used Human and S. cerevisiae PPI datasets. The obtained results demonstrate the potentiality of a multi-modal approach and deep learning techniques in predicting protein interactions. However, the predictive capability of a model for PPI depends on feature extraction methods as well. Also, increasing the modality does not always ensure performance improvement. In this study, the PPI model integrating two modalities outperforms the designed uni-modal and tri-modal PPI models.
Collapse
|
14
|
Dutta P, Patra AP, Saha S. DeePROG: Deep Attention-Based Model for Diseased Gene Prognosis by Fusing Multi-Omics Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2770-2781. [PMID: 34166198 DOI: 10.1109/tcbb.2021.3090302] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
An in-depth exploration of gene prognosis using different methodologies aids in understanding various biological regulations of genes in disease pathobiology and molecular functions. Interpreting gene functions at biological and molecular levels remains a daunting yet crucial task in domains such as drug design, personalized medicine, and next-generation diagnostics. Recent advancements in omics technologies have produced diverse heterogeneous genomic datasets like micro-array gene expression, miRNA expression, DNA sequence, 3D structures, which are significant resources for understanding the gene functions. In this paper, we propose a novel self-attention based deep multi-modal model, named DeePROG, for the prognosis of disease affected genes based on heterogeneous omics data. We use three NCBI datasets covering three modalities, namely gene expression profile, the underlying DNA sequence, and the 3D protein structures. To extract useful features from each modality, we develop several context-specific deep learning models. Besides, we develop three attention-based deep bi-modal architectures along with DeePROG to leverage the prognosis of the underlying biomedical data. We assess the performance of the models' in terms of computational assessment of function annotation (CAFA2) metrics. Moreover, we analyze the results in terms of receiver operating characteristics (ROC) curve in high-class imbalance data setting and perform statistical significance tests in terms of Welch's t-test. Experiment results show that DeePROG significantly outperforms baseline models across in terms of performance metrics. The source code and all preprocessed datasets used in this study are available at https://github.com/duttaprat/DeePROG.
Collapse
|
15
|
Qiu XY, Wu H, Shao J. TALE-cmap: Protein function prediction based on a TALE-based architecture and the structure information from contact map. Comput Biol Med 2022; 149:105938. [DOI: 10.1016/j.compbiomed.2022.105938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2022] [Revised: 07/26/2022] [Accepted: 08/06/2022] [Indexed: 11/03/2022]
|
16
|
Sheng J, Xue J, Li P, Yi N. [A protein complex recognition method based on spatial-temporal graph convolution neural network]. NAN FANG YI KE DA XUE XUE BAO = JOURNAL OF SOUTHERN MEDICAL UNIVERSITY 2022; 42:1075-1081. [PMID: 35869773 DOI: 10.12122/j.issn.1673-4254.2022.07.17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
OBJECTIVE To propose a new method for mining complexes in dynamic protein network using spatiotemporal convolution neural network. METHODS The edge strength, node strength and edge existence probability are defined for modeling of the dynamic protein network. Based on the time series information and structure information on the graph, two convolution operators were designed using Hilbert-Huang transform, attention mechanism and residual connection technology to represent and learn the characteristics of the proteins in the network, and the dynamic protein network characteristic map was constructed. Finally, spectral clustering was used to identify the protein complexes. RESULTS The simulation results on several public biological datasets showed that the F value of the proposed algorithm exceeded 90% on DIP dataset and MIPS dataset. Compared with 4 other recognition algorithms (DPCMNE, GE-CFI, VGAE and NOCD), the proposed algorithm improved the recognition efficiency by 34.5%, 28.7%, 25.4% and 17.6%, respectively. CONCLUSION The application of deep learning technology can improve the efficiency in analysis of dynamic protein networks.
Collapse
Affiliation(s)
- J Sheng
- Clinical nursing teaching and Research Office, The Second Xiangya Hospital of Central South University, Changsha 410011, China.,Department of ultrasound diagnosis, The Second Xiangya Hospital of Central South University, Changsha 410011, China
| | - J Xue
- Operation center, The Third Xiangya Hospital of Central South University, Changsha 410013, China
| | - P Li
- School of Informatics, Hunan University of Chinese Medicine, Changsha 410208, China
| | - N Yi
- School of Informatics, Hunan University of Chinese Medicine, Changsha 410208, China
| |
Collapse
|
17
|
Xia W, Zheng L, Fang J, Li F, Zhou Y, Zeng Z, Zhang B, Li Z, Li H, Zhu F. PFmulDL: a novel strategy enabling multi-class and multi-label protein function annotation by integrating diverse deep learning methods. Comput Biol Med 2022; 145:105465. [PMID: 35366467 DOI: 10.1016/j.compbiomed.2022.105465] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 03/22/2022] [Accepted: 03/25/2022] [Indexed: 02/06/2023]
Abstract
Bioinformatic annotation of protein function is essential but extremely sophisticated, which asks for extensive efforts to develop effective prediction method. However, the existing methods tend to amplify the representativeness of the families with large number of proteins by misclassifying the proteins in the families with small number of proteins. That is to say, the ability of the existing methods to annotate proteins in the 'rare classes' remains limited. Herein, a new protein function annotation strategy, PFmulDL, integrating multiple deep learning methods, was thus constructed. First, the recurrent neural network was integrated, for the first time, with the convolutional neural network to facilitate the function annotation. Second, a transfer learning method was introduced to the model construction for further improving the prediction performances. Third, based on the latest data of Gene Ontology, the newly constructed model could annotate the largest number of protein families comparing with the existing methods. Finally, this newly constructed model was found capable of significantly elevating the prediction performance for the 'rare classes' without sacrificing that for the 'major classes'. All in all, due to the emerging requirements on improving the prediction performance for the proteins in 'rare classes', this new strategy would become an essential complement to the existing methods for protein function prediction. All the models and source codes are freely available and open to all users at: https://github.com/idrblab/PFmulDL.
Collapse
Affiliation(s)
- Weiqi Xia
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Lingyan Zheng
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China; Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Jiebin Fang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Fengcheng Li
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Ying Zhou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Zhenyu Zeng
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Bing Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Zhaorong Li
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Honglin Li
- School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.
| | - Feng Zhu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China; Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China.
| |
Collapse
|
18
|
Zhang Z, Zhao Y, Wang J, Guo M. DeepRCI: predicting ATP-binding proteins using the residue-residue contact information. IEEE J Biomed Health Inform 2021; 26:2822-2829. [PMID: 34941538 DOI: 10.1109/jbhi.2021.3137840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Adenine-5'-triphosphate (ATP) is a direct energy source for various activities of tissues and cells in the body. The release of ATP energies requires the assistance of ATP-binding proteins. Therefore, the identification of ATP-binding proteins is of great significance for the research on organisms. So far, there are several methods for predicting ATP-binding proteins. However, the accuracies of these methods are so low that the predicted proteins are inaccurate. Here, we designed a novel method, called as DeepRCI (based on Deep convolutional neural network and Residue-residue Contact Information), for predicting ATP-binding proteins. DeepRCI achieved an accuracy of 93.61\% on the test set which was a significant improvement over the state-of-the-art methods.
Collapse
|
19
|
Nallasamy V, S M. Bingham deep neural and oppositional fish swarm optimized protein structure prediction. J Biomol Struct Dyn 2021; 40:8706-8724. [PMID: 33955323 DOI: 10.1080/07391102.2021.1915181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
It is familiar that essential proteins take part in managing cellular activities in living organisms. Moreover, protein structure prediction from its amino acid sequence is advantageous to the comprehending of cellular functions. Formerly, several essential protein prediction methods have been proposed. However, those existing prediction methods were not satisfactory because to low sensitivity to imbalance characteristics. To address this issue, this paper presents a novel secondary protein structure prediction method, called, Bingham Deep Convolutional-based Oppositional Artificial Fish Optimized (BDC-OAFO). First, a protein structure identification framework, called, Bingham Distributed Deep Convolutional (BDDC) is designed to identify the essential proteins by eliminating the imbalanced learning issue. Next, secondary structure prediction framework, called, Oppositional Artificial Fish Swarm Optimization is proposed that obtain precise prediction results. Then, predicting secondary protein structure by emulating three biological behaviors of artificial fishes, including foraging behavior, following behavior, swarming behavior in which process, proximal count, oppositional function and Gaussian function are utilized. To evaluate the performance of BDC-OAFO method, we conduct experiments on Protein Data Bank dataset the experimental results show that our method BDC-OAFO achieves a better performance for identifying essential proteins and precise prediction in comparison with several other well-known prediction methods, which confirms the significance of BDC-OAFO.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
| | - Malarvizhi S
- Department of Computer Science, Thiruvalluvar Government Arts College, Namakkal, Tamil Nadu, India
| |
Collapse
|