1
|
Qin X, Liu M, Liu G. ResCNNT-fold: Combining residual convolutional neural network and Transformer for protein fold recognition from language model embeddings. Comput Biol Med 2023; 166:107571. [PMID: 37864911 DOI: 10.1016/j.compbiomed.2023.107571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 09/30/2023] [Accepted: 10/11/2023] [Indexed: 10/23/2023]
Abstract
A comprehensive understanding of protein functions holds significant promise for disease research and drug development, and proteins with analogous tertiary structures tend to exhibit similar functions. Protein fold recognition stands as a classical approach in the realm of protein structure investigation. Despite significant advancements made by researchers in this field, the continuous updating of protein databases presents an ongoing challenge in accurately identifying protein fold types. In this study, we introduce a predictor, ResCNNT-fold, for protein fold recognition and employ the LE dataset for testing purpose. ResCNNT-fold leverages a pre-trained language model to obtain embedding representations for protein sequences, which are then processed by the ResCNNT feature extractor, a combination of residual convolutional neural network and Transformer, to derive fold-specific features. Subsequently, the query protein is paired with each protein whose structure is known in the template dataset. For each pair, the similarity score of their fold-specific features is calculated. Ultimately, the query protein is identified as the fold type of the template protein in the pair with the highest similarity score. To further validate the utility and efficacy of the proposed ResCNNT-fold predictor, we conduct a 2-fold cross-validation experiment on the fold level of the LE dataset. Remarkably, this rigorous evaluation yields an exceptional accuracy of 91.57%, which surpasses the best result among other state-of-the-art protein fold recognition methods by an approximate margin of 10%. The excellent performance unequivocally underscores the compelling advantages inherent to our proposed ResCNNT-fold predictor in the realm of protein fold recognition. The source code and data of ResCNNT-fold can be downloaded from https://github.com/Bioinformatics-Laboratory/ResCNNT-fold.
Collapse
Affiliation(s)
- Xinyi Qin
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Min Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Guangzhong Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| |
Collapse
|
2
|
Liu Y, Wei G, Li C, Shen LC, Gasser RB, Song J, Chen D, Yu DJ. TripletCell: a deep metric learning framework for accurate annotation of cell types at the single-cell level. Brief Bioinform 2023; 24:bbad132. [PMID: 37080771 PMCID: PMC10199768 DOI: 10.1093/bib/bbad132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Revised: 02/02/2023] [Accepted: 03/14/2023] [Indexed: 04/22/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has significantly accelerated the experimental characterization of distinct cell lineages and types in complex tissues and organisms. Cell-type annotation is of great importance in most of the scRNA-seq analysis pipelines. However, manual cell-type annotation heavily relies on the quality of scRNA-seq data and marker genes, and therefore can be laborious and time-consuming. Furthermore, the heterogeneity of scRNA-seq datasets poses another challenge for accurate cell-type annotation, such as the batch effect induced by different scRNA-seq protocols and samples. To overcome these limitations, here we propose a novel pipeline, termed TripletCell, for cross-species, cross-protocol and cross-sample cell-type annotation. We developed a cell embedding and dimension-reduction module for the feature extraction (FE) in TripletCell, namely TripletCell-FE, to leverage the deep metric learning-based algorithm for the relationships between the reference gene expression matrix and the query cells. Our experimental studies on 21 datasets (covering nine scRNA-seq protocols, two species and three tissues) demonstrate that TripletCell outperformed state-of-the-art approaches for cell-type annotation. More importantly, regardless of protocols or species, TripletCell can deliver outstanding and robust performance in annotating different types of cells. TripletCell is freely available at https://github.com/liuyan3056/TripletCell. We believe that TripletCell is a reliable computational tool for accurately annotating various cell types using scRNA-seq data and will be instrumental in assisting the generation of novel biological hypotheses in cell biology.
Collapse
Affiliation(s)
- Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Guo Wei
- School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Long-Chen Shen
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| | - Dijun Chen
- School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
3
|
Ma Z, Lu YY, Wang Y, Lin R, Yang Z, Zhang F, Wang Y. Metric learning for comparing genomic data with triplet network. Brief Bioinform 2022; 23:6679451. [DOI: 10.1093/bib/bbac345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 07/20/2022] [Accepted: 07/26/2022] [Indexed: 11/13/2022] Open
Abstract
Abstract
Many biological applications are essentially pairwise comparison problems, such as evolutionary relationships on genomic sequences, contigs binning on metagenomic data, cell type identification on gene expression profiles of single-cells, etc. To make pair-wise comparison, it is necessary to adopt suitable dissimilarity metric. However, not all the metrics can be fully adapted to all possible biological applications. It is necessary to employ metric learning based on data adaptive to the application of interest. Therefore, in this study, we proposed MEtric Learning with Triplet network (MELT), which learns a nonlinear mapping from original space to the embedding space in order to keep similar data closer and dissimilar data far apart. MELT is a weakly supervised and data-driven comparison framework that offers more adaptive and accurate dissimilarity learned in the absence of the label information when the supervised methods are not applicable. We applied MELT in three typical applications of genomic data comparison, including hierarchical genomic sequences, longitudinal microbiome samples and longitudinal single-cell gene expression profiles, which have no distinctive grouping information. In the experiments, MELT demonstrated its empirical utility in comparison to many widely used dissimilarity metrics. And MELT is expected to accommodate a more extensive set of applications in large-scale genomic comparisons. MELT is available at https://github.com/Ying-Lab/MELT.
Collapse
Affiliation(s)
- Zhi Ma
- Department of Automation, Xiamen University , China
- National Institute for Data Science in Health and Medicine, Xiamen University
| | - Yang Young Lu
- Cheriton School of Computer Science, University of Waterloo , Waterloo, Ontario , Canada
| | - Yiwen Wang
- Department of Automation, Xiamen University , China
| | - Renhao Lin
- Department of Automation, Xiamen University , China
| | - Zizi Yang
- Department of Automation, Xiamen University , China
| | - Fang Zhang
- Cheriton School of Computer Science, University of Waterloo , Waterloo, Ontario , Canada
| | - Ying Wang
- Department of Automation, Xiamen University , China
- National Institute for Data Science in Health and Medicine, Xiamen University
- Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision , Xiamen, Fujian 361005 , China
- Fujian Key Laboratory of Genetics and Breeding of Marine Organisms , Xiamen, 361100 , China
| |
Collapse
|
4
|
Zhu GY, Liu Y, Wang PH, Yang X, Yu DJ. Learning Protein Embedding to Improve Protein Fold Recognition Using Deep Metric Learning. J Chem Inf Model 2022; 62:4283-4291. [PMID: 36017565 DOI: 10.1021/acs.jcim.2c00959] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Protein fold recognition refers to predicting the most likely fold type of the query protein and is a critical step of protein structure and function prediction. With the popularity of deep learning in bioinformatics, protein fold recognition has obtained impressive progress. In this study, to extract the fold-specific feature to improve protein fold recognition, we proposed a unified deep metric learning framework based on a joint loss function, termed NPCFold. In addition, we also proposed an integrated machine learning model based on the similarity of proteins in various properties, termed NPCFoldpro. Benchmark experiments show both NPCFold and NPCFoldpro outperform existing protein fold recognition methods at the fold level, indicating that our proposed strategies of fusing loss functions and fusing features could improve the fold recognition level.
Collapse
Affiliation(s)
- Guan-Yu Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, P. R. China
| | - Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, P. R. China
| | - Peng-Hao Wang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, P. R. China
| | - Xibei Yang
- School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212100, P. R. China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, P. R. China
| |
Collapse
|
5
|
Newton MAH, Rahman J, Zaman R, Sattar A. Enhancing Protein Contact Map Prediction Accuracy via Ensembles of Inter-Residue Distance Predictors. Comput Biol Chem 2022; 99:107700. [DOI: 10.1016/j.compbiolchem.2022.107700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 05/19/2022] [Accepted: 05/19/2022] [Indexed: 11/03/2022]
|
6
|
Han K, Liu Y, Xu J, Song J, Yu DJ. Performing protein fold recognition by exploiting a stack convolutional neural network with the attention mechanism. Anal Biochem 2022; 651:114695. [PMID: 35487269 DOI: 10.1016/j.ab.2022.114695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Revised: 04/18/2022] [Accepted: 04/19/2022] [Indexed: 11/01/2022]
Abstract
Protein fold recognition is a critical step in protein structure and function prediction, and aims to ascertain the most likely fold type of the query protein. As a typical pattern recognition problem, designing a powerful feature extractor and metric function to extract relevant and representative fold-specific features from protein sequences is the key to improving protein fold recognition. In this study, we propose an effective sequence-based approach, called RattnetFold, to identify protein fold types. The basic concept of RattnetFold is to employ a stack convolutional neural network with the attention mechanism that acts as a feature extractor to extract fold-specific features from protein residue-residue contact maps. Moreover, based on the fold-specific features, we leverage metric learning to project fold-specific features into a subspace where similar proteins are closer together and name this approach RattnetFoldPro. Benchmarking experiments illustrate that RattnetFold and RattnetFoldPro enable the convolutional neural networks to efficiently learn the underlying subtle patterns in residue-residue contact maps, thereby improving the performance of protein fold recognition. An online web server of RattnetFold and the benchmark datasets are freely available at http://csbio.njust.edu.cn/bioinf/rattnetfold/.
Collapse
Affiliation(s)
- Ke Han
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Jian Xu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria, 3800, Australia; Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, Victoria, 3800, Australia.
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China.
| |
Collapse
|
7
|
Villegas-Morcillo A, Gomez AM, Sanchez V. An analysis of protein language model embeddings for fold prediction. Brief Bioinform 2022; 23:6571527. [PMID: 35443054 DOI: 10.1093/bib/bbac142] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 03/21/2022] [Accepted: 03/28/2022] [Indexed: 11/13/2022] Open
Abstract
The identification of the protein fold class is a challenging problem in structural biology. Recent computational methods for fold prediction leverage deep learning techniques to extract protein fold-representative embeddings mainly using evolutionary information in the form of multiple sequence alignment (MSA) as input source. In contrast, protein language models (LM) have reshaped the field thanks to their ability to learn efficient protein representations (protein-LM embeddings) from purely sequential information in a self-supervised manner. In this paper, we analyze a framework for protein fold prediction using pre-trained protein-LM embeddings as input to several fine-tuning neural network models, which are supervisedly trained with fold labels. In particular, we compare the performance of six protein-LM embeddings: the long short-term memory-based UniRep and SeqVec, and the transformer-based ESM-1b, ESM-MSA, ProtBERT and ProtT5; as well as three neural networks: Multi-Layer Perceptron, ResCNN-BGRU (RBG) and Light-Attention (LAT). We separately evaluated the pairwise fold recognition (PFR) and direct fold classification (DFC) tasks on well-known benchmark datasets. The results indicate that the combination of transformer-based embeddings, particularly those obtained at amino acid level, with the RBG and LAT fine-tuning models performs remarkably well in both tasks. To further increase prediction accuracy, we propose several ensemble strategies for PFR and DFC, which provide a significant performance boost over the current state-of-the-art results. All this suggests that moving from traditional protein representations to protein-LM embeddings is a very promising approach to protein fold-related tasks.
Collapse
Affiliation(s)
- Amelia Villegas-Morcillo
- Department of Signal Theory, Telematics and Communications, University of Granada, Granada, Spain
| | - Angel M Gomez
- Department of Signal Theory, Telematics and Communications, University of Granada, Granada, Spain
| | - Victoria Sanchez
- Department of Signal Theory, Telematics and Communications, University of Granada, Granada, Spain
| |
Collapse
|
8
|
Li H, Pang Y, Liu B, Yu L. MoRF-FUNCpred: Molecular Recognition Feature Function Prediction Based on Multi-Label Learning and Ensemble Learning. Front Pharmacol 2022; 13:856417. [PMID: 35350759 PMCID: PMC8957949 DOI: 10.3389/fphar.2022.856417] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 02/14/2022] [Indexed: 01/13/2023] Open
Abstract
Intrinsically disordered regions (IDRs) without stable structure are important for protein structures and functions. Some IDRs can be combined with molecular fragments to make itself completed the transition from disordered to ordered, which are called molecular recognition features (MoRFs). There are five main functions of MoRFs: molecular recognition assembler (MoR_assembler), molecular recognition chaperone (MoR_chaperone), molecular recognition display sites (MoR_display_sites), molecular recognition effector (MoR_effector), and molecular recognition scavenger (MoR_scavenger). Researches on functions of molecular recognition features are important for pharmaceutical and disease pathogenesis. However, the existing computational methods can only predict the MoRFs in proteins, failing to distinguish their different functions. In this paper, we treat MoRF function prediction as a multi-label learning task and solve it with the Binary Relevance (BR) strategy. Finally, we use Support Vector Machine (SVM), Logistic Regression (LR), Decision Tree (DT), and Random Forest (RF) as basic models to construct MoRF-FUNCpred through ensemble learning. Experimental results show that MoRF-FUNCpred performs well for MoRF function prediction. To the best knowledge of ours, MoRF-FUNCpred is the first predictor for predicting the functions of MoRFs. Availability and Implementation: The stand alone package of MoRF-FUNCpred can be accessed from https://github.com/LiangYu-Xidian/MoRF-FUNCpred.
Collapse
Affiliation(s)
- Haozheng Li
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Yihe Pang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|