1
|
Han S, Liu L. GP-HTNLoc: A graph prototype head-tail network-based model for multi-label subcellular localization prediction of ncRNAs. Comput Struct Biotechnol J 2024; 23:2034-2048. [PMID: 38765609 PMCID: PMC11101938 DOI: 10.1016/j.csbj.2024.04.052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 04/17/2024] [Accepted: 04/18/2024] [Indexed: 05/22/2024] Open
Abstract
Numerous research results demonstrated that understanding the subcellular localization of non-coding RNAs (ncRNAs) is pivotal in elucidating their roles and regulatory mechanisms in cells. Despite the existence of over ten computational models dedicated to predicting the subcellular localization of ncRNAs, a majority of these models are designed solely for single-label prediction. In reality, ncRNAs often exhibit localization across multiple subcellular compartments. Furthermore, the existing multi-label localization prediction models are insufficient in addressing the challenges posed by the scarcity of training samples and class imbalance in ncRNA dataset. To address these limitations, this study proposes a novel multi-label localization prediction model for ncRNAs, named GP-HTNLoc. To mitigate class imbalance, GP-HTNLoc adopts separate training approaches for head and tail location labels. Additionally, GP-HTNLoc introduces a pioneering graph prototype module to enhance its performance in small-sample, multi-label scenarios. The experimental results based on 10-fold cross-validation on benchmark datasets demonstrate that GP-HTNLoc achieves competitive predictive performance. The average results from 10 rounds of testing on an independent dataset show that GP-HTNLoc outperforms the best existing models on the human lncRNA, human snoRNA, and human miRNA subsets, with average precision improvements of 31.5%, 14.2%, and 5.6%, respectively, reaching 0.685, 0.632, and 0.704. A user-friendly online GP-HTNLoc server is accessible at https://56s8y85390.goho.co.
Collapse
Affiliation(s)
- Shuangkai Han
- School of Information, Yunnan Normal University, Kunming, China
- Engineering Research Center of Computer Vision and Intelligent Control Technology, Department of Education of Yunnan Province, China
| | - Lin Liu
- School of Information, Yunnan Normal University, Kunming, China
- Engineering Research Center of Computer Vision and Intelligent Control Technology, Department of Education of Yunnan Province, China
| |
Collapse
|
2
|
Yavari P, Roointan A, Naghdibadi M, Masoudi-Sobhanzadeh Y. In-silico identification of therapeutic targets in pancreatic ductal adenocarcinoma using WGCNA and Trader. Sci Rep 2024; 14:23292. [PMID: 39375436 DOI: 10.1038/s41598-024-74252-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Accepted: 09/24/2024] [Indexed: 10/09/2024] Open
Abstract
Pancreatic ductal adenocarcinoma (PDAC) is a highly aggressive malignancy, accounting for over 90% of pancreatic cancers, and is characterized by limited treatment options and poor survival rates. Systems biology provides in-depth insights into the molecular mechanisms of PDAC. In this context, novel algorithms and comprehensive strategies are essential for advancing the identification of critical network nodes and therapeutic targets within disease-related protein-protein interaction networks. This study employed a comprehensive computational strategy using the metaheuristic algorithm Trader to enhance the identification of potential therapeutic targets. Analysis of the expression data from the PDAC dataset (GSE132956) involved co-expression analysis and clustering of differentially expressed genes to identify key disease-associated modules. The STRING database was used to construct a network of differentially expressed genes, and the Trader algorithm pinpointed the top 30 DEGs whose removal caused the most significant network disconnections. Enriched gene ontology terms included "Signaling by Rho GTPases," "Signaling by receptor tyrosine kinases," and "immune system." Additionally, nine hub genes-FYN, MAPK3, CDK2, SNRPG, GNAQ, PAK1, LPCAT4, MAP1LC3B, and FBN1-were identified as central to PDAC pathogenesis. This integrated approach, combining co-expression analysis with protein-protein interaction network analysis using a metaheuristic algorithm, provides valuable insights into PDAC mechanisms and highlights several hub genes as potential therapeutic targets.
Collapse
Affiliation(s)
- Parvin Yavari
- Regenerative Medicine Research Center, Isfahan University of Medical Sciences, Hezar Jerib Avenue, Isfahan, Iran
| | - Amir Roointan
- Regenerative Medicine Research Center, Isfahan University of Medical Sciences, Hezar Jerib Avenue, Isfahan, Iran.
| | - Mohammadjavad Naghdibadi
- Regenerative Medicine Research Center, Isfahan University of Medical Sciences, Hezar Jerib Avenue, Isfahan, Iran
| | - Yosef Masoudi-Sobhanzadeh
- Faculty of Advanced Medical Siences, Tabriz University of Medical Sciences, Tabriz, Iran.
- Research Center for Pharmaceutical Nanotechnology, Biomedicine Institute, Tabriz university of Medical Sciences, Tabriz, Iran.
| |
Collapse
|
3
|
Lu P, Tian J. ACDMBI: A deep learning model based on community division and multi-source biological information fusion predicts essential proteins. Comput Biol Chem 2024; 112:108115. [PMID: 38865861 DOI: 10.1016/j.compbiolchem.2024.108115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 05/15/2024] [Accepted: 05/28/2024] [Indexed: 06/14/2024]
Abstract
Accurately identifying essential proteins is vital for drug research and disease diagnosis. Traditional centrality methods and machine learning approaches often face challenges in accurately discerning essential proteins, primarily relying on information derived from protein-protein interaction (PPI) networks. Despite attempts by some researchers to integrate biological data and PPI networks for predicting essential proteins, designing effective integration methods remains a challenge. In response to these challenges, this paper presents the ACDMBI model, specifically designed to overcome the aforementioned issues. ACDMBI is comprised of two key modules: feature extraction and classification. In terms of capturing relevant information, we draw insights from three distinct data sources. Initially, structural features of proteins are extracted from the PPI network through community division. Subsequently, these features are further optimized using Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT). Moving forward, protein features are extracted from gene expression data utilizing Bidirectional Long Short-Term Memory networks (BiLSTM) and a multi-head self-attention mechanism. Finally, protein features are derived by mapping subcellular localization data to a one-dimensional vector and processing it through fully connected layers. In the classification phase, we integrate features extracted from three different data sources, crafting a multi-layer deep neural network (DNN) for protein classification prediction. Experimental results on brewing yeast data showcase the ACDMBI model's superior performance, with AUC reaching 0.9533 and AUPR reaching 0.9153. Ablation experiments further reveal that the effective integration of features from diverse biological information significantly boosts the model's performance.
Collapse
Affiliation(s)
- Pengli Lu
- School of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China.
| | - Jialong Tian
- School of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China.
| |
Collapse
|
4
|
Ruan X, Xia S, Li S, Su Z, Yang J. Hybrid framework for membrane protein type prediction based on the PSSM. Sci Rep 2024; 14:17156. [PMID: 39060345 PMCID: PMC11282086 DOI: 10.1038/s41598-024-68163-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2024] [Accepted: 07/22/2024] [Indexed: 07/28/2024] Open
Abstract
Membrane proteins are considered the major source of drug targets and are indispensable for drug design and disease prevention. However, traditional biomechanical experiments are costly and time-consuming; thus, many computational methods for predicting membrane protein types are gaining popularity. The position-specific scoring matrix (PSSM) method is an excellent method for describing the evolutionary information of protein sequences. In this study, we propose an improved capsule neural network (ICNN) model based on a capsule neural network to acquire sufficient relevant information from the PSSM. Furthermore, accounting for the complementarity between traditional machine learning and deep learning, we propose a hybrid framework that combines both approaches to predict protein types. This framework trains 41 baseline models based on the PSSM. The optimal subset features, selected after traversal, are fused using a two-level decision-level feature fusion approach. Subsequently, comparisons are made using three combined strategies within an ensemble learning framework. The experimental results demonstrate that solely relying on PSSM input, the proposed method not only surpasses the optimal methods by 1.52 % , 2.26 % and 2.67 % on Dataset1, Dataset2, and Datasets3, respectively, but also exhibits superior generalizability. Furthermore, the code and dataset can be free download at https://github.com/ruanxiaoli/membrane-protein-types .
Collapse
Affiliation(s)
- Xiaoli Ruan
- State Key Laboratory of Public Big Data, Guizhou University, Guizhou, 550000, Guizhou, China.
| | - Sina Xia
- State Key Laboratory of Public Big Data, Guizhou University, Guizhou, 550000, Guizhou, China
| | - Shaobo Li
- State Key Laboratory of Public Big Data, Guizhou University, Guizhou, 550000, Guizhou, China
| | - Zhidong Su
- Department of Electrical and Computer Engineering, University of Oklahoma State, Stillwater, 74078, USA
| | - Jing Yang
- State Key Laboratory of Public Big Data, Guizhou University, Guizhou, 550000, Guizhou, China
| |
Collapse
|
5
|
Zeng M, Wu Y, Li Y, Yin R, Lu C, Duan J, Li M. LncLocFormer: a Transformer-based deep learning model for multi-label lncRNA subcellular localization prediction by using localization-specific attention mechanism. Bioinformatics 2023; 39:btad752. [PMID: 38109668 PMCID: PMC10749772 DOI: 10.1093/bioinformatics/btad752] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 11/13/2023] [Accepted: 12/17/2023] [Indexed: 12/20/2023] Open
Abstract
MOTIVATION There is mounting evidence that the subcellular localization of lncRNAs can provide valuable insights into their biological functions. In the real world of transcriptomes, lncRNAs are usually localized in multiple subcellular localizations. Furthermore, lncRNAs have specific localization patterns for different subcellular localizations. Although several computational methods have been developed to predict the subcellular localization of lncRNAs, few of them are designed for lncRNAs that have multiple subcellular localizations, and none of them take motif specificity into consideration. RESULTS In this study, we proposed a novel deep learning model, called LncLocFormer, which uses only lncRNA sequences to predict multi-label lncRNA subcellular localization. LncLocFormer utilizes eight Transformer blocks to model long-range dependencies within the lncRNA sequence and shares information across the lncRNA sequence. To exploit the relationship between different subcellular localizations and find distinct localization patterns for different subcellular localizations, LncLocFormer employs a localization-specific attention mechanism. The results demonstrate that LncLocFormer outperforms existing state-of-the-art predictors on the hold-out test set. Furthermore, we conducted a motif analysis and found LncLocFormer can capture known motifs. Ablation studies confirmed the contribution of the localization-specific attention mechanism in improving the prediction performance. AVAILABILITY AND IMPLEMENTATION The LncLocFormer web server is available at http://csuligroup.com:9000/LncLocFormer. The source code can be obtained from https://github.com/CSUBioGroup/LncLocFormer.
Collapse
Affiliation(s)
- Min Zeng
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Yifan Wu
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Yiming Li
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Rui Yin
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, FL 32603, United States
| | - Chengqian Lu
- School of Computer Science, Key Laboratory of Intelligent Computing and Information Processing, Xiangtan University, Xiangtan, Hunan 411105, China
| | - Junwen Duan
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| |
Collapse
|
6
|
Ma J, Song J, Young ND, Chang BCH, Korhonen PK, Campos TL, Liu H, Gasser RB. 'Bingo'-a large language model- and graph neural network-based workflow for the prediction of essential genes from protein data. Brief Bioinform 2023; 25:bbad472. [PMID: 38152979 PMCID: PMC10753293 DOI: 10.1093/bib/bbad472] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Revised: 10/22/2023] [Accepted: 11/28/2023] [Indexed: 12/29/2023] Open
Abstract
The identification and characterization of essential genes are central to our understanding of the core biological functions in eukaryotic organisms, and has important implications for the treatment of diseases caused by, for example, cancers and pathogens. Given the major constraints in testing the functions of genes of many organisms in the laboratory, due to the absence of in vitro cultures and/or gene perturbation assays for most metazoan species, there has been a need to develop in silico tools for the accurate prediction or inference of essential genes to underpin systems biological investigations. Major advances in machine learning approaches provide unprecedented opportunities to overcome these limitations and accelerate the discovery of essential genes on a genome-wide scale. Here, we developed and evaluated a large language model- and graph neural network (LLM-GNN)-based approach, called 'Bingo', to predict essential protein-coding genes in the metazoan model organisms Caenorhabditis elegans and Drosophila melanogaster as well as in Mus musculus and Homo sapiens (a HepG2 cell line) by integrating LLM and GNNs with adversarial training. Bingo predicts essential genes under two 'zero-shot' scenarios with transfer learning, showing promise to compensate for a lack of high-quality genomic and proteomic data for non-model organisms. In addition, the attention mechanisms and GNNExplainer were employed to manifest the functional sites and structural domain with most contribution to essentiality. In conclusion, Bingo provides the prospect of being able to accurately infer the essential genes of little- or under-studied organisms of interest, and provides a biological explanation for gene essentiality.
Collapse
Affiliation(s)
- Jiani Ma
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Jiangning Song
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Neil D Young
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Bill C H Chang
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Pasi K Korhonen
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Tulio L Campos
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
- Bioinformatics Core Facility, Instituto Aggeu Magalhaes, Fundaçao Oswaldo Cruz (IAM-Fiocruz), Recife, Pernambuco, Brazil
| | - Hui Liu
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| |
Collapse
|