1
|
Jiang R, Yue Z, Shang L, Wang D, Wei N. PEZy-miner: An artificial intelligence driven approach for the discovery of plastic-degrading enzyme candidates. Metab Eng Commun 2024; 19:e00248. [PMID: 39310048 PMCID: PMC11414552 DOI: 10.1016/j.mec.2024.e00248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Revised: 07/14/2024] [Accepted: 09/03/2024] [Indexed: 09/25/2024] Open
Abstract
Plastic waste has caused a global environmental crisis. Biocatalytic depolymerization mediated by enzymes has emerged as an efficient and sustainable alternative for plastic treatment and recycling. However, it is challenging and time-consuming to discover novel plastic-degrading enzymes using conventional cultivation-based or omics methods. There is a growing interest in developing effective computational methods to identify new enzymes with desirable plastic degradation functionalities by exploring the ever-increasing databases of protein sequences. In this study, we designed an innovative machine learning-based framework, named PEZy-Miner, to mine for enzymes with high potential in degrading plastics of interest. Two datasets integrating information from experimentally verified enzymes and homologs with unknown plastic-degrading activity were created respectively, covering eleven types of plastic substrates. Protein language models and binary classification models were developed to predict enzymatic degradation of plastics along with confidence and uncertainty estimation. PEZy-Miner exhibited high prediction accuracy and stability when validated on experimentally verified enzymes. Furthermore, by masking the experimentally verified enzymes and blending them into homolog dataset, PEZy-Miner effectively concentrated the experimentally verified entries by 14∼30 times while shortlisting promising plastic-degrading enzyme candidates. We applied PEZy-Miner to 0.1 million putative sequences, out of which 27 new sequences were identified with high confidence. This study provided a new computational tool for mining and recommending promising new plastic-degrading enzymes.
Collapse
Affiliation(s)
- Renjing Jiang
- Department of Civil and Environmental Engineering, University of Illinois Urbana-Champaign, Urbana, IL, 61801, United States
| | - Zhenrui Yue
- School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, 61820, United States
| | - Lanyu Shang
- School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, 61820, United States
| | - Dong Wang
- School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, 61820, United States
| | - Na Wei
- Department of Civil and Environmental Engineering, University of Illinois Urbana-Champaign, Urbana, IL, 61801, United States
| |
Collapse
|
2
|
Chen F, He X, Xu W, Zhou L, Liu Q, Chen W, Zhu W, Zhang J. Chromatin lysine acylation: On the path to chromatin homeostasis and genome integrity. Cancer Sci 2024; 115:3506-3519. [PMID: 39155589 PMCID: PMC11531963 DOI: 10.1111/cas.16321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2024] [Revised: 07/25/2024] [Accepted: 08/06/2024] [Indexed: 08/20/2024] Open
Abstract
The fundamental role of cells in safeguarding the genome's integrity against DNA double-strand breaks (DSBs) is crucial for maintaining chromatin homeostasis and the overall genomic stability. Aberrant responses to DNA damage, known as DNA damage responses (DDRs), can result in genomic instability and contribute significantly to tumorigenesis. Unraveling the intricate mechanisms underlying DDRs following severe damage holds the key to identify therapeutic targets for cancer. Chromatin lysine acylation, encompassing diverse modifications such as acetylation, lactylation, crotonylation, succinylation, malonylation, glutarylation, propionylation, and butyrylation, has been extensively studied in the context of DDRs and chromatin homeostasis. Here, we delve into the modifying enzymes and the pivotal roles of lysine acylation and their crosstalk in maintaining chromatin homeostasis and genome integrity in response to DDRs. Moreover, we offer a comprehensive perspective and overview of the latest insights, driven primarily by chromatin acylation modification and associated regulators.
Collapse
Affiliation(s)
- Feng Chen
- International Cancer Center, Guangdong Key Laboratory of Genome Instability and Human Disease Prevention, Marshall Laboratory of Biomedical Engineering, Department of Biochemistry and Molecular BiologyShenzhen University Medical SchoolShenzhenChina
| | - Xingkai He
- International Cancer Center, Guangdong Key Laboratory of Genome Instability and Human Disease Prevention, Marshall Laboratory of Biomedical Engineering, Department of Biochemistry and Molecular BiologyShenzhen University Medical SchoolShenzhenChina
| | - Wenchao Xu
- International Cancer Center, Guangdong Key Laboratory of Genome Instability and Human Disease Prevention, Marshall Laboratory of Biomedical Engineering, Department of Biochemistry and Molecular BiologyShenzhen University Medical SchoolShenzhenChina
| | - Linmin Zhou
- International Cancer Center, Guangdong Key Laboratory of Genome Instability and Human Disease Prevention, Marshall Laboratory of Biomedical Engineering, Department of Biochemistry and Molecular BiologyShenzhen University Medical SchoolShenzhenChina
| | - Qi Liu
- International Cancer Center, Guangdong Key Laboratory of Genome Instability and Human Disease Prevention, Marshall Laboratory of Biomedical Engineering, Department of Biochemistry and Molecular BiologyShenzhen University Medical SchoolShenzhenChina
- Cancer Research Institute, School of Basic Medical SciencesSouthern Medical UniversityGuangzhouChina
| | - Weicheng Chen
- International Cancer Center, Guangdong Key Laboratory of Genome Instability and Human Disease Prevention, Marshall Laboratory of Biomedical Engineering, Department of Biochemistry and Molecular BiologyShenzhen University Medical SchoolShenzhenChina
| | - Wei‐Guo Zhu
- International Cancer Center, Guangdong Key Laboratory of Genome Instability and Human Disease Prevention, Marshall Laboratory of Biomedical Engineering, Department of Biochemistry and Molecular BiologyShenzhen University Medical SchoolShenzhenChina
| | - Jun Zhang
- International Cancer Center, Guangdong Key Laboratory of Genome Instability and Human Disease Prevention, Marshall Laboratory of Biomedical Engineering, Department of Biochemistry and Molecular BiologyShenzhen University Medical SchoolShenzhenChina
| |
Collapse
|
3
|
Qin Z, Ren H, Zhao P, Wang K, Liu H, Miao C, Du Y, Li J, Wu L, Chen Z. Current computational tools for protein lysine acylation site prediction. Brief Bioinform 2024; 25:bbae469. [PMID: 39316944 PMCID: PMC11421846 DOI: 10.1093/bib/bbae469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Revised: 08/20/2024] [Accepted: 09/07/2024] [Indexed: 09/26/2024] Open
Abstract
As a main subtype of post-translational modification (PTM), protein lysine acylations (PLAs) play crucial roles in regulating diverse functions of proteins. With recent advancements in proteomics technology, the identification of PTM is becoming a data-rich field. A large amount of experimentally verified data is urgently required to be translated into valuable biological insights. With computational approaches, PLA can be accurately detected across the whole proteome, even for organisms with small-scale datasets. Herein, a comprehensive summary of 166 in silico PLA prediction methods is presented, including a single type of PLA site and multiple types of PLA sites. This recapitulation covers important aspects that are critical for the development of a robust predictor, including data collection and preparation, sample selection, feature representation, classification algorithm design, model evaluation, and method availability. Notably, we discuss the application of protein language models and transfer learning to solve the small-sample learning issue. We also highlight the prediction methods developed for functionally relevant PLA sites and species/substrate/cell-type-specific PLA sites. In conclusion, this systematic review could potentially facilitate the development of novel PLA predictors and offer useful insights to researchers from various disciplines.
Collapse
Affiliation(s)
- Zhaohui Qin
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Haoran Ren
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Kaiyuan Wang
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Huixia Liu
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Chunbo Miao
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Yanxiu Du
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Junzhou Li
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Liuji Wu
- National Key Laboratory of Wheat and Maize Crop Science, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| |
Collapse
|
4
|
Dickson A, Mofrad MRK. Fine-tuning protein embeddings for functional similarity evaluation. Bioinformatics 2024; 40:btae445. [PMID: 38985218 PMCID: PMC11299545 DOI: 10.1093/bioinformatics/btae445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 06/25/2024] [Accepted: 07/09/2024] [Indexed: 07/11/2024] Open
Abstract
MOTIVATION Proteins with unknown function are frequently compared to better characterized relatives, either using sequence similarity, or recently through similarity in a learned embedding space. Through comparison, protein sequence embeddings allow for interpretable and accurate annotation of proteins, as well as for downstream tasks such as clustering for unsupervised discovery of protein families. However, it is unclear whether embeddings can be deliberately designed to improve their use in these downstream tasks. RESULTS We find that for functional annotation of proteins, as represented by Gene Ontology (GO) terms, direct fine-tuning of language models on a simple classification loss has an immediate positive impact on protein embedding quality. Fine-tuned embeddings show stronger performance as representations for K-nearest neighbor classifiers, reaching stronger performance for GO annotation than even directly comparable fine-tuned classifiers, while maintaining interpretability through protein similarity comparisons. They also maintain their quality in related tasks, such as rediscovering protein families with clustering. AVAILABILITY AND IMPLEMENTATION github.com/mofradlab/go_metric.
Collapse
Affiliation(s)
- Andrew Dickson
- Departments of Bioengineering and Mechanical Engineering, Molecular Cell Biomechanics Laboratory, University of California, Berkeley, CA 94720, United States
| | - Mohammad R K Mofrad
- Departments of Bioengineering and Mechanical Engineering, Molecular Cell Biomechanics Laboratory, University of California, Berkeley, CA 94720, United States
| |
Collapse
|
5
|
Sha M, Parveen Rahamathulla M. Splice site recognition - deciphering Exon-Intron transitions for genetic insights using Enhanced integrated Block-Level gated LSTM model. Gene 2024; 915:148429. [PMID: 38575098 DOI: 10.1016/j.gene.2024.148429] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 03/26/2024] [Accepted: 04/01/2024] [Indexed: 04/06/2024]
Abstract
Bioinformatics is a contemporary interdisciplinary area focused on analyzing the growing number of genome sequences. Gene variants are differences in DNA sequences among individuals within a population. Splice site recognition is a crucial step in the process of gene expression, where the coding sequences of genes are joined together to form mature messenger RNA (mRNA). These genetic variants that disrupt genes are believed to be the primary reason for neuro-developmental disorders like ASD (Autism Spectrum Disorder) is a neuro-developmental disorder that is diagnosed in individuals, families, and society and occurs as the developmental delay in one among the hundred genes that are associated with these disorders. Missense variants, premature stop codons, or deletions alter both the quality and quantity of encoded proteins. Predicting genes within exons and introns presents main challenges, such as dealing with sequencing errors, short reads, incomplete genes, overlapping, and more. Although many traditional techniques have been utilized in creating an exon prediction system, the primary challenge lies in accurately identifying the length and spliced strand location classification of exons in conjunction with introns. From now on, the suggested approach utilizes a Deep Learning algorithm to analyze intricate and extensive genomic datasets. M-LSTM is utilized to categorize three binary combinations (EI as 1, IE as 2, and none as 3) using spliced DNA strands. The M-LSTM system is able to sequence extensive datasets, ensuring that long information can be stored without any impact on the current input or output. This enables it to recognize and address long-term connections and problems with rapidly increasing gradients. The proposed model is compared internally with Naïve Bayes and Random Forest to assess its efficacy. Additionally, the proposed model's performance is forecasted by utilizing probabilistic parameters like recall, F1-score, precision, and accuracy to assess the effectiveness of the proposed system.
Collapse
Affiliation(s)
- Mohemmed Sha
- Department of Software Engineering, College of Computer Engineering and Sciences, Prince Sattam bin Abdulaziz University, Al Kharj 11942, Kingdom of Saudi Arabia.
| | - Mohamudha Parveen Rahamathulla
- Department of Basic Medical Sciences, College of Medicine, Prince Sattam bin Abdulaziz University, Al Kharj 11942, Kingdom of Saudi Arabia.
| |
Collapse
|
6
|
Liu T, Song C, Wang C. NCSP-PLM: An ensemble learning framework for predicting non-classical secreted proteins based on protein language models and deep learning. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:1472-1488. [PMID: 38303473 DOI: 10.3934/mbe.2024063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
Non-classical secreted proteins (NCSPs) refer to a group of proteins that are located in the extracellular environment despite the absence of signal peptides and motifs. They usually play different roles in intercellular communication. Therefore, the accurate prediction of NCSPs is a critical step to understanding in depth their associated secretion mechanisms. Since the experimental recognition of NCSPs is often costly and time-consuming, computational methods are desired. In this study, we proposed an ensemble learning framework, termed NCSP-PLM, for the identification of NCSPs by extracting feature embeddings from pre-trained protein language models (PLMs) as input to several fine-tuned deep learning models. First, we compared the performance of nine PLM embeddings by training three neural networks: Multi-layer perceptron (MLP), attention mechanism and bidirectional long short-term memory network (BiLSTM) and selected the best network model for each PLM embedding. Then, four models were excluded due to their below-average accuracies, and the remaining five models were integrated to perform the prediction of NCSPs based on the weighted voting. Finally, the 5-fold cross validation and the independent test were conducted to evaluate the performance of NCSP-PLM on the benchmark datasets. Based on the same independent dataset, the sensitivity and specificity of NCSP-PLM were 91.18% and 97.06%, respectively. Particularly, the overall accuracy of our model achieved 94.12%, which was 7~16% higher than that of the existing state-of-the-art predictors. It indicated that NCSP-PLM could serve as a useful tool for the annotation of NCSPs.
Collapse
Affiliation(s)
- Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| | - Chen Song
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| | - Chunhua Wang
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| |
Collapse
|
7
|
Djeddi WE, Hermi K, Ben Yahia S, Diallo G. Advancing drug-target interaction prediction: a comprehensive graph-based approach integrating knowledge graph embedding and ProtBert pretraining. BMC Bioinformatics 2023; 24:488. [PMID: 38114937 PMCID: PMC10731821 DOI: 10.1186/s12859-023-05593-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Accepted: 11/30/2023] [Indexed: 12/21/2023] Open
Abstract
BACKGROUND The pharmaceutical field faces a significant challenge in validating drug target interactions (DTIs) due to the time and cost involved, leading to only a fraction being experimentally verified. To expedite drug discovery, accurate computational methods are essential for predicting potential interactions. Recently, machine learning techniques, particularly graph-based methods, have gained prominence. These methods utilize networks of drugs and targets, employing knowledge graph embedding (KGE) to represent structured information from knowledge graphs in a continuous vector space. This phenomenon highlights the growing inclination to utilize graph topologies as a means to improve the precision of predicting DTIs, hence addressing the pressing requirement for effective computational methodologies in the field of drug discovery. RESULTS The present study presents a novel approach called DTIOG for the prediction of DTIs. The methodology employed in this study involves the utilization of a KGE strategy, together with the incorporation of contextual information obtained from protein sequences. More specifically, the study makes use of Protein Bidirectional Encoder Representations from Transformers (ProtBERT) for this purpose. DTIOG utilizes a two-step process to compute embedding vectors using KGE techniques. Additionally, it employs ProtBERT to determine target-target similarity. Different similarity measures, such as Cosine similarity or Euclidean distance, are utilized in the prediction procedure. In addition to the contextual embedding, the proposed unique approach incorporates local representations obtained from the Simplified Molecular Input Line Entry Specification (SMILES) of drugs and the amino acid sequences of protein targets. CONCLUSIONS The effectiveness of the proposed approach was assessed through extensive experimentation on datasets pertaining to Enzymes, Ion Channels, and G-protein-coupled Receptors. The remarkable efficacy of DTIOG was showcased through the utilization of diverse similarity measures in order to calculate the similarities between drugs and targets. The combination of these factors, along with the incorporation of various classifiers, enabled the model to outperform existing algorithms in its ability to predict DTIs. The consistent observation of this advantage across all datasets underlines the robustness and accuracy of DTIOG in the domain of DTIs. Additionally, our case study suggests that the DTIOG can serve as a valuable tool for discovering new DTIs.
Collapse
Affiliation(s)
- Warith Eddine Djeddi
- LR11ES14, Faculty of Sciences of Tunis, University of Tunis El Manar, Campus Universitaire, 2092, Tunis, Tunisia.
- High Institute of Informatics in Kef, University of Jendouba, Saleh Ayech, 8189, Jendouba, Tunisia.
| | - Khalil Hermi
- High Institute of Informatics in Kef, University of Jendouba, Saleh Ayech, 8189, Jendouba, Tunisia
| | - Sadok Ben Yahia
- Department of Software Science, Tallinn University of Technology, Ehitajate tee-5, 12618, Tallinn, Estonia
- The Maersk Mc-Kinney Moller Institute, Southern Syddansk Universitet, Alsion 2, 6400, Sønderborg, Denmark
| | - Gayo Diallo
- Bordeaux Population Health Inserm 1219, University of Bordeaux, rue Léo Saignat, 33000, Bordeaux, France
| |
Collapse
|
8
|
Liu X, Zhu B, Dai XW, Xu ZA, Li R, Qian Y, Lu YP, Zhang W, Liu Y, Zheng J. GBDT_KgluSite: An improved computational prediction model for lysine glutarylation sites based on feature fusion and GBDT classifier. BMC Genomics 2023; 24:765. [PMID: 38082413 PMCID: PMC10712101 DOI: 10.1186/s12864-023-09834-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2023] [Accepted: 11/23/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND Lysine glutarylation (Kglu) is one of the most important Post-translational modifications (PTMs), which plays significant roles in various cellular functions, including metabolism, mitochondrial processes, and translation. Therefore, accurate identification of the Kglu site is important for elucidating protein molecular function. Due to the time-consuming and expensive limitations of traditional biological experiments, computational-based Kglu site prediction research is gaining more and more attention. RESULTS In this paper, we proposed GBDT_KgluSite, a novel Kglu site prediction model based on GBDT and appropriate feature combinations, which achieved satisfactory performance. Specifically, seven features including sequence-based features, physicochemical property-based features, structural-based features, and evolutionary-derived features were used to characterize proteins. NearMiss-3 and Elastic Net were applied to address data imbalance and feature redundancy issues, respectively. The experimental results show that GBDT_KgluSite has good robustness and generalization ability, with accuracy and AUC values of 93.73%, and 98.14% on five-fold cross-validation as well as 90.11%, and 96.75% on the independent test dataset, respectively. CONCLUSION GBDT_KgluSite is an effective computational method for identifying Kglu sites in protein sequences. It has good stability and generalization ability and could be useful for the identification of new Kglu sites in the future. The relevant code and dataset are available at https://github.com/flyinsky6/GBDT_KgluSite .
Collapse
Affiliation(s)
- Xin Liu
- School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China.
| | - Bao Zhu
- Cancer Institute, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
- Jiangsu Center for the Collaboration and Innovation of Cancer Biotherapy, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Xia-Wei Dai
- School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Zhi-Ao Xu
- School of Life Sciences, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Rui Li
- School of Life Sciences, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Yuting Qian
- Jiangsu Center for the Collaboration and Innovation of Cancer Biotherapy, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Ya-Ping Lu
- School of Humanities and Arts, China University of Mining and Technology, Xuzhou, Jiangsu, 221116, China
| | - Wenqing Zhang
- School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Yong Liu
- Cancer Institute, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China.
- Jiangsu Center for the Collaboration and Innovation of Cancer Biotherapy, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China.
| | - Junnian Zheng
- Jiangsu Center for the Collaboration and Innovation of Cancer Biotherapy, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China.
- Center of Clinical Oncology, The Affiliated Hospital of Xuzhou Medical University, Xuzhou, Jiangsu, 221002, China.
| |
Collapse
|
9
|
Le NQK. Leveraging transformers-based language models in proteome bioinformatics. Proteomics 2023; 23:e2300011. [PMID: 37381841 DOI: 10.1002/pmic.202300011] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 06/13/2023] [Accepted: 06/13/2023] [Indexed: 06/30/2023]
Abstract
In recent years, the rapid growth of biological data has increased interest in using bioinformatics to analyze and interpret this data. Proteomics, which studies the structure, function, and interactions of proteins, is a crucial area of bioinformatics. Using natural language processing (NLP) techniques in proteomics is an emerging field that combines machine learning and text mining to analyze biological data. Recently, transformer-based NLP models have gained significant attention for their ability to process variable-length input sequences in parallel, using self-attention mechanisms to capture long-range dependencies. In this review paper, we discuss the recent advancements in transformer-based NLP models in proteome bioinformatics and examine their advantages, limitations, and potential applications to improve the accuracy and efficiency of various tasks. Additionally, we highlight the challenges and future directions of using these models in proteome bioinformatics research. Overall, this review provides valuable insights into the potential of transformer-based NLP models to revolutionize proteome bioinformatics.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
- AIBioMed Research Group, Taipei Medical University, Taipei, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei, Taiwan
| |
Collapse
|
10
|
Jia J, Cao X, Wei Z. DLC-ac4C: A Prediction Model for N4-acetylcytidine Sites in Human mRNA Based on DenseNet and Bidirectional LSTM Methods. Curr Genomics 2023; 24:171-186. [PMID: 38178985 PMCID: PMC10761336 DOI: 10.2174/0113892029270191231013111911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Revised: 09/13/2023] [Accepted: 09/21/2023] [Indexed: 01/06/2024] Open
Abstract
Introduction N4 acetylcytidine (ac4C) is a highly conserved nucleoside modification that is essential for the regulation of immune functions in organisms. Currently, the identification of ac4C is primarily achieved using biological methods, which can be time-consuming and labor-intensive. In contrast, accurate identification of ac4C by computational methods has become a more effective method for classification and prediction. Aim To the best of our knowledge, although there are several computational methods for ac4C locus prediction, the performance of the models they constructed is poor, and the network structure they used is relatively simple and suffers from the disadvantage of network degradation. This study aims to improve these limitations by proposing a predictive model based on integrated deep learning to better help identify ac4C sites. Methods In this study, we propose a new integrated deep learning prediction framework, DLC-ac4C. First, we encode RNA sequences based on three feature encoding schemes, namely C2 encoding, nucleotide chemical property (NCP) encoding, and nucleotide density (ND) encoding. Second, one-dimensional convolutional layers and densely connected convolutional networks (DenseNet) are used to learn local features, and bi-directional long short-term memory networks (Bi-LSTM) are used to learn global features. Third, a channel attention mechanism is introduced to determine the importance of sequence characteristics. Finally, a homomorphic integration strategy is used to limit the generalization error of the model, which further improves the performance of the model. Results The DLC-ac4C model performed well in terms of sensitivity (Sn), specificity (Sp), accuracy (Acc), Mathews correlation coefficient (MCC), and area under the curve (AUC) for the independent test data with 86.23%, 79.71%, 82.97%, 66.08%, and 90.42%, respectively, which was significantly better than the prediction accuracy of the existing methods. Conclusion Our model not only combines DenseNet and Bi-LSTM, but also uses the channel attention mechanism to better capture hidden information features from a sequence perspective, and can identify ac4C sites more effectively.
Collapse
Affiliation(s)
- Jianhua Jia
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Xiaojing Cao
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Zhangying Wei
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| |
Collapse
|
11
|
Wang GA, Yan X, Li X, Liu Y, Xia J, Zhu X. MSTL-Kace: Prediction of Prokaryotic Lysine Acetylation Sites Based on Multistage Transfer Learning Strategy. ACS OMEGA 2023; 8:41930-41942. [PMID: 37969991 PMCID: PMC10634282 DOI: 10.1021/acsomega.3c07086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Revised: 10/11/2023] [Accepted: 10/13/2023] [Indexed: 11/17/2023]
Abstract
As one of the most important post-translational modifications (PTM), lysine acetylation (Kace) plays an important role in various biological activities. Traditional experimental methods for identifying Kace sites are inefficient and expensive. Instead, several machine learning methods have been developed for Kace site prediction, and hand-crafted features have been used to encode the protein sequences. However, there are still two challenges: the complex biological information may be under-represented by these manmade features and the small sample issue of some species needs to be addressed. We propose a novel model, MSTL-Kace, which was developed based on transfer learning strategy with pretrained bidirectional encoder representations from transformers (BERT) model. In this model, the high-level embeddings were extracted from species-specific BERT models, and a two-stage fine-tuning strategy was used to deal with small sample issue. Specifically, a domain-specific BERT model was pretrained using all of the sequences in our data sets, which was then fine-tuned, or two-stage fine-tuned based on the training data set of each species to obtain the species-specific BERT models. Afterward, the embeddings of residues were extracted from the fine-tuned model and fed to the different downstream learning algorithms. After comparison, the best model for the six prokaryotic species was built by using a random forest. The results for the independent test sets show that our model outperforms the state-of-the-art methods on all six species. The source codes and data for MSTL-Kace are available at https://github.com/leo97king/MSTL-Kace.
Collapse
Affiliation(s)
- Gang-Ao Wang
- School
of Sciences, Anhui Agricultural University, Hefei 230036, Anhui, China
| | - Xiaodi Yan
- School
of Sciences, Anhui Agricultural University, Hefei 230036, Anhui, China
| | - Xiang Li
- School
of Sciences, Anhui Agricultural University, Hefei 230036, Anhui, China
| | - Yinbo Liu
- School
of Sciences, Anhui Agricultural University, Hefei 230036, Anhui, China
| | - Junfeng Xia
- Key
Laboratory of Intelligent Computing and Signal Processing of Ministry
of Education, Institutes of Physical Science and Information Technology, Anhui University, Hefei 230601, Anhui, China
| | - Xiaolei Zhu
- School
of Sciences, Anhui Agricultural University, Hefei 230036, Anhui, China
| |
Collapse
|