1
|
Shrestha P, Kandel J, Tayara H, Chong KT. Post-translational modification prediction via prompt-based fine-tuning of a GPT-2 model. Nat Commun 2024; 15:6699. [PMID: 39107330 PMCID: PMC11303401 DOI: 10.1038/s41467-024-51071-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Accepted: 07/29/2024] [Indexed: 08/10/2024] Open
Abstract
Post-translational modifications (PTMs) are pivotal in modulating protein functions and influencing cellular processes like signaling, localization, and degradation. The complexity of these biological interactions necessitates efficient predictive methodologies. In this work, we introduce PTMGPT2, an interpretable protein language model that utilizes prompt-based fine-tuning to improve its accuracy in precisely predicting PTMs. Drawing inspiration from recent advancements in GPT-based architectures, PTMGPT2 adopts unsupervised learning to identify PTMs. It utilizes a custom prompt to guide the model through the subtle linguistic patterns encoded in amino acid sequences, generating tokens indicative of PTM sites. To provide interpretability, we visualize attention profiles from the model's final decoder layer to elucidate sequence motifs essential for molecular recognition and analyze the effects of mutations at or near PTM sites to offer deeper insights into protein functionality. Comparative assessments reveal that PTMGPT2 outperforms existing methods across 19 PTM types, underscoring its potential in identifying disease associations and drug targets.
Collapse
Affiliation(s)
- Palistha Shrestha
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, Jeollabuk-do, Republic of Korea
| | - Jeevan Kandel
- Graduate School of Integrated Energy-AI, Jeonbuk National University, Jeonju, Jeollabuk-do, Republic of Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju, Jeollabuk-do, Republic of Korea.
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, Jeollabuk-do, Republic of Korea.
- Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju, Jeollabuk-do, Republic of Korea.
| |
Collapse
|
2
|
An Z, Zhai L, Ying W, Qian X, Gong F, Tan M, Fu Y. PTMiner: Localization and Quality Control of Protein Modifications Detected in an Open Search and Its Application to Comprehensive Post-translational Modification Characterization in Human Proteome. Mol Cell Proteomics 2019; 18:391-405. [PMID: 30420486 PMCID: PMC6356076 DOI: 10.1074/mcp.ra118.000812] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2018] [Revised: 11/02/2018] [Indexed: 12/27/2022] Open
Abstract
The open (mass tolerant) search of tandem mass spectra of peptides shows great potential in the comprehensive detection of post-translational modifications (PTMs) in shotgun proteomics. However, this search strategy has not been widely used by the community, and one bottleneck of it is the lack of appropriate algorithms for automated and reliable post-processing of the coarse and error-prone search results. Here we present PTMiner, a software tool for confident filtering and localization of modifications (mass shifts) detected in an open search. After mass-shift-grouped false discovery rate (FDR) control of peptide-spectrum matches (PSMs), PTMiner uses an empirical Bayesian method to localize modifications through iterative learning of the prior probabilities of each type of modification occurring on different amino acids. The performance of PTMiner was evaluated on three data sets, including simulated data, chemically synthesized peptide library data and modified-peptide spiked-in proteome data. The results showed that PTMiner can effectively control the PSM FDR and accurately localize the modification sites. At 1% real false localization rate (FLR), PTMiner localized 93%, 84 and 83% of the modification sites in the three data sets, respectively, far higher than two open search engines we used and an extended version of the Ascore localization algorithm. We then used PTMiner to analyze a draft map of human proteome containing 25 million spectra from 30 tissues, and confidently identified over 1.7 million modified PSMs at 1% FDR and 1% FLR, which provided a system-wide view of both known and unknown PTMs in the human proteome.
Collapse
Affiliation(s)
- Zhiwu An
- National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China;; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Linhui Zhai
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
| | - Wantao Ying
- State key Laboratory of Proteomics, National Center for Protein Sciences Beijing, Beijing Proteome Research Center, National Engineering Research Center for Protein Drugs, Beijing 102206, China, Beijing Institute of Lifeomics, Beijing 100850, China
| | - Xiaohong Qian
- State key Laboratory of Proteomics, National Center for Protein Sciences Beijing, Beijing Proteome Research Center, National Engineering Research Center for Protein Drugs, Beijing 102206, China, Beijing Institute of Lifeomics, Beijing 100850, China
| | - Fuzhou Gong
- National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China;; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Minjia Tan
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China;.
| | - Yan Fu
- National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China;; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.
| |
Collapse
|
3
|
Liu J, Han J, Lv H. ADPRtool: A novel predicting model for identification of ASP-ADP-Ribosylation sites of human proteins. J Bioinform Comput Biol 2015; 13:1550015. [PMID: 26017462 DOI: 10.1142/s0219720015500158] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
Post-translational modifications (PTMs) occur in the vast majority of proteins, and they are essential for many protein functions. Computational prediction of the residue location of PTMs enhances the functional characterization of proteins. ADP-Ribosylation is an important type of PTM, because it is implicated in apoptosis, DNA repair, regulation of cell proliferation, and protein synthesis. However, mass spectrometric approaches have difficulties in identifying a vast number of protein ADP-Ribosylation sites. Therefore, a computational method for predicting ADP-Ribosylation sites of human proteins seems useful and necessary. Four types of sequence features and an incremental feature selection technique are utilized to predict protein ADP-Ribosylation sites. The final feature set for ADPR prediction modeling is optimized, based on a minimum redundancy maximum relevance criterion, so as to make more accurate predictions on aspartic acid ADPR modified residues. Our prediction model, ADPRtool, is capable to predict Asp-ADP-Ribosylation sites with a total accuracy of 85.45%, which is as good as most computational PTM site predictors. By using a sequence-based computational method, a new ADP-Ribosylation site prediction model - ADPRtool, is developed, and it has shown great accuracies with total accuracy, Matthew's correlation coefficient and area under receiver operating characteristic curve.
Collapse
Affiliation(s)
- Jun Liu
- School of Electrical Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, P. R. China
| | - Jiuqiang Han
- School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, P. R. China
| | - Hongqiang Lv
- School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, P. R. China
| |
Collapse
|
4
|
Lv H, Han J, Liu J, Zheng J, Liu R, Zhong D. CarSPred: a computational tool for predicting carbonylation sites of human proteins. PLoS One 2014; 9:e111478. [PMID: 25347395 PMCID: PMC4210226 DOI: 10.1371/journal.pone.0111478] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Accepted: 09/26/2014] [Indexed: 12/15/2022] Open
Abstract
Protein carbonylation is one of the most pervasive oxidative stress-induced post-translational modifications (PTMs), which plays a significant role in the etiology and progression of several human diseases. It has been regarded as a biomarker of oxidative stress due to its relatively early formation and stability compared with other oxidative PTMs. Only a subset of proteins is prone to carbonylation and most carbonyl groups are formed from lysine (K), arginine (R), threonine (T) and proline (P) residues. Recent advancements in analysis of the PTM by mass spectrometry provided new insights into the mechanisms of protein carbonylation, such as protein susceptibility and exact modification sites. However, the experimental approaches to identifying carbonylation sites are costly, time-consuming and capable of processing a limited number of proteins, and there is no bioinformatics method or tool devoted to predicting carbonylation sites of human proteins so far. In the paper, a computational method is proposed to identify carbonylation sites of human proteins. The method extracted four kinds of features and combined the minimum Redundancy Maximum Relevance (mRMR) feature selection criterion with weighted support vector machine (WSVM) to achieve total accuracies of 85.72%, 85.95%, 83.92% and 85.72% for K, R, T and P carbonylation site predictions respectively using 10-fold cross-validation. The final optimal feature sets were analysed, the position-specific composition and hydrophobicity environment of flanking residues of modification sites were discussed. In addition, a software tool named CarSPred has been developed to facilitate the application of the method. Datasets and the software involved in the paper are available at https://sourceforge.net/projects/hqlstudio/files/CarSPred-1.0/.
Collapse
Affiliation(s)
- Hongqiang Lv
- School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, China
| | - Jiuqiang Han
- School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, China
- * E-mail: (JQH); (JL)
| | - Jun Liu
- School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, China
- * E-mail: (JQH); (JL)
| | - Jiguang Zheng
- School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, China
| | - Ruiling Liu
- School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, China
| | - Dexing Zhong
- School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, China
| |
Collapse
|
5
|
Moreira GMSG, Conceição FR, McBride AJA, Pinto LDS. Structure predictions of two Bauhinia variegata lectins reveal patterns of C-terminal properties in single chain legume lectins. PLoS One 2013; 8:e81338. [PMID: 24260572 PMCID: PMC3834338 DOI: 10.1371/journal.pone.0081338] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2013] [Accepted: 10/15/2013] [Indexed: 11/18/2022] Open
Abstract
Bauhinia variegata lectins (BVL-I and BVL-II) are single chain lectins isolated from the plant Bauhinia variegata. Single chain lectins undergo post-translational processing on its N-terminal and C-terminal regions, which determines their physiological targeting, carbohydrate binding activity and pattern of quaternary association. These two lectins are isoforms, BVL-I being highly glycosylated, and thus far, it has not been possible to determine their structures. The present study used prediction and validation algorithms to elucidate the likely structures of BVL-I and -II. The program Bhageerath-H was chosen from among three different structure prediction programs due to its better overall reliability. In order to predict the C-terminal region cleavage sites, other lectins known to have this modification were analysed and three rules were created: (1) the first amino acid of the excised peptide is small or hydrophobic; (2) the cleavage occurs after an acid, polar, or hydrophobic residue, but not after a basic one; and (3) the cleavage spot is located 5-8 residues after a conserved Leu amino acid. These rules predicted that BVL-I and -II would have fifteen C-terminal residues cleaved, and this was confirmed experimentally by Edman degradation sequencing of BVL-I. Furthermore, the C-terminal analyses predicted that only BVL-II underwent α-helical folding in this region, similar to that seen in SBA and DBL. Conversely, BVL-I and -II contained four conserved regions of a GS-I association, providing evidence of a previously undescribed X4+unusual oligomerisation between the truncated BVL-I and the intact BVL-II. This is the first report on the structural analysis of lectins from Bauhinia spp. and therefore is important for the characterisation C-terminal cleavage and patterns of quaternary association of single chain lectins.
Collapse
Affiliation(s)
- Gustavo M. S. G. Moreira
- Centro de Desenvolvimento Tecnológico, Núcleo de Biotecnologia, Universidade Federal de Pelotas, Pelotas, Rio Grande do Sul, Brazil
| | - Fabricio R. Conceição
- Centro de Desenvolvimento Tecnológico, Núcleo de Biotecnologia, Universidade Federal de Pelotas, Pelotas, Rio Grande do Sul, Brazil
| | - Alan J. A. McBride
- Centro de Desenvolvimento Tecnológico, Núcleo de Biotecnologia, Universidade Federal de Pelotas, Pelotas, Rio Grande do Sul, Brazil
| | - Luciano da S. Pinto
- Centro de Desenvolvimento Tecnológico, Núcleo de Biotecnologia, Universidade Federal de Pelotas, Pelotas, Rio Grande do Sul, Brazil
| |
Collapse
|
6
|
Kertész-Farkas A, Reiz B, Vera R, Myers MP, Pongor S. PTMTreeSearch: a novel two-stage tree-search algorithm with pruning rules for the identification of post-translational modification of proteins in MS/MS spectra. ACTA ACUST UNITED AC 2013; 30:234-41. [PMID: 24215026 DOI: 10.1093/bioinformatics/btt642] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
MOTIVATION Tandem mass spectrometry has become a standard tool for identifying post-translational modifications (PTMs) of proteins. Algorithmic searches for PTMs from tandem mass spectrum data (MS/MS) tend to be hampered by noisy data as well as by a combinatorial explosion of search space. This leads to high uncertainty and long search-execution times. RESULTS To address this issue, we present PTMTreeSearch, a new algorithm that uses a large database of known PTMs to identify PTMs from MS/MS data. For a given peptide sequence, PTMTreeSearch builds a computational tree wherein each path from the root to the leaves is labeled with the amino acids of a peptide sequence. Branches then represent PTMs. Various empirical tree pruning rules have been designed to decrease the search-execution time by eliminating biologically unlikely solutions. PTMTreeSearch first identifies a relatively small set of high confidence PTM types, and in a second stage, performs a more exhaustive search on this restricted set using relaxed search parameter settings. An analysis of experimental data shows that using the same criteria for false discovery, PTMTreeSearch annotates more peptides than the current state-of-the-art methods and PTM identification algorithms, and achieves this at roughly the same execution time. PTMTreeSearch is implemented as a plugable scoring function in the X!Tandem search engine. AVAILABILITY The source code of PTMTreeSearch and a demo server application can be found at http://net.icgeb.org/ptmtreesearch
Collapse
Affiliation(s)
- Attila Kertész-Farkas
- Protein Structure and Bioinformatics Group, International Centre for Genetic Engineering and Biotechnology, AREA Research Park, 99 Padriciano, Trieste, Italy, 34149, Institute of Biophysics, Biological Research Centre, Temesvari krt. 62, H-6727 Szeged, Hungary, Protein Networks Group, International Centre for Genetic Engineering and Biotechnology, AREA Research Park, Padriciano 99, 34149 Trieste, Italy and Faculty of Information Technology, Pázmány Péter Catholic University, Práter u. 50/a, H-1083 Budapest, Hungary
| | | | | | | | | |
Collapse
|
7
|
Chung C, Emili A, Frey BJ. Non-parametric Bayesian approach to post-translational modification refinement of predictions from tandem mass spectrometry. Bioinformatics 2013; 29:821-9. [DOI: 10.1093/bioinformatics/btt056] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
8
|
Suo SB, Qiu JD, Shi SP, Sun XY, Huang SY, Chen X, Liang RP. Position-specific analysis and prediction for protein lysine acetylation based on multiple features. PLoS One 2012; 7:e49108. [PMID: 23173045 PMCID: PMC3500252 DOI: 10.1371/journal.pone.0049108] [Citation(s) in RCA: 58] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2012] [Accepted: 10/04/2012] [Indexed: 11/17/2022] Open
Abstract
Protein lysine acetylation is a type of reversible post-translational modification that plays a vital role in many cellular processes, such as transcriptional regulation, apoptosis and cytokine signaling. To fully decipher the molecular mechanisms of acetylation-related biological processes, an initial but crucial step is the recognition of acetylated substrates and the corresponding acetylation sites. In this study, we developed a position-specific method named PSKAcePred for lysine acetylation prediction based on support vector machines. The residues around the acetylation sites were selected or excluded based on their entropy values. We incorporated features of amino acid composition information, evolutionary similarity and physicochemical properties to predict lysine acetylation sites. The prediction model achieved an accuracy of 79.84% and a Matthews correlation coefficient of 59.72% using the 10-fold cross-validation on balanced positive and negative samples. A feature analysis showed that all features applied in this method contributed to the acetylation process. A position-specific analysis showed that the features derived from the critical neighboring residues contributed profoundly to the acetylation site determination. The detailed analysis in this paper can help us to understand more of the acetylation mechanism and can provide guidance for the related experimental validation.
Collapse
Affiliation(s)
- Sheng-Bao Suo
- Department of Chemistry, Nanchang University, Nanchang, China
| | | | | | | | | | | | | |
Collapse
|
9
|
Sharma N, Martin A, McCabe CJ. Mining the proteome: the application of tandem mass spectrometry to endocrine cancer research. Endocr Relat Cancer 2012; 19:R149-61. [PMID: 22555494 DOI: 10.1530/erc-12-0036] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Tandem mass spectrometry (MS/MS) permits the detection of femtomolar quantities of protein from a wide variety of tissue sources. As endocrine cancers are frequently aetiologically complex, they are particularly amenable to mass spectrometry. The most widely studied aspect is the search for novel reliable biomarkers that would allow cancers to be diagnosed earlier and distinguished from benign tumours. MS/MS allows for the rapid analysis of blood and urine in addition to tumour tissue, and in this regard it has been applied on research involving thyroid, pancreatic, adrenal and ovarian cancers with varying degrees of success, as well as additional organ sites including breast and lung. The description of an individual cancer proteome potentially allows for personalized management of each patient, avoiding unnecessary therapies and targeting treatments to those which will have the most effect. The application of MS/MS to interaction proteomics is a field that has generated recent novel targets for chemotherapy. However, the technology involved in MS/MS has a number of drawbacks that at present prevent its widespread use in translational cancer research, including a poor reproducibility of results, in part due to the large amount of data generated and the inability to accurately differentiate true from false-positive results. Further, the current cost of running MS/MS restricts the number of times the experiments can be repeated, contributing to the lack of significance and concordance between studies. Despite these problems, however, MS/MS is emerging as a front line tool in endocrine cancer research and it is likely that this will continue over the next decade.
Collapse
Affiliation(s)
- Neil Sharma
- School of Clinical and Experimental Medicine, Institute for Biomedical Research and School of Cancer Sciences, University of Birmingham, Birmingham B15 2TT, UK
| | | | | |
Collapse
|
10
|
Gao PP, Wang WH, Wang J, Li J, Dong XH. Proteomic profiling of Helicobacter pylori treated with celecoxib. Shijie Huaren Xiaohua Zazhi 2011; 19:1785-1790. [DOI: 10.11569/wcjd.v19.i17.1785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
AIM: To perform a proteomic investigation of the effect of celecoxib on Helicobacter pylori (H. pylori).
METHODS: Total proteins of untreated and celecoxib-treated H. pylori 26695 were extracted and separated by 2-dimensionals polyacrylamide gel electrophoresis (2-DE). Differential protein expression was detected using computer-assisted image analysis. Differential proteins were identified by matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF-MS) and matrix-assisted laser desorption/ionization time-of-flight-tandem mass spectrometry (MALDI-TOF-MS/MS). The levels of mRNA expression were measured by real-time polymerase chain reaction.
RESULTS: Seventeen differentially expressed spots were detected between untreated and celecoxib-treated H. pylori 26695. Seven spots were positively identified as three proteins: heat shock protein 60 (HSP60), elongation factor TU (EF-TU) and gamma-glutamyltranspeptidase (GGT). The protein expression of HSP60, GGT, and EF-TU, and mRNA expression of GGT and EF-TU were down-regulated (0.07 ± 0.06 vs 1.01 ± 0.16; 0.31 ± 0.13 vs 0.98 ± 0.01, both P < 0.05), while the mRNA expression of HSP60 was up-regulated in the presence of celecoxib (1.85 ± 0.26 vs 1.07 ± 0.27, P < 0.05).
CONCLUSION: Celecoxib could down-regulate the protein expression of HSP60, GGT and EF-TU and mRNA expression of GGT and EF-TU in H. pylori; however, the mRNA expression of HSP60 was up-regulated. These results suggest that celecoxib might interfere with the pathogenicity of H. pylori.
Collapse
|