1
|
Savage SR, Zhang Y, Jaehnig EJ, Liao Y, Shi Z, Pham HA, Xu H, Zhang B. IDPpub: Illuminating the Dark Phosphoproteome Through PubMed Mining. Mol Cell Proteomics 2024; 23:100682. [PMID: 37993103 PMCID: PMC10716774 DOI: 10.1016/j.mcpro.2023.100682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 10/25/2023] [Accepted: 11/14/2023] [Indexed: 11/24/2023] Open
Abstract
Global phosphoproteomics experiments quantify tens of thousands of phosphorylation sites. However, data interpretation is hampered by our limited knowledge on functions, biological contexts, or precipitating enzymes of the phosphosites. This study establishes a repository of phosphosites with associated evidence in biomedical abstracts, using deep learning-based natural language processing techniques. Our model for illuminating the dark phosphoproteome through PubMed mining (IDPpub) was generated by fine-tuning BioBERT, a deep learning tool for biomedical text mining. Trained using sentences containing protein substrates and phosphorylation site positions from 3000 abstracts, the IDPpub model was then used to extract phosphorylation sites from all MEDLINE abstracts. The extracted proteins were normalized to gene symbols using the National Center for Biotechnology Information gene query, and sites were mapped to human UniProt sequences using ProtMapper and mouse UniProt sequences by direct match. Precision and recall were calculated using 150 curated abstracts, and utility was assessed by analyzing the CPTAC (Clinical Proteomics Tumor Analysis Consortium) pan-cancer phosphoproteomics datasets and the PhosphoSitePlus database. Using 10-fold cross validation, pairs of correct substrates and phosphosite positions were extracted with an average precision of 0.93 and recall of 0.94. After entity normalization and site mapping to human reference sequences, an independent validation achieved a precision of 0.91 and recall of 0.77. The IDPpub repository contains 18,458 unique human phosphorylation sites with evidence sentences from 58,227 abstracts and 5918 mouse sites in 14,610 abstracts. This included evidence sentences for 1803 sites identified in CPTAC studies that are not covered by manually curated functional information in PhosphoSitePlus. Evaluation results demonstrate the potential of IDPpub as an effective biomedical text mining tool for collecting phosphosites. Moreover, the repository (http://idppub.ptmax.org), which can be automatically updated, can serve as a powerful complement to existing resources.
Collapse
Affiliation(s)
- Sara R Savage
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | | | - Eric J Jaehnig
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Yuxing Liao
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Zhiao Shi
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | | | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, Connecticut, USA
| | - Bing Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA.
| |
Collapse
|
2
|
Arumugam K, Sellappan M, Anand D, Anand S, Radhakrishnan SV. A Text Mining and Machine Learning Protocol for Extracting Posttranslational Modifications of Proteins from PubMed: A Special Focus on Glycosylation, Acetylation, Methylation, Hydroxylation, and Ubiquitination. Methods Mol Biol 2022; 2496:179-202. [PMID: 35713865 DOI: 10.1007/978-1-0716-2305-3_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Posttranslational modifications (PTMs) of proteins impart a significant role in human cellular functions ranging from localization to signal transduction. Hundreds of PTMs act in a human cell. Among them, only the selected PTMs are well established and documented. PubMed includes thousands of papers on the selected PTMs, and it is a challenge for the biomedical researchers to assimilate useful information manually. Alternatively, text mining approaches and machine learning algorithm automatically extract the relevant information from PubMed. Protein phosphorylation is a well-established PTM and several research works are under way. Many existing systems are there for protein phosphorylation information extraction. A recent approach uses a hybrid approach using text mining and machine learning to extract protein phosphorylation information from PubMed. Some of the other common PTMs that exhibit similar features in terms of entities that are involved in PTM process, that is, the substrate, the enzymes, and the amino acid residues, are glycosylation, acetylation, methylation, hydroxylation, and ubiquitination. This has motivated us to repurpose and extend the text mining protocol and machine learning information extraction methodology developed for protein phosphorylation to these PTMs. In this chapter, the chemistry behind each of the PTMs is briefly outlined and the text mining protocol and machine learning algorithm adaption is explained for the same.
Collapse
Affiliation(s)
- Krishnamurthy Arumugam
- Department of Management Studies, Coimbatore Institute of Engineering and Technology, Coimbatore, Tamilnadu, India.
| | - Malathi Sellappan
- Department of Pharmaceutical Analysis, PSG College of Pharmacy, Coimbatore, Tamilnadu, India
| | - Dheepa Anand
- Department of Pharmacology, Cheran College of Pharmacy, Coimbatore, Tamilnadu, India
| | - Sadhanha Anand
- Department of Biomedical Engineering, PSG College of Technology, Coimbatore, Tamilnadu, India
| | | |
Collapse
|
3
|
Anand S, Iyyappan OR, Manoharan S, Anand D, Jose MA, Shanker RR. Text Mining Protocol to Retrieve Significant Drug-Gene Interactions from PubMed Abstracts. Methods Mol Biol 2022; 2496:17-39. [PMID: 35713857 DOI: 10.1007/978-1-0716-2305-3_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Genes and proteins form the basis of all cellular processes and ensure a smooth functioning of the human system. The diseases caused in humans can be either genetic in nature or may be caused due to external factors. Genetic diseases are mainly the result of any anomaly in gene/protein structure or function. This disruption interferes with the normal expression of cellular components. Against external factors, even though the immunogenicity of every individual protects them to a certain extent from infections, they are still susceptible to other disease-causing agents. Understanding the biological pathway/entities that could be targeted by specific drugs is an essential component of drug discovery. The traditional drug target discovery process is time-consuming and practically not feasible. A computational approach could provide speed and efficiency to the method. With the presence of vast biomedical literature, text mining also seems to be an obvious choice which could efficiently aid with other computational methods in identifying drug-gene targets. These could aid in initial stages of reviewing the disease components or can even aid parallel in extracting drug-disease-gene/protein relationships from literature. The present chapter aims at finding drug-gene interactions and how the information could be explored for drug interaction.
Collapse
Affiliation(s)
- Sadhanha Anand
- Department of Biomedical Engineering, PSG College of Technology, Coimbatore, Tamilnadu, India
| | - Oviya Ramalakshmi Iyyappan
- Department of Sciences, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Chennai, Tamilnadu, India
| | - Sharanya Manoharan
- Department of Bioinformatics, Stella Maris College (Autonomous), Chennai, Tamilnadu, India
| | - Dheepa Anand
- Department of Pharmacology, Cheran College of Pharmacy, Coimbatore, Tamilnadu, India
| | | | - Raja Ravi Shanker
- International Business Unit, Alembic Pharmaceuticals Limited, Vadodara, Gujarat, India.
| |
Collapse
|
4
|
Automated Extraction and Visualization of Protein-Protein Interaction Networks and Beyond: A Text-Mining Protocol. Methods Mol Biol 2020; 2074:13-34. [PMID: 31583627 DOI: 10.1007/978-1-4939-9873-9_2] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Proteins perform their functions by interacting with other proteins. Protein-protein interaction (PPI) is critical for understanding the functions of individual proteins, the mechanisms of biological processes, and the disease mechanisms. High-throughput experiments accumulated a huge number of PPIs in PubMed articles, and their extraction is possible only through automated approaches. The standard text-mining protocol includes four major tasks, namely, recognizing protein mentions, normalizing protein names and aliases to unique identifiers such as gene symbol, extracting PPIs, and visualizing the PPI network using Cytoscape or other visualization tools. Each task is challenging and has been revised over several years to improve the performance. We present a protocol based on our hybrid approaches and show the possibility of presenting each task as an independent web-based tool, NAGGNER for protein name recognition, ProNormz for protein name normalization, PPInterFinder for PPI extraction, and HPIminer for PPI network visualization. The protocol is specific to human but can be generalized to other organisms. We include KinderMiner, our most recent text-mining tool that predicts PPIs by retrieving significant co-occurring protein pairs. The algorithm is simple, easy to implement, and generalizable to other biological challenges.
Collapse
|