1
|
Arumugam K, Shanker RR. Text Mining and Machine Learning Protocol for Extracting Human-Related Protein Phosphorylation Information from PubMed. Methods Mol Biol 2022; 2496:159-177. [PMID: 35713864 DOI: 10.1007/978-1-0716-2305-3_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In the modern health care research, protein phosphorylation has gained an enormous attention from the researchers across the globe and requires automated approaches to process a huge volume of data on proteins and their modifications at the cellular level. The data generated at the cellular level is unique as well as arbitrary, and an accumulation of massive volume of information is inevitable. Biological research has revealed that a huge array of cellular communication aided by protein phosphorylation and other similar mechanisms imply different and diverse meanings. This led to a collection of huge volume of data to understand the biological functions of human evolution, especially for combating diseases in a better way. Text mining, an automated approach to mine the information from an unstructured data, finds its application in extracting protein phosphorylation information from the biomedical literature databases such as PubMed. This chapter outlines a recent text mining protocol that applies natural language parsing (NLP) for named entity recognition and text processing, and support vector machines (SVM), a machine learning algorithm for classifying the processed text related human protein phosphorylation. We discuss on evaluating the text mining system which is the outcome of the protocol on three corpora, namely, human Protein Phosphorylation (hPP) corpus, Integrated Protein Literature Information and Knowledge corpus (iProLink), and Phosphorylation Literature corpus (PLC). We also present a basic understanding on the chemistry and biology that drive the protein phosphorylation process in a human body. We believe that this basic understanding will be useful to advance the existing text mining systems for extracting protein phosphorylation information from PubMed.
Collapse
Affiliation(s)
- Krishnamurthy Arumugam
- Department of Management Studies, Coimbatore Institute of Engineering and Technology, Coimbatore, Tamilnadu, India.
| | - Raja Ravi Shanker
- International Business Unit, Alembic Pharmaceuticals Limited, Vadodara, Gujarat, India
| |
Collapse
|
2
|
Islamaj R, Wei CH, Cissel D, Miliaras N, Printseva O, Rodionov O, Sekiya K, Ward J, Lu Z. NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition. J Biomed Inform 2021; 118:103779. [PMID: 33839304 PMCID: PMC11037554 DOI: 10.1016/j.jbi.2021.103779] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Revised: 03/14/2021] [Accepted: 04/05/2021] [Indexed: 10/21/2022]
Abstract
The automatic recognition of gene names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. While current methods for tagging gene entities have been developed for biomedical literature, their performance on species other than human is substantially lower due to the lack of annotation data. We therefore present the NLM-Gene corpus, a high-quality manually annotated corpus for genes developed at the US National Library of Medicine (NLM), covering ambiguous gene names, with an average of 29 gene mentions (10 unique identifiers) per document, and a broader representation of different species (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, etc.) when compared to previous gene annotation corpora. NLM-Gene consists of 550 PubMed abstracts from 156 biomedical journals, doubly annotated by six experienced NLM indexers, randomly paired for each document to control for bias. The annotators worked in three annotation rounds until they reached complete agreement. This gold-standard corpus can serve as a benchmark to develop & test new gene text mining algorithms. Using this new resource, we have developed a new gene finding algorithm based on deep learning which improved both on precision and recall from existing tools. The NLM-Gene annotated corpus is freely available at ftp://ftp.ncbi.nlm.nih.gov/pub/lu/NLMGene. We have also applied this tool to the entire PubMed/PMC with their results freely accessible through our web-based tool PubTator (www.ncbi.nlm.nih.gov/research/pubtator).
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Chih-Hsuan Wei
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - David Cissel
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Nicholas Miliaras
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Olga Printseva
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Oleg Rodionov
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Keiko Sekiya
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Janice Ward
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Zhiyong Lu
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
3
|
Poverennaya EV, Kiseleva OI, Ivanov AS, Ponomarenko EA. Methods of Computational Interactomics for Investigating Interactions of Human Proteoforms. BIOCHEMISTRY (MOSCOW) 2020; 85:68-79. [PMID: 32079518 DOI: 10.1134/s000629792001006x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
Human genome contains ca. 20,000 protein-coding genes that could be translated into millions of unique protein species (proteoforms). Proteoforms coded by a single gene often have different functions, which implies different protein partners. By interacting with each other, proteoforms create a network reflecting the dynamics of cellular processes in an organism. Perturbations of protein-protein interactions change the network topology, which often triggers pathological processes. Studying proteoforms is a relatively new research area in proteomics, and this is why there are comparatively few experimental studies on the interaction of proteoforms. Bioinformatics tools can facilitate such studies by providing valuable complementary information to the experimental data and, in particular, expanding the possibilities of the studies of proteoform interactions.
Collapse
Affiliation(s)
| | - O I Kiseleva
- Institute of Biomedical Chemistry, Moscow, 119121, Russia
| | - A S Ivanov
- Institute of Biomedical Chemistry, Moscow, 119121, Russia
| | | |
Collapse
|
4
|
Zhao S, Feng J, Li C, Gao H, Lv P, Li J, Liu Q, He Y, Wang H, Gong L, Li D, Zhang Y. Phosphoproteome profiling revealed abnormally phosphorylated AMPK and ATF2 involved in glucose metabolism and tumorigenesis of GH-PAs. J Endocrinol Invest 2019; 42:137-148. [PMID: 29691806 DOI: 10.1007/s40618-018-0890-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/29/2017] [Accepted: 04/11/2018] [Indexed: 01/04/2023]
Abstract
PURPOSE Protein phosphorylation plays a key role in tumorigenesis and progression. However, little is known about the phosphoproteome profiles of growth hormone-secreting pituitary adenomas (GH-PAs). The aim of this study was to identify critical biomarkers and signaling pathways that might play important roles in GH-PAs and may, therefore, represent potential therapeutic targets. METHODS The differential phosphoprotein expression patterns involved in GH-PAs were investigated by nano-LC-MS/MS in a group of samples. The phosphoprotein expression data were analyzed by bioinformatics. The expression levels of the candidate phosphorylated AMPK (ser496) and ATF2 (ser112) were validated by Western blot analysis in another group of samples. RESULTS A total of 1213 phosphorylated protein sites corresponding to 667 proteins were significantly different between GH-PAs and healthy pituitary glands. Among these phosphorylated sites, 871 exhibited lower levels of phosphorylation in GH-PAs. Moreover, 140 novel phosphosites corresponding to 93 proteins were differentially phosphorylated between GH-PAs and healthy pituitary glands, 101 of which showed decreased phosphorylation in GH-PAs. The majority of differentially expressed phosphorylated proteins were significantly enriched in glycolysis and the AMPK signaling pathway in GH-PAs. The AMPK signaling pathway was demonstrated to be inhibited in GH-PAs by pathway activity analysis (z score = - 2.324). Notably, the phosphorylated levels of AMPK (ser496) and ATF2 (ser112) were significantly lower in GH-PAs than in healthy pituitary glands. CONCLUSION These findings suggest that decreased phosphorylation of the AMPK/ATF2 pathway may be critical for glucose metabolism and tumorigenesis in GH-PAs.
Collapse
Affiliation(s)
- S Zhao
- Beijing Neurosurgical Institute, Capital Medical University, TianTanXiLi6, Beijing, 100050, China.
| | - J Feng
- Beijing Neurosurgical Institute, Capital Medical University, TianTanXiLi6, Beijing, 100050, China
| | - C Li
- Beijing Neurosurgical Institute, Capital Medical University, TianTanXiLi6, Beijing, 100050, China
| | - H Gao
- Beijing Neurosurgical Institute, Capital Medical University, TianTanXiLi6, Beijing, 100050, China
| | - P Lv
- Beijing Neurosurgical Institute, Capital Medical University, TianTanXiLi6, Beijing, 100050, China
- Chinese Medical Association, Beijing, 100710, China
| | - J Li
- Beijing Neurosurgical Institute, Capital Medical University, TianTanXiLi6, Beijing, 100050, China
| | - Q Liu
- Beijing Neurosurgical Institute, Capital Medical University, TianTanXiLi6, Beijing, 100050, China
| | - Y He
- Beijing Neurosurgical Institute, Capital Medical University, TianTanXiLi6, Beijing, 100050, China
| | - H Wang
- Beijing Neurosurgical Institute, Capital Medical University, TianTanXiLi6, Beijing, 100050, China
| | - L Gong
- Beijing Neurosurgical Institute, Capital Medical University, TianTanXiLi6, Beijing, 100050, China
| | - D Li
- Beijing Neurosurgical Institute, Capital Medical University, TianTanXiLi6, Beijing, 100050, China
| | - Y Zhang
- Beijing Neurosurgical Institute, Capital Medical University, TianTanXiLi6, Beijing, 100050, China.
- Beijing Tiantan Hospital, Capital Medical University, Beijing, 100050, China.
- Beijing Institute for Brain Disorders Brain Tumor Center, Capital Medical University, Beijing, 100050, China.
- China National Clinical Research Center for Neurological Diseases, Beijing, 100050, China.
| |
Collapse
|