Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Wei CH, Harris BR, Kao HY, Lu Z. tmVar: a text mining approach for extracting sequence variants in biomedical literature. Bioinformatics 2013;29:1433-9. [PMID: 23564842 DOI: 10.1093/bioinformatics/btt156] [Citation(s) in RCA: 98] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

For:	Wei CH, Harris BR, Kao HY, Lu Z. tmVar: a text mining approach for extracting sequence variants in biomedical literature. Bioinformatics 2013;29:1433-9. [PMID: 23564842 DOI: 10.1093/bioinformatics/btt156] [Citation(s) in RCA: 98] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Number

Cited by Other Article(s)

Luo L, Ning J, Zhao Y, Wang Z, Ding Z, Chen P, Fu W, Han Q, Xu G, Qiu Y, Pan D, Li J, Li H, Feng W, Tu S, Liu Y, Yang Z, Wang J, Sun Y, Lin H. Taiyi: a bilingual fine-tuned large language model for diverse biomedical tasks. J Am Med Inform Assoc 2024;31:1865-1874. [PMID: 38422367 PMCID: PMC11339499 DOI: 10.1093/jamia/ocae037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/08/2024] [Accepted: 02/16/2024] [Indexed: 03/02/2024] Open

Affiliation(s)

Ling Luo School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Jinzhong Ning School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yingwen Zhao School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Zhijun Wang School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Zeyuan Ding School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Peng Chen School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Weiru Fu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Qinyu Han School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Guangtao Xu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yunzhi Qiu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Dinghao Pan School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Jiru Li School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Hao Li School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Wenduo Feng School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Senbo Tu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yuqi Liu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Zhihao Yang School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Jian Wang School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yuanyuan Sun School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Hongfei Lin School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China

Collapse

Zong H, Wu R, Cha J, Feng W, Wu E, Li J, Shao A, Tao L, Li Z, Tang B, Shen B. Advancing Chinese biomedical text mining with community challenges. J Biomed Inform 2024;157:104716. [PMID: 39197732 DOI: 10.1016/j.jbi.2024.104716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Revised: 08/22/2024] [Accepted: 08/25/2024] [Indexed: 09/01/2024]

Affiliation(s)

Hui Zong Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
Rongrong Wu Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
Jiaxue Cha Shanghai Key Laboratory of Signaling and Disease Research, Laboratory of Receptor-Based Bio-Medicine, Collaborative Innovation Center for Brain Science, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
Weizhe Feng Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
Erman Wu Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
Jiakun Li Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China; Department of Urology, West China Hospital, Sichuan University, Chengdu 610041, China
Aibin Shao Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
Liang Tao Faculty of Business Information, Shanghai Business School, Shanghai 201400, China
Zuofeng Li Takeda Co. Ltd., Shanghai 200040, China
Buzhou Tang Department of Computer Science, Harbin Institute of Technology, Shenzhen 518055, China
Bairong Shen Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China.

Collapse

Islamaj R, Wei CH, Lai PT, Luo L, Coss C, Gokal Kochar P, Miliaras N, Rodionov O, Sekiya K, Trinh D, Whitman D, Lu Z. The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop. Database (Oxford) 2024;2024:baae071. [PMID: 39126204 PMCID: PMC11315767 DOI: 10.1093/database/baae071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 06/03/2024] [Accepted: 07/09/2024] [Indexed: 08/12/2024]

Abstract

The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.

Collapse

Hsieh AR, Tsai CY. Biomedical literature mining: graph kernel-based learning for gene-gene interaction extraction. Eur J Med Res 2024;29:404. [PMID: 39095899 PMCID: PMC11297645 DOI: 10.1186/s40001-024-01983-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Accepted: 07/17/2024] [Indexed: 08/04/2024] Open

Wang M, Vijayaraghavan A, Beck T, Posma JM. Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition. J Proteome Res 2024;23:1915-1925. [PMID: 38733346 PMCID: PMC11165580 DOI: 10.1021/acs.jproteome.3c00367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 01/30/2024] [Accepted: 04/29/2024] [Indexed: 05/13/2024]

Jin Q, Leaman R, Lu Z. PubMed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine 2024;100:104988. [PMID: 38306900 PMCID: PMC10850402 DOI: 10.1016/j.ebiom.2024.104988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2023] [Revised: 01/14/2024] [Accepted: 01/15/2024] [Indexed: 02/04/2024] Open

Sun Z, Tao C. Named Entity Recognition and Normalization for Alzheimer's Disease Eligibility Criteria. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2023;2023:558-564. [PMID: 38283164 PMCID: PMC10815931 DOI: 10.1109/ichi57859.2023.00100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2024]

Abstract

Alzheimer's Disease (AD) is a complex neurodegenerative disorder that affects millions of people worldwide. Finding effective treatments for this disease is crucial. Clinical trials play an essential role in developing and testing new treatments for AD. However, identifying eligible participants can be challenging, time-consuming, and costly. In recent years, the development of natural language processing (NLP) techniques, specifically named entity recognition (NER) and named entity normalization (NEN), have helped to automate the identification and extraction of relevant information from the eligibility criteria (EC) more efficiently, in order to facilitate semi-automatic patient recruitment and enable data FAIRness for clinical trial data. Nevertheless, most current biomedical NER models only provide annotations for a restricted set of entity types that may not be applicable to the clinical trial data. Additionally, accurately performing NEN on entities that are negated using a negative prefix currently lacks established techniques. In this paper, we introduce a pipeline designed for information extraction from AD clinical trial EC, which involves preprocessing of the EC data, clinical NER, and biomedical NEN to Unified Medical Language System (UMLS). Our NER model can identify named entities in seven pre-defined categories, while our NEN model employs a combination of exact match and partial match search strategies, as well as customized rules to accurately normalize entities with negative prefixes. To evaluate the performance of our pipeline, we measured the precision, recall, and F1 score for the NER component, and we manually reviewed the top five mapping results produced by the NEN component. Our evaluation of the pipeline's performance revealed that it can successfully normalize named entities in clinical trial ECs with optimal accuracies. The NER component achieved a overall F1 of 0.816, demonstrating its ability to accurately identify seven types of named entities in clinical text. The NEN component of the pipeline also demonstrated impressive performance, with customized rules and a combination of exact and partial match strategies leading to an accuracy of 0.940 for normalized entities.

Collapse

Zheng X, Du H, Luo X, Tong F, Song W, Zhao D. BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework. BMC Bioinformatics 2022;23:501. [PMID: 36418937 PMCID: PMC9682683 DOI: 10.1186/s12859-022-05051-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 11/10/2022] [Indexed: 11/24/2022] Open

Abstract

BACKGROUND

Automatic and accurate recognition of various biomedical named entities from literature is an important task of biomedical text mining, which is the foundation of extracting biomedical knowledge from unstructured texts into structured formats. Using the sequence labeling framework and deep neural networks to implement biomedical named entity recognition (BioNER) is a common method at present. However, the above method often underutilizes syntactic features such as dependencies and topology of sentences. Therefore, it is an urgent problem to be solved to integrate semantic and syntactic features into the BioNER model.

RESULTS

In this paper, we propose a novel biomedical named entity recognition model, named BioByGANS (BioBERT/SpaCy-Graph Attention Network-Softmax), which uses a graph to model the dependencies and topology of a sentence and formulate the BioNER task as a node classification problem. This formulation can introduce more topological features of language and no longer be only concerned about the distance between words in the sequence. First, we use periods to segment sentences and spaces and symbols to segment words. Second, contextual features are encoded by BioBERT, and syntactic features such as part of speeches, dependencies and topology are preprocessed by SpaCy respectively. A graph attention network is then used to generate a fusing representation considering both the contextual features and syntactic features. Last, a softmax function is used to calculate the probabilities and get the results. We conduct experiments on 8 benchmark datasets, and our proposed model outperforms existing BioNER state-of-the-art methods on the BC2GM, JNLPBA, BC4CHEMD, BC5CDR-chem, BC5CDR-disease, NCBI-disease, Species-800, and LINNAEUS datasets, and achieves F1-scores of 85.15%, 78.16%, 92.97%, 94.74%, 87.74%, 91.57%, 75.01%, 90.99%, respectively.

CONCLUSION

The experimental results on 8 biomedical benchmark datasets demonstrate the effectiveness of our model, and indicate that formulating the BioNER task into a node classification problem and combining syntactic features into the graph attention networks can significantly improve model performance.

Collapse

Li J, Gao J, Feng B, Jing Y. PlagueKD: a knowledge graph-based plague knowledge database. Database (Oxford) 2022;2022:6837306. [PMID: 36412326 PMCID: PMC10161524 DOI: 10.1093/database/baac100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Revised: 10/17/2022] [Accepted: 10/28/2022] [Indexed: 11/23/2022]

Abstract

Plague has been confirmed as an extremely horrific international quarantine infectious disease attributed to Yersinia pestis. It has an extraordinarily high lethal rate that poses a serious hazard to human and animal lives. With the deepening of research, there has been a considerable amount of literature related to the plague that has never been systematically integrated. Indeed, it makes researchers time-consuming and laborious when they conduct some investigation. Accordingly, integrating and excavating plague-related knowledge from considerable literature takes on a critical significance. Moreover, a comprehensive plague knowledge base should be urgently built. To solve the above issues, the plague knowledge base is built for the first time. A database is built from the literature mining based on knowledge graph, which is capable of storing, retrieving, managing and accessing data. First, 5388 plague-related abstracts that were obtained automatically from PubMed are integrated, and plague entity dictionary and ontology knowledge base are constructed by using text mining technology. Second, the scattered plague-related knowledge is correlated through knowledge graph technology. A multifactor correlation knowledge graph centered on plague is formed, which contains 9633 nodes of 33 types (e.g. disease, gene, protein, species, symptom, treatment and geographic location), as well as 9466 association relations (e.g. disease-gene, gene-protein and disease-species). The Neo4j graph database is adopted to store and manage the relational data in the form of triple. Lastly, a plague knowledge base is built, which can successfully manage and visualize a large amount of structured plague-related data. This knowledge base almost provides an integrated and comprehensive plague-related knowledge. It should not only help researchers to better understand the complex pathogenesis and potential therapeutic approaches of plague but also take on a key significance to reference for exploring potential action mechanisms of corresponding drug candidates and the development of vaccine in the future. Furthermore, it is of great significance to promote the field of plague research. Researchers are enabled to acquire data more easily for more effective research. Database URL: http://39.104.28.169:18095/.

Collapse

Tong Y, Tan F, Huang H, Zhang Z, Zong H, Xie Y, Huang D, Cheng S, Wei Z, Fang M, Crabbe MJC, Wang Y, Zhang X. ViMRT: a text-mining tool and search engine for automated virus mutation recognition. Bioinformatics 2022;39:6808671. [PMID: 36342236 PMCID: PMC9805560 DOI: 10.1093/bioinformatics/btac721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 10/24/2022] [Accepted: 11/04/2022] [Indexed: 11/09/2022] Open

Affiliation(s)

Yuantao Tong Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
Fanglin Tan Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
Honglian Huang Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
Zeyu Zhang Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
Hui Zong Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
Yujia Xie Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
Danqi Huang Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
Shiyang Cheng Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
Ziyi Wei Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
Meng Fang Department of Laboratory Medicine, Shanghai Eastern Hepatobiliary Surgery Hospital, Shanghai 200438, China
M James C Crabbe Wolfson College, Oxford University, Oxford OX2 6UD, UK Institute of Biomedical and Environmental Science & Technology, University of Bedfordshire, Luton LU1 3JU, UK School of Life Sciences, Shanxi University, Taiyuan 030006, China
Ying Wang To whom correspondence should be addressed. or
Xiaoyan Zhang To whom correspondence should be addressed. or

Collapse

Luo L, Lai PT, Wei CH, Arighi CN, Lu Z. BioRED: a rich biomedical relation extraction dataset. Brief Bioinform 2022;23:6645993. [PMID: 35849818 PMCID: PMC9487702 DOI: 10.1093/bib/bbac282] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Revised: 06/02/2022] [Accepted: 06/19/2022] [Indexed: 11/13/2022] Open

Wei CH, Allot A, Riehle K, Milosavljevic A, Lu Z. tmVar 3.0: an improved variant concept recognition and normalization tool. Bioinformatics 2022;38:4449-4451. [PMID: 35904569 PMCID: PMC9477515 DOI: 10.1093/bioinformatics/btac537] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 07/07/2022] [Accepted: 07/27/2022] [Indexed: 12/24/2022] Open

Bajic VP, Salhi A, Lakota K, Radovanovic A, Razali R, Zivkovic L, Spremo-Potparevic B, Uludag M, Tifratene F, Motwalli O, Marchand B, Bajic VB, Gojobori T, Isenovic ER, Essack M. DES-Amyloidoses “Amyloidoses through the looking-glass”: A knowledgebase developed for exploring and linking information related to human amyloid-related diseases. PLoS One 2022;17:e0271737. [PMID: 35877764 PMCID: PMC9312389 DOI: 10.1371/journal.pone.0271737] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 07/06/2022] [Indexed: 11/23/2022] Open

Affiliation(s)

Vladan P. Bajic Institute of Nuclear Sciences “VINCA", Laboratory for Radiobiology and Molecular Genetics, University of Belgrade, Belgrade, Republic of Serbia * E-mail: (ME); (VPB)
Adil Salhi Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
Katja Lakota Department of Physiology, Faculty of Pharmacy, University of Belgrade, Belgrade, Serbia
Aleksandar Radovanovic Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
Rozaimi Razali Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
Lada Zivkovic Department of Physiology, Faculty of Pharmacy, University of Belgrade, Belgrade, Serbia
Biljana Spremo-Potparevic Department of Pathobiology, Faculty of Pharmacy, University of Belgrade, Belgrade, Serbia
Mahmut Uludag Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
Faroug Tifratene Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
Olaa Motwalli Saudi Electronic University (SEU), College of Computing and Informatics, Madinah, Kingdom of Saudi Arabia
Benoit Marchand New York University, Abu Dhabi, UAE
Vladimir B. Bajic Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
Takashi Gojobori Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia Biological and Environmental Sciences and Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
Esma R. Isenovic Institute of Nuclear Sciences “VINCA", Laboratory for Radiobiology and Molecular Genetics, University of Belgrade, Belgrade, Republic of Serbia
Magbubah Essack Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia * E-mail: (ME); (VPB)

Collapse

Mallick R, Arnaboldi V, Davis P, Diamantakis S, Zarowiecki M, Howe K. Accelerated variant curation from scientific literature using biomedical text mining. MICROPUBLICATION BIOLOGY 2022;2022:10.17912/micropub.biology.000578. [PMID: 35663412 PMCID: PMC9160977 DOI: 10.17912/micropub.biology.000578] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 05/19/2022] [Accepted: 06/01/2022] [Indexed: 11/20/2022]

Bhasuran B. Combining Literature Mining and Machine Learning for Predicting Biomedical Discoveries. Methods Mol Biol 2022;2496:123-140. [PMID: 35713862 DOI: 10.1007/978-1-0716-2305-3_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]

Abstract

The major outcomes and insights of scientific research and clinical study end up in the form of publication or clinical record in an unstructured text format. Due to advancements in biomedical research, the growth of published literature is getting tremendous large in recent years. The scientists and clinical researchers are facing a big challenge to stay current with the knowledge and to extract hidden information from this sheer quantity of millions of published biomedical literature. The potential one-stop automated solution to this problem is biomedical literature mining. One of the long-standing goals in biology is to discover the disease-causing genes and their specific roles in personalized precision medicine and drug repurposing. However, the empirical approaches and clinical affirmation are expensive and time-consuming. In silico approach using text mining to identify the disease causing genes can contribute towards biomarker discovery. This chapter presents a protocol on combining literature mining and machine learning for predicting biomedical discoveries with a special emphasis on gene-disease relation based discovery. The protocol is presented as a literature based discovery (LBD) pipeline for gene-disease based discovery. The protocol includes our web based tools: (1) DNER (Disease Named Entity Recognizer) for disease entity recognition, (2) BCCNER (Bidirectional, Contextual clues Named Entity Tagger) for gene/protein entity recognition, (3) DisGeReExT (Disease-Gene Relation Extractor) for statistically validated results and visualization, and (4) a newly introduced deep learning based method for association discovery. Our proposed deep learning based method can be generalized and applied to other important biomedical discoveries focusing on entities such as drug/chemical, or miRNA.

Collapse

Chen Q, Leaman R, Allot A, Luo L, Wei CH, Yan S, Lu Z. Artificial Intelligence in Action: Addressing the COVID-19 Pandemic with Natural Language Processing. Annu Rev Biomed Data Sci 2021;4:313-339. [PMID: 34465169 DOI: 10.1146/annurev-biodatasci-021821-061045] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

AlSaieedi A, Salhi A, Tifratene F, Raies AB, Hungler A, Uludag M, Van Neste C, Bajic VB, Gojobori T, Essack M. DES-Tcell is a knowledgebase for exploring immunology-related literature. Sci Rep 2021;11:14344. [PMID: 34253812 PMCID: PMC8275784 DOI: 10.1038/s41598-021-93809-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Accepted: 06/24/2021] [Indexed: 12/02/2022] Open

Abstract

T-cells are a subtype of white blood cells circulating throughout the body, searching for infected and abnormal cells. They have multifaceted functions that include scanning for and directly killing cells infected with intracellular pathogens, eradicating abnormal cells, orchestrating immune response by activating and helping other immune cells, memorizing encountered pathogens, and providing long-lasting protection upon recurrent infections. However, T-cells are also involved in immune responses that result in organ transplant rejection, autoimmune diseases, and some allergic diseases. To support T-cell research, we developed the DES-Tcell knowledgebase (KB). This KB incorporates text- and data-mined information that can expedite retrieval and exploration of T-cell relevant information from the large volume of published T-cell-related research. This KB enables exploration of data through concepts from 15 topic-specific dictionaries, including immunology-related genes, mutations, pathogens, and pathways. We developed three case studies using DES-Tcell, one of which validates effective retrieval of known associations by DES-Tcell. The second and third case studies focuses on concepts that are common to Grave’s disease (GD) and Hashimoto’s thyroiditis (HT). Several reports have shown that up to 20% of GD patients treated with antithyroid medication develop HT, thus suggesting a possible conversion or shift from GD to HT disease. DES-Tcell found miR-4442 links to both GD and HT, and that miR-4442 possibly targets the autoimmune disease risk factor CD6, which provides potential new knowledge derived through the use of DES-Tcell. According to our understanding, DES-Tcell is the first KB dedicated to exploring T-cell-relevant information via literature-mining, data-mining, and topic-specific dictionaries.

Collapse

Affiliation(s)

Ahdab AlSaieedi Department of Medical Laboratory Technology (MLT), Faculty of Applied Medical Sciences (FAMS), King Abdulaziz University (KAU), Jeddah, 21589-80324, Saudi Arabia
Adil Salhi Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
Faroug Tifratene Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
Arwa Bin Raies Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
Arnaud Hungler Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
Mahmut Uludag Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
Christophe Van Neste Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
Vladimir B Bajic Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
Takashi Gojobori Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
Magbubah Essack Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia.

Collapse

Garda S, Schwarz JM, Schuelke M, Leser U, Seelow D. Public data sources for regulatory genomic features. MED GENET-BERLIN 2021;33:167-177. [PMID: 38836022 PMCID: PMC11113004 DOI: 10.1515/medgen-2021-2075] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 06/24/2021] [Indexed: 06/06/2024]

Islamaj R, Wei CH, Cissel D, Miliaras N, Printseva O, Rodionov O, Sekiya K, Ward J, Lu Z. NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition. J Biomed Inform 2021;118:103779. [PMID: 33839304 PMCID: PMC11037554 DOI: 10.1016/j.jbi.2021.103779] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Revised: 03/14/2021] [Accepted: 04/05/2021] [Indexed: 10/21/2022]

Lee K, Wei CH, Lu Z. Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. Brief Bioinform 2021;22:bbaa142. [PMID: 32770181 PMCID: PMC8138883 DOI: 10.1093/bib/bbaa142] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 06/07/2020] [Accepted: 06/25/2020] [Indexed: 12/28/2022] Open

Stourac J, Dubrava J, Musil M, Horackova J, Damborsky J, Mazurenko S, Bednar D. FireProtDB: database of manually curated protein stability data. Nucleic Acids Res 2021;49:D319-D324. [PMID: 33166383 PMCID: PMC7778887 DOI: 10.1093/nar/gkaa981] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 09/18/2020] [Accepted: 10/12/2020] [Indexed: 01/13/2023] Open

Qu J, Steppi A, Zhong D, Hao J, Wang J, Lung PY, Zhao T, He Z, Zhang J. Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach. BMC Genomics 2020;21:773. [PMID: 33167858 PMCID: PMC7654050 DOI: 10.1186/s12864-020-07185-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2020] [Accepted: 10/26/2020] [Indexed: 11/17/2022] Open

Rahman P, Nandi A, Hebert C. Amplifying Domain Expertise in Clinical Data Pipelines. JMIR Med Inform 2020;8:e19612. [PMID: 33151150 PMCID: PMC7677017 DOI: 10.2196/19612] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 07/07/2020] [Accepted: 07/22/2020] [Indexed: 11/28/2022] Open

Islamaj R, Kwon D, Kim S, Lu Z. TeamTat: a collaborative text annotation tool. Nucleic Acids Res 2020;48:W5-W11. [PMID: 32383756 DOI: 10.1093/nar/gkaa333] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 04/16/2020] [Accepted: 04/22/2020] [Indexed: 12/20/2022] Open

Alag S. Unique insights from ClinicalTrials.gov by mining protein mutations and RSids in addition to applying the Human Phenotype Ontology. PLoS One 2020;15:e0233438. [PMID: 32459809 PMCID: PMC7252633 DOI: 10.1371/journal.pone.0233438] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2019] [Accepted: 05/05/2020] [Indexed: 01/31/2023] Open

Wei CH, Allot A, Leaman R, Lu Z. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 2020;47:W587-W593. [PMID: 31114887 DOI: 10.1093/nar/gkz389] [Citation(s) in RCA: 188] [Impact Index Per Article: 47.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2019] [Revised: 04/08/2019] [Accepted: 04/30/2019] [Indexed: 11/12/2022] Open

DES-ROD: Exploring Literature to Develop New Links between RNA Oxidation and Human Diseases. OXIDATIVE MEDICINE AND CELLULAR LONGEVITY 2020;2020:5904315. [PMID: 32308806 PMCID: PMC7142358 DOI: 10.1155/2020/5904315] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/09/2020] [Accepted: 02/21/2020] [Indexed: 12/27/2022]

Abstract

Normal cellular physiology and biochemical processes require undamaged RNA molecules. However, RNAs are frequently subjected to oxidative damage. Overproduction of reactive oxygen species (ROS) leads to RNA oxidation and disturbs redox (oxidation-reduction reaction) homeostasis. When oxidation damage affects RNA carrying protein-coding information, this may result in the synthesis of aberrant proteins as well as a lower efficiency of translation. Both of these, as well as imbalanced redox homeostasis, may lead to numerous human diseases. The number of studies on the effects of RNA oxidative damage in mammals is increasing by year due to the understanding that this oxidation fundamentally leads to numerous human diseases. To enable researchers in this field to explore information relevant to RNA oxidation and effects on human diseases, we developed DES-ROD, an online knowledgebase that contains processed information from 298,603 relevant documents that consist of PubMed abstracts and PubMed Central full-text articles. The system utilizes concepts/terms from 38 curated thematic dictionaries mapped to the analyzed documents. Researchers can explore enriched concepts, as well as enriched pairs of putatively associated concepts. In this way, one can explore mutual relationships between any combinations of two concepts from used dictionaries. Dictionaries cover a wide range of biomedical topics, such as human genes and proteins, pathways, Gene Ontology categories, mutations, noncoding RNAs, enzymes, toxins, metabolites, and diseases. This makes insights into different facets of the effects of RNA oxidation and the control of this process possible. The usefulness of the DES-ROD system is demonstrated by case studies on some known information, as well as potentially novel information involving RNA oxidation and diseases. DES-ROD is the first knowledgebase based on text and data mining that focused on the exploration of RNA oxidation and human diseases.

Collapse

Bugnon LA, Yones C, Raad J, Gerard M, Rubiolo M, Merino G, Pividori M, Di Persia L, Milone DH, Stegmayer G. DL4papers: a deep learning approach for the automatic interpretation of scientific articles. Bioinformatics 2020;36:3499-3506. [DOI: 10.1093/bioinformatics/btaa111] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2019] [Revised: 12/27/2019] [Accepted: 02/14/2020] [Indexed: 01/26/2023] Open

Abstract Abstract Motivation In precision medicine, next-generation sequencing and novel preclinical reports have led to an increasingly large amount of results, published in the scientific literature. However, identifying novel treatments or predicting a drug response in, for example, cancer patients, from the huge amount of papers available remains a laborious and challenging work. This task can be considered a text mining problem that requires reading a lot of academic documents for identifying a small set of papers describing specific relations between key terms. Due to the infeasibility of the manual curation of these relations, computational methods that can automatically identify them from the available literature are urgently needed. Results We present DL4papers, a new method based on deep learning that is capable of analyzing and interpreting papers in order to automatically extract relevant relations between specific keywords. DL4papers receives as input a query with the desired keywords, and it returns a ranked list of papers that contain meaningful associations between the keywords. The comparison against related methods showed that our proposal outperformed them in a cancer corpus. The reliability of the DL4papers output list was also measured, revealing that 100% of the first two documents retrieved for a particular search have relevant relations, in average. This shows that our model can guarantee that in the top-2 papers of the ranked list, the relation can be effectively found. Furthermore, the model is capable of highlighting, within each document, the specific fragments that have the associations of the input keywords. This can be very useful in order to pay attention only to the highlighted text, instead of reading the full paper. We believe that our proposal could be used as an accurate tool for rapidly identifying relationships between genes and their mutations, drug responses and treatments in the context of a certain disease. This new approach can certainly be a very useful and valuable resource for the advancement of the precision medicine field. Availability and implementation A web-demo is available at: http://sinc.unl.edu.ar/web-demo/dl4papers/. Full source code and data are available at: https://sourceforge.net/projects/sourcesinc/files/dl4papers/. Contact lbugnon@sinc.unl.edu.ar Supplementary information Supplementary data are available at Bioinformatics online. Collapse

Affiliation(s)

L A Bugnon Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
C Yones Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
J Raad Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
M Gerard Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
M Rubiolo Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
G Merino Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina Bioengineering and Bioinformatics Research and Development Institute, IBB, FIUNER-CONICET, Ruta Prov 11, Km 10.5, Oro Verde 3100, Argentina
M Pividori Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA, USA
L Di Persia Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
D H Milone Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
G Stegmayer Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina

Collapse

Legrand J, Gogdemir R, Bousquet C, Dalleau K, Devignes MD, Digan W, Lee CJ, Ndiaye NC, Petitpain N, Ringot P, Smaïl-Tabbone M, Toussaint Y, Coulet A. PGxCorpus, a manually annotated corpus for pharmacogenomics. Sci Data 2020;7:3. [PMID: 31896797 PMCID: PMC6940385 DOI: 10.1038/s41597-019-0342-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2019] [Accepted: 12/02/2019] [Indexed: 11/09/2022] Open

Liu S, Lee I. Extracting features with medical sentiment lexicon and position encoding for drug reviews. Health Inf Sci Syst 2019;7:11. [PMID: 31168364 PMCID: PMC6542915 DOI: 10.1007/s13755-019-0072-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2018] [Accepted: 05/15/2019] [Indexed: 11/26/2022] Open

Wang Y, Fan X, Chen L, Chang EIC, Ananiadou S, Tsujii J, Xu Y. Mapping anatomical related entities to human body parts based on wikipedia in discharge summaries. BMC Bioinformatics 2019;20:430. [PMID: 31419946 PMCID: PMC6697955 DOI: 10.1186/s12859-019-3005-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2019] [Accepted: 07/23/2019] [Indexed: 11/16/2022] Open

Ševa J, Wiegandt DL, Götze J, Lamping M, Rieke D, Schäfer R, Jähnichen P, Kittner M, Pallarz S, Starlinger J, Keilholz U, Leser U. VIST - a Variant-Information Search Tool for precision oncology. BMC Bioinformatics 2019;20:429. [PMID: 31419935 PMCID: PMC6697931 DOI: 10.1186/s12859-019-2958-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2019] [Accepted: 06/18/2019] [Indexed: 02/08/2023] Open

Affiliation(s)

Jurica Ševa Knowledge Management in Bioinformatics, Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, Berlin, 12489, Germany
David Luis Wiegandt Knowledge Management in Bioinformatics, Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, Berlin, 12489, Germany
Julian Götze University Hospital Tübingen, Hoppe-Seyler-Straße 3, Tübingen, 72076, Germany
Mario Lamping Charité Comprehensive Cancer Center, Charitéplatz 1, Berlin, 10117, Germany
Damian Rieke Charité Comprehensive Cancer Center, Charitéplatz 1, Berlin, 10117, Germany Department of Hematology and Medical Oncology, Campus Benjamin Franklin, Charité Unviersitätsmedizin Berlin, Hindenburgdamm 30, Berlin, 12203, Germany Berlin Institute of Health, Kapelle-Ufer 2, Berlin, 10117, Germany
Reinhold Schäfer Charité Comprehensive Cancer Center, Charitéplatz 1, Berlin, 10117, Germany German Cancer Consortium (DKTK), DKFZ Heidelberg, Im Neuenheimer Feld 280, Heidelberg, 69120, Germany
Patrick Jähnichen Knowledge Management in Bioinformatics, Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, Berlin, 12489, Germany
Madeleine Kittner Knowledge Management in Bioinformatics, Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, Berlin, 12489, Germany
Steffen Pallarz Knowledge Management in Bioinformatics, Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, Berlin, 12489, Germany
Johannes Starlinger Knowledge Management in Bioinformatics, Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, Berlin, 12489, Germany
Ulrich Keilholz Charité Comprehensive Cancer Center, Charitéplatz 1, Berlin, 10117, Germany
Ulf Leser Knowledge Management in Bioinformatics, Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, Berlin, 12489, Germany.

Collapse

Guin D, Rani J, Singh P, Grover S, Bora S, Talwar P, Karthikeyan M, Satyamoorthy K, Adithan C, Ramachandran S, Saso L, Hasija Y, Kukreti R. Global Text Mining and Development of Pharmacogenomic Knowledge Resource for Precision Medicine. Front Pharmacol 2019;10:839. [PMID: 31447668 PMCID: PMC6692532 DOI: 10.3389/fphar.2019.00839] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Accepted: 07/01/2019] [Indexed: 11/20/2022] Open

Affiliation(s)

Debleena Guin Genomics and Molecular Medicine Unit, Council of Scientific and Industrial Research (CSIR)—Institute of Genomics and Integrative Biology (IGIB), New Delhi, India Department of Biotechnology, Delhi Technological University, Delhi, India
Jyoti Rani Department of Biomedical Sciences, Acharya Narayan Dev College, University of Delhi, New Delhi, India G N Ramachandran Knowledge Centre, Council of Scientific and Industrial Research (CSIR)—Institute of Genomics and Integrative Biology (IGIB), New Delhi, India
Priyanka Singh Genomics and Molecular Medicine Unit, Council of Scientific and Industrial Research (CSIR)—Institute of Genomics and Integrative Biology (IGIB), New Delhi, India Academy of Scientific & Innovative Research (AcSIR), New Delhi, India
Sandeep Grover Institute of Medical Biometry and Statistics, University of Lübeck University Medical Center Schleswig-Holstein - Campus Lübeck, Lübeck, Germany
Shivangi Bora Genomics and Molecular Medicine Unit, Council of Scientific and Industrial Research (CSIR)—Institute of Genomics and Integrative Biology (IGIB), New Delhi, India Department of Biotechnology, Delhi Technological University, Delhi, India
Puneet Talwar Institute of Human Behaviour and Allied Sciences, Delhi, India
Muthusamy Karthikeyan Department of Bioinformatics, Alagappa University, Karaikudi, India
K Satyamoorthy School of Life Sciences, Manipal University, Manipal, India
C Adithan Central Inter-Disciplinary Research Facility (CIDRF), Pondicherry, India
S Ramachandran G N Ramachandran Knowledge Centre, Council of Scientific and Industrial Research (CSIR)—Institute of Genomics and Integrative Biology (IGIB), New Delhi, India Academy of Scientific & Innovative Research (AcSIR), New Delhi, India
Luciano Saso Department of Physiology and Pharmacology “Vittorio Erspamer,” Sapienza University of Rome, Rome, Italy
Yasha Hasija Department of Biotechnology, Delhi Technological University, Delhi, India
Ritushree Kukreti Genomics and Molecular Medicine Unit, Council of Scientific and Industrial Research (CSIR)—Institute of Genomics and Integrative Biology (IGIB), New Delhi, India Academy of Scientific & Innovative Research (AcSIR), New Delhi, India

Collapse

Allot A, Peng Y, Wei CH, Lee K, Phan L, Lu Z. LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res 2019;46:W530-W536. [PMID: 29762787 PMCID: PMC6030971 DOI: 10.1093/nar/gky355] [Citation(s) in RCA: 74] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Accepted: 05/08/2018] [Indexed: 01/10/2023] Open

Gachloo M, Wang Y, Xia J. A review of drug knowledge discovery using BioNLP and tensor or matrix decomposition. Genomics Inform 2019;17:e18. [PMID: 31307133 PMCID: PMC6808632 DOI: 10.5808/gi.2019.17.2.e18] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2019] [Revised: 05/30/2019] [Accepted: 05/30/2019] [Indexed: 12/12/2022] Open

Ohno-Machado L, Kim J, Gabriel RA, Kuo GM, Hogarth MA. Genomics and electronic health record systems. Hum Mol Genet 2019;27:R48-R55. [PMID: 29741693 DOI: 10.1093/hmg/ddy104] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Accepted: 03/19/2018] [Indexed: 01/27/2023] Open

Essack M, Salhi A, Stanimirovic J, Tifratene F, Bin Raies A, Hungler A, Uludag M, Van Neste C, Trpkovic A, Bajic VP, Bajic VB, Isenovic ER. Literature-Based Enrichment Insights into Redox Control of Vascular Biology. OXIDATIVE MEDICINE AND CELLULAR LONGEVITY 2019;2019:1769437. [PMID: 31223421 PMCID: PMC6542245 DOI: 10.1155/2019/1769437] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Revised: 04/11/2019] [Accepted: 05/02/2019] [Indexed: 02/07/2023]

Saqi M, Lysenko A, Guo YK, Tsunoda T, Auffray C. Navigating the disease landscape: knowledge representations for contextualizing molecular signatures. Brief Bioinform 2019;20:609-623. [PMID: 29684165 PMCID: PMC6556902 DOI: 10.1093/bib/bby025] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2017] [Revised: 02/05/2018] [Indexed: 12/14/2022] Open

Labbé C, Grima N, Gautier T, Favier B, Byrne JA. Semi-automated fact-checking of nucleotide sequence reagents in biomedical research publications: The Seek & Blastn tool. PLoS One 2019;14:e0213266. [PMID: 30822319 PMCID: PMC6396917 DOI: 10.1371/journal.pone.0213266] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2018] [Accepted: 02/18/2019] [Indexed: 12/14/2022] Open

Xu J, Yang P, Xue S, Sharma B, Sanchez-Martin M, Wang F, Beaty KA, Dehan E, Parikh B. Translating cancer genomics into precision medicine with artificial intelligence: applications, challenges and future perspectives. Hum Genet 2019;138:109-124. [PMID: 30671672 PMCID: PMC6373233 DOI: 10.1007/s00439-019-01970-5] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2018] [Accepted: 01/02/2019] [Indexed: 02/07/2023]

Zheng S, Dharssi S, Wu M, Li J, Lu Z. Text Mining for Drug Discovery. Methods Mol Biol 2019;1939:231-252. [PMID: 30848465 DOI: 10.1007/978-1-4939-9089-4_13] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]

Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Wei CH, Comeau DC, Antunes R, Matos S, Chen Q, Elangovan A, Panyam NC, Verspoor K, Liu H, Wang Y, Liu Z, Altinel B, Hüsünbeyi ZM, Özgür A, Fergadis A, Wang CK, Dai HJ, Tran T, Kavuluru R, Luo L, Steppi A, Zhang J, Qu J, Lu Z. Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine. Database (Oxford) 2019;2019:5303240. [PMID: 30689846 PMCID: PMC6348314 DOI: 10.1093/database/bay147] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Accepted: 12/19/2018] [Indexed: 12/16/2022]

Abstract

The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.

Collapse

Affiliation(s)

Rezarta Islamaj Dogan National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Sun Kim National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Andrew Chatr-Aryamontri Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, Canada
Chih-Hsuan Wei National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Donald C Comeau National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Rui Antunes Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
Sérgio Matos Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
Qingyu Chen School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
Aparna Elangovan School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
Nagesh C Panyam School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
Karin Verspoor School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
Hongfang Liu Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
Yanshan Wang Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
Zhuang Liu School of Computer Science and Technology, Dalian University of Technology, Dalian, China
Berna Altinel Department of Computer Engineering, Marmara University, Istanbul, Turkey
Zehra Melce Hüsünbeyi Department of Computer Engineering, Bogaziçi University, Istanbul, Turkey
Arzucan Özgür
Aris Fergadis School of Electrical and Computer Engineering, National Technical University of Athens, Zografou, Athens, Greece
Chen-Kai Wang Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan
Hong-Jie Dai Department of Electrical Engineering, National Kaousiung University of Science and Technology, Kaohsiung, Taiwan
Tung Tran Department of Computer Science, University of Kentucky, Lexington, KY, USA
Ramakanth Kavuluru Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY, USA
Ling Luo College of Computer Science and Technology, Dalian University of Technology, Dalian, China
Albert Steppi Department of Statistics, Florida State University, Florida, USA
Jinfeng Zhang Department of Statistics, Florida State University, Florida, USA
Jinchan Qu Department of Statistics, Florida State University, Florida, USA
Zhiyong Lu National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA

Collapse

Yang X, Song Z, Wu C, Wang W, Li G, Zhang W, Wu L, Lu K. Constructing a database for the relations between CNV and human genetic diseases via systematic text mining. BMC Bioinformatics 2018;19:528. [PMID: 30598077 PMCID: PMC6311945 DOI: 10.1186/s12859-018-2526-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open

Matos S. Configurable web-services for biomedical document annotation. J Cheminform 2018;10:68. [PMID: 30578450 PMCID: PMC6755557 DOI: 10.1186/s13321-018-0317-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2018] [Accepted: 12/04/2018] [Indexed: 01/13/2023] Open

Yepes AJ, MacKinlay A, Gunn N, Schieber C, Faux N, Downton M, Goudey B, Martin RL. A hybrid approach for automated mutation annotation of the extended human mutation landscape in scientific literature. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018;2018:616-623. [PMID: 30815103 PMCID: PMC6371299] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]

Tawfik NS, Spruit MR. The SNPcurator: literature mining of enriched SNP-disease associations. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018;2018:4925332. [PMID: 29688369 PMCID: PMC5844215 DOI: 10.1093/database/bay020] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/18/2017] [Accepted: 02/05/2018] [Indexed: 01/08/2023]

Michelini S, Balakrishnan B, Parolo S, Matone A, Mullaney JA, Young W, Gasser O, Wall C, Priami C, Lombardo R, Kussmann M. A reverse metabolic approach to weaning: in silico identification of immune-beneficial infant gut bacteria, mining their metabolism for prebiotic feeds and sourcing these feeds in the natural product space. MICROBIOME 2018;6:171. [PMID: 30241567 PMCID: PMC6151060 DOI: 10.1186/s40168-018-0545-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Accepted: 08/30/2018] [Indexed: 05/13/2023]

Wei CH, Phan L, Feltz J, Maiti R, Hefferon T, Lu Z. tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine. Bioinformatics 2018;34:80-87. [PMID: 28968638 DOI: 10.1093/bioinformatics/btx541] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Accepted: 08/31/2017] [Indexed: 11/12/2022] Open

Abstract

Motivation

Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data.

Results

We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of ∼90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41 889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12 462 rare variants (MAF ≤ 0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research.

Availability and implementation

The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/.

Contact

zhiyong.lu@nih.gov.

Collapse

Kordopati V, Salhi A, Razali R, Radovanovic A, Tifratene F, Uludag M, Li Y, Bokhari A, AlSaieedi A, Bin Raies A, Van Neste C, Essack M, Bajic VB. DES-Mutation: System for Exploring Links of Mutations and Diseases. Sci Rep 2018;8:13359. [PMID: 30190574 PMCID: PMC6127254 DOI: 10.1038/s41598-018-31439-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2017] [Accepted: 08/17/2018] [Indexed: 12/17/2022] Open

Affiliation(s)

Vasiliki Kordopati King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
Adil Salhi King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
Rozaimi Razali King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
Aleksandar Radovanovic King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
Faroug Tifratene King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
Mahmut Uludag King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
Yu Li King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
Ameerah Bokhari King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
Ahdab AlSaieedi King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia.,King Abdulaziz University (KAU), Faculty of Applied Medical Sciences (FAMS), Department of Medical Laboratory Technology (MLT), Jeddah, 21589-80324, Saudi Arabia
Arwa Bin Raies King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
Christophe Van Neste King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia.,Ghent University, Center for Medical Genetics Ghent (CMGG), B-9000, Ghent, Belgium
Magbubah Essack King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
Vladimir B Bajic King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia.

Collapse

Lee K, Famiglietti ML, McMahon A, Wei CH, MacArthur JAL, Poux S, Breuza L, Bridge A, Cunningham F, Xenarios I, Lu Z. Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLoS Comput Biol 2018;14:e1006390. [PMID: 30102703 PMCID: PMC6107285 DOI: 10.1371/journal.pcbi.1006390] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2018] [Revised: 08/23/2018] [Accepted: 07/24/2018] [Indexed: 11/18/2022] Open

Abstract

Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases.

As the volume of literature on genomic variants continues to grow at an increasing rate, it is becoming more difficult for a curator of a variant knowledge base to keep up with and curate all the published papers. Here, we suggest a deep learning-based literature triage method for genomic variation resources. Our method achieves state-of-the-art performance on the triage task. Moreover, our model does not require any laborious preprocessing or feature engineering steps, which are required for traditional machine learning triage methods. We applied our method to the literature triage process of UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog for genomic variation by collaborating with the database curators. Both the manual curation teams confirmed that our method achieved higher precision than their previous query-based triage methods without compromising recall. Both results show that our method is more efficient and can replace the traditional query-based triage methods of manually curated databases. Our method can give human curators more time to focus on more challenging tasks such as actual curation as well as the discovery of novel papers/experimental techniques to consider for inclusion.

Collapse