1
|
Yu S, Dong P, Li J, Tang X, Li X. A study on large-scale disease causality discovery from biomedical literature. BMC Med Inform Decis Mak 2025; 25:136. [PMID: 40102814 PMCID: PMC11916938 DOI: 10.1186/s12911-025-02893-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Accepted: 01/23/2025] [Indexed: 03/20/2025] Open
Abstract
BACKGROUND Biomedical semantic relationship extraction could reveal important biomedical entities and the semantic relationships between them, providing a crucial foundation for the biomedical knowledge discovery, clinical decision making and other artificial intelligence applications. Identifying the causal relationships between diseases is a significant research field, since it expedites the identification of underlying disease pathogenesis mechanisms and promote better disease prevention and treatment. SemRep is an effective tool for semantic relationship extraction in the biomedical field, but it is not accurate enough for disease causality extraction, bringing challenges for downstream tasks. In this study, we proposed an optimization strategy for SemRep to enhance its accuracy in disease causality extraction. METHODS This study aims to optimize disease causality extraction of SemRep tool by constructing a semantic predicate vocabulary that precisely expresses disease causality to support the automatic extraction of disease causality knowledge from biomedical literature. The proposed method invloves the following four steps: Firstly, we obtained a collection of semantic feature words expressing disease causality based on current causality predicate studies and the disease causality pairs extracted from SemMedDB. Then, we constructed a disease causality semantic predicate vocabulary by filtering and evaluating the clue words using quantitative comparisons. Following that, we extracted disease causality pairs from the biomedical literature using 36 semantic predicates with an accuracy greater than 80% for more meaningful knowledge discovery. Finally, we conducted knowledge discovery based on the extracted disease causality triples, which primarily includes unidirectional disease causality, bidirectional disease causality, as well as two specific types of disease causality: primary disease causality and rare disease causality. RESULTS We obtained a disease causality semantic predicate vocabulary containing 50 textual predicates with an accuracy of above 40%. 36 semantic predicates from the 60% accuracy group were used for disease causality extraction, yielding 259,434 disease causality pairs for subsequent knowledge discovery. Among them, 92,557 types with 176,010 unidirectional disease causality triples, and 6084 types with 83,424 bidirectional disease causality triples were found eventually. Two other types of disease causality, primary disease causality and rare disease causality, were also discovered. CONCLUSIONS The novelty of this research is that the proposed method enhanced the disease causality extraction of SemRep tool, resulting a more accurate and comprehensive disease causality extraction. It also facilitates an automatic disease causality extraction from large-scale biomedical literature. Additionally, a customized extraction of disease causality for its accuracy and comprehensiveness can be made possible by leveraging the quantified causality predicate vocabulary, allowing for flexible extraction of disease causality according to the actual circumstance.
Collapse
Affiliation(s)
- Shirui Yu
- National Science Library (Chengdu), Chinese Academy of Sciences, Chengdu, 610041, China
- Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing, 100190, China
| | - Peng Dong
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| | - Junlian Li
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| | - Xiaoli Tang
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China.
| | - Xiaoying Li
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China.
| |
Collapse
|
2
|
Yea S, Jang H, Kim S, Lee S, Kim JU. Annotated corpus for traditional formula-disease relationships in biomedical articles. Sci Data 2025; 12:26. [PMID: 39774689 PMCID: PMC11707285 DOI: 10.1038/s41597-025-04377-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2024] [Accepted: 01/01/2025] [Indexed: 01/11/2025] Open
Abstract
The Traditional Formula (TF), a combination of herbs prepared in accordance with traditional medicine principles, is increasingly garnering global attention as an alternative to modern medicine. Specifically, there is growing interest in exploring TF's therapeutic effects across various diseases. A significant portion of the state-of-the-art knowledge regarding the relationship between TF and disease is found in scientific publications, where manual knowledge extraction is impractical. Thus, Natural Language Processing (NLP) is being employed to efficiently and accurately search and extract crucial knowledge from unstructured literatures. However, the absence of a high-quality manually annotated corpus focusing on TF-disease relationships hampers the use of NLP in the fields of traditional medicine and modern biomedical science. This article introduces the Traditional Formula-Disease Relationship (TFDR) corpus, a manually annotated corpus designed to facilitate the automatic extraction of TF-disease relationships from biomedical literatures. The TFDR corpus includes information gleaned from 740 PubMed abstracts, encompassing a total of 6,211 TF mentions, 7,166 disease mentions, and 1,109 relationships between them encapsulated within 744 key-sentences.
Collapse
Affiliation(s)
- Sangjun Yea
- Korean medicine data division, Korea Institute of Oriental Medicine, Daejeon, 34054, Republic of Korea.
| | - Ho Jang
- Korean medicine data division, Korea Institute of Oriental Medicine, Daejeon, 34054, Republic of Korea
- Korean convergence medical science, University of Science and Technology, Daejeon, 34113, Republic of Korea
| | - Soyoung Kim
- Korean medicine data division, Korea Institute of Oriental Medicine, Daejeon, 34054, Republic of Korea
- Korean convergence medical science, University of Science and Technology, Daejeon, 34113, Republic of Korea
| | - Sanghun Lee
- Korean medicine data division, Korea Institute of Oriental Medicine, Daejeon, 34054, Republic of Korea
- Korean convergence medical science, University of Science and Technology, Daejeon, 34113, Republic of Korea
| | - Jaeuk U Kim
- Korean convergence medical science, University of Science and Technology, Daejeon, 34113, Republic of Korea
- Digital health research division, Korea Institute of Oriental Medicine, Daejeon, 34054, Republic of Korea
| |
Collapse
|
3
|
Chen P, Wang J, Luo L, Lin H, Yang Z. Learning to explain is a good biomedical few-shot learner. Bioinformatics 2024; 40:btae589. [PMID: 39360976 PMCID: PMC11483110 DOI: 10.1093/bioinformatics/btae589] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Revised: 09/24/2024] [Accepted: 10/01/2024] [Indexed: 10/05/2024] Open
Abstract
MOTIVATION Significant progress has been achieved in biomedical text mining using deep learning methods, which rely heavily on large amounts of high-quality data annotated by human experts. However, the reality is that obtaining high-quality annotated data is extremely challenging due to data scarcity (e.g. rare or new diseases), data privacy and security concerns, and the high cost of data annotation. Additionally, nearly all researches focus on predicting labels without providing corresponding explanations. Therefore, in this paper, we investigate a more realistic scenario, biomedical few-shot learning, and explore the impact of interpretability on biomedical few-shot learning. RESULTS We present LetEx-Learning to explain-a novel multi-task generative approach that leverages reasoning explanations from large language models (LLMs) to enhance the inductive reasoning ability of few-shot learning. Our approach includes (1) collecting high-quality explanations by devising a suite of complete workflow based on LLMs through CoT prompting and self-training strategies, (2) converting various biomedical NLP tasks into a text-to-text generation task in a unified manner, where collected explanations serve as additional supervision between text-label pairs by multi-task training. Experiments are conducted on three few-shot settings across six biomedical benchmark datasets. The results show that learning to explain improves the performances of diverse biomedical NLP tasks in low-resource scenario, outperforming strong baseline models significantly by up to 6.41%. Notably, the proposed method makes the 220M LetEx perform superior reasoning explanation ability against LLMs. AVAILABILITY AND IMPLEMENTATION Our source code and data are available at https://github.com/cpmss521/LetEx.
Collapse
Affiliation(s)
- Peng Chen
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jian Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| |
Collapse
|
4
|
Nastou K, Mehryary F, Ohta T, Luoma J, Pyysalo S, Jensen LJ. RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature. Database (Oxford) 2024; 2024:baae095. [PMID: 39265993 PMCID: PMC11394941 DOI: 10.1093/database/baae095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 07/31/2024] [Accepted: 08/16/2024] [Indexed: 09/14/2024]
Abstract
In the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level. RegulaTome stands out by offering 16 961 relations annotated in >2500 documents, making it the most extensive dataset of its kind to date. This corpus is specifically designed to cover a broader spectrum of >40 relation types beyond those traditionally explored, setting a new benchmark in the complexity and depth of biomedical RE tasks. Our corpus both broadens the scope of detected relations and allows for achieving noteworthy accuracy in RE. A transformer-based model trained on this corpus has demonstrated a promising F1-score (66.6%) for a task of this complexity, underscoring the effectiveness of our approach in accurately identifying and categorizing a wide array of biological relations. This achievement highlights RegulaTome's potential to significantly contribute to the development of more sophisticated, efficient, and accurate RE systems to tackle biomedical tasks. Finally, a run of the trained RE system on all PubMed abstracts and PMC Open Access full-text documents resulted in >18 million relations, extracted from the entire biomedical literature.
Collapse
Affiliation(s)
- Katerina Nastou
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3, Copenhagen 2200, Denmark
| | - Farrokh Mehryary
- TurkuNLP Group, Department of Computing, University of Turku, Vesilinnantie 5, Turku 20014, Finland
| | - Tomoko Ohta
- Textimi, 1-37-13 Kitazawa, Tokyo, Setagaya-ku 155-0031, Japan
| | - Jouni Luoma
- TurkuNLP Group, Department of Computing, University of Turku, Vesilinnantie 5, Turku 20014, Finland
| | - Sampo Pyysalo
- TurkuNLP Group, Department of Computing, University of Turku, Vesilinnantie 5, Turku 20014, Finland
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3, Copenhagen 2200, Denmark
| |
Collapse
|
5
|
Mehryary F, Nastou K, Ohta T, Jensen LJ, Pyysalo S. STRING-ing together protein complexes: corpus and methods for extracting physical protein interactions from the biomedical literature. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae552. [PMID: 39276156 PMCID: PMC11441320 DOI: 10.1093/bioinformatics/btae552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Revised: 07/01/2024] [Accepted: 09/12/2024] [Indexed: 09/16/2024]
Abstract
MOTIVATION Understanding biological processes relies heavily on curated knowledge of physical interactions between proteins. Yet, a notable gap remains between the information stored in databases of curated knowledge and the plethora of interactions documented in the scientific literature. RESULTS To bridge this gap, we introduce ComplexTome, a manually annotated corpus designed to facilitate the development of text-mining methods for the extraction of complex formation relationships among biomedical entities targeting the downstream semantics of the physical interaction subnetwork of the STRING database. This corpus comprises 1287 documents with ∼3500 relationships. We train a novel relation extraction model on this corpus and find that it can highly reliably identify physical protein interactions (F1-score = 82.8%). We additionally enhance the model's capabilities through unsupervised trigger word detection and apply it to extract relations and trigger words for these relations from all open publications in the domain literature. This information has been fully integrated into the latest version of the STRING database. AVAILABILITY AND IMPLEMENTATION We provide the corpus, code, and all results produced by the large-scale runs of our systems biomedical on literature via Zenodo https://doi.org/10.5281/zenodo.8139716, Github https://github.com/farmeh/ComplexTome_extraction, and the latest version of STRING database https://string-db.org/.
Collapse
Affiliation(s)
- Farrokh Mehryary
- TurkuNLP Group, Department of Computing, University of Turku, Turku 20014, Finland
| | - Katerina Nastou
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark
| | - Tomoko Ohta
- Textimi, 1-37-13 Kitazawa, Tokyo, Setagaya-ku 155-0031, Japan
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark
| | - Sampo Pyysalo
- TurkuNLP Group, Department of Computing, University of Turku, Turku 20014, Finland
| |
Collapse
|
6
|
Enayati S, Vucetic S. Leveraging shortest dependency paths in low-resource biomedical relation extraction. BMC Med Inform Decis Mak 2024; 24:205. [PMID: 39049015 PMCID: PMC11267752 DOI: 10.1186/s12911-024-02592-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 06/27/2024] [Indexed: 07/27/2024] Open
Abstract
BACKGROUND Biomedical Relation Extraction (RE) is essential for uncovering complex relationships between biomedical entities within text. However, training RE classifiers is challenging in low-resource biomedical applications with few labeled examples. METHODS We explore the potential of Shortest Dependency Paths (SDPs) to aid biomedical RE, especially in situations with limited labeled examples. In this study, we suggest various approaches to employ SDPs when creating word and sentence representations under supervised, semi-supervised, and in-context-learning settings. RESULTS Through experiments on three benchmark biomedical text datasets, we find that incorporating SDP-based representations enhances the performance of RE classifiers. The improvement is especially notable when working with small amounts of labeled data. CONCLUSION SDPs offer valuable insights into the complex sentence structure found in many biomedical text passages. Our study introduces several straightforward techniques that, as demonstrated experimentally, effectively enhance the accuracy of RE classifiers.
Collapse
Affiliation(s)
- Saman Enayati
- Department of Computer and Information Sciences, Temple University, 1925 N. 12th Street, Suite 304, Philadelphia, PA, 19122, USA.
| | - Slobodan Vucetic
- Department of Computer and Information Sciences, Temple University, 1925 N. 12th Street, Suite 304, Philadelphia, PA, 19122, USA
| |
Collapse
|
7
|
Nachtegael C, De Stefani J, Cnudde A, Lenaerts T. DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations. Database (Oxford) 2024; 2024:baae039. [PMID: 38805753 PMCID: PMC11131422 DOI: 10.1093/database/baae039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 04/17/2024] [Accepted: 05/13/2024] [Indexed: 05/30/2024]
Abstract
While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene-variant-gene-variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene-variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571.
Collapse
Affiliation(s)
- Charlotte Nachtegael
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Boulevard du Triomphe, CP 263, Brussels 1050, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
| | - Jacopo De Stefani
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
| | - Anthony Cnudde
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
- Pharmacologie, Pharmacothérapie et Suivi Pharmaceutique, Université Libre de Bruxelles, Boulevard du Triomphe, CP 205, Brussels 1050, Belgium
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Boulevard du Triomphe, CP 263, Brussels 1050, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
- Artificial Intelligence Laboratory, Vrije Universiteit Brussel, Pleinlaan 2, Brussels 1050, Belgium
| |
Collapse
|
8
|
Das Baksi K, Pokhrel V, Pudavar AE, Mande SS, Kuntal BK. BactInt: A domain driven transfer learning approach for extracting inter-bacterial associations from biomedical text. Comput Biol Chem 2024; 109:108012. [PMID: 38198963 DOI: 10.1016/j.compbiolchem.2023.108012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 12/15/2023] [Accepted: 12/30/2023] [Indexed: 01/12/2024]
Abstract
BACKGROUND The healthy as well as dysbiotic state of an ecosystem like human body is known to be influenced not only by the presence of the bacterial groups in it, but also with respect to the associations within themselves. Evidence reported in biomedical text serves as a reliable source for identifying and ascertaining such inter bacterial associations. However, the complexity of the reported text as well as the ever-increasing volume of information necessitates development of methods for automated and accurate extraction of such knowledge. METHODS A BioBERT (biomedical domain specific language model) based information extraction model for bacterial associations is presented that utilizes learning patterns from other publicly available datasets. Additionally, a specialized sentence corpus has been developed to significantly improve the prediction accuracy of the 'transfer learned' model using a fine-tuning approach. RESULTS The final model was seen to outperform all other variations (non-transfer learned and non-fine-tuned models) as well as models trained on BioGPT (a domain trained Generative Pre-trained Transformer). To further demonstrate the utility, a case study was performed using bacterial association network data obtained from experimental studies. CONCLUSION This study attempts to demonstrate the applicability of transfer learning in a niche field of life sciences where understanding of inter bacterial relationships is crucial to obtain meaningful insights in comprehending microbial community structures across different ecosystems. The study further discusses how such a model can be further improved by fine tuning using limited training data. The results presented and the datasets made available are expected to be a valuable addition in the field of medical informatics and bioinformatics.
Collapse
Affiliation(s)
| | - Vatsala Pokhrel
- TCS Research, Tata Consultancy Services Ltd, Pune 411057, India
| | | | | | - Bhusan K Kuntal
- TCS Research, Tata Consultancy Services Ltd, Pune 411057, India.
| |
Collapse
|
9
|
Huang MS, Han JC, Lin PY, You YT, Tsai RTH, Hsu WL. Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource. Brief Bioinform 2024; 25:bbae132. [PMID: 38609331 PMCID: PMC11014787 DOI: 10.1093/bib/bbae132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/06/2023] [Accepted: 03/02/2023] [Indexed: 04/14/2024] Open
Abstract
Natural language processing (NLP) has become an essential technique in various fields, offering a wide range of possibilities for analyzing data and developing diverse NLP tasks. In the biomedical domain, understanding the complex relationships between compounds and proteins is critical, especially in the context of signal transduction and biochemical pathways. Among these relationships, protein-protein interactions (PPIs) are of particular interest, given their potential to trigger a variety of biological reactions. To improve the ability to predict PPI events, we propose the protein event detection dataset (PEDD), which comprises 6823 abstracts, 39 488 sentences and 182 937 gene pairs. Our PEDD dataset has been utilized in the AI CUP Biomedical Paper Analysis competition, where systems are challenged to predict 12 different relation types. In this paper, we review the state-of-the-art relation extraction research and provide an overview of the PEDD's compilation process. Furthermore, we present the results of the PPI extraction competition and evaluate several language models' performances on the PEDD. This paper's outcomes will provide a valuable roadmap for future studies on protein event detection in NLP. By addressing this critical challenge, we hope to enable breakthroughs in drug discovery and enhance our understanding of the molecular mechanisms underlying various diseases.
Collapse
Affiliation(s)
- Ming-Siang Huang
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
- National Institute of Cancer Research, National Health Research Institutes, Tainan, Taiwan
- Department of Computer Science and Information Engineering, College of Information and Electrical Engineering, Asia University, Taichung, Taiwan
| | - Jen-Chieh Han
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
| | - Pei-Yen Lin
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
| | - Yu-Ting You
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
| | - Richard Tzong-Han Tsai
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
- Center for Geographic Information Science, Research Center for Humanities and Social Sciences, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
- Department of Computer Science and Information Engineering, College of Information and Electrical Engineering, Asia University, Taichung, Taiwan
| |
Collapse
|
10
|
Nachtegael C, De Stefani J, Lenaerts T. A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction. PLoS One 2023; 18:e0292356. [PMID: 38100453 PMCID: PMC10723703 DOI: 10.1371/journal.pone.0292356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 09/19/2023] [Indexed: 12/17/2023] Open
Abstract
Automatic biomedical relation extraction (bioRE) is an essential task in biomedical research in order to generate high-quality labelled data that can be used for the development of innovative predictive methods. However, building such fully labelled, high quality bioRE data sets of adequate size for the training of state-of-the-art relation extraction models is hindered by an annotation bottleneck due to limitations on time and expertise of researchers and curators. We show here how Active Learning (AL) plays an important role in resolving this issue and positively improve bioRE tasks, effectively overcoming the labelling limits inherent to a data set. Six different AL strategies are benchmarked on seven bioRE data sets, using PubMedBERT as the base model, evaluating their area under the learning curve (AULC) as well as intermediate results measurements. The results demonstrate that uncertainty-based strategies, such as Least-Confident or Margin Sampling, are statistically performing better in terms of F1-score, accuracy and precision, than other types of AL strategies. However, in terms of recall, a diversity-based strategy, called Core-set, outperforms all strategies. AL strategies are shown to reduce the annotation need (in order to reach a performance at par with training on all data), from 6% to 38%, depending on the data set; with Margin Sampling and Least-Confident Sampling strategies moreover obtaining the best AULCs compared to the Random Sampling baseline. We show through the experiments the importance of using AL methods to reduce the amount of labelling needed to construct high-quality data sets leading to optimal performance of deep learning models. The code and data sets to reproduce all the results presented in the article are available at https://github.com/oligogenic/Deep_active_learning_bioRE.
Collapse
Affiliation(s)
- Charlotte Nachtegael
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Bruxelles, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
| | - Jacopo De Stefani
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
- Technology, Policy and Management Faculty, Technische Universiteit Delft, Delft, Netherlands
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Bruxelles, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
- Artificial Intelligence Laboratory, Vrije Universiteit Brussel, Bruxelles, Belgium
| |
Collapse
|
11
|
Lai PT, Wei CH, Luo L, Chen Q, Lu Z. BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets. J Biomed Inform 2023; 146:104487. [PMID: 37673376 DOI: 10.1016/j.jbi.2023.104487] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 08/18/2023] [Accepted: 09/02/2023] [Indexed: 09/08/2023]
Abstract
Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.
Collapse
Affiliation(s)
- Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA.
| |
Collapse
|
12
|
Altuntas V. Diffusion Alignment Coefficient (DAC): A Novel Similarity Metric for Protein-Protein Interaction Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:894-903. [PMID: 35737632 DOI: 10.1109/tcbb.2022.3185406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Interaction networks can be used to predict the functions of unknown proteins using known interactions and proteins with known functions. Many graph theory or diffusion-based methods have been proposed, using the assumption that the topological properties of a protein in a network are related to its biological function. Here we seek to improve function prediction by finding more similar neighbors with a new diffusion-based alignment technique to overcome the topological information loss of the node. In this study, we introduce the Diffusion Alignment Coefficient (DAC) algorithm, which combines diffusion, longest common subsequence, and longest common substring techniques to measure the similarity of two nodes in protein interaction networks. As a proof of concept, our experiments, conducted on a real PPI networks S.cerevisiae and Homo Sapiens, demonstrated that our method obtained better results than competitors for MIPS and MSigDB Collections hallmark gene set functional categories. This is the first study to develop a measure of node function similarity using alignment to consider the positions of nodes in protein-protein interaction networks. According to the experimental results, the use of spatial information belonging to the nodes in the network has a positive effect on the detection of more functionally similar neighboring nodes.
Collapse
|
13
|
Nicholson DN, Himmelstein DS, Greene CS. Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts. BioData Min 2022; 15:26. [PMID: 36258252 PMCID: PMC9578183 DOI: 10.1186/s13040-022-00311-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2022] [Accepted: 09/17/2022] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types. RESULTS We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1. CONCLUSIONS Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results.
Collapse
Affiliation(s)
- David N. Nicholson
- grid.25879.310000 0004 1936 8972Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA USA
| | - Daniel S. Himmelstein
- grid.25879.310000 0004 1936 8972Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA USA
| | - Casey S. Greene
- grid.430503.10000 0001 0703 675XDepartment of Biomedical Informatics, University of Colorado School of Medicine and Center for Health Artificial Intellegence (CHAI), University of Colorado School of Medicine, Aurora, USA
| |
Collapse
|
14
|
Luo L, Lai PT, Wei CH, Arighi CN, Lu Z. BioRED: a rich biomedical relation extraction dataset. Brief Bioinform 2022; 23:6645993. [PMID: 35849818 PMCID: PMC9487702 DOI: 10.1093/bib/bbac282] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Revised: 06/02/2022] [Accepted: 06/19/2022] [Indexed: 11/13/2022] Open
Abstract
Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then, we present a first-of-its-kind biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Furthermore, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including Bidirectional Encoder Representations from Transformers (BERT)-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient and robust RE systems for biomedicine. Availability: The BioRED dataset and annotation guidelines are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.
Collapse
Affiliation(s)
- Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | | | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
15
|
Lee Y, Son J, Song M. BertSRC: transformer-based semantic relation classification. BMC Med Inform Decis Mak 2022; 22:234. [PMID: 36068535 PMCID: PMC9446816 DOI: 10.1186/s12911-022-01977-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 08/11/2022] [Indexed: 11/13/2022] Open
Abstract
The relationship between biomedical entities is complex, and many of them have not yet been identified. For many biomedical research areas including drug discovery, it is of paramount importance to identify the relationships that have already been established through a comprehensive literature survey. However, manually searching through literature is difficult as the amount of biomedical publications continues to increase. Therefore, the relation classification task, which automatically mines meaningful relations from the literature, is spotlighted in the field of biomedical text mining. By applying relation classification techniques to the accumulated biomedical literature, existing semantic relations between biomedical entities that can help to infer previously unknown relationships are efficiently grasped. To develop semantic relation classification models, which is a type of supervised machine learning, it is essential to construct a training dataset that is manually annotated by biomedical experts with semantic relations among biomedical entities. Any advanced model must be trained on a dataset with reliable quality and meaningful scale to be deployed in the real world and can assist biologists in their research. In addition, as the number of such public datasets increases, the performance of machine learning algorithms can be accurately revealed and compared by using those datasets as a benchmark for model development and improvement. In this paper, we aim to build such a dataset. Along with that, to validate the usability of the dataset as training data for relation classification models and to improve the performance of the relation extraction task, we built a relation classification model based on Bidirectional Encoder Representations from Transformers (BERT) trained on our dataset, applying our newly proposed fine-tuning methodology. In experiments comparing performance among several models based on different deep learning algorithms, our model with the proposed fine-tuning methodology showed the best performance. The experimental results show that the constructed training dataset is an important information resource for the development and evaluation of semantic relation extraction models. Furthermore, relation extraction performance can be improved by integrating our proposed fine-tuning methodology. Therefore, this can lead to the promotion of future text mining research in the biomedical field.
Collapse
Affiliation(s)
- Yeawon Lee
- Department of Library and Information Science, Yonsei University, Seoul, South Korea
| | - Jinseok Son
- Department of Digital Analytics, Yonsei University, Seoul, South Korea
| | - Min Song
- Department of Library and Information Science, Yonsei University, Seoul, South Korea.
| |
Collapse
|
16
|
Cho H, Kim B, Choi W, Lee D, Lee H. Plant phenotype relationship corpus for biomedical relationships between plants and phenotypes. Sci Data 2022; 9:235. [PMID: 35618736 PMCID: PMC9135735 DOI: 10.1038/s41597-022-01350-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 05/03/2022] [Indexed: 11/09/2022] Open
Abstract
Medicinal plants have demonstrated therapeutic potential for applicability for a wide range of observable characteristics in the human body, known as "phenotype," and have been considered favorably in clinical treatment. With an ever increasing interest in plants, many researchers have attempted to extract meaningful information by identifying relationships between plants and phenotypes from the existing literature. Although natural language processing (NLP) aims to extract useful information from unstructured textual data, there is no appropriate corpus available to train and evaluate the NLP model for plants and phenotypes. Therefore, in the present study, we have presented the plant-phenotype relationship (PPR) corpus, a high-quality resource that supports the development of various NLP fields; it includes information derived from 600 PubMed abstracts corresponding to 5,668 plant and 11,282 phenotype entities, and demonstrates a total of 9,709 relationships. We have also described benchmark results through named entity recognition and relation extraction systems to verify the quality of our data and to show the significant performance of NLP tasks in the PPR test set.
Collapse
Affiliation(s)
- Hyejin Cho
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, 61005, Republic of Korea
| | - Baeksoo Kim
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, 61005, Republic of Korea
| | - Wonjun Choi
- Digital Curation Center, Korea Institute of Science and Technology Information, Daejeon, 34141, Republic of Korea
| | - Doheon Lee
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, 34141, Republic of Korea
| | - Hyunju Lee
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, 61005, Republic of Korea.
| |
Collapse
|
17
|
Su P, Vijay-Shanker K. Investigation of improving the pre-training and fine-tuning of BERT model for biomedical relation extraction. BMC Bioinformatics 2022; 23:120. [PMID: 35379166 PMCID: PMC8978438 DOI: 10.1186/s12859-022-04642-w] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2021] [Accepted: 03/11/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recently, automatically extracting biomedical relations has been a significant subject in biomedical research due to the rapid growth of biomedical literature. Since the adaptation to the biomedical domain, the transformer-based BERT models have produced leading results on many biomedical natural language processing tasks. In this work, we will explore the approaches to improve the BERT model for relation extraction tasks in both the pre-training and fine-tuning stages of its applications. In the pre-training stage, we add another level of BERT adaptation on sub-domain data to bridge the gap between domain knowledge and task-specific knowledge. Also, we propose methods to incorporate the ignored knowledge in the last layer of BERT to improve its fine-tuning. RESULTS The experiment results demonstrate that our approaches for pre-training and fine-tuning can improve the BERT model performance. After combining the two proposed techniques, our approach outperforms the original BERT models with averaged F1 score improvement of 2.1% on relation extraction tasks. Moreover, our approach achieves state-of-the-art performance on three relation extraction benchmark datasets. CONCLUSIONS The extra pre-training step on sub-domain data can help the BERT model generalization on specific tasks, and our proposed fine-tuning mechanism could utilize the knowledge in the last layer of BERT to boost the model performance. Furthermore, the combination of these two approaches further improves the performance of BERT model on the relation extraction tasks.
Collapse
Affiliation(s)
- Peng Su
- Department of Computer and Information Science, Biomedical Text Mining Lab, University of Delaware, Newark, USA
| | - K. Vijay-Shanker
- Department of Computer and Information Science, Biomedical Text Mining Lab, University of Delaware, Newark, USA
| |
Collapse
|
18
|
Yadav S, Ramesh S, Saha S, Ekbal A. Relation Extraction From Biomedical and Clinical Text: Unified Multitask Learning Framework. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1105-1116. [PMID: 32853152 DOI: 10.1109/tcbb.2020.3020016] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
MOTIVATION To minimize the accelerating amount of time invested on the biomedical literature search, numerous approaches for automated knowledge extraction have been proposed. Relation extraction is one such task where semantic relations between the entities are identified from the free text. In the biomedical domain, extraction of regulatory pathways, metabolic processes, adverse drug reaction or disease models necessitates knowledge from the individual relations, for example, physical or regulatory interactions between genes, proteins, drugs, chemical, disease or phenotype. RESULTS In this paper, we study the relation extraction task from three major biomedical and clinical tasks, namely drug-drug interaction, protein-protein interaction, and medical concept relation extraction. Towards this, we model the relation extraction problem in a multi-task learning (MTL)framework, and introduce for the first time the concept of structured self-attentive network complemented with the adversarial learning approach for the prediction of relationships from the biomedical and clinical text. The fundamental notion of MTL is to simultaneously learn multiple problems together by utilizing the concepts of the shared representation. Additionally, we also generate the highly efficient single task model which exploits the shortest dependency path embedding learned over the attentive gated recurrent unit to compare our proposed MTL models. The framework we propose significantly improves over all the baselines (deep learning techniques)and single-task models for predicting the relationships, without compromising on the performance of all the tasks.
Collapse
|
19
|
Elangovan A, Li Y, Pires DEV, Davis MJ, Verspoor K. Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT. BMC Bioinformatics 2022; 23:4. [PMID: 34983371 PMCID: PMC8729035 DOI: 10.1186/s12859-021-04504-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2021] [Accepted: 11/30/2021] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation. METHOD We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models-dubbed PPI-BioBERT-x10-to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. RESULTS AND CONCLUSION The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter [Formula: see text] (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.
Collapse
Affiliation(s)
- Aparna Elangovan
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Yuan Li
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Douglas E. V. Pires
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Melissa J. Davis
- The Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
- Department of Clinical Pathology, Faculty of Medicine, Dentistry and Health Sciences, The University of Melbourne, Melbourne, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
- School of Computing Technologies, RMIT University, Melbourne, Australia
| |
Collapse
|
20
|
Arumugam K, Shanker RR. Text Mining and Machine Learning Protocol for Extracting Human-Related Protein Phosphorylation Information from PubMed. Methods Mol Biol 2022; 2496:159-177. [PMID: 35713864 DOI: 10.1007/978-1-0716-2305-3_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In the modern health care research, protein phosphorylation has gained an enormous attention from the researchers across the globe and requires automated approaches to process a huge volume of data on proteins and their modifications at the cellular level. The data generated at the cellular level is unique as well as arbitrary, and an accumulation of massive volume of information is inevitable. Biological research has revealed that a huge array of cellular communication aided by protein phosphorylation and other similar mechanisms imply different and diverse meanings. This led to a collection of huge volume of data to understand the biological functions of human evolution, especially for combating diseases in a better way. Text mining, an automated approach to mine the information from an unstructured data, finds its application in extracting protein phosphorylation information from the biomedical literature databases such as PubMed. This chapter outlines a recent text mining protocol that applies natural language parsing (NLP) for named entity recognition and text processing, and support vector machines (SVM), a machine learning algorithm for classifying the processed text related human protein phosphorylation. We discuss on evaluating the text mining system which is the outcome of the protocol on three corpora, namely, human Protein Phosphorylation (hPP) corpus, Integrated Protein Literature Information and Knowledge corpus (iProLink), and Phosphorylation Literature corpus (PLC). We also present a basic understanding on the chemistry and biology that drive the protein phosphorylation process in a human body. We believe that this basic understanding will be useful to advance the existing text mining systems for extracting protein phosphorylation information from PubMed.
Collapse
Affiliation(s)
- Krishnamurthy Arumugam
- Department of Management Studies, Coimbatore Institute of Engineering and Technology, Coimbatore, Tamilnadu, India.
| | - Raja Ravi Shanker
- International Business Unit, Alembic Pharmaceuticals Limited, Vadodara, Gujarat, India
| |
Collapse
|
21
|
Kalyan KS, Rajasekharan A, Sangeetha S. AMMU: A survey of transformer-based biomedical pretrained language models. J Biomed Inform 2021; 126:103982. [PMID: 34974190 DOI: 10.1016/j.jbi.2021.103982] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Revised: 12/12/2021] [Accepted: 12/20/2021] [Indexed: 01/04/2023]
Abstract
Transformer-based pretrained language models (PLMs) have started a new era in modern natural language processing (NLP). These models combine the power of transformers, transfer learning, and self-supervised learning (SSL). Following the success of these models in the general domain, the biomedical research community has developed various in-domain PLMs starting from BioBERT to the latest BioELECTRA and BioALBERT models. We strongly believe there is a need for a survey paper that can provide a comprehensive survey of various transformer-based biomedical pretrained language models (BPLMs). In this survey, we start with a brief overview of foundational concepts like self-supervised learning, embedding layer and transformer encoder layers. We discuss core concepts of transformer-based PLMs like pretraining methods, pretraining tasks, fine-tuning methods, and various embedding types specific to biomedical domain. We introduce a taxonomy for transformer-based BPLMs and then discuss all the models. We discuss various challenges and present possible solutions. We conclude by highlighting some of the open issues which will drive the research community to further improve transformer-based BPLMs. The list of all the publicly available transformer-based BPLMs along with their links is provided at https://mr-nlp.github.io/posts/2021/05/transformer-based-biomedical-pretrained-language-models-list/.
Collapse
|
22
|
Protein-protein interaction relation extraction based on multigranularity semantic fusion. J Biomed Inform 2021; 123:103931. [PMID: 34628063 DOI: 10.1016/j.jbi.2021.103931] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Revised: 09/12/2021] [Accepted: 10/04/2021] [Indexed: 01/02/2023]
Abstract
Extracting semantic relationships about biomedical entities in a sentence is a typical task in biomedical information extraction. Because a sentence usually contains several named entities, it is important to learn global semantics of a sentence to support relation extraction. In related works, many strategies have been proposed to encode a sentence representation relevant to considered named entities. Despite the current success, according to the characteristic of languages, semantics of words are expressed on multigranular levels which also heavily depends on local semantic of a sentence. In this paper, we propose a multigranularity semantic fusion method to support biomedical relation extraction. In this method, Transformer is adopted for embedding words of a sentence into distributed representations, which is effective to encode global semantic of a sentence. Meanwhile, a multichannel strategy is applied to encode local semantics of words, which enables the same word to have different representations in a sentence. Both global and local semantic representations are fused to enhance the discriminability of the neural network. To evaluate our method, experiments are conducted on five standard PPI corpora (AImed, BioInfer, IEPA, HPRD50, and LLL), which achieve F1-scores of 83.4%, 89.9%, 81.2%, 84.5%, and 92.5%, respectively. The results show that multigranular semantic fusion is helpful to support the protein-protein interaction relationship extraction.
Collapse
|
23
|
Zhu T, Qin Y, Xiang Y, Hu B, Chen Q, Peng W. Distantly supervised biomedical relation extraction using piecewise attentive convolutional neural network and reinforcement learning. J Am Med Inform Assoc 2021; 28:2571-2581. [PMID: 34524450 DOI: 10.1093/jamia/ocab176] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Revised: 07/08/2021] [Accepted: 08/06/2021] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE There have been various methods to deal with the erroneous training data in distantly supervised relation extraction (RE), however, their performance is still far from satisfaction. We aimed to deal with the insufficient modeling problem on instance-label correlations for predicting biomedical relations using deep learning and reinforcement learning. MATERIALS AND METHODS In this study, a new computational model called piecewise attentive convolutional neural network and reinforcement learning (PACNN+RL) was proposed to perform RE on distantly supervised data generated from Unified Medical Language System with MEDLINE abstracts and benchmark datasets. In PACNN+RL, PACNN was introduced to encode semantic information of biomedical text, and the RL method with memory backtracking mechanism was leveraged to alleviate the erroneous data issue. Extensive experiments were conducted on 4 biomedical RE tasks. RESULTS The proposed PACNN+RL model achieved competitive performance on 8 biomedical corpora, outperforming most baseline systems. Specifically, PACNN+RL outperformed all baseline methods with the F1-score of 0.5592 on the may-prevent dataset, 0.6666 on the may-treat dataset, and 0.3838 on the DDI corpus, 2011. For the protein-protein interaction RE task, we obtained new state-of-the-art performance on 4 out of 5 benchmark datasets. CONCLUSIONS The performance on many distantly supervised biomedical RE tasks was substantially improved, primarily owing to the denoising effect of the proposed model. It is anticipated that PACNN+RL will become a useful tool for large-scale RE and other downstream tasks to facilitate biomedical knowledge acquisition. We also made the demonstration program and source code publicly available at http://112.74.48.115:9000/.
Collapse
Affiliation(s)
- Tiantian Zhu
- Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China.,Department of Network Intelligence, Peng Cheng Laboratory, Shenzhen, China
| | - Yang Qin
- Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
| | - Yang Xiang
- Department of Network Intelligence, Peng Cheng Laboratory, Shenzhen, China
| | - Baotian Hu
- Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
| | - Qingcai Chen
- Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China.,Department of Network Intelligence, Peng Cheng Laboratory, Shenzhen, China
| | - Weihua Peng
- Department of Knowledge Graph, Baidu International Technology (Shenzhen), Shenzhen, China
| |
Collapse
|
24
|
Hong G, Kim Y, Choi Y, Song M. BioPREP: Deep learning-based predicate classification with SemMedDB. J Biomed Inform 2021; 122:103888. [PMID: 34411707 DOI: 10.1016/j.jbi.2021.103888] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Revised: 06/03/2021] [Accepted: 08/13/2021] [Indexed: 11/16/2022]
Abstract
When it comes to inferring relations between entities in biomedical texts, Relation Extraction (RE) has become key to biomedical information extraction. Although previous studies focused on using rule-based and machine learning-based approaches, these methods lacked efficiency in terms of the demanding amount of feature processing while resulting in relatively low accuracy. Some existing biomedical relation extraction tools are based on neural networks. Nonetheless, they rarely analyze possible causes of the difference in accuracy among predicates. Also, there have not been enough biomedical datasets that were structured for predicate classification. With these regards, we set our research goals as follows: constructing a large-scale training dataset, namely Biomedical Predicate Relation-extraction with Entity-filtering by PKDE4J (BioPREP), based on SemMedDB then using PKDE4J as an entity-filtering tool, evaluating the performances of each neural network-based algorithms on the structured dataset. We then analyzed our model's performance in-depth by grouping predicates into semantic clusters. Based on comprehensive experimental outcomes, the experiments showed that the BioBERT-based model outperformed other models for predicate classification. The suggested model achieved an f1-score of 0.846 when BioBERT was loaded as the pre-trained model and 0.840 when SciBERT weights were loaded. Moreover, the semantic cluster analysis showed that sentences containing key phrases were classified better, such as comparison verb + 'than'.
Collapse
Affiliation(s)
- Gibong Hong
- Department of Digital Analytics, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea; Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea
| | - Yuheun Kim
- Department of Digital Analytics, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea; Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea
| | - YeonJung Choi
- Department of Digital Analytics, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea; Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea
| | - Min Song
- Department of Digital Analytics, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea; Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea.
| |
Collapse
|
25
|
Fei H, Zhang Y, Ren Y, Ji D. A span-graph neural model for overlapping entity relation extraction in biomedical texts. Bioinformatics 2021; 37:1581-1589. [PMID: 33245108 DOI: 10.1093/bioinformatics/btaa993] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2020] [Revised: 10/25/2020] [Accepted: 11/17/2020] [Indexed: 01/14/2023] Open
Abstract
MOTIVATION Entity relation extraction is one of the fundamental tasks in biomedical text mining, which is usually solved by the models from natural language processing. Compared with traditional pipeline methods, joint methods can avoid the error propagation from entity to relation, giving better performances. However, the existing joint models are built upon sequential scheme, and fail to detect overlapping entity and relation, which are ubiquitous in biomedical texts. The main reason is that sequential models have relatively weaker power in capturing long-range dependencies, which results in lower performance in encoding longer sentences. In this article, we propose a novel span-graph neural model for jointly extracting overlapping entity relation in biomedical texts. Our model treats the task as relation triplets prediction, and builds the entity-graph by enumerating possible candidate entity spans. The proposed model captures the relationship between the correlated entities via a span scorer and a relation scorer, respectively, and finally outputs all valid relational triplets. RESULTS Experimental results on two biomedical entity relation extraction tasks, including drug-drug interaction detection and protein-protein interaction detection, show that the proposed method outperforms previous models by a substantial margin, demonstrating the effectiveness of span-graph-based method for overlapping relation extraction in biomedical texts. Further in-depth analysis proves that our model is more effective in capturing the long-range dependencies for relation extraction compared with the sequential models. AVAILABILITY AND IMPLEMENTATION Related codes are made publicly available at http://github.com/Baxelyne/SpanBioER.
Collapse
Affiliation(s)
- Hao Fei
- School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China
| | - Yue Zhang
- School of Engineering, Westlake University, Hangzhou 310024, China
| | - Yafeng Ren
- Laboratory of Language and Artificial Intelligence, Guangdong University of Foreign Studies, Guangzhou 510420, China
| | - Donghong Ji
- School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China
| |
Collapse
|
26
|
Zhao S, Su C, Lu Z, Wang F. Recent advances in biomedical literature mining. Brief Bioinform 2021; 22:bbaa057. [PMID: 32422651 PMCID: PMC8138828 DOI: 10.1093/bib/bbaa057] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 03/22/2020] [Accepted: 03/25/2020] [Indexed: 01/26/2023] Open
Abstract
The recent years have witnessed a rapid increase in the number of scientific articles in biomedical domain. These literature are mostly available and readily accessible in electronic format. The domain knowledge hidden in them is critical for biomedical research and applications, which makes biomedical literature mining (BLM) techniques highly demanding. Numerous efforts have been made on this topic from both biomedical informatics (BMI) and computer science (CS) communities. The BMI community focuses more on the concrete application problems and thus prefer more interpretable and descriptive methods, while the CS community chases more on superior performance and generalization ability, thus more sophisticated and universal models are developed. The goal of this paper is to provide a review of the recent advances in BLM from both communities and inspire new research directions.
Collapse
Affiliation(s)
- Sendong Zhao
- Department of Healthcare Policy and Research, Weill Medical College of Cornell University, New York, NY 10065, USA
| | - Chang Su
- Division of Health Informatics, Department of Healthcare Policy and Research at Weill Cornell Medicine at Cornell University, New York, NY, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI) at National Library of Medicine, National Institute of Health, Bethesda, MD, USA
| | - Fei Wang
- Department of Healthcare Policy and Research, Weill Medical College of Cornell University, New York, NY 10065, USA
| |
Collapse
|
27
|
Mitra S, Saha S, Hasanuzzaman M. A Multi-View Deep Neural Network Model for Chemical-Disease Relation Extraction From Imbalanced Datasets. IEEE J Biomed Health Inform 2020; 24:3315-3325. [DOI: 10.1109/jbhi.2020.2983365] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
28
|
Wang CCN, Jin J, Chang JG, Hayakawa M, Kitazawa A, Tsai JJP, Sheu PCY. Identification of most influential co-occurring gene suites for gastrointestinal cancer using biomedical literature mining and graph-based influence maximization. BMC Med Inform Decis Mak 2020; 20:208. [PMID: 32883271 PMCID: PMC7469322 DOI: 10.1186/s12911-020-01227-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Accepted: 08/20/2020] [Indexed: 12/02/2022] Open
Abstract
Background Gastrointestinal (GI) cancer including colorectal cancer, gastric cancer, pancreatic cancer, etc., are among the most frequent malignancies diagnosed annually and represent a major public health problem worldwide. Methods This paper reports an aided curation pipeline to identify potential influential genes for gastrointestinal cancer. The curation pipeline integrates biomedical literature to identify named entities by Bi-LSTM-CNN-CRF methods. The entities and their associations can be used to construct a graph, and from which we can compute the sets of co-occurring genes that are the most influential based on an influence maximization algorithm. Results The sets of co-occurring genes that are the most influential that we discover include RARA - CRBP1, CASP3 - BCL2, BCL2 - CASP3 – CRBP1, RARA - CASP3 – CRBP1, FOXJ1 - RASSF3 - ESR1, FOXJ1 - RASSF1A - ESR1, FOXJ1 - RASSF1A - TNFAIP8 - ESR1. With TCGA and functional and pathway enrichment analysis, we prove the proposed approach works well in the context of gastrointestinal cancer. Conclusions Our pipeline that uses text mining to identify objects and relationships to construct a graph and uses graph-based influence maximization to discover the most influential co-occurring genes presents a viable direction to assist knowledge discovery for clinical applications.
Collapse
Affiliation(s)
- Charles C N Wang
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan.,Center for Artificial Intelligence in Precision Medicine, UAsia University, Taichung, Taiwan
| | - Jennifer Jin
- Department of EECS and BME, University of California, Irvine, USA
| | - Jan-Gowth Chang
- Department of Laboratory Medicine, China Medical University Hospital, Taichung, Taiwan.,Center for Precision Medicine, China Medical University Hospital, Taichung, Taiwan.,Graduate Institute of Clinical Medical Science, School of Medicine, College of Medicine, China Medical University, Taichung, Taiwan
| | | | | | - Jeffrey J P Tsai
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan
| | - Phillip C-Y Sheu
- Department of EECS and BME, University of California, Irvine, USA.
| |
Collapse
|
29
|
Perera N, Dehmer M, Emmert-Streib F. Named Entity Recognition and Relation Detection for Biomedical Information Extraction. Front Cell Dev Biol 2020; 8:673. [PMID: 32984300 PMCID: PMC7485218 DOI: 10.3389/fcell.2020.00673] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 07/02/2020] [Indexed: 12/29/2022] Open
Abstract
The number of scientific publications in the literature is steadily growing, containing our knowledge in the biomedical, health, and clinical sciences. Since there is currently no automatic archiving of the obtained results, much of this information remains buried in textual details not readily available for further usage or analysis. For this reason, natural language processing (NLP) and text mining methods are used for information extraction from such publications. In this paper, we review practices for Named Entity Recognition (NER) and Relation Detection (RD), allowing, e.g., to identify interactions between proteins and drugs or genes and diseases. This information can be integrated into networks to summarize large-scale details on a particular biomedical or clinical problem, which is then amenable for easy data management and further analysis. Furthermore, we survey novel deep learning methods that have recently been introduced for such tasks.
Collapse
Affiliation(s)
- Nadeesha Perera
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
| | - Matthias Dehmer
- Department of Mechatronics and Biomedical Computer Science, University for Health Sciences, Medical Informatics and Technology (UMIT), Hall in Tirol, Austria
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
- Faculty of Medicine and Health Technology, Institute of Biosciences and Medical Technology, Tampere University, Tampere, Finland
| |
Collapse
|
30
|
|
31
|
Huang MS, Lai PT, Lin PY, You YT, Tsai RTH, Hsu WL. Biomedical named entity recognition and linking datasets: survey and our recent development. Brief Bioinform 2020; 21:2219-2238. [DOI: 10.1093/bib/bbaa054] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 02/29/2020] [Accepted: 03/31/2020] [Indexed: 11/14/2022] Open
Abstract
AbstractNatural language processing (NLP) is widely applied in biological domains to retrieve information from publications. Systems to address numerous applications exist, such as biomedical named entity recognition (BNER), named entity normalization (NEN) and protein–protein interaction extraction (PPIE). High-quality datasets can assist the development of robust and reliable systems; however, due to the endless applications and evolving techniques, the annotations of benchmark datasets may become outdated and inappropriate. In this study, we first review commonlyused BNER datasets and their potential annotation problems such as inconsistency and low portability. Then, we introduce a revised version of the JNLPBA dataset that solves potential problems in the original and use state-of-the-art named entity recognition systems to evaluate its portability to different kinds of biomedical literature, including protein–protein interaction and biology events. Lastly, we introduce an ensembled biomedical entity dataset (EBED) by extending the revised JNLPBA dataset with PubMed Central full-text paragraphs, figure captions and patent abstracts. This EBED is a multi-task dataset that covers annotations including gene, disease and chemical entities. In total, it contains 85000 entity mentions, 25000 entity mentions with database identifiers and 5000 attribute tags. To demonstrate the usage of the EBED, we review the BNER track from the AI CUP Biomedical Paper Analysis challenge. Availability: The revised JNLPBA dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/Re vised_JNLPBA.zip. The EBED dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/AICUP _EBED_dataset.rar. Contact: Email: thtsai@g.ncu.edu.tw, Tel. 886-3-4227151 ext. 35203, Fax: 886-3-422-2681 Email: hsu@iis.sinica.edu.tw, Tel. 886-2-2788-3799 ext. 2211, Fax: 886-2-2782-4814 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.
Collapse
Affiliation(s)
- Ming-Siang Huang
- Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Po-Ting Lai
- Institute of Biomedical Informatics, National Yang Ming University, Taipei, Taiwan
| | - Pei-Yen Lin
- Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan
| | - Yu-Ting You
- Intelligent Agent Systems Laboratory, Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Richard Tzong-Han Tsai
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
| | - Wen-Lian Hsu
- Intelligent Agent Systems Laboratory, Institute of Information Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
32
|
Nicholson DN, Greene CS. Constructing knowledge graphs and their biomedical applications. Comput Struct Biotechnol J 2020; 18:1414-1428. [PMID: 32637040 PMCID: PMC7327409 DOI: 10.1016/j.csbj.2020.05.017] [Citation(s) in RCA: 92] [Impact Index Per Article: 18.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 05/22/2020] [Accepted: 05/23/2020] [Indexed: 12/31/2022] Open
Abstract
Knowledge graphs can support many biomedical applications. These graphs represent biomedical concepts and relationships in the form of nodes and edges. In this review, we discuss how these graphs are constructed and applied with a particular focus on how machine learning approaches are changing these processes. Biomedical knowledge graphs have often been constructed by integrating databases that were populated by experts via manual curation, but we are now seeing a more robust use of automated systems. A number of techniques are used to represent knowledge graphs, but often machine learning methods are used to construct a low-dimensional representation that can support many different applications. This representation is designed to preserve a knowledge graph's local and/or global structure. Additional machine learning methods can be applied to this representation to make predictions within genomic, pharmaceutical, and clinical domains. We frame our discussion first around knowledge graph construction and then around unifying representational learning techniques and unifying applications. Advances in machine learning for biomedicine are creating new opportunities across many domains, and we note potential avenues for future work with knowledge graphs that appear particularly promising.
Collapse
Affiliation(s)
- David N. Nicholson
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, United States
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Childhood Cancer Data Lab, Alex’s Lemonade Stand Foundation, United States
| |
Collapse
|
33
|
Kilicoglu H, Rosemblat G, Fiszman M, Shin D. Broad-coverage biomedical relation extraction with SemRep. BMC Bioinformatics 2020; 21:188. [PMID: 32410573 PMCID: PMC7222583 DOI: 10.1186/s12859-020-3517-7] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Accepted: 04/29/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep's performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships. RESULTS A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F 1 score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F 1 score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F 1 score. The recall and the F 1 score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level. CONCLUSIONS SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
- University of Illinois at Urbana-Champaign, School of Information Sciences, 501 E Daniel Street, Champaign, 61820 IL USA
| | - Graciela Rosemblat
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| | | | - Dongwook Shin
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| |
Collapse
|
34
|
Lai PT, Lu WL, Kuo TR, Chung CR, Han JC, Tsai RTH, Horng JT. Using a Large Margin Context-Aware Convolutional Neural Network to Automatically Extract Disease-Disease Association from Literature: Comparative Analytic Study. JMIR Med Inform 2019; 7:e14502. [PMID: 31769759 PMCID: PMC6913619 DOI: 10.2196/14502] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2019] [Revised: 07/26/2019] [Accepted: 08/11/2019] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND Research on disease-disease association (DDA), like comorbidity and complication, provides important insights into disease treatment and drug discovery, and a large body of the literature has been published in the field. However, using current search tools, it is not easy for researchers to retrieve information on the latest DDA findings. First, comorbidity and complication keywords pull up large numbers of PubMed studies. Second, disease is not highlighted in search results. Finally, DDA is not identified, as currently no disease-disease association extraction (DDAE) dataset or tools are available. OBJECTIVE As there are no available DDAE datasets or tools, this study aimed to develop (1) a DDAE dataset and (2) a neural network model for extracting DDA from the literature. METHODS In this study, we formulated DDAE as a supervised machine learning classification problem. To develop the system, we first built a DDAE dataset. We then employed two machine learning models, support vector machine and convolutional neural network, to extract DDA. Furthermore, we evaluated the effect of using the output layer as features of the support vector machine-based model. Finally, we implemented large margin context-aware convolutional neural network architecture to integrate context features and convolutional neural networks through the large margin function. RESULTS Our DDAE dataset consisted of 521 PubMed abstracts. Experiment results showed that the support vector machine-based approach achieved an F1 measure of 80.32%, which is higher than the convolutional neural network-based approach (73.32%). Using the output layer of convolutional neural network as a feature for the support vector machine does not further improve the performance of support vector machine. However, our large margin context-aware-convolutional neural network achieved the highest F1 measure of 84.18% and demonstrated that combining the hinge loss function of support vector machine with a convolutional neural network into a single neural network architecture outperforms other approaches. CONCLUSIONS To facilitate the development of text-mining research for DDAE, we developed the first publicly available DDAE dataset consisting of disease mentions, Medical Subject Heading IDs, and relation annotations. We developed different conventional machine learning models and neural network architectures and evaluated their effects on our DDAE dataset. To further improve DDAE performance, we propose an large margin context-aware-convolutional neural network model for DDAE that outperforms other approaches.
Collapse
Affiliation(s)
- Po-Ting Lai
- Department of Computer Science National Tsing Hua University, Hsinchu, Province of China Taiwan
| | - Wei-Liang Lu
- Department of Computer Science & Information Engineering, National Central University, Taoyuan, Province of China Taiwan
| | - Ting-Rung Kuo
- Department of Computer Science & Information Engineering, National Central University, Taoyuan, Province of China Taiwan
| | - Chia-Ru Chung
- Department of Computer Science & Information Engineering, National Central University, Taoyuan, Province of China Taiwan
| | - Jen-Chieh Han
- Department of Computer Science & Information Engineering, National Central University, Taoyuan, Province of China Taiwan
| | - Richard Tzong-Han Tsai
- Department of Computer Science & Information Engineering, National Central University, Taoyuan, Province of China Taiwan
| | - Jorng-Tzong Horng
- Department of Computer Science & Information Engineering, National Central University, Taoyuan, Province of China Taiwan
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Province of China Taiwan
| |
Collapse
|
35
|
Zhou H, Li X, Yao W, Liu Z, Ning S, Lang C, Du L. Improving neural protein-protein interaction extraction with knowledge selection. Comput Biol Chem 2019; 83:107146. [PMID: 31707129 DOI: 10.1016/j.compbiolchem.2019.107146] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Revised: 10/08/2019] [Accepted: 10/09/2019] [Indexed: 11/29/2022]
Abstract
Protein-protein interaction (PPI) extraction from published scientific literature provides additional support for precision medicine efforts. Meanwhile, knowledge bases (KBs) contain huge amounts of structured information of protein entities and their relations, which can be encoded in entity and relation embeddings to help PPI extraction. However, the prior knowledge of protein-protein pairs must be selectively used so that it is suitable for different contexts. This paper proposes a Knowledge Selection Model (KSM) to fuse the selected prior knowledge and context information for PPI extraction. Firstly, two Transformers encode the context sequence of a protein pair according to each protein embedding, respectively. Then, the two outputs are fed to a mutual attention to capture the important context features towards the protein pair. Next, the context features are used to distill the relation embedding by a knowledge selector. Finally, the selected relation embedding and the context features are concatenated for PPI extraction. Experiments on the BioCreative VI PPI dataset show that KSM achieves a new state-of-the-art performance (38.08 % F1-score) by adding knowledge selection.
Collapse
Affiliation(s)
- Huiwei Zhou
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, Liaoning, China.
| | - Xuefei Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, Liaoning, China.
| | - Weihong Yao
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, Liaoning, China.
| | - Zhuang Liu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, Liaoning, China.
| | - Shixian Ning
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, Liaoning, China.
| | - Chengkun Lang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, Liaoning, China.
| | - Lei Du
- School of Mathematical Sciences, Dalian University of Technology, Dalian, 116024, Liaoning, China.
| |
Collapse
|
36
|
Suárez-Paniagua V, Rivera Zavala RM, Segura-Bedmar I, Martínez P. A two-stage deep learning approach for extracting entities and relationships from medical texts. J Biomed Inform 2019; 99:103285. [PMID: 31546016 DOI: 10.1016/j.jbi.2019.103285] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 09/01/2019] [Accepted: 09/03/2019] [Indexed: 10/26/2022]
Abstract
This work presents a two-stage deep learning system for Named Entity Recognition (NER) and Relation Extraction (RE) from medical texts. These tasks are a crucial step to many natural language understanding applications in the biomedical domain. Automatic medical coding of electronic medical records, automated summarizing of patient records, automatic cohort identification for clinical studies, text simplification of health documents for patients, early detection of adverse drug reactions or automatic identification of risk factors are only a few examples of the many possible opportunities that the text analysis can offer in the clinical domain. In this work, our efforts are primarily directed towards the improvement of the pharmacovigilance process by the automatic detection of drug-drug interactions (DDI) from texts. Moreover, we deal with the semantic analysis of texts containing health information for patients. Our two-stage approach is based on Deep Learning architectures. Concretely, NER is performed combining a bidirectional Long Short-Term Memory (Bi-LSTM) and a Conditional Random Field (CRF), while RE applies a Convolutional Neural Network (CNN). Since our approach uses very few language resources, only the pre-trained word embeddings, and does not exploit any domain resources (such as dictionaries or ontologies), this can be easily expandable to support other languages and clinical applications that require the exploitation of semantic information (concepts and relationships) from texts. During the last years, the task of DDI extraction has received great attention by the BioNLP community. However, the problem has been traditionally evaluated as two separate subtasks: drug name recognition and extraction of DDIs. To the best of our knowledge, this is the first work that provides an evaluation of the whole pipeline. Moreover, our system obtains state-of-the-art results on the eHealth-KD challenge, which was part of the Workshop on Semantic Analysis at SEPLN (TASS-2018).
Collapse
Affiliation(s)
- Víctor Suárez-Paniagua
- Computer Science Department, Carlos III University of Madrid, Leganés 28911, Madrid, Spain.
| | - Renzo M Rivera Zavala
- Computer Science Department, Carlos III University of Madrid, Leganés 28911, Madrid, Spain.
| | - Isabel Segura-Bedmar
- Computer Science Department, Carlos III University of Madrid, Leganés 28911, Madrid, Spain.
| | - Paloma Martínez
- Computer Science Department, Carlos III University of Madrid, Leganés 28911, Madrid, Spain.
| |
Collapse
|
37
|
Fannjiang C, Mooney TA, Cones S, Mann D, Shorter KA, Katija K. Augmenting biologging with supervised machine learning to study in situ behavior of the medusa Chrysaora fuscescens. ACTA ACUST UNITED AC 2019; 222:jeb.207654. [PMID: 31371399 PMCID: PMC6739807 DOI: 10.1242/jeb.207654] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2019] [Accepted: 07/29/2019] [Indexed: 11/20/2022]
Abstract
Zooplankton play critical roles in marine ecosystems, yet their fine-scale behavior remains poorly understood because of the difficulty in studying individuals in situ. Here, we combine biologging with supervised machine learning (ML) to propose a pipeline for studying in situ behavior of larger zooplankton such as jellyfish. We deployed the ITAG, a biologging package with high-resolution motion sensors designed for soft-bodied invertebrates, on eight Chrysaora fuscescens in Monterey Bay, using the tether method for retrieval. By analyzing simultaneous video footage of the tagged jellyfish, we developed ML methods to: (1) identify periods of tag data corrupted by the tether method, which may have compromised prior research findings, and (2) classify jellyfish behaviors. Our tools yield characterizations of fine-scale jellyfish activity and orientation over long durations, and we conclude that it is essential to develop behavioral classifiers on in situ rather than laboratory data. Summary: High-resolution motion sensors paired with supervised machine learning can be used to infer fine-scale in situ behavior of zooplankton over long durations.
Collapse
Affiliation(s)
- Clara Fannjiang
- Research and Development, Monterey Bay Aquarium Research Institute, Moss Landing, CA 95039, USA .,Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720, USA
| | - T Aran Mooney
- Biology Department, Woods Hole Oceanographic Institution, Woods Hole, MA 02543, USA
| | - Seth Cones
- Biology Department, Woods Hole Oceanographic Institution, Woods Hole, MA 02543, USA
| | | | - K Alex Shorter
- Department of Mechanical Engineering, University of Michigan, Ann Arbor, MI 48109, USA
| | - Kakani Katija
- Research and Development, Monterey Bay Aquarium Research Institute, Moss Landing, CA 95039, USA
| |
Collapse
|
38
|
Caufield JH, Ping P. New advances in extracting and learning from protein-protein interactions within unstructured biomedical text data. Emerg Top Life Sci 2019; 3:357-369. [PMID: 33523203 DOI: 10.1042/etls20190003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2019] [Revised: 07/11/2019] [Accepted: 07/16/2019] [Indexed: 12/14/2022]
Abstract
Protein-protein interactions, or PPIs, constitute a basic unit of our understanding of protein function. Though substantial effort has been made to organize PPI knowledge into structured databases, maintenance of these resources requires careful manual curation. Even then, many PPIs remain uncurated within unstructured text data. Extracting PPIs from experimental research supports assembly of PPI networks and highlights relationships crucial to elucidating protein functions. Isolating specific protein-protein relationships from numerous documents is technically demanding by both manual and automated means. Recent advances in the design of these methods have leveraged emerging computational developments and have demonstrated impressive results on test datasets. In this review, we discuss recent developments in PPI extraction from unstructured biomedical text. We explore the historical context of these developments, recent strategies for integrating and comparing PPI data, and their application to advancing the understanding of protein function. Finally, we describe the challenges facing the application of PPI mining to the text concerning protein families, using the multifunctional 14-3-3 protein family as an example.
Collapse
Affiliation(s)
- J Harry Caufield
- The NIH BD2K Center of Excellence in Biomedical Computing, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Physiology, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
| | - Peipei Ping
- The NIH BD2K Center of Excellence in Biomedical Computing, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Physiology, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Medicine/Cardiology, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Bioinformatics, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Scalable Analytics Institute (ScAi), University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
| |
Collapse
|
39
|
Su P, Li G, Wu C, Vijay-Shanker K. Using distant supervision to augment manually annotated data for relation extraction. PLoS One 2019; 14:e0216913. [PMID: 31361753 PMCID: PMC6667146 DOI: 10.1371/journal.pone.0216913] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2019] [Accepted: 07/18/2019] [Indexed: 11/18/2022] Open
Abstract
Significant progress has been made in applying deep learning on natural language processing tasks recently. However, deep learning models typically require a large amount of annotated training data while often only small labeled datasets are available for many natural language processing tasks in biomedical literature. Building large-size datasets for deep learning is expensive since it involves considerable human effort and usually requires domain expertise in specialized fields. In this work, we consider augmenting manually annotated data with large amounts of data using distant supervision. However, data obtained by distant supervision is often noisy, we first apply some heuristics to remove some of the incorrect annotations. Then using methods inspired from transfer learning, we show that the resulting models outperform models trained on the original manually annotated sets.
Collapse
Affiliation(s)
- Peng Su
- Department of Computer and Information Science, University of Delaware, Newark, Delaware, United States of America
| | - Gang Li
- Department of Computer and Information Science, University of Delaware, Newark, Delaware, United States of America
| | - Cathy Wu
- Department of Computer and Information Science, University of Delaware, Newark, Delaware, United States of America
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, Delaware, United States of America
| | - K. Vijay-Shanker
- Department of Computer and Information Science, University of Delaware, Newark, Delaware, United States of America
| |
Collapse
|
40
|
Varidi MJ, Varidi M, Vajdi M, Sharifpour A, Akbarzadeh-T MR. Best Precision–Recall Confidence Threshold and F-Measure to Determine Quality of Camel Meat by Support Vector Regression Based Electronic Nose. INTERNATIONAL JOURNAL OF FOOD ENGINEERING 2019. [DOI: 10.1515/ijfe-2018-0235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
AbstractMaintaining fresh quality of camel meat and devising an effective validation instrument were the goals of this project. The minced samples were packed in bags with five different atmospheres and stored 20 days at 4 °C. Head space gas analysis and total viable count of bacteria were performed as references. Meat samples were tested with an electronic nose machine through dynamic sampling. Principal component analysis technique resulted in two distinct clusters in consistence with reference methods. Carbon dioxide was the best modified atmosphere to retain meat freshness. Support vector regression was trained with different confidence thresholds. The best precision–recall and F-measure values were obtained at threshold 0.5 that are promising to avoid false-positive and false-negative diagnoses which would be very crucial for regulatory decision-making organizations.
Collapse
Affiliation(s)
- Mohammad J. Varidi
- Department of Food Science and Technology, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Mehdi Varidi
- Department of Food Science and Technology, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Meisam Vajdi
- Department of Food Science and Technology, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Azam Sharifpour
- Department of Food Science and Technology, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Mohammad-R. Akbarzadeh-T
- Departments of Electrical and Computer Engineering, Center of Excellence on Soft Computing and Intelligent Information Processing (SCIIP), Ferdowsi University of Mashhad, Mashhad, Iran
| |
Collapse
|
41
|
Castillo-Lara S, Abril JF. PPaxe: easy extraction of protein occurrence and interactions from the scientific literature. Bioinformatics 2019; 35:2523-2524. [PMID: 30500875 DOI: 10.1093/bioinformatics/bty988] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2018] [Revised: 10/26/2018] [Accepted: 11/29/2018] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Protein-protein interactions (PPIs) are very important to build models for understanding many biological processes. Although several databases hold many of these interactions, exploring them, selecting those relevant for a given subject and contextualizing them can be a difficult task for researchers. Extracting PPIs directly from the scientific literature can be very helpful for providing such context, as the sentences describing these interactions may give insights to researchers in helpful ways. RESULTS We have developed PPaxe, a python module and a web application that allows users to extract PPIs and protein occurrence from a given set of PubMed and PubMedCentral articles. It presents the results of the analysis in different ways to help researchers export, filter and analyze the results easily. AVAILABILITY AND IMPLEMENTATION PPaxe web demo is freely available at https://compgen.bio.ub.edu/PPaxe. All the software can be downloaded from https://compgen.bio.ub.edu/PPaxe/download, including a command-line version and docker containers for an easy installation. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
42
|
Zhang Y, Chen Q, Yang Z, Lin H, Lu Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci Data 2019; 6:52. [PMID: 31076572 PMCID: PMC6510737 DOI: 10.1038/s41597-019-0055-0] [Citation(s) in RCA: 162] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2018] [Accepted: 03/27/2019] [Indexed: 11/10/2022] Open
Abstract
Distributed word representations have become an essential foundation for biomedical natural language processing (BioNLP), text mining and information retrieval. Word embeddings are traditionally computed at the word level from a large corpus of unlabeled text, ignoring the information present in the internal structure of words or any information available in domain specific structured resources such as ontologies. However, such information holds potentials for greatly improving the quality of the word representation, as suggested in some recent studies in the general domain. Here we present BioWordVec: an open set of biomedical word vectors/embeddings that combines subword information from unlabeled biomedical text with a widely-used biomedical controlled vocabulary called Medical Subject Headings (MeSH). We assess both the validity and utility of our generated word embeddings over multiple NLP tasks in the biomedical domain. Our benchmarking results demonstrate that our word embeddings can result in significantly improved performance over the previous state of the art in those challenging tasks.
Collapse
Affiliation(s)
- Yijia Zhang
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, 20894, USA
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116023, China
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, 20894, USA
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116023, China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116023, China
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, 20894, USA.
| |
Collapse
|
43
|
Niu Y, Wu H, Wang Y. Protein-Protein Interaction Identification Using a Similarity-Constrained Graph Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:607-616. [PMID: 29989990 DOI: 10.1109/tcbb.2017.2777448] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Protein-protein interaction (PPI) identification is an important task in text mining. Most PPI detection systems make predictions solely based on evidence within a single sentence and often suffer from the heavy burden of manual annotation. This paper approaches PPI detection task from a different paradigm by investigating the context of protein pairs collected from a large corpus and their relations. First, crucial cues in the context are exploited to make initial predictions. Then, relational similarity between protein pairs is calculated. Finally, evidence from the two views is integrated in the framework of minimum cuts algorithm. Experimental results show that the graph model achieves better performance than standard supervised approaches. Using 20 percent data as the training set, our algorithm achieves higher accuracy than support vector machine (SVM) using 80 percent data as training data. Moreover, the semi-supervised settings reveal promising directions for PPI identification exploiting unlabeled data.
Collapse
|
44
|
Yadav S, Ekbal A, Saha S, Kumar A, Bhattacharyya P. Feature assisted stacked attentive shortest dependency path based Bi-LSTM model for protein–protein interaction. Knowl Based Syst 2019. [DOI: 10.1016/j.knosys.2018.11.020] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
45
|
Li M, He Q, Ma J, He F, Zhu Y, Chang C, Chen T. PPICurator: A Tool for Extracting Comprehensive Protein-Protein Interaction Information. Proteomics 2019; 19:e1800291. [PMID: 30521143 DOI: 10.1002/pmic.201800291] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2018] [Revised: 11/12/2018] [Indexed: 11/07/2022]
Abstract
Protein-protein interaction extraction through biological literature curation is widely employed for proteome analysis. There is a strong need for a tool that can assist researchers in extracting comprehensive PPI information through literature curation, which is critical in research on protein, for example, construction of protein interaction network, identification of protein signaling pathway, and discovery of meaningful protein interaction. However, most of current tools can only extract PPI relations. None of them are capable of extracting other important PPI information, such as interaction directions, effects, and functional annotations. To address these issues, this paper proposes PPICurator, a novel tool for extracting comprehensive PPI information with a variety of logic and syntax features based on a new support vector machine classifier. PPICurator provides a friendly web-based user interface. It is a platform that automates the extraction of comprehensive PPI information through literature, including PPI relations, as well as their confidential scores, interaction directions, effects, and functional annotations. Thus, PPICurator is more comprehensive than state-of-the-art tools. Moreover, it outperforms state-of-the-art tools in the accuracy of PPI relation extraction measured by F-score and recall on the widely used open datasets. PPICurator is available at https://ppicurator.hupo.org.cn.
Collapse
Affiliation(s)
- Mansheng Li
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Qiang He
- School of Software and Electrical Engineering, Swinburne University of Technology, Melbourne, Victoria, 3122, Australia
| | - Jie Ma
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Fuchu He
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Yunping Zhu
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Cheng Chang
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Tao Chen
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| |
Collapse
|
46
|
Lim S, Kang J. Chemical-gene relation extraction using recursive neural network. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:5042822. [PMID: 29961818 PMCID: PMC6014134 DOI: 10.1093/database/bay060] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Accepted: 05/25/2018] [Indexed: 11/14/2022]
Abstract
In this article, we describe our system for the CHEMPROT task of the BioCreative VI challenge. Although considerable research on the named entity recognition of genes and drugs has been conducted, there is limited research on extracting relationships between them. Extracting relations between chemical compounds and genes from the literature is an important element in pharmacological and clinical research. The CHEMPROT task of BioCreative VI aims to promote the development of text mining systems that can be used to automatically extract relationships between chemical compounds and genes. We tested three recursive neural network approaches to improve the performance of relation extraction. In the BioCreative VI challenge, we developed a tree-Long Short-Term Memory networks (tree-LSTM) model with several additional features including a position feature and a subtree containment feature, and we also applied an ensemble method. After the challenge, we applied additional pre-processing steps to the tree-LSTM model, and we tested the performance of another recursive neural network model called Stack-augmented Parser Interpreter Neural Network (SPINN). Our tree-LSTM model achieved an F-score of 58.53% in the BioCreative VI challenge. Our tree-LSTM model with additional pre-processing and the SPINN model obtained F-scores of 63.7 and 64.1%, respectively. Database URL: https://github.com/arwhirang/recursive_chemprot
Collapse
Affiliation(s)
- Sangrak Lim
- Department of Computer Science and Engineering, Korea University, Anam-dong 5-ga, Seongbuk-gu, Seoul, South Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Anam-dong 5-ga, Seongbuk-gu, Seoul, South Korea
| |
Collapse
|
47
|
Rios A, Kavuluru R, Lu Z. Generalizing biomedical relation classification with neural adversarial domain adaptation. Bioinformatics 2018; 34:2973-2981. [PMID: 29590309 PMCID: PMC6129312 DOI: 10.1093/bioinformatics/bty190] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2017] [Revised: 03/15/2018] [Accepted: 03/25/2018] [Indexed: 11/13/2022] Open
Abstract
Motivation Creating large datasets for biomedical relation classification can be prohibitively expensive. While some datasets have been curated to extract protein-protein and drug-drug interactions (PPIs and DDIs) from text, we are also interested in other interactions including gene-disease and chemical-protein connections. Also, many biomedical researchers have begun to explore ternary relationships. Even when annotated data are available, many datasets used for relation classification are inherently biased. For example, issues such as sample selection bias typically prevent models from generalizing in the wild. To address the problem of cross-corpora generalization, we present a novel adversarial learning algorithm for unsupervised domain adaptation tasks where no labeled data are available in the target domain. Instead, our method takes advantage of unlabeled data to improve biased classifiers through learning domain-invariant features via an adversarial process. Finally, our method is built upon recent advances in neural network (NN) methods. Results We experiment by extracting PPIs and DDIs from text. In our experiments, we show domain invariant features can be learned in NNs such that classifiers trained for one interaction type (protein-protein) can be re-purposed to others (drug-drug). We also show that our method can adapt to different source and target pairs of PPI datasets. Compared to prior convolutional and recurrent NN-based relation classification methods without domain adaptation, we achieve improvements as high as 30% in F1-score. Likewise, we show improvements over state-of-the-art adversarial methods. Availability and implementation Experimental code is available at https://github.com/bionlproc/adversarial-relation-classification. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Anthony Rios
- National Library of Medicine (NLM), National Center for Biotechnology Information (NCBI), National Institutes of Health (NIH), Bethesda, MD, USA
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| | - Ramakanth Kavuluru
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
- Division of Biomedical Informatics, Department of Internal Medicine, Lexington, KY, USA
| | - Zhiyong Lu
- National Library of Medicine (NLM), National Center for Biotechnology Information (NCBI), National Institutes of Health (NIH), Bethesda, MD, USA
| |
Collapse
|
48
|
Zhang Y, Lin H, Yang Z, Wang J, Zhang S, Sun Y, Yang L. A hybrid model based on neural networks for biomedical relation extraction. J Biomed Inform 2018; 81:83-92. [DOI: 10.1016/j.jbi.2018.03.011] [Citation(s) in RCA: 68] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2017] [Revised: 03/15/2018] [Accepted: 03/16/2018] [Indexed: 11/27/2022]
|
49
|
Song M, Kim M, Kang K, Kim YH, Jeon S. Application of Public Knowledge Discovery Tool (PKDE4J) to Represent Biomedical Scientific Knowledge. Front Res Metr Anal 2018. [DOI: 10.3389/frma.2018.00007] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
|
50
|
Murugesan G, Abdulkadhar S, Natarajan J. Distributed smoothed tree kernel for protein-protein interaction extraction from the biomedical literature. PLoS One 2017; 12:e0187379. [PMID: 29099838 PMCID: PMC5669485 DOI: 10.1371/journal.pone.0187379] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2017] [Accepted: 10/18/2017] [Indexed: 11/24/2022] Open
Abstract
Automatic extraction of protein-protein interaction (PPI) pairs from biomedical literature is a widely examined task in biological information extraction. Currently, many kernel based approaches such as linear kernel, tree kernel, graph kernel and combination of multiple kernels has achieved promising results in PPI task. However, most of these kernel methods fail to capture the semantic relation information between two entities. In this paper, we present a special type of tree kernel for PPI extraction which exploits both syntactic (structural) and semantic vectors information known as Distributed Smoothed Tree kernel (DSTK). DSTK comprises of distributed trees with syntactic information along with distributional semantic vectors representing semantic information of the sentences or phrases. To generate robust machine learning model composition of feature based kernel and DSTK were combined using ensemble support vector machine (SVM). Five different corpora (AIMed, BioInfer, HPRD50, IEPA, and LLL) were used for evaluating the performance of our system. Experimental results show that our system achieves better f-score with five different corpora compared to other state-of-the-art systems.
Collapse
Affiliation(s)
- Gurusamy Murugesan
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India
| | - Sabenabanu Abdulkadhar
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India
| | - Jeyakumar Natarajan
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India
- * E-mail:
| |
Collapse
|