1
|
Matsumoto Y, Gotoh H. Compound Classification and Consideration of Correlation with Chemical Descriptors from Articles on Antioxidant Capacity Using Natural Language Processing. J Chem Inf Model 2024; 64:119-127. [PMID: 38118462 DOI: 10.1021/acs.jcim.3c01826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2023]
Abstract
In recent times, there has been a substantial increase in the number of articles focusing on antioxidants. However, the development of a comprehensive estimator for antioxidant capacity remains elusive due to the challenge of integrating information from these articles. Furthermore, the complexity of the antioxidant mechanism, which involves a multitude of factors, makes it difficult to establish a simple equation or correlation. Hence, there is a pressing need for a model that can effectively interpret the collective knowledge from these articles, especially from a chemistry perspective. In this research, we employed natural language processing techniques, specifically Word2Vec, to analyze articles related to antioxidant capacity. We extracted representation vectors of compound names from these documents and organized them into 10 distinct clusters. In our investigation of two of these clusters, we unveiled that the majority of the compounds in question were flavonoids and flavonoid glycosides. To establish a link between the descriptors and clusters, we utilized kernel density estimation and generated scatter plots to visualize their similarity. These visualizations clearly indicated a strong relationship between the descriptors and clusters, affirming that a tangible connection exists between word vectors and compound descriptors through a document analysis conducted with natural language processing techniques. This study represents a pioneering approach that utilizes document analysis to shed light on the field of antioxidant capacity research, marking a significant advancement in this domain.
Collapse
Affiliation(s)
- Yuto Matsumoto
- Department of Chemistry and Life Science, Yokohama National University, 79-5 Tokiwadai, Hodogaya-ku, Yokohama 240-8501, Japan
| | - Hiroaki Gotoh
- Department of Chemistry and Life Science, Yokohama National University, 79-5 Tokiwadai, Hodogaya-ku, Yokohama 240-8501, Japan
| |
Collapse
|
2
|
Natural Language Processing to Extract Information from Portuguese-Language Medical Records. DATA 2022. [DOI: 10.3390/data8010011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Studies that use medical records are often impeded due to the information presented in narrative fields. However, recent studies have used artificial intelligence to extract and process secondary health data from electronic medical records. The aim of this study was to develop a neural network that uses data from unstructured medical records to capture information regarding symptoms, diagnoses, medications, conditions, exams, and treatment. Data from 30,000 medical records of patients hospitalized in the Clinical Hospital of the Botucatu Medical School (HCFMB), São Paulo, Brazil, were obtained, creating a corpus with 1200 clinical texts. A natural language algorithm for text extraction and convolutional neural networks for pattern recognition were used to evaluate the model with goodness-of-fit indices. The results showed good accuracy, considering the complexity of the model, with an F-score of 63.9% and a precision of 72.7%. The patient condition class reached a precision of 90.3% and the medication class reached 87.5%. The proposed neural network will facilitate the detection of relationships between diseases and symptoms and prevalence and incidence, in addition to detecting the identification of clinical conditions, disease evolution, and the effects of prescribed medications.
Collapse
|
3
|
Li Y, Hui L, Zou L, Li H, Xu L, Wang X, Chua S. Relation Extraction in Biomedical Texts: Development of a Multi-Head Attention Model with Syntactic Dependency Feature (Preprint). JMIR Med Inform 2022; 10:e41136. [PMID: 36264604 PMCID: PMC9634522 DOI: 10.2196/41136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2022] [Revised: 08/27/2022] [Accepted: 09/07/2022] [Indexed: 11/19/2022] Open
Abstract
Background With the rapid expansion of biomedical literature, biomedical information extraction has attracted increasing attention from researchers. In particular, relation extraction between 2 entities is a long-term research topic. Objective This study aimed to perform 2 multiclass relation extraction tasks of Biomedical Natural Language Processing Workshop 2019 Open Shared Tasks: relation extraction of Bacteria-Biotope (BB-rel) task and binary relation extraction of plant seed development (SeeDev-binary) task. In essence, these 2 tasks are aimed at extracting the relation between annotated entity pairs from biomedical texts, which is a challenging problem. Methods Traditional research methods adopted feature- or kernel-based methods and achieved good performance. For these tasks, we propose a deep learning model based on a combination of several distributed features, such as domain-specific word embedding, part-of-speech embedding, entity-type embedding, distance embedding, and position embedding. The multi-head attention mechanism is used to extract the global semantic features of an entire sentence. Meanwhile, we introduced a dependency-type feature and the shortest dependency path connecting 2 candidate entities in the syntactic dependency graph to enrich the feature representation. Results Experiments show that our proposed model has excellent performance in biomedical relation extraction, achieving F1 scores of 65.56% and 38.04% on the test sets of the BB-rel and SeeDev-binary tasks. Especially in the SeeDev-binary task, the F1 score of our model is superior to that of other existing models and achieves state-of-the-art performance. Conclusions We demonstrated that the multi-head attention mechanism can learn relevant syntactic and semantic features in different representation subspaces and different positions to extract comprehensive feature representation. Moreover, syntactic dependency features can improve the performance of the model by learning dependency relation between the entities in biomedical texts.
Collapse
Affiliation(s)
- Yongbin Li
- School of Medical Information Engineering, Zunyi Medical University, Zunyi, China
| | - Linhu Hui
- School of Medical Information Engineering, Zunyi Medical University, Zunyi, China
| | - Liping Zou
- School of Medical Information Engineering, Zunyi Medical University, Zunyi, China
| | - Huyang Li
- School of Medical Information Engineering, Zunyi Medical University, Zunyi, China
| | - Luo Xu
- School of Medical Information Engineering, Zunyi Medical University, Zunyi, China
| | - Xiaohua Wang
- School of Medical Information Engineering, Zunyi Medical University, Zunyi, China
| | - Stephanie Chua
- Faculty of Computer Science and Information Technology, University Malaysia Sarawak, Sarawak, Malaysia
| |
Collapse
|
4
|
Ong SQ, Pauzi MBM, Gan KH. Text mining in mosquito-borne disease: A systematic review. Acta Trop 2022; 231:106447. [PMID: 35430265 PMCID: PMC9663275 DOI: 10.1016/j.actatropica.2022.106447] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Revised: 03/31/2022] [Accepted: 04/01/2022] [Indexed: 01/09/2023]
Abstract
Mosquito-borne diseases are emerging and re-emerging across the globe, especially after the COVID19 pandemic. The recent advances in text mining in infectious diseases hold the potential of providing timely access to explicit and implicit associations among information in the text. In the past few years, the availability of online text data in the form of unstructured or semi-structured text with rich content of information from this domain enables many studies to provide solutions in this area, e.g., disease-related knowledge discovery, disease surveillance, early detection system, etc. However, a recent review of text mining in the domain of mosquito-borne disease was not available to the best of our knowledge. In this review, we survey the recent works in the text mining techniques used in combating mosquito-borne diseases. We highlight the corpus sources, technologies, applications, and the challenges faced by the studies, followed by the possible future directions that can be taken further in this domain. We present a bibliometric analysis of the 294 scientific articles that have been published in Scopus and PubMed in the domain of text mining in mosquito-borne diseases, from the year 2016 to 2021. The papers were further filtered and reviewed based on the techniques used to analyze the text related to mosquito-borne diseases. Based on the corpus of 158 selected articles, we found 27 of the articles were relevant and used text mining in mosquito-borne diseases. These articles covered the majority of Zika (38.70%), Dengue (32.26%), and Malaria (29.03%), with extremely low numbers or none of the other crucial mosquito-borne diseases like chikungunya, yellow fever, West Nile fever. Twitter was the dominant corpus resource to perform text mining in mosquito-borne diseases, followed by PubMed and LexisNexis databases. Sentiment analysis was the most popular technique of text mining to understand the discourse of the disease and followed by information extraction, which dependency relation and co-occurrence-based approach to extract relations and events. Surveillance was the main usage of most of the reviewed studies and followed by treatment, which focused on the drug-disease or symptom-disease association. The advance in text mining could improve the management of mosquito-borne diseases. However, the technique and application posed many limitations and challenges, including biases like user authentication and language, real-world implementation, etc. We discussed the future direction which can be useful to expand this area and domain. This review paper contributes mainly as a library for text mining in mosquito-borne diseases and could further explore the system for other neglected diseases.
Collapse
Affiliation(s)
- Song-Quan Ong
- Institute for Tropical Biology and Conservation, Universiti Malaysia Sabah, Jalan UMS, Kota Kinabalu, Sabah 88400, Malaysia,Corresponding author
| | | | - Keng Hoon Gan
- School of Computer Sciences, Universiti Sains Malaysia, Penang 11800, Malaysia
| |
Collapse
|
5
|
Transducer Cascades for Biological Literature-Based Discovery. INFORMATION 2022. [DOI: 10.3390/info13050262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
G protein-coupled receptors (GPCRs) control the response of cells to many signals, and as such, are involved in most cellular processes. As membrane receptors, they are accessible at the surface of the cell. GPCRs are also the largest family of membrane receptors, with more than 800 representatives in mammal genomes. For this reason, they are ideal targets for drugs. Although about one third of approved drugs target GPCRs, only about 16% of GPCRs are targeted by drugs. One of the difficulties comes from the lack of knowledge on the intra-cellular events triggered by these molecules. In the last two decades, scientists have started mapping the signaling networks triggered by GPCRs. However, it soon appeared that the system is very complex, which led to the publication of more than 320,000 scientific papers. Clearly, a human cannot take into account such massive sources of information. These papers represent a mine of information about both ontological knowledge and experimental results related to GPCRs, which have to be exploited in order to build signaling networks. The ABLISS project aims at the automatic building of GPCRs networks using automated deductive reasoning, allowing to integrate all available data. Therefore, we processed the automatic extraction of network information from the literature using Natural Language Processing (NLP). We mainly focused on the experimental results about GPCRs reported in the scientific papers, as so far there is no source gathering all these experimental results. We designed a relational database in order to make them available to the scientific community later. After introducing the more general objectives of the ABLISS project, we describe the formalism in detail. We then explain the NLP program using the finite state methods (Unitex graph cascades) we implemented and discuss the extracted facts obtained. Finally, we present the design of the relational database that stores the facts extracted from the selected papers.
Collapse
|
6
|
Zhao S, Su C, Lu Z, Wang F. Recent advances in biomedical literature mining. Brief Bioinform 2021; 22:bbaa057. [PMID: 32422651 PMCID: PMC8138828 DOI: 10.1093/bib/bbaa057] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 03/22/2020] [Accepted: 03/25/2020] [Indexed: 01/26/2023] Open
Abstract
The recent years have witnessed a rapid increase in the number of scientific articles in biomedical domain. These literature are mostly available and readily accessible in electronic format. The domain knowledge hidden in them is critical for biomedical research and applications, which makes biomedical literature mining (BLM) techniques highly demanding. Numerous efforts have been made on this topic from both biomedical informatics (BMI) and computer science (CS) communities. The BMI community focuses more on the concrete application problems and thus prefer more interpretable and descriptive methods, while the CS community chases more on superior performance and generalization ability, thus more sophisticated and universal models are developed. The goal of this paper is to provide a review of the recent advances in BLM from both communities and inspire new research directions.
Collapse
Affiliation(s)
- Sendong Zhao
- Department of Healthcare Policy and Research, Weill Medical College of Cornell University, New York, NY 10065, USA
| | - Chang Su
- Division of Health Informatics, Department of Healthcare Policy and Research at Weill Cornell Medicine at Cornell University, New York, NY, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI) at National Library of Medicine, National Institute of Health, Bethesda, MD, USA
| | - Fei Wang
- Department of Healthcare Policy and Research, Weill Medical College of Cornell University, New York, NY 10065, USA
| |
Collapse
|
7
|
Wang LL, Lo K. Text mining approaches for dealing with the rapidly expanding literature on COVID-19. Brief Bioinform 2021; 22:781-799. [PMID: 33279995 PMCID: PMC7799291 DOI: 10.1093/bib/bbaa296] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Revised: 10/02/2020] [Accepted: 10/07/2020] [Indexed: 12/13/2022] Open
Abstract
More than 50 000 papers have been published about COVID-19 since the beginning of 2020 and several hundred new papers continue to be published every day. This incredible rate of scientific productivity leads to information overload, making it difficult for researchers, clinicians and public health officials to keep up with the latest findings. Automated text mining techniques for searching, reading and summarizing papers are helpful for addressing information overload. In this review, we describe the many resources that have been introduced to support text mining applications over the COVID-19 literature; specifically, we discuss the corpora, modeling resources, systems and shared tasks that have been introduced for COVID-19. We compile a list of 39 systems that provide functionality such as search, discovery, visualization and summarization over the COVID-19 literature. For each system, we provide a qualitative description and assessment of the system's performance, unique data or user interface features and modeling decisions. Many systems focus on search and discovery, though several systems provide novel features, such as the ability to summarize findings over multiple documents or linking between scientific articles and clinical trials. We also describe the public corpora, models and shared tasks that have been introduced to help reduce repeated effort among community members; some of these resources (especially shared tasks) can provide a basis for comparing the performance of different systems. Finally, we summarize promising results and open challenges for text mining the COVID-19 literature.
Collapse
Affiliation(s)
- Lucy Lu Wang
- The Allen Institute for Artificial Intelligence, Seattle, WA 98112, USA
| | - Kyle Lo
- The Allen Institute for Artificial Intelligence, Seattle, WA 98112, USA
| |
Collapse
|
8
|
Espinosa C, Becker M, Marić I, Wong RJ, Shaw GM, Gaudilliere B, Aghaeepour N, Stevenson DK. Data-Driven Modeling of Pregnancy-Related Complications. Trends Mol Med 2021; 27:762-776. [PMID: 33573911 DOI: 10.1016/j.molmed.2021.01.007] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 12/01/2020] [Accepted: 01/20/2021] [Indexed: 12/11/2022]
Abstract
A healthy pregnancy depends on complex interrelated biological adaptations involving placentation, maternal immune responses, and hormonal homeostasis. Recent advances in high-throughput technologies have provided access to multiomics biological data that, combined with clinical and social data, can provide a deeper understanding of normal and abnormal pregnancies. Integration of these heterogeneous datasets using state-of-the-art machine-learning methods can enable the prediction of short- and long-term health trajectories for a mother and offspring and the development of treatments to prevent or minimize complications. We review advanced machine-learning methods that could: provide deeper biological insights into a pregnancy not yet unveiled by current methodologies; clarify the etiologies and heterogeneity of pathologies that affect a pregnancy; and suggest the best approaches to address disparities in outcomes affecting vulnerable populations.
Collapse
Affiliation(s)
- Camilo Espinosa
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA; Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
| | - Martin Becker
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA; Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
| | - Ivana Marić
- Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Ronald J Wong
- Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Gary M Shaw
- Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Brice Gaudilliere
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA; Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Nima Aghaeepour
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA; Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA; Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - David K Stevenson
- Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA.
| | | |
Collapse
|
9
|
Chen Y. A transfer learning model with multi-source domains for biomedical event trigger extraction. BMC Genomics 2021; 22:31. [PMID: 33413073 PMCID: PMC7788773 DOI: 10.1186/s12864-020-07315-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2020] [Accepted: 12/07/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Automatic extraction of biomedical events from literature, that allows for faster update of the latest discoveries automatically, is a heated research topic now. Trigger word recognition is a critical step in the process of event extraction. Its performance directly influences the results of the event extraction. In general, machine learning-based trigger recognition approaches such as neural networks must to be trained on a dataset with plentiful annotations to achieve high performances. However, the problem of the datasets in wide coverage event domains is that their annotations are insufficient and imbalance. One of the methods widely used to deal with this problem is transfer learning. In this work, we aim to extend the transfer learning to utilize multiple source domains. Multiple source domain datasets can be jointly trained to help achieve a higher recognition performance on a target domain with wide coverage events. RESULTS Based on the study of previous work, we propose an improved multi-source domain neural network transfer learning architecture and a training approach for biomedical trigger detection task, which can share knowledge between the multi-source and target domains more comprehensively. We extend the ability of traditional adversarial networks to extract common features between source and target domains, when there is more than one dataset in the source domains. Multiple feature extraction channels to simultaneously capture global and local common features are designed. Moreover, under the constraint of an extra classifier, the multiple local common feature sub-channels can extract and transfer more diverse common features from the related multi-source domains effectively. In the experiments, MLEE corpus is used to train and test the proposed model to recognize the wide coverage triggers as a target dataset. Other four corpora with the varying degrees of relevance with MLEE from different domains are used as source datasets, respectively. Our proposed approach achieves recognition improvement compared with traditional adversarial networks. Moreover, its performance is competitive compared with the results of other leading systems on the same MLEE corpus. CONCLUSIONS The proposed Multi-Source Transfer Learning-based Trigger Recognizer (MSTLTR) can further improve the performance compared with the traditional method, when the source domains are more than one. The most essential improvement is that our approach represents common features in two aspects: the global common features and the local common features. Hence, these more sharable features improve the performance and generalization of the model on the target domain effectively.
Collapse
Affiliation(s)
- Yifei Chen
- School of Information Engineering, Nanjing Audit University, 86 West Yushan Road, Nanjing, China.
| |
Collapse
|
10
|
Sousa D, Lamurias A, Couto FM. Using Neural Networks for Relation Extraction from Biomedical Literature. Methods Mol Biol 2021; 2190:289-305. [PMID: 32804372 DOI: 10.1007/978-1-0716-0826-5_14] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Using different sources of information to support automated extracting of relations between biomedical concepts contributes to the development of our understanding of biological systems. The primary comprehensive source of these relations is biomedical literature. Several relation extraction approaches have been proposed to identify relations between concepts in biomedical literature, namely, using neural networks algorithms. The use of multichannel architectures composed of multiple data representations, as in deep neural networks, is leading to state-of-the-art results. The right combination of data representations can eventually lead us to even higher evaluation scores in relation extraction tasks. Thus, biomedical ontologies play a fundamental role by providing semantic and ancestry information about an entity. The incorporation of biomedical ontologies has already been proved to enhance previous state-of-the-art results.
Collapse
Affiliation(s)
- Diana Sousa
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal.
| | - Andre Lamurias
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
| | - Francisco M Couto
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
| |
Collapse
|
11
|
Cheerkoot-Jalim S, Khedo KK. A systematic review of text mining approaches applied to various application areas in the biomedical domain. JOURNAL OF KNOWLEDGE MANAGEMENT 2020. [DOI: 10.1108/jkm-09-2019-0524] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Purpose
This work shows the results of a systematic literature review on biomedical text mining. The purpose of this study is to identify the different text mining approaches used in different application areas of the biomedical domain, the common tools used and the challenges of biomedical text mining as compared to generic text mining algorithms. This study will be of value to biomedical researchers by allowing them to correlate text mining approaches to specific biomedical application areas. Implications for future research are also discussed.
Design/methodology/approach
The review was conducted following the principles of the Kitchenham method. A number of research questions were first formulated, followed by the definition of the search strategy. The papers were then selected based on a list of assessment criteria. Each of the papers were analyzed and information relevant to the research questions were extracted.
Findings
It was found that researchers have mostly harnessed data sources such as electronic health records, biomedical literature, social media and health-related forums. The most common text mining technique was natural language processing using tools such as MetaMap and Unstructured Information Management Architecture, alongside the use of medical terminologies such as Unified Medical Language System. The main application area was the detection of adverse drug events. Challenges identified included the need to deal with huge amounts of text, the heterogeneity of the different data sources, the duality of meaning of words in biomedical text and the amount of noise introduced mainly from social media and health-related forums.
Originality/value
To the best of the authors’ knowledge, other reviews in this area have focused on either specific techniques, specific application areas or specific data sources. The results of this review will help researchers to correlate most relevant and recent advances in text mining approaches to specific biomedical application areas by providing an up-to-date and holistic view of work done in this research area. The use of emerging text mining techniques has great potential to spur the development of innovative applications, thus considerably impacting on the advancement of biomedical research.
Collapse
|
12
|
Menadue CB. Pandemics, epidemics, viruses, plagues, and disease: Comparative frequency analysis of a cultural pathology reflected in science fiction magazines from 1926 to 2015. ACTA ACUST UNITED AC 2020; 2:100048. [PMID: 34173491 PMCID: PMC7480741 DOI: 10.1016/j.ssaho.2020.100048] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 07/13/2020] [Accepted: 07/13/2020] [Indexed: 12/03/2022]
Abstract
Science fiction includes many dystopian narratives, often featuring epidemics, pandemics, plagues, viruses, and disease. As science fiction has grown in popularity and prevalence it appeals to an increasingly broad demographic, is employed in research communication and education, and as a genre it is frequently argued that it reflects contemporary cultural interests and concerns. To identify the relevance of science fiction as an indicator of popular trends relating to the pathologies of disease, a word frequency comparison of selected key words found in the Google Books 2012 English Corpus has been made to a representative corpus of science fiction magazines dating between 1926 and 2015. Selected issues were reviewed to identify concepts, situations, and outcomes that could readily be measured against real-world examples from current and recent pandemics. The findings indicate that science fiction does appear to mirror and magnify contemporary literary trends, and provides potentially revealing correlations to real-world historical events. In this regard, science fiction might be regarded as a form of ‘cultural pathology’ of popular interests related to the spread and impact of disease that may be valuable in gauging the degree to which society is engaged with these topics at any specific time. Science fiction topics tend to reflect real-world historical events. Comparison of English corpus Google Books word frequencies to science fiction. Science fiction investigates social, cultural and psychological concerns. Science fiction content indicates a ‘cultural pathology’ of popular interests.
Collapse
|
13
|
Current trends in cancer immunotherapy: a literature-mining analysis. Cancer Immunol Immunother 2020; 69:2425-2439. [PMID: 32556496 DOI: 10.1007/s00262-020-02630-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2020] [Accepted: 05/28/2020] [Indexed: 11/27/2022]
Abstract
Cancer immunotherapy is a rapidly growing field that is completely transforming oncology care. Mining this knowledge base for biomedically important information is becoming increasingly challenging, due to the expanding number of scientific publications, and the dynamic evolution of this subject with time. In this study, we have employed a literature-mining approach that was used to analyze the cancer immunotherapy-related publications listed in PubMed and quantify emerging trends. A total of 93,033 publications published in 5055 journals have been retrieved, and 141 meaningful topics have been identified, which were further classified into eight distinct categories. Statistical analysis indicates a mean annual increase in the number of published papers of approximately 8% in the last 20 years. The research topics that exhibited the highest trends included "immune checkpoint inhibitors," "tumor microenvironment," "HPV vaccination," "CAR T-cells," and "gene mutations/tumor profiling." The top identified cancer types included "lung," "colorectal," and "breast cancer," and a shift in popularity from hematological to solid tumors was observed. As regards clinical research, a transition from early phase clinical trials to randomized control trials was recorded, indicating that the field is entering a more advanced phase of development. Overall, this mining approach provided an unbiased analysis of the cancer immunotherapy literature in a time-conserving and scale-efficient manner.
Collapse
|
14
|
Zhu H, Zeng Y, Wang D, Huangfu C. Species Classification for Neuroscience Literature Based on Span of Interest Using Sequence-to-Sequence Learning Model. Front Hum Neurosci 2020; 14:128. [PMID: 32372933 PMCID: PMC7187631 DOI: 10.3389/fnhum.2020.00128] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2020] [Accepted: 03/19/2020] [Indexed: 11/13/2022] Open
Abstract
Large-scale neuroscience literature call for effective methods to mine the knowledge from species perspective to link the brain and neuroscience communities, neurorobotics, computing devices, and AI research communities. Structured knowledge can motivate researchers to better understand the functionality and structure of the brain and link the related resources and components. However, the abstracts of massive scientific works do not explicitly mention the species. Therefore, in addition to dictionary-based methods, we need to mine species using cognitive computing models that are more like the human reading process, and these methods can take advantage of the rich information in the literature. We also enable the model to automatically distinguish whether the mentioned species is the main research subject. Distinguishing the two situations can generate value at different levels of knowledge management. We propose SpecExplorer project which is used to explore the knowledge associations of different species for brain and neuroscience. This project frees humans from the tedious task of classifying neuroscience literature by species. Species classification task belongs to the multi-label classification which is more complex than the single-label classification due to the correlation between labels. To resolve this problem, we present the sequence-to-sequence classification framework to adaptively assign multiple species to the literature. To model the structure information of documents, we propose the hierarchical attentive decoding (HAD) to extract span of interest (SOI) for predicting each species. We create three datasets from PubMed and PMC corpora. We present two versions of annotation criteria (mention-based annotation and semantic-based annotation) for species research. Experiments demonstrate that our approach achieves improvements in the final results. Finally, we perform species-based analysis of brain diseases, brain cognitive functions, and proteins related to the hippocampus and provide potential research directions for certain species.
Collapse
Affiliation(s)
- Hongyin Zhu
- Research Center for Brain-Inspired Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China
- School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
| | - Yi Zeng
- Research Center for Brain-Inspired Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China
- School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
- Center for Excellence in Brain Science and Intelligence Technology Chinese Academy of Sciences, Shanghai, China
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Science, Beijing, China
| | - Dongsheng Wang
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| | - Cunqing Huangfu
- Research Center for Brain-Inspired Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
15
|
Krsnik I, Glavaš G, Krsnik M, Miletić D, Štajduhar I. Automatic Annotation of Narrative Radiology Reports. Diagnostics (Basel) 2020; 10:E196. [PMID: 32244833 PMCID: PMC7235892 DOI: 10.3390/diagnostics10040196] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Revised: 03/27/2020] [Accepted: 03/27/2020] [Indexed: 12/04/2022] Open
Abstract
Narrative texts in electronic health records can be efficiently utilized for building decision support systems in the clinic, only if they are correctly interpreted automatically in accordance with a specified standard. This paper tackles the problem of developing an automated method of labeling free-form radiology reports, as a precursor for building query-capable report databases in hospitals. The analyzed dataset consists of 1295 radiology reports concerning the condition of a knee, retrospectively gathered at the Clinical Hospital Centre Rijeka, Croatia. Reports were manually labeled with one or more labels from a set of 10 most commonly occurring clinical conditions. After primary preprocessing of the texts, two sets of text classification methods were compared: (1) traditional classification models-Naive Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), and Random Forests (RF)-coupled with Bag-of-Words (BoW) features (i.e., symbolic text representation) and (2) Convolutional Neural Network (CNN) coupled with dense word vectors (i.e., word embeddings as a semantic text representation) as input features. We resorted to nested 10-fold cross-validation to evaluate the performance of competing methods using accuracy, precision, recall, and F 1 score. The CNN with semantic word representations as input yielded the overall best performance, having a micro-averaged F 1 score of 86 . 7 % . The CNN classifier yielded particularly encouraging results for the most represented conditions: degenerative disease ( 95 . 9 % ), arthrosis ( 93 . 3 % ), and injury ( 89 . 2 % ). As a data-hungry deep learning model, the CNN, however, performed notably worse than the competing models on underrepresented classes with fewer training instances such as multicausal disease or metabolic disease. LR, RF, and SVM performed comparably well, with the obtained micro-averaged F 1 scores of 84 . 6 % , 82 . 2 % , and 82 . 1 % , respectively.
Collapse
Affiliation(s)
- Ivan Krsnik
- Department of Computer Engineering, Faculty of Engineering, University of Rijeka, Vukovarska 58, 51000 Rijeka, Croatia;
| | - Goran Glavaš
- School of Business Informatics and Mathematics, University of Mannheim, 68159 Mannheim, Germany;
| | - Marina Krsnik
- Faculty of Veterinary Medicine, University of Zagreb, Heinzelova 55, 10000 Zagreb, Croatia;
| | - Damir Miletić
- Clinical Hospital Centre Rijeka, University of Rijeka, Krešimirova 42, 51000 Rijeka, Croatia;
| | - Ivan Štajduhar
- Department of Computer Engineering, Faculty of Engineering, University of Rijeka, Vukovarska 58, 51000 Rijeka, Croatia;
- Center for Artificial Intelligence and Cybersecurity, University of Rijeka, Radmile Matejčić 2, 51000 Rijeka, Croatia
| |
Collapse
|
16
|
Jiang K, Yang T, Wu C, Chen L, Mao L, Wu Y, Deng L, Jiang T. LATTE: A knowledge-based method to normalize various expressions of laboratory test results in free text of Chinese electronic health records. J Biomed Inform 2020; 102:103372. [PMID: 31901507 DOI: 10.1016/j.jbi.2019.103372] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2019] [Revised: 12/29/2019] [Accepted: 12/30/2019] [Indexed: 12/23/2022]
Abstract
BACKGROUND A wealth of clinical information is buried in free text of electronic health records (EHR), and converting clinical information to machine-understandable form is crucial for the secondary use of EHRs. Laboratory test results, as one of the most important types of clinical information, are written in various styles in free text of EHRs. This has brought great difficulties for data integration and utilization of EHRs. Therefore, developing technology to normalize different expressions of laboratory test results in free text is indispensable for the secondary use of EHRs. METHODS In this study, we developed a knowledge-based method named LATTE (transforming lab test results), which could transform various expressions of laboratory test results into a normalized and machine-understandable format. We first identified the analyte of a laboratory test result with a dictionary-based method and then designed a series of rules to detect information associated with the analyte, including its specimen, measured value, unit of measure, conclusive phrase and sampling factor. We determined whether a test result is normal or abnormal by understanding the meaning of conclusive phrases or by comparing its measured value with an appropriate normal range. Finally, we converted various expressions of laboratory test results, either in numeric or textual form, into a normalized form as "specimen-analyte-abnormality". With this method, a laboratory test with the same type of abnormality would have the same representation, regardless of the way that it is mentioned in free text. RESULTS LATTE was developed and optimized on a training set including 8894 laboratory test results from 756 EHRs, and evaluated on a test set including 3740 laboratory test results from 210 EHRs. Compared to experts' annotations, LATTE achieved a precision of 0.936, a recall of 0.897 and an F1 score of 0.916 on the training set, and a precision of 0.892, a recall of 0.843 and an F1 score of 0.867 on the test set. For 223 laboratory tests with at least two different expression forms in the test set, LATTE transformed 85.7% (2870/3350) of laboratory test results into a normalized form. Besides, LATTE achieved F1 scores above 0.8 for EHRs from 18 of 21 different hospital departments, indicating its generalization capabilities in normalizing laboratory test results. CONCLUSION In conclusion, LATTE is an effective method for normalizing various expressions of laboratory test results in free text of EHRs. LATTE will facilitate EHR-based applications such as cohort querying, patient clustering and machine learning. AVAILABILITY LATTE is freely available for download on GitHub (https://github.com/denglizong/LATTE).
Collapse
Affiliation(s)
- Kun Jiang
- Institute of Biophysics, Chinese Academy of Sciences, 15 Datun Road, Chaoyang District, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100101, China; Suzhou Institute of Systems Medicine, Suzhou, Jiangsu 215123, China; Center for Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100005, China; Changsha Hancloud Information Technology Co., Ltd., Hunan 410000, China
| | - Tao Yang
- The Second Affiliated Hospital of Soochow University, Jiangsu 215008, China
| | - Chunyan Wu
- Suzhou Institute of Systems Medicine, Suzhou, Jiangsu 215123, China; Department of Pharmacology, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, China
| | - Luming Chen
- Suzhou Institute of Systems Medicine, Suzhou, Jiangsu 215123, China; Center for Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100005, China
| | - Longfei Mao
- Suzhou Institute of Systems Medicine, Suzhou, Jiangsu 215123, China; Center for Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100005, China
| | - Yongyou Wu
- The Second Affiliated Hospital of Soochow University, Jiangsu 215008, China
| | - Lizong Deng
- Suzhou Institute of Systems Medicine, Suzhou, Jiangsu 215123, China; Center for Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100005, China.
| | - Taijiao Jiang
- Suzhou Institute of Systems Medicine, Suzhou, Jiangsu 215123, China; Center for Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100005, China
| |
Collapse
|
17
|
Boland MR, Kashyap A, Xiong J, Holmes J, Lorch S. Development and validation of the PEPPER framework (Prenatal Exposure PubMed ParsER) with applications to food additives. J Am Med Inform Assoc 2019; 25:1432-1443. [PMID: 30371821 PMCID: PMC6213088 DOI: 10.1093/jamia/ocy119] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2018] [Accepted: 08/13/2018] [Indexed: 11/14/2022] Open
Abstract
Background Globally, 36% of deaths among children can be attributed to environmental factors. However, no comprehensive list of environmental exposures exists. We seek to address this gap by developing a literature-mining algorithm to catalog prenatal environmental exposures. Methods We designed a framework called. PEPPER Prenatal Exposure PubMed ParsER to a) catalog prenatal exposures studied in the literature and b) identify study type. Using PubMed Central, PEPPER classifies article type (methodology, systematic review) and catalogs prenatal exposures. We coupled PEPPER with the FDA's food additive database to form a master set of exposures. Results We found that of 31 764 prenatal exposure studies only 53.0% were methodology studies. PEPPER consists of 219 prenatal exposures, including a common set of 43 exposures. PEPPER captured prenatal exposures from 56.4% of methodology studies (9492/16 832 studies). Two raters independently reviewed 50 randomly selected articles and annotated presence of exposures and study methodology type. Error rates for PEPPER's exposure assignment ranged from 0.56% to 1.30% depending on the rater. Evaluation of the study type assignment showed agreement ranging from 96% to 100% (kappa = 0.909, p < .001). Using a gold-standard set of relevant prenatal exposure studies, PEPPER achieved a recall of 94.4%. Conclusions Using curated exposures and food additives; PEPPER provides the first comprehensive list of 219 prenatal exposures studied in methodology papers. On average, 1.45 exposures were investigated per study. PEPPER successfully distinguished article type for all prenatal studies allowing literature gaps to be easily identified.
Collapse
Affiliation(s)
- Mary Regina Boland
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.,Center for Excellence in Environmental Toxicology, University of Pennsylvania, Philadelphia, PA, USA.,Department of Biomedical and Health Informatics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Aditya Kashyap
- Data Science Masters Program, University of Pennsylvania, Philadelphia, PA, USA
| | - Jiadi Xiong
- Data Science Masters Program, University of Pennsylvania, Philadelphia, PA, USA
| | - John Holmes
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Scott Lorch
- Division of Neonatology, Department of Pediatrics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| |
Collapse
|
18
|
Farahmand S, Riley T, Zarringhalam K. ModEx: A text mining system for extracting mode of regulation of transcription factor-gene regulatory interaction. J Biomed Inform 2019; 102:103353. [PMID: 31857203 DOI: 10.1016/j.jbi.2019.103353] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2019] [Revised: 11/22/2019] [Accepted: 12/10/2019] [Indexed: 10/25/2022]
Abstract
BACKGROUND Transcription factors (TFs) are proteins that are fundamental to transcription and regulation of gene expression. Each TF may regulate multiple genes and each gene may be regulated by multiple TFs. TFs can act as either activator or repressor of gene expression. This complex network of interactions between TFs and genes underlies many developmental and biological processes and is implicated in several human diseases such as cancer. Hence deciphering the network of TF-gene interactions with information on mode of regulation (activation vs. repression) is an important step toward understanding the regulatory pathways that underlie complex traits. There are many experimental, computational, and manually curated databases of TF-gene interactions. In particular, high-throughput ChIP-Seq datasets provide a large-scale map or transcriptional regulatory interactions. However, these interactions are not annotated with information on context and mode of regulation. Such information is crucial to gain a global picture of gene regulatory mechanisms and can aid in developing machine learning models for applications such as biomarker discovery, prediction of response to therapy, and precision medicine. METHODS In this work, we introduce a text-mining system to annotate ChIP-Seq derived interaction with such meta data through mining PubMed articles. We evaluate the performance of our system using gold standard small scale manually curated databases. RESULTS Our results show that the method is able to accurately extract mode of regulation with F-score 0.77 on TRRUST curated interaction and F-score 0.96 on intersection of TRUSST and ChIP-network. We provide a HTTP REST API for our code to facilitate usage. Availibility: Source code and datasets are available for download on GitHub: https://github.com/samanfrm/modex.
Collapse
Affiliation(s)
- Saman Farahmand
- Computational Sciences PhD program, University of Massachusetts Boston, Boston, USA; Department of Biology, University of Massachusetts Boston, Boston, USA
| | - Todd Riley
- Department of Biology, University of Massachusetts Boston, Boston, USA
| | | |
Collapse
|
19
|
Zhang T, Lin H, Ren Y, Yang L, Xu B, Yang Z, Wang J, Zhang Y. Adverse drug reaction detection via a multihop self-attention mechanism. BMC Bioinformatics 2019; 20:479. [PMID: 31533622 PMCID: PMC6751590 DOI: 10.1186/s12859-019-3053-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2019] [Accepted: 08/26/2019] [Indexed: 12/17/2022] Open
Abstract
Background The adverse reactions that are caused by drugs are potentially life-threatening problems. Comprehensive knowledge of adverse drug reactions (ADRs) can reduce their detrimental impacts on patients. Detecting ADRs through clinical trials takes a large number of experiments and a long period of time. With the growing amount of unstructured textual data, such as biomedical literature and electronic records, detecting ADRs in the available unstructured data has important implications for ADR research. Most of the neural network-based methods typically focus on the simple semantic information of sentence sequences; however, the relationship of the two entities depends on more complex semantic information. Methods In this paper, we propose multihop self-attention mechanism (MSAM) model that aims to learn the multi-aspect semantic information for the ADR detection task. first, the contextual information of the sentence is captured by using the bidirectional long short-term memory (Bi-LSTM) model. Then, via applying the multiple steps of an attention mechanism, multiple semantic representations of a sentence are generated. Each attention step obtains a different attention distribution focusing on the different segments of the sentence. Meanwhile, our model locates and enhances various keywords from the multiple representations of a sentence. Results Our model was evaluated by using two ADR corpora. It is shown that the method has a stable generalization ability. Via extensive experiments, our model achieved F-measure of 0.853, 0.799 and 0.851 for ADR detection for TwiMed-PubMed, TwiMed-Twitter, and ADE, respectively. The experimental results showed that our model significantly outperforms other compared models for ADR detection. Conclusions In this paper, we propose a modification of multihop self-attention mechanism (MSAM) model for an ADR detection task. The proposed method significantly improved the learning of the complex semantic information of sentences.
Collapse
Affiliation(s)
- Tongxuan Zhang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Hongfei Lin
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China.
| | - Yuqi Ren
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Liang Yang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Bo Xu
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Zhihao Yang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Jian Wang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Yijia Zhang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| |
Collapse
|
20
|
Chen Y. Multiple-level biomedical event trigger recognition with transfer learning. BMC Bioinformatics 2019; 20:459. [PMID: 31492112 PMCID: PMC6731566 DOI: 10.1186/s12859-019-3030-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2019] [Accepted: 08/16/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Automatic extraction of biomedical events from literature is an important task in the understanding biological systems, allowing for faster update of the latest discoveries automatically. Detecting trigger words which indicate events is a critical step in the process of event extraction, because following steps depend on the recognized triggers. The task in this study is to identify event triggers from the literature across multiple levels of biological organization. In order to achieve high performances, the machine learning based approaches, such as neural networks, must be trained on a dataset with plentiful annotations. However, annotations might be difficult to obtain on the multiple levels, and annotated resources have so far mainly focused on the relations and processes at the molecular level. In this work, we aim to apply transfer learning for multiple-level trigger recognition, in which a source dataset with sufficient annotations on the molecular level is utilized to improve performance on a target domain with insufficient annotations and more trigger types. RESULTS We propose a generalized cross-domain neural network transfer learning architecture and approach, which can share as much knowledge as possible between the source and target domains, especially when their label sets overlap. In the experiments, MLEE corpus is used to train and test the proposed model to recognize the multiple-level triggers as a target dataset. Two different corpora having the varying degrees of overlapping labels with MLEE from the BioNLP'09 and BioNLP'11 Shared Tasks are used as source datasets, respectively. Regardless of the degree of overlap, our proposed approach achieves recognition improvement. Moreover, its performance exceeds previously reported results of other leading systems on the same MLEE corpus. CONCLUSIONS The proposed transfer learning method can further improve the performance compared with the traditional method, when the labels of the source and target datasets overlap. The most essential reason is that our approach has changed the way parameters are shared. The vertical sharing replaces the horizontal sharing, which brings more sharable parameters. Hence, these more shared parameters between networks improve the performance and generalization of the model on the target domain effectively.
Collapse
Affiliation(s)
- Yifei Chen
- School of Information Engineering, Nanjing Audit University, 86 West Yushan Road, Nanjing, China.
| |
Collapse
|
21
|
Percha B, Altman RB. A global network of biomedical relationships derived from text. Bioinformatics 2019; 34:2614-2624. [PMID: 29490008 PMCID: PMC6061699 DOI: 10.1093/bioinformatics/bty114] [Citation(s) in RCA: 67] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Accepted: 02/26/2018] [Indexed: 11/14/2022] Open
Abstract
Motivation The biomedical community’s collective understanding of how chemicals, genes and phenotypes interact is distributed across the text of over 24 million research articles. These interactions offer insights into the mechanisms behind higher order biochemical phenomena, such as drug-drug interactions and variations in drug response across individuals. To assist their curation at scale, we must understand what relationship types are possible and map unstructured natural language descriptions onto these structured classes. We used NCBI’s PubTator annotations to identify instances of chemical, gene and disease names in Medline abstracts and applied the Stanford dependency parser to find connecting dependency paths between pairs of entities in single sentences. We combined a published ensemble biclustering algorithm (EBC) with hierarchical clustering to group the dependency paths into semantically-related categories, which we annotated with labels, or ‘themes’ (‘inhibition’ and ‘activation’, for example). We evaluated our theme assignments against six human-curated databases: DrugBank, Reactome, SIDER, the Therapeutic Target Database, OMIM and PharmGKB. Results Clustering revealed 10 broad themes for chemical-gene relationships, 7 for chemical-disease, 10 for gene-disease and 9 for gene–gene. In most cases, enriched themes corresponded directly to known database relationships. Our final dataset, represented as a network, contained 37 491 thematically-labeled chemical-gene edges, 2 021 192 chemical-disease edges, 136 206 gene-disease edges and 41 418 gene–gene edges, each representing a single-sentence description of an interaction from somewhere in the literature. Availability and implementation The complete network is available on Zenodo (https://zenodo.org/record/1035500). We have also provided the full set of dependency paths connecting biomedical entities in Medline abstracts, with associated sentences, for future use by the biomedical research community. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bethany Percha
- Biomedical Informatics Training Program, Stanford University, Stanford, CA, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York City, NY, USA
| | - Russ B Altman
- Department of Bioengineering, Stanford University, Stanford, CA, USA.,Department of Genetics, Stanford University, Stanford, CA, USA.,Department of Medicine, Stanford University, Stanford, CA, USA
| |
Collapse
|
22
|
Azam MF, Musa A, Dehmer M, Yli-Harja OP, Emmert-Streib F. Global Genetics Research in Prostate Cancer: A Text Mining and Computational Network Theory Approach. Front Genet 2019; 10:70. [PMID: 30838019 PMCID: PMC6383410 DOI: 10.3389/fgene.2019.00070] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2018] [Accepted: 01/28/2019] [Indexed: 11/13/2022] Open
Abstract
Prostate cancer is the most common cancer type in men in Finland and second worldwide. In this paper, we analyze almost 150, 000 published papers about prostate cancer, authored by ten thousands of scientists worldwide, with an integrated text mining and computational network theory approach. We demonstrate how to integrate text mining with network analysis investigating research contributions of countries and collaborations within and between countries. Furthermore, we study the time evolution of individually and collectively studied genes. Finally, we investigate a collaboration network of Finland and compare studied genes with globally studied genes in prostate cancer genetics. Overall, our results provide a global overview of prostate cancer research in genetics. In addition, we present a specific discussion for Finland. Our results shed light on trends within the last 30 years and are useful for translational researchers within the full range from genetics to public health management and health policy.
Collapse
Affiliation(s)
- Md Facihul Azam
- Predictive Society and Data Analysis Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland.,Institute of Biosciences and Medical Technology, Tampere, Finland
| | - Aliyu Musa
- Predictive Society and Data Analysis Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland.,Institute of Biosciences and Medical Technology, Tampere, Finland
| | - Matthias Dehmer
- Faculty for Management, Institute for Intelligent Production, University of Applied Sciences Upper Austria, Steyr, Austria.,Department of Mechatronics and Biomedical Computer Science, UMIT, Hall in Tyrol, Austria.,College of Computer and Control Engineering, Nankai University, Tianjin, China
| | - Olli P Yli-Harja
- Institute of Biosciences and Medical Technology, Tampere, Finland.,Computational Systems Biology, Faculty of Biomedical Engineering, Tampere University, Tampere, Finland.,Institute for Systems Biology, Seattle, WA, United States
| | - Frank Emmert-Streib
- Predictive Society and Data Analysis Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland.,Institute of Biosciences and Medical Technology, Tampere, Finland
| |
Collapse
|
23
|
Rusanov A, Miotto R, Weng C. Trends in anesthesiology research: a machine learning approach to theme discovery and summarization. JAMIA Open 2018; 1:283-293. [PMID: 30474079 PMCID: PMC6241511 DOI: 10.1093/jamiaopen/ooy009] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Revised: 03/18/2018] [Accepted: 08/23/2018] [Indexed: 11/13/2022] Open
Abstract
Objectives Traditionally, summarization of research themes and trends within a given discipline was accomplished by manual review of scientific works in the field. However, with the ushering in of the age of “big data,” new methods for discovery of such information become necessary as traditional techniques become increasingly difficult to apply due to the exponential growth of document repositories. Our objectives are to develop a pipeline for unsupervised theme extraction and summarization of thematic trends in document repositories, and to test it by applying it to a specific domain. Methods To that end, we detail a pipeline, which utilizes machine learning and natural language processing for unsupervised theme extraction, and a novel method for summarization of thematic trends, and network mapping for visualization of thematic relations. We then apply this pipeline to a collection of anesthesiology abstracts. Results We demonstrate how this pipeline enables discovery of major themes and temporal trends in anesthesiology research and facilitates document classification and corpus exploration. Discussion The relation of prevalent topics and extracted trends to recent events in both anesthesiology, and healthcare in general, demonstrates the pipeline’s utility. Furthermore, the agreement between the unsupervised thematic grouping and human-assigned classification validates the pipeline’s accuracy and demonstrates another potential use. Conclusion The described pipeline enables summarization and exploration of large document repositories, facilitates classification, aids in trend identification. A more robust and user-friendly interface will facilitate the expansion of this methodology to other domains. This will be the focus of future work for our group.
Collapse
Affiliation(s)
- Alexander Rusanov
- Department of Anesthesiology, Columbia University, New York, New York, USA
| | - Riccardo Miotto
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| |
Collapse
|
24
|
Oh SY, Kim JH, Kim SJ, Nam HJ, Park HS. GNI Corpus Version 1.0: Annotated Full-Text Corpus of Genomics & Informatics to Support Biomedical Information Extraction. Genomics Inform 2018; 16:75-77. [PMID: 30309207 PMCID: PMC6187819 DOI: 10.5808/gi.2018.16.3.75] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2018] [Accepted: 08/23/2018] [Indexed: 11/21/2022] Open
Abstract
Genomics & Informatics (NLM title abbreviation: Genomics Inform) is the official journal of the Korea Genome Organization. Text corpus for this journal annotated with various levels of linguistic information would be a valuable resource as the process of information extraction requires syntactic, semantic, and higher levels of natural language processing. In this study, we publish our new corpus called GNI Corpus version 1.0, extracted and annotated from full texts of Genomics & Informatics, with NLTK (Natural Language ToolKit)-based text mining script. The preliminary version of the corpus could be used as a training and testing set of a system that serves a variety of functions for future biomedical text mining.
Collapse
Affiliation(s)
- So-Yeon Oh
- Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
| | - Ji-Hyeon Kim
- Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
| | - Seo-Jin Kim
- Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
| | - Hee-Jo Nam
- Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
| | - Hyun-Seok Park
- Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea.,Center for Convergence Research of Advanced Technologies, Ewha Womans University, Seoul 03760, Korea
| |
Collapse
|
25
|
Mower J, Subramanian D, Cohen T. Learning predictive models of drug side-effect relationships from distributed representations of literature-derived semantic predications. J Am Med Inform Assoc 2018; 25:1339-1350. [PMID: 30010902 PMCID: PMC6454491 DOI: 10.1093/jamia/ocy077] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Revised: 04/23/2018] [Accepted: 06/05/2018] [Indexed: 02/01/2023] Open
Abstract
Objective The aim of this work is to leverage relational information extracted from biomedical literature using a novel synthesis of unsupervised pretraining, representational composition, and supervised machine learning for drug safety monitoring. Methods Using ≈80 million concept-relationship-concept triples extracted from the literature using the SemRep Natural Language Processing system, distributed vector representations (embeddings) were generated for concepts as functions of their relationships utilizing two unsupervised representational approaches. Embeddings for drugs and side effects of interest from two widely used reference standards were then composed to generate embeddings of drug/side-effect pairs, which were used as input for supervised machine learning. This methodology was developed and evaluated using cross-validation strategies and compared to contemporary approaches. To qualitatively assess generalization, models trained on the Observational Medical Outcomes Partnership (OMOP) drug/side-effect reference set were evaluated against a list of ≈1100 drugs from an online database. Results The employed method improved performance over previous approaches. Cross-validation results advance the state of the art (AUC 0.96; F1 0.90 and AUC 0.95; F1 0.84 across the two sets), outperforming methods utilizing literature and/or spontaneous reporting system data. Examination of predictions for unseen drug/side-effect pairs indicates the ability of these methods to generalize, with over tenfold label support enrichment in the top 100 predictions versus the bottom 100 predictions. Discussion and Conclusion Our methods can assist the pharmacovigilance process using information from the biomedical literature. Unsupervised pretraining generates a rich relationship-based representational foundation for machine learning techniques to classify drugs in the context of a putative side effect, given known examples.
Collapse
Affiliation(s)
- Justin Mower
- Baylor College of Medicine, Quantitative and Computational Biosciences, Houston, Texas, USA
| | | | - Trevor Cohen
- School of Biomedical Informatics, University of Texas Health Science Center Houston, Texas, USA
| |
Collapse
|
26
|
Vilar S, Friedman C, Hripcsak G. Detection of drug-drug interactions through data mining studies using clinical sources, scientific literature and social media. Brief Bioinform 2018; 19:863-877. [PMID: 28334070 PMCID: PMC6454455 DOI: 10.1093/bib/bbx010] [Citation(s) in RCA: 82] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2016] [Revised: 12/28/2016] [Indexed: 11/13/2022] Open
Abstract
Drug-drug interactions (DDIs) constitute an important concern in drug development and postmarketing pharmacovigilance. They are considered the cause of many adverse drug effects exposing patients to higher risks and increasing public health system costs. Methods to follow-up and discover possible DDIs causing harm to the population are a primary aim of drug safety researchers. Here, we review different methodologies and recent advances using data mining to detect DDIs with impact on patients. We focus on data mining of different pharmacovigilance sources, such as the US Food and Drug Administration Adverse Event Reporting System and electronic health records from medical institutions, as well as on the diverse data mining studies that use narrative text available in the scientific biomedical literature and social media. We pay attention to the strengths but also further explain challenges related to these methods. Data mining has important applications in the analysis of DDIs showing the impact of the interactions as a cause of adverse effects, extracting interactions to create knowledge data sets and gold standards and in the discovery of novel and dangerous DDIs.
Collapse
Affiliation(s)
- Santiago Vilar
- Department of Biomedical Informatics, Columbia University, New York, USA
- Department of Organic Chemistry, University of Santiago de Compostela, Spain
| | - Carol Friedman
- Department of Biomedical Informatics, Columbia University, New York, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, New York, USA
| |
Collapse
|
27
|
Zhu Y, Elemento O, Pathak J, Wang F. Drug knowledge bases and their applications in biomedical informatics research. Brief Bioinform 2018; 20:1308-1321. [DOI: 10.1093/bib/bbx169] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Revised: 11/15/2017] [Indexed: 11/14/2022] Open
Abstract
Abstract
Recent advances in biomedical research have generated a large volume of drug-related data. To effectively handle this flood of data, many initiatives have been taken to help researchers make good use of them. As the results of these initiatives, many drug knowledge bases have been constructed. They range from simple ones with specific focuses to comprehensive ones that contain information on almost every aspect of a drug. These curated drug knowledge bases have made significant contributions to the development of efficient and effective health information technologies for better health-care service delivery. Understanding and comparing existing drug knowledge bases and how they are applied in various biomedical studies will help us recognize the state of the art and design better knowledge bases in the future. In addition, researchers can get insights on novel applications of the drug knowledge bases through a review of successful use cases. In this study, we provide a review of existing popular drug knowledge bases and their applications in drug-related studies. We discuss challenges in constructing and using drug knowledge bases as well as future research directions toward a better ecosystem of drug knowledge bases.
Collapse
|
28
|
Smalheiser NR. Rediscovering Don Swanson: the Past, Present and Future of Literature-Based Discovery. JOURNAL OF DATA AND INFORMATION SCIENCE 2017; 2:43-64. [PMID: 29355246 PMCID: PMC5771422 DOI: 10.1515/jdis-2017-0019] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
The late Don R. Swanson was well appreciated during his lifetime as Dean of the Graduate Library School at University of Chicago, as winner of the American Society for Information Science Award of Merit for 2000, and as author of many seminal articles. In this informal essay, I will give my personal perspective on Don's contributions to science, and outline some current and future directions in literature-based discovery that are rooted in concepts that he developed.
Collapse
Affiliation(s)
- Neil R Smalheiser
- Department of Psychiatry, University of Illinois at Chicago, Chicago, IL 60612 USA, +1 312-413-4581
| |
Collapse
|
29
|
Singh G, Marshall IJ, Thomas J, Shawe-Taylor J, Wallace BC. A Neural Candidate-Selector Architecture for Automatic Structured Clinical Text Annotation. PROCEEDINGS OF THE ... ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT. ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT 2017; 2017:1519-1528. [PMID: 29308293 PMCID: PMC5752318 DOI: 10.1145/3132847.3132989] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
We consider the task of automatically annotating free texts describing clinical trials with concepts from a controlled, structured medical vocabulary. Specifically we aim to build a model to infer distinct sets of (ontological) concepts describing complementary clinically salient aspects of the underlying trials: the populations enrolled, the interventions administered and the outcomes measured, i.e., the PICO elements. This important practical problem poses a few key challenges. One issue is that the output space is vast, because the vocabulary comprises many unique concepts. Compounding this problem, annotated data in this domain is expensive to collect and hence sparse. Furthermore, the outputs (sets of concepts for each PICO element) are correlated: specific populations (e.g., diabetics) will render certain intervention concepts likely (insulin therapy) while effectively precluding others (radiation therapy). Such correlations should be exploited. We propose a novel neural model that addresses these challenges. We introduce a Candidate-Selector architecture in which the model considers setes of candidate concepts for PICO elements, and assesses their plausibility conditioned on the input text to be annotated. This relies on a 'candidate set' generator, which may be learned or relies on heuristics. A conditional discriminative neural model then jointly selects candidate concepts, given the input text. We compare the predictive performance of our approach to strong baselines, and show that it outperforms them. Finally, we perform a qualitative evaluation of the generated annotations by asking domain experts to assess their quality.
Collapse
|
30
|
Kang T, Zhang S, Tang Y, Hruby GW, Rusanov A, Elhadad N, Weng C. EliIE: An open-source information extraction system for clinical trial eligibility criteria. J Am Med Inform Assoc 2017; 24:1062-1071. [PMID: 28379377 PMCID: PMC6259668 DOI: 10.1093/jamia/ocx019] [Citation(s) in RCA: 53] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2016] [Revised: 01/31/2017] [Accepted: 03/02/2017] [Indexed: 12/22/2022] Open
Abstract
OBJECTIVE To develop an open-source information extraction system called Eligibility Criteria Information Extraction (EliIE) for parsing and formalizing free-text clinical research eligibility criteria (EC) following Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) version 5.0. MATERIALS AND METHODS EliIE parses EC in 4 steps: (1) clinical entity and attribute recognition, (2) negation detection, (3) relation extraction, and (4) concept normalization and output structuring. Informaticians and domain experts were recruited to design an annotation guideline and generate a training corpus of annotated EC for 230 Alzheimer's clinical trials, which were represented as queries against the OMOP CDM and included 8008 entities, 3550 attributes, and 3529 relations. A sequence labeling-based method was developed for automatic entity and attribute recognition. Negation detection was supported by NegEx and a set of predefined rules. Relation extraction was achieved by a support vector machine classifier. We further performed terminology-based concept normalization and output structuring. RESULTS In task-specific evaluations, the best F1 score for entity recognition was 0.79, and for relation extraction was 0.89. The accuracy of negation detection was 0.94. The overall accuracy for query formalization was 0.71 in an end-to-end evaluation. CONCLUSIONS This study presents EliIE, an OMOP CDM-based information extraction system for automatic structuring and formalization of free-text EC. According to our evaluation, machine learning-based EliIE outperforms existing systems and shows promise to improve.
Collapse
Affiliation(s)
- Tian Kang
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Shaodian Zhang
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Youlan Tang
- Institute of Human Nutrition, Columbia University, New York, NY, USA
| | - Gregory W Hruby
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Alexander Rusanov
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Noémie Elhadad
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| |
Collapse
|
31
|
Correlating Lab Test Results in Clinical Notes with Structured Lab Data: A Case Study in HbA1c and Glucose. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2017; 2017:221-228. [PMID: 28815133 PMCID: PMC5543347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
It is widely acknowledged that information extraction of unstructured clinical notes using natural language processing (NLP) and text mining is essential for secondary use of clinical data for clinical research and practice. Lab test results are currently structured in most of the electronic health record (EHR) systems. However, for referral patients or lab tests that can be done in non-clinical setting, the results can be captured in unstructured clinical notes. In this study, we proposed a rule-based information extraction system to extract the lab test results with temporal information from clinical notes. The lab test results of glucose and HbA1c from 104 randomly sampled diabetes patients selected from 1996 to 2015 are extracted and further correlated with structured lab test information in the Mayo Clinic EHRs. The system has high F1-scores of 0.964, 0.967 and 0.966 in glucose, HbA1c and overall extraction, respectively.
Collapse
|
32
|
Kuusisto F, Steill J, Kuang Z, Thomson J, Page D, Stewart R. A Simple Text Mining Approach for Ranking Pairwise Associations in Biomedical Applications. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2017; 2017:166-174. [PMID: 28815126 PMCID: PMC5543342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
We present a simple text mining method that is easy to implement, requires minimal data collection and preparation, and is easy to use for proposing ranked associations between a list of target terms and a key phrase. We call this method KinderMiner, and apply it to two biomedical applications. The first application is to identify relevant transcription factors for cell reprogramming, and the second is to identify potential drugs for investigation in drug repositioning. We compare the results from our algorithm to existing data and state-of-the-art algorithms, demonstrating compelling results for both application areas. While we apply the algorithm here for biomedical applications, we argue that the method is generalizable to any available corpus of sufficient size.
Collapse
Affiliation(s)
| | - John Steill
- Morgridge Institute for Research, Madison, USA
| | | | - James Thomson
- Morgridge Institute for Research, Madison, USA;,University of Wisconsin, Madison, USA
| | | | - Ron Stewart
- Morgridge Institute for Research, Madison, USA
| |
Collapse
|
33
|
Yan S, Wong KC. Elucidating high-dimensional cancer hallmark annotation via enriched ontology. J Biomed Inform 2017; 73:84-94. [PMID: 28723579 DOI: 10.1016/j.jbi.2017.07.011] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2016] [Revised: 05/23/2017] [Accepted: 07/14/2017] [Indexed: 10/19/2022]
Abstract
MOTIVATION Cancer hallmark annotation is a promising technique that could discover novel knowledge about cancer from the biomedical literature. The automated annotation of cancer hallmarks could reveal relevant cancer transformation processes in the literature or extract the articles that correspond to the cancer hallmark of interest. It acts as a complementary approach that can retrieve knowledge from massive text information, advancing numerous focused studies in cancer research. Nonetheless, the high-dimensional nature of cancer hallmark annotation imposes a unique challenge. RESULTS To address the curse of dimensionality, we compared multiple cancer hallmark annotation methods on 1580 PubMed abstracts. Based on the insights, a novel approach, UDT-RF, which makes use of ontological features is proposed. It expands the feature space via the Medical Subject Headings (MeSH) ontology graph and utilizes novel feature selections for elucidating the high-dimensional cancer hallmark annotation space. To demonstrate its effectiveness, state-of-the-art methods are compared and evaluated by a multitude of performance metrics, revealing the full performance spectrum on the full set of cancer hallmarks. Several case studies are conducted, demonstrating how the proposed approach could reveal novel insights into cancers. AVAILABILITY https://github.com/cskyan/chmannot.
Collapse
Affiliation(s)
- Shankai Yan
- Department of Computer Science, City University of Hong Kong, Hong Kong Special Administrative Region
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong Special Administrative Region.
| |
Collapse
|
34
|
Wang P, Hao T, Yan J, Jin L. Large-scale extraction of drug-disease pairs from the medical literature. J Assoc Inf Sci Technol 2017. [DOI: 10.1002/asi.23876] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Affiliation(s)
- Pengwei Wang
- School of Electronic and Information Engineering; South China University of Technology; Guangzhou China
| | - Tianyong Hao
- Cisco School of Informatics; Guangdong University of Foreign Studies; Guangzhou China
| | - Jun Yan
- Microsoft Research Asia; Beijing China
| | - Lianwen Jin
- School of Electronic and Information Engineering; South China University of Technology; Guangzhou China
| |
Collapse
|
35
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
36
|
Karystianis G, Thayer K, Wolfe M, Tsafnat G. Evaluation of a rule-based method for epidemiological document classification towards the automation of systematic reviews. J Biomed Inform 2017; 70:27-34. [PMID: 28455150 DOI: 10.1016/j.jbi.2017.04.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2016] [Revised: 03/14/2017] [Accepted: 04/02/2017] [Indexed: 02/02/2023]
Abstract
INTRODUCTION Most data extraction efforts in epidemiology are focused on obtaining targeted information from clinical trials. In contrast, limited research has been conducted on the identification of information from observational studies, a major source for human evidence in many fields, including environmental health. The recognition of key epidemiological information (e.g., exposures) through text mining techniques can assist in the automation of systematic reviews and other evidence summaries. METHOD We designed and applied a knowledge-driven, rule-based approach to identify targeted information (study design, participant population, exposure, outcome, confounding factors, and the country where the study was conducted) from abstracts of epidemiological studies included in several systematic reviews of environmental health exposures. The rules were based on common syntactical patterns observed in text and are thus not specific to any systematic review. To validate the general applicability of our approach, we compared the data extracted using our approach versus hand curation for 35 epidemiological study abstracts manually selected for inclusion in two systematic reviews. RESULTS The returned F-score, precision, and recall ranged from 70% to 98%, 81% to 100%, and 54% to 97%, respectively. The highest precision was observed for exposure, outcome and population (100%) while recall was best for exposure and study design with 97% and 89%, respectively. The lowest recall was observed for the population (54%), which also had the lowest F-score (70%). CONCLUSION The generated performance of our text-mining approach demonstrated encouraging results for the identification of targeted information from observational epidemiological study abstracts related to environmental exposures. We have demonstrated that rules based on generic syntactic patterns in one corpus can be applied to other observational study design by simple interchanging the dictionaries aiming to identify certain characteristics (i.e., outcomes, exposures). At the document level, the recognised information can assist in the selection and categorization of studies included in a systematic review.
Collapse
Affiliation(s)
- George Karystianis
- Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, Sydney, Australia.
| | - Kristina Thayer
- National Toxicology Program, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
| | - Mary Wolfe
- National Toxicology Program, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
| | - Guy Tsafnat
- Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, Sydney, Australia
| |
Collapse
|
37
|
Wu H, Oellrich A, Girges C, de Bono B, Hubbard TJ, Dobson RJ. Automated PDF highlighting to support faster curation of literature for Parkinson's and Alzheimer's disease. Database (Oxford) 2017; 2017:3091736. [PMID: 28365743 PMCID: PMC5467557 DOI: 10.1093/database/bax027] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2016] [Revised: 01/23/2017] [Accepted: 03/08/2017] [Indexed: 12/20/2022]
Abstract
Neurodegenerative disorders such as Parkinson's and Alzheimer's disease are devastating and costly illnesses, a source of major global burden. In order to provide successful interventions for patients and reduce costs, both causes and pathological processes need to be understood. The ApiNATOMY project aims to contribute to our understanding of neurodegenerative disorders by manually curating and abstracting data from the vast body of literature amassed on these illnesses. As curation is labour-intensive, we aimed to speed up the process by automatically highlighting those parts of the PDF document of primary importance to the curator. Using techniques similar to those of summarisation, we developed an algorithm that relies on linguistic, semantic and spatial features. Employing this algorithm on a test set manually corrected for tool imprecision, we achieved a macro F 1 -measure of 0.51, which is an increase of 132% compared to the best bag-of-words baseline model. A user based evaluation was also conducted to assess the usefulness of the methodology on 40 unseen publications, which reveals that in 85% of cases all highlighted sentences are relevant to the curation task and in about 65% of the cases, the highlights are sufficient to support the knowledge curation task without needing to consult the full text. In conclusion, we believe that these are promising results for a step in automating the recognition of curation-relevant sentences. Refining our approach to pre-digest papers will lead to faster processing and cost reduction in the curation process. Database URL https://github.com/KHP-Informatics/NapEasy.
Collapse
Affiliation(s)
- Honghan Wu
- Department of Biostatistics and Health Informatics, King's College London, De Crespigny Park, Denmark Hill London SE5 8AF, UK
- School of Computer and Software, Nanjing University of Information Science and Technology, 219 Ningliu Road, Nanjing, China, 210044
| | - Anika Oellrich
- Department of Biostatistics and Health Informatics, King's College London, De Crespigny Park, Denmark Hill London SE5 8AF, UK
| | - Christine Girges
- Farr Institute of Health Informatics Research, UCL Institute of Health Informatics, University College London, London Gower Street, WC1E 6BT, UK
| | - Bernard de Bono
- Farr Institute of Health Informatics Research, UCL Institute of Health Informatics, University College London, London Gower Street, WC1E 6BT, UK
| | - Tim J.P. Hubbard
- Department of Medical and Molecular Genetics, King's College London, Guys Hospital, Great Maze Pond, London SE1 9RT, UK
| | - Richard J.B. Dobson
- Department of Biostatistics and Health Informatics, King's College London, De Crespigny Park, Denmark Hill London SE5 8AF, UK
- Farr Institute of Health Informatics Research, UCL Institute of Health Informatics, University College London, London Gower Street, WC1E 6BT, UK
| |
Collapse
|
38
|
Luo Y, Uzuner Ö, Szolovits P. Bridging semantics and syntax with graph algorithms-state-of-the-art of extracting biomedical relations. Brief Bioinform 2017; 18:160-178. [PMID: 26851224 PMCID: PMC5221425 DOI: 10.1093/bib/bbw001] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2015] [Revised: 11/29/2015] [Indexed: 01/18/2023] Open
Abstract
Research on extracting biomedical relations has received growing attention recently, with numerous biological and clinical applications including those in pharmacogenomics, clinical trial screening and adverse drug reaction detection. The ability to accurately capture both semantic and syntactic structures in text expressing these relations becomes increasingly critical to enable deep understanding of scientific papers and clinical narratives. Shared task challenges have been organized by both bioinformatics and clinical informatics communities to assess and advance the state-of-the-art research. Significant progress has been made in algorithm development and resource construction. In particular, graph-based approaches bridge semantics and syntax, often achieving the best performance in shared tasks. However, a number of problems at the frontiers of biomedical relation extraction continue to pose interesting challenges and present opportunities for great improvement and fruitful research. In this article, we place biomedical relation extraction against the backdrop of its versatile applications, present a gentle introduction to its general pipeline and shared resources, review the current state-of-the-art in methodology advancement, discuss limitations and point out several promising future directions.
Collapse
Affiliation(s)
- Yuan Luo
- Department of Preventive Medicine, Northwestern University, 11th Floor, Arthur Rubloff Building, 750 N. Lake Shore Drive, Chicago, IL, USA
| | - Özlem Uzuner
- Department of Information Studies, State University of New York at Albany, New York, USA
| | - Peter Szolovits
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Massachusetts, USA
| |
Collapse
|
39
|
Feng Q, Gui Y, Yang Z, Wang L, Li Y. Semisupervised Learning Based Disease-Symptom and Symptom-Therapeutic Substance Relation Extraction from Biomedical Literature. BIOMED RESEARCH INTERNATIONAL 2016; 2016:3594937. [PMID: 27822473 PMCID: PMC5086401 DOI: 10.1155/2016/3594937] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Revised: 07/13/2016] [Accepted: 08/18/2016] [Indexed: 11/18/2022]
Abstract
With the rapid growth of biomedical literature, a large amount of knowledge about diseases, symptoms, and therapeutic substances hidden in the literature can be used for drug discovery and disease therapy. In this paper, we present a method of constructing two models for extracting the relations between the disease and symptom and symptom and therapeutic substance from biomedical texts, respectively. The former judges whether a disease causes a certain physiological phenomenon while the latter determines whether a substance relieves or eliminates a certain physiological phenomenon. These two kinds of relations can be further utilized to extract the relations between disease and therapeutic substance. In our method, first two training sets for extracting the relations between the disease-symptom and symptom-therapeutic substance are manually annotated and then two semisupervised learning algorithms, that is, Co-Training and Tri-Training, are applied to utilize the unlabeled data to boost the relation extraction performance. Experimental results show that exploiting the unlabeled data with both Co-Training and Tri-Training algorithms can enhance the performance effectively.
Collapse
Affiliation(s)
- Qinlin Feng
- College of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yingyi Gui
- School of Optoelectronics, Beijing Institute of Technology, Beijing 100081, China
| | - Zhihao Yang
- College of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Lei Wang
- Beijing Institute of Health Administration and Medical Information, Beijing 100850, China
| | - Yuxia Li
- Beijing Institute of Health Administration and Medical Information, Beijing 100850, China
| |
Collapse
|
40
|
Swain MC, Cole JM. ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature. J Chem Inf Model 2016; 56:1894-1904. [PMID: 27669338 DOI: 10.1021/acs.jcim.6b00207] [Citation(s) in RCA: 158] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The emergence of "big data" initiatives has led to the need for tools that can automatically extract valuable chemical information from large volumes of unstructured data, such as the scientific literature. Since chemical information can be present in figures, tables, and textual paragraphs, successful information extraction often depends on the ability to interpret all of these domains simultaneously. We present a complete toolkit for the automated extraction of chemical entities and their associated properties, measurements, and relationships from scientific documents that can be used to populate structured chemical databases. Our system provides an extensible, chemistry-aware, natural language processing pipeline for tokenization, part-of-speech tagging, named entity recognition, and phrase parsing. Within this scope, we report improved performance for chemical named entity recognition through the use of unsupervised word clustering based on a massive corpus of chemistry articles. For phrase parsing and information extraction, we present the novel use of multiple rule-based grammars that are tailored for interpreting specific document domains such as textual paragraphs, captions, and tables. We also describe document-level processing to resolve data interdependencies and show that this is particularly necessary for the autogeneration of chemical databases since captions and tables commonly contain chemical identifiers and references that are defined elsewhere in the text. The performance of the toolkit to correctly extract various types of data was evaluated, affording an F-score of 93.4%, 86.8%, and 91.5% for extracting chemical identifiers, spectroscopic attributes, and chemical property attributes, respectively; set against the CHEMDNER chemical name extraction challenge, ChemDataExtractor yields a competitive F-score of 87.8%. All tools have been released under the MIT license and are available to download from http://www.chemdataextractor.org .
Collapse
Affiliation(s)
- Matthew C Swain
- Cavendish Laboratory, University of Cambridge , J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K
| | - Jacqueline M Cole
- Cavendish Laboratory, University of Cambridge , J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K
| |
Collapse
|
41
|
Papamokos G, Silins I. Combining QSAR Modeling and Text-Mining Techniques to Link Chemical Structures and Carcinogenic Modes of Action. Front Pharmacol 2016; 7:284. [PMID: 27625608 PMCID: PMC5003827 DOI: 10.3389/fphar.2016.00284] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2016] [Accepted: 08/18/2016] [Indexed: 12/28/2022] Open
Abstract
There is an increasing need for new reliable non-animal based methods to predict and test toxicity of chemicals. Quantitative structure-activity relationship (QSAR), a computer-based method linking chemical structures with biological activities, is used in predictive toxicology. In this study, we tested the approach to combine QSAR data with literature profiles of carcinogenic modes of action automatically generated by a text-mining tool. The aim was to generate data patterns to identify associations between chemical structures and biological mechanisms related to carcinogenesis. Using these two methods, individually and combined, we evaluated 96 rat carcinogens of the hematopoietic system, liver, lung, and skin. We found that skin and lung rat carcinogens were mainly mutagenic, while the group of carcinogens affecting the hematopoietic system and the liver also included a large proportion of non-mutagens. The automatic literature analysis showed that mutagenicity was a frequently reported endpoint in the literature of these carcinogens, however, less common endpoints such as immunosuppression and hormonal receptor-mediated effects were also found in connection with some of the carcinogens, results of potential importance for certain target organs. The combined approach, using QSAR and text-mining techniques, could be useful for identifying more detailed information on biological mechanisms and the relation with chemical structures. The method can be particularly useful in increasing the understanding of structure and activity relationships for non-mutagens.
Collapse
Affiliation(s)
- George Papamokos
- Department of Physics and School of Engineering and Applied Sciences, Harvard UniversityCambridge, MA, USA; Department of Physics, University of IoanninaIoannina, Greece; Biomedical Research Division, Institute of Molecular Biology and Biotechnology Foundation for Research and TechnologyHeraklion, Greece
| | - Ilona Silins
- Institute of Environmental Medicine, Karolinska Institutet Stockholm, Sweden
| |
Collapse
|
42
|
Sharma V, Law W, Balick MJ, Sarkar IN. Identifying Plant-Human Disease Associations in Biomedical Literature: A Case Study. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2016; 2016:84-93. [PMID: 27595045 PMCID: PMC5009952] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
The impact of ethnobotanical data from surveys of traditional medicinal uses ofplants can be enhanced through the validation of biomedical knowledge that may be embedded in literature. This study aimed to explore the use of informatics approaches, including natural language processing and terminology resources, for extracting and comparing ethnobotanical leads from biomedical literature indexed in MEDLINE. Using ethnobotanical data for plant species described in Primary Health Care Manuals of the Micronesian islands of Palau and Pohnpei, the results of this study were done relative to disease concepts from the "Mental, Behavioral And Neurodevelopmental Disorders " ICD-9-CM category. The results from this feasibility study suggest that informatics methods can be used to extract and prioritize relevant ethnobotanical information from biomedical knowledge literature.
Collapse
Affiliation(s)
- Vivekanand Sharma
- Center for Biomedical Informatics, Brown University, Providence, RI USA
| | - Wayne Law
- Institute of Economic Botany, The New York Botanical Garden, Bronx, NY USA
| | - Michael J. Balick
- Institute of Economic Botany, The New York Botanical Garden, Bronx, NY USA
| | - Indra Neil Sarkar
- Center for Biomedical Informatics, Brown University, Providence, RI USA
| |
Collapse
|
43
|
Abbe A, Grouin C, Zweigenbaum P, Falissard B. Text mining applications in psychiatry: a systematic literature review. Int J Methods Psychiatr Res 2016; 25:86-100. [PMID: 26184780 PMCID: PMC6877250 DOI: 10.1002/mpr.1481] [Citation(s) in RCA: 59] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/05/2014] [Revised: 01/21/2015] [Accepted: 04/09/2015] [Indexed: 11/08/2022] Open
Abstract
The expansion of biomedical literature is creating the need for efficient tools to keep pace with increasing volumes of information. Text mining (TM) approaches are becoming essential to facilitate the automated extraction of useful biomedical information from unstructured text. We reviewed the applications of TM in psychiatry, and explored its advantages and limitations. A systematic review of the literature was carried out using the CINAHL, Medline, EMBASE, PsycINFO and Cochrane databases. In this review, 1103 papers were screened, and 38 were included as applications of TM in psychiatric research. Using TM and content analysis, we identified four major areas of application: (1) Psychopathology (i.e. observational studies focusing on mental illnesses) (2) the Patient perspective (i.e. patients' thoughts and opinions), (3) Medical records (i.e. safety issues, quality of care and description of treatments), and (4) Medical literature (i.e. identification of new scientific information in the literature). The information sources were qualitative studies, Internet postings, medical records and biomedical literature. Our work demonstrates that TM can contribute to complex research tasks in psychiatry. We discuss the benefits, limits, and further applications of this tool in the future. Copyright © 2015 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Adeline Abbe
- Inserm, U669, Paris, France.,University Paris-Sud and University Paris Descartes, UMR-S0669, Paris, France
| | | | | | - Bruno Falissard
- Inserm, U669, Paris, France.,University Paris-Sud and University Paris Descartes, UMR-S0669, Paris, France
| |
Collapse
|
44
|
Zhu Y, Song M, Yan E. Identifying Liver Cancer and Its Relations with Diseases, Drugs, and Genes: A Literature-Based Approach. PLoS One 2016; 11:e0156091. [PMID: 27195695 PMCID: PMC4873143 DOI: 10.1371/journal.pone.0156091] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2015] [Accepted: 05/09/2016] [Indexed: 12/04/2022] Open
Abstract
In biomedicine, scientific literature is a valuable source for knowledge discovery. Mining knowledge from textual data has become an ever important task as the volume of scientific literature is growing unprecedentedly. In this paper, we propose a framework for examining a certain disease based on existing information provided by scientific literature. Disease-related entities that include diseases, drugs, and genes are systematically extracted and analyzed using a three-level network-based approach. A paper-entity network and an entity co-occurrence network (macro-level) are explored and used to construct six entity specific networks (meso-level). Important diseases, drugs, and genes as well as salient entity relations (micro-level) are identified from these networks. Results obtained from the literature-based literature mining can serve to assist clinical applications.
Collapse
Affiliation(s)
- Yongjun Zhu
- College of Computing and Informatics, Drexel University, Philadelphia, PA, United States of America
| | - Min Song
- Department of Library and Information Science, Yonsei University, Seoul, Republic of Korea
| | - Erjia Yan
- College of Computing and Informatics, Drexel University, Philadelphia, PA, United States of America
| |
Collapse
|
45
|
Jain S, Tumkur KR, Kuo TT, Bhargava S, Lin G, Hsu CN. Weakly supervised learning of biomedical information extraction from curated data. BMC Bioinformatics 2016; 17 Suppl 1:1. [PMID: 26817711 PMCID: PMC4847485 DOI: 10.1186/s12859-015-0844-1] [Citation(s) in RCA: 70] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
Background Numerous publicly available biomedical databases derive data by curating from literatures. The curated data can be useful as training examples for information extraction, but curated data usually lack the exact mentions and their locations in the text required for supervised machine learning. This paper describes a general approach to information extraction using curated data as training examples. The idea is to formulate the problem as cost-sensitive learning from noisy labels, where the cost is estimated by a committee of weak classifiers that consider both curated data and the text. Results We test the idea on two information extraction tasks of Genome-Wide Association Studies (GWAS). The first task is to extract target phenotypes (diseases or traits) of a study and the second is to extract ethnicity backgrounds of study subjects for different stages (initial or replication). Experimental results show that our approach can achieve 87 % of Precision-at-2 (P@2) for disease/trait extraction, and 0.83 of F1-Score for stage-ethnicity extraction, both outperforming their cost-insensitive baseline counterparts. Conclusions The results show that curated biomedical databases can potentially be reused as training examples to train information extractors without expert annotation or refinement, opening an unprecedented opportunity of using “big data” in biomedical text mining. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0844-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Suvir Jain
- Department of Computer Science and Engineering, Jacobs School of Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, 92093, USA.
| | - Kashyap R Tumkur
- Department of Computer Science and Engineering, Jacobs School of Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, 92093, USA.
| | - Tsung-Ting Kuo
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, 9500 Gilman Drive, La Jolla, 92093, USA.
| | - Shitij Bhargava
- Department of Computer Science and Engineering, Jacobs School of Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, 92093, USA.
| | - Gordon Lin
- Department of Computer Science and Engineering, Jacobs School of Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, 92093, USA.
| | - Chun-Nan Hsu
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, 9500 Gilman Drive, La Jolla, 92093, USA.
| |
Collapse
|
46
|
Ahmed Z, Dandekar T. MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format. F1000Res 2015; 4:1453. [PMID: 29721305 PMCID: PMC5897790 DOI: 10.12688/f1000research.7329.3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/26/2018] [Indexed: 01/12/2023] Open
Abstract
Published scientific literature contains millions of figures, including information about the results obtained from different scientific experiments e.g. PCR-ELISA data, microarray analysis, gel electrophoresis, mass spectrometry data, DNA/RNA sequencing, diagnostic imaging (CT/MRI and ultrasound scans), and medicinal imaging like electroencephalography (EEG), magnetoencephalography (MEG), echocardiography (ECG), positron-emission tomography (PET) images. The importance of biomedical figures has been widely recognized in scientific and medicine communities, as they play a vital role in providing major original data, experimental and computational results in concise form. One major challenge for implementing a system for scientific literature analysis is extracting and analyzing text and figures from published PDF files by physical and logical document analysis. Here we present a product line architecture based bioinformatics tool ‘Mining Scientific Literature (MSL)’, which supports the extraction of text and images by interpreting all kinds of published PDF files using advanced data mining and image processing techniques. It provides modules for the marginalization of extracted text based on different coordinates and keywords, visualization of extracted figures and extraction of embedded text from all kinds of biological and biomedical figures using applied Optimal Character Recognition (OCR). Moreover, for further analysis and usage, it generates the system’s output in different formats including text, PDF, XML and images files. Hence, MSL is an easy to install and use analysis tool to interpret published scientific literature in PDF format.
Collapse
Affiliation(s)
- Zeeshan Ahmed
- Genetics and Genome Sciences, School of Medicine, University of Connecticut Health Center, Farmington, CT, 06032, USA.,Institute for Systems Genomics, University of Connecticut Health Center, Farmington, CT, 06032, USA
| | - Thomas Dandekar
- Department of Bioinformatics, Biocenter, University of Wuerzburg, Wuerzburg, 97074, Germany
| |
Collapse
|
47
|
Vilares M, Fernández M, Blanco A. Supporting knowledge discovery for biodiversity. DATA KNOWL ENG 2015. [DOI: 10.1016/j.datak.2015.08.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
48
|
Gonzalez GH, Tahsin T, Goodale BC, Greene AC, Greene CS. Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery. Brief Bioinform 2015; 17:33-42. [PMID: 26420781 PMCID: PMC4719073 DOI: 10.1093/bib/bbv087] [Citation(s) in RCA: 103] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2015] [Indexed: 02/06/2023] Open
Abstract
Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacogenomics and other fields. We present an overview of the fundamental methods for text and data mining, as well as recent advances and emerging applications toward precision medicine.
Collapse
|
49
|
GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains. BIOMED RESEARCH INTERNATIONAL 2015; 2015:918710. [PMID: 26380306 PMCID: PMC4561873 DOI: 10.1155/2015/918710] [Citation(s) in RCA: 111] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Revised: 04/03/2015] [Accepted: 04/04/2015] [Indexed: 02/01/2023]
Abstract
The automatic recognition of gene names and their associated database identifiers from biomedical text has been widely studied in recent years, as these tasks play an important role in many downstream text-mining applications. Despite significant previous research, only a small number of tools are publicly available and these tools are typically restricted to detecting only mention level gene names or only document level gene identifiers. In this work, we report GNormPlus: an end-to-end and open source system that handles both gene mention and identifier detection. We created a new corpus of 694 PubMed articles to support our development of GNormPlus, containing manual annotations for not only gene names and their identifiers, but also closely related concepts useful for gene name disambiguation, such as gene families and protein domains. GNormPlus integrates several advanced text-mining techniques, including SimConcept for resolving composite gene names. As a result, GNormPlus compares favorably to other state-of-the-art methods when evaluated on two widely used public benchmarking datasets, achieving 86.7% F1-score on the BioCreative II Gene Normalization task dataset and 50.1% F1-score on the BioCreative III Gene Normalization task dataset. The GNormPlus source code and its annotated corpus are freely available, and the results of applying GNormPlus to the entire PubMed are freely accessible through our web-based tool PubTator.
Collapse
|
50
|
From Literature to Knowledge: Exploiting PubMed to Answer Biomedical Questions in Natural Language. ACTA ACUST UNITED AC 2015. [DOI: 10.1007/978-3-319-22741-2_1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/19/2023]
|