1
|
Landolsi MY, Hlaoua L, Ben Romdhane L. Information extraction from electronic medical documents: state of the art and future research directions. Knowl Inf Syst 2023; 65:463-516. [PMID: 36405956 PMCID: PMC9640816 DOI: 10.1007/s10115-022-01779-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 05/04/2022] [Accepted: 10/17/2022] [Indexed: 11/10/2022]
Abstract
In the medical field, a doctor must have a comprehensive knowledge by reading and writing narrative documents, and he is responsible for every decision he takes for patients. Unfortunately, it is very tiring to read all necessary information about drugs, diseases and patients due to the large amount of documents that are increasing every day. Consequently, so many medical errors can happen and even kill people. Likewise, there is such an important field that can handle this problem, which is the information extraction. There are several important tasks in this field to extract the important and desired information from unstructured text written in natural language. The main principal tasks are named entity recognition and relation extraction since they can structure the text by extracting the relevant information. However, in order to treat the narrative text we should use natural language processing techniques to extract useful information and features. In our paper, we introduce and discuss the several techniques and solutions used in these tasks. Furthermore, we outline the challenges in information extraction from medical documents. In our knowledge, this is the most comprehensive survey in the literature with an experimental analysis and a suggestion for some uncovered directions.
Collapse
Affiliation(s)
- Mohamed Yassine Landolsi
- MARS Research Laboratory, SDM Research Group, ISITCom, University of Sousse, Hammam Sousse, Tunisia
| | - Lobna Hlaoua
- MARS Research Laboratory, SDM Research Group, ISITCom, University of Sousse, Hammam Sousse, Tunisia
| | - Lotfi Ben Romdhane
- MARS Research Laboratory, SDM Research Group, ISITCom, University of Sousse, Hammam Sousse, Tunisia
| |
Collapse
|
2
|
Jin S, Niu Z, Jiang C, Huang W, Xia F, Jin X, Liu X, Zeng X. HeTDR: Drug repositioning based on heterogeneous networks and text mining. PATTERNS 2021; 2:100307. [PMID: 34430926 PMCID: PMC8369234 DOI: 10.1016/j.patter.2021.100307] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/07/2021] [Revised: 05/11/2021] [Accepted: 06/14/2021] [Indexed: 12/14/2022]
Abstract
Using existing knowledge to carry out drug-disease associations prediction is a vital method for drug repositioning. However, effectively fusing the biomedical text and biological network information is one of the great challenges for most current drug repositioning methods. In this study, we propose a drug repositioning method based on heterogeneous networks and text mining (HeTDR). This model can combine drug features from multiple drug-related networks, disease features from biomedical corpora with the known drug-disease associations network to predict the correlation scores between drug and disease. Experiments demonstrate that HeTDR has excellent performance that is superior to that of state-of-the-art models. We present the top 10 novel HeTDR-predicted approved drugs for five diseases and prove our model is capable of discovering potential candidate drugs for disease indications. We developed a novel DL-based method for drug repositioning (HeTDR) HeTDR succeeds in fusing networks topology information and text mining information HeTDR obtains high accuracy, excessing most state-of-the-art models HeTDR could represent an algorithm integrating multiple sources of information
Traditional drug discovery and development are often time consuming and high risk. Drug repositioning aims to expand existing indications or discover new targets by studying the approved drug compounds, thereby reducing the time, costs, and risk of drug development. We propose a novel method in drug repositioning based on heterogeneous networks and text mining (HeTDR), which combines drugs features from multiple networks and diseases features from biomedical corpora to predict the correlation scores between drugs and diseases. This prediction model has provided a potential solution for multiple information fusion and to exhibit accurate performance leading to the discovery of new drugs for indications. This algorithm could contribute a new idea to the acceleration and development of future drug repositioning by using computational methods and provide computer-aided guidance for biologists in clinical settings.
Collapse
Affiliation(s)
- Shuting Jin
- Department of Computer Science, Xiamen University, Xiamen 361005, China.,National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China.,Shenzhen Research Institute of Xiamen University, Shenzhen 518000, China
| | | | - Changzhi Jiang
- Department of Computer Science, Xiamen University, Xiamen 361005, China
| | - Wei Huang
- Department of Computer Science, Xiamen University, Xiamen 361005, China
| | - Feng Xia
- Department of Computer Science, Xiamen University, Xiamen 361005, China
| | - Xurui Jin
- MindRank AI Ltd., Hangzhou, Zhejiang 311113, China
| | - Xiangrong Liu
- Department of Computer Science, Xiamen University, Xiamen 361005, China.,National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Hunan University, Changsha 410082, China
| |
Collapse
|
3
|
Jha K, Xun G, Zhang A. Continual Representation Learning For Evolving Biomedical Bipartite Networks. Bioinformatics 2021; 37:2190-2197. [PMID: 33532833 DOI: 10.1093/bioinformatics/btab067] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Revised: 12/14/2020] [Accepted: 01/27/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Many real-world biomedical interactions such as 'gene-disease', 'disease-symptom', and 'drug-target' are modeled as a bipartite network structure. Learning meaningful representations for such networks is a fundamental problem in the research area of Network Representation Learning (NRL). NRL approaches aim to translate the network structure into low-dimensional vector representations that are useful to a variety of biomedical applications. Despite significant advances, the existing approaches still have certain limitations. First, a majority of these approaches do not model the unique topological properties of bipartite networks. Consequently, their straightforward application to the bipartite graphs yields unsatisfactory results. Second, the existing approaches typically learn representations from static networks. This is limiting for the biomedical bipartite networks that evolve at a rapid pace, and thus necessitate the development of approaches that can update the representations in an online fashion. RESULTS In this research, we propose a novel representation learning approach that accurately preserves the intricate bipartite structure, and efficiently updates the node representations. Specifically, we design a customized autoencoder that captures the proximity relationship between nodes participating in the bipartite bicliques (2 × 2 sub-graph), while preserving both the global and local structures. Moreover, the proposed structure-preserving technique is carefully interleaved with the central tenets of continual machine learning to design an incremental learning strategy that updates the node representations in an online manner. Taken together, the proposed approach produces meaningful representations with high fidelity and computational efficiency. Extensive experiments conducted on several biomedical bipartite networks validate the effectiveness and rationality of the proposed approach. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kishlay Jha
- Department of Computer Science, University of Virginia, Charlottesville, VA, 22904, USA
| | - Guangxu Xun
- Department of Computer Science, University of Virginia, Charlottesville, VA, 22904, USA
| | - Aidong Zhang
- Department of Computer Science, University of Virginia, Charlottesville, VA, 22904, USA
| |
Collapse
|
4
|
Khan JY, Khondaker MTI, Hoque IT, Al-Absi HRH, Rahman MS, Guler R, Alam T, Rahman MS. Toward Preparing a Knowledge Base to Explore Potential Drugs and Biomedical Entities Related to COVID-19: Automated Computational Approach. JMIR Med Inform 2020; 8:e21648. [PMID: 33055059 PMCID: PMC7674141 DOI: 10.2196/21648] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2020] [Revised: 08/23/2020] [Accepted: 09/06/2020] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Novel coronavirus disease 2019 (COVID-19) is taking a huge toll on public health. Along with the non-therapeutic preventive measurements, scientific efforts are currently focused, mainly, on the development of vaccines and pharmacological treatment with existing drugs. Summarizing evidences from scientific literatures on the discovery of treatment plan of COVID-19 under a platform would help the scientific community to explore the opportunities in a systematic fashion. OBJECTIVE The aim of this study is to explore the potential drugs and biomedical entities related to coronavirus related diseases, including COVID-19, that are mentioned on scientific literature through an automated computational approach. METHODS We mined the information from publicly available scientific literature and related public resources. Six topic-specific dictionaries, including human genes, human miRNAs, diseases, Protein Databank, drugs, and drug side effects, were integrated to mine all scientific evidence related to COVID-19. We employed an automated literature mining and labeling system through a novel approach to measure the effectiveness of drugs against diseases based on natural language processing, sentiment analysis, and deep learning. We also applied the concept of cosine similarity to confidently infer the associations between diseases and genes. RESULTS Based on the literature mining, we identified 1805 diseases, 2454 drugs, 1910 genes that are related to coronavirus related diseases including COVID-19. Integrating the extracted information, we developed the first knowledgebase platform dedicated to COVID-19, which highlights potential list of drugs and related biomedical entities. For COVID-19, we highlighted multiple case studies on existing drugs along with a confidence score for their applicability in the treatment plan. Based on our computational method, we found Remdesivir, Statins, Dexamethasone, and Ivermectin could be considered as potential effective drugs to improve clinical status and lower mortality in patients hospitalized with COVID-19. We also found that Hydroxychloroquine could not be considered as an effective drug for COVID-19. The resulting knowledgebase is made available as an open source tool, named COVID-19Base. CONCLUSIONS Proper investigation of the mined biomedical entities along with the identified interactions among those would help the research community to discover possible ways for the therapeutic treatment of COVID-19.
Collapse
Affiliation(s)
- Junaed Younus Khan
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Md Tawkat Islam Khondaker
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Iram Tazim Hoque
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Hamada R H Al-Absi
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Mohammad Saifur Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Reto Guler
- International Centre for Genetic Engineering and Biotechnology, Cape Town Component, Cape Town, South Africa
- Division of Immunology and South African Medical Research Council Immunology of Infectious Diseases, Department of Pathology, Institute of Infectious Diseases and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
- Wellcome Centre for Infectious Diseases Research in Africa, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - M Sohel Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| |
Collapse
|
5
|
Abstract
PurposeTheory is a kind of condensed human knowledge. This paper is to examine the mechanism of interdisciplinary diffusion of theoretical knowledge by tracing the diffusion of a representative theory, the Technology Acceptance Model (TAM).Design/methodology/approachBased on the full-scale dataset of Web of Science (WoS), the citations of Davis's original work about TAM were analysed and the interdisciplinary diffusion paths of TAM were delineated, a supervised machine learning method was used to extract theory incidents, and a content analysis was used to categorize the patterns of theory evolution.FindingsIt is found that the diffusion of a theory is intertwined with its evolution. In the process, the role that a participating discipline play is related to its knowledge distance from the original disciplines of TAM. With the distance increases, the capacity to support theory development and innovation weakens, while that to assume analytical tools for practical problems increases. During the diffusion, a theory evolves into new extensions in four theoretical construction patterns, elaboration, proliferation, competition and integration.Research limitations/implicationsThe study does not only deepen the understanding of the trajectory of a theory but also enriches the research of knowledge diffusion and innovation.Originality/valueThe study elaborates the relationship between theory diffusion and theory development, reveals the roles of the participating disciplines played in theory diffusion and vice versa, interprets four patterns of theory evolution and uses text mining technique to extract theory incidents, which makes up for the shortcomings of citation analysis and content analysis used in previous studies.
Collapse
|
6
|
Lee W, Choi J. Precursor-induced conditional random fields: connecting separate entities by induction for improved clinical named entity recognition. BMC Med Inform Decis Mak 2019; 19:132. [PMID: 31307440 PMCID: PMC6632205 DOI: 10.1186/s12911-019-0865-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Accepted: 07/03/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND This paper presents a conditional random fields (CRF) method that enables the capture of specific high-order label transition factors to improve clinical named entity recognition performance. Consecutive clinical entities in a sentence are usually separated from each other, and the textual descriptions in clinical narrative documents frequently indicate causal or posterior relationships that can be used to facilitate clinical named entity recognition. However, the CRF that is generally used for named entity recognition is a first-order model that constrains label transition dependency of adjoining labels under the Markov assumption. METHODS Based on the first-order structure, our proposed model utilizes non-entity tokens between separated entities as an information transmission medium by applying a label induction method. The model is referred to as precursor-induced CRF because its non-entity state memorizes precursor entity information, and the model's structure allows the precursor entity information to propagate forward through the label sequence. RESULTS We compared the proposed model with both first- and second-order CRFs in terms of their F1-scores, using two clinical named entity recognition corpora (the i2b2 2012 challenge and the Seoul National University Hospital electronic health record). The proposed model demonstrated better entity recognition performance than both the first- and second-order CRFs and was also more efficient than the higher-order model. CONCLUSION The proposed precursor-induced CRF which uses non-entity labels as label transition information improves entity recognition F1 score by exploiting long-distance transition factors without exponentially increasing the computational time. In contrast, a conventional second-order CRF model that uses longer distance transition factors showed even worse results than the first-order model and required the longest computation time. Thus, the proposed model could offer a considerable performance improvement over current clinical named entity recognition methods based on the CRF models.
Collapse
Affiliation(s)
- Wangjin Lee
- Interdisciplinary Program for Bioengineering, Graduate School, Seoul National University, 103 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea
| | - Jinwook Choi
- Interdisciplinary Program for Bioengineering, Graduate School, Seoul National University, 103 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea. .,Department of Biomedical Engineering, Seoul National University College of Medicine, 103 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea. .,Institute of Medical and Biological Engineering, Medical Research Center, Seoul National University, 101 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea.
| |
Collapse
|
7
|
Su C, Tong J, Zhu Y, Cui P, Wang F. Network embedding in biomedical data science. Brief Bioinform 2018; 21:182-197. [PMID: 30535359 DOI: 10.1093/bib/bby117] [Citation(s) in RCA: 71] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2018] [Revised: 10/08/2018] [Accepted: 11/03/2018] [Indexed: 12/15/2022] Open
Abstract
AbstractOwning to the rapid development of computer technologies, an increasing number of relational data have been emerging in modern biomedical research. Many network-based learning methods have been proposed to perform analysis on such data, which provide people a deep understanding of topology and knowledge behind the biomedical networks and benefit a lot of applications for human healthcare. However, most network-based methods suffer from high computational and space cost. There remain challenges on handling high dimensionality and sparsity of the biomedical networks. The latest advances in network embedding technologies provide new effective paradigms to solve the network analysis problem. It converts network into a low-dimensional space while maximally preserves structural properties. In this way, downstream tasks such as link prediction and node classification can be done by traditional machine learning methods. In this survey, we conduct a comprehensive review of the literature on applying network embedding to advance the biomedical domain. We first briefly introduce the widely used network embedding models. After that, we carefully discuss how the network embedding approaches were performed on biomedical networks as well as how they accelerated the downstream tasks in biomedical science. Finally, we discuss challenges the existing network embedding applications in biomedical domains are faced with and suggest several promising future directions for a better improvement in human healthcare.
Collapse
Affiliation(s)
- Chang Su
- Department of Healthcare Policy and Research, Weill Cornell Medicine at Cornell University, New York, NY, USA
| | - Jie Tong
- Department of Mechanical and Aerospace Engineering at New York University, New York, NY, USA
| | - Yongjun Zhu
- Department of Library and Information Science, Sungkyunkwan University, Seoul, South Korea
| | - Peng Cui
- Department of Computer Science and Technology, Tsinghua University, Beijing, China
| | - Fei Wang
- Department of Healthcare Policy and Research, Weill Cornell Medicine at Cornell University, New York, NY, USA
| |
Collapse
|
8
|
Chen X, Xie H, Wang FL, Liu Z, Xu J, Hao T. A bibliometric analysis of natural language processing in medical research. BMC Med Inform Decis Mak 2018; 18:14. [PMID: 29589569 PMCID: PMC5872501 DOI: 10.1186/s12911-018-0594-x] [Citation(s) in RCA: 64] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Background Natural language processing (NLP) has become an increasingly significant role in advancing medicine. Rich research achievements of NLP methods and applications for medical information processing are available. It is of great significance to conduct a deep analysis to understand the recent development of NLP-empowered medical research field. However, limited study examining the research status of this field could be found. Therefore, this study aims to quantitatively assess the academic output of NLP in medical research field. Methods We conducted a bibliometric analysis on NLP-empowered medical research publications retrieved from PubMed in the period 2007–2016. The analysis focused on three aspects. Firstly, the literature distribution characteristics were obtained with a statistics analysis method. Secondly, a network analysis method was used to reveal scientific collaboration relations. Finally, thematic discovery and evolution was reflected using an affinity propagation clustering method. Results There were 1405 NLP-empowered medical research publications published during the 10 years with an average annual growth rate of 18.39%. 10 most productive publication sources together contributed more than 50% of the total publications. The USA had the highest number of publications. A moderately significant correlation between country’s publications and GDP per capita was revealed. Denny, Joshua C was the most productive author. Mayo Clinic was the most productive affiliation. The annual co-affiliation and co-country rates reached 64.04% and 15.79% in 2016, respectively. 10 main great thematic areas were identified including Computational biology, Terminology mining, Information extraction, Text classification, Social medium as data source, Information retrieval, etc. Conclusions A bibliometric analysis of NLP-empowered medical research publications for uncovering the recent research status is presented. The results can assist relevant researchers, especially newcomers in understanding the research development systematically, seeking scientific cooperation partners, optimizing research topic choices and monitoring new scientific or technological activities.
Collapse
Affiliation(s)
- Xieling Chen
- College of Economics, Jinan University, Guangzhou, China
| | - Haoran Xie
- Department of Mathematics and Information Technology, The Education University of Hong Kong, Hong Kong, Hong Kong, Special Administrative Region of China
| | - Fu Lee Wang
- School of Science and Technology, The Open University of Hong Kong, Hong Kong, Hong Kong, Special Administrative Region of China
| | - Ziqing Liu
- The Second Clinical Medical College, Guangzhou University of Chinese Medicine, Guangzhou, China
| | - Juan Xu
- The Research Institute of National Supervision and Audit Law, Nanjing Audit University, Nanjing, China
| | - Tianyong Hao
- School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, China. .,School of Computer, South China Normal University, Guangzhou, China.
| |
Collapse
|