1
|
Bhattarai K, Oh IY, Sierra JM, Tang J, Payne PRO, Abrams Z, Lai AM. Leveraging GPT-4 for identifying cancer phenotypes in electronic health records: a performance comparison between GPT-4, GPT-3.5-turbo, Flan-T5, Llama-3-8B, and spaCy's rule-based and machine learning-based methods. JAMIA Open 2024; 7:ooae060. [PMID: 38962662 PMCID: PMC11221943 DOI: 10.1093/jamiaopen/ooae060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 06/12/2024] [Accepted: 06/18/2024] [Indexed: 07/05/2024] Open
Abstract
Objective Accurately identifying clinical phenotypes from Electronic Health Records (EHRs) provides additional insights into patients' health, especially when such information is unavailable in structured data. This study evaluates the application of OpenAI's Generative Pre-trained Transformer (GPT)-4 model to identify clinical phenotypes from EHR text in non-small cell lung cancer (NSCLC) patients. The goal was to identify disease stages, treatments and progression utilizing GPT-4, and compare its performance against GPT-3.5-turbo, Flan-T5-xl, Flan-T5-xxl, Llama-3-8B, and 2 rule-based and machine learning-based methods, namely, scispaCy and medspaCy. Materials and Methods Phenotypes such as initial cancer stage, initial treatment, evidence of cancer recurrence, and affected organs during recurrence were identified from 13 646 clinical notes for 63 NSCLC patients from Washington University in St. Louis, Missouri. The performance of the GPT-4 model is evaluated against GPT-3.5-turbo, Flan-T5-xxl, Flan-T5-xl, Llama-3-8B, medspaCy, and scispaCy by comparing precision, recall, and micro-F1 scores. Results GPT-4 achieved higher F1 score, precision, and recall compared to Flan-T5-xl, Flan-T5-xxl, Llama-3-8B, medspaCy, and scispaCy's models. GPT-3.5-turbo performed similarly to that of GPT-4. GPT, Flan-T5, and Llama models were not constrained by explicit rule requirements for contextual pattern recognition. spaCy models relied on predefined patterns, leading to their suboptimal performance. Discussion and Conclusion GPT-4 improves clinical phenotype identification due to its robust pre-training and remarkable pattern recognition capability on the embedded tokens. It demonstrates data-driven effectiveness even with limited context in the input. While rule-based models remain useful for some tasks, GPT models offer improved contextual understanding of the text, and robust clinical phenotype extraction.
Collapse
Affiliation(s)
- Kriti Bhattarai
- Institute for Informatics, Data Science & Biostatistics, Washington University School of Medicine, St. Louis, MO 63110, United States
- Department of Computer Science, Washington University in St Louis, St. Louis, MO 63110, United States
| | - Inez Y Oh
- Institute for Informatics, Data Science & Biostatistics, Washington University School of Medicine, St. Louis, MO 63110, United States
| | - Jonathan Moran Sierra
- Medical Scientist Training Program, Washington University School of Medicine, St. Louis, MO 63110, United States
| | - Jonathan Tang
- Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO 63110, United States
| | - Philip R O Payne
- Institute for Informatics, Data Science & Biostatistics, Washington University School of Medicine, St. Louis, MO 63110, United States
- Department of Computer Science, Washington University in St Louis, St. Louis, MO 63110, United States
| | - Zach Abrams
- Institute for Informatics, Data Science & Biostatistics, Washington University School of Medicine, St. Louis, MO 63110, United States
| | - Albert M Lai
- Institute for Informatics, Data Science & Biostatistics, Washington University School of Medicine, St. Louis, MO 63110, United States
- Department of Computer Science, Washington University in St Louis, St. Louis, MO 63110, United States
| |
Collapse
|
2
|
Ogrinc M, Koroušić Seljak B, Eftimov T. Zero-shot evaluation of ChatGPT for food named-entity recognition and linking. Front Nutr 2024; 11:1429259. [PMID: 39290564 PMCID: PMC11406469 DOI: 10.3389/fnut.2024.1429259] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Accepted: 07/26/2024] [Indexed: 09/19/2024] Open
Abstract
Introduction Recognizing and extracting key information from textual data plays an important role in intelligent systems by maintaining up-to-date knowledge, reinforcing informed decision-making, question-answering, and more. It is especially apparent in the food domain, where critical information guides the decisions of nutritionists and clinicians. The information extraction process involves two natural language processing tasks named entity recognition-NER and named entity linking-NEL. With the emergence of large language models (LLMs), especially ChatGPT, many areas began incorporating its knowledge to reduce workloads or simplify tasks. In the field of food, however, we noticed an opportunity to involve ChatGPT in NER and NEL. Methods To assess ChatGPT's capabilities, we have evaluated its two versions, ChatGPT-3.5 and ChatGPT-4, focusing on their performance across both NER and NEL tasks, emphasizing food-related data. To benchmark our results in the food domain, we also investigated its capabilities in a more broadly investigated biomedical domain. By evaluating its zero-shot capabilities, we were able to ascertain the strengths and weaknesses of the two versions of ChatGPT. Results Despite being able to show promising results in NER compared to other models. When tasked with linking entities to their identifiers from semantic models ChatGPT's effectiveness falls drastically. Discussion While the integration of ChatGPT holds potential across various fields, it is crucial to approach its use with caution, particularly in relying on its responses for critical decisions in food and bio-medicine.
Collapse
Affiliation(s)
- Matevž Ogrinc
- Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
- Department of Computer Systems, Jožef Stefan Institute, Ljubljana, Slovenia
| | | | - Tome Eftimov
- Department of Computer Systems, Jožef Stefan Institute, Ljubljana, Slovenia
| |
Collapse
|
3
|
Bagler G, Goel M. Computational gastronomy: capturing culinary creativity by making food computable. NPJ Syst Biol Appl 2024; 10:72. [PMID: 38977713 PMCID: PMC11231233 DOI: 10.1038/s41540-024-00399-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Accepted: 06/25/2024] [Indexed: 07/10/2024] Open
Abstract
Cooking, a quintessential creative pursuit, holds profound significance for individuals, communities, and civilizations. Food and cooking transcend mere sensory pleasure to influence nutrition and public health outcomes. Inextricably linked to culinary and cultural heritage, food systems play a pivotal role in sustainability and the survival of life on our planet. Computational Gastronomy is a novel approach for investigating food through a data-driven paradigm. It offers a systematic, rule-based understanding of culinary arts by scrutinizing recipes for taste, nutritional value, health implications, and environmental sustainability. Probing the art of cooking through the lens of computation will open up a new realm of possibilities for culinary creativity. Amidst the ongoing quest for imitating creativity through artificial intelligence, an interesting question would be, 'Can a machine think like a Chef?' Capturing the experience and creativity of a chef in an AI algorithm presents an exciting opportunity for generating a galaxy of hitherto unseen recipes with desirable culinary, flavor, nutrition, health, and carbon footprint profiles.
Collapse
Affiliation(s)
- Ganesh Bagler
- Department of Computational Biology, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), Okhla Phase III, New Delhi, 110020, India.
- Infosys Center for Artificial Intelligence, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), Okhla Phase III, New Delhi, 110020, India.
- Center of Excellence in Healthcare, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), Okhla Phase III, New Delhi, 110020, India.
| | - Mansi Goel
- Department of Computational Biology, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), Okhla Phase III, New Delhi, 110020, India
- Infosys Center for Artificial Intelligence, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), Okhla Phase III, New Delhi, 110020, India
- Center of Excellence in Healthcare, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), Okhla Phase III, New Delhi, 110020, India
| |
Collapse
|
4
|
Bhattarai K, Oh IY, Sierra JM, Tang J, Payne PRO, Abrams ZB, Lai AM. Leveraging GPT-4 for Identifying Cancer Phenotypes in Electronic Health Records: A Performance Comparison between GPT-4, GPT-3.5-turbo, Flan-T5 and spaCy's Rule-based & Machine Learning-based methods. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.09.27.559788. [PMID: 37808763 PMCID: PMC10557629 DOI: 10.1101/2023.09.27.559788] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
Abstract
Objective Accurately identifying clinical phenotypes from Electronic Health Records (EHRs) provides additional insights into patients' health, especially when such information is unavailable in structured data. This study evaluates the application of OpenAI's Generative Pre-trained Transformer (GPT)-4 model to identify clinical phenotypes from EHR text in non-small cell lung cancer (NSCLC) patients. The goal was to identify disease stages, treatments and progression utilizing GPT-4, and compare its performance against GPT-3.5-turbo, Flan-T5-xl, Flan-T5-xxl, and two rule-based and machine learning-based methods, namely, scispaCy and medspaCy. Materials and Methods Phenotypes such as initial cancer stage, initial treatment, evidence of cancer recurrence, and affected organs during recurrence were identified from 13,646 records for 63 NSCLC patients from Washington University in St. Louis, Missouri. The performance of the GPT-4 model is evaluated against GPT-3.5-turbo, Flan-T5-xxl, Flan-T5-xl, medspaCy and scispaCy by comparing precision, recall, and micro-F1 scores. Results GPT-4 achieved higher F1 score, precision, and recall compared to Flan-T5-xl, Flan-T5-xxl, medspaCy and scispaCy's models. GPT-3.5-turbo performed similarly to that of GPT-4. GPT and Flan-T5 models were not constrained by explicit rule requirements for contextual pattern recognition. SpaCy models relied on predefined patterns, leading to their suboptimal performance. Discussion and Conclusion GPT-4 improves clinical phenotype identification due to its robust pre-training and remarkable pattern recognition capability on the embedded tokens. It demonstrates data-driven effectiveness even with limited context in the input. While rule-based models remain useful for some tasks, GPT models offer improved contextual understanding of the text, and robust clinical phenotype extraction.
Collapse
|
5
|
Fudholi DH, Zahra A, Rani S, Huda SN, Paputungan IV, Zukhri Z. BERT-based tourism named entity recognition: making use of social media for travel recommendations. PeerJ Comput Sci 2023; 9:e1731. [PMID: 38192479 PMCID: PMC10773720 DOI: 10.7717/peerj-cs.1731] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Accepted: 11/09/2023] [Indexed: 01/10/2024]
Abstract
Background Social media has become a massive encyclopedia of almost anything due to its content richness. People tell stories, write comments and feedback, and share knowledge through social media. The information available on social media enables 'clueless' travelers to get quick travel recommendations in the tourism sector. Through a simple query, such as typing 'places to visit in Bali', travelers can get many blog articles to help them decide which places of interest to visit. However, doing this reading task without a helper can be overwhelming. Methods To overcome this problem, we developed Bidirectional Encoder Representations from Transformers (BERT)-based tourism named entity recognition system, which is used to highlight tourist destination places in the query result. BERT is a state-of-the-art machine learning framework for natural language processing that can give a decent performance in various settings and cases. Our developed tourism named entity recognition (NER) model specifies three different tourist destinations: heritage, natural, and purposefully built (man-made or artificial). The dataset is taken from various tourism-related community articles and posts. Results The model achieved an average F1-score of 0.80 and has been implemented into a traveling destination recommendation system. By using this system, travelers can get quick recommendations based on the popularity of places visited in the query frame. Discussion Based on the survey that we conducted to target respondents who have never visited and have no or limited knowledge about tourist attractions in some example cities, their average interest level from the recommendation results is higher than four on a scale of 1 to 5. Thus, it can be considered a good recommendation. Furthermore, the NER model performance is comparable to another related research.
Collapse
Affiliation(s)
| | - Annisa Zahra
- Department of Informatics, Universitas Islam Indonesia, Yogyakarta, Indonesia
| | - Septia Rani
- Department of Informatics, Universitas Islam Indonesia, Yogyakarta, Indonesia
| | - Sheila Nurul Huda
- Department of Informatics, Universitas Islam Indonesia, Yogyakarta, Indonesia
| | | | - Zainudin Zukhri
- Department of Informatics, Universitas Islam Indonesia, Yogyakarta, Indonesia
| |
Collapse
|
6
|
Cenikj G, Valenčič E, Ispirova G, Ogrinc M, Stojanov R, Korošec P, Cavalli E, Seljak BK, Eftimov T. CafeteriaSA corpus: scientific abstracts annotated across different food semantic resources. Database (Oxford) 2022; 2022:6918707. [PMID: 36526439 PMCID: PMC9757992 DOI: 10.1093/database/baac107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 10/30/2022] [Accepted: 11/23/2022] [Indexed: 12/23/2022]
Abstract
In the last decades, a great amount of work has been done in predictive modeling of issues related to human and environmental health. Resolution of issues related to healthcare is made possible by the existence of several biomedical vocabularies and standards, which play a crucial role in understanding the health information, together with a large amount of health-related data. However, despite a large number of available resources and work done in the health and environmental domains, there is a lack of semantic resources that can be utilized in the food and nutrition domain, as well as their interconnections. For this purpose, in a European Food Safety Authority-funded project CAFETERIA, we have developed the first annotated corpus of 500 scientific abstracts that consists of 6407 annotated food entities with regard to Hansard taxonomy, 4299 for FoodOn and 3623 for SNOMED-CT. The CafeteriaSA corpus will enable the further development of natural language processing methods for food information extraction from textual data that will allow extracting food information from scientific textual data. Database URL: https://zenodo.org/record/6683798#.Y49wIezMJJF.
Collapse
Affiliation(s)
| | - Eva Valenčič
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia,Jožef Stefan International Postgraduate School, Jamova cesta 39, Ljubljana 1000, Slovenia,School of Health Sciences, College of Health, Medicine and Wellbeing, University of Newcastle, University Drive, Callaghan Campus, Newcastle, NSW 2308, Australia,Food and Nutrition Program, Hunter Medical Research Institute, Lot 1 Kookaburra Circuit, New Lambton Heights, Newcastle, NSW 2305, Australia
| | - Gordana Ispirova
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia,Jožef Stefan International Postgraduate School, Jamova cesta 39, Ljubljana 1000, Slovenia
| | - Matevž Ogrinc
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia,Jožef Stefan International Postgraduate School, Jamova cesta 39, Ljubljana 1000, Slovenia
| | - Riste Stojanov
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje, Ruger Boshkovikj 16, Skopje 1000, North Macedonia
| | - Peter Korošec
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia
| | - Ermanno Cavalli
- European Food Safety Authority, Via Carlo Magno 1A, Parma 43126, Italy
| | - Barbara Koroušić Seljak
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia,Jožef Stefan International Postgraduate School, Jamova cesta 39, Ljubljana 1000, Slovenia
| | - Tome Eftimov
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia
| |
Collapse
|
7
|
Zhang Y, Li X, Yang Y, Wang T. Disease- and Drug-Related Knowledge Extraction for Health Management from Online Health Communities Based on BERT-BiGRU-ATT. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:16590. [PMID: 36554472 PMCID: PMC9779596 DOI: 10.3390/ijerph192416590] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 12/01/2022] [Accepted: 12/06/2022] [Indexed: 06/17/2023]
Abstract
Knowledge extraction from rich text in online health communities can supplement and improve the existing knowledge base, supporting evidence-based medicine and clinical decision making. The extracted time series health management data of users can help users with similar conditions when managing their health. By annotating four relationships, this study constructed a deep learning model, BERT-BiGRU-ATT, to extract disease-medication relationships. A Chinese-pretrained BERT model was used to generate word embeddings for the question-and-answer data from online health communities in China. In addition, the bidirectional gated recurrent unit, combined with an attention mechanism, was employed to capture sequence context features and then to classify text related to diseases and drugs using a softmax classifier and to obtain the time series data provided by users. By using various word embedding training experiments and comparisons with classical models, the superiority of our model in relation to extraction was verified. Based on the knowledge extraction, the evolution of a user's disease progression was analyzed according to the time series data provided by users to further analyze the evolution of the user's disease progression. BERT word embedding, GRU, and attention mechanisms in our research play major roles in knowledge extraction. The knowledge extraction results obtained are expected to supplement and improve the existing knowledge base, assist doctors' diagnosis, and help users with dynamic lifecycle health management, such as user disease treatment management. In future studies, a co-reference resolution can be introduced to further improve the effect of extracting the relationships among diseases, drugs, and drug effects.
Collapse
Affiliation(s)
- Yanli Zhang
- College of Business Administration, Henan Finance University, Zhengzhou 451464, China
- Business School, Henan University, Kaifeng 475004, China
| | - Xinmiao Li
- School of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Yu Yang
- School of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai 200433, China
- China Banking and Insurance Regulatory Commission Neimengu Office, Hohhot 010019, China
| | - Tao Wang
- College of Business Administration, Henan Finance University, Zhengzhou 451464, China
| |
Collapse
|
8
|
Review on knowledge extraction from text and scope in agriculture domain. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10239-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
9
|
Ispirova G, Cenikj G, Ogrinc M, Valenčič E, Stojanov R, Korošec P, Cavalli E, Koroušić Seljak B, Eftimov T. CafeteriaFCD Corpus: Food Consumption Data Annotated with Regard to Different Food Semantic Resources. Foods 2022; 11:foods11172684. [PMID: 36076868 PMCID: PMC9455825 DOI: 10.3390/foods11172684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 08/23/2022] [Accepted: 08/30/2022] [Indexed: 11/16/2022] Open
Abstract
Besides the numerous studies in the last decade involving food and nutrition data, this domain remains low resourced. Annotated corpuses are very useful tools for researchers and experts of the domain in question, as well as for data scientists for analysis. In this paper, we present the annotation process of food consumption data (recipes) with semantic tags from different semantic resources—Hansard taxonomy, FoodOn ontology, SNOMED CT terminology and the FoodEx2 classification system. FoodBase is an annotated corpus of food entities—recipes—which includes a curated version of 1000 instances, considered a gold standard. In this study, we use the curated version of FoodBase and two different approaches for annotating—the NCBO annotator (for the FoodOn and SNOMED CT annotations) and the semi-automatic StandFood method (for the FoodEx2 annotations). The end result is a new version of the golden standard of the FoodBase corpus, called the CafeteriaFCD (Cafeteria Food Consumption Data) corpus. This corpus contains food consumption data—recipes—annotated with semantic tags from the aforementioned four different external semantic resources. With these annotations, data interoperability is achieved between five semantic resources from different domains. This resource can be further utilized for developing and training different information extraction pipelines using state-of-the-art NLP approaches for tracing knowledge about food safety applications.
Collapse
Affiliation(s)
- Gordana Ispirova
- Computer Systems Department, Jožef Stefan Institute, 1000 Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, 1000 Ljubljana, Slovenia
- Correspondence:
| | - Gjorgjina Cenikj
- Computer Systems Department, Jožef Stefan Institute, 1000 Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, 1000 Ljubljana, Slovenia
| | - Matevž Ogrinc
- Computer Systems Department, Jožef Stefan Institute, 1000 Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, 1000 Ljubljana, Slovenia
| | - Eva Valenčič
- Computer Systems Department, Jožef Stefan Institute, 1000 Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, 1000 Ljubljana, Slovenia
- School of Health Sciences, College of Health, Medicine and Wellbeing, University of Newcastle, Callaghan, NSW 2308, Australia
- Food and Nutrition Program, Hunter Medical Research Institute, Newcastle, NSW 2305, Australia
| | - Riste Stojanov
- Faculty of Computer Science and Engineering, “Ss. Cyril and Methodius” University in Skopje, 1000 Skopje, North Macedonia
| | - Peter Korošec
- Computer Systems Department, Jožef Stefan Institute, 1000 Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, 1000 Ljubljana, Slovenia
| | - Ermanno Cavalli
- Resources and Support Department, European Food Safety Authority, 43126 Parma, Italy
| | - Barbara Koroušić Seljak
- Computer Systems Department, Jožef Stefan Institute, 1000 Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, 1000 Ljubljana, Slovenia
| | - Tome Eftimov
- Computer Systems Department, Jožef Stefan Institute, 1000 Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, 1000 Ljubljana, Slovenia
- Faculty of Computer and Information Science, University of Ljubljana, 1000 Ljubljana, Slovenia
| |
Collapse
|
10
|
Perron BE, Victor BG, Ryan JP, Piellusch EK, Sokol RL. A text-based approach to measuring opioid-related risk among families involved in the child welfare system. CHILD ABUSE & NEGLECT 2022; 131:105688. [PMID: 35687937 DOI: 10.1016/j.chiabu.2022.105688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/30/2022] [Revised: 05/17/2022] [Accepted: 05/24/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND The public health significance of the opioid epidemic is well-established. However, few states collect data on opioid problems among families involved in child welfare services. The absence of data creates significant barriers to understanding the impact of opioids on the service system and the needs of families being served. OBJECTIVE This study sought to validate binary and count-based indicators of opioid-related maltreatment risk based on mentions of opioid use in written child welfare summaries. DATA AND PROCEDURES We developed a comprehensive list of terms referring to opioid street drugs and pharmaceuticals. This terminology list was used to scan and flag investigator summaries from an extensive collection of investigations (N = 362,754) obtained from a state-based child welfare system in the United States. Associations between mentions of opioid use and investigators' decisions to substantiate maltreatment and remove a child from home were tested within a framework of a priori hypotheses. RESULTS Approximately 6.3% of all investigations contained one or more opioid use mentions. Opioid mentions exhibited practically signficant associations with investigator decisions. One in ten summaries that were substantiated had an opioid mention. One in five investigations that led to the out-of-home placement of a child contained an opioid mention. CONCLUSION This study demonstrates the feasibility of using simple text mining procedures to extract information from unstructured text documents. These methods provide novel opportunities to build insights into opioid-related problems among families involved in a child welfare system when structured data are not available.
Collapse
Affiliation(s)
- Brian E Perron
- University of Michigan, School of Social Work, 1080 S. University Avenue, Ann Arbor, MI 48109, United States of America.
| | - Bryan G Victor
- Wayne State University, School of Social Work, 5447 Woodward Avenue, Detroit, MI 48202, United States of America
| | - Joseph P Ryan
- University of Michigan, School of Social Work, 1080 S. University Avenue, Ann Arbor, MI 48109, United States of America
| | - Emily K Piellusch
- University of Michigan, School of Social Work, 1080 S. University Avenue, Ann Arbor, MI 48109, United States of America
| | - Rebeccah L Sokol
- Wayne State University, School of Social Work, 5447 Woodward Avenue, Detroit, MI 48202, United States of America
| |
Collapse
|
11
|
Butt S, Bakhtyar M, Noor W, Baber J, Ullah I, Ahmed A, Basit A, Kakar MSH. Semantic similarity based food entities recognition using WordNet. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2022. [DOI: 10.3233/jifs-219306] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Unstructured text processing is the first step for several applications such as question answering systems, information retrieval, and recipe classification. In the field of recipe classification, number of frameworks have been proposed. However, it is still very tedious and time consuming to extract the food items from the unstructured text and then process for classification. In this research, an automatic food item detection from unstructured text is proposed based on semantic sense modeling. The candidate nouns are detected which can be food items and then the similarity of those nouns is computed with possible food categories. The candidate noun is treated as food item if the similarity is high. For similarity between possible food item and food category is computed by WordNet ontology. The proposed framework is evaluated on benchmark datasets and competitive performance have been achieved. The F-score on large dataset that contains around 20 K recipes is 0.89 which is improved from 0.56.
Collapse
Affiliation(s)
- Sahrish Butt
- Department of Computer Science and IT, University of Balochistan, Quetta, Pakistan
| | - Maheen Bakhtyar
- Department of Computer Science and IT, University of Balochistan, Quetta, Pakistan
| | - Waheed Noor
- Department of Computer Science and IT, University of Balochistan, Quetta, Pakistan
| | - Junaid Baber
- Department of Computer Science and IT, University of Balochistan, Quetta, Pakistan
- Laboratoire d’Informatique de Grenoble, Université Grenoble Alpes, Grenoble, France
| | - Ihsan Ullah
- Department of Computer Science and IT, University of Balochistan, Quetta, Pakistan
| | - Atiq Ahmed
- Department of Computer Science and IT, University of Balochistan, Quetta, Pakistan
| | - Abdul Basit
- Department of Computer Science and IT, University of Balochistan, Quetta, Pakistan
| | - M. Saeed H. Kakar
- Department of Computer Science and IT, University of Balochistan, Quetta, Pakistan
| |
Collapse
|
12
|
Workflow for building interoperable food and nutrition security (FNS) data platforms. Trends Food Sci Technol 2022. [DOI: 10.1016/j.tifs.2022.03.022] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
13
|
Privacy-Preserving Mimic Models for clinical Named Entity Recognition in French. J Biomed Inform 2022; 130:104073. [DOI: 10.1016/j.jbi.2022.104073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Revised: 02/09/2022] [Accepted: 04/07/2022] [Indexed: 11/18/2022]
|
14
|
Application of named entity recognition on tweets during earthquake disaster: a deep learning-based approach. Soft comput 2022. [DOI: 10.1007/s00500-021-06370-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
15
|
Hendawi R, Alian S, Li J. A Smart Mobile Application to Simplify Medical Documents and Improve Health Literacy: System Design and Feasibility Validation (Preprint). JMIR Form Res 2021; 6:e35069. [PMID: 35363142 PMCID: PMC9015750 DOI: 10.2196/35069] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 01/25/2022] [Accepted: 02/15/2022] [Indexed: 11/23/2022] Open
Abstract
Background People with low health literacy experience more challenges in understanding instructions given by their health providers, following prescriptions, and understanding their health care system sufficiently to obtain the maximum benefits. People with insufficient health literacy have high risk of making medical mistakes, more chances of experiencing adverse drug effects, and inferior control of chronic diseases. Objective This study aims to design, develop, and evaluate a mobile health app, MediReader, to help individuals better understand complex medical materials and improve their health literacy. Methods MediReader is designed and implemented through several steps, which are as follows: measure and understand an individual’s health literacy level; identify medical terminologies that the individual may not understand based on their health literacy; annotate and interpret the identified medical terminologies tailored to the individual’s reading skill levels, with meanings defined in the appropriate external knowledge sources; evaluate MediReader using task-based user study and satisfaction surveys. Results On the basis of the comparison with a control group, user study results demonstrate that MediReader can improve users’ understanding of medical documents. This improvement is particularly significant for users with low health literacy levels. The satisfaction survey showed that users are satisfied with the tool in general. Conclusions MediReader provides an easy-to-use interface for users to read and understand medical documents. It can effectively identify medical terms that a user may not understand, and then, annotate and interpret them with appropriate meanings using languages that the user can understand. Experimental results demonstrate the feasibility of using this tool to improve an individual’s understanding of medical materials.
Collapse
Affiliation(s)
- Rasha Hendawi
- North Dakota State University, Fargo, ND, United States
| | - Shadi Alian
- North Dakota State University, Fargo, ND, United States
| | - Juan Li
- North Dakota State University, Fargo, ND, United States
| |
Collapse
|
16
|
Timotijevic L, Astley S, Bogaardt M, Bucher T, Carr I, Copani G, de la Cueva J, Eftimov T, Finglas P, Hieke S, Hodgkins C, Koroušić Seljak B, Klepacz N, Pasch K, Maringer M, Mikkelsen B, Normann A, Ofei K, Poppe K, Pourabdollahian G, Raats M, Roe M, Sadler C, Selnes T, van der Veen H, van’t Veer P, Zimmermann K. Designing a research infrastructure (RI) on food behaviour and health: Balancing user needs, business model, governance mechanisms and technology. Trends Food Sci Technol 2021. [DOI: 10.1016/j.tifs.2021.07.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
17
|
Yang LWY, Ng WY, Foo LL, Liu Y, Yan M, Lei X, Zhang X, Ting DSW. Deep learning-based natural language processing in ophthalmology: applications, challenges and future directions. Curr Opin Ophthalmol 2021; 32:397-405. [PMID: 34324453 DOI: 10.1097/icu.0000000000000789] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
PURPOSE OF REVIEW Artificial intelligence (AI) is the fourth industrial revolution in mankind's history. Natural language processing (NLP) is a type of AI that transforms human language, to one that computers can interpret and process. NLP is still in the formative stages of development in healthcare, with promising applications and potential challenges in its applications. This review provides an overview of AI-based NLP, its applications in healthcare and ophthalmology, next-generation use case, as well as potential challenges in deployment. RECENT FINDINGS The integration of AI-based NLP systems into existing clinical care shows considerable promise in disease screening, risk stratification, and treatment monitoring, amongst others. Stakeholder collaboration, greater public acceptance, and advancing technologies will continue to shape the NLP landscape in healthcare and ophthalmology. SUMMARY Healthcare has always endeavored to be patient centric and personalized. For AI-based NLP systems to become an eventual reality in larger-scale applications, it is pertinent for key stakeholders to collaborate and address potential challenges in application. Ultimately, these would enable more equitable and generalizable use of NLP systems for the betterment of healthcare and society.
Collapse
Affiliation(s)
| | - Wei Yan Ng
- Singapore National Eye Centre, Singapore Eye Research Institute
- Duke-NUS Medical School, National University of Singapore, Singapore, Singapore
| | - Li Lian Foo
- Singapore National Eye Centre, Singapore Eye Research Institute
- Duke-NUS Medical School, National University of Singapore, Singapore, Singapore
| | - Yong Liu
- Institute of High Performance Computing, A STAR
| | - Ming Yan
- Institute of High Performance Computing, A STAR
| | | | | | - Daniel Shu Wei Ting
- Singapore National Eye Centre, Singapore Eye Research Institute
- Duke-NUS Medical School, National University of Singapore, Singapore, Singapore
| |
Collapse
|
18
|
Stojanov R, Popovski G, Cenikj G, Koroušić Seljak B, Eftimov T. A Fine-Tuned Bidirectional Encoder Representations From Transformers Model for Food Named-Entity Recognition: Algorithm Development and Validation. J Med Internet Res 2021; 23:e28229. [PMID: 34383671 PMCID: PMC8415558 DOI: 10.2196/28229] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 03/13/2021] [Accepted: 05/06/2021] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Recently, food science has been garnering a lot of attention. There are many open research questions on food interactions, as one of the main environmental factors, with other health-related entities such as diseases, treatments, and drugs. In the last 2 decades, a large amount of work has been done in natural language processing and machine learning to enable biomedical information extraction. However, machine learning in food science domains remains inadequately resourced, which brings to attention the problem of developing methods for food information extraction. There are only few food semantic resources and few rule-based methods for food information extraction, which often depend on some external resources. However, an annotated corpus with food entities along with their normalization was published in 2019 by using several food semantic resources. OBJECTIVE In this study, we investigated how the recently published bidirectional encoder representations from transformers (BERT) model, which provides state-of-the-art results in information extraction, can be fine-tuned for food information extraction. METHODS We introduce FoodNER, which is a collection of corpus-based food named-entity recognition methods. It consists of 15 different models obtained by fine-tuning 3 pretrained BERT models on 5 groups of semantic resources: food versus nonfood entity, 2 subsets of Hansard food semantic tags, FoodOn semantic tags, and Systematized Nomenclature of Medicine Clinical Terms food semantic tags. RESULTS All BERT models provided very promising results with 93.30% to 94.31% macro F1 scores in the task of distinguishing food versus nonfood entity, which represents the new state-of-the-art technology in food information extraction. Considering the tasks where semantic tags are predicted, all BERT models obtained very promising results once again, with their macro F1 scores ranging from 73.39% to 78.96%. CONCLUSIONS FoodNER can be used to extract and annotate food entities in 5 different tasks: food versus nonfood entities and distinguishing food entities on the level of food groups by using the closest Hansard semantic tags, the parent Hansard semantic tags, the FoodOn semantic tags, or the Systematized Nomenclature of Medicine Clinical Terms semantic tags.
Collapse
Affiliation(s)
- Riste Stojanov
- Faculty of Computer Science and Engineering, Ss Cyril and Methodius, University- Skopje, Skopje, the Former Yugoslav Republic of Macedonia
| | - Gorjan Popovski
- Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
| | - Gjorgjina Cenikj
- Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
| | | | - Tome Eftimov
- Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
| |
Collapse
|
19
|
MenuNER: Domain-Adapted BERT Based NER Approach for a Domain with Limited Dataset and Its Application to Food Menu Domain. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11136007] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Entity-based information extraction is one of the main applications of Natural Language Processing (NLP). Recently, deep transfer-learning utilizing contextualized word embedding from pre-trained language models has shown remarkable results for many NLP tasks, including Named-entity recognition (NER). BERT (Bidirectional Encoder Representations from Transformers) is gaining prominent attention among various contextualized word embedding models as a state-of-the-art pre-trained language model. It is quite expensive to train a BERT model from scratch for a new application domain since it needs a huge dataset and enormous computing time. In this paper, we focus on menu entity extraction from online user reviews for the restaurant and propose a simple but effective approach for NER task on a new domain where a large dataset is rarely available or difficult to prepare, such as food menu domain, based on domain adaptation technique for word embedding and fine-tuning the popular NER task network model ‘Bi-LSTM+CRF’ with extended feature vectors. The proposed NER approach (named as ‘MenuNER’) consists of two step-processes: (1) Domain adaptation for target domain; further pre-training of the off-the-shelf BERT language model (BERT-base) in semi-supervised fashion on a domain-specific dataset, and (2) Supervised fine-tuning the popular Bi-LSTM+CRF network for downstream task with extended feature vectors obtained by concatenating word embedding from the domain-adapted pre-trained BERT model from the first step, character embedding and POS tag feature information. Experimental results on handcrafted food menu corpus from customers’ review dataset show that our proposed approach for domain-specific NER task, that is: food menu named-entity recognition, performs significantly better than the one based on the baseline off-the-shelf BERT-base model. The proposed approach achieves 92.5% F1 score on the YELP dataset for the MenuNER task.
Collapse
|
20
|
Zeb A, Soininen JP, Sozer N. Data harmonisation as a key to enable digitalisation of the food sector: A review. FOOD AND BIOPRODUCTS PROCESSING 2021. [DOI: 10.1016/j.fbp.2021.02.005] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
21
|
Gao S, Kotevska O, Sorokine A, Christian JB. A pre-training and self-training approach for biomedical named entity recognition. PLoS One 2021; 16:e0246310. [PMID: 33561139 PMCID: PMC7872256 DOI: 10.1371/journal.pone.0246310] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Accepted: 01/18/2021] [Indexed: 11/18/2022] Open
Abstract
Named entity recognition (NER) is a key component of many scientific literature mining tasks, such as information retrieval, information extraction, and question answering; however, many modern approaches require large amounts of labeled training data in order to be effective. This severely limits the effectiveness of NER models in applications where expert annotations are difficult and expensive to obtain. In this work, we explore the effectiveness of transfer learning and semi-supervised self-training to improve the performance of NER models in biomedical settings with very limited labeled data (250-2000 labeled samples). We first pre-train a BiLSTM-CRF and a BERT model on a very large general biomedical NER corpus such as MedMentions or Semantic Medline, and then we fine-tune the model on a more specific target NER task that has very limited training data; finally, we apply semi-supervised self-training using unlabeled data to further boost model performance. We show that in NER tasks that focus on common biomedical entity types such as those in the Unified Medical Language System (UMLS), combining transfer learning with self-training enables a NER model such as a BiLSTM-CRF or BERT to obtain similar performance with the same model trained on 3x-8x the amount of labeled data. We further show that our approach can also boost performance in a low-resource application where entities types are more rare and not specifically covered in UMLS.
Collapse
Affiliation(s)
- Shang Gao
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States of America
| | - Olivera Kotevska
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States of America
| | - Alexandre Sorokine
- Geospatial Science and Human Security Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States of America
| | - J. Blair Christian
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States of America
| |
Collapse
|
22
|
Petković M, Popovski G, Seljak BK, Kocev D, Eftimov T. DietHub: Dietary habits analysis through understanding the content of recipes. Trends Food Sci Technol 2021. [DOI: 10.1016/j.tifs.2020.10.017] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
23
|
Batra D, Diwan N, Upadhyay U, Kalra JS, Sharma T, Sharma AK, Khanna D, Marwah JS, Kalathil S, Singh N, Tuwani R, Bagler G. RecipeDB: a resource for exploring recipes. Database (Oxford) 2020; 2020:baaa077. [PMID: 33238002 PMCID: PMC7687679 DOI: 10.1093/database/baaa077] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2020] [Revised: 08/10/2020] [Accepted: 11/19/2020] [Indexed: 01/10/2023]
Abstract
Cooking is the act of turning nature into the culture, which has enabled the advent of the omnivorous human diet. The cultural wisdom of processing raw ingredients into delicious dishes is embodied in their cuisines. Recipes thus are the cultural capsules that encode elaborate cooking protocols for evoking sensory satiation as well as providing nourishment. As we stand on the verge of an epidemic of diet-linked disorders, it is eminently important to investigate the culinary correlates of recipes to probe their association with sensory responses as well as consequences for nutrition and health. RecipeDB (https://cosylab.iiitd.edu.in/recipedb) is a structured compilation of recipes, ingredients and nutrition profiles interlinked with flavor profiles and health associations. The repertoire comprises of meticulous integration of 118 171 recipes from cuisines across the globe (6 continents, 26 geocultural regions and 74 countries), cooked using 268 processes (heat, cook, boil, simmer, bake, etc.), by blending over 20 262 diverse ingredients, which are further linked to their flavor molecules (FlavorDB), nutritional profiles (US Department of Agriculture) and empirical records of disease associations obtained from MEDLINE (DietRx). This resource is aimed at facilitating scientific explorations of the culinary space (recipe, ingredient, cooking processes/techniques, dietary styles, etc.) linked to taste (flavor profile) and health (nutrition and disease associations) attributes seeking for divergent applications. Database URL: https://cosylab.iiitd.edu.in/recipedb.
Collapse
Affiliation(s)
- Devansh Batra
- Complex Systems Laboratory, Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India 110020
| | - Nirav Diwan
- Complex Systems Laboratory, Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India 110020
| | - Utkarsh Upadhyay
- Complex Systems Laboratory, Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India 110020
| | - Jushaan Singh Kalra
- Complex Systems Laboratory, Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India 110020
| | - Tript Sharma
- Complex Systems Laboratory, Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India 110020
| | - Aman Kumar Sharma
- Complex Systems Laboratory, Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India 110020
| | - Dheeraj Khanna
- Complex Systems Laboratory, Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India 110020
| | - Jaspreet Singh Marwah
- Complex Systems Laboratory, Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India 110020
| | - Srilakshmi Kalathil
- Complex Systems Laboratory, Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India 110020
| | - Navjot Singh
- Complex Systems Laboratory, Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India 110020
| | - Rudraksh Tuwani
- Complex Systems Laboratory, Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India 110020
| | - Ganesh Bagler
- Complex Systems Laboratory, Center for Computational Biology, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India 110020
| |
Collapse
|
24
|
Delayed Combination of Feature Embedding in Bidirectional LSTM CRF for NER. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10217557] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Named Entity Recognition (NER) plays a vital role in natural language processing (NLP). Currently, deep neural network models have achieved significant success in NER. Recent advances in NER systems have introduced various feature selections to identify appropriate representations and handle Out-Of-the-Vocabulary (OOV) words. After selecting the features, they are all concatenated at the embedding layer before being fed into a model to label the input sequences. However, when concatenating the features, information collisions may occur and this would cause the limitation or degradation of the performance. To overcome the information collisions, some works tried to directly connect some features to latter layers, which we call the delayed combination and show its effectiveness by comparing it to the early combination. As feature encodings for input, we selected the character-level Convolutional Neural Network (CNN) or Long Short-Term Memory (LSTM) word encoding, the pre-trained word embedding, and the contextual word embedding and additionally designed CNN-based sentence encoding using a dictionary. These feature encodings are combined at early or delayed position of the bidirectional LSTM Conditional Random Field (CRF) model according to each feature’s characteristics. We evaluated the performance of this model on the CoNLL 2003 and OntoNotes 5.0 datasets using the F1 score and compared the delayed combination model with our own implementation of the early combination as well as the previous works. This comparison convinces us that our delayed combination is more effective than the early one and also highly competitive.
Collapse
|
25
|
Perera N, Dehmer M, Emmert-Streib F. Named Entity Recognition and Relation Detection for Biomedical Information Extraction. Front Cell Dev Biol 2020; 8:673. [PMID: 32984300 PMCID: PMC7485218 DOI: 10.3389/fcell.2020.00673] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 07/02/2020] [Indexed: 12/29/2022] Open
Abstract
The number of scientific publications in the literature is steadily growing, containing our knowledge in the biomedical, health, and clinical sciences. Since there is currently no automatic archiving of the obtained results, much of this information remains buried in textual details not readily available for further usage or analysis. For this reason, natural language processing (NLP) and text mining methods are used for information extraction from such publications. In this paper, we review practices for Named Entity Recognition (NER) and Relation Detection (RD), allowing, e.g., to identify interactions between proteins and drugs or genes and diseases. This information can be integrated into networks to summarize large-scale details on a particular biomedical or clinical problem, which is then amenable for easy data management and further analysis. Furthermore, we survey novel deep learning methods that have recently been introduced for such tasks.
Collapse
Affiliation(s)
- Nadeesha Perera
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
| | - Matthias Dehmer
- Department of Mechatronics and Biomedical Computer Science, University for Health Sciences, Medical Informatics and Technology (UMIT), Hall in Tirol, Austria
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
- Faculty of Medicine and Health Technology, Institute of Biosciences and Medical Technology, Tampere University, Tampere, Finland
| |
Collapse
|
26
|
Stojanov R, Popovski G, Jofce N, Trajanov D, Seljak BK, Eftimov T. FoodViz: Visualization of Food Entities Linked Across Different Standards. MACHINE LEARNING, OPTIMIZATION, AND DATA SCIENCE 2020. [DOI: 10.1007/978-3-030-64580-9_4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
|
27
|
Kapsokefalou M, Roe M, Turrini A, Costa HS, Martinez-Victoria E, Marletta L, Berry R, Finglas P. Food Composition at Present: New Challenges. Nutrients 2019; 11:E1714. [PMID: 31349634 PMCID: PMC6723776 DOI: 10.3390/nu11081714] [Citation(s) in RCA: 46] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2019] [Revised: 07/04/2019] [Accepted: 07/19/2019] [Indexed: 11/16/2022] Open
Abstract
Food composition data is important for stakeholders and users active in the areas of food, nutrition and health. New challenges related to the quality of food composition data reflect the dynamic changes in these areas while the emerging technologies create new opportunities. These challenges and the impact on food composition data for the Mediterranean region were reviewed during the NUTRIMAD 2018 congress of the Spanish Society for Community Nutrition. Data harmonization and standardization, data compilation and use, thesauri, food classification and description, and data exchange are some of the areas that require new approaches. Consistency in documentation, linking of information between datasets, food matching and capturing portion size information suggest the need for new automated tools. Research Infrastructures bring together key data and services. The delivery of sustainable networks and Research Infrastructures in food, nutrition and health will help to increase access to and effective use of food composition data. EuroFIR AISBL coordinates experts and national compilers and contributes to worldwide efforts aiming to produce and maintain high quality data and tools. A Mediterranean Network that shares high quality food composition data is vital for the development of ambitious common research and policy initiatives in support of the Mediterranean Diet.
Collapse
Affiliation(s)
- Maria Kapsokefalou
- Department of Food Science and Human Nutrition, Agricultural University of Athens, 11855 Athens, Greece
- EuroFIR AISBL Executive Board, 1050 Brussels, Belgium
| | - Mark Roe
- EuroFIR AISBL Executive Board, 1050 Brussels, Belgium.
| | - Aida Turrini
- Research Centre for Food and Nutrition (CREA-Food and Nutrition), CREA-Council for Agricultural Research and Economics), 00178 Rome, Italy
| | - Helena S Costa
- EuroFIR AISBL Executive Board, 1050 Brussels, Belgium
- Department of Food and Nutrition, National Institute of Health Dr. Ricardo Jorge, I.P., 1649-016 Lisbon, Portugal
- REQUIMTE, LAQV/Faculty of Pharmacy, University of Porto, 4050-313 Porto, Portugal
| | - Emilio Martinez-Victoria
- Institute of Nutrition and Food Technology "José Mataix", University of Granada, 18016 Armilla (Granada), Spain
| | - Luisa Marletta
- Research Centre for Food and Nutrition (CREA-Food and Nutrition), CREA-Council for Agricultural Research and Economics), 00178 Rome, Italy
| | - Rachel Berry
- Quadram Institute Bioscience, Norwich, Norfolk NR4 7UA, UK
| | - Paul Finglas
- EuroFIR AISBL Executive Board, 1050 Brussels, Belgium
- Quadram Institute Bioscience, Norwich, Norfolk NR4 7UA, UK
| |
Collapse
|
28
|
Abstract
Named Entity Recognition (NER) is the process of identifying the elementary units in a text document and classifying them into predefined categories such as person, location, organization and so forth. NER plays an important role in many Natural Language Processing applications like information retrieval, question answering, machine translation and so forth. Resolving the ambiguities of lexical items involved in a text document is a challenging task. NER in Indian languages is always a complex task due to their morphological richness and agglutinative nature. Even though different solutions were proposed for NER, it is still an unsolved problem. Traditional approaches to Named Entity Recognition were based on the application of hand-crafted features to classical machine learning techniques such as Hidden Markov Model (HMM), Support Vector Machine (SVM), Conditional Random Field (CRF) and so forth. But the introduction of deep learning techniques to the NER problem changed the scenario, where the state of art results have been achieved using deep learning architectures. In this paper, we address the problem of effective word representation for NER in Indian languages by capturing the syntactic, semantic and morphological information. We propose a deep learning based entity extraction system for Indian languages using a novel combined word representation, including character-level, word-level and affix-level embeddings. We have used ‘ARNEKT-IECSIL 2018’ shared data for training and testing. Our results highlight the improvement that we obtained over the existing pre-trained word representations.
Collapse
|
29
|
Thomas A, Sangeetha S. An innovative hybrid approach for extracting named entities from unstructured text data. Comput Intell 2019. [DOI: 10.1111/coin.12214] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Anu Thomas
- Text Analytics & NLP Lab, Department of Computer ApplicationsNational Institute of Technology Tiruchirappalli India
| | - S. Sangeetha
- Text Analytics & NLP Lab, Department of Computer ApplicationsNational Institute of Technology Tiruchirappalli India
| |
Collapse
|
30
|
Popovski G, Seljak BK, Eftimov T. FoodBase corpus: a new resource of annotated food entities. Database (Oxford) 2019; 2019:baz121. [PMID: 31682732 PMCID: PMC6827550 DOI: 10.1093/database/baz121] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 07/05/2019] [Accepted: 09/17/2019] [Indexed: 12/24/2022]
Abstract
The existence of annotated text corpora is essential for the development of public health services and tools based on natural language processing (NLP) and text mining. Recently organized biomedical NLP shared tasks have provided annotated corpora related to different biomedical entities such as genes, phenotypes, drugs, diseases and chemical entities. These are needed to develop named-entity recognition (NER) models that are used for extracting entities from text and finding their relations. However, to the best of our knowledge, there are limited annotated corpora that provide information about food entities despite food and dietary management being an essential public health issue. Hence, we developed a new annotated corpus of food entities, named FoodBase. It was constructed using recipes extracted from Allrecipes, which is currently the largest food-focused social network. The recipes were selected from five categories: 'Appetizers and Snacks', 'Breakfast and Lunch', 'Dessert', 'Dinner' and 'Drinks'. Semantic tags used for annotating food entities were selected from the Hansard corpus. To extract and annotate food entities, we applied a rule-based food NER method called FoodIE. Since FoodIE provides a weakly annotated corpus, by manually evaluating the obtained results on 1000 recipes, we created a gold standard of FoodBase. It consists of 12 844 food entity annotations describing 2105 unique food entities. Additionally, we provided a weakly annotated corpus on an additional 21 790 recipes. It consists of 274 053 food entity annotations, 13 079 of which are unique. The FoodBase corpus is necessary for developing corpus-based NER models for food science, as a new benchmark dataset for machine learning tasks such as multi-class classification, multi-label classification and hierarchical multi-label classification. FoodBase can be used for detecting semantic differences/similarities between food concepts, and after all we believe that it will open a new path for learning food embedding space that can be used in predictive studies.
Collapse
Affiliation(s)
- Gorjan Popovski
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, ul.Rudzer Boshkovikj 16, 1000 Skopje, Macedonia
- Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia
- Computer Systems Department, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia
| | - Barbara Koroušić Seljak
- Computer Systems Department, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia
| | - Tome Eftimov
- Computer Systems Department, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia
- Department of Biomedical Data Science, Stanford University, 450 Serra Mall, Stanford 94305 CA, USA
- Center for Population Health Sciences, Stanford University, 450 Serra Mall, Stanford 94305 CA, USA
| |
Collapse
|
31
|
Popovski G, Seljak BK, Eftimov T. FoodBase corpus: a new resource of annotated food entities. Database (Oxford) 2019; 2019:5611291. [PMID: 31682732 DOI: 10.1093/database/baz121(2019)] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Revised: 07/05/2019] [Accepted: 09/17/2019] [Indexed: 05/28/2023]
Abstract
The existence of annotated text corpora is essential for the development of public health services and tools based on natural language processing (NLP) and text mining. Recently organized biomedical NLP shared tasks have provided annotated corpora related to different biomedical entities such as genes, phenotypes, drugs, diseases and chemical entities. These are needed to develop named-entity recognition (NER) models that are used for extracting entities from text and finding their relations. However, to the best of our knowledge, there are limited annotated corpora that provide information about food entities despite food and dietary management being an essential public health issue. Hence, we developed a new annotated corpus of food entities, named FoodBase. It was constructed using recipes extracted from Allrecipes, which is currently the largest food-focused social network. The recipes were selected from five categories: 'Appetizers and Snacks', 'Breakfast and Lunch', 'Dessert', 'Dinner' and 'Drinks'. Semantic tags used for annotating food entities were selected from the Hansard corpus. To extract and annotate food entities, we applied a rule-based food NER method called FoodIE. Since FoodIE provides a weakly annotated corpus, by manually evaluating the obtained results on 1000 recipes, we created a gold standard of FoodBase. It consists of 12 844 food entity annotations describing 2105 unique food entities. Additionally, we provided a weakly annotated corpus on an additional 21 790 recipes. It consists of 274 053 food entity annotations, 13 079 of which are unique. The FoodBase corpus is necessary for developing corpus-based NER models for food science, as a new benchmark dataset for machine learning tasks such as multi-class classification, multi-label classification and hierarchical multi-label classification. FoodBase can be used for detecting semantic differences/similarities between food concepts, and after all we believe that it will open a new path for learning food embedding space that can be used in predictive studies.
Collapse
Affiliation(s)
- Gorjan Popovski
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, ul.Rudzer Boshkovikj 16, 1000 Skopje, Macedonia
- Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia
- Computer Systems Department, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia
| | - Barbara Koroušić Seljak
- Computer Systems Department, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia
| | - Tome Eftimov
- Computer Systems Department, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia
- Department of Biomedical Data Science, Stanford University, 450 Serra Mall, Stanford 94305 CA, USA
- Center for Population Health Sciences, Stanford University, 450 Serra Mall, Stanford 94305 CA, USA
| |
Collapse
|