1
|
Huang TY, Chong CF, Lin HY, Chen TY, Chang YC, Lin MC. A pre-trained language model for emergency department intervention prediction using routine physiological data and clinical narratives. Int J Med Inform 2024; 191:105564. [PMID: 39121529 DOI: 10.1016/j.ijmedinf.2024.105564] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 07/15/2024] [Accepted: 07/20/2024] [Indexed: 08/12/2024]
Abstract
INTRODUCTION The urgency and complexity of emergency room (ER) settings require precise and swift decision-making processes for patient care. Ensuring the timely execution of critical examinations and interventions is vital for reducing diagnostic errors, but the literature highlights a need for innovative approaches to optimize diagnostic accuracy and patient outcomes. In response, our study endeavors to create predictive models for timely examinations and interventions by leveraging the patient's symptoms and vital signs recorded during triage, and in so doing, augment traditional diagnostic methodologies. METHODS Focusing on four key areas-medication dispensing, vital interventions, laboratory testing, and emergency radiology exams, the study employed Natural Language Processing (NLP) and seven advanced machine learning techniques. The research was centered around the innovative use of BioClinicalBERT, a state-of-the-art NLP framework. RESULTS BioClinicalBERT emerged as the superior model, outperforming others in predictive accuracy. The integration of physiological data with patient narrative symptoms demonstrated greater effectiveness compared to models based solely on textual data. The robustness of our approach was confirmed by an Area Under the Receiver Operating Characteristic curve (AUROC) score of 0.9. CONCLUSION The findings of our study underscore the feasibility of establishing a decision support system for emergency patients, targeting timely interventions and examinations based on a nuanced analysis of symptoms. By using an advanced natural language processing technique, our approach shows promise for enhancing diagnostic accuracy. However, the current model is not yet fully mature for direct implementation into daily clinical practice. Recognizing the imperative nature of precision in the ER environment, future research endeavors must focus on refining and expanding predictive models to include detailed timely examinations and interventions. Although the progress achieved in this study represents an encouraging step towards a more innovative and technology-driven paradigm in emergency care, full clinical integration warrants further exploration and validation.
Collapse
Affiliation(s)
- Ting-Yun Huang
- Emergency Department, Shuang-Ho Hospital, Taipei Medical University, Taipei, Taiwan.
| | - Chee-Fah Chong
- Emergency Department, Shin-Kong Wu Ho-Su Memorial Hospital, Taipei, Taiwan.
| | - Heng-Yu Lin
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan.
| | - Tzu-Ying Chen
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| | - Yung-Chun Chang
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan; Clinical Big Data Research Center, Taipei Medical University Hospital, Taipei, Taiwan.
| | - Ming-Chin Lin
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan; Department of Neurosurgery, Shuang-Ho Hospital, Taipei Medical University, Taipei, Taiwan; Department of Neurosurgery, Taipei Municipal Wanfang Hospital, Taipei Medical University, Taipei, Taiwan..
| |
Collapse
|
2
|
Xu S, Zhang Y, Chen L, An X. Is metadata of articles about COVID-19 enough for multilabel topic classification task? Database (Oxford) 2024; 2024:baae106. [PMID: 39432499 PMCID: PMC11492800 DOI: 10.1093/database/baae106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 06/03/2024] [Accepted: 09/12/2024] [Indexed: 10/23/2024]
Abstract
The ever-increasing volume of COVID-19-related articles presents a significant challenge for the manual curation and multilabel topic classification of LitCovid. For this purpose, a novel multilabel topic classification framework is developed in this study, which considers both the correlation and imbalance of topic labels, while empowering the pretrained model. With the help of this framework, this study devotes to answering the following question: Do full texts, MeSH (Medical Subject Heading), and biological entities of articles about COVID-19 encode more discriminative information than metadata (title, abstract, keyword, and journal name)? From extensive experiments on our enriched version of the BC7-LitCovid corpus and Hallmarks of Cancer corpus, the following conclusions can be drawn. Our framework demonstrates superior performance and robustness. The metadata of scientific publications about COVID-19 carries valuable information for multilabel topic classification. Compared to biological entities, full texts and MeSH can further enhance the performance of our framework for multilabel topic classification, but the improved performance is very limited. Database URL: https://github.com/pzczxs/Enriched-BC7-LitCovid.
Collapse
Affiliation(s)
- Shuo Xu
- College of Economics and Management, Beijing University of Technology, No. 100 PingLeYuan, Chaoyang District, Beijing 100124, P.R. China
| | - Yuefu Zhang
- College of Economics and Management, Beijing University of Technology, No. 100 PingLeYuan, Chaoyang District, Beijing 100124, P.R. China
| | - Liang Chen
- Institute of Scientific and Technical Information of China, No. 15 Fuxing Road, Haidian District, Beijing 100038, P.R. China
| | - Xin An
- School of Economics and Management, Beijing Forestry University, No. 35 Qinghua East Road, Haidian District, Beijing 100083, P.R. China
| |
Collapse
|
3
|
Luo L, Ning J, Zhao Y, Wang Z, Ding Z, Chen P, Fu W, Han Q, Xu G, Qiu Y, Pan D, Li J, Li H, Feng W, Tu S, Liu Y, Yang Z, Wang J, Sun Y, Lin H. Taiyi: a bilingual fine-tuned large language model for diverse biomedical tasks. J Am Med Inform Assoc 2024; 31:1865-1874. [PMID: 38422367 PMCID: PMC11339499 DOI: 10.1093/jamia/ocae037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/08/2024] [Accepted: 02/16/2024] [Indexed: 03/02/2024] Open
Abstract
OBJECTIVE Most existing fine-tuned biomedical large language models (LLMs) focus on enhancing performance in monolingual biomedical question answering and conversation tasks. To investigate the effectiveness of the fine-tuned LLMs on diverse biomedical natural language processing (NLP) tasks in different languages, we present Taiyi, a bilingual fine-tuned LLM for diverse biomedical NLP tasks. MATERIALS AND METHODS We first curated a comprehensive collection of 140 existing biomedical text mining datasets (102 English and 38 Chinese datasets) across over 10 task types. Subsequently, these corpora were converted to the instruction data used to fine-tune the general LLM. During the supervised fine-tuning phase, a 2-stage strategy is proposed to optimize the model performance across various tasks. RESULTS Experimental results on 13 test sets, which include named entity recognition, relation extraction, text classification, and question answering tasks, demonstrate that Taiyi achieves superior performance compared to general LLMs. The case study involving additional biomedical NLP tasks further shows Taiyi's considerable potential for bilingual biomedical multitasking. CONCLUSION Leveraging rich high-quality biomedical corpora and developing effective fine-tuning strategies can significantly improve the performance of LLMs within the biomedical domain. Taiyi shows the bilingual multitasking capability through supervised fine-tuning. However, those tasks such as information extraction that are not generation tasks in nature remain challenging for LLM-based generative approaches, and they still underperform the conventional discriminative approaches using smaller language models.
Collapse
Affiliation(s)
- Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jinzhong Ning
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yingwen Zhao
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zhijun Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zeyuan Ding
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Peng Chen
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Weiru Fu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Qinyu Han
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Guangtao Xu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yunzhi Qiu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Dinghao Pan
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jiru Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Hao Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Wenduo Feng
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Senbo Tu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yuqi Liu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jian Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yuanyuan Sun
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| |
Collapse
|
4
|
Zong H, Wu R, Cha J, Feng W, Wu E, Li J, Shao A, Tao L, Li Z, Tang B, Shen B. Advancing Chinese biomedical text mining with community challenges. J Biomed Inform 2024; 157:104716. [PMID: 39197732 DOI: 10.1016/j.jbi.2024.104716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Revised: 08/22/2024] [Accepted: 08/25/2024] [Indexed: 09/01/2024]
Abstract
OBJECTIVE This study aims to review the recent advances in community challenges for biomedical text mining in China. METHODS We collected information of evaluation tasks released in community challenges of biomedical text mining, including task description, dataset description, data source, task type and related links. A systematic summary and comparative analysis were conducted on various biomedical natural language processing tasks, such as named entity recognition, entity normalization, attribute extraction, relation extraction, event extraction, text classification, text similarity, knowledge graph construction, question answering, text generation, and large language model evaluation. RESULTS We identified 39 evaluation tasks from 6 community challenges that spanned from 2017 to 2023. Our analysis revealed the diverse range of evaluation task types and data sources in biomedical text mining. We explored the potential clinical applications of these community challenge tasks from a translational biomedical informatics perspective. We compared with their English counterparts, and discussed the contributions, limitations, lessons and guidelines of these community challenges, while highlighting future directions in the era of large language models. CONCLUSION Community challenge evaluation competitions have played a crucial role in promoting technology innovation and fostering interdisciplinary collaboration in the field of biomedical text mining. These challenges provide valuable platforms for researchers to develop state-of-the-art solutions.
Collapse
Affiliation(s)
- Hui Zong
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Rongrong Wu
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Jiaxue Cha
- Shanghai Key Laboratory of Signaling and Disease Research, Laboratory of Receptor-Based Bio-Medicine, Collaborative Innovation Center for Brain Science, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Weizhe Feng
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Erman Wu
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Jiakun Li
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China; Department of Urology, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Aibin Shao
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Liang Tao
- Faculty of Business Information, Shanghai Business School, Shanghai 201400, China
| | | | - Buzhou Tang
- Department of Computer Science, Harbin Institute of Technology, Shenzhen 518055, China
| | - Bairong Shen
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China.
| |
Collapse
|
5
|
Sarol MJ, Hong G, Guerra E, Kilicoglu H. Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach. Database (Oxford) 2024; 2024:baae079. [PMID: 39197056 PMCID: PMC11352595 DOI: 10.1093/database/baae079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 06/21/2024] [Accepted: 08/14/2024] [Indexed: 08/30/2024]
Abstract
Biomedical relation extraction from scientific publications is a key task in biomedical natural language processing (NLP) and can facilitate the creation of large knowledge bases, enable more efficient knowledge discovery, and accelerate evidence synthesis. In this paper, building upon our previous effort in the BioCreative VIII BioRED Track, we propose an enhanced end-to-end pipeline approach for biomedical relation extraction (RE) and novelty detection (ND) that effectively leverages existing datasets and integrates state-of-the-art deep learning methods. Our pipeline consists of four tasks performed sequentially: named entity recognition (NER), entity linking (EL), RE, and ND. We trained models using the BioRED benchmark corpus that was the basis of the shared task. We explored several methods for each task and combinations thereof: for NER, we compared a BERT-based sequence labeling model that uses the BIO scheme with a span classification model. For EL, we trained a convolutional neural network model for diseases and chemicals and used an existing tool, PubTator 3.0, for mapping other entity types. For RE and ND, we adapted the BERT-based, sentence-bound PURE model to bidirectional and document-level extraction. We also performed extensive hyperparameter tuning to improve model performance. We obtained our best performance using BERT-based models for NER, RE, and ND, and the hybrid approach for EL. Our enhanced and optimized pipeline showed substantial improvement compared to our shared task submission, NER: 93.53 (+3.09), EL: 83.87 (+9.73), RE: 46.18 (+15.67), and ND: 38.86 (+14.9). While the performances of the NER and EL models are reasonably high, RE and ND tasks remain challenging at the document level. Further enhancements to the dataset could enable more accurate and useful models for practical use. We provide our models and code at https://github.com/janinaj/e2eBioMedRE/. Database URL: https://github.com/janinaj/e2eBioMedRE/.
Collapse
Affiliation(s)
- M Janina Sarol
- Informatics Programs, University of Illinois Urbana-Champaign, 614 E Daniel Street, Champaign, IL 61820, United States
| | - Gibong Hong
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel Street, Champaign, IL 61820, United States
| | - Evan Guerra
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel Street, Champaign, IL 61820, United States
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel Street, Champaign, IL 61820, United States
| |
Collapse
|
6
|
Madan S, Lentzen M, Brandt J, Rueckert D, Hofmann-Apitius M, Fröhlich H. Transformer models in biomedicine. BMC Med Inform Decis Mak 2024; 24:214. [PMID: 39075407 PMCID: PMC11287876 DOI: 10.1186/s12911-024-02600-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 07/08/2024] [Indexed: 07/31/2024] Open
Abstract
Deep neural networks (DNN) have fundamentally revolutionized the artificial intelligence (AI) field. The transformer model is a type of DNN that was originally used for the natural language processing tasks and has since gained more and more attention for processing various kinds of sequential data, including biological sequences and structured electronic health records. Along with this development, transformer-based models such as BioBERT, MedBERT, and MassGenie have been trained and deployed by researchers to answer various scientific questions originating in the biomedical domain. In this paper, we review the development and application of transformer models for analyzing various biomedical-related datasets such as biomedical textual data, protein sequences, medical structured-longitudinal data, and biomedical images as well as graphs. Also, we look at explainable AI strategies that help to comprehend the predictions of transformer-based models. Finally, we discuss the limitations and challenges of current models, and point out emerging novel research directions.
Collapse
Affiliation(s)
- Sumit Madan
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany.
- Institute of Computer Science, University of Bonn, Bonn, 53115, Germany.
| | - Manuel Lentzen
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany
| | - Johannes Brandt
- School of Medicine, Klinikum Rechts der Isar, Technical University Munich, Munich, Germany
| | - Daniel Rueckert
- School of Medicine, Klinikum Rechts der Isar, Technical University Munich, Munich, Germany
- School of Computation, Information and Technology, Technical University Munich, Munich, Germany
- Department of Computing, Imperial College London, London, UK
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany
| | - Holger Fröhlich
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany.
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany.
| |
Collapse
|
7
|
Du J, Soysal E, Wang D, He L, Lin B, Wang J, Manion FJ, Li Y, Wu E, Yao L. Machine learning models for abstract screening task - A systematic literature review application for health economics and outcome research. BMC Med Res Methodol 2024; 24:108. [PMID: 38724903 PMCID: PMC11080200 DOI: 10.1186/s12874-024-02224-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Accepted: 04/18/2024] [Indexed: 05/13/2024] Open
Abstract
OBJECTIVE Systematic literature reviews (SLRs) are critical for life-science research. However, the manual selection and retrieval of relevant publications can be a time-consuming process. This study aims to (1) develop two disease-specific annotated corpora, one for human papillomavirus (HPV) associated diseases and the other for pneumococcal-associated pediatric diseases (PAPD), and (2) optimize machine- and deep-learning models to facilitate automation of the SLR abstract screening. METHODS This study constructed two disease-specific SLR screening corpora for HPV and PAPD, which contained citation metadata and corresponding abstracts. Performance was evaluated using precision, recall, accuracy, and F1-score of multiple combinations of machine- and deep-learning algorithms and features such as keywords and MeSH terms. RESULTS AND CONCLUSIONS The HPV corpus contained 1697 entries, with 538 relevant and 1159 irrelevant articles. The PAPD corpus included 2865 entries, with 711 relevant and 2154 irrelevant articles. Adding additional features beyond title and abstract improved the performance (measured in Accuracy) of machine learning models by 3% for HPV corpus and 2% for PAPD corpus. Transformer-based deep learning models that consistently outperformed conventional machine learning algorithms, highlighting the strength of domain-specific pre-trained language models for SLR abstract screening. This study provides a foundation for the development of more intelligent SLR systems.
Collapse
Affiliation(s)
| | - Ekin Soysal
- Intelligent Medical Objects, Houston, TX, USA
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | | | - Long He
- Intelligent Medical Objects, Houston, TX, USA
| | - Bin Lin
- Intelligent Medical Objects, Houston, TX, USA
| | - Jingqi Wang
- Intelligent Medical Objects, Houston, TX, USA
| | | | - Yeran Li
- Merck & Co., Inc, Rahway, NJ, USA
| | - Elise Wu
- Merck & Co., Inc, Rahway, NJ, USA
| | | |
Collapse
|
8
|
Badenes-Olmedo C, Corcho O. Lessons learned to enable question answering on knowledge graphs extracted from scientific publications: A case study on the coronavirus literature. J Biomed Inform 2023; 142:104382. [PMID: 37156393 PMCID: PMC10163941 DOI: 10.1016/j.jbi.2023.104382] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 04/14/2023] [Accepted: 05/03/2023] [Indexed: 05/10/2023]
Abstract
The article presents a workflow to create a question-answering system whose knowledge base combines knowledge graphs and scientific publications on coronaviruses. It is based on the experience gained in modeling evidence from research articles to provide answers to questions in natural language. The work contains best practices for acquiring scientific publications, tuning language models to identify and normalize relevant entities, creating representational models based on probabilistic topics, and formalizing an ontology that describes the associations between domain concepts supported by the scientific literature. All the resources generated in the domain of coronavirus are available openly as part of the Drugs4COVID initiative, and can be (re)-used independently or as a whole. They can be exploited by scientific communities conducting research related to SARS-CoV-2/COVID-19 and also by therapeutic communities, laboratories, etc., wishing to find and understand relationships between symptoms, drugs, active ingredients and their documentary evidence.
Collapse
Affiliation(s)
| | - Oscar Corcho
- Artificial Intelligence Department, Campus de Montegancedo, s/n., Boadilla del Monte, 28660, Madrid, Spain
| |
Collapse
|
9
|
Systematic Guidelines for Effective Utilization of COVID-19 Databases in Genomic, Epidemiologic, and Clinical Research. Viruses 2023; 15:v15030692. [PMID: 36992400 PMCID: PMC10059256 DOI: 10.3390/v15030692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Revised: 02/27/2023] [Accepted: 03/04/2023] [Indexed: 03/09/2023] Open
Abstract
The pandemic has led to the production and accumulation of various types of data related to coronavirus disease 2019 (COVID-19). To understand the features and characteristics of COVID-19 data, we summarized representative databases and determined the data types, purpose, and utilization details of each database. In addition, we categorized COVID-19 associated databases into epidemiological data, genome and protein data, and drug and target data. We found that the data present in each of these databases have nine separate purposes (clade/variant/lineage, genome browser, protein structure, epidemiological data, visualization, data analysis tool, treatment, literature, and immunity) according to the types of data. Utilizing the databases we investigated, we created four queries as integrative analysis methods that aimed to answer important scientific questions related to COVID-19. Our queries can make effective use of multiple databases to produce valuable results that can reveal novel findings through comprehensive analysis. This allows clinical researchers, epidemiologists, and clinicians to have easy access to COVID-19 data without requiring expert knowledge in computing or data science. We expect that users will be able to reference our examples to construct their own integrative analysis methods, which will act as a basis for further scientific inquiry and data searching.
Collapse
|
10
|
Jimeno Yepes AJ, Verspoor K. Classifying literature mentions of biological pathogens as experimentally studied using natural language processing. J Biomed Semantics 2023; 14:1. [PMID: 36721225 PMCID: PMC9889128 DOI: 10.1186/s13326-023-00282-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Accepted: 01/17/2023] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND Information pertaining to mechanisms, management and treatment of disease-causing pathogens including viruses and bacteria is readily available from research publications indexed in MEDLINE. However, identifying the literature that specifically characterises these pathogens and their properties based on experimental research, important for understanding of the molecular basis of diseases caused by these agents, requires sifting through a large number of articles to exclude incidental mentions of the pathogens, or references to pathogens in other non-experimental contexts such as public health. OBJECTIVE In this work, we lay the foundations for the development of automatic methods for characterising mentions of pathogens in scientific literature, focusing on the task of identifying research that involves the experimental study of a pathogen in an experimental context. There are no manually annotated pathogen corpora available for this purpose, while such resources are necessary to support the development of machine learning-based models. We therefore aim to fill this gap, producing a large data set automatically from MEDLINE under some simplifying assumptions for the task definition, and using it to explore automatic methods that specifically support the detection of experimentally studied pathogen mentions in research publications. METHODS We developed a pathogen mention characterisation literature data set -READBiomed-Pathogens- automatically using NCBI resources, which we make available. Resources such as the NCBI Taxonomy, MeSH and GenBank can be used effectively to identify relevant literature about experimentally researched pathogens, more specifically using MeSH to link to MEDLINE citations including titles and abstracts with experimentally researched pathogens. We experiment with several machine learning-based natural language processing (NLP) algorithms leveraging this data set as training data, to model the task of detecting papers that specifically describe experimental study of a pathogen. RESULTS We show that our data set READBiomed-Pathogens can be used to explore natural language processing configurations for experimental pathogen mention characterisation. READBiomed-Pathogens includes citations related to organisms including bacteria, viruses, and a small number of toxins and other disease-causing agents. CONCLUSIONS We studied the characterisation of experimentally studied pathogens in scientific literature, developing several natural language processing methods supported by an automatically developed data set. As a core contribution of the work, we presented a methodology to automatically construct a data set for pathogen identification using existing biomedical resources. The data set and the annotation code are made publicly available. Performance of the pathogen mention identification and characterisation algorithms were additionally evaluated on a small manually annotated data set shows that the data set that we have generated allows characterising pathogens of interest. TRIAL REGISTRATION N/A.
Collapse
Affiliation(s)
- Antonio Jose Jimeno Yepes
- School of Computing Technologies, RMIT University, Melbourne, Australia.
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Australia
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| |
Collapse
|
11
|
Comprehensively identifying Long Covid articles with human-in-the-loop machine learning. PATTERNS (NEW YORK, N.Y.) 2022; 4:100659. [PMID: 36471749 PMCID: PMC9712067 DOI: 10.1016/j.patter.2022.100659] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 09/19/2022] [Accepted: 11/17/2022] [Indexed: 12/05/2022]
Abstract
A significant percentage of COVID-19 survivors experience ongoing multisystemic symptoms that often affect daily living, a condition known as Long Covid or post-acute-sequelae of SARS-CoV-2 infection. However, identifying scientific articles relevant to Long Covid is challenging since there is no standardized or consensus terminology. We developed an iterative human-in-the-loop machine learning framework combining data programming with active learning into a robust ensemble model, demonstrating higher specificity and considerably higher sensitivity than other methods. Analysis of the Long Covid Collection shows that (1) most Long Covid articles do not refer to Long Covid by any name, (2) when the condition is named, the name used most frequently in the literature is Long Covid, and (3) Long Covid is associated with disorders in a wide variety of body systems. The Long Covid Collection is updated weekly and is searchable online at the LitCovid portal: https://www.ncbi.nlm.nih.gov/research/coronavirus/docsum?filters=e_condition.LongCovid.
Collapse
|
12
|
Rabby G, Berka P. Multi-class classification of COVID-19 documents using machine learning algorithms. J Intell Inf Syst 2022; 60:571-591. [PMID: 36465147 PMCID: PMC9707112 DOI: 10.1007/s10844-022-00768-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Revised: 11/16/2022] [Accepted: 11/17/2022] [Indexed: 11/30/2022]
Abstract
In most biomedical research paper corpus, document classification is a crucial task. Even due to the global epidemic, it is a crucial task for researchers across a variety of fields to figure out the relevant scientific research papers accurately and quickly from a flood of biomedical research papers. It can also assist learners or researchers in assigning a research paper to an appropriate category and also help to find the relevant research paper within a very short time. A biomedical document classifier needs to be designed differently to go beyond a "general" text classifier because it's not dependent only on the text itself (i.e. on titles and abstracts) but can also utilize other information like entities extracted using some medical taxonomies or bibliometric data. The main objective of this research was to find out the type of information or features and representation method creates influence the biomedical document classification task. For this reason, we run several experiments on conventional text classification methods with different kinds of features extracted from the titles, abstracts, and bibliometric data. These procedures include data cleaning, feature engineering, and multi-class classification. Eleven different variants of input data tables were created and analyzed using ten machine learning algorithms. We also evaluate the data efficiency and interpretability of these models as essential features of any biomedical research paper classification system for handling specifically the COVID-19 related health crisis. Our major findings are that TF-IDF representations outperform the entity extraction methods and the abstract itself provides sufficient information for correct classification. Out of the used machine learning algorithms, the best performance over various forms of document representation was achieved by Random Forest and Neural Network (BERT). Our results lead to a concrete guideline for practitioners on biomedical document classification.
Collapse
Affiliation(s)
- Gollam Rabby
- Department of Information and Knowledge Engineering, Prague University of Economics and Business, Prague, Czech Republic
| | - Petr Berka
- Department of Information and Knowledge Engineering, Prague University of Economics and Business, Prague, Czech Republic
| |
Collapse
|
13
|
Gu J, Chersoni E, Wang X, Huang CR, Qian L, Zhou G. LitCovid ensemble learning for COVID-19 multi-label classification. Database (Oxford) 2022; 2022:6846687. [PMID: 36426767 PMCID: PMC9693804 DOI: 10.1093/database/baac103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2022] [Revised: 10/27/2022] [Accepted: 11/04/2022] [Indexed: 11/27/2022]
Abstract
The Coronavirus Disease 2019 (COVID-19) pandemic has shifted the focus of research worldwide, and more than 10 000 new articles per month have concentrated on COVID-19-related topics. Considering this rapidly growing literature, the efficient and precise extraction of the main topics of COVID-19-relevant articles is of great importance. The manual curation of this information for biomedical literature is labor-intensive and time-consuming, and as such the procedure is insufficient and difficult to maintain. In response to these complications, the BioCreative VII community has proposed a challenging task, LitCovid Track, calling for a global effort to automatically extract semantic topics for COVID-19 literature. This article describes our work on the BioCreative VII LitCovid Track. We proposed the LitCovid Ensemble Learning (LCEL) method for the tasks and integrated multiple biomedical pretrained models to address the COVID-19 multi-label classification problem. Specifically, seven different transformer-based pretrained models were ensembled for the initialization and fine-tuning processes independently. To enhance the representation abilities of the deep neural models, diverse additional biomedical knowledge was utilized to facilitate the fruitfulness of the semantic expressions. Simple yet effective data augmentation was also leveraged to address the learning deficiency during the training phase. In addition, given the imbalanced label distribution of the challenging task, a novel asymmetric loss function was applied to the LCEL model, which explicitly adjusted the negative-positive importance by assigning different exponential decay factors and helped the model focus on the positive samples. After the training phase, an ensemble bagging strategy was adopted to merge the outputs from each model for final predictions. The experimental results show the effectiveness of our proposed approach, as LCEL obtains the state-of-the-art performance on the LitCovid dataset. Database URL: https://github.com/JHnlp/LCEL.
Collapse
Affiliation(s)
| | - Emmanuele Chersoni
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong 999077, China
| | - Xing Wang
- Tencent AI Lab, Shenzhen 518071, China
| | - Chu-Ren Huang
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong 999077, China
| | - Longhua Qian
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Guodong Zhou
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| |
Collapse
|
14
|
Xu S, Li L, Wang C, An X, Yang G. An improved author-topic (AT) model with authorship credit allocation schemes. J Inf Sci 2022. [DOI: 10.1177/01655515221133530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Authorship credit allocation schemes have attracted considerable research attention. However, no consensus about which one is the best has been attained until now, and limited evidence from practical tasks has been reported. Therefore, this study uses the author interest discovery task as a real-world task case to provide valuable insights into authorship credit allocation schemes and guidelines for further practical applications. For this purpose, a novel model, ATcredit, is proposed to strengthen the Author-Topic (AT) model with an authorship credit allocation scheme, and collapsed Gibbs sampling is used to approximate the posterior and estimate model parameters. Extensive experiments using the SynBio dataset reveal several interesting findings as follows. (a) Any scheme for allocating unequal authorship credits performs better than its equal-credit counterpart with our ATcredit model in terms of perplexity. (b) The fixed versions of four out of the six schemes work better than their flexible counterparts with our ATcredit model, regardless of the hyper-authorship strategy. (c) The variation coefficient of credit awards can serve as a criterion to decide whether the hyper-authorship strategy should be used. (d) When the number of authors in a scholarly article is less than three, the six authorship credit allocation schemes are similar to each other with our ATcredit model in terms of perplexity. (e) The harmonic counting scheme performs the best, followed by the arithmetic counting scheme, and the network-based counting scheme performs the worst with our ATcredit model in terms of perplexity. (f) The arithmetic counting scheme is similar to the harmonic counting scheme in terms of the normalised mutual information (NMI) of discovered interests, but the geometric counting scheme is different from the axiomatic and network-based counting schemes.
Collapse
Affiliation(s)
- Shuo Xu
- College of Economics and Management, Beijing University of Technology, P.R. China
| | - Ling Li
- College of Economics and Management, Beijing University of Technology, P.R. China
| | - Congcong Wang
- College of Economics and Management, Beijing University of Technology, P.R. China
| | - Xin An
- School of Economics and Management, Beijing Forestry University, P.R. China
| | - Guancan Yang
- School of Information Resource Management, Renmin University of China, P.R. China
| |
Collapse
|
15
|
Chen Q, Allot A, Leaman R, Wei CH, Aghaarabi E, Guerrerio J, Xu L, Lu Z. LitCovid in 2022: an information resource for the COVID-19 literature. Nucleic Acids Res 2022; 51:D1512-D1518. [PMID: 36350613 PMCID: PMC9825538 DOI: 10.1093/nar/gkac1005] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 10/11/2022] [Accepted: 10/19/2022] [Indexed: 11/11/2022] Open
Abstract
LitCovid (https://www.ncbi.nlm.nih.gov/research/coronavirus/)-first launched in February 2020-is a first-of-its-kind literature hub for tracking up-to-date published research on COVID-19. The number of articles in LitCovid has increased from 55 000 to ∼300 000 over the past 2.5 years, with a consistent growth rate of ∼10 000 articles per month. In addition to the rapid literature growth, the COVID-19 pandemic has evolved dramatically. For instance, the Omicron variant has now accounted for over 98% of new infections in the United States. In response to the continuing evolution of the COVID-19 pandemic, this article describes significant updates to LitCovid over the last 2 years. First, we introduced the long Covid collection consisting of the articles on COVID-19 survivors experiencing ongoing multisystemic symptoms, including respiratory issues, cardiovascular disease, cognitive impairment, and profound fatigue. Second, we provided new annotations on the latest COVID-19 strains and vaccines mentioned in the literature. Third, we improved several existing features with more accurate machine learning algorithms for annotating topics and classifying articles relevant to COVID-19. LitCovid has been widely used with millions of accesses by users worldwide on various information needs and continues to play a critical role in collecting, curating and standardizing the latest knowledge on the COVID-19 literature.
Collapse
Affiliation(s)
| | | | - Robert Leaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, USA
| | | | | | | | - Zhiyong Lu
- To whom correspondence should be addressed. Tel: +1 301 594 7089; Fax: +1 301 480 2290;
| |
Collapse
|
16
|
Chen Q, Du J, Allot A, Lu Z. LitMC-BERT: Transformer-Based Multi-Label Classification of Biomedical Literature With An Application on COVID-19 Literature Curation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2584-2595. [PMID: 35536809 PMCID: PMC9647722 DOI: 10.1109/tcbb.2022.3173562] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 04/19/2022] [Accepted: 04/22/2022] [Indexed: 05/20/2023]
Abstract
The rapid growth of biomedical literature poses a significant challenge for curation and interpretation. This has become more evident during the COVID-19 pandemic. LitCovid, a literature database of COVID-19 related papers in PubMed, has accumulated over 200,000 articles with millions of accesses. Approximately 10,000 new articles are added to LitCovid every month. A main curation task in LitCovid is topic annotation where an article is assigned with up to eight topics, e.g., Treatment and Diagnosis. The annotated topics have been widely used both in LitCovid (e.g., accounting for ∼18% of total uses) and downstream studies such as network generation. However, it has been a primary curation bottleneck due to the nature of the task and the rapid literature growth. This study proposes LITMC-BERT, a transformer-based multi-label classification method in biomedical literature. It uses a shared transformer backbone for all the labels while also captures label-specific features and the correlations between label pairs. We compare LITMC-BERT with three baseline models on two datasets. Its micro-F1 and instance-based F1 are 5% and 4% higher than the current best results, respectively, and only requires ∼18% of the inference time than the Binary BERT baseline. The related datasets and models are available via https://github.com/ncbi/ml-transformer.
Collapse
|