1
|
Zhang Y, Li X, Liu Y, Li A, Yang X, Tang X. A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification. JMIR Med Inform 2023; 11:e44892. [PMID: 37796584 PMCID: PMC10587805 DOI: 10.2196/44892] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Revised: 03/07/2023] [Accepted: 09/06/2023] [Indexed: 10/06/2023] Open
Abstract
BACKGROUND Given the threat posed by cancer to human health, there is a rapid growth in the volume of data in the cancer field and interdisciplinary and collaborative research is becoming increasingly important for fine-grained classification. The low-resolution classifier of reported studies at the journal level fails to satisfy advanced searching demands, and a single label does not adequately characterize the literature originated from interdisciplinary research results. There is thus a need to establish a multilabel classifier with higher resolution to support literature retrieval for cancer research and reduce the burden of screening papers for clinical relevance. OBJECTIVE The primary objective of this research was to address the low-resolution issue of cancer literature classification due to the ambiguity of the existing journal-level classifier in order to support gaining high-relevance evidence for clinical consideration and all-sided results for literature retrieval. METHODS We trained a multilabel classifier with scalability for classifying the literature on cancer research directly at the publication level to assign proper content-derived labels based on the "Bidirectional Encoder Representation from Transformers (BERT) + X" model and obtain the best option for X. First, a corpus of 70,599 cancer publications retrieved from the Dimensions database was divided into a training and a testing set in a ratio of 7:3. Second, using the classification terminology of International Cancer Research Partnership cancer types, we compared the performance of classifiers developed using BERT and 5 classical deep learning models, such as the text recurrent neural network (TextRNN) and FastText, followed by metrics analysis. RESULTS After comparing various combined deep learning models, we obtained a classifier based on the optimal combination "BERT + TextRNN," with a precision of 93.09%, a recall of 87.75%, and an F1-score of 90.34%. Moreover, we quantified the distinctive characteristics in the text structure and multilabel distribution in order to generalize the model to other fields with similar characteristics. CONCLUSIONS The "BERT + TextRNN" model was trained for high-resolution classification of cancer literature at the publication level to support accurate retrieval and academic statistics. The model automatically assigns 1 or more labels to each cancer paper, as required. Quantitative comparison verified that the "BERT + TextRNN" model is the best fit for multilabel classification of cancer literature compared to other models. More data from diverse fields will be collected to testify the scalability and extensibility of the proposed model in the future.
Collapse
Affiliation(s)
- Ying Zhang
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| | - Xiaoying Li
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| | - Yi Liu
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| | - Aihua Li
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| | - Xuemei Yang
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| | - Xiaoli Tang
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| |
Collapse
|
2
|
Jin Y, Lu H, Zhu W, Huo W. Deep learning based classification of multi-label chest X-ray images via dual-weighted metric loss. Comput Biol Med 2023; 157:106683. [PMID: 36905869 DOI: 10.1016/j.compbiomed.2023.106683] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 10/17/2022] [Accepted: 11/06/2022] [Indexed: 02/17/2023]
Abstract
-Thoracic disease, like many other diseases, can lead to complications. Existing multi-label medical image learning problems typically include rich pathological information, such as images, attributes, and labels, which are crucial for supplementary clinical diagnosis. However, the majority of contemporary efforts exclusively focus on regression from input to binary labels, ignoring the relationship between visual features and semantic vectors of labels. In addition, there is an imbalance in data amount between diseases, which frequently causes intelligent diagnostic systems to make erroneous disease predictions. Therefore, we aim to improve the accuracy of the multi-label classification of chest X-ray images. Chest X-ray14 pictures were utilized as the multi-label dataset for the experiments in this study. By fine-tuning the ConvNeXt network, we got visual vectors, which we combined with semantic vectors encoded by BioBert to map the two different forms of features into a common metric space and made semantic vectors the prototype of each class in metric space. The metric relationship between images and labels is then considered from the image level and disease category level, respectively, and a new dual-weighted metric loss function is proposed. Finally, the average AUC score achieved in the experiment reached 0.826, and our model outperformed the comparison models.
Collapse
Affiliation(s)
- Yufei Jin
- College of Information Engineering, China Jiliang University, Hangzhou, China.
| | - Huijuan Lu
- College of Information Engineering, China Jiliang University, Hangzhou, China.
| | - Wenjie Zhu
- College of Information Engineering, China Jiliang University, Hangzhou, China.
| | - Wanli Huo
- College of Information Engineering, China Jiliang University, Hangzhou, China.
| |
Collapse
|
3
|
Mao C, Zhu Q, Chen R, Su W. Automatic medical specialty classification based on patients' description of their symptoms. BMC Med Inform Decis Mak 2023; 23:15. [PMID: 36670382 PMCID: PMC9862953 DOI: 10.1186/s12911-023-02105-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 01/09/2023] [Indexed: 01/22/2023] Open
Abstract
In China, patients usually determine their medical specialty before they register the corresponding specialists in the hospitals. This process usually requires a lot of medical knowledge for the patients. As a result, many patients do not register the correct specialty for the first time if they do not receive help from the hospitals. In this study, we try to automatically direct the patients to the appropriate specialty based on the symptoms they described. As far as we know, this is the first study to solve the problem. We propose a neural network-based model based on a hybrid model integrated with an attention mechanism. To prove the actual effect of this hybrid model, we utilized a data set of more than 40,000 items, including eight departments, such as Otorhinolaryngology, Pediatrics, and other common departments. The experiment results show that the hybrid model achieves more than 93.5% accuracy and has a high generalization capacity, which is superior to traditional classification models.
Collapse
Affiliation(s)
- Chao Mao
- grid.469245.80000 0004 1756 4881Guangdong Provincial Key Laboratory of Interdisciplinary Research and Application for Data Science, BNU-HKBU United International College, Zhuhai, 519087 China
| | - Quanjing Zhu
- grid.13291.380000 0001 0807 1581Specialty of Laboratory Medicine, West China Hospital, Sichuan University, Guoxue Lane, Wuhou District, Chengdu, 610041 China
| | - Rong Chen
- grid.412615.50000 0004 1803 6239Specialty of Rehabilitation Medicine, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510080 China
| | - Weifeng Su
- grid.469245.80000 0004 1756 4881Guangdong Provincial Key Laboratory of Interdisciplinary Research and Application for Data Science, BNU-HKBU United International College, Zhuhai, 519087 China
| |
Collapse
|
4
|
Jin Y, Xiong Y, Shi D, Lin Y, He L, Zhang Y, Plasek JM, Zhou L, Bates DW, Tang C. Learning from undercoded clinical records for automated International Classification of Diseases (ICD) coding. J Am Med Inform Assoc 2022; 30:438-446. [PMID: 36478240 PMCID: PMC9933053 DOI: 10.1093/jamia/ocac230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 11/08/2022] [Accepted: 11/16/2022] [Indexed: 12/12/2022] Open
Abstract
OBJECTIVES To develop an unbiased objective for learning automatic coding algorithms from clinical records annotated with only partial relevant International Classification of Diseases codes, as annotation noise in undercoded clinical records used as training data can mislead the learning process of deep neural networks. MATERIALS AND METHODS We use Medical Information Mart for Intensive Care III as our dataset. We employ positive-unlabeled learning to achieve unbiased loss estimation, which is free of misleading training signal. We then utilize reweighting mechanism to compensate for the imbalance between positive and negative samples. To further close the performance gap caused by poor quality annotation, we integrate the supervision provided by the automatic annotation tool Medical Concept Annotation Toolkit which can ease the heavy burden of manual validation. RESULTS Our benchmarking results show that positive-unlabeled learning with reweighting outperforms competitive baseline methods over a range of missing label ratios. Integrating supervision provided by annotation tool further boosted the performance. DISCUSSION Considering the annotation noise and severe imbalance, unbiased loss estimation and reweighting mechanism are both important for learning from undercoded clinical records. Unbiased loss requires the estimation of false negative ratios and estimation through trained models is practical and competitive. CONCLUSIONS The combination of positive-unlabeled learning with reweighting and supervision provided by the annotation tool is a promising solution to learn from undercoded clinical records.
Collapse
Affiliation(s)
| | | | | | | | - Lifang He
- Department of Computer Science and Engineering, Lehigh University, Bethlehem, Pennsylvania, USA
| | - Yao Zhang
- Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai, China
| | - Joseph M Plasek
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, USA
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, USA
| | - David W Bates
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, USA
| | - Chunlei Tang
- Corresponding Author: Chunlei Tang, PhD, Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women’s Hospital, 1620 Tremont Street, BS-3, Boston, MA 02120, USA;
| |
Collapse
|
5
|
Aman A, Reji DJ. Environmental due diligence data: A novel corpus for training environmental domain NLP models. Data Brief 2022; 45:108579. [PMID: 36148216 PMCID: PMC9486029 DOI: 10.1016/j.dib.2022.108579] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 08/20/2022] [Accepted: 09/02/2022] [Indexed: 11/24/2022] Open
Abstract
This article takes a step in the direction of adapting existing Natural Language Processing (NLP) models to diverse and heterogeneous settings of Environmental Due Diligence (EDD). The approach we followed was to enrich the vocabulary of deep learning models with more data from environmental domain by collecting the data from open-source regulatory documents provided by Environmental Protection Agency (EPA) [1]. We used active learning and data augmentation methods to resolve the imbalanced classes and fine-tuned DistilBERT on EDD data to develop environmental due diligence model which is hosted as an inference Application Programming Interface (API) on Hugging Face Hub. This model was packaged to predict EDD classes, determine relevancy and ranking, and allows users to fine tune the model to more EDD classes. This package, EnvBert is hosted on Python Package Index (PyPI) repository [2]. We anticipate that the rich EDD dataset that we used to train the model and create a package would help the users contribute for a variety of NLP tasks on EDD textual data, especially for text classification purposes. We present the data in raw format; it has been open sourced and publicly available at https://data.mendeley.com/datasets/tx6vmd4g9p/4.
Collapse
|
6
|
Gu J, Chersoni E, Wang X, Huang CR, Qian L, Zhou G. LitCovid ensemble learning for COVID-19 multi-label classification. Database (Oxford) 2022; 2022:6846687. [PMID: 36426767 PMCID: PMC9693804 DOI: 10.1093/database/baac103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2022] [Revised: 10/27/2022] [Accepted: 11/04/2022] [Indexed: 11/27/2022]
Abstract
The Coronavirus Disease 2019 (COVID-19) pandemic has shifted the focus of research worldwide, and more than 10 000 new articles per month have concentrated on COVID-19-related topics. Considering this rapidly growing literature, the efficient and precise extraction of the main topics of COVID-19-relevant articles is of great importance. The manual curation of this information for biomedical literature is labor-intensive and time-consuming, and as such the procedure is insufficient and difficult to maintain. In response to these complications, the BioCreative VII community has proposed a challenging task, LitCovid Track, calling for a global effort to automatically extract semantic topics for COVID-19 literature. This article describes our work on the BioCreative VII LitCovid Track. We proposed the LitCovid Ensemble Learning (LCEL) method for the tasks and integrated multiple biomedical pretrained models to address the COVID-19 multi-label classification problem. Specifically, seven different transformer-based pretrained models were ensembled for the initialization and fine-tuning processes independently. To enhance the representation abilities of the deep neural models, diverse additional biomedical knowledge was utilized to facilitate the fruitfulness of the semantic expressions. Simple yet effective data augmentation was also leveraged to address the learning deficiency during the training phase. In addition, given the imbalanced label distribution of the challenging task, a novel asymmetric loss function was applied to the LCEL model, which explicitly adjusted the negative-positive importance by assigning different exponential decay factors and helped the model focus on the positive samples. After the training phase, an ensemble bagging strategy was adopted to merge the outputs from each model for final predictions. The experimental results show the effectiveness of our proposed approach, as LCEL obtains the state-of-the-art performance on the LitCovid dataset. Database URL: https://github.com/JHnlp/LCEL.
Collapse
Affiliation(s)
| | - Emmanuele Chersoni
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong 999077, China
| | - Xing Wang
- Tencent AI Lab, Shenzhen 518071, China
| | - Chu-Ren Huang
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong 999077, China
| | - Longhua Qian
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Guodong Zhou
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| |
Collapse
|
7
|
Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. A reproducible experimental survey on biomedical sentence similarity: A string-based method sets the state of the art. PLoS One 2022; 17:e0276539. [PMID: 36409715 PMCID: PMC9678326 DOI: 10.1371/journal.pone.0276539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 10/08/2022] [Indexed: 11/22/2022] Open
Abstract
This registered report introduces the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity with the following aims: (1) to elucidate the state of the art of the problem; (2) to solve some reproducibility problems preventing the evaluation of most current methods; (3) to evaluate several unexplored sentence similarity methods; (4) to evaluate for the first time an unexplored benchmark, called Corpus-Transcriptional-Regulation (CTR); (5) to carry out a study on the impact of the pre-processing stages and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (6) to bridge the lack of software and data reproducibility resources for methods and experiments in this line of research. Our reproducible experimental survey is based on a single software platform, which is provided with a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results. In addition, we introduce a new aggregated string-based sentence similarity method, called LiBlock, together with eight variants of current ontology-based methods, and a new pre-trained word embedding model trained on the full-text articles in the PMC-BioC corpus. Our experiments show that our novel string-based measure establishes the new state of the art in sentence similarity analysis in the biomedical domain and significantly outperforms all the methods evaluated herein, with the only exception of one ontology-based method. Likewise, our experiments confirm that the pre-processing stages, and the choice of the NER tool for ontology-based methods, have a very significant impact on the performance of the sentence similarity methods. We also detail some drawbacks and limitations of current methods, and highlight the need to refine the current benchmarks. Finally, a notable finding is that our new string-based method significantly outperforms all state-of-the-art Machine Learning (ML) models evaluated herein.
Collapse
Affiliation(s)
- Alicia Lara-Clares
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
- * E-mail:
| | - Juan J. Lastra-Díaz
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| | - Ana Garcia-Serrano
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| |
Collapse
|
8
|
Multi-label sequence generating model via label semantic attention mechanism. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01722-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
9
|
Han M, Wu H, Chen Z, Li M, Zhang X. A survey of multi-label classification based on supervised and semi-supervised learning. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01658-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
10
|
Abbasi A, Javed AR, Iqbal F, Kryvinska N, Jalil Z. Deep learning for religious and continent-based toxic content detection and classification. Sci Rep 2022; 12:17478. [PMID: 36261675 PMCID: PMC9581992 DOI: 10.1038/s41598-022-22523-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2022] [Accepted: 10/17/2022] [Indexed: 01/12/2023] Open
Abstract
With time, numerous online communication platforms have emerged that allow people to express themselves, increasing the dissemination of toxic languages, such as racism, sexual harassment, and other negative behaviors that are not accepted in polite society. As a result, toxic language identification in online communication has emerged as a critical application of natural language processing. Numerous academic and industrial researchers have recently researched toxic language identification using machine learning algorithms. However, Nontoxic comments, including particular identification descriptors, such as Muslim, Jewish, White, and Black, were assigned unrealistically high toxicity ratings in several machine learning models. This research analyzes and compares modern deep learning algorithms for multilabel toxic comments classification. We explore two scenarios: the first is a multilabel classification of Religious toxic comments, and the second is a multilabel classification of race or toxic ethnicity comments with various word embeddings (GloVe, Word2vec, and FastText) without word embeddings using an ordinary embedding layer. Experiments show that the CNN model produced the best results for classifying multilabel toxic comments in both scenarios. We compared the outcomes of these modern deep learning model performances in terms of multilabel evaluation metrics.
Collapse
Affiliation(s)
- Ahmed Abbasi
- grid.444783.80000 0004 0607 2515Department of Creative Technologies, PAF Complex, E-9, Air University, Islamabad, Pakistan
| | - Abdul Rehman Javed
- grid.444783.80000 0004 0607 2515Department of Cyber Security, PAF Complex, E-9, Air University, Islamabad, Pakistan ,grid.411323.60000 0001 2324 5973Department of Electrical and Computer Engineering, Lebanese American University, Byblos, Lebanon
| | - Farkhund Iqbal
- grid.444464.20000 0001 0650 0848College of Technological Innovation, Zayed University, Abu Dhabi, United Arab Emirates
| | - Natalia Kryvinska
- grid.7634.60000000109409708Information Systems Department, Faculty of Management, Comenius University in Bratislava, Odbojárov 10, 82005 Bratislava, 25, Slovakia
| | - Zunera Jalil
- grid.444783.80000 0004 0607 2515Department of Creative Technologies, PAF Complex, E-9, Air University, Islamabad, Pakistan
| |
Collapse
|
11
|
Su R, Yang H, Wei L, Chen S, Zou Q. A multi-label learning model for predicting drug-induced pathology in multi-organ based on toxicogenomics data. PLoS Comput Biol 2022; 18:e1010402. [PMID: 36070305 PMCID: PMC9451100 DOI: 10.1371/journal.pcbi.1010402] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Accepted: 07/18/2022] [Indexed: 11/18/2022] Open
Abstract
Drug-induced toxicity damages the health and is one of the key factors causing drug withdrawal from the market. It is of great significance to identify drug-induced target-organ toxicity, especially the detailed pathological findings, which are crucial for toxicity assessment, in the early stage of drug development process. A large variety of studies have devoted to identify drug toxicity. However, most of them are limited to single organ or only binary toxicity. Here we proposed a novel multi-label learning model named Att-RethinkNet, for predicting drug-induced pathological findings targeted on liver and kidney based on toxicogenomics data. The Att-RethinkNet is equipped with a memory structure and can effectively use the label association information. Besides, attention mechanism is embedded to focus on the important features and obtain better feature presentation. Our Att-RethinkNet is applicable in multiple organs and takes account the compound type, dose, and administration time, so it is more comprehensive and generalized. And more importantly, it predicts multiple pathological findings at the same time, instead of predicting each pathology separately as the previous model did. To demonstrate the effectiveness of the proposed model, we compared the proposed method with a series of state-of-the-arts methods. Our model shows competitive performance and can predict potential hepatotoxicity and nephrotoxicity in a more accurate and reliable way. The implementation of the proposed method is available at https://github.com/RanSuLab/Drug-Toxicity-Prediction-MultiLabel.
Collapse
Affiliation(s)
- Ran Su
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Haitang Yang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan, Shandong, China
- * E-mail: (LW); (SC); (QZ)
| | - Siqi Chen
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
- * E-mail: (LW); (SC); (QZ)
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
- * E-mail: (LW); (SC); (QZ)
| |
Collapse
|
12
|
Hu Y, Donald C, Giacaman N. Can Multi-Label Classifiers Help Identify Subjectivity? A Deep Learning Approach to Classifying Cognitive Presence in MOOCs. INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION 2022; 33:1-36. [PMID: 36090962 PMCID: PMC9439267 DOI: 10.1007/s40593-022-00310-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/18/2022] [Indexed: 11/21/2022]
Abstract
This paper investigates using multi-label deep learning approach to extending the understanding of cognitive presence in MOOC discussions. Previous studies demonstrate the challenges of subjectivity in manual categorisation methods. Training automatic single-label classifiers may preserve this subjectivity. Using a triangulation approach, we developed a multi-label, fine-tuning BERT classifier to analyse cognitive presence to enrich results with state-of-the-art, single-label classifiers. We trained the multi-label classifiers on the MOOC discussion messages that were categorised into the same phase of cognitive presence by the expert coders, and tested the best-performing classifiers on the messages that the coders categorised into different phases. The results suggest that multi-label classifiers slightly outperformed the single-label classifiers, and the multi-label classifiers predicted the discussion messages as either one category or two adjacent categories of cognitive presence. No messages were tagged as non-adjacent categories by the multi-label classifier. This is an improvement compared to manual categorisation by our expert coders, who obtained non-adjacent categories and even three categories of cognitive presence in one message. In addition to the fully correct prediction, parts of messages were partially correctly predicted by the multi-label classifier. We report an in-depth quantitative and qualitative analysis of these messages in the paper. The automatic categorisation results suggest that the multi-label classifiers have the potential to help educators and researchers identify research subjectivity and tolerate the multiplicity in cognitive presence categorisation. This study contributes to extending the literature on understanding cognitive presence in MOOC discussions.
Collapse
Affiliation(s)
- Yuanyuan Hu
- Faculty of Engineering, The University of Auckland, Auckland, New Zealand
| | - Claire Donald
- Faculty of Engineering, The University of Auckland, Auckland, New Zealand
| | - Nasser Giacaman
- Faculty of Engineering, The University of Auckland, Auckland, New Zealand
| |
Collapse
|
13
|
Chen Q, Du J, Allot A, Lu Z. LitMC-BERT: Transformer-Based Multi-Label Classification of Biomedical Literature With An Application on COVID-19 Literature Curation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2584-2595. [PMID: 35536809 PMCID: PMC9647722 DOI: 10.1109/tcbb.2022.3173562] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 04/19/2022] [Accepted: 04/22/2022] [Indexed: 05/20/2023]
Abstract
The rapid growth of biomedical literature poses a significant challenge for curation and interpretation. This has become more evident during the COVID-19 pandemic. LitCovid, a literature database of COVID-19 related papers in PubMed, has accumulated over 200,000 articles with millions of accesses. Approximately 10,000 new articles are added to LitCovid every month. A main curation task in LitCovid is topic annotation where an article is assigned with up to eight topics, e.g., Treatment and Diagnosis. The annotated topics have been widely used both in LitCovid (e.g., accounting for ∼18% of total uses) and downstream studies such as network generation. However, it has been a primary curation bottleneck due to the nature of the task and the rapid literature growth. This study proposes LITMC-BERT, a transformer-based multi-label classification method in biomedical literature. It uses a shared transformer backbone for all the labels while also captures label-specific features and the correlations between label pairs. We compare LITMC-BERT with three baseline models on two datasets. Its micro-F1 and instance-based F1 are 5% and 4% higher than the current best results, respectively, and only requires ∼18% of the inference time than the Binary BERT baseline. The related datasets and models are available via https://github.com/ncbi/ml-transformer.
Collapse
|
14
|
Chen Q, Allot A, Leaman R, Islamaj R, Du J, Fang L, Wang K, Xu S, Zhang Y, Bagherzadeh P, Bergler S, Bhatnagar A, Bhavsar N, Chang YC, Lin SJ, Tang W, Zhang H, Tavchioski I, Pollak S, Tian S, Zhang J, Otmakhova Y, Yepes AJ, Dong H, Wu H, Dufour R, Labrak Y, Chatterjee N, Tandon K, Laleye FAA, Rakotoson L, Chersoni E, Gu J, Friedrich A, Pujari SC, Chizhikova M, Sivadasan N, VG S, Lu Z. Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations. Database (Oxford) 2022; 2022:baac069. [PMID: 36043400 PMCID: PMC9428574 DOI: 10.1093/database/baac069] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Revised: 08/02/2022] [Accepted: 08/13/2022] [Indexed: 05/03/2023]
Abstract
The coronavirus disease 2019 (COVID-19) pandemic has been severely impacting global society since December 2019. The related findings such as vaccine and drug development have been reported in biomedical literature-at a rate of about 10 000 articles on COVID-19 per month. Such rapid growth significantly challenges manual curation and interpretation. For instance, LitCovid is a literature database of COVID-19-related articles in PubMed, which has accumulated more than 200 000 articles with millions of accesses each month by users worldwide. One primary curation task is to assign up to eight topics (e.g. Diagnosis and Treatment) to the articles in LitCovid. The annotated topics have been widely used for navigating the COVID literature, rapidly locating articles of interest and other downstream studies. However, annotating the topics has been the bottleneck of manual curation. Despite the continuing advances in biomedical text-mining methods, few have been dedicated to topic annotations in COVID-19 literature. To close the gap, we organized the BioCreative LitCovid track to call for a community effort to tackle automated topic annotation for COVID-19 literature. The BioCreative LitCovid dataset-consisting of over 30 000 articles with manually reviewed topics-was created for training and testing. It is one of the largest multi-label classification datasets in biomedical scientific literature. Nineteen teams worldwide participated and made 80 submissions in total. Most teams used hybrid systems based on transformers. The highest performing submissions achieved 0.8875, 0.9181 and 0.9394 for macro-F1-score, micro-F1-score and instance-based F1-score, respectively. Notably, these scores are substantially higher (e.g. 12%, higher for macro F1-score) than the corresponding scores of the state-of-art multi-label classification method. The level of participation and results demonstrate a successful track and help close the gap between dataset curation and method development. The dataset is publicly available via https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/biocreative/ for benchmarking and further development. Database URL https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/biocreative/.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Alexis Allot
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Robert Leaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Rezarta Islamaj
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Jingcheng Du
- School of Biomedical Informatics, UT Health, TX, Houston 77030, USA
| | - Li Fang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Shuo Xu
- College of Economics and Management, Beijing University of Technology, Beijing, QC, China
| | - Yuefu Zhang
- College of Economics and Management, Beijing University of Technology, Beijing, QC, China
| | | | | | | | | | - Yung-Chun Chang
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| | - Sheng-Jie Lin
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| | - Wentai Tang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Hongtong Zhang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Ilija Tavchioski
- Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
- Jožef Stefan Institute, Ljubljana, Slovenia
| | | | - Shubo Tian
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Yulia Otmakhova
- School of Computing and Information Systems, University of Melbourne, Melbourne, AU-VIC, Australia
| | | | - Hang Dong
- Centre for Medical Informatics, Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Honghan Wu
- Institute of Health Informatics, University College London, London, UK
| | | | | | - Niladri Chatterjee
- Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India
| | - Kushagri Tandon
- Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India
| | | | | | - Emmanuele Chersoni
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China
| | - Jinghang Gu
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China
| | | | - Subhash Chandra Pujari
- Institute of Computer Science, Heidelberg University, Heidelberg, Germany
- Bosch Center for Artificial Intelligence, Renningen, Germany
| | - Mariia Chizhikova
- SINAI Group, Department of Computer Science, Advanced Studies Center in ICT (CEATIC), Universidad de Jaén, Jaén, Spain
| | | | | | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| |
Collapse
|
15
|
Lin Y, Chi Y, Han H, Han M, Guo Y. Multimodal Orthodontic Corpus Construction Based on Semantic Tag Classification Method. Neural Process Lett 2022. [DOI: 10.1007/s11063-021-10558-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
16
|
Lin SJ, Yeh WC, Chiu YW, Chang YC, Hsu MH, Chen YS, Hsu WL. A BERT-based ensemble learning approach for the BioCreative VII challenges: full-text chemical identification and multi-label classification in PubMed articles. Database (Oxford) 2022; 2022:6645124. [PMID: 35849027 PMCID: PMC9290865 DOI: 10.1093/database/baac056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2022] [Revised: 06/20/2022] [Accepted: 07/02/2022] [Indexed: 11/25/2022]
Abstract
In this research, we explored various state-of-the-art biomedical-specific pre-trained Bidirectional Encoder Representations from Transformers (BERT) models for the National Library of Medicine - Chemistry (NLM CHEM) and LitCovid tracks in the BioCreative VII Challenge, and propose a BERT-based ensemble learning approach to integrate the advantages of various models to improve the system’s performance. The experimental results of the NLM-CHEM track demonstrate that our method can achieve remarkable performance, with F1-scores of 85% and 91.8% in strict and approximate evaluations, respectively. Moreover, the proposed Medical Subject Headings identifier (MeSH ID) normalization algorithm is effective in entity normalization, which achieved a F1-score of about 80% in both strict and approximate evaluations. For the LitCovid track, the proposed method is also effective in detecting topics in the Coronavirus disease 2019 (COVID-19) literature, which outperformed the compared methods and achieve state-of-the-art performance in the LitCovid corpus. Database URL: https://www.ncbi.nlm.nih.gov/research/coronavirus/.
Collapse
Affiliation(s)
- Sheng-Jie Lin
- Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Dáan District , Taipei City 106, Taiwan
| | - Wen-Chao Yeh
- Institute of Information Systems and Applications, National Tsing Hua University, No. 101, Section 2, Guangfu Rd, East District , Hsinchu City 300, Taiwan
| | - Yu-Wen Chiu
- Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Dáan District , Taipei City 106, Taiwan
| | - Yung-Chun Chang
- Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Dáan District , Taipei City 106, Taiwan
- Clinical Big Data Research Center, Taipei Medical University Hospital, No. 172-1, Section 2, Keelung Rd, Dáan District , Taipei City 106, Taiwan
- Pervasive AI Research Labs, Ministry of Science and Technology, No. 1001, Daxue Rd, East District , Hsinchu City 300, Taiwan
| | - Min-Huei Hsu
- Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Dáan District , Taipei City 106, Taiwan
| | - Yi-Shin Chen
- Institute of Information Systems and Applications, National Tsing Hua University, No. 101, Section 2, Guangfu Rd, East District , Hsinchu City 300, Taiwan
| | - Wen-Lian Hsu
- Pervasive AI Research Labs, Ministry of Science and Technology, No. 1001, Daxue Rd, East District , Hsinchu City 300, Taiwan
- Department of Computer Science and Information Engineering, Asia University, No. 500, Liufeng Rd, Wufeng District , Taichung City 413, Taiwan
| |
Collapse
|
17
|
Sovrano F, Palmirani M, Vitali F. Combining shallow and deep learning approaches against data scarcity in legal domains. GOVERNMENT INFORMATION QUARTERLY 2022. [DOI: 10.1016/j.giq.2022.101715] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
|
18
|
Fang L, Wang K. Multi-label topic classification for COVID-19 literature with Bioformer. ARXIV 2022:arXiv:2204.06758v1. [PMID: 35441084 PMCID: PMC9016643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
We describe Bioformer team's participation in the multi-label topic classification task for COVID-19 literature (track 5 of BioCreative VII). Topic classification is performed using different BERT models (BioBERT, PubMedBERT, and Bioformer). We formulate the topic classification task as a sentence pair classification problem, where the title is the first sentence, and the abstract is the second sentence. Our results show that Bioformer outperforms BioBERT and PubMedBERT in this task. Compared to the baseline results, our best model increased micro, macro, and instance-based F1 score by 8.8%, 15.5%, 7.4%, respectively. Bioformer achieved the highest micro F1 and macro F1 scores in this challenge. In post-challenge experiments, we found that pretraining of Bioformer on COVID-19 articles further improves the performance.
Collapse
Affiliation(s)
- Li Fang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA
| |
Collapse
|
19
|
Janjua ZH, Kerins D, O'Flynn B, Tedesco S. Knowledge-driven feature engineering to detect multiple symptoms using ambulatory blood pressure monitoring data. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 217:106638. [PMID: 35220199 DOI: 10.1016/j.cmpb.2022.106638] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Revised: 11/14/2021] [Accepted: 01/14/2022] [Indexed: 06/14/2023]
Abstract
BACKGROUND Hypertension is a major health concern across the globe and needs to be properly diagnosed to so it can be treated and to mitigate for this critical health condition. In this context, ambulatory blood pressure monitoring is essential to provide for a proper diagnosis of hypertension, which may not be possible otherwise due to the white coat effect or masked hypertension. In this paper, the objective is to develop a model which incorporates expert's knowledge in the feature engineering process so as to accurately predict multiple medical conditions. As a case study, we have considered multiple symptoms related to hypertension and used an ambulatory blood pressure monitoring method to continuously acquire hypertension relevant data from a patient. The goal is to train a model with a minimum set of the most effective knowledge-driven features which are useful to detect multiple symptoms simultaneously using multi-class classification techniques. METHOD Artificial intelligence-based blood pressure monitoring techniques introduce a new dimension in the diagnosis of hypertension by enabling a continuous (24hours) analysis of systolic and diastolic blood pressure levels. In this work, we present a model that entails a knowledge-driven feature engineering method and implemented an ambulatory blood pressure monitoring system to diagnose multiple cardiac parameters and associated conditions simultaneously these include morning surge, circadian rhythm, and pulse pressure. The knowledge-driven features are extracted to improve the interpretability of the classification model and machine learning techniques (Random Forest, Naive Bayes, and KNN) were applied in a multi-label classification setup using RAkEL to classify multiple conditions simultaneously. RESULTS The results obtained (F 1 = 0.918) show that the Random forest technique has performed well for multilabel classification using knowledge-driven features. Our technique has also reduced the complexity of the model by reducing the number of features required to train a machine learning model. CONCLUSION Considering these results, we conclude that knowledge-driven feature engineering enhances the learning process by reducing the number of features given as input to the machine learning algorithm. The proposed feature engineering method considers expert's knowledge to develop better diagnosis models which are free from misleading data-driven noisy features in some situations. It is a white-box approach in which clinicians can under stand the importance of a feature while looking at its value.
Collapse
|
20
|
Lentzas A, Dalagdi E, Vrakas D. Multilabel Classification Methods for Human Activity Recognition: A Comparison of Algorithms. SENSORS 2022; 22:s22062353. [PMID: 35336522 PMCID: PMC8955852 DOI: 10.3390/s22062353] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 03/14/2022] [Accepted: 03/16/2022] [Indexed: 12/10/2022]
Abstract
As the world’s population is aging, and since access to ambient sensors has become easier over the past years, activity recognition in smart home installations has gained increased scientific interest. The majority of published papers in the literature focus on single-resident activity recognition. While this is an important area, especially when focusing on elderly people living alone, multi-resident activity recognition has potentially more applications in smart homes. Activity recognition for multiple residents acting concurrently can be treated as a multilabel classification problem (MLC). In this study, an experimental comparison between different MLC algorithms is attempted. Three different techniques were implemented: RAkELd, classifier chains, and binary relevance. These methods are evaluated using the ARAS and CASAS public datasets. Results obtained from experiments have shown that using MLC can recognize activities performed by multiple people with high accuracy. While RAkELd had the best performance, the rest of the methods had on-par results.
Collapse
|
21
|
Aduragba OT, Yu J, Cristea AI, Shi L. Detecting Fine-Grained Emotions on Social Media during Major Disease Outbreaks: Health and Well-being before and during the COVID-19 Pandemic. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2022; 2021:187-196. [PMID: 35308991 PMCID: PMC8861702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The COVID-19 pandemic has affected the whole world in various ways. One type of impact is that communication, work, interaction, a great part of our lives has moved online on various platforms, with some of the most popular being the social media ones. Another, arguably less visible impact, is the emotional impact. Detecting and understanding emotions is important, to better discern the emotional health and well-being of the global population. Thus, in this work, we use a social media platform (Twitter) to analyse emotions in detail. Our contribution is twofold: (1) we propose EmoBERT, a new emotion-based variant of the BERT transformer model, able to learn emotion representations and outperform the state-of-the-art; (2) we provide a fine-grained analysis of the pandemic's effect in a major location, London, comparing specific emotions (annoyed, anxious, empathetic, sad) before and during the epidemic.
Collapse
Affiliation(s)
| | - Jialin Yu
- Department of Computer Science, Durham University, Durham, United Kingdom
| | | | - Lei Shi
- Department of Computer Science, Durham University, Durham, United Kingdom
| |
Collapse
|
22
|
Ensemble of classifier chains and decision templates for multi-label classification. Knowl Inf Syst 2022. [DOI: 10.1007/s10115-021-01647-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
23
|
Automatic Prediction of Recurrence of Major Cardiovascular Events: A Text Mining Study Using Chest X-Ray Reports. JOURNAL OF HEALTHCARE ENGINEERING 2021; 2021:6663884. [PMID: 34306597 PMCID: PMC8285182 DOI: 10.1155/2021/6663884] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Revised: 05/29/2021] [Accepted: 06/29/2021] [Indexed: 11/17/2022]
Abstract
Methods We used EHR data of patients included in the Second Manifestations of ARTerial disease (SMART) study. We propose a deep learning-based multimodal architecture for our text mining pipeline that integrates neural text representation with preprocessed clinical predictors for the prediction of recurrence of major cardiovascular events in cardiovascular patients. Text preprocessing, including cleaning and stemming, was first applied to filter out the unwanted texts from X-ray radiology reports. Thereafter, text representation methods were used to numerically represent unstructured radiology reports with vectors. Subsequently, these text representation methods were added to prediction models to assess their clinical relevance. In this step, we applied logistic regression, support vector machine (SVM), multilayer perceptron neural network, convolutional neural network, long short-term memory (LSTM), and bidirectional LSTM deep neural network (BiLSTM). Results We performed various experiments to evaluate the added value of the text in the prediction of major cardiovascular events. The two main scenarios were the integration of radiology reports (1) with classical clinical predictors and (2) with only age and sex in the case of unavailable clinical predictors. In total, data of 5603 patients were used with 5-fold cross-validation to train the models. In the first scenario, the multimodal BiLSTM (MI-BiLSTM) model achieved an area under the curve (AUC) of 84.7%, misclassification rate of 14.3%, and F1 score of 83.8%. In this scenario, the SVM model, trained on clinical variables and bag-of-words representation, achieved the lowest misclassification rate of 12.2%. In the case of unavailable clinical predictors, the MI-BiLSTM model trained on radiology reports and demographic (age and sex) variables reached an AUC, F1 score, and misclassification rate of 74.5%, 70.8%, and 20.4%, respectively. Conclusions Using the case study of routine care chest X-ray radiology reports, we demonstrated the clinical relevance of integrating text features and classical predictors in our text mining pipeline for cardiovascular risk prediction. The MI-BiLSTM model with word embedding representation appeared to have a desirable performance when trained on text data integrated with the clinical variables from the SMART study. Our results mined from chest X-ray reports showed that models using text data in addition to laboratory values outperform those using only known clinical predictors.
Collapse
|
24
|
Stemerman R, Arguello J, Brice J, Krishnamurthy A, Houston M, Kitzmiller R. Identification of social determinants of health using multi-label classification of electronic health record clinical notes. JAMIA Open 2021; 4:ooaa069. [PMID: 34514351 PMCID: PMC8423426 DOI: 10.1093/jamiaopen/ooaa069] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 11/16/2020] [Accepted: 11/20/2020] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVES Social determinants of health (SDH), key contributors to health, are rarely systematically measured and collected in the electronic health record (EHR). We investigate how to leverage clinical notes using novel applications of multi-label learning (MLL) to classify SDH in mental health and substance use disorder patients who frequent the emergency department. METHODS AND MATERIALS We labeled a gold-standard corpus of EHR clinical note sentences (N = 4063) with 6 identified SDH-related domains recommended by the Institute of Medicine for inclusion in the EHR. We then trained 5 classification models: linear-Support Vector Machine, K-Nearest Neighbors, Random Forest, XGBoost, and bidirectional Long Short-Term Memory (BI-LSTM). We adopted 5 common evaluation measures: accuracy, average precision-recall (AP), area under the curve receiver operating characteristic (AUC-ROC), Hamming loss, and log loss to compare the performance of different methods for MLL classification using the F1 score as the primary evaluation metric. RESULTS Our results suggested that, overall, BI-LSTM outperformed the other classification models in terms of AUC-ROC (93.9), AP (0.76), and Hamming loss (0.12). The AUC-ROC values of MLL models of SDH related domains varied between (0.59-1.0). We found that 44.6% of our study population (N = 1119) had at least one positive documentation of SDH. DISCUSSION AND CONCLUSION The proposed approach of training an MLL model on an SDH rich data source can produce a high performing classifier using only unstructured clinical notes. We also provide evidence that model performance is associated with lexical diversity by health professionals and the auto-generation of clinical note sentences to document SDH.
Collapse
Affiliation(s)
- Rachel Stemerman
- Carolina Health Informatics Program, The University of North Carolina, Chapel Hill, North Carolina, USA
| | - Jaime Arguello
- School of Information and Library Sciences, The University of North Carolina, Chapel Hill, North Carolina, USA
| | - Jane Brice
- Department of Emergency Medicine, The University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
| | - Ashok Krishnamurthy
- Department of Computer Science, The University of North Carolina, Chapel Hill, North Carolina, USA
| | - Mary Houston
- Department of Emergency Medicine, The University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
| | - Rebecca Kitzmiller
- School of Nursing, The University of North Carolina, Chapel Hill, North Carolina, USA
| |
Collapse
|
25
|
Dai X, Xu F, Wang S, Mundra PA, Zheng J. PIKE-R2P: Protein-protein interaction network-based knowledge embedding with graph neural network for single-cell RNA to protein prediction. BMC Bioinformatics 2021; 22:139. [PMID: 34078261 PMCID: PMC8170782 DOI: 10.1186/s12859-021-04022-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Accepted: 02/11/2021] [Indexed: 12/05/2022] Open
Abstract
Background Recent advances in simultaneous measurement of RNA and protein abundances at single-cell level provide a unique opportunity to predict protein abundance from scRNA-seq data using machine learning models. However, existing machine learning methods have not considered relationship among the proteins sufficiently. Results We formulate this task in a multi-label prediction framework where multiple proteins are linked to each other at the single-cell level. Then, we propose a novel method for single-cell RNA to protein prediction named PIKE-R2P, which incorporates protein–protein interactions (PPI) and prior knowledge embedding into a graph neural network. Compared with existing methods, PIKE-R2P could significantly improve prediction performance in terms of smaller errors and higher correlations with the gold standard measurements. Conclusion The superior performance of PIKE-R2P indicates that adding the prior knowledge of PPI to graph neural networks can be a powerful strategy for cross-modality prediction of protein abundances at the single-cell level.
Collapse
Affiliation(s)
- Xinnan Dai
- School of Information Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong District, Shanghai, 201210, China
| | - Fan Xu
- School of Information Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong District, Shanghai, 201210, China
| | - Shike Wang
- School of Information Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong District, Shanghai, 201210, China
| | - Piyushkumar A Mundra
- Molecular Oncology Group, Cancer Research UK Manchester Institute, The University of Manchester, Alderley Park, Manchester, UK
| | - Jie Zheng
- School of Information Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong District, Shanghai, 201210, China.
| |
Collapse
|
26
|
Zhao S, Su C, Lu Z, Wang F. Recent advances in biomedical literature mining. Brief Bioinform 2021; 22:bbaa057. [PMID: 32422651 PMCID: PMC8138828 DOI: 10.1093/bib/bbaa057] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 03/22/2020] [Accepted: 03/25/2020] [Indexed: 01/26/2023] Open
Abstract
The recent years have witnessed a rapid increase in the number of scientific articles in biomedical domain. These literature are mostly available and readily accessible in electronic format. The domain knowledge hidden in them is critical for biomedical research and applications, which makes biomedical literature mining (BLM) techniques highly demanding. Numerous efforts have been made on this topic from both biomedical informatics (BMI) and computer science (CS) communities. The BMI community focuses more on the concrete application problems and thus prefer more interpretable and descriptive methods, while the CS community chases more on superior performance and generalization ability, thus more sophisticated and universal models are developed. The goal of this paper is to provide a review of the recent advances in BLM from both communities and inspire new research directions.
Collapse
Affiliation(s)
- Sendong Zhao
- Department of Healthcare Policy and Research, Weill Medical College of Cornell University, New York, NY 10065, USA
| | - Chang Su
- Division of Health Informatics, Department of Healthcare Policy and Research at Weill Cornell Medicine at Cornell University, New York, NY, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI) at National Library of Medicine, National Institute of Health, Bethesda, MD, USA
| | - Fei Wang
- Department of Healthcare Policy and Research, Weill Medical College of Cornell University, New York, NY 10065, USA
| |
Collapse
|
27
|
Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. Protocol for a reproducible experimental survey on biomedical sentence similarity. PLoS One 2021; 16:e0248663. [PMID: 33760855 PMCID: PMC7990182 DOI: 10.1371/journal.pone.0248663] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 03/02/2021] [Indexed: 11/28/2022] Open
Abstract
Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in recent years. However, most sentence similarity methods and experimental results reported in the biomedical domain cannot be reproduced for multiple reasons as follows: the copying of previous results without confirmation, the lack of source code and data to replicate both methods and experiments, and the lack of a detailed definition of the experimental setup, among others. As a consequence of this reproducibility gap, the state of the problem can be neither elucidated nor new lines of research be soundly set. On the other hand, there are other significant gaps in the literature on biomedical sentence similarity as follows: (1) the evaluation of several unexplored sentence similarity methods which deserve to be studied; (2) the evaluation of an unexplored benchmark on biomedical sentence similarity, called Corpus-Transcriptional-Regulation (CTR); (3) a study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (4) the lack of software and data resources for the reproducibility of methods and experiments in this line of research. Identified these open problems, this registered report introduces a detailed experimental setup, together with a categorization of the literature, to develop the largest, updated, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aforementioned experimental survey will be based on our own software replication and the evaluation of all methods being studied on the same software platform, which will be specially developed for this work, and it will become the first publicly available software library for biomedical sentence similarity. Finally, we will provide a very detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.
Collapse
Affiliation(s)
- Alicia Lara-Clares
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| | - Juan J. Lastra-Díaz
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| | - Ana Garcia-Serrano
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| |
Collapse
|
28
|
Automatic multilabel detection of ICD10 codes in Dutch cardiology discharge letters using neural networks. NPJ Digit Med 2021; 4:37. [PMID: 33637859 PMCID: PMC7910461 DOI: 10.1038/s41746-021-00404-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Accepted: 01/26/2021] [Indexed: 12/02/2022] Open
Abstract
Standard reference terminology of diagnoses and risk factors is crucial for billing, epidemiological studies, and inter/intranational comparisons of diseases. The International Classification of Disease (ICD) is a standardized and widely used method, but the manual classification is an enormously time-consuming endeavor. Natural language processing together with machine learning allows automated structuring of diagnoses using ICD-10 codes, but the limited performance of machine learning models, the necessity of gigantic datasets, and poor reliability of terminal parts of these codes restricted clinical usability. We aimed to create a high performing pipeline for automated classification of reliable ICD-10 codes in the free medical text in cardiology. We focussed on frequently used and well-defined three- and four-digit ICD-10 codes that still have enough granularity to be clinically relevant such as atrial fibrillation (I48), acute myocardial infarction (I21), or dilated cardiomyopathy (I42.0). Our pipeline uses a deep neural network known as a Bidirectional Gated Recurrent Unit Neural Network and was trained and tested with 5548 discharge letters and validated in 5089 discharge and procedural letters. As in clinical practice discharge letters may be labeled with more than one code, we assessed the single- and multilabel performance of main diagnoses and cardiovascular risk factors. We investigated using both the entire body of text and only the summary paragraph, supplemented by age and sex. Given the privacy-sensitive information included in discharge letters, we added a de-identification step. The performance was high, with F1 scores of 0.76–0.99 for three-character and 0.87–0.98 for four-character ICD-10 codes, and was best when using complete discharge letters. Adding variables age/sex did not affect results. For model interpretability, word coefficients were provided and qualitative assessment of classification was manually performed. Because of its high performance, this pipeline can be useful to decrease the administrative burden of classifying discharge diagnoses and may serve as a scaffold for reimbursement and research applications.
Collapse
|
29
|
Ibrahim MA, Ghani Khan MU, Mehmood F, Asim MN, Mahmood W. GHS-NET a generic hybridized shallow neural network for multi-label biomedical text classification. J Biomed Inform 2021; 116:103699. [PMID: 33601013 DOI: 10.1016/j.jbi.2021.103699] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Revised: 11/30/2020] [Accepted: 02/02/2021] [Indexed: 01/16/2023]
Abstract
Exponential growth of biomedical literature and clinical data demands more robust yet precise computational methodologies to extract useful insights from biomedical literature and to perform accurate assignment of disease-specific codes. Such approaches can largely enhance the effectiveness of diverse biomedicine and bioinformatics applications. State-of-the-art computational biomedical text classification methodologies either solely leverage discrimintaive features extracted through convolution operations performed by deep convolutional neural network or contextual information extracted by recurrent neural network. However, none of the methodology takes advantage of both convolutional and recurrent neural networks. Further, existing methodologies lack to produce decent performance for the classification of different genre biomedical text such as biomedical literature or clinical notes. We, for the very first time, present a generic deep learning based hybrid multi-label classification methodology namely GHS-NET which can be utilized to accurately classify biomedical text of diverse genre. GHS-NET makes use of convolutional neural network to extract most discriminative features and bi-directional Long Short-Term Memory to acquire contextual information. GHS-NET effectiveness is evaluated for extreme multi-label biomedical literature classification and assignment of ICD-9 codes to clinical notes. For the task of extreme multi-label biomedical literature classification, performance comparison of GHS-Net and state-of-the-art deep learning based methodology reveals that GHS-Net marks the increment of 1%, 6%, and 1% for hallmarks of cancer dataset, 10%, 16%, and 11% for chemical exposure dataset in terms of precision, recall, and F1-score. For the task of clinical notes classification, GHS-Net outperforms previous best deep learning based methodology over Medical Information Mart for Intensive Care dataset (MIMIC-III) by the significant margin of 6%, 8% in terms of recall and F1-score. GHS-NET is available as a web service at1 and potentially can be used to accurately classify multi-variate disease and chemical exposure specific text.
Collapse
Affiliation(s)
- Muhammad Ali Ibrahim
- Intelligent Criminology Research Lab, National Center of Artificial Intelligence, Al-Khawarizmi Institute of Computer Science, UET, Lahore, Pakistan; German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany
| | - Muhammad Usman Ghani Khan
- Intelligent Criminology Research Lab, National Center of Artificial Intelligence, Al-Khawarizmi Institute of Computer Science, UET, Lahore, Pakistan; Department of Computer Science, University of Engineering and Technology (UET), Lahore, Pakistan
| | - Faiza Mehmood
- Intelligent Criminology Research Lab, National Center of Artificial Intelligence, Al-Khawarizmi Institute of Computer Science, UET, Lahore, Pakistan
| | - Muhammad Nabeel Asim
- Intelligent Criminology Research Lab, National Center of Artificial Intelligence, Al-Khawarizmi Institute of Computer Science, UET, Lahore, Pakistan; German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany.
| | - Waqar Mahmood
- Intelligent Criminology Research Lab, National Center of Artificial Intelligence, Al-Khawarizmi Institute of Computer Science, UET, Lahore, Pakistan
| |
Collapse
|
30
|
Pandey B, Kumar Pandey D, Pratap Mishra B, Rhmann W. A comprehensive survey of deep learning in the field of medical imaging and medical natural language processing: Challenges and research directions. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2021. [DOI: 10.1016/j.jksuci.2021.01.007] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
31
|
Wang J, Li M, Diao Q, Lin H, Yang Z, Zhang Y. Biomedical document triage using a hierarchical attention-based capsule network. BMC Bioinformatics 2020; 21:380. [PMID: 32938366 PMCID: PMC7495737 DOI: 10.1186/s12859-020-03673-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biomedical document triage is the foundation of biomedical information extraction, which is important to precision medicine. Recently, some neural networks-based methods have been proposed to classify biomedical documents automatically. In the biomedical domain, documents are often very long and often contain very complicated sentences. However, the current methods still find it difficult to capture important features across sentences. RESULTS In this paper, we propose a hierarchical attention-based capsule model for biomedical document triage. The proposed model effectively employs hierarchical attention mechanism and capsule networks to capture valuable features across sentences and construct a final latent feature representation for a document. We evaluated our model on three public corpora. CONCLUSIONS Experimental results showed that both hierarchical attention mechanism and capsule networks are helpful in biomedical document triage task. Our method proved itself highly competitive or superior compared with other state-of-the-art methods.
Collapse
Affiliation(s)
- Jian Wang
- Dalian University of Technology, The School of Computer Science and Technology, Dalian, 116024 China
| | - Mengying Li
- Dalian University of Technology, The School of Computer Science and Technology, Dalian, 116024 China
| | - Qishuai Diao
- Dalian University of Technology, The School of Computer Science and Technology, Dalian, 116024 China
| | - Hongfei Lin
- Dalian University of Technology, The School of Computer Science and Technology, Dalian, 116024 China
| | - Zhihao Yang
- Dalian University of Technology, The School of Computer Science and Technology, Dalian, 116024 China
| | - YiJia Zhang
- Dalian University of Technology, The School of Computer Science and Technology, Dalian, 116024 China
| |
Collapse
|
32
|
Si Y, Roberts K. Patient Representation Transfer Learning from Clinical Notes based on Hierarchical Attention Network. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020; 2020:597-606. [PMID: 32477682 PMCID: PMC7233035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
To explicitly learn patient representations from longitudinal clinical notes, we propose a hierarchical attention-based recurrent neural network (RNN) with greedy segmentation to distinguish between shorter and longer, more meaningful gaps between notes. The proposed model is evaluated for both a direct clinical prediction task (mortality) and as a transfer learning pre-training model to downstream evaluation (phenotype prediction of obesity and its comorbidities). Experimental results first show the proposed model with appropriate segmentation achieved the best performance on mortality prediction, indicating the effectiveness of hierarchical RNNs in dealing with longitudinal clinical text. Attention weights from the models highlight those parts of notes with the largest impact on mortality prediction and hopefully provide a degree of interpretability. Following the transfer learning approach, we also demonstrate the effectiveness and generalizability of pre-trained patient representations on target tasks of phenotyping.
Collapse
Affiliation(s)
- Yuqi Si
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston Houston, TX, USA
| | - Kirk Roberts
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston Houston, TX, USA
| |
Collapse
|
33
|
Teodoro D, Knafou J, Naderi N, Pasche E, Gobeill J, Arighi CN, Ruch P. UPCLASS: a deep learning-based classifier for UniProtKB entry publications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5822772. [PMID: 32367111 PMCID: PMC7198315 DOI: 10.1093/database/baaa026] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 02/19/2020] [Accepted: 03/11/2020] [Indexed: 12/20/2022]
Abstract
In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession. Database URL: https://goldorak.hesge.ch/bioexpclass/upclass/.
Collapse
Affiliation(s)
- Douglas Teodoro
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Julien Knafou
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Nona Naderi
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Emilie Pasche
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Julien Gobeill
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Cecilia N Arighi
- Center of Bioinformatics and Computational Biology, 15 Innovation Way, 19711, Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Patrick Ruch
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| |
Collapse
|
34
|
Chen Q, Du J, Kim S, Wilbur WJ, Lu Z. Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records. BMC Med Inform Decis Mak 2020; 20:73. [PMID: 32349758 PMCID: PMC7191680 DOI: 10.1186/s12911-020-1044-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Background Capturing sentence semantics plays a vital role in a range of text mining applications. Despite continuous efforts on the development of related datasets and models in the general domain, both datasets and models are limited in biomedical and clinical domains. The BioCreative/OHNLP2018 organizers have made the first attempt to annotate 1068 sentence pairs from clinical notes and have called for a community effort to tackle the Semantic Textual Similarity (BioCreative/OHNLP STS) challenge. Methods We developed models using traditional machine learning and deep learning approaches. For the post challenge, we focused on two models: the Random Forest and the Encoder Network. We applied sentence embeddings pre-trained on PubMed abstracts and MIMIC-III clinical notes and updated the Random Forest and the Encoder Network accordingly. Results The official results demonstrated our best submission was the ensemble of eight models. It achieved a Person correlation coefficient of 0.8328 – the highest performance among 13 submissions from 4 teams. For the post challenge, the performance of both Random Forest and the Encoder Network was improved; in particular, the correlation of the Encoder Network was improved by ~ 13%. During the challenge task, no end-to-end deep learning models had better performance than machine learning models that take manually-crafted features. In contrast, with the sentence embeddings pre-trained on biomedical corpora, the Encoder Network now achieves a correlation of ~ 0.84, which is higher than the original best model. The ensembled model taking the improved versions of the Random Forest and Encoder Network as inputs further increased performance to 0.8528. Conclusions Deep learning models with sentence embeddings pre-trained on biomedical corpora achieve the highest performance on the test set. Through error analysis, we find that end-to-end deep learning models and traditional machine learning models with manually-crafted features complement each other by finding different types of sentences. We suggest a combination of these models can better find similar sentences in practice.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - Jingcheng Du
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.,School of Biomedical Informatics, UTHealth, Houston, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.
| |
Collapse
|
35
|
Multi-Task Topic Analysis Framework for Hallmarks of Cancer with Weak Supervision. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10030834] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The hallmarks of cancer represent an essential concept for discovering novel knowledge about cancer and for extracting the complexity of cancer. Due to the lack of topic analysis frameworks optimized specifically for cancer data, the studies on topic modeling in cancer research still have a strong challenge. Recently, deep learning (DL) based approaches were successfully employed to learn semantic and contextual information from scientific documents using word embeddings according to the hallmarks of cancer (HoC). However, those are only applicable to labeled data. There is a comparatively small number of documents that are labeled by experts. In the real world, there is a massive number of unlabeled documents that are available online. In this paper, we present a multi-task topic analysis (MTTA) framework to analyze cancer hallmark-specific topics from documents. The MTTA framework consists of three main subtasks: (1) cancer hallmark learning (CHL)—used to learn cancer hallmarks on existing labeled documents; (2) weak label propagation (WLP)—used to classify a large number of unlabeled documents with the pre-trained model in the CHL task; and (3) topic modeling (ToM)—used to discover topics for each hallmark category. In the CHL task, we employed a convolutional neural network (CNN) with pre-trained word embedding that represents semantic meanings obtained from an unlabeled large corpus. In the ToM task, we employed a latent topic model such as latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) model to catch the semantic information learned by the CNN model for topic analysis. To evaluate the MTTA framework, we collected a large number of documents related to lung cancer in a case study. We also conducted a comprehensive performance evaluation for the MTTA framework, comparing it with several approaches.
Collapse
|