Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Du J, Chen Q, Peng Y, Xiang Y, Tao C, Lu Z. ML-Net: multi-label classification of biomedical texts with deep neural networks. J Am Med Inform Assoc 2019;26:1279-1285. [PMID: 31233120 PMCID: PMC7647240 DOI: 10.1093/jamia/ocz085] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2018] [Revised: 04/21/2019] [Accepted: 05/08/2019] [Indexed: 11/14/2022] Open

For:	Du J, Chen Q, Peng Y, Xiang Y, Tao C, Lu Z. ML-Net: multi-label classification of biomedical texts with deep neural networks. J Am Med Inform Assoc 2019;26:1279-1285. [PMID: 31233120 PMCID: PMC7647240 DOI: 10.1093/jamia/ocz085] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2018] [Revised: 04/21/2019] [Accepted: 05/08/2019] [Indexed: 11/14/2022] Open

Number

Cited by Other Article(s)

Zhang Y, Li X, Liu Y, Li A, Yang X, Tang X. A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification. JMIR Med Inform 2023;11:e44892. [PMID: 37796584 PMCID: PMC10587805 DOI: 10.2196/44892] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Revised: 03/07/2023] [Accepted: 09/06/2023] [Indexed: 10/06/2023] Open

Abstract

BACKGROUND

Given the threat posed by cancer to human health, there is a rapid growth in the volume of data in the cancer field and interdisciplinary and collaborative research is becoming increasingly important for fine-grained classification. The low-resolution classifier of reported studies at the journal level fails to satisfy advanced searching demands, and a single label does not adequately characterize the literature originated from interdisciplinary research results. There is thus a need to establish a multilabel classifier with higher resolution to support literature retrieval for cancer research and reduce the burden of screening papers for clinical relevance.

OBJECTIVE

The primary objective of this research was to address the low-resolution issue of cancer literature classification due to the ambiguity of the existing journal-level classifier in order to support gaining high-relevance evidence for clinical consideration and all-sided results for literature retrieval.

METHODS

We trained a multilabel classifier with scalability for classifying the literature on cancer research directly at the publication level to assign proper content-derived labels based on the "Bidirectional Encoder Representation from Transformers (BERT) + X" model and obtain the best option for X. First, a corpus of 70,599 cancer publications retrieved from the Dimensions database was divided into a training and a testing set in a ratio of 7:3. Second, using the classification terminology of International Cancer Research Partnership cancer types, we compared the performance of classifiers developed using BERT and 5 classical deep learning models, such as the text recurrent neural network (TextRNN) and FastText, followed by metrics analysis.

RESULTS

After comparing various combined deep learning models, we obtained a classifier based on the optimal combination "BERT + TextRNN," with a precision of 93.09%, a recall of 87.75%, and an F1-score of 90.34%. Moreover, we quantified the distinctive characteristics in the text structure and multilabel distribution in order to generalize the model to other fields with similar characteristics.

CONCLUSIONS

The "BERT + TextRNN" model was trained for high-resolution classification of cancer literature at the publication level to support accurate retrieval and academic statistics. The model automatically assigns 1 or more labels to each cancer paper, as required. Quantitative comparison verified that the "BERT + TextRNN" model is the best fit for multilabel classification of cancer literature compared to other models. More data from diverse fields will be collected to testify the scalability and extensibility of the proposed model in the future.

Collapse

Jin Y, Lu H, Zhu W, Huo W. Deep learning based classification of multi-label chest X-ray images via dual-weighted metric loss. Comput Biol Med 2023;157:106683. [PMID: 36905869 DOI: 10.1016/j.compbiomed.2023.106683] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 10/17/2022] [Accepted: 11/06/2022] [Indexed: 02/17/2023]

Mao C, Zhu Q, Chen R, Su W. Automatic medical specialty classification based on patients' description of their symptoms. BMC Med Inform Decis Mak 2023;23:15. [PMID: 36670382 PMCID: PMC9862953 DOI: 10.1186/s12911-023-02105-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 01/09/2023] [Indexed: 01/22/2023] Open

Jin Y, Xiong Y, Shi D, Lin Y, He L, Zhang Y, Plasek JM, Zhou L, Bates DW, Tang C. Learning from undercoded clinical records for automated International Classification of Diseases (ICD) coding. J Am Med Inform Assoc 2022;30:438-446. [PMID: 36478240 PMCID: PMC9933053 DOI: 10.1093/jamia/ocac230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 11/08/2022] [Accepted: 11/16/2022] [Indexed: 12/12/2022] Open

Aman A, Reji DJ. Environmental due diligence data: A novel corpus for training environmental domain NLP models. Data Brief 2022;45:108579. [PMID: 36148216 PMCID: PMC9486029 DOI: 10.1016/j.dib.2022.108579] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 08/20/2022] [Accepted: 09/02/2022] [Indexed: 11/24/2022] Open

Gu J, Chersoni E, Wang X, Huang CR, Qian L, Zhou G. LitCovid ensemble learning for COVID-19 multi-label classification. Database (Oxford) 2022;2022:6846687. [PMID: 36426767 PMCID: PMC9693804 DOI: 10.1093/database/baac103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2022] [Revised: 10/27/2022] [Accepted: 11/04/2022] [Indexed: 11/27/2022]

Abstract

The Coronavirus Disease 2019 (COVID-19) pandemic has shifted the focus of research worldwide, and more than 10 000 new articles per month have concentrated on COVID-19-related topics. Considering this rapidly growing literature, the efficient and precise extraction of the main topics of COVID-19-relevant articles is of great importance. The manual curation of this information for biomedical literature is labor-intensive and time-consuming, and as such the procedure is insufficient and difficult to maintain. In response to these complications, the BioCreative VII community has proposed a challenging task, LitCovid Track, calling for a global effort to automatically extract semantic topics for COVID-19 literature. This article describes our work on the BioCreative VII LitCovid Track. We proposed the LitCovid Ensemble Learning (LCEL) method for the tasks and integrated multiple biomedical pretrained models to address the COVID-19 multi-label classification problem. Specifically, seven different transformer-based pretrained models were ensembled for the initialization and fine-tuning processes independently. To enhance the representation abilities of the deep neural models, diverse additional biomedical knowledge was utilized to facilitate the fruitfulness of the semantic expressions. Simple yet effective data augmentation was also leveraged to address the learning deficiency during the training phase. In addition, given the imbalanced label distribution of the challenging task, a novel asymmetric loss function was applied to the LCEL model, which explicitly adjusted the negative-positive importance by assigning different exponential decay factors and helped the model focus on the positive samples. After the training phase, an ensemble bagging strategy was adopted to merge the outputs from each model for final predictions. The experimental results show the effectiveness of our proposed approach, as LCEL obtains the state-of-the-art performance on the LitCovid dataset. Database URL: https://github.com/JHnlp/LCEL.

Collapse

Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. A reproducible experimental survey on biomedical sentence similarity: A string-based method sets the state of the art. PLoS One 2022;17:e0276539. [PMID: 36409715 PMCID: PMC9678326 DOI: 10.1371/journal.pone.0276539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 10/08/2022] [Indexed: 11/22/2022] Open

Abstract

This registered report introduces the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity with the following aims: (1) to elucidate the state of the art of the problem; (2) to solve some reproducibility problems preventing the evaluation of most current methods; (3) to evaluate several unexplored sentence similarity methods; (4) to evaluate for the first time an unexplored benchmark, called Corpus-Transcriptional-Regulation (CTR); (5) to carry out a study on the impact of the pre-processing stages and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (6) to bridge the lack of software and data reproducibility resources for methods and experiments in this line of research. Our reproducible experimental survey is based on a single software platform, which is provided with a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results. In addition, we introduce a new aggregated string-based sentence similarity method, called LiBlock, together with eight variants of current ontology-based methods, and a new pre-trained word embedding model trained on the full-text articles in the PMC-BioC corpus. Our experiments show that our novel string-based measure establishes the new state of the art in sentence similarity analysis in the biomedical domain and significantly outperforms all the methods evaluated herein, with the only exception of one ontology-based method. Likewise, our experiments confirm that the pre-processing stages, and the choice of the NER tool for ontology-based methods, have a very significant impact on the performance of the sentence similarity methods. We also detail some drawbacks and limitations of current methods, and highlight the need to refine the current benchmarks. Finally, a notable finding is that our new string-based method significantly outperforms all state-of-the-art Machine Learning (ML) models evaluated herein.

Collapse

Multi-label sequence generating model via label semantic attention mechanism. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01722-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]

Han M, Wu H, Chen Z, Li M, Zhang X. A survey of multi-label classification based on supervised and semi-supervised learning. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01658-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Abbasi A, Javed AR, Iqbal F, Kryvinska N, Jalil Z. Deep learning for religious and continent-based toxic content detection and classification. Sci Rep 2022;12:17478. [PMID: 36261675 PMCID: PMC9581992 DOI: 10.1038/s41598-022-22523-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2022] [Accepted: 10/17/2022] [Indexed: 01/12/2023] Open

Su R, Yang H, Wei L, Chen S, Zou Q. A multi-label learning model for predicting drug-induced pathology in multi-organ based on toxicogenomics data. PLoS Comput Biol 2022;18:e1010402. [PMID: 36070305 PMCID: PMC9451100 DOI: 10.1371/journal.pcbi.1010402] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Accepted: 07/18/2022] [Indexed: 11/18/2022] Open

Hu Y, Donald C, Giacaman N. Can Multi-Label Classifiers Help Identify Subjectivity? A Deep Learning Approach to Classifying Cognitive Presence in MOOCs. INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION 2022;33:1-36. [PMID: 36090962 PMCID: PMC9439267 DOI: 10.1007/s40593-022-00310-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/18/2022] [Indexed: 11/21/2022]

Chen Q, Du J, Allot A, Lu Z. LitMC-BERT: Transformer-Based Multi-Label Classification of Biomedical Literature With An Application on COVID-19 Literature Curation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022;19:2584-2595. [PMID: 35536809 PMCID: PMC9647722 DOI: 10.1109/tcbb.2022.3173562] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 04/19/2022] [Accepted: 04/22/2022] [Indexed: 05/20/2023]

Chen Q, Allot A, Leaman R, Islamaj R, Du J, Fang L, Wang K, Xu S, Zhang Y, Bagherzadeh P, Bergler S, Bhatnagar A, Bhavsar N, Chang YC, Lin SJ, Tang W, Zhang H, Tavchioski I, Pollak S, Tian S, Zhang J, Otmakhova Y, Yepes AJ, Dong H, Wu H, Dufour R, Labrak Y, Chatterjee N, Tandon K, Laleye FAA, Rakotoson L, Chersoni E, Gu J, Friedrich A, Pujari SC, Chizhikova M, Sivadasan N, VG S, Lu Z. Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations. Database (Oxford) 2022;2022:baac069. [PMID: 36043400 PMCID: PMC9428574 DOI: 10.1093/database/baac069] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Revised: 08/02/2022] [Accepted: 08/13/2022] [Indexed: 05/03/2023]

Abstract

The coronavirus disease 2019 (COVID-19) pandemic has been severely impacting global society since December 2019. The related findings such as vaccine and drug development have been reported in biomedical literature-at a rate of about 10 000 articles on COVID-19 per month. Such rapid growth significantly challenges manual curation and interpretation. For instance, LitCovid is a literature database of COVID-19-related articles in PubMed, which has accumulated more than 200 000 articles with millions of accesses each month by users worldwide. One primary curation task is to assign up to eight topics (e.g. Diagnosis and Treatment) to the articles in LitCovid. The annotated topics have been widely used for navigating the COVID literature, rapidly locating articles of interest and other downstream studies. However, annotating the topics has been the bottleneck of manual curation. Despite the continuing advances in biomedical text-mining methods, few have been dedicated to topic annotations in COVID-19 literature. To close the gap, we organized the BioCreative LitCovid track to call for a community effort to tackle automated topic annotation for COVID-19 literature. The BioCreative LitCovid dataset-consisting of over 30 000 articles with manually reviewed topics-was created for training and testing. It is one of the largest multi-label classification datasets in biomedical scientific literature. Nineteen teams worldwide participated and made 80 submissions in total. Most teams used hybrid systems based on transformers. The highest performing submissions achieved 0.8875, 0.9181 and 0.9394 for macro-F1-score, micro-F1-score and instance-based F1-score, respectively. Notably, these scores are substantially higher (e.g. 12%, higher for macro F1-score) than the corresponding scores of the state-of-art multi-label classification method. The level of participation and results demonstrate a successful track and help close the gap between dataset curation and method development. The dataset is publicly available via https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/biocreative/ for benchmarking and further development. Database URL https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/biocreative/.

Collapse

Affiliation(s)

Qingyu Chen National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
Alexis Allot National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
Robert Leaman National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
Rezarta Islamaj National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
Jingcheng Du School of Biomedical Informatics, UT Health, TX, Houston 77030, USA
Li Fang Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA, USA
Kai Wang Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA, USA Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
Shuo Xu College of Economics and Management, Beijing University of Technology, Beijing, QC, China
Yuefu Zhang College of Economics and Management, Beijing University of Technology, Beijing, QC, China
Parsa Bagherzadeh CLaC Labs, Concordia University, Montreal, Canada
Sabine Bergler CLaC Labs, Concordia University, Montreal, Canada
Aakash Bhatnagar Navrachana University, Vadodara, India
Nidhir Bhavsar Navrachana University, Vadodara, India
Yung-Chun Chang Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
Sheng-Jie Lin Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
Wentai Tang College of Computer Science and Technology, Dalian University of Technology, Dalian, China
Hongtong Zhang College of Computer Science and Technology, Dalian University of Technology, Dalian, China
Ilija Tavchioski Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia Jožef Stefan Institute, Ljubljana, Slovenia
Senja Pollak Jožef Stefan Institute, Ljubljana, Slovenia
Shubo Tian Department of Statistics, Florida State University, Tallahassee, FL, USA
Jinfeng Zhang Department of Statistics, Florida State University, Tallahassee, FL, USA
Yulia Otmakhova School of Computing and Information Systems, University of Melbourne, Melbourne, AU-VIC, Australia
Antonio Jimeno Yepes School of Computing Technologies, RMIT University, Melbourne, AU-VIC, Australia
Hang Dong Centre for Medical Informatics, Usher Institute, University of Edinburgh, Edinburgh, UK
Honghan Wu Institute of Health Informatics, University College London, London, UK
Richard Dufour LS2N, Nantes University, Nantes, France
Yanis Labrak LIA, Avignon University, Avignon, France
Niladri Chatterjee Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India
Kushagri Tandon Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India
Fréjus A A Laleye Opscidia, Paris, France
Loïc Rakotoson Opscidia, Paris, France
Emmanuele Chersoni Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China
Jinghang Gu Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China
Annemarie Friedrich Bosch Center for Artificial Intelligence, Renningen, Germany
Subhash Chandra Pujari Institute of Computer Science, Heidelberg University, Heidelberg, Germany Bosch Center for Artificial Intelligence, Renningen, Germany
Mariia Chizhikova SINAI Group, Department of Computer Science, Advanced Studies Center in ICT (CEATIC), Universidad de Jaén, Jaén, Spain
Naveen Sivadasan TCS Research, Life Sciences, Hyderabad, India
Saipradeep VG TCS Research, Life Sciences, Hyderabad, India
Zhiyong Lu National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA

Collapse

Lin Y, Chi Y, Han H, Han M, Guo Y. Multimodal Orthodontic Corpus Construction Based on Semantic Tag Classification Method. Neural Process Lett 2022. [DOI: 10.1007/s11063-021-10558-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Lin SJ, Yeh WC, Chiu YW, Chang YC, Hsu MH, Chen YS, Hsu WL. A BERT-based ensemble learning approach for the BioCreative VII challenges: full-text chemical identification and multi-label classification in PubMed articles. Database (Oxford) 2022;2022:6645124. [PMID: 35849027 PMCID: PMC9290865 DOI: 10.1093/database/baac056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2022] [Revised: 06/20/2022] [Accepted: 07/02/2022] [Indexed: 11/25/2022]

Sovrano F, Palmirani M, Vitali F. Combining shallow and deep learning approaches against data scarcity in legal domains. GOVERNMENT INFORMATION QUARTERLY 2022. [DOI: 10.1016/j.giq.2022.101715] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]

Fang L, Wang K. Multi-label topic classification for COVID-19 literature with Bioformer. ARXIV 2022:arXiv:2204.06758v1. [PMID: 35441084 PMCID: PMC9016643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Janjua ZH, Kerins D, O'Flynn B, Tedesco S. Knowledge-driven feature engineering to detect multiple symptoms using ambulatory blood pressure monitoring data. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022;217:106638. [PMID: 35220199 DOI: 10.1016/j.cmpb.2022.106638] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Revised: 11/14/2021] [Accepted: 01/14/2022] [Indexed: 06/14/2023]

Abstract

BACKGROUND

Hypertension is a major health concern across the globe and needs to be properly diagnosed to so it can be treated and to mitigate for this critical health condition. In this context, ambulatory blood pressure monitoring is essential to provide for a proper diagnosis of hypertension, which may not be possible otherwise due to the white coat effect or masked hypertension. In this paper, the objective is to develop a model which incorporates expert's knowledge in the feature engineering process so as to accurately predict multiple medical conditions. As a case study, we have considered multiple symptoms related to hypertension and used an ambulatory blood pressure monitoring method to continuously acquire hypertension relevant data from a patient. The goal is to train a model with a minimum set of the most effective knowledge-driven features which are useful to detect multiple symptoms simultaneously using multi-class classification techniques.

METHOD

Artificial intelligence-based blood pressure monitoring techniques introduce a new dimension in the diagnosis of hypertension by enabling a continuous (24hours) analysis of systolic and diastolic blood pressure levels. In this work, we present a model that entails a knowledge-driven feature engineering method and implemented an ambulatory blood pressure monitoring system to diagnose multiple cardiac parameters and associated conditions simultaneously these include morning surge, circadian rhythm, and pulse pressure. The knowledge-driven features are extracted to improve the interpretability of the classification model and machine learning techniques (Random Forest, Naive Bayes, and KNN) were applied in a multi-label classification setup using RAkEL to classify multiple conditions simultaneously.

RESULTS

The results obtained (F 1 = 0.918) show that the Random forest technique has performed well for multilabel classification using knowledge-driven features. Our technique has also reduced the complexity of the model by reducing the number of features required to train a machine learning model.

CONCLUSION

Considering these results, we conclude that knowledge-driven feature engineering enhances the learning process by reducing the number of features given as input to the machine learning algorithm. The proposed feature engineering method considers expert's knowledge to develop better diagnosis models which are free from misleading data-driven noisy features in some situations. It is a white-box approach in which clinicians can under stand the importance of a feature while looking at its value.

Collapse

Lentzas A, Dalagdi E, Vrakas D. Multilabel Classification Methods for Human Activity Recognition: A Comparison of Algorithms. SENSORS 2022;22:s22062353. [PMID: 35336522 PMCID: PMC8955852 DOI: 10.3390/s22062353] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 03/14/2022] [Accepted: 03/16/2022] [Indexed: 12/10/2022]

Aduragba OT, Yu J, Cristea AI, Shi L. Detecting Fine-Grained Emotions on Social Media during Major Disease Outbreaks: Health and Well-being before and during the COVID-19 Pandemic. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2022;2021:187-196. [PMID: 35308991 PMCID: PMC8861702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]

Ensemble of classifier chains and decision templates for multi-label classification. Knowl Inf Syst 2022. [DOI: 10.1007/s10115-021-01647-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]

Automatic Prediction of Recurrence of Major Cardiovascular Events: A Text Mining Study Using Chest X-Ray Reports. JOURNAL OF HEALTHCARE ENGINEERING 2021;2021:6663884. [PMID: 34306597 PMCID: PMC8285182 DOI: 10.1155/2021/6663884] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Revised: 05/29/2021] [Accepted: 06/29/2021] [Indexed: 11/17/2022]

Abstract

Methods

We used EHR data of patients included in the Second Manifestations of ARTerial disease (SMART) study. We propose a deep learning-based multimodal architecture for our text mining pipeline that integrates neural text representation with preprocessed clinical predictors for the prediction of recurrence of major cardiovascular events in cardiovascular patients. Text preprocessing, including cleaning and stemming, was first applied to filter out the unwanted texts from X-ray radiology reports. Thereafter, text representation methods were used to numerically represent unstructured radiology reports with vectors. Subsequently, these text representation methods were added to prediction models to assess their clinical relevance. In this step, we applied logistic regression, support vector machine (SVM), multilayer perceptron neural network, convolutional neural network, long short-term memory (LSTM), and bidirectional LSTM deep neural network (BiLSTM).

Results

We performed various experiments to evaluate the added value of the text in the prediction of major cardiovascular events. The two main scenarios were the integration of radiology reports (1) with classical clinical predictors and (2) with only age and sex in the case of unavailable clinical predictors. In total, data of 5603 patients were used with 5-fold cross-validation to train the models. In the first scenario, the multimodal BiLSTM (MI-BiLSTM) model achieved an area under the curve (AUC) of 84.7%, misclassification rate of 14.3%, and F1 score of 83.8%. In this scenario, the SVM model, trained on clinical variables and bag-of-words representation, achieved the lowest misclassification rate of 12.2%. In the case of unavailable clinical predictors, the MI-BiLSTM model trained on radiology reports and demographic (age and sex) variables reached an AUC, F1 score, and misclassification rate of 74.5%, 70.8%, and 20.4%, respectively.

Conclusions

Using the case study of routine care chest X-ray radiology reports, we demonstrated the clinical relevance of integrating text features and classical predictors in our text mining pipeline for cardiovascular risk prediction. The MI-BiLSTM model with word embedding representation appeared to have a desirable performance when trained on text data integrated with the clinical variables from the SMART study. Our results mined from chest X-ray reports showed that models using text data in addition to laboratory values outperform those using only known clinical predictors.

Collapse

Stemerman R, Arguello J, Brice J, Krishnamurthy A, Houston M, Kitzmiller R. Identification of social determinants of health using multi-label classification of electronic health record clinical notes. JAMIA Open 2021;4:ooaa069. [PMID: 34514351 PMCID: PMC8423426 DOI: 10.1093/jamiaopen/ooaa069] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 11/16/2020] [Accepted: 11/20/2020] [Indexed: 11/13/2022] Open

Dai X, Xu F, Wang S, Mundra PA, Zheng J. PIKE-R2P: Protein-protein interaction network-based knowledge embedding with graph neural network for single-cell RNA to protein prediction. BMC Bioinformatics 2021;22:139. [PMID: 34078261 PMCID: PMC8170782 DOI: 10.1186/s12859-021-04022-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Accepted: 02/11/2021] [Indexed: 12/05/2022] Open

Zhao S, Su C, Lu Z, Wang F. Recent advances in biomedical literature mining. Brief Bioinform 2021;22:bbaa057. [PMID: 32422651 PMCID: PMC8138828 DOI: 10.1093/bib/bbaa057] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 03/22/2020] [Accepted: 03/25/2020] [Indexed: 01/26/2023] Open

Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. Protocol for a reproducible experimental survey on biomedical sentence similarity. PLoS One 2021;16:e0248663. [PMID: 33760855 PMCID: PMC7990182 DOI: 10.1371/journal.pone.0248663] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 03/02/2021] [Indexed: 11/28/2022] Open

Abstract

Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in recent years. However, most sentence similarity methods and experimental results reported in the biomedical domain cannot be reproduced for multiple reasons as follows: the copying of previous results without confirmation, the lack of source code and data to replicate both methods and experiments, and the lack of a detailed definition of the experimental setup, among others. As a consequence of this reproducibility gap, the state of the problem can be neither elucidated nor new lines of research be soundly set. On the other hand, there are other significant gaps in the literature on biomedical sentence similarity as follows: (1) the evaluation of several unexplored sentence similarity methods which deserve to be studied; (2) the evaluation of an unexplored benchmark on biomedical sentence similarity, called Corpus-Transcriptional-Regulation (CTR); (3) a study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (4) the lack of software and data resources for the reproducibility of methods and experiments in this line of research. Identified these open problems, this registered report introduces a detailed experimental setup, together with a categorization of the literature, to develop the largest, updated, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aforementioned experimental survey will be based on our own software replication and the evaluation of all methods being studied on the same software platform, which will be specially developed for this work, and it will become the first publicly available software library for biomedical sentence similarity. Finally, we will provide a very detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.

Collapse

Automatic multilabel detection of ICD10 codes in Dutch cardiology discharge letters using neural networks. NPJ Digit Med 2021;4:37. [PMID: 33637859 PMCID: PMC7910461 DOI: 10.1038/s41746-021-00404-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Accepted: 01/26/2021] [Indexed: 12/02/2022] Open

Abstract

Standard reference terminology of diagnoses and risk factors is crucial for billing, epidemiological studies, and inter/intranational comparisons of diseases. The International Classification of Disease (ICD) is a standardized and widely used method, but the manual classification is an enormously time-consuming endeavor. Natural language processing together with machine learning allows automated structuring of diagnoses using ICD-10 codes, but the limited performance of machine learning models, the necessity of gigantic datasets, and poor reliability of terminal parts of these codes restricted clinical usability. We aimed to create a high performing pipeline for automated classification of reliable ICD-10 codes in the free medical text in cardiology. We focussed on frequently used and well-defined three- and four-digit ICD-10 codes that still have enough granularity to be clinically relevant such as atrial fibrillation (I48), acute myocardial infarction (I21), or dilated cardiomyopathy (I42.0). Our pipeline uses a deep neural network known as a Bidirectional Gated Recurrent Unit Neural Network and was trained and tested with 5548 discharge letters and validated in 5089 discharge and procedural letters. As in clinical practice discharge letters may be labeled with more than one code, we assessed the single- and multilabel performance of main diagnoses and cardiovascular risk factors. We investigated using both the entire body of text and only the summary paragraph, supplemented by age and sex. Given the privacy-sensitive information included in discharge letters, we added a de-identification step. The performance was high, with F1 scores of 0.76–0.99 for three-character and 0.87–0.98 for four-character ICD-10 codes, and was best when using complete discharge letters. Adding variables age/sex did not affect results. For model interpretability, word coefficients were provided and qualitative assessment of classification was manually performed. Because of its high performance, this pipeline can be useful to decrease the administrative burden of classifying discharge diagnoses and may serve as a scaffold for reimbursement and research applications.

Collapse

Ibrahim MA, Ghani Khan MU, Mehmood F, Asim MN, Mahmood W. GHS-NET a generic hybridized shallow neural network for multi-label biomedical text classification. J Biomed Inform 2021;116:103699. [PMID: 33601013 DOI: 10.1016/j.jbi.2021.103699] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Revised: 11/30/2020] [Accepted: 02/02/2021] [Indexed: 01/16/2023]

Abstract

Exponential growth of biomedical literature and clinical data demands more robust yet precise computational methodologies to extract useful insights from biomedical literature and to perform accurate assignment of disease-specific codes. Such approaches can largely enhance the effectiveness of diverse biomedicine and bioinformatics applications. State-of-the-art computational biomedical text classification methodologies either solely leverage discrimintaive features extracted through convolution operations performed by deep convolutional neural network or contextual information extracted by recurrent neural network. However, none of the methodology takes advantage of both convolutional and recurrent neural networks. Further, existing methodologies lack to produce decent performance for the classification of different genre biomedical text such as biomedical literature or clinical notes. We, for the very first time, present a generic deep learning based hybrid multi-label classification methodology namely GHS-NET which can be utilized to accurately classify biomedical text of diverse genre. GHS-NET makes use of convolutional neural network to extract most discriminative features and bi-directional Long Short-Term Memory to acquire contextual information. GHS-NET effectiveness is evaluated for extreme multi-label biomedical literature classification and assignment of ICD-9 codes to clinical notes. For the task of extreme multi-label biomedical literature classification, performance comparison of GHS-Net and state-of-the-art deep learning based methodology reveals that GHS-Net marks the increment of 1%, 6%, and 1% for hallmarks of cancer dataset, 10%, 16%, and 11% for chemical exposure dataset in terms of precision, recall, and F1-score. For the task of clinical notes classification, GHS-Net outperforms previous best deep learning based methodology over Medical Information Mart for Intensive Care dataset (MIMIC-III) by the significant margin of 6%, 8% in terms of recall and F1-score. GHS-NET is available as a web service at¹ and potentially can be used to accurately classify multi-variate disease and chemical exposure specific text.

Collapse

Pandey B, Kumar Pandey D, Pratap Mishra B, Rhmann W. A comprehensive survey of deep learning in the field of medical imaging and medical natural language processing: Challenges and research directions. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2021. [DOI: 10.1016/j.jksuci.2021.01.007] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]

Wang J, Li M, Diao Q, Lin H, Yang Z, Zhang Y. Biomedical document triage using a hierarchical attention-based capsule network. BMC Bioinformatics 2020;21:380. [PMID: 32938366 PMCID: PMC7495737 DOI: 10.1186/s12859-020-03673-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Si Y, Roberts K. Patient Representation Transfer Learning from Clinical Notes based on Hierarchical Attention Network. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020;2020:597-606. [PMID: 32477682 PMCID: PMC7233035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]

Teodoro D, Knafou J, Naderi N, Pasche E, Gobeill J, Arighi CN, Ruch P. UPCLASS: a deep learning-based classifier for UniProtKB entry publications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020;2020:5822772. [PMID: 32367111 PMCID: PMC7198315 DOI: 10.1093/database/baaa026] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 02/19/2020] [Accepted: 03/11/2020] [Indexed: 12/20/2022]

Chen Q, Du J, Kim S, Wilbur WJ, Lu Z. Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records. BMC Med Inform Decis Mak 2020;20:73. [PMID: 32349758 PMCID: PMC7191680 DOI: 10.1186/s12911-020-1044-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open

Abstract

Background

Capturing sentence semantics plays a vital role in a range of text mining applications. Despite continuous efforts on the development of related datasets and models in the general domain, both datasets and models are limited in biomedical and clinical domains. The BioCreative/OHNLP2018 organizers have made the first attempt to annotate 1068 sentence pairs from clinical notes and have called for a community effort to tackle the Semantic Textual Similarity (BioCreative/OHNLP STS) challenge.

Methods

We developed models using traditional machine learning and deep learning approaches. For the post challenge, we focused on two models: the Random Forest and the Encoder Network. We applied sentence embeddings pre-trained on PubMed abstracts and MIMIC-III clinical notes and updated the Random Forest and the Encoder Network accordingly.

Results

The official results demonstrated our best submission was the ensemble of eight models. It achieved a Person correlation coefficient of 0.8328 – the highest performance among 13 submissions from 4 teams. For the post challenge, the performance of both Random Forest and the Encoder Network was improved; in particular, the correlation of the Encoder Network was improved by ~ 13%. During the challenge task, no end-to-end deep learning models had better performance than machine learning models that take manually-crafted features. In contrast, with the sentence embeddings pre-trained on biomedical corpora, the Encoder Network now achieves a correlation of ~ 0.84, which is higher than the original best model. The ensembled model taking the improved versions of the Random Forest and Encoder Network as inputs further increased performance to 0.8528.

Conclusions

Deep learning models with sentence embeddings pre-trained on biomedical corpora achieve the highest performance on the test set. Through error analysis, we find that end-to-end deep learning models and traditional machine learning models with manually-crafted features complement each other by finding different types of sentences. We suggest a combination of these models can better find similar sentences in practice.

Collapse

Multi-Task Topic Analysis Framework for Hallmarks of Cancer with Weak Supervision. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10030834] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

Abstract The hallmarks of cancer represent an essential concept for discovering novel knowledge about cancer and for extracting the complexity of cancer. Due to the lack of topic analysis frameworks optimized specifically for cancer data, the studies on topic modeling in cancer research still have a strong challenge. Recently, deep learning (DL) based approaches were successfully employed to learn semantic and contextual information from scientific documents using word embeddings according to the hallmarks of cancer (HoC). However, those are only applicable to labeled data. There is a comparatively small number of documents that are labeled by experts. In the real world, there is a massive number of unlabeled documents that are available online. In this paper, we present a multi-task topic analysis (MTTA) framework to analyze cancer hallmark-specific topics from documents. The MTTA framework consists of three main subtasks: (1) cancer hallmark learning (CHL)—used to learn cancer hallmarks on existing labeled documents; (2) weak label propagation (WLP)—used to classify a large number of unlabeled documents with the pre-trained model in the CHL task; and (3) topic modeling (ToM)—used to discover topics for each hallmark category. In the CHL task, we employed a convolutional neural network (CNN) with pre-trained word embedding that represents semantic meanings obtained from an unlabeled large corpus. In the ToM task, we employed a latent topic model such as latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) model to catch the semantic information learned by the CNN model for topic analysis. To evaluate the MTTA framework, we collected a large number of documents related to lung cancer in a case study. We also conducted a comprehensive performance evaluation for the MTTA framework, comparing it with several approaches. Collapse