1
|
Gu J, Chersoni E, Wang X, Huang CR, Qian L, Zhou G. LitCovid ensemble learning for COVID-19 multi-label classification. Database (Oxford) 2022; 2022:6846687. [PMID: 36426767 PMCID: PMC9693804 DOI: 10.1093/database/baac103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2022] [Revised: 10/27/2022] [Accepted: 11/04/2022] [Indexed: 11/27/2022]
Abstract
The Coronavirus Disease 2019 (COVID-19) pandemic has shifted the focus of research worldwide, and more than 10 000 new articles per month have concentrated on COVID-19-related topics. Considering this rapidly growing literature, the efficient and precise extraction of the main topics of COVID-19-relevant articles is of great importance. The manual curation of this information for biomedical literature is labor-intensive and time-consuming, and as such the procedure is insufficient and difficult to maintain. In response to these complications, the BioCreative VII community has proposed a challenging task, LitCovid Track, calling for a global effort to automatically extract semantic topics for COVID-19 literature. This article describes our work on the BioCreative VII LitCovid Track. We proposed the LitCovid Ensemble Learning (LCEL) method for the tasks and integrated multiple biomedical pretrained models to address the COVID-19 multi-label classification problem. Specifically, seven different transformer-based pretrained models were ensembled for the initialization and fine-tuning processes independently. To enhance the representation abilities of the deep neural models, diverse additional biomedical knowledge was utilized to facilitate the fruitfulness of the semantic expressions. Simple yet effective data augmentation was also leveraged to address the learning deficiency during the training phase. In addition, given the imbalanced label distribution of the challenging task, a novel asymmetric loss function was applied to the LCEL model, which explicitly adjusted the negative-positive importance by assigning different exponential decay factors and helped the model focus on the positive samples. After the training phase, an ensemble bagging strategy was adopted to merge the outputs from each model for final predictions. The experimental results show the effectiveness of our proposed approach, as LCEL obtains the state-of-the-art performance on the LitCovid dataset. Database URL: https://github.com/JHnlp/LCEL.
Collapse
Affiliation(s)
| | - Emmanuele Chersoni
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong 999077, China
| | - Xing Wang
- Tencent AI Lab, Shenzhen 518071, China
| | - Chu-Ren Huang
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong 999077, China
| | - Longhua Qian
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Guodong Zhou
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| |
Collapse
|
2
|
Wishart DS, Girod S, Peters H, Oler E, Jovel J, Budinski Z, Milford R, Lui VW, Sayeeda Z, Mah R, Wei W, Badran H, Lo E, Yamamoto M, Djoumbou-Feunang Y, Karu N, Gautam V. ChemFOnt: the chemical functional ontology resource. Nucleic Acids Res 2022; 51:D1220-D1229. [PMID: 36305829 PMCID: PMC9825615 DOI: 10.1093/nar/gkac919] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Revised: 10/03/2022] [Accepted: 10/18/2022] [Indexed: 01/30/2023] Open
Abstract
The Chemical Functional Ontology (ChemFOnt), located at https://www.chemfont.ca, is a hierarchical, OWL-compatible ontology describing the functions and actions of >341 000 biologically important chemicals. These include primary metabolites, secondary metabolites, natural products, food chemicals, synthetic food additives, drugs, herbicides, pesticides and environmental chemicals. ChemFOnt is a FAIR-compliant resource intended to bring the same rigor, standardization and formal structure to the terms and terminology used in biochemistry, food chemistry and environmental chemistry as the gene ontology (GO) has brought to molecular biology. ChemFOnt is available as both a freely accessible, web-enabled database and a downloadable Web Ontology Language (OWL) file. Users may download and deploy ChemFOnt within their own chemical databases or integrate ChemFOnt into their own analytical software to generate machine readable relationships that can be used to make new inferences, enrich their omics data sets or make new, non-obvious connections between chemicals and their direct or indirect effects. The web version of the ChemFOnt database has been designed to be easy to search, browse and navigate. Currently ChemFOnt contains data on 341 627 chemicals, including 515 332 terms or definitions. The functional hierarchy for ChemFOnt consists of four functional 'aspects', 12 functional super-categories and a total of 173 705 functional terms. In addition, each of the chemicals are classified into 4825 structure-based chemical classes. ChemFOnt currently contains 3.9 million protein-chemical relationships and ∼10.3 million chemical-functional relationships. The long-term goal for ChemFOnt is for it to be adopted by databases and software tools used by the general chemistry community as well as the metabolomics, exposomics, metagenomics, genomics and proteomics communities.
Collapse
Affiliation(s)
- David S Wishart
- To whom correspondence should be addressed. Tel: +1 780 492 8574;
| | - Sagan Girod
- Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada
| | - Harrison Peters
- Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada
| | - Eponine Oler
- Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada
| | - Juan Jovel
- Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada
| | - Zachary Budinski
- Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada
| | - Ralph Milford
- Department of Computing Science, University of Alberta, Edmonton, AB T6G 2E8, Canada
| | - Vicki W Lui
- Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada
| | - Zinat Sayeeda
- Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada
| | - Robert Mah
- Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada
| | - William Wei
- Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada
| | - Hasan Badran
- Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada
| | - Elvis Lo
- Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada
| | - Mai Yamamoto
- Molecular You Corporation, 788 Beatty St., Suite 307, Vancouver, BC V6B 2M1, Canada
| | | | - Naama Karu
- Leiden Academic Centre for Drug Research, Leiden University, Leiden, 2333 CC, The Netherlands
| | - Vasuk Gautam
- Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada
| |
Collapse
|
3
|
Zhao W, Zhang J, Yang J, Jiang X, He T. Document-Level Chemical-Induced Disease Relation Extraction via Hierarchical Representation Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2782-2793. [PMID: 34077368 DOI: 10.1109/tcbb.2021.3086090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Over the past decades, Chemical-induced Disease (CID) relations have attracted extensive attention in biomedical community, reflecting wide applications in biomedical research and healthcare field. However, prior efforts fail to make full use of the interaction between local and global contexts in biomedical document, and the derived performance needs to be improved accordingly. In this paper, we propose a novel framework for document-level CID relation extraction. More specifically, a stacked Hypergraph Aggregation Neural Network (HANN) layers are introduced to model the complicated interaction between local and global contexts, based on which better contextualized representations are obtained for CID relation extraction. In addition, the CID Relation Heterogeneous Graph is constructed to capture the information with different granularities and improve further the performance of CID relation classification. Experiments on a real-world dataset demonstrate the effectiveness of the proposed framework.
Collapse
|
4
|
Li Z, Wang M, Peng D, Liu J, Xie Y, Dai Z, Zou X. Identification of Chemical-Disease Associations Through Integration of Molecular Fingerprint, Gene Ontology and Pathway Information. Interdiscip Sci 2022; 14:683-696. [PMID: 35391615 DOI: 10.1007/s12539-022-00511-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Revised: 03/16/2022] [Accepted: 03/17/2022] [Indexed: 06/14/2023]
Abstract
The identification of chemical-disease association types is helpful not only to discovery lead compounds and study drug repositioning, but also to treat disease and decipher pathomechanism. It is very urgent to develop computational method for identifying potential chemical-disease association types, since wet methods are usually expensive, laborious and time-consuming. In this study, molecular fingerprint, gene ontology and pathway are utilized to characterize chemicals and diseases. A novel predictor is proposed to recognize potential chemical-disease associations at the first layer, and further distinguish whether their relationships belong to biomarker or therapeutic relations at the second layer. The prediction performance of current method is assessed using the benchmark dataset based on ten-fold cross-validation. The practical prediction accuracies of the first layer and the second layer are 78.47% and 72.07%, respectively. The recognition ability for lead compounds, new drug indications, potential and true chemical-disease association pairs has also been investigated and confirmed by constructing a variety of datasets and performing a series of experiments. It is anticipated that the current method can be considered as a powerful high-throughput virtual screening tool for drug researches and developments.
Collapse
Affiliation(s)
- Zhanchao Li
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China.
- NMPA Key Laboratory for Technology Research and Evaluation of Pharmacovigilance, Guangzhou, 510006, People's Republic of China.
- Key Laboratory of Digital Quality Evaluation of Chinese Materia Medica of State Administration of Traditional Chinese Medicine, Guangzhou, 510006, People's Republic of China.
| | - Mengru Wang
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China
| | - Dongdong Peng
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China
| | - Jie Liu
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China
| | - Yun Xie
- HuiZhou University, Huizhou, 516007, People's Republic of China
| | - Zong Dai
- School of Biomedical Engineering, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
| | - Xiaoyong Zou
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China.
| |
Collapse
|
5
|
Lin SJ, Yeh WC, Chiu YW, Chang YC, Hsu MH, Chen YS, Hsu WL. A BERT-based ensemble learning approach for the BioCreative VII challenges: full-text chemical identification and multi-label classification in PubMed articles. Database (Oxford) 2022; 2022:baac056. [PMID: 35849027 PMCID: PMC9290865 DOI: 10.1093/database/baac056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2022] [Revised: 06/20/2022] [Accepted: 07/02/2022] [Indexed: 11/25/2022]
Abstract
In this research, we explored various state-of-the-art biomedical-specific pre-trained Bidirectional Encoder Representations from Transformers (BERT) models for the National Library of Medicine - Chemistry (NLM CHEM) and LitCovid tracks in the BioCreative VII Challenge, and propose a BERT-based ensemble learning approach to integrate the advantages of various models to improve the system's performance. The experimental results of the NLM-CHEM track demonstrate that our method can achieve remarkable performance, with F1-scores of 85% and 91.8% in strict and approximate evaluations, respectively. Moreover, the proposed Medical Subject Headings identifier (MeSH ID) normalization algorithm is effective in entity normalization, which achieved a F1-score of about 80% in both strict and approximate evaluations. For the LitCovid track, the proposed method is also effective in detecting topics in the Coronavirus disease 2019 (COVID-19) literature, which outperformed the compared methods and achieve state-of-the-art performance in the LitCovid corpus. Database URL: https://www.ncbi.nlm.nih.gov/research/coronavirus/.
Collapse
Affiliation(s)
- Sheng-Jie Lin
- Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Dáan District, Taipei City 106, Taiwan
| | - Wen-Chao Yeh
- Institute of Information Systems and Applications, National Tsing Hua University, No. 101, Section 2, Guangfu Rd, East District, Hsinchu City 300, Taiwan
| | - Yu-Wen Chiu
- Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Dáan District, Taipei City 106, Taiwan
| | - Yung-Chun Chang
- Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Dáan District, Taipei City 106, Taiwan
- Clinical Big Data Research Center, Taipei Medical University Hospital, No. 172-1, Section 2, Keelung Rd, Dáan District, Taipei City 106, Taiwan
- Pervasive AI Research Labs, Ministry of Science and Technology, No. 1001, Daxue Rd, East District, Hsinchu City 300, Taiwan
| | - Min-Huei Hsu
- Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Dáan District, Taipei City 106, Taiwan
| | - Yi-Shin Chen
- Institute of Information Systems and Applications, National Tsing Hua University, No. 101, Section 2, Guangfu Rd, East District, Hsinchu City 300, Taiwan
| | - Wen-Lian Hsu
- Pervasive AI Research Labs, Ministry of Science and Technology, No. 1001, Daxue Rd, East District, Hsinchu City 300, Taiwan
- Department of Computer Science and Information Engineering, Asia University, No. 500, Liufeng Rd, Wufeng District, Taichung City 413, Taiwan
| |
Collapse
|
6
|
Li Z, Chen H, Qi R, Lin H, Chen H. DocR-BERT: Document-level R-BERT for Chemical-induced Disease Relation Extraction via Gaussian Probability Distribution. IEEE J Biomed Health Inform 2021; 26:1341-1352. [PMID: 34591774 DOI: 10.1109/jbhi.2021.3116769] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Chemical-induced disease (CID) relation extraction from biomedical articles plays an important role in disease treatment and drug development. Existing methods are insufficient for capturing complete document level semantic information due to ignoring semantic information of entities in different sentences. In this work, we proposed an effective document-level relation extraction model to automatically extract intra-/inter-sentential CID relations from articles. Firstly, our model employed BERT to generate contextual semantic representations of the title, abstract and shortest dependency paths (SDPs). Secondly, to enhance the semantic representation of the whole document, cross attention with self-attention (named cross2self-attention) between abstract, title and SDPs was proposed to learn the mutual semantic information. Thirdly, to distinguish the importance of the target entity in different sentences, the Gaussian probability distribution was utilized to compute the weights of the co-occurrence sentence and its adjacent entity sentences. More complete semantic information of the target entity is collected from all entities occurring in the document via our presented document-level R-BERT (DocR-BERT). Finally, the related representations were concatenated and fed into the softmax function to extract CIDs. We evaluated the model on the CDR corpus provided by BioCreative V. The proposed model without external resources is superior in performance as compared with other state-of-the-art models (our model achieves 53.5%, 70%, and 63.7% of the F1-score on inter-/intra-sentential and overall CDR dataset). The experimental results indicate that cross2self-attention, the Gaussian probability distribution and DocR-BERT can effectively improve the CID extraction performance. Furthermore, the mutual semantic information learned by the cross self-attention from abstract towards title can significantly influence the extraction performance of document-level biomedical relation extraction tasks.
Collapse
|