1
|
Gao Y, Zhou Q, Luo J, Xia C, Zhang Y, Yue Z. Crop-GPA: an integrated platform of crop gene-phenotype associations. NPJ Syst Biol Appl 2024; 10:15. [PMID: 38346982 PMCID: PMC10861494 DOI: 10.1038/s41540-024-00343-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 01/22/2024] [Indexed: 02/15/2024] Open
Abstract
With the increasing availability of large-scale biology data in crop plants, there is an urgent demand for a versatile platform that fully mines and utilizes the data for modern molecular breeding. We present Crop-GPA ( https://crop-gpa.aielab.net ), a comprehensive and functional open-source platform for crop gene-phenotype association data. The current Crop-GPA provides well-curated information on genes, phenotypes, and their associations (GPAs) to researchers through an intuitive interface, dynamic graphical visualizations, and efficient online tools. Two computational tools, GPA-BERT and GPA-GCN, are specifically developed and integrated into Crop-GPA, facilitating the automatic extraction of gene-phenotype associations from bio-crop literature and predicting unknown relations based on known associations. Through usage examples, we demonstrate how our platform enables the exploration of complex correlations between genes and phenotypes in crop plants. In summary, Crop-GPA serves as a valuable multi-functional resource, empowering the crop research community to gain deeper insights into the biological mechanisms of interest.
Collapse
Affiliation(s)
- Yujia Gao
- School of Information and Artificial Intelligence, Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, Anhui, 230036, China
| | - Qian Zhou
- School of Information and Artificial Intelligence, Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, Anhui, 230036, China
| | - Jiaxin Luo
- School of Information and Artificial Intelligence, Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, Anhui, 230036, China
| | - Chuan Xia
- School of Information and Artificial Intelligence, Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, Anhui, 230036, China
| | - Youhua Zhang
- School of Information and Artificial Intelligence, Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, Anhui, 230036, China.
| | - Zhenyu Yue
- School of Information and Artificial Intelligence, Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, Anhui, 230036, China.
| |
Collapse
|
2
|
Collins C, Baker S, Brown J, Zheng H, Chan A, Stenius U, Narita M, Korhonen A. Text mining for contexts and relationships in cancer genomics literature. Bioinformatics 2024; 40:btae021. [PMID: 38258418 PMCID: PMC10822582 DOI: 10.1093/bioinformatics/btae021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 09/27/2023] [Accepted: 01/15/2024] [Indexed: 01/24/2024] Open
Abstract
MOTIVATION Scientific advances build on the findings of existing research. The 2001 publication of the human genome has led to the production of huge volumes of literature exploring the context-specific functions and interactions of genes. Technology is needed to perform large-scale text mining of research papers to extract the reported actions of genes in specific experimental contexts and cell states, such as cancer, thereby facilitating the design of new therapeutic strategies. RESULTS We present a new corpus and Text Mining methodology that can accurately identify and extract the most important details of cancer genomics experiments from biomedical texts. We build a Named Entity Recognition model that accurately extracts relevant experiment details from PubMed abstract text, and a second model that identifies the relationships between them. This system outperforms earlier models and enables the analysis of gene function in diverse and dynamically evolving experimental contexts. AVAILABILITY AND IMPLEMENTATION Code and data are available here: https://github.com/cambridgeltl/functional-genomics-ie.
Collapse
Affiliation(s)
- Charlotte Collins
- Language Technology Laboratory, Theoretical and Applied Linguistics, Faculty of Modern and Medieval Languages and Linguistics, University of Cambridge, Cambridge CB3 9DA, United Kingdom
| | - Simon Baker
- Language Technology Laboratory, Theoretical and Applied Linguistics, Faculty of Modern and Medieval Languages and Linguistics, University of Cambridge, Cambridge CB3 9DA, United Kingdom
| | - Jason Brown
- Language Technology Laboratory, Theoretical and Applied Linguistics, Faculty of Modern and Medieval Languages and Linguistics, University of Cambridge, Cambridge CB3 9DA, United Kingdom
| | - Huiyuan Zheng
- Institute of Environmental Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden
| | - Adelyne Chan
- Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Cambridge CB2 0RE, United Kingdom
| | - Ulla Stenius
- Institute of Environmental Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden
| | - Masashi Narita
- Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Cambridge CB2 0RE, United Kingdom
| | - Anna Korhonen
- Language Technology Laboratory, Theoretical and Applied Linguistics, Faculty of Modern and Medieval Languages and Linguistics, University of Cambridge, Cambridge CB3 9DA, United Kingdom
| |
Collapse
|
3
|
Zeibich R, Kwan P, J. O’Brien T, Perucca P, Ge Z, Anderson A. Applications for Deep Learning in Epilepsy Genetic Research. Int J Mol Sci 2023; 24:14645. [PMID: 37834093 PMCID: PMC10572791 DOI: 10.3390/ijms241914645] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 09/11/2023] [Accepted: 09/21/2023] [Indexed: 10/15/2023] Open
Abstract
Epilepsy is a group of brain disorders characterised by an enduring predisposition to generate unprovoked seizures. Fuelled by advances in sequencing technologies and computational approaches, more than 900 genes have now been implicated in epilepsy. The development and optimisation of tools and methods for analysing the vast quantity of genomic data is a rapidly evolving area of research. Deep learning (DL) is a subset of machine learning (ML) that brings opportunity for novel investigative strategies that can be harnessed to gain new insights into the genomic risk of people with epilepsy. DL is being harnessed to address limitations in accuracy of long-read sequencing technologies, which improve on short-read methods. Tools that predict the functional consequence of genetic variation can represent breaking ground in addressing critical knowledge gaps, while methods that integrate independent but complimentary data enhance the predictive power of genetic data. We provide an overview of these DL tools and discuss how they may be applied to the analysis of genetic data for epilepsy research.
Collapse
Affiliation(s)
- Robert Zeibich
- Department of Neuroscience, Central Clinical School, Monash University, Melbourne, VIC 3800, Australia; (R.Z.); (P.K.); (T.J.O.); (P.P.)
| | - Patrick Kwan
- Department of Neuroscience, Central Clinical School, Monash University, Melbourne, VIC 3800, Australia; (R.Z.); (P.K.); (T.J.O.); (P.P.)
- Department of Neurology, Alfred Health, Melbourne, VIC 3004, Australia
- Department of Neurology, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
- Department of Medicine, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
| | - Terence J. O’Brien
- Department of Neuroscience, Central Clinical School, Monash University, Melbourne, VIC 3800, Australia; (R.Z.); (P.K.); (T.J.O.); (P.P.)
- Department of Neurology, Alfred Health, Melbourne, VIC 3004, Australia
- Department of Neurology, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
- Department of Medicine, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
| | - Piero Perucca
- Department of Neuroscience, Central Clinical School, Monash University, Melbourne, VIC 3800, Australia; (R.Z.); (P.K.); (T.J.O.); (P.P.)
- Department of Neurology, Alfred Health, Melbourne, VIC 3004, Australia
- Department of Neurology, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
- Epilepsy Research Centre, Department of Medicine, Austin Health, The University of Melbourne, Melbourne, VIC 3084, Australia
- Bladin-Berkovic Comprehensive Epilepsy Program, Department of Neurology, Austin Health, The University of Melbourne, Melbourne, VIC 3084, Australia
| | - Zongyuan Ge
- Faculty of Engineering, Monash University, Melbourne, VIC 3800, Australia;
- Monash-Airdoc Research, Monash University, Melbourne, VIC 3800, Australia
| | - Alison Anderson
- Department of Neuroscience, Central Clinical School, Monash University, Melbourne, VIC 3800, Australia; (R.Z.); (P.K.); (T.J.O.); (P.P.)
- Department of Medicine, The Royal Melbourne Hospital, The University of Melbourne, Parkville, VIC 3052, Australia
| |
Collapse
|
4
|
Li X, Yuan H, Wu X, Wang C, Wu M, Shi H, Lv Y. MultiDS-MDA: Integrating multiple data sources into heterogeneous network for predicting novel metabolite-drug associations. Comput Biol Med 2023; 162:107067. [PMID: 37276756 DOI: 10.1016/j.compbiomed.2023.107067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 05/15/2023] [Accepted: 05/27/2023] [Indexed: 06/07/2023]
Abstract
Metabolic processes in the human body play an important role in maintaining normal life activities, and the abnormal concentration of metabolites is closely related to the occurrence and development of diseases. The use of drugs is considered to have a major impact on metabolism, and drug metabolites can contribute to efficacy, drug toxicity and drug-drug interaction. However, our understanding of metabolite-drug associations is far from complete, and individual data source tends to be incomplete and noisy. Therefore, the integration of various types of data sources for inferring reliable metabolite-drug associations is urgently needed. In this study, we proposed a computational framework, MultiDS-MDA, for identifying metabolite-drug associations by integrating multiple data sources, including chemical structure information of metabolites and drugs, the relationships of metabolite-gene, metabolite-disease, drug-gene and drug-disease, the data of gene ontology (GO) and disease ontology (DO) and known metabolite-drug connections. The performance of MultiDS-MDA was evaluated by 5-fold cross-validation, which achieved an area under the ROC curve (AUROC) of 0.911 and an area under the precision-recall curve (AUPRC) of 0.907. Additionally, MultiDS-MDA showed outstanding performance compared with similar approaches. Case studies for three metabolites (cholesterol, thromboxane B2 and coenzyme Q10) and three drugs (simvastatin, pravastatin and morphine) also demonstrated the reliability and efficiency of MultiDS-MDA, and it is anticipated that MultiDS-MDA will serve as a powerful tool for future exploration of metabolite-drug interactions and contribute to drug development and drug combination.
Collapse
Affiliation(s)
- Xiuhong Li
- College of Bioinformatics Science and Technology, Harbin Medical University, China
| | - Hao Yuan
- College of Bioinformatics Science and Technology, Harbin Medical University, China
| | - Xiaoliang Wu
- College of Bioinformatics Science and Technology, Harbin Medical University, China
| | - Chengyi Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, China
| | - Meitao Wu
- College of Bioinformatics Science and Technology, Harbin Medical University, China
| | - Hongbo Shi
- College of Bioinformatics Science and Technology, Harbin Medical University, China.
| | - Yingli Lv
- College of Bioinformatics Science and Technology, Harbin Medical University, China.
| |
Collapse
|
5
|
Jadhav A, Kumar T, Raghavendra M, Loganathan T, Narayanan M. Predicting cross-tissue hormone-gene relations using balanced word embeddings. Bioinformatics 2022; 38:4771-4781. [PMID: 36000859 PMCID: PMC9563690 DOI: 10.1093/bioinformatics/btac578] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 07/29/2022] [Accepted: 08/23/2022] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Inter-organ/inter-tissue communication is central to multi-cellular organisms including humans, and mapping inter-tissue interactions can advance system-level whole-body modeling efforts. Large volumes of biomedical literature have fostered studies that map within-tissue or tissue-agnostic interactions, but literature-mining studies that infer inter-tissue relations, such as between hormones and genes are solely missing. RESULTS We present a first study to predict from biomedical literature the hormone-gene associations mediating inter-tissue signaling in the human body. Our BioEmbedS* models use neural network-based Biomedical word Embeddings with a Support Vector Machine classifier to predict if a hormone-gene pair is associated or not, and whether an associated gene is involved in the hormone's production or response. Model training relies on our unified dataset Hormone-Gene version 1 of ground-truth associations between genes and endocrine hormones, which we compiled and carefully balanced in the embedded space to handle data disparities, such as between poorly- versus well-studied hormones. Our BioEmbedS model recapitulates known gene mediators of tissue-tissue signaling with 70.4% accuracy; predicts novel inter-tissue communication genes in humans, which are enriched for hormone-related disorders; and generalizes well to mouse, thereby holding promise for its extension to other multi-cellular organisms as well. AVAILABILITY AND IMPLEMENTATION Freely available at https://cross-tissue-signaling.herokuapp.com are our model predictions & datasets; https://github.com/BIRDSgroup/BioEmbedS has all relevant code. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aditya Jadhav
- Department of Computer Science and Engineering, Indian Institute of Technology (IIT) Madras, Chennai, India
| | - Tarun Kumar
- Department of Computer Science and Engineering, Indian Institute of Technology (IIT) Madras, Chennai, India
- Initiative for Biological Systems Engineering, IIT Madras, Chennai, India
- Robert Bosch Centre for Data Science and Artificial Intelligence, IIT Madras, Chennai, India
| | - Mohit Raghavendra
- Department of Information Technology, National Institute of Technology Karnataka, Surathkal, India
| | - Tamizhini Loganathan
- Initiative for Biological Systems Engineering, IIT Madras, Chennai, India
- Robert Bosch Centre for Data Science and Artificial Intelligence, IIT Madras, Chennai, India
| | - Manikandan Narayanan
- Department of Computer Science and Engineering, Indian Institute of Technology (IIT) Madras, Chennai, India
- Initiative for Biological Systems Engineering, IIT Madras, Chennai, India
- Robert Bosch Centre for Data Science and Artificial Intelligence, IIT Madras, Chennai, India
| |
Collapse
|
6
|
Feng B, Gao J. AnthraxKP: a knowledge graph-based, Anthrax Knowledge Portal mined from biomedical literature. Database (Oxford) 2022; 2022:6598946. [PMID: 35653350 PMCID: PMC9216567 DOI: 10.1093/database/baac037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 04/13/2022] [Accepted: 05/13/2022] [Indexed: 11/15/2022]
Abstract
Abstract
Anthrax is a zoonotic infectious disease caused by Bacillus anthracis (anthrax bacterium) that affects not only domestic and wild animals worldwide but also human health. As the study develops in-depth, a large quantity of related biomedical publications emerge. Acquiring knowledge from the literature is essential for gaining insight into anthrax etiology, diagnosis, treatment and research. In this study, we used a set of text mining tools to identify nearly 14 000 entities of 29 categories, such as genes, diseases, chemicals, species, vaccines and proteins, from nearly 8000 anthrax biomedical literature and extracted 281 categories of association relationships among the entities. We curated Anthrax-related Entities Dictionary and Anthrax Ontology. We formed Anthrax Knowledge Graph (AnthraxKG) containing more than 6000 nodes, 6000 edges and 32 000 properties. An interactive visualized Anthrax Knowledge Portal(AnthraxKP) was also developed based on AnthraxKG by using Web technology. AnthraxKP in this study provides rich and authentic relevant knowledge in many forms, which can help researchers carry out research more efficiently.
Database URL: AnthraxKP is permitted users to query and download data at http://139.224.212.120:18095/.
Collapse
Affiliation(s)
- Baiyang Feng
- College of Computer and Information Engineering, Inner Mongolia Agricultural University , Erdos East Street No. 29, Hohhot 010011, China
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application for Agriculture and Animal Husbandry , Zhaowuda Road No. 306, Hohhot 010018, China
| | - Jing Gao
- College of Computer and Information Engineering, Inner Mongolia Agricultural University , Erdos East Street No. 29, Hohhot 010011, China
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application for Agriculture and Animal Husbandry , Zhaowuda Road No. 306, Hohhot 010018, China
- Inner Mongolia Autonomous Region Big Data Center , Chilechuan Street No. 1, Hohhot 010091, China
| |
Collapse
|
7
|
Krämer A, Green J, Billaud JN, Pasare NA, Jones M, Tugendreich S. Mining hidden knowledge: embedding models of cause-effect relationships curated from the biomedical literature. BIOINFORMATICS ADVANCES 2022; 2:vbac022. [PMID: 36699407 PMCID: PMC9710590 DOI: 10.1093/bioadv/vbac022] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 03/04/2022] [Accepted: 04/06/2022] [Indexed: 01/28/2023]
Abstract
Motivation We explore the use of literature-curated signed causal gene expression and gene-function relationships to construct unsupervised embeddings of genes, biological functions and diseases. Our goal is to prioritize and predict activating and inhibiting functional associations of genes and to discover hidden relationships between functions. As an application, we are particularly interested in the automatic construction of networks that capture relevant biology in a given disease context. Results We evaluated several unsupervised gene embedding models leveraging literature-curated signed causal gene expression findings. Using linear regression, we show that, based on these gene embeddings, gene-function relationships can be predicted with about 95% precision for the highest scoring genes. Function embedding vectors, derived from parameters of the linear regression model, allow inference of relationships between different functions or diseases. We show for several diseases that gene and function embeddings can be used to recover key drivers of pathogenesis, as well as underlying cellular and physiological processes. These results are presented as disease-centric networks of genes and functions. To illustrate the applicability of our approach to other machine learning tasks, we also computed embeddings for drug molecules, which were then tested using a simple neural network to predict drug-disease associations. Availability and implementation Python implementations of the gene and function embedding algorithms operating on a subset of our literature-curated content as well as other code used for this paper are made available as part of the Supplementary data. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | - Jeff Green
- QIAGEN Digital Insights, Redwood City, CA 94063, USA
| | | | | | - Martin Jones
- QIAGEN Digital Insights, Redwood City, CA 94063, USA
| | | |
Collapse
|
8
|
Artificial Intelligence and Cardiovascular Genetics. Life (Basel) 2022; 12:life12020279. [PMID: 35207566 PMCID: PMC8875522 DOI: 10.3390/life12020279] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/26/2022] [Accepted: 02/09/2022] [Indexed: 12/13/2022] Open
Abstract
Polygenic diseases, which are genetic disorders caused by the combined action of multiple genes, pose unique and significant challenges for the diagnosis and management of affected patients. A major goal of cardiovascular medicine has been to understand how genetic variation leads to the clinical heterogeneity seen in polygenic cardiovascular diseases (CVDs). Recent advances and emerging technologies in artificial intelligence (AI), coupled with the ever-increasing availability of next generation sequencing (NGS) technologies, now provide researchers with unprecedented possibilities for dynamic and complex biological genomic analyses. Combining these technologies may lead to a deeper understanding of heterogeneous polygenic CVDs, better prognostic guidance, and, ultimately, greater personalized medicine. Advances will likely be achieved through increasingly frequent and robust genomic characterization of patients, as well the integration of genomic data with other clinical data, such as cardiac imaging, coronary angiography, and clinical biomarkers. This review discusses the current opportunities and limitations of genomics; provides a brief overview of AI; and identifies the current applications, limitations, and future directions of AI in genomics.
Collapse
|
9
|
Birgmeier J, Haeussler M, Deisseroth CA, Steinberg EH, Jagadeesh KA, Ratner AJ, Guturu H, Wenger AM, Diekhans ME, Stenson PD, Cooper DN, Ré C, Beggs AH, Bernstein JA, Bejerano G. AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature. Sci Transl Med 2021; 12:12/544/eaau9113. [PMID: 32434849 DOI: 10.1126/scitranslmed.aau9113] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2018] [Revised: 08/14/2019] [Accepted: 04/22/2020] [Indexed: 12/21/2022]
Abstract
The diagnosis of Mendelian disorders requires labor-intensive literature research. Trained clinicians can spend hours looking for the right publication(s) supporting a single gene that best explains a patient's disease. AMELIE (Automatic Mendelian Literature Evaluation) greatly accelerates this process. AMELIE parses all 29 million PubMed abstracts and downloads and further parses hundreds of thousands of full-text articles in search of information supporting the causality and associated phenotypes of most published genetic variants. AMELIE then prioritizes patient candidate variants for their likelihood of explaining any patient's given set of phenotypes. Diagnosis of singleton patients (without relatives' exomes) is the most time-consuming scenario, and AMELIE ranked the causative gene at the very top for 66% of 215 diagnosed singleton Mendelian patients from the Deciphering Developmental Disorders project. Evaluating only the top 11 AMELIE-scored genes of 127 (median) candidate genes per patient resulted in a rapid diagnosis in more than 90% of cases. AMELIE-based evaluation of all cases was 3 to 19 times more efficient than hand-curated database-based approaches. We replicated these results on a retrospective cohort of clinical cases from Stanford Children's Health and the Manton Center for Orphan Disease Research. An analysis web portal with our most recent update, programmatic interface, and code is available at AMELIE.stanford.edu.
Collapse
Affiliation(s)
- Johannes Birgmeier
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Maximilian Haeussler
- Santa Cruz Genomics Institute, MS CBSE, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Cole A Deisseroth
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Ethan H Steinberg
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Karthik A Jagadeesh
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Alexander J Ratner
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Harendra Guturu
- Department of Pediatrics, Stanford School of Medicine, Stanford, CA 94305, USA
| | - Aaron M Wenger
- Department of Pediatrics, Stanford School of Medicine, Stanford, CA 94305, USA
| | - Mark E Diekhans
- Santa Cruz Genomics Institute, MS CBSE, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Peter D Stenson
- Institute of Medical Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff, UK
| | - David N Cooper
- Institute of Medical Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff, UK
| | - Christopher Ré
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Alan H Beggs
- Manton Center for Orphan Disease Research, Division of Genetics and Genomics, Boston Children's Hospital, Harvard Medical School, Boston, MA 02115, USA
| | | | - Gill Bejerano
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA. .,Department of Pediatrics, Stanford School of Medicine, Stanford, CA 94305, USA.,Department of Developmental Biology, Stanford University, Stanford, CA 94305, USA.,Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
10
|
Zhao S, Su C, Lu Z, Wang F. Recent advances in biomedical literature mining. Brief Bioinform 2021; 22:bbaa057. [PMID: 32422651 PMCID: PMC8138828 DOI: 10.1093/bib/bbaa057] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 03/22/2020] [Accepted: 03/25/2020] [Indexed: 01/26/2023] Open
Abstract
The recent years have witnessed a rapid increase in the number of scientific articles in biomedical domain. These literature are mostly available and readily accessible in electronic format. The domain knowledge hidden in them is critical for biomedical research and applications, which makes biomedical literature mining (BLM) techniques highly demanding. Numerous efforts have been made on this topic from both biomedical informatics (BMI) and computer science (CS) communities. The BMI community focuses more on the concrete application problems and thus prefer more interpretable and descriptive methods, while the CS community chases more on superior performance and generalization ability, thus more sophisticated and universal models are developed. The goal of this paper is to provide a review of the recent advances in BLM from both communities and inspire new research directions.
Collapse
Affiliation(s)
- Sendong Zhao
- Department of Healthcare Policy and Research, Weill Medical College of Cornell University, New York, NY 10065, USA
| | - Chang Su
- Division of Health Informatics, Department of Healthcare Policy and Research at Weill Cornell Medicine at Cornell University, New York, NY, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI) at National Library of Medicine, National Institute of Health, Bethesda, MD, USA
| | - Fei Wang
- Department of Healthcare Policy and Research, Weill Medical College of Cornell University, New York, NY 10065, USA
| |
Collapse
|
11
|
Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network. Sci Rep 2021; 11:1696. [PMID: 33462256 PMCID: PMC7813825 DOI: 10.1038/s41598-020-80441-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Accepted: 12/17/2020] [Indexed: 11/17/2022] Open
Abstract
The increased diversity and scale of published biological data has to led to a growing appreciation for the applications of machine learning and statistical methodologies to gain new insights. Key to achieving this aim is solving the Relationship Extraction problem which specifies the semantic interaction between two or more biological entities in a published study. Here, we employed two deep neural network natural language processing (NLP) methods, namely: the continuous bag of words (CBOW), and the bi-directional long short-term memory (bi-LSTM). These methods were employed to predict relations between entities that describe protein subcellular localisation in plants. We applied our system to 1700 published Arabidopsis protein subcellular studies from the SUBA manually curated dataset. The system combines pre-processing of full-text articles in a machine-readable format with relevant sentence extraction for downstream NLP analysis. Using the SUBA corpus, the neural network classifier predicted interactions between protein name, subcellular localisation and experimental methodology with an average precision, recall rate, accuracy and F1 scores of 95.1%, 82.8%, 89.3% and 88.4% respectively (n = 30). Comparable scoring metrics were obtained using the CropPAL database as an independent testing dataset that stores protein subcellular localisation in crop species, demonstrating wide applicability of prediction model. We provide a framework for extracting protein functional features from unstructured text in the literature with high accuracy, improving data dissemination and unlocking the potential of big data text analytics for generating new hypotheses.
Collapse
|
12
|
Fu Y, Xu J, Tang Z, Wang L, Yin D, Fan Y, Zhang D, Deng F, Zhang Y, Zhang H, Wang H, Xing W, Yin L, Zhu S, Zhu M, Yu M, Li X, Liu X, Yuan X, Zhao S. A gene prioritization method based on a swine multi-omics knowledgebase and a deep learning model. Commun Biol 2020; 3:502. [PMID: 32913254 PMCID: PMC7483748 DOI: 10.1038/s42003-020-01233-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Accepted: 08/07/2020] [Indexed: 12/27/2022] Open
Abstract
The analyses of multi-omics data have revealed candidate genes for objective traits. However, they are integrated poorly, especially in non-model organisms, and they pose a great challenge for prioritizing candidate genes for follow-up experimental verification. Here, we present a general convolutional neural network model that integrates multi-omics information to prioritize the candidate genes of objective traits. By applying this model to Sus scrofa, which is a non-model organism, but one of the most important livestock animals, the model precision was 72.9%, recall 73.5%, and F1-Measure 73.4%, demonstrating a good prediction performance compared with previous studies in Arabidopsis thaliana and Oryza sativa. Additionally, to facilitate the use of the model, we present ISwine (http://iswine.iomics.pro/), which is an online comprehensive knowledgebase in which we incorporated almost all the published swine multi-omics data. Overall, the results suggest that the deep learning strategy will greatly facilitate analyses of multi-omics integration in the future. Yuhua Fu et al. develop a CNN model that integrates multi-omics information to prioritize candidate genes of objective traits. Their model performs well when applied to important livestock non-model animals like Sus scrofa. Finally, the authors present ISwine, an online comprehensive knowledgebase which includes all published swine omics data to facilitate the integration of heterogeneous data.
Collapse
Affiliation(s)
- Yuhua Fu
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education, Key Laboratory of Swine Genetics and Breeding, Ministry of Agriculture, College of Animal Science and Technology, Huazhong Agricultural University, 430070, Wuhan, Hubei, P.R. China.,School of Computer Science and Technology, Wuhan University of Technology, 430070, Wuhan, Hubei, P.R. China
| | - Jingya Xu
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education, Key Laboratory of Swine Genetics and Breeding, Ministry of Agriculture, College of Animal Science and Technology, Huazhong Agricultural University, 430070, Wuhan, Hubei, P.R. China
| | - Zhenshuang Tang
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education, Key Laboratory of Swine Genetics and Breeding, Ministry of Agriculture, College of Animal Science and Technology, Huazhong Agricultural University, 430070, Wuhan, Hubei, P.R. China
| | - Lu Wang
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education, Key Laboratory of Swine Genetics and Breeding, Ministry of Agriculture, College of Animal Science and Technology, Huazhong Agricultural University, 430070, Wuhan, Hubei, P.R. China
| | - Dong Yin
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education, Key Laboratory of Swine Genetics and Breeding, Ministry of Agriculture, College of Animal Science and Technology, Huazhong Agricultural University, 430070, Wuhan, Hubei, P.R. China
| | - Yu Fan
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education, Key Laboratory of Swine Genetics and Breeding, Ministry of Agriculture, College of Animal Science and Technology, Huazhong Agricultural University, 430070, Wuhan, Hubei, P.R. China
| | - Dongdong Zhang
- School of Computer Science and Technology, Wuhan University of Technology, 430070, Wuhan, Hubei, P.R. China
| | - Fei Deng
- School of Computer Science and Technology, Wuhan University of Technology, 430070, Wuhan, Hubei, P.R. China
| | - Yanping Zhang
- School of Computer Science and Technology, Wuhan University of Technology, 430070, Wuhan, Hubei, P.R. China
| | - Haohao Zhang
- School of Computer Science and Technology, Wuhan University of Technology, 430070, Wuhan, Hubei, P.R. China
| | - Haiyan Wang
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education, Key Laboratory of Swine Genetics and Breeding, Ministry of Agriculture, College of Animal Science and Technology, Huazhong Agricultural University, 430070, Wuhan, Hubei, P.R. China
| | - Wenhui Xing
- School of Computer Science and Technology, Wuhan University of Technology, 430070, Wuhan, Hubei, P.R. China
| | - Lilin Yin
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education, Key Laboratory of Swine Genetics and Breeding, Ministry of Agriculture, College of Animal Science and Technology, Huazhong Agricultural University, 430070, Wuhan, Hubei, P.R. China
| | - Shilin Zhu
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education, Key Laboratory of Swine Genetics and Breeding, Ministry of Agriculture, College of Animal Science and Technology, Huazhong Agricultural University, 430070, Wuhan, Hubei, P.R. China
| | - Mengjin Zhu
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education, Key Laboratory of Swine Genetics and Breeding, Ministry of Agriculture, College of Animal Science and Technology, Huazhong Agricultural University, 430070, Wuhan, Hubei, P.R. China
| | - Mei Yu
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education, Key Laboratory of Swine Genetics and Breeding, Ministry of Agriculture, College of Animal Science and Technology, Huazhong Agricultural University, 430070, Wuhan, Hubei, P.R. China
| | - Xinyun Li
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education, Key Laboratory of Swine Genetics and Breeding, Ministry of Agriculture, College of Animal Science and Technology, Huazhong Agricultural University, 430070, Wuhan, Hubei, P.R. China
| | - Xiaolei Liu
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education, Key Laboratory of Swine Genetics and Breeding, Ministry of Agriculture, College of Animal Science and Technology, Huazhong Agricultural University, 430070, Wuhan, Hubei, P.R. China.
| | - Xiaohui Yuan
- School of Computer Science and Technology, Wuhan University of Technology, 430070, Wuhan, Hubei, P.R. China.
| | - Shuhong Zhao
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education, Key Laboratory of Swine Genetics and Breeding, Ministry of Agriculture, College of Animal Science and Technology, Huazhong Agricultural University, 430070, Wuhan, Hubei, P.R. China.
| |
Collapse
|
13
|
Perera N, Dehmer M, Emmert-Streib F. Named Entity Recognition and Relation Detection for Biomedical Information Extraction. Front Cell Dev Biol 2020; 8:673. [PMID: 32984300 PMCID: PMC7485218 DOI: 10.3389/fcell.2020.00673] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 07/02/2020] [Indexed: 12/29/2022] Open
Abstract
The number of scientific publications in the literature is steadily growing, containing our knowledge in the biomedical, health, and clinical sciences. Since there is currently no automatic archiving of the obtained results, much of this information remains buried in textual details not readily available for further usage or analysis. For this reason, natural language processing (NLP) and text mining methods are used for information extraction from such publications. In this paper, we review practices for Named Entity Recognition (NER) and Relation Detection (RD), allowing, e.g., to identify interactions between proteins and drugs or genes and diseases. This information can be integrated into networks to summarize large-scale details on a particular biomedical or clinical problem, which is then amenable for easy data management and further analysis. Furthermore, we survey novel deep learning methods that have recently been introduced for such tasks.
Collapse
Affiliation(s)
- Nadeesha Perera
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
| | - Matthias Dehmer
- Department of Mechatronics and Biomedical Computer Science, University for Health Sciences, Medical Informatics and Technology (UMIT), Hall in Tirol, Austria
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
- Faculty of Medicine and Health Technology, Institute of Biosciences and Medical Technology, Tampere University, Tampere, Finland
| |
Collapse
|
14
|
Ju M, Short AD, Thompson P, Bakerly ND, Gkoutos GV, Tsaprouni L, Ananiadou S. Annotating and detecting phenotypic information for chronic obstructive pulmonary disease. JAMIA Open 2020; 2:261-271. [PMID: 31984360 PMCID: PMC6951876 DOI: 10.1093/jamiaopen/ooz009] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2018] [Revised: 02/21/2019] [Accepted: 03/19/2019] [Indexed: 12/29/2022] Open
Abstract
Objectives Chronic obstructive pulmonary disease (COPD) phenotypes cover a range of lung abnormalities. To allow text mining methods to identify pertinent and potentially complex information about these phenotypes from textual data, we have developed a novel annotated corpus, which we use to train a neural network-based named entity recognizer to detect fine-grained COPD phenotypic information. Materials and methods Since COPD phenotype descriptions often mention other concepts within them (proteins, treatments, etc.), our corpus annotations include both outermost phenotype descriptions and concepts nested within them. Our neural layered bidirectional long short-term memory conditional random field (BiLSTM-CRF) network firstly recognizes nested mentions, which are fed into subsequent BiLSTM-CRF layers, to help to recognize enclosing phenotype mentions. Results Our corpus of 30 full papers (available at: http://www.nactem.ac.uk/COPD) is annotated by experts with 27 030 phenotype-related concept mentions, most of which are automatically linked to UMLS Metathesaurus concepts. When trained using the corpus, our BiLSTM-CRF network outperforms other popular approaches in recognizing detailed phenotypic information. Discussion Information extracted by our method can facilitate efficient location and exploration of detailed information about phenotypes, for example, those specifically concerning reactions to treatments. Conclusion The importance of our corpus for developing methods to extract fine-grained information about COPD phenotypes is demonstrated through its successful use to train a layered BiLSTM-CRF network to extract phenotypic information at various levels of granularity. The minimal human intervention needed for training should permit ready adaption to extracting phenotypic information about other diseases.
Collapse
Affiliation(s)
- Meizhi Ju
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| | - Andrea D Short
- Faculty of Biology, Medicine and Health, The University of Manchester, Manchester, UK
| | - Paul Thompson
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| | - Nawar Diar Bakerly
- Salford Royal NHS Foundation Trust; and School of Health Sciences, The University of Manchester, Manchester, UK
| | - Georgios V Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,MRC Health Data Research UK (HDR UK).,NIHR Experimental Cancer Medicine Centre, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, Birmingham, UK.,NIHR Biomedical Research Centre, Birmingham, UK
| | - Loukia Tsaprouni
- School of Health Sciences, Centre for Life and Sport Sciences, Birmingham City University, Birmingham, UK
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| |
Collapse
|
15
|
Braun IR, Lawrence-Dill CJ. Automated Methods Enable Direct Computation on Phenotypic Descriptions for Novel Candidate Gene Prediction. FRONTIERS IN PLANT SCIENCE 2020; 10:1629. [PMID: 31998331 PMCID: PMC6965352 DOI: 10.3389/fpls.2019.01629] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 11/19/2019] [Indexed: 06/01/2023]
Abstract
Natural language descriptions of plant phenotypes are a rich source of information for genetics and genomics research. We computationally translated descriptions of plant phenotypes into structured representations that can be analyzed to identify biologically meaningful associations. These representations include the entity-quality (EQ) formalism, which uses terms from biological ontologies to represent phenotypes in a standardized, semantically rich format, as well as numerical vector representations generated using natural language processing (NLP) methods (such as the bag-of-words approach and document embedding). We compared resulting phenotype similarity measures to those derived from manually curated data to determine the performance of each method. Computationally derived EQ and vector representations were comparably successful in recapitulating biological truth to representations created through manual EQ statement curation. Moreover, NLP methods for generating vector representations of phenotypes are scalable to large quantities of text because they require no human input. These results indicate that it is now possible to computationally and automatically produce and populate large-scale information resources that enable researchers to query phenotypic descriptions directly.
Collapse
Affiliation(s)
- Ian R. Braun
- Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, IA, United States
- Interdepartmental Bioinformatics and Computational Biology, Iowa State University, Ames, IA, United States
| | - Carolyn J. Lawrence-Dill
- Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, IA, United States
- Interdepartmental Bioinformatics and Computational Biology, Iowa State University, Ames, IA, United States
- Department of Agronomy, Iowa State University, Ames, IA, United States
| |
Collapse
|
16
|
Xu B, Liu Y, Yu S, Wang L, Dong J, Lin H, Yang Z, Wang J, Xia F. A network embedding model for pathogenic genes prediction by multi-path random walking on heterogeneous network. BMC Med Genomics 2019; 12:188. [PMID: 31865919 PMCID: PMC6927107 DOI: 10.1186/s12920-019-0627-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Prediction of pathogenic genes is crucial for disease prevention, diagnosis, and treatment. But traditional genetic localization methods are often technique-difficulty and time-consuming. With the development of computer science, computational biology has gradually become one of the main methods for finding candidate pathogenic genes. METHODS We propose a pathogenic genes prediction method based on network embedding which is called Multipath2vec. Firstly, we construct an heterogeneous network which is called GP-network. It is constructed based on three kinds of relationships between genes and phenotypes, including correlations between phenotypes, interactions between genes and known gene-phenotype pairs. Then in order to embedding the network better, we design the multi-path to guide random walk in GP-network. The multi-path includes multiple paths between genes and phenotypes which can capture complex structural information of heterogeneous network. Finally, we use the learned vector representation of each phenotype and protein to calculate the similarities and rank according to the similarities between candidate genes and the target phenotype. RESULTS We implemented Multipath2vec and four baseline approaches (i.e., CATAPULT, PRINCE, Deepwalk and Metapath2vec) on many-genes gene-phenotype data, single-gene gene-phenotype data and whole gene-phenotype data. Experimental results show that Multipath2vec outperformed the state-of-the-art baselines in pathogenic genes prediction task. CONCLUSIONS We propose Multipath2vec that can be utilized to predict pathogenic genes and experimental results show the higher accuracy of pathogenic genes prediction.
Collapse
Affiliation(s)
- Bo Xu
- School of Software, Dalian University of Technology, Dalian, 116000 China
- Key Laboratory for Ubiquitous Network and Service Software of Liaoning, Dalian, 116000 China
| | - Yu Liu
- School of Software, Dalian University of Technology, Dalian, 116000 China
| | - Shuo Yu
- School of Software, Dalian University of Technology, Dalian, 116000 China
| | - Lei Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116000 China
| | - Jie Dong
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116000 China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116000 China
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116000 China
| | - Jian Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116000 China
| | - Feng Xia
- School of Software, Dalian University of Technology, Dalian, 116000 China
- Key Laboratory for Ubiquitous Network and Service Software of Liaoning, Dalian, 116000 China
| |
Collapse
|
17
|
Heo GE, Xie Q, Song M, Lee JH. Combining entity co-occurrence with specialized word embeddings to measure entity relation in Alzheimer's disease. BMC Med Inform Decis Mak 2019; 19:240. [PMID: 31801521 PMCID: PMC6894106 DOI: 10.1186/s12911-019-0934-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Background Extracting useful information from biomedical literature plays an important role in the development of modern medicine. In natural language processing, there have been rigorous attempts to find meaningful relationships between entities automatically by co-occurrence-based methods. It has been increasingly important to understand whether relationships exist, and if so how strong, between any two entities extracted from a large number of texts. One of the defining methods is to measure semantic similarity and relatedness between two entities. Methods We propose a hybrid ranking method that combines a co-occurrence approach considering both direct and indirect entity pair relationship with specialized word embeddings for measuring the relatedness of two entities. Results We evaluate the proposed ranking method comparatively with other well-known methods such as co-occurrence, Word2Vec, COALS (Correlated Occurrence Analog to Lexical Semantics), and random indexing by calculating top-ranked entities related to Alzheimer’s disease. In addition, we analyze gene, pathway, and gene–phenotype relationships. Overall, the proposed method tends to find more hidden relationships than the other methods. Conclusion Our proposed method is able to select more useful related entities that not only highly co-occur but also have more indirect relations for the target entity. In pathway analysis, our proposed method shows superior performance at identifying (functional) cross clustering and higher-level pathways. Our proposed method, resulting from phenotype analysis, has an advantage in identifying the common genotype relating to phenotypes from biological literature.
Collapse
Affiliation(s)
- Go Eun Heo
- Department of Library and Information Science, Yonsei University, 50 Yonsei-ro Seodaemun-gu, Seoul, 03722, Republic of Korea
| | - Qing Xie
- Department of Library and Information Science, Yonsei University, 50 Yonsei-ro Seodaemun-gu, Seoul, 03722, Republic of Korea
| | - Min Song
- Department of Library and Information Science, Yonsei University, 50 Yonsei-ro Seodaemun-gu, Seoul, 03722, Republic of Korea.
| | - Jeong-Hoon Lee
- Department of Creative IT Engineering, POSTECH, 77 Cheongam-ro Nam-gu, Pohang, Gyeongbuk, 37673, Republic of Korea
| |
Collapse
|
18
|
Tsueng G, Nanis M, Fouquier JT, Mayers M, Good BM, Su AI. Applying citizen science to gene, drug and disease relationship extraction from biomedical abstracts. Bioinformatics 2019; 36:1226-1233. [PMID: 31504205 PMCID: PMC8104067 DOI: 10.1093/bioinformatics/btz678] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2019] [Revised: 08/05/2019] [Accepted: 08/29/2019] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION Biomedical literature is growing at a rate that outpaces our ability to harness the knowledge contained therein. To mine valuable inferences from the large volume of literature, many researchers use information extraction algorithms to harvest information in biomedical texts. Information extraction is usually accomplished via a combination of manual expert curation and computational methods. Advances in computational methods usually depend on the time-consuming generation of gold standards by a limited number of expert curators. Citizen science is public participation in scientific research. We previously found that citizen scientists are willing and capable of performing named entity recognition of disease mentions in biomedical abstracts, but did not know if this was true with relationship extraction (RE). RESULTS In this article, we introduce the Relationship Extraction Module of the web-based application Mark2Cure (M2C) and demonstrate that citizen scientists can perform RE. We confirm the importance of accurate named entity recognition on user performance of RE and identify design issues that impacted data quality. We find that the data generated by citizen scientists can be used to identify relationship types not currently available in the M2C Relationship Extraction Module. We compare the citizen science-generated data with algorithm-mined data and identify ways in which the two approaches may complement one another. We also discuss opportunities for future improvement of this system, as well as the potential synergies between citizen science, manual biocuration and natural language processing. AVAILABILITY AND IMPLEMENTATION Mark2Cure platform: https://mark2cure.org; Mark2Cure source code: https://github.com/sulab/mark2cure; and data and analysis code for this article: https://github.com/gtsueng/M2C_rel_nb. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Max Nanis
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Jennifer T Fouquier
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Michael Mayers
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Benjamin M Good
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Andrew I Su
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| |
Collapse
|
19
|
Kafkas Ş, Hoehndorf R. Ontology based text mining of gene-phenotype associations: application to candidate gene prediction. Database (Oxford) 2019; 2019:baz019. [PMID: 30809638 PMCID: PMC6391585 DOI: 10.1093/database/baz019] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Revised: 01/09/2019] [Accepted: 01/26/2019] [Indexed: 01/07/2023]
Abstract
Gene-phenotype associations play an important role in understanding the disease mechanisms which is a requirement for treatment development. A portion of gene-phenotype associations are observed mainly experimentally and made publicly available through several standard resources such as MGI. However, there is still a vast amount of gene-phenotype associations buried in the biomedical literature. Given the large amount of literature data, we need automated text mining tools to alleviate the burden in manual curation of gene-phenotype associations and to develop comprehensive resources. In this study, we present an ontology-based approach in combination with statistical methods to text mine gene-phenotype associations from the literature. Our method achieved AUC values of 0.90 and 0.75 in recovering known gene-phenotype associations from HPO and MGI respectively. We posit that candidate genes and their relevant diseases should be expressed with similar phenotypes in publications. Thus, we demonstrate the utility of our approach by predicting disease candidate genes based on the semantic similarities of phenotypes associated with genes and diseases. To the best of our knowledge, this is the first study using an ontology based approach to extract gene-phenotype associations from the literature. We evaluated our disease candidate prediction model on the gene-disease associations from MGI. Our model achieved AUC values of 0.90 and 0.87 on OMIM (human) and MGI (mouse) datasets of gene-disease associations respectively. Our manual analysis on the text mined data revealed that our method can accurately extract gene-phenotype associations which are not currently covered by the existing public gene-phenotype resources. Overall, results indicate that our method can precisely extract known as well as new gene-phenotype associations from literature. All the data and methods are available at https://github.com/bio-ontology-research-group/genepheno.
Collapse
Affiliation(s)
- Şenay Kafkas
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Kingdom of Saudi Arabia
| | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Kingdom of Saudi Arabia
| |
Collapse
|
20
|
Acharya A, Li S, Liu X, Pelekos G, Ziebolz D, Mattheos N. Biological links in periodontitis and rheumatoid arthritis: Discovery via text‐mining PubMed abstracts. J Periodontal Res 2018; 54:318-328. [DOI: 10.1111/jre.12632] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2018] [Revised: 11/01/2018] [Accepted: 11/18/2018] [Indexed: 12/18/2022]
Affiliation(s)
- Aneesha Acharya
- Faculty of DentistryThe University of Hong Kong Sai Yin Pun Hong Kong
- Department of PeriodontologyDr. D.Y. Patil Vidyapeeth Pune India
| | - Simin Li
- Department of Cariology, Endodontology, and PeriodontologyUniversity Leipzig Liebigstr Germany
| | - Xiangqiong Liu
- Shanghai Genomap Technologies Shanghai China
- College of Bioinformatics Science and TechnologyHarbin Medical University Harbin China
| | - George Pelekos
- Faculty of DentistryThe University of Hong Kong Sai Yin Pun Hong Kong
| | - Dirk Ziebolz
- Department of Cariology, Endodontology, and PeriodontologyUniversity Leipzig Liebigstr Germany
| | - Nikos Mattheos
- Faculty of DentistryThe University of Hong Kong Sai Yin Pun Hong Kong
| |
Collapse
|