1
|
Mastropietro A, De Carlo G, Anagnostopoulos A. XGDAG: explainable gene-disease associations via graph neural networks. Bioinformatics 2023; 39:btad482. [PMID: 37531293 PMCID: PMC10421968 DOI: 10.1093/bioinformatics/btad482] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Revised: 06/27/2023] [Accepted: 08/01/2023] [Indexed: 08/04/2023] Open
Abstract
MOTIVATION Disease gene prioritization consists in identifying genes that are likely to be involved in the mechanisms of a given disease, providing a ranking of such genes. Recently, the research community has used computational methods to uncover unknown gene-disease associations; these methods range from combinatorial to machine learning-based approaches. In particular, during the last years, approaches based on deep learning have provided superior results compared to more traditional ones. Yet, the problem with these is their inherent black-box structure, which prevents interpretability. RESULTS We propose a new methodology for disease gene discovery, which leverages graph-structured data using graph neural networks (GNNs) along with an explainability phase for determining the ranking of candidate genes and understanding the model's output. Our approach is based on a positive-unlabeled learning strategy, which outperforms existing gene discovery methods by exploiting GNNs in a non-black-box fashion. Our methodology is effective even in scenarios where a large number of associated genes need to be retrieved, in which gene prioritization methods often tend to lose their reliability. AVAILABILITY AND IMPLEMENTATION The source code of XGDAG is available on GitHub at: https://github.com/GiDeCarlo/XGDAG. The data underlying this article are available at: https://www.disgenet.org/, https://thebiogrid.org/, https://doi.org/10.1371/journal.pcbi.1004120.s003, and https://doi.org/10.1371/journal.pcbi.1004120.s004.
Collapse
Affiliation(s)
- Andrea Mastropietro
- Department of Computer, Control and Management Engineering “Antonio Ruberti”, Sapienza University of Rome, Rome 00185, Italy
| | - Gianluca De Carlo
- Department of Computer, Control and Management Engineering “Antonio Ruberti”, Sapienza University of Rome, Rome 00185, Italy
| | - Aris Anagnostopoulos
- Department of Computer, Control and Management Engineering “Antonio Ruberti”, Sapienza University of Rome, Rome 00185, Italy
| |
Collapse
|
2
|
Stolfi P, Mastropietro A, Pasculli G, Tieri P, Vergni D. NIAPU: network-informed adaptive positive-unlabeled learning for disease gene identification. Bioinformatics 2023; 39:7023926. [PMID: 36727493 PMCID: PMC9933847 DOI: 10.1093/bioinformatics/btac848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Revised: 12/23/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Gene-disease associations are fundamental for understanding disease etiology and developing effective interventions and treatments. Identifying genes not yet associated with a disease due to a lack of studies is a challenging task in which prioritization based on prior knowledge is an important element. The computational search for new candidate disease genes may be eased by positive-unlabeled learning, the machine learning (ML) setting in which only a subset of instances are labeled as positive while the rest of the dataset is unlabeled. In this work, we propose a set of effective network-based features to be used in a novel Markov diffusion-based multi-class labeling strategy for putative disease gene discovery. RESULTS The performances of the new labeling algorithm and the effectiveness of the proposed features have been tested on 10 different disease datasets using three ML algorithms. The new features have been compared against classical topological and functional/ontological features and a set of network- and biological-derived features already used in gene discovery tasks. The predictive power of the integrated methodology in searching for new disease genes has been found to be competitive against state-of-the-art algorithms. AVAILABILITY AND IMPLEMENTATION The source code of NIAPU can be accessed at https://github.com/AndMastro/NIAPU. The source data used in this study are available online on the respective websites. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Paola Stolfi
- Institute for Applied Computing (IAC) 'Mauro Picone', National Research Council of Italy (CNR), Rome 00185, Italy
| | - Andrea Mastropietro
- Department of Computer, Control and Management Engineering (DIAG) 'Antonio Ruberti', Sapienza University of Rome, Rome 00185, Italy
| | - Giuseppe Pasculli
- Department of Computer, Control and Management Engineering (DIAG) 'Antonio Ruberti', Sapienza University of Rome, Rome 00185, Italy
| | - Paolo Tieri
- Institute for Applied Computing (IAC) 'Mauro Picone', National Research Council of Italy (CNR), Rome 00185, Italy
| | - Davide Vergni
- Institute for Applied Computing (IAC) 'Mauro Picone', National Research Council of Italy (CNR), Rome 00185, Italy
| |
Collapse
|
3
|
Chaiben CL, Macedo NF, Batista TBD, Penteado CAS, Ventura TMO, Dionizio A, Souza PHC, Buzalaf MAR, Azevedo-Alanis LR. Salivary protein candidates for biomarkers of oral disorders in people with a crack cocaine use disorder. J Appl Oral Sci 2023; 31:e20220480. [PMID: 37194792 DOI: 10.1590/1678-7757-2022-0480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Accepted: 04/06/2023] [Indexed: 05/18/2023] Open
Abstract
The use of cocaine and its main derivative, crack, can cause some systemic effects that may lead to the development of some oral disorders. To assess the oral health of people with a crack cocaine use disorder and identify salivary protein candidates for biomarkers of oral disorders. A total of 40 volunteers hospitalized for rehabilitation for crack cocaine addiction were enrolled; nine were randomly selected for proteomic analysis. Intraoral examination, report of DMFT, gingival and plaque index, xerostomia, and non-stimulated saliva collection were performed. A list of proteins identified was generated from the UniProt database and manually revised. The mean age (n=40) was 32 (±8.88; 18-51) years; the mean DMFT index was 16±7.70; the mean plaque and gingival index were 2.07±0.65 and 2.12±0.64, respectively; and 20 (50%) volunteers reported xerostomia. We identified 305 salivary proteins (n=9), of which 23 were classified as candidate for biomarkers associated with 14 oral disorders. The highest number of candidates for biomarkers was associated with carcinoma of head and neck (n=7) and nasopharyngeal carcinoma (n=7), followed by periodontitis (n=6). People with a crack cocaine use disorder had an increased risk of dental caries and gingival inflammation; less than half had oral mucosal alterations, and half experienced xerostomia. As possible biomarkers for 14 oral disorders, 23 salivary proteins were identified. Oral cancer and periodontal disease were the most often associated disorders with biomarkers.
Collapse
Affiliation(s)
- Cassiano Lima Chaiben
- Pontifícia Universidade Católica do Paraná, Escola de Ciências da Vida, Programa de Pós-graduação em Odontologia, Curitiba, PR, Brasil
| | - Nayara Flores Macedo
- Pontifícia Universidade Católica do Paraná, Escola de Ciências da Vida, Programa de Pós-graduação em Odontologia, Curitiba, PR, Brasil
| | - Thiago Beltrami Dias Batista
- Pontifícia Universidade Católica do Paraná, Escola de Ciências da Vida, Programa de Pós-graduação em Odontologia, Curitiba, PR, Brasil
| | - Carlos Antonio Schaffer Penteado
- Pontifícia Universidade Católica do Paraná, Escola de Ciências da Vida, Programa de Pós-graduação em Odontologia, Curitiba, PR, Brasil
| | - Talita M O Ventura
- Universidade de São Paulo, Faculdade de Odontologia de Bauru, Departamento de Ciências Básicas, Bauru, SP, Brasil
| | - Aline Dionizio
- Universidade de São Paulo, Faculdade de Odontologia de Bauru, Departamento de Ciências Básicas, Bauru, SP, Brasil
| | - Paulo Henrique Couto Souza
- Pontifícia Universidade Católica do Paraná, Escola de Ciências da Vida, Programa de Pós-graduação em Odontologia, Curitiba, PR, Brasil
| | | | - Luciana Reis Azevedo-Alanis
- Pontifícia Universidade Católica do Paraná, Escola de Ciências da Vida, Programa de Pós-graduação em Odontologia, Curitiba, PR, Brasil
| |
Collapse
|
4
|
López-Úbeda P, Martín-Noguerol T, Aneiros-Fernández J, Luna A. Natural Language Processing in Pathology: Current Trends and Future Insights. THE AMERICAN JOURNAL OF PATHOLOGY 2022; 192:1486-1495. [PMID: 35985480 DOI: 10.1016/j.ajpath.2022.07.012] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 07/21/2022] [Accepted: 07/29/2022] [Indexed: 06/15/2023]
Abstract
Natural language processing (NLP) plays a key role in advancing health care, being key to extracting structured information from electronic health reports. In the last decade, several advances in the field of pathology have been derived from the application of NLP to pathology reports. Herein, a comprehensive review of the most used NLP methods for extracting, coding, and organizing information from pathology reports is presented, including how the development of tools is used to improve workflow. In addition, this article discusses, from a practical point of view, the steps necessary to extract data and encode natural language information for its analytical processing, ranging from preprocessing of text to its inclusion in complex algorithms. Finally, the potential of NLP-based automatic solutions for improving workflow in pathology and their further applications in the near future is highlighted.
Collapse
Affiliation(s)
| | | | | | - Antonio Luna
- MRI Unit, Radiology Department, HT Medica, Jaén, Spain
| |
Collapse
|
5
|
Wager K, Chari D, Ho S, Rees T, Penner O, Schijvenaars BJA. Identifying and Validating Networks of Oncology Biomarkers Mined From the Scientific Literature. Cancer Inform 2022; 21:11769351221086441. [PMID: 35342286 PMCID: PMC8943609 DOI: 10.1177/11769351221086441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Accepted: 02/18/2022] [Indexed: 11/17/2022] Open
Abstract
Biomarkers, as measurements of defined biological characteristics, can play a pivotal role in estimations of disease risk, early detection, differential diagnosis, assessment of disease progression and outcomes prediction. Studies of cancer biomarkers are published daily; some are well characterized, while others are of growing interest. Managing this flow of information is challenging for scientists and clinicians. We sought to develop a novel text-mining method employing biomarker co-occurrence processing applied to a deeply indexed full-text database to generate time-interval–delimited biomarker co-occurrence networks. Biomarkers across 6 cancer sites and a cancer-agnostic network were successfully characterized in terms of their emergence in the published literature and the context in which they are described. Our approach, which enables us to find publications based on biomarker relationships, identified biomarker relationships not known to existing interaction networks. This search method finds relevant literature that could be missed with keyword searches, even if full text is available. It enables users to extract relevant biological information and may provide new biological insights that could not be achieved by individual review of papers.
Collapse
|
6
|
Satyam R, Yousef M, Qazi S, Bhat AM, Raza K. COVIDium: a COVID-19 resource compendium. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2021:6377761. [PMID: 34585731 PMCID: PMC8500058 DOI: 10.1093/database/baab057] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/17/2021] [Revised: 08/14/2021] [Accepted: 09/11/2021] [Indexed: 12/24/2022]
Abstract
The severe acute respiratory syndrome coronavirus 2 that causes coronavirus disease 2019 (COVID-19) disrupted the normal functioning throughout the world since early 2020 and it continues to do so. Nonetheless, the global pandemic was taken up as a challenge by researchers across the globe to discover an effective cure, either in the form of a drug or vaccine. This resulted in an unprecedented surge of experimental and computational data and publications, which often translated their findings in the form of databases (DBs) and tools. Over 160 such DBs and more than 80 software tools were developed, which are uncharacterized, unannotated, deployed at different universal resource locators and are challenging to reach out through a normal web search. Besides, most of the DBs/tools are present on preprints and are either underutilized or unrecognized because of their inability to make it to top Google search hits. Henceforth, there was a need to crawl and characterize these DBs and create a compendium for easy referencing. The current article is one such concerted effort in this direction to create a COVID-19 resource compendium (COVIDium) that would facilitate the researchers to find suitable DBs and tools for their research studies. COVIDium tries to classify the DBs and tools into 11 broad categories for quick navigation. It also provides end-users some generic hit terms to filter the DB entries for quick access to the resources. Additionally, the DB provides Tracker Dashboard, Neuro Resources, references to COVID-19 datasets and protein-protein interactions. This compendium will be periodically updated to accommodate new resources. Database URL: The COVIDium is accessible through http://kraza.in/covidium/.
Collapse
Affiliation(s)
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Jerusalem St 11, Safed, Zefat 1320611, Israel
| | - Sahar Qazi
- Department of Computer Science, Jamia Millia Islamia, Maulana Mohammad Ali Jauhar Marg, Jamia Nagar, Okhla, New Delhi 110025, India
| | - Adil Manzoor Bhat
- Department of Computer Science, Jamia Millia Islamia, Maulana Mohammad Ali Jauhar Marg, Jamia Nagar, Okhla, New Delhi 110025, India
| | - Khalid Raza
- Department of Computer Science, Jamia Millia Islamia, Maulana Mohammad Ali Jauhar Marg, Jamia Nagar, Okhla, New Delhi 110025, India
| |
Collapse
|
7
|
Turewicz M, Frericks-Zipper A, Stepath M, Schork K, Ramesh S, Marcus K, Eisenacher M. BIONDA: a free database for a fast information on published biomarkers. BIOINFORMATICS ADVANCES 2021; 1:vbab015. [PMID: 36700097 PMCID: PMC9710600 DOI: 10.1093/bioadv/vbab015] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Revised: 07/11/2021] [Indexed: 01/28/2023]
Abstract
Summary Because of the steadily increasing and already manually unmanageable total number of biomarker-related articles in biomedical research, there is a need for intelligent systems that extract all relevant information from biomedical texts and provide it as structured information to researchers in a user-friendly way. To address this, BIONDA was implemented as a free text mining-based online database for molecular biomarkers including genes, proteins and miRNAs and for all kinds of diseases. The contained structured information on published biomarkers is extracted automatically from Europe PMC publication abstracts and high-quality sources like UniProt and Disease Ontology. This allows frequent content updates. Availability and implementation BIONDA is freely accessible via a user-friendly web application at http://bionda.mpc.ruhr-uni-bochum.de. The current BIONDA code is available at GitHub via https://github.com/mpc-bioinformatics/bionda. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Michael Turewicz
- Medizinisches Proteom-Center, Ruhr University Bochum, Bochum 44801, Germany.,Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, Ruhr University Bochum, Bochum 44801, Germany
| | - Anika Frericks-Zipper
- Medizinisches Proteom-Center, Ruhr University Bochum, Bochum 44801, Germany.,Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, Ruhr University Bochum, Bochum 44801, Germany
| | - Markus Stepath
- Medizinisches Proteom-Center, Ruhr University Bochum, Bochum 44801, Germany.,Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, Ruhr University Bochum, Bochum 44801, Germany
| | - Karin Schork
- Medizinisches Proteom-Center, Ruhr University Bochum, Bochum 44801, Germany.,Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, Ruhr University Bochum, Bochum 44801, Germany
| | - Spoorti Ramesh
- Medizinisches Proteom-Center, Ruhr University Bochum, Bochum 44801, Germany.,Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, Ruhr University Bochum, Bochum 44801, Germany
| | - Katrin Marcus
- Medizinisches Proteom-Center, Ruhr University Bochum, Bochum 44801, Germany.,Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, Ruhr University Bochum, Bochum 44801, Germany
| | - Martin Eisenacher
- Medizinisches Proteom-Center, Ruhr University Bochum, Bochum 44801, Germany.,Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, Ruhr University Bochum, Bochum 44801, Germany
| |
Collapse
|
8
|
Taha K, Davuluri R, Yoo P, Spencer J. Personizing the prediction of future susceptibility to a specific disease. PLoS One 2021; 16:e0243127. [PMID: 33406077 PMCID: PMC7787538 DOI: 10.1371/journal.pone.0243127] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2019] [Accepted: 11/17/2020] [Indexed: 01/22/2023] Open
Abstract
A traceable biomarker is a member of a disease's molecular pathway. A disease may be associated with several molecular pathways. Each different combination of these molecular pathways, to which detected traceable biomarkers belong, may serve as an indicative of the elicitation of the disease at a different time frame in the future. Based on this notion, we introduce a novel methodology for personalizing an individual's degree of future susceptibility to a specific disease. We implemented the methodology in a working system called Susceptibility Degree to a Disease Predictor (SDDP). For a specific disease d, let S be the set of molecular pathways, to which traceable biomarkers detected from most patients of d belong. For the same disease d, let S' be the set of molecular pathways, to which traceable biomarkers detected from a certain individual belong. SDDP is able to infer the subset S'' ⊆{S-S'} of undetected molecular pathways for the individual. Thus, SDDP can infer undetected molecular pathways of a disease for an individual based on few molecular pathways detected from the individual. SDDP can also help in inferring the combination of molecular pathways in the set {S'+S''}, whose traceable biomarkers collectively is an indicative of the disease. SDDP is composed of the following four components: information extractor, interrelationship between molecular pathways modeler, logic inferencer, and risk indicator. The information extractor takes advantage of the exponential increase of biomedical literature to automatically extract the common traceable biomarkers for a specific disease. The interrelationship between molecular pathways modeler models the hierarchical interrelationships between the molecular pathways of the traceable biomarkers. The logic inferencer transforms the hierarchical interrelationships between the molecular pathways into rule-based specifications. It employs the specification rules and the inference rules for predicate logic to infer as many as possible undetected molecular pathways of a disease for an individual. The risk indicator outputs a risk indicator value that reflects the individual's degree of future susceptibility to the disease. We evaluated SDDP by comparing it experimentally with other methods. Results revealed marked improvement.
Collapse
Affiliation(s)
- Kamal Taha
- Department of Electrical and Computer Science, Khalifa University, Abu Dhabi, UAE
- * E-mail:
| | - Ramana Davuluri
- Department of Biomedical Informatics, School of Medicine and College of Engineering and Applied Sciences, Stony Brook University, Stony Brook, New York, United States of America
| | - Paul Yoo
- Department of Computer Science & Information Systems, University of London, Birkbeck College, London, United Kingdom
| | - Jesse Spencer
- Department of Pathology, University of Utah, Salt Lake City, Utah, United States of America
| |
Collapse
|
9
|
Choudhari JK, Chatterjee T, Gupta S, Garcia-Garcia JG, Vera-González J. Network Biology Approaches in Ophthalmological Diseases: A Case Study of Glaucoma. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11586-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022] Open
|
10
|
Penteado CAS, Batista TBD, Chaiben CL, Bonacin BG, Ventura TMO, Dionizio A, Couto Souza PH, Buzalaf MAR, Azevedo-Alanis LR. Salivary protein candidates for biomarkers of oral disorders in alcohol and tobacco dependents. Oral Dis 2020; 26:1200-1208. [PMID: 32237000 DOI: 10.1111/odi.13337] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2019] [Revised: 02/20/2020] [Accepted: 03/19/2020] [Indexed: 12/21/2022]
Abstract
OBJECTIVES To evaluate the oral condition of alcohol and tobacco dependents and identify salivary protein candidates for biomarkers of oral disorders. SUBJECTS AND METHODS Thirty-three male volunteers were evaluated for alcohol abuse rehabilitation; nine were selected for proteomic analysis. Intraoral examination was performed, and non-stimulated saliva was collected. Salivary proteins were extracted and processed for analysis. A list of proteins identified in saliva was generated from the database and manually revised, obtaining the total number of candidate biomarkers for oral disorders. RESULTS The mean age (n = 33) was 42.94 ± 8.61 years. Fourteen (42.4%) subjects presented with 23 oral mucosa changes, and 31 (94%) had dental plaque. A total of 282 proteins were found in saliva (n = 9), of which 26 were identified as candidates for biomarkers of oral disorders. After manual review, 21 proteins were selected. The highest number of candidates for biomarkers was associated with carcinoma of head and neck (n = 10), nasopharyngeal carcinoma (n = 6), and periodontal disease (n = 6). CONCLUSION Alcohol and tobacco dependents showed gingival inflammation, and less than half of them showed oral mucosa changes. Twenty-one protein candidates for biomarkers of oral disorders were identified in saliva. The two major oral disorders in number of candidates for biomarkers were head and neck cancer and periodontal disease.
Collapse
Affiliation(s)
| | - Thiago Beltrami Dias Batista
- Graduate Program in Dentistry, School of Life Sciences, Pontifícia Universidade Católica do Paraná, Curitiba, Brazil
| | - Cassiano Lima Chaiben
- Graduate Program in Dentistry, School of Life Sciences, Pontifícia Universidade Católica do Paraná, Curitiba, Brazil
| | - Bruna Guedes Bonacin
- Dentistry, School of Life Sciences, Pontifícia Universidade Católica do Paraná, Curitiba, Brazil
| | | | - Aline Dionizio
- Bauru School of Dentistry, University of São Paulo, Bauru, SP, Brasil
| | - Paulo Henrique Couto Souza
- Graduate Program in Dentistry, School of Life Sciences, Pontifícia Universidade Católica do Paraná, Curitiba, Brazil
| | | | - Luciana Reis Azevedo-Alanis
- Graduate Program in Dentistry, School of Life Sciences, Pontifícia Universidade Católica do Paraná, Curitiba, Brazil
| |
Collapse
|
11
|
DES-ROD: Exploring Literature to Develop New Links between RNA Oxidation and Human Diseases. OXIDATIVE MEDICINE AND CELLULAR LONGEVITY 2020; 2020:5904315. [PMID: 32308806 PMCID: PMC7142358 DOI: 10.1155/2020/5904315] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/09/2020] [Accepted: 02/21/2020] [Indexed: 12/27/2022]
Abstract
Normal cellular physiology and biochemical processes require undamaged RNA molecules. However, RNAs are frequently subjected to oxidative damage. Overproduction of reactive oxygen species (ROS) leads to RNA oxidation and disturbs redox (oxidation-reduction reaction) homeostasis. When oxidation damage affects RNA carrying protein-coding information, this may result in the synthesis of aberrant proteins as well as a lower efficiency of translation. Both of these, as well as imbalanced redox homeostasis, may lead to numerous human diseases. The number of studies on the effects of RNA oxidative damage in mammals is increasing by year due to the understanding that this oxidation fundamentally leads to numerous human diseases. To enable researchers in this field to explore information relevant to RNA oxidation and effects on human diseases, we developed DES-ROD, an online knowledgebase that contains processed information from 298,603 relevant documents that consist of PubMed abstracts and PubMed Central full-text articles. The system utilizes concepts/terms from 38 curated thematic dictionaries mapped to the analyzed documents. Researchers can explore enriched concepts, as well as enriched pairs of putatively associated concepts. In this way, one can explore mutual relationships between any combinations of two concepts from used dictionaries. Dictionaries cover a wide range of biomedical topics, such as human genes and proteins, pathways, Gene Ontology categories, mutations, noncoding RNAs, enzymes, toxins, metabolites, and diseases. This makes insights into different facets of the effects of RNA oxidation and the control of this process possible. The usefulness of the DES-ROD system is demonstrated by case studies on some known information, as well as potentially novel information involving RNA oxidation and diseases. DES-ROD is the first knowledgebase based on text and data mining that focused on the exploration of RNA oxidation and human diseases.
Collapse
|
12
|
Barman RK, Mukhopadhyay A, Maulik U, Das S. Identification of infectious disease-associated host genes using machine learning techniques. BMC Bioinformatics 2019; 20:736. [PMID: 31881961 PMCID: PMC6935192 DOI: 10.1186/s12859-019-3317-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Accepted: 12/16/2019] [Indexed: 02/06/2023] Open
Abstract
Background With the global spread of multidrug resistance in pathogenic microbes, infectious diseases emerge as a key public health concern of the recent time. Identification of host genes associated with infectious diseases will improve our understanding about the mechanisms behind their development and help to identify novel therapeutic targets. Results We developed a machine learning techniques-based classification approach to identify infectious disease-associated host genes by integrating sequence and protein interaction network features. Among different methods, Deep Neural Networks (DNN) model with 16 selected features for pseudo-amino acid composition (PAAC) and network properties achieved the highest accuracy of 86.33% with sensitivity of 85.61% and specificity of 86.57%. The DNN classifier also attained an accuracy of 83.33% on a blind dataset and a sensitivity of 83.1% on an independent dataset. Furthermore, to predict unknown infectious disease-associated host genes, we applied the proposed DNN model to all reviewed proteins from the database. Seventy-six out of 100 highly-predicted infectious disease-associated genes from our study were also found in experimentally-verified human-pathogen protein-protein interactions (PPIs). Finally, we validated the highly-predicted infectious disease-associated genes by disease and gene ontology enrichment analysis and found that many of them are shared by one or more of the other diseases, such as cancer, metabolic and immune related diseases. Conclusions To the best of our knowledge, this is the first computational method to identify infectious disease-associated host genes. The proposed method will help large-scale prediction of host genes associated with infectious-diseases. However, our results indicated that for small datasets, advanced DNN-based method does not offer significant advantage over the simpler supervised machine learning techniques, such as Support Vector Machine (SVM) or Random Forest (RF) for the prediction of infectious disease-associated host genes. Significant overlap of infectious disease with cancer and metabolic disease on disease and gene ontology enrichment analysis suggests that these diseases perturb the functions of the same cellular signaling pathways and may be treated by drugs that tend to reverse these perturbations. Moreover, identification of novel candidate genes associated with infectious diseases would help us to explain disease pathogenesis further and develop novel therapeutics.
Collapse
Affiliation(s)
- Ranjan Kumar Barman
- Biomedical Informatics Centre, ICMR-National Institute of Cholera and Enteric Diseases, Kolkata, West Bengal, India.,Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
| | - Anirban Mukhopadhyay
- Department of Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, India
| | - Ujjwal Maulik
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
| | - Santasabuj Das
- Biomedical Informatics Centre, ICMR-National Institute of Cholera and Enteric Diseases, Kolkata, West Bengal, India. .,Division of Clinical Medicine, ICMR-National Institute of Cholera and Enteric Diseases, P-33, C.I.T.Road Scheme XM, Beliaghata-700010, Kolkata, West Bengal, India.
| |
Collapse
|
13
|
Batista TBD, Chaiben CL, Penteado CAS, Nascimento JMC, Ventura TMO, Dionizio A, Rosa EAR, Buzalaf MAR, Azevedo-Alanis LR. Salivary proteome characterization of alcohol and tobacco dependents. Drug Alcohol Depend 2019; 204:107510. [PMID: 31494441 DOI: 10.1016/j.drugalcdep.2019.06.013] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/04/2019] [Revised: 05/28/2019] [Accepted: 06/03/2019] [Indexed: 12/18/2022]
Abstract
BACKGROUND Alcohol and substances found in tobacco may alter salivary flow and amount of saliva proteins. This study aimed to compare salivary proteins between alcohol dependent smokers and controls. METHODS This is a case-control study with men older than 18 years of age, matched by age. The alcohol-dependent group was composed by heavy smokers and alcohol consumers. Unstimulated whole saliva was collected from all subjects. Analysis of digested peptides was performed in mass spectrometer. Data were processed using ProteinLynx GlobalServer software. Results were obtained by searching theHomo sapiens database from the UniProt catalog. The search tool IBI-IMIM was used to identify candidate proteins for biomarkers. RESULTS Alcohol-dependent and control groups were composed of nine participants each, with mean age of 36.89 ± 2.57 and 35.78 ± 1.64 years, respectively. 404 salivary proteins were found in both groups; 282 in the alcohol-dependent. Among the 96 proteins presented in both groups, 32 were up-regulated in the alcohol dependents (i.e. "Hemoglobin subunit beta" and "Forkhead box protein P2" were up-regulated at least 10-fold), 23 were down-regulated (i.e. "Statherin" and "RNA-binding protein 25" were down-regulated at least 10-fold), and 41 presented similar expression in both groups. 71 proteins were candidates for biomarkers of disorders 58 presented in alcohol dependents' saliva. The most common disorders were neoplasms, genetic, cardiovascular, metabolic and glandular diseases. CONCLUSIONS Salivary protein profile undergoes strong changes in alcohol and tobacco dependents. 34% of salivary proteins present in alcohol and tobacco dependents were present in controls; 14.5% of them were expressed in similar quantity.
Collapse
Affiliation(s)
- Thiago Beltrami Dias Batista
- Graduate student, Graduate Program in Dentistry, School of Life Sciences, Pontifícia, Universidade Católica do Paraná, Rua Imaculada Conceição 1155, Curitiba, PR, 80215-901, Brazil.
| | - Cassiano Lima Chaiben
- Graduate student, Graduate Program in Dentistry, School of Life Sciences, Pontifícia, Universidade Católica do Paraná, Rua Imaculada Conceição 1155, Curitiba, PR, 80215-901, Brazil.
| | - Carlos Antonio Schäffer Penteado
- Graduate student, Graduate Program in Dentistry, School of Life Sciences, Pontifícia, Universidade Católica do Paraná, Rua Imaculada Conceição 1155, Curitiba, PR, 80215-901, Brazil.
| | - Júlia Milena Carvalho Nascimento
- Undergraduate student, Dentistry, School of Life Sciences, Pontifícia Universidade, Católica do Paraná, Rua Imaculada Conceição 1155, Curitiba, PR, 80215-901, Brazil.
| | - Talita Mendes Oliveira Ventura
- Graduate student, Bauru School of Dentistry, University of São Paulo, Alameda Doutor, Octávio Pinheiro Brisolla, 9-75, Bauru, SP, 17012-901, Brazil.
| | - Aline Dionizio
- Graduate student, Bauru School of Dentistry, University of São Paulo, Alameda Doutor, Octávio Pinheiro Brisolla, 9-75, Bauru, SP, 17012-901, Brazil.
| | - Edvaldo Antonio Ribeiro Rosa
- Full Professor, Graduate Program in Dentistry, School of Life Sciences, Pontifícia, Universidade Católica do Paraná, Rua Imaculada Conceição 1155, Curitiba, PR, 80215-901, Brazil.
| | - Marília Afonso Rabelo Buzalaf
- Full Professor, Bauru School of Dentistry, University of São Paulo, Alameda Doutor, Octávio Pinheiro Brisolla, 9-75, Bauru, SP, 17012-901, Brazil.
| | - Luciana Reis Azevedo-Alanis
- Full Professor, Graduate Program in Dentistry, School of Life Sciences, Pontifícia, Universidade Católica do Paraná, Rua Imaculada Conceição 1155, Curitiba, PR, 80215-901, Brazil.
| |
Collapse
|
14
|
Essack M, Salhi A, Stanimirovic J, Tifratene F, Bin Raies A, Hungler A, Uludag M, Van Neste C, Trpkovic A, Bajic VP, Bajic VB, Isenovic ER. Literature-Based Enrichment Insights into Redox Control of Vascular Biology. OXIDATIVE MEDICINE AND CELLULAR LONGEVITY 2019; 2019:1769437. [PMID: 31223421 PMCID: PMC6542245 DOI: 10.1155/2019/1769437] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Revised: 04/11/2019] [Accepted: 05/02/2019] [Indexed: 02/07/2023]
Abstract
In cellular physiology and signaling, reactive oxygen species (ROS) play one of the most critical roles. ROS overproduction leads to cellular oxidative stress. This may lead to an irrecoverable imbalance of redox (oxidation-reduction reaction) function that deregulates redox homeostasis, which itself could lead to several diseases including neurodegenerative disease, cardiovascular disease, and cancers. In this study, we focus on the redox effects related to vascular systems in mammals. To support research in this domain, we developed an online knowledge base, DES-RedoxVasc, which enables exploration of information contained in the biomedical scientific literature. The DES-RedoxVasc system analyzed 233399 documents consisting of PubMed abstracts and PubMed Central full-text articles related to different aspects of redox biology in vascular systems. It allows researchers to explore enriched concepts from 28 curated thematic dictionaries, as well as literature-derived potential associations of pairs of such enriched concepts, where associations themselves are statistically enriched. For example, the system allows exploration of associations of pathways, diseases, mutations, genes/proteins, miRNAs, long ncRNAs, toxins, drugs, biological processes, molecular functions, etc. that allow for insights about different aspects of redox effects and control of processes related to the vascular system. Moreover, we deliver case studies about some existing or possibly novel knowledge regarding redox of vascular biology demonstrating the usefulness of DES-RedoxVasc. DES-RedoxVasc is the first compiled knowledge base using text mining for the exploration of this topic.
Collapse
Affiliation(s)
- Magbubah Essack
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Adil Salhi
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Julijana Stanimirovic
- Vinca Institute, University of Belgrade, Laboratory for Molecular Endocrinology and Radiobiology, Belgrade, Serbia
| | - Faroug Tifratene
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Arwa Bin Raies
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Arnaud Hungler
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Mahmut Uludag
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Christophe Van Neste
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Andreja Trpkovic
- Vinca Institute, University of Belgrade, Laboratory for Molecular Endocrinology and Radiobiology, Belgrade, Serbia
| | - Vladan P. Bajic
- Vinca Institute, University of Belgrade, Laboratory for Molecular Endocrinology and Radiobiology, Belgrade, Serbia
| | - Vladimir B. Bajic
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Esma R. Isenovic
- Vinca Institute, University of Belgrade, Laboratory for Molecular Endocrinology and Radiobiology, Belgrade, Serbia
| |
Collapse
|
15
|
Hatz S, Spangler S, Bender A, Studham M, Haselmayer P, Lacoste AMB, Willis VC, Martin RL, Gurulingappa H, Betz U. Identification of pharmacodynamic biomarker hypotheses through literature analysis with IBM Watson. PLoS One 2019; 14:e0214619. [PMID: 30958864 PMCID: PMC6453528 DOI: 10.1371/journal.pone.0214619] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Accepted: 03/16/2019] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Pharmacodynamic biomarkers are becoming increasingly valuable for assessing drug activity and target modulation in clinical trials. However, identifying quality biomarkers is challenging due to the increasing volume and heterogeneity of relevant data describing the biological networks that underlie disease mechanisms. A biological pathway network typically includes entities (e.g. genes, proteins and chemicals/drugs) as well as the relationships between these and is typically curated or mined from structured databases and textual co-occurrence data. We propose a hybrid Natural Language Processing and directed relationships-based network analysis approach using IBM Watson for Drug Discovery to rank all human genes and identify potential candidate biomarkers, requiring only an initial determination of a specific target-disease relationship. METHODS Through natural language processing of scientific literature, Watson for Drug Discovery creates a network of semantic relationships between biological concepts such as genes, drugs, and diseases. Using Bruton's tyrosine kinase as a case study, Watson for Drug Discovery's automatically extracted relationship network was compared with a prominent manually curated physical interaction network. Additionally, potential biomarkers for Bruton's tyrosine kinase inhibition were predicted using a matrix factorization approach and subsequently compared with expert-generated biomarkers. RESULTS Watson's natural language processing generated a relationship network matching 55 (86%) genes upstream of BTK and 98 (95%) genes downstream of Bruton's tyrosine kinase in a prominent manually curated physical interaction network. Matrix factorization analysis predicted 11 of 13 genes identified by Merck subject matter experts in the top 20% of Watson for Drug Discovery's 13,595 ranked genes, with 7 in the top 5%. CONCLUSION Taken together, these results suggest that Watson for Drug Discovery's automatic relationship network identifies the majority of upstream and downstream genes in biological pathway networks and can be used to help with the identification and prioritization of pharmacodynamic biomarker evaluation, accelerating the early phases of disease hypothesis generation.
Collapse
Affiliation(s)
- Sonja Hatz
- Merck KGaA, Frankfurter Straße, Darmstadt, Germany
| | - Scott Spangler
- IBM Watson Health, Almaden, California, United States of America
| | - Andrew Bender
- EMD Serono, Middlesex Turnpike, Billerica, United States of America
| | - Matthew Studham
- EMD Serono, Middlesex Turnpike, Billerica, United States of America
| | | | | | - Van C. Willis
- IBM Watson Health, Cambridge, Massachusetts, United States of America
| | - Richard L. Martin
- IBM Watson Health, Cambridge, Massachusetts, United States of America
| | | | - Ulrich Betz
- Merck KGaA, Frankfurter Straße, Darmstadt, Germany
| |
Collapse
|
16
|
Furrer L, Jancso A, Colic N, Rinaldi F. OGER++: hybrid multi-type entity recognition. J Cheminform 2019; 11:7. [PMID: 30666476 PMCID: PMC6689863 DOI: 10.1186/s13321-018-0326-3] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2018] [Accepted: 12/27/2018] [Indexed: 12/14/2022] Open
Abstract
Background We present a text-mining tool for recognizing biomedical entities in scientific literature. OGER++ is a hybrid system for named entity recognition and concept recognition (linking), which combines a dictionary-based annotator with a corpus-based disambiguation component. The annotator uses an efficient look-up strategy combined with a normalization method for matching spelling variants. The disambiguation classifier is implemented as a feed-forward neural network which acts as a postfilter to the previous step. Results We evaluated the system in terms of processing speed and annotation quality. In the speed benchmarks, the OGER++ web service processes 9.7 abstracts or 0.9 full-text documents per second. On the CRAFT corpus, we achieved 71.4% and 56.7% F1 for named entity recognition and concept recognition, respectively. Conclusions Combining knowledge-based and data-driven components allows creating a system with competitive performance in biomedical text mining.
Collapse
Affiliation(s)
- Lenz Furrer
- Institute of Computational Linguistics, University of Zurich, Andreasstr. 15, 8050, Zürich, Switzerland
| | - Anna Jancso
- Institute of Computational Linguistics, University of Zurich, Andreasstr. 15, 8050, Zürich, Switzerland
| | - Nicola Colic
- Institute of Computational Linguistics, University of Zurich, Andreasstr. 15, 8050, Zürich, Switzerland
| | - Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Andreasstr. 15, 8050, Zürich, Switzerland. .,Fondazione Bruno Kessler, Via Sommarive, 18, 38123, Trento, Italy.
| |
Collapse
|
17
|
Smaili FZ, Gao X, Hoehndorf R. OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction. Bioinformatics 2018; 35:2133-2140. [DOI: 10.1093/bioinformatics/bty933] [Citation(s) in RCA: 65] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Revised: 11/02/2018] [Accepted: 11/07/2018] [Indexed: 12/11/2022] Open
Affiliation(s)
- Fatima Zohra Smaili
- Computer, Electrical & Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Xin Gao
- Computer, Electrical & Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Robert Hoehndorf
- Computer, Electrical & Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| |
Collapse
|
18
|
Li S, Liu X, Zhou Y, Acharya A, Savkovic V, Xu C, Wu N, Deng Y, Hu X, Li H, Haak R, Schmidt J, Shang W, Pan H, Shang R, Yu Y, Ziebolz D, Schmalz G. Shared genetic and epigenetic mechanisms between chronic periodontitis and oral squamous cell carcinoma. Oral Oncol 2018; 86:216-224. [DOI: 10.1016/j.oraloncology.2018.09.029] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2018] [Revised: 09/15/2018] [Accepted: 09/28/2018] [Indexed: 12/11/2022]
|
19
|
Mishra S, Shah MI, Sarkar M, Asati N, Rout C. ILDgenDB: integrated genetic knowledge resource for interstitial lung diseases (ILDs). DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:5035482. [PMID: 29897484 PMCID: PMC6007225 DOI: 10.1093/database/bay053] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/25/2018] [Accepted: 05/17/2018] [Indexed: 12/31/2022]
Abstract
Interstitial lung diseases (ILDs) are a diverse group of ∼200 acute and chronic pulmonary disorders that are characterized by variable amounts of inflammation, fibrosis and architectural distortion with substantial morbidity and mortality. Inaccurate and delayed diagnoses increase the risk, especially in developing countries. Studies have indicated the significant roles of genetic elements in ILDs pathogenesis. Therefore, the first genetic knowledge resource, ILDgenDB, has been developed with an objective to provide ILDs genetic data and their integrated analyses for the better understanding of disease pathogenesis and identification of diagnostics-based biomarkers. This resource contains literature-curated disease candidate genes (DCGs) enriched with various regulatory elements that have been generated using an integrated bioinformatics workflow of databases searches, literature-mining and DCGs–microRNA (miRNAs)–single nucleotide polymorphisms (SNPs) association analyses. To provide statistical significance to disease-gene association, ILD-specificity index and hypergeomatric test scores were also incorporated. Association analyses of miRNAs, SNPs and pathways responsible for the pathogenesis of different sub-classes of ILDs were also incorporated. Manually verified 299 DCGs and their significant associations with 1932 SNPs, 2966 miRNAs and 9170 miR-polymorphisms were also provided. Furthermore, 216 literature-mined and proposed biomarkers were identified. The ILDgenDB resource provides user-friendly browsing and extensive query-based information retrieval systems. Additionally, this resource also facilitates graphical view of predicted DCGs–SNPs/miRNAs and literature associated DCGs–ILDs interactions for each ILD to facilitate efficient data interpretation. Outcomes of analyses suggested the significant involvement of immune system and defense mechanisms in ILDs pathogenesis. This resource may potentially facilitate genetic-based disease monitoring and diagnosis. Database URL: http://14.139.240.55/ildgendb/index.php
Collapse
Affiliation(s)
- Smriti Mishra
- Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Waknaghat, Solan, Himachal Pradesh 173234, India
| | - Mohammad I Shah
- Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Waknaghat, Solan, Himachal Pradesh 173234, India
| | - Malay Sarkar
- Department of Pulmonary Medicine, Indira Gandhi Medical College, Shimla, Himachal Pradesh 171001, India
| | - Nimisha Asati
- Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Waknaghat, Solan, Himachal Pradesh 173234, India
| | - Chittaranjan Rout
- Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Waknaghat, Solan, Himachal Pradesh 173234, India
| |
Collapse
|
20
|
Bhasuran B, Natarajan J. Automatic extraction of gene-disease associations from literature using joint ensemble learning. PLoS One 2018; 13:e0200699. [PMID: 30048465 PMCID: PMC6061985 DOI: 10.1371/journal.pone.0200699] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Accepted: 07/02/2018] [Indexed: 12/26/2022] Open
Abstract
A wealth of knowledge concerning relations between genes and its associated diseases is present in biomedical literature. Mining these biological associations from literature can provide immense support to research ranging from drug-targetable pathways to biomarker discovery. However, time and cost of manual curation heavily slows it down. In this current scenario one of the crucial technologies is biomedical text mining, and relation extraction shows the promising result to explore the research of genes associated with diseases. By developing automatic extraction of gene-disease associations from the literature using joint ensemble learning we addressed this problem from a text mining perspective. In the proposed work, we employ a supervised machine learning approach in which a rich feature set covering conceptual, syntax and semantic properties jointly learned with word embedding are trained using ensemble support vector machine for extracting gene-disease relations from four gold standard corpora. Upon evaluating the machine learning approach shows promised results of 85.34%, 83.93%,87.39% and 85.57% of F-measure on EUADR, GAD, CoMAGC and PolySearch corpora respectively. We strongly believe that the presented novel approach combining rich syntax and semantic feature set with domain-specific word embedding through ensemble support vector machines evaluated on four gold standard corpora can act as a new baseline for future works in gene-disease relation extraction from literature.
Collapse
Affiliation(s)
- Balu Bhasuran
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India
| | - Jeyakumar Natarajan
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India
- Data mining and Text mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India
- * E-mail:
| |
Collapse
|
21
|
Lee J, Song HJ, Yoon E, Park SB, Park SH, Seo JW, Park P, Choi J. Automated extraction of Biomarker information from pathology reports. BMC Med Inform Decis Mak 2018; 18:29. [PMID: 29783980 PMCID: PMC5963015 DOI: 10.1186/s12911-018-0609-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2017] [Accepted: 04/27/2018] [Indexed: 02/06/2023] Open
Abstract
Background Pathology reports are written in free-text form, which precludes efficient data gathering. We aimed to overcome this limitation and design an automated system for extracting biomarker profiles from accumulated pathology reports. Methods We designed a new data model for representing biomarker knowledge. The automated system parses immunohistochemistry reports based on a “slide paragraph” unit defined as a set of immunohistochemistry findings obtained for the same tissue slide. Pathology reports are parsed using context-free grammar for immunohistochemistry, and using a tree-like structure for surgical pathology. The performance of the approach was validated on manually annotated pathology reports of 100 randomly selected patients managed at Seoul National University Hospital. Results High F-scores were obtained for parsing biomarker name and corresponding test results (0.999 and 0.998, respectively) from the immunohistochemistry reports, compared to relatively poor performance for parsing surgical pathology findings. However, applying the proposed approach to our single-center dataset revealed information on 221 unique biomarkers, which represents a richer result than biomarker profiles obtained based on the published literature. Owing to the data representation model, the proposed approach can associate biomarker profiles extracted from an immunohistochemistry report with corresponding pathology findings listed in one or more surgical pathology reports. Term variations are resolved by normalization to corresponding preferred terms determined by expanded dictionary look-up and text similarity-based search. Conclusions Our proposed approach for biomarker data extraction addresses key limitations regarding data representation and can handle reports prepared in the clinical setting, which often contain incomplete sentences, typographical errors, and inconsistent formatting. Electronic supplementary material The online version of this article (10.1186/s12911-018-0609-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jeongeun Lee
- Interdisciplinary Program for Bioengineering, Graduate School, Seoul National Universty, Seoul, Republic of Korea
| | - Hyun-Je Song
- School of Computer Science and Engineering, Kyungpook National University, Daegu, Republic of Korea
| | - Eunsil Yoon
- PAS1 team, TmaxSoft, Gyeonggi-do, Republic of Korea
| | - Seong-Bae Park
- School of Computer Science and Engineering, Kyungpook National University, Daegu, Republic of Korea
| | - Sung-Hye Park
- Department of Pathology, College of Medicine, Seoul National University, Seoul, Republic of Korea
| | - Jeong-Wook Seo
- Department of Pathology, College of Medicine, Seoul National University, Seoul, Republic of Korea
| | - Peom Park
- Department of Industrial Engineering, Ajou University, Suwon, Republic of Korea
| | - Jinwook Choi
- Interdisciplinary Program for Bioengineering, Graduate School, Seoul National Universty, Seoul, Republic of Korea. .,Department of Biomedical Engineering, College of Medicine, Seoul National University, Seoul, Republic of Korea.
| |
Collapse
|
22
|
Renganathan V. Text Mining in Biomedical Domain with Emphasis on Document Clustering. Healthc Inform Res 2017; 23:141-146. [PMID: 28875048 PMCID: PMC5572517 DOI: 10.4258/hir.2017.23.3.141] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Revised: 07/16/2017] [Accepted: 07/17/2017] [Indexed: 12/19/2022] Open
Abstract
Objectives With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. Methods This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain. Results Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail. Conclusions Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise.
Collapse
|
23
|
Automated extraction of potential migraine biomarkers using a semantic graph. J Biomed Inform 2017; 71:178-189. [PMID: 28579531 DOI: 10.1016/j.jbi.2017.05.018] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2016] [Revised: 04/03/2017] [Accepted: 05/23/2017] [Indexed: 01/20/2023]
Abstract
PROBLEM Biomedical literature and databases contain important clues for the identification of potential disease biomarkers. However, searching these enormous knowledge reservoirs and integrating findings across heterogeneous sources is costly and difficult. Here we demonstrate how semantically integrated knowledge, extracted from biomedical literature and structured databases, can be used to automatically identify potential migraine biomarkers. METHOD We used a knowledge graph containing more than 3.5 million biomedical concepts and 68.4 million relationships. Biochemical compound concepts were filtered and ranked by their potential as biomarkers based on their connections to a subgraph of migraine-related concepts. The ranked results were evaluated against the results of a systematic literature review that was performed manually by migraine researchers. Weight points were assigned to these reference compounds to indicate their relative importance. RESULTS Ranked results automatically generated by the knowledge graph were highly consistent with results from the manual literature review. Out of 222 reference compounds, 163 (73%) ranked in the top 2000, with 547 out of the 644 (85%) weight points assigned to the reference compounds. For reference compounds that were not in the top of the list, an extensive error analysis has been performed. When evaluating the overall performance, we obtained a ROC-AUC of 0.974. DISCUSSION Semantic knowledge graphs composed of information integrated from multiple and varying sources can assist researchers in identifying potential disease biomarkers.
Collapse
|
24
|
Abstract
Deciphering gene–disease association is a crucial step in designing therapeutic strategies against diseases. There are experimental methods for identifying gene–disease associations, such as genome-wide association studies and linkage analysis, but these can be expensive and time consuming. As a result, various
in silico methods for predicting associations from these and other data have been developed using different approaches. In this article, we review some of the recent approaches to the computational prediction of gene–disease association. We look at recent advancements in algorithms, categorising them into those based on genome variation, networks, text mining, and crowdsourcing. We also look at some of the challenges faced in the computational prediction of gene–disease associations.
Collapse
Affiliation(s)
- Kenneth Opap
- University of Cape Town, Cape Town, South Africa
| | | |
Collapse
|
25
|
Yoon BH, Kim SK, Kim SY. Use of Graph Database for the Integration of Heterogeneous Biological Data. Genomics Inform 2017; 15:19-27. [PMID: 28416946 PMCID: PMC5389944 DOI: 10.5808/gi.2017.15.1.19] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2016] [Revised: 02/02/2017] [Accepted: 02/02/2017] [Indexed: 12/15/2022] Open
Abstract
Understanding complex relationships among heterogeneous biological data is one of the fundamental goals in biology. In most cases, diverse biological data are stored in relational databases, such as MySQL and Oracle, which store data in multiple tables and then infer relationships by multiple-join statements. Recently, a new type of database, called the graph-based database, was developed to natively represent various kinds of complex relationships, and it is widely used among computer science communities and IT industries. Here, we demonstrate the feasibility of using a graph-based database for complex biological relationships by comparing the performance between MySQL and Neo4j, one of the most widely used graph databases. We collected various biological data (protein-protein interaction, drug-target, gene-disease, etc.) from several existing sources, removed duplicate and redundant data, and finally constructed a graph database containing 114,550 nodes and 82,674,321 relationships. When we tested the query execution performance of MySQL versus Neo4j, we found that Neo4j outperformed MySQL in all cases. While Neo4j exhibited a very fast response for various queries, MySQL exhibited latent or unfinished responses for complex queries with multiple-join statements. These results show that using graph-based databases, such as Neo4j, is an efficient way to store complex biological relationships. Moreover, querying a graph database in diverse ways has the potential to reveal novel relationships among heterogeneous biological data.
Collapse
Affiliation(s)
- Byoung-Ha Yoon
- Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon 34141, Korea.,Department of Functional Genomics, University of Science and Technology (UST), Daejeon 34113, Korea
| | - Seon-Kyu Kim
- Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon 34141, Korea
| | - Seon-Young Kim
- Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon 34141, Korea.,Department of Functional Genomics, University of Science and Technology (UST), Daejeon 34113, Korea
| |
Collapse
|
26
|
Xi X, Li T, Huang Y, Sun J, Zhu Y, Yang Y, Lu ZJ. RNA Biomarkers: Frontier of Precision Medicine for Cancer. Noncoding RNA 2017; 3:ncrna3010009. [PMID: 29657281 PMCID: PMC5832009 DOI: 10.3390/ncrna3010009] [Citation(s) in RCA: 71] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2016] [Accepted: 02/13/2017] [Indexed: 12/15/2022] Open
Abstract
As an essential part of central dogma, RNA delivers genetic and regulatory information and reflects cellular states. Based on high-throughput sequencing technologies, cumulating data show that various RNA molecules are able to serve as biomarkers for the diagnosis and prognosis of various diseases, for instance, cancer. In particular, detectable in various bio-fluids, such as serum, saliva and urine, extracellular RNAs (exRNAs) are emerging as non-invasive biomarkers for earlier cancer diagnosis, tumor progression monitor, and prediction of therapy response. In this review, we summarize the latest studies on various types of RNA biomarkers, especially extracellular RNAs, in cancer diagnosis and prognosis, and illustrate several well-known RNA biomarkers of clinical utility. In addition, we describe and discuss general procedures and issues in investigating exRNA biomarkers, and perspectives on utility of exRNAs in precision medicine.
Collapse
Affiliation(s)
- Xiaochen Xi
- MOE Key Laboratory of Bioinformatics, Tsinghua-Peking Joint Center for Life Sciences, Center for Plant Biology and Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China.
| | - Tianxiao Li
- MOE Key Laboratory of Bioinformatics, Tsinghua-Peking Joint Center for Life Sciences, Center for Plant Biology and Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China.
| | - Yiming Huang
- MOE Key Laboratory of Bioinformatics, Tsinghua-Peking Joint Center for Life Sciences, Center for Plant Biology and Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China.
| | - Jiahui Sun
- MOE Key Laboratory of Bioinformatics, Tsinghua-Peking Joint Center for Life Sciences, Center for Plant Biology and Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China.
| | - Yumin Zhu
- MOE Key Laboratory of Bioinformatics, Tsinghua-Peking Joint Center for Life Sciences, Center for Plant Biology and Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China.
| | - Yang Yang
- MOE Key Laboratory of Bioinformatics, Tsinghua-Peking Joint Center for Life Sciences, Center for Plant Biology and Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China.
| | - Zhi John Lu
- MOE Key Laboratory of Bioinformatics, Tsinghua-Peking Joint Center for Life Sciences, Center for Plant Biology and Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
27
|
Gutiérrez-Sacristán A, Bravo À, Portero-Tresserra M, Valverde O, Armario A, Blanco-Gandía M, Farré A, Fernández-Ibarrondo L, Fonseca F, Giraldo J, Leis A, Mané A, Mayer M, Montagud-Romero S, Nadal R, Ortiz J, Pavon FJ, Perez EJ, Rodríguez-Arias M, Serrano A, Torrens M, Warnault V, Sanz F, Furlong LI. Text mining and expert curation to develop a database on psychiatric diseases and their genes. Database (Oxford) 2017; 2017:3891487. [PMID: 29220439 PMCID: PMC5502359 DOI: 10.1093/database/bax043] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2016] [Revised: 04/27/2017] [Accepted: 05/01/2017] [Indexed: 01/15/2023]
Abstract
Database URL http://www.psygenet.org. PsyGeNET corpus http://www.psygenet.org/ds/PsyGeNET/results/psygenetCorpus.tar.
Collapse
Affiliation(s)
- Alba Gutiérrez-Sacristán
- Research Group on Integrative Biomedical Informatics (GRIB), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), C/Dr. Aiguader 88, Barcelona 08003, Spain
| | - Àlex Bravo
- Research Group on Integrative Biomedical Informatics (GRIB), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), C/Dr. Aiguader 88, Barcelona 08003, Spain
| | - Marta Portero-Tresserra
- Neurobiology of Behaviour Research Group (GReNeC), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Olga Valverde
- Neurobiology of Behaviour Research Group (GReNeC), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Antonio Armario
- Institut de Neurociències and Animal Physiology Unit, Universitat Autònoma de Barcelona (UAB), Barcelona, Spain
- Network Biomedical Research Center on Mental Health (CIBERSAM)
| | - M.C. Blanco-Gandía
- Department of Psychobiology, Facultad de Psicología, Universitat de València, València, Spain
| | - Adriana Farré
- Institute of Neuropsychiatry and Addiction, Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), Parc de Salut Mar, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Lierni Fernández-Ibarrondo
- Programa de Cáncer (IMIM), Investigación Traslacional en Neoplasias Colorrectales, C/Dr. Aiguader 88, Barcelona, Spain
| | - Francina Fonseca
- Institute of Neuropsychiatry and Addiction, Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), Parc de Salut Mar, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Jesús Giraldo
- Network Biomedical Research Center on Mental Health (CIBERSAM)
- Institut de Neurociències and Unitat de Bioestadística, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Angela Leis
- Research Group on Integrative Biomedical Informatics (GRIB), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), C/Dr. Aiguader 88, Barcelona 08003, Spain
| | - Anna Mané
- Network Biomedical Research Center on Mental Health (CIBERSAM)
- Institute of Neuropsychiatry and Addiction, Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), Parc de Salut Mar, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - M.A. Mayer
- Research Group on Integrative Biomedical Informatics (GRIB), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), C/Dr. Aiguader 88, Barcelona 08003, Spain
| | - Sandra Montagud-Romero
- Department of Psychobiology, Facultad de Psicología, Universitat de València, València, Spain
| | - Roser Nadal
- Network Biomedical Research Center on Mental Health (CIBERSAM)
- Institut de Neurociències and Psychobiology Area, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Jordi Ortiz
- Network Biomedical Research Center on Mental Health (CIBERSAM)
- Neuroscience Institute and Department of Biochemistry and Molecular Biology, School of Medicine, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Francisco Javier Pavon
- Unidad de Gestión Clínica de Salud Mental, Instituto de Investigación Biomédica de Málaga (IBIMA), Hospital Regional Universitario de Málaga, Málaga, Spain
| | - Ezequiel Jesús Perez
- Institute of Neuropsychiatry and Addiction, Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), Parc de Salut Mar, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Marta Rodríguez-Arias
- Department of Psychobiology, Facultad de Psicología, Universitat de València, València, Spain
| | - Antonia Serrano
- Unidad de Gestión Clínica de Salud Mental, Instituto de Investigación Biomédica de Málaga (IBIMA), Hospital Regional Universitario de Málaga, Málaga, Spain
| | - Marta Torrens
- Institute of Neuropsychiatry and Addiction, Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), Parc de Salut Mar, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Vincent Warnault
- Neurobiology of Behaviour Research Group (GReNeC), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Ferran Sanz
- Research Group on Integrative Biomedical Informatics (GRIB), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), C/Dr. Aiguader 88, Barcelona 08003, Spain
| | - Laura I. Furlong
- Research Group on Integrative Biomedical Informatics (GRIB), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), C/Dr. Aiguader 88, Barcelona 08003, Spain
| |
Collapse
|
28
|
Piñero J, Bravo À, Queralt-Rosinach N, Gutiérrez-Sacristán A, Deu-Pons J, Centeno E, García-García J, Sanz F, Furlong LI. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res 2016; 45:D833-D839. [PMID: 27924018 PMCID: PMC5210640 DOI: 10.1093/nar/gkw943] [Citation(s) in RCA: 1482] [Impact Index Per Article: 185.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2016] [Revised: 09/29/2016] [Accepted: 10/18/2016] [Indexed: 12/12/2022] Open
Abstract
The information about the genetic basis of human diseases lies at the heart of precision medicine and drug discovery. However, to realize its full potential to support these goals, several problems, such as fragmentation, heterogeneity, availability and different conceptualization of the data must be overcome. To provide the community with a resource free of these hurdles, we have developed DisGeNET (http://www.disgenet.org), one of the largest available collections of genes and variants involved in human diseases. DisGeNET integrates data from expert curated repositories, GWAS catalogues, animal models and the scientific literature. DisGeNET data are homogeneously annotated with controlled vocabularies and community-driven ontologies. Additionally, several original metrics are provided to assist the prioritization of genotype-phenotype relationships. The information is accessible through a web interface, a Cytoscape App, an RDF SPARQL endpoint, scripts in several programming languages and an R package. DisGeNET is a versatile platform that can be used for different research purposes including the investigation of the molecular underpinnings of specific human diseases and their comorbidities, the analysis of the properties of disease genes, the generation of hypothesis on drug therapeutic action and drug adverse effects, the validation of computationally predicted disease genes and the evaluation of text-mining methods performance.
Collapse
Affiliation(s)
- Janet Piñero
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences (DCEXS), Universitat Pompeu Fabra (UPF), C/Dr Aiguader 88, E-08003 Barcelona, Spain
| | - Àlex Bravo
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences (DCEXS), Universitat Pompeu Fabra (UPF), C/Dr Aiguader 88, E-08003 Barcelona, Spain
| | - Núria Queralt-Rosinach
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences (DCEXS), Universitat Pompeu Fabra (UPF), C/Dr Aiguader 88, E-08003 Barcelona, Spain
| | - Alba Gutiérrez-Sacristán
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences (DCEXS), Universitat Pompeu Fabra (UPF), C/Dr Aiguader 88, E-08003 Barcelona, Spain
| | - Jordi Deu-Pons
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences (DCEXS), Universitat Pompeu Fabra (UPF), C/Dr Aiguader 88, E-08003 Barcelona, Spain
| | - Emilio Centeno
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences (DCEXS), Universitat Pompeu Fabra (UPF), C/Dr Aiguader 88, E-08003 Barcelona, Spain
| | - Javier García-García
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences (DCEXS), Universitat Pompeu Fabra (UPF), C/Dr Aiguader 88, E-08003 Barcelona, Spain
| | - Ferran Sanz
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences (DCEXS), Universitat Pompeu Fabra (UPF), C/Dr Aiguader 88, E-08003 Barcelona, Spain
| | - Laura I Furlong
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences (DCEXS), Universitat Pompeu Fabra (UPF), C/Dr Aiguader 88, E-08003 Barcelona, Spain
| |
Collapse
|
29
|
Li P, Nie Y, Yu J. Fusing literature and full network data improves disease similarity computation. BMC Bioinformatics 2016; 17:326. [PMID: 27578323 PMCID: PMC5006367 DOI: 10.1186/s12859-016-1205-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2016] [Accepted: 08/24/2016] [Indexed: 01/01/2023] Open
Abstract
Background Identifying relatedness among diseases could help deepen understanding for the underlying pathogenic mechanisms of diseases, and facilitate drug repositioning projects. A number of methods for computing disease similarity had been developed; however, none of them were designed to utilize information of the entire protein interaction network, using instead only those interactions involving disease causing genes. Most of previously published methods required gene-disease association data, unfortunately, many diseases still have very few or no associated genes, which impeded broad adoption of those methods. In this study, we propose a new method (MedNetSim) for computing disease similarity by integrating medical literature and protein interaction network. MedNetSim consists of a network-based method (NetSim), which employs the entire protein interaction network, and a MEDLINE-based method (MedSim), which computes disease similarity by mining the biomedical literature. Results Among function-based methods, NetSim achieved the best performance. Its average AUC (area under the receiver operating characteristic curve) reached 95.2 %. MedSim, whose performance was even comparable to some function-based methods, acquired the highest average AUC in all semantic-based methods. Integration of MedSim and NetSim (MedNetSim) further improved the average AUC to 96.4 %. We further studied the effectiveness of different data sources. It was found that quality of protein interaction data was more important than its volume. On the contrary, higher volume of gene-disease association data was more beneficial, even with a lower reliability. Utilizing higher volume of disease-related gene data further improved the average AUC of MedNetSim and NetSim to 97.5 % and 96.7 %, respectively. Conclusions Integrating biomedical literature and protein interaction network can be an effective way to compute disease similarity. Lacking sufficient disease-related gene data, literature-based methods such as MedSim can be a great addition to function-based algorithms. It may be beneficial to steer more resources torward studying gene-disease associations and improving the quality of protein interaction data. Disease similarities can be computed using the proposed methods at http://www.digintelli.com:8000/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1205-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ping Li
- State Key Laboratory of Biochemical Engineering, Institute of Process Engineering, Chinese Academy of Sciences, Beijing, 100190, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Yaling Nie
- State Key Laboratory of Biochemical Engineering, Institute of Process Engineering, Chinese Academy of Sciences, Beijing, 100190, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Jingkai Yu
- State Key Laboratory of Biochemical Engineering, Institute of Process Engineering, Chinese Academy of Sciences, Beijing, 100190, China.
| |
Collapse
|
30
|
A functional module-based exploration between inflammation and cancer in esophagus. Sci Rep 2015; 5:15340. [PMID: 26489668 PMCID: PMC4614801 DOI: 10.1038/srep15340] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2015] [Accepted: 09/23/2015] [Indexed: 12/26/2022] Open
Abstract
Inflammation contributing to the underlying progression of diverse human cancers has been generally appreciated, however, explorations into the molecular links between inflammation and cancer in esophagus are still at its early stage. In our study, we presented a functional module-based approach, in combination with multiple data resource (gene expression, protein-protein interactions (PPI), transcriptional and post-transcriptional regulations) to decipher the underlying links. Via mapping differentially expressed disease genes, functional disease modules were identified. As indicated, those common genes and interactions tended to play important roles in linking inflammation and cancer. Based on crosstalk analysis, we demonstrated that, although most disease genes were not shared by both kinds of modules, they might act through participating in the same or similar functions to complete the molecular links. Additionally, we applied pivot analysis to extract significant regulators for per significant crosstalk module pair. As shown, pivot regulators might manipulate vital parts of the module subnetworks, and then work together to bridge inflammation and cancer in esophagus. Collectively, based on our functional module analysis, we demonstrated that shared genes or interactions, significant crosstalk modules, and those significant pivot regulators were served as different functional parts underlying the molecular links between inflammation and cancer in esophagus.
Collapse
|
31
|
Ernst P, Siu A, Weikum G. KnowLife: a versatile approach for constructing a large knowledge graph for biomedical sciences. BMC Bioinformatics 2015; 16:157. [PMID: 25971816 PMCID: PMC4448285 DOI: 10.1186/s12859-015-0549-5] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2014] [Accepted: 03/25/2015] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Biomedical knowledge bases (KB's) have become important assets in life sciences. Prior work on KB construction has three major limitations. First, most biomedical KBs are manually built and curated, and cannot keep up with the rate at which new findings are published. Second, for automatic information extraction (IE), the text genre of choice has been scientific publications, neglecting sources like health portals and online communities. Third, most prior work on IE has focused on the molecular level or chemogenomics only, like protein-protein interactions or gene-drug relationships, or solely address highly specific topics such as drug effects. RESULTS We address these three limitations by a versatile and scalable approach to automatic KB construction. Using a small number of seed facts for distant supervision of pattern-based extraction, we harvest a huge number of facts in an automated manner without requiring any explicit training. We extend previous techniques for pattern-based IE with confidence statistics, and we combine this recall-oriented stage with logical reasoning for consistency constraint checking to achieve high precision. To our knowledge, this is the first method that uses consistency checking for biomedical relations. Our approach can be easily extended to incorporate additional relations and constraints. We ran extensive experiments not only for scientific publications, but also for encyclopedic health portals and online communities, creating different KB's based on different configurations. We assess the size and quality of each KB, in terms of number of facts and precision. The best configured KB, KnowLife, contains more than 500,000 facts at a precision of 93% for 13 relations covering genes, organs, diseases, symptoms, treatments, as well as environmental and lifestyle risk factors. CONCLUSION KnowLife is a large knowledge base for health and life sciences, automatically constructed from different Web sources. As a unique feature, KnowLife is harvested from different text genres such as scientific publications, health portals, and online communities. Thus, it has the potential to serve as one-stop portal for a wide range of relations and use cases. To showcase the breadth and usefulness, we make the KnowLife KB accessible through the health portal (http://knowlife.mpi-inf.mpg.de).
Collapse
Affiliation(s)
- Patrick Ernst
- Max-Planck-Institute for Informatics, Campus E1 4, Saarbrücken, 66123, Germany.
| | - Amy Siu
- Max-Planck-Institute for Informatics, Campus E1 4, Saarbrücken, 66123, Germany.
| | - Gerhard Weikum
- Max-Planck-Institute for Informatics, Campus E1 4, Saarbrücken, 66123, Germany.
| |
Collapse
|
32
|
Piñero J, Queralt-Rosinach N, Bravo À, Deu-Pons J, Bauer-Mehren A, Baron M, Sanz F, Furlong LI. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015; 2015:bav028. [PMID: 25877637 PMCID: PMC4397996 DOI: 10.1093/database/bav028] [Citation(s) in RCA: 622] [Impact Index Per Article: 69.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2014] [Accepted: 03/09/2015] [Indexed: 11/25/2022]
Abstract
DisGeNET is a comprehensive discovery platform designed to address a variety of questions concerning the genetic underpinning of human diseases. DisGeNET contains over 380 000 associations between >16 000 genes and 13 000 diseases, which makes it one of the largest repositories currently available of its kind. DisGeNET integrates expert-curated databases with text-mined data, covers information on Mendelian and complex diseases, and includes data from animal disease models. It features a score based on the supporting evidence to prioritize gene-disease associations. It is an open access resource available through a web interface, a Cytoscape plugin and as a Semantic Web resource. The web interface supports user-friendly data exploration and navigation. DisGeNET data can also be analysed via the DisGeNET Cytoscape plugin, and enriched with the annotations of other plugins of this popular network analysis software suite. Finally, the information contained in DisGeNET can be expanded and complemented using Semantic Web technologies and linked to a variety of resources already present in the Linked Data cloud. Hence, DisGeNET offers one of the most comprehensive collections of human gene-disease associations and a valuable set of tools for investigating the molecular mechanisms underlying diseases of genetic origin, designed to fulfill the needs of different user profiles, including bioinformaticians, biologists and health-care practitioners. Database URL: http://www.disgenet.org/
Collapse
Affiliation(s)
- Janet Piñero
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| | - Núria Queralt-Rosinach
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| | - Àlex Bravo
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| | - Jordi Deu-Pons
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| | - Anna Bauer-Mehren
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| | - Martin Baron
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| | - Ferran Sanz
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| | - Laura I Furlong
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, C/Dr Aiguader 88, E-08003 Barcelona, Spain, Roche Pharma Research and Early Development, pRED Informatics, Roche Innovation Center Penzberg, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany and Scientific & Business Information Services, Roche Diagnostics GmbH, Nonnenwald 2, 82377 Penzberg, Germany
| |
Collapse
|
33
|
Bravo À, Piñero J, Queralt-Rosinach N, Rautschka M, Furlong LI. Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinformatics 2015; 16:55. [PMID: 25886734 PMCID: PMC4466840 DOI: 10.1186/s12859-015-0472-9] [Citation(s) in RCA: 116] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2014] [Accepted: 01/19/2015] [Indexed: 11/23/2022] Open
Abstract
Background Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases. Results By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications. Conclusions BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0472-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Àlex Bravo
- Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
| | - Janet Piñero
- Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
| | - Núria Queralt-Rosinach
- Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
| | - Michael Rautschka
- Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
| | - Laura I Furlong
- Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain.
| |
Collapse
|
34
|
Kotłowska A. Application of Chemometric Techniques in Search of Clinically Applicable Biomarkers of Disease. Drug Dev Res 2014; 75:283-90. [DOI: 10.1002/ddr.21213] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Affiliation(s)
- Alicja Kotłowska
- Department of Food Sciences; Faculty of Pharmacy; Medical University of Gdańsk; Gdańsk 80-416 Poland
| |
Collapse
|