1
|
Voskamp M, Vinhoven L, Stanke F, Hafkemeyer S, Nietert MM. Integrating Text Mining into the Curation of Disease Maps. Biomolecules 2022; 12:biom12091278. [PMID: 36139119 PMCID: PMC9496510 DOI: 10.3390/biom12091278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Revised: 09/02/2022] [Accepted: 09/07/2022] [Indexed: 11/16/2022] Open
Abstract
An adequate visualization form is required to gain an overview and ultimately understand the complex and diverse biological mechanisms of diseases. Recently, disease maps have been introduced for this purpose. A disease map is defined as a systems biological map or model that combines metabolic, signaling, and physiological pathways to create a comprehensive overview of known disease mechanisms. With the increase in publications describing biological interactions, efforts in creating and curating comprehensive disease maps is growing accordingly. Therefore, new computational approaches are needed to reduce the time that manual curation takes. Test mining algorithms can be used to analyse the natural language of scientific publications. These types of algorithms can take humanly readable text passages and convert them into a more ordered, machine-usable data structure. To support the creation of disease maps by text mining, we developed an interactive, user-friendly disease map viewer. The disease map viewer displays text mining results in a systems biology map, where the user can review them and either validate or reject identified interactions. Ultimately, the viewer brings together the time-saving advantages of text mining with the accuracy of manual data curation.
Collapse
Affiliation(s)
- Malte Voskamp
- Department of Medical Bioinformatics, University Medical Center Göttingen, Goldschmidtstraße 1, 37077 Göttingen, Germany
| | - Liza Vinhoven
- Department of Medical Bioinformatics, University Medical Center Göttingen, Goldschmidtstraße 1, 37077 Göttingen, Germany
| | - Frauke Stanke
- Clinic for Pediatric Pneumology, Allergology and Neonatology, Hannover Medical School, Carl-Neuberg-Strasse 1, 30625 Hannover, Germany
- Biomedical Research in Endstage and Obstructive Lung Disease Hannover (BREATH), the German Center for Lung Research, Carl-Neuberg-Strasse 1, 30625 Hannover, Germany
| | | | - Manuel Manfred Nietert
- Department of Medical Bioinformatics, University Medical Center Göttingen, Goldschmidtstraße 1, 37077 Göttingen, Germany
- CIDAS Campus Institute Data Science, Goldschmidtstraße 1, 37077 Göttingen, Germany
- Correspondence: ; Tel.: +49-551-39-14920
| |
Collapse
|
2
|
Zafeiropoulos H, Paragkamian S, Ninidakis S, Pavlopoulos GA, Jensen LJ, Pafilis E. PREGO: A Literature and Data-Mining Resource to Associate Microorganisms, Biological Processes, and Environment Types. Microorganisms 2022; 10:microorganisms10020293. [PMID: 35208748 PMCID: PMC8879827 DOI: 10.3390/microorganisms10020293] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Revised: 01/19/2022] [Accepted: 01/20/2022] [Indexed: 12/12/2022] Open
Abstract
To elucidate ecosystem functioning, it is fundamental to recognize what processes occur in which environments (where) and which microorganisms carry them out (who). Here, we present PREGO, a one-stop-shop knowledge base providing such associations. PREGO combines text mining and data integration techniques to mine such what-where-who associations from data and metadata scattered in the scientific literature and in public omics repositories. Microorganisms, biological processes, and environment types are identified and mapped to ontology terms from established community resources. Analyses of comentions in text and co-occurrences in metagenomics data/metadata are performed to extract associations and a level of confidence is assigned to each of them thanks to a scoring scheme. The PREGO knowledge base contains associations for 364,508 microbial taxa, 1090 environmental types, 15,091 biological processes, and 7971 molecular functions with a total of almost 58 million associations. These associations are available through a web portal, an Application Programming Interface (API), and bulk download. By exploring environments and/or processes associated with each other or with microbes, PREGO aims to assist researchers in design and interpretation of experiments and their results. To demonstrate PREGO’s capabilities, a thorough presentation of its web interface is given along with a meta-analysis of experimental results from a lagoon-sediment study of sulfur-cycle related microbes.
Collapse
Affiliation(s)
- Haris Zafeiropoulos
- Department of Biology, University of Crete, Voutes University Campus, P.O. Box 2208, 70013 Heraklion, Crete, Greece; (H.Z.); (S.P.)
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Former U.S. Base of Gournes, P.O. Box 2214, 71003 Heraklion, Crete, Greece;
| | - Savvas Paragkamian
- Department of Biology, University of Crete, Voutes University Campus, P.O. Box 2208, 70013 Heraklion, Crete, Greece; (H.Z.); (S.P.)
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Former U.S. Base of Gournes, P.O. Box 2214, 71003 Heraklion, Crete, Greece;
| | - Stelios Ninidakis
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Former U.S. Base of Gournes, P.O. Box 2214, 71003 Heraklion, Crete, Greece;
| | - Georgios A. Pavlopoulos
- Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center “Alexander Fleming”, 16672 Vari, Greece;
- Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, 11527 Athens, Greece
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen, Denmark;
| | - Evangelos Pafilis
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Former U.S. Base of Gournes, P.O. Box 2214, 71003 Heraklion, Crete, Greece;
- Correspondence: or ; Tel.: +30-2810-337748
| |
Collapse
|
3
|
Ali I, Dreij K, Baker S, Högberg J, Korhonen A, Stenius U. Application of Text Mining in Risk Assessment of Chemical Mixtures: A Case Study of Polycyclic Aromatic Hydrocarbons (PAHs). ENVIRONMENTAL HEALTH PERSPECTIVES 2021; 129:67008. [PMID: 34165340 PMCID: PMC8318069 DOI: 10.1289/ehp6702] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Revised: 05/07/2021] [Accepted: 05/10/2021] [Indexed: 05/08/2023]
Abstract
BACKGROUND Cancer risk assessment of complex exposures, such as exposure to mixtures of polycyclic aromatic hydrocarbons (PAHs), is challenging due to the diverse biological activities of these compounds. With the help of text mining (TM), we have developed TM tools-the latest iteration of the Cancer Risk Assessment using Biomedical literature tool (CRAB3) and a Cancer Hallmarks Analytics Tool (CHAT)-that could be useful for automatic literature analyses in cancer risk assessment and research. Although CRAB3 analyses are based on carcinogenic modes of action (MOAs) and cover almost all the key characteristics of carcinogens, CHAT evaluates literature according to the hallmarks of cancer referring to the alterations in cellular behavior that characterize the cancer cell. OBJECTIVES The objective was to evaluate the usefulness of these tools to support cancer risk assessment by performing a case study of 22 European Union and U.S. Environmental Protection Agency priority PAHs and diesel exhaust and a case study of PAH interactions with silica. METHODS We analyzed PubMed literature, comprising 57,498 references concerning priority PAHs and complex PAH mixtures, using CRAB3 and CHAT. RESULTS CRAB3 analyses correctly identified similarities and differences in genotoxic and nongenotoxic MOAs of the 22 priority PAHs and grouped them according to their known carcinogenic potential. CHAT had the same capacity and complemented the CRAB output when comparing, for example, benzo[a]pyrene and dibenzo[a,l]pyrene. Both CRAB3 and CHAT analyses highlighted potentially interacting mechanisms within and across complex PAH mixtures and mechanisms of possible importance for interactions with silica. CONCLUSION These data suggest that our TM approach can be useful in the hazard identification of PAHs and mixtures including PAHs. The tools can assist in grouping chemicals and identifying similarities and differences in carcinogenic MOAs and their interactions. https://doi.org/10.1289/EHP6702.
Collapse
Affiliation(s)
- Imran Ali
- Institute of Environmental Medicine, Karolinska Institutet, Stockholm, Sweden
| | - Kristian Dreij
- Institute of Environmental Medicine, Karolinska Institutet, Stockholm, Sweden
| | - Simon Baker
- Department of Theoretical and Applied Linguistics, University of Cambridge, Cambridge, UK
| | - Johan Högberg
- Institute of Environmental Medicine, Karolinska Institutet, Stockholm, Sweden
| | - Anna Korhonen
- Department of Theoretical and Applied Linguistics, University of Cambridge, Cambridge, UK
| | - Ulla Stenius
- Institute of Environmental Medicine, Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|
4
|
Herrgårdh T, Madai VI, Kelleher JD, Magnusson R, Gustafsson M, Milani L, Gennemark P, Cedersund G. Hybrid modelling for stroke care: Review and suggestions of new approaches for risk assessment and simulation of scenarios. Neuroimage Clin 2021; 31:102694. [PMID: 34000646 PMCID: PMC8141769 DOI: 10.1016/j.nicl.2021.102694] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 04/27/2021] [Accepted: 05/04/2021] [Indexed: 11/28/2022]
Abstract
Stroke is an example of a complex and multi-factorial disease involving multiple organs, timescales, and disease mechanisms. To deal with this complexity, and to realize Precision Medicine of stroke, mathematical models are needed. Such approaches include: 1) machine learning, 2) bioinformatic network models, and 3) mechanistic models. Since these three approaches have complementary strengths and weaknesses, a hybrid modelling approach combining them would be the most beneficial. However, no concrete approach ready to be implemented for a specific disease has been presented to date. In this paper, we both review the strengths and weaknesses of the three approaches, and propose a roadmap for hybrid modelling in the case of stroke care. We focus on two main tasks needed for the clinical setting: a) For stroke risk calculation, we propose a new two-step approach, where non-linear mixed effects models and bioinformatic network models yield biomarkers which are used as input to a machine learning model and b) For simulation of care scenarios, we propose a new four-step approach, which revolves around iterations between simulations of the mechanistic models and imputations of non-modelled or non-measured variables. We illustrate and discuss the different approaches in the context of Precision Medicine for stroke.
Collapse
Affiliation(s)
- Tilda Herrgårdh
- Integrative Systems Biology, Department of Biomedical Engineering, Linköping University, 58185 Linköping, Sweden
| | - Vince I Madai
- Charité Lab for Artificial Intelligence in Medicine - CLAIM, Charité University Medicine Berlin, Germany; School of Computing and Digital Technology, Faculty of Computing, Engineering and the Built Environment, Birmingham City University, Birmingham, UK
| | - John D Kelleher
- ADAPT Research Centre, Technological University Dublin, Ireland
| | - Rasmus Magnusson
- Bioinformatics, Department of Physics, Chemistry and Biology, Linköping University, Sweden
| | - Mika Gustafsson
- Bioinformatics, Department of Physics, Chemistry and Biology, Linköping University, Sweden
| | - Lili Milani
- Estonian Genome Center, Institute of Genomics, University of Tartu, Tartu, Estonia
| | - Peter Gennemark
- Integrative Systems Biology, Department of Biomedical Engineering, Linköping University, 58185 Linköping, Sweden; Drug Metabolism and Pharmacokinetics, Early Cardiovascular, Renal and Metabolism, BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden
| | - Gunnar Cedersund
- Integrative Systems Biology, Department of Biomedical Engineering, Linköping University, 58185 Linköping, Sweden.
| |
Collapse
|
5
|
Singh G, Papoutsoglou EA, Keijts-Lalleman F, Vencheva B, Rice M, Visser RG, Bachem CW, Finkers R. Extracting knowledge networks from plant scientific literature: potato tuber flesh color as an exemplary trait. BMC PLANT BIOLOGY 2021; 21:198. [PMID: 33894758 PMCID: PMC8070292 DOI: 10.1186/s12870-021-02943-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Accepted: 03/29/2021] [Indexed: 06/12/2023]
Abstract
BACKGROUND Scientific literature carries a wealth of information crucial for research, but only a fraction of it is present as structured information in databases and therefore can be analyzed using traditional data analysis tools. Natural language processing (NLP) is often and successfully employed to support humans by distilling relevant information from large corpora of free text and structuring it in a way that lends itself to further computational analyses. For this pilot, we developed a pipeline that uses NLP on biological literature to produce knowledge networks. We focused on the flesh color of potato, a well-studied trait with known associations, and we investigated whether these knowledge networks can assist us in formulating new hypotheses on the underlying biological processes. RESULTS We trained an NLP model based on a manually annotated corpus of 34 full-text potato articles, to recognize relevant biological entities and relationships between them in text (genes, proteins, metabolites and traits). This model detected the number of biological entities with a precision of 97.65% and a recall of 88.91% on the training set. We conducted a time series analysis on 4023 PubMed abstract of plant genetics-based articles which focus on 4 major Solanaceous crops (tomato, potato, eggplant and capsicum), to determine that the networks contained both previously known and contemporaneously unknown leads to subsequently discovered biological phenomena relating to flesh color. A novel time-based analysis of these networks indicates a connection between our trait and a candidate gene (zeaxanthin epoxidase) already two years prior to explicit statements of that connection in the literature. CONCLUSIONS Our time-based analysis indicates that network-assisted hypothesis generation shows promise for knowledge discovery, data integration and hypothesis generation in scientific research.
Collapse
Affiliation(s)
- Gurnoor Singh
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| | | | | | | | - Mark Rice
- IBM Netherlands, Amsterdam, The Netherlands
| | - Richard G.F. Visser
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| | - Christian W.B. Bachem
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| | - Richard Finkers
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| |
Collapse
|
6
|
Kaushik V, Plazzer J, Macrae F. Evaluation of literature searching tools for curation of mismatch repair gene variants in hereditary colon cancer. ADVANCED GENETICS (HOBOKEN, N.J.) 2021; 2:e10039. [PMID: 36618447 PMCID: PMC9744508 DOI: 10.1002/ggn2.10039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 01/12/2021] [Accepted: 01/14/2021] [Indexed: 01/11/2023]
Abstract
Pathogenic constitutional genomic variants in the mismatch repair (MMR) genes are the drivers of Lynch syndrome; optimal variant interpretation is required for the management of suspected and confirmed cases. The International Society for Hereditary Gastrointestinal Tumours (InSiGHT) provides expert classifications for MMR variants for the US National Human Genome Research Institute's (NHGRI) ClinGen initiative and interprets variants with discordant classifications and those of uncertain significance (VUSs). Given the onerous nature of extracting information related to variants, literature searching tools which harness artificial intelligence may aid in retrieving information to allow optimum variant classification. In this study, we described the nature of discordance in a sample of 80 variants from a list of variants requiring updating by InSiGHT for ClinGen by comparing their existing InSiGHT classifications with the various submissions for each variant on the US National Centre for Biotechnology Information's (NCBI) ClinVar database. To identify the potential value of a literature searching tool in extracting information related to classification, all variants were searched for using a traditional method (Google Scholar) and literature searching tool (Mastermind) independently. Descriptive statistics were used to compare: the number of articles before and after screening for relevance and the number of relevant articles unique to either method. Relevance was defined as containing the variant in question as well as data informing variant interpretation. A total of 916 articles were returned by both methods and Mastermind averaged four relevant articles per search compared to Google Scholar's three. Of relevant Mastermind articles, 193/308 (62.7%) were unique to it, compared to 87/202, (43.0%) for Google Scholar. For 24 variants, either or both methods found no information. All 6/80 (20%) variants with pathogenic or likely pathogenic InSiGHT classifications have newer VUS assertions on ClinVar. Our study demonstrated that for a sample of variants with varying discordant interpretations, Mastermind was able to return on average, a more relevant and unique literature search. Google Scholar was able to retrieve information that Mastermind did not, which supports a conclusion that Mastermind could play a complementary role in literature searching for classification. This work will aid InSiGHT in its role of classifying MMR variants.
Collapse
Affiliation(s)
- Varun Kaushik
- Melbourne Medical SchoolThe University of MelbourneParkvilleVictoriaAustralia,Department of Colorectal Medicine and Genetics, The Royal Melbourne HospitalParkvilleVictoriaAustralia
| | - John‐Paul Plazzer
- Department of Colorectal Medicine and Genetics, The Royal Melbourne HospitalParkvilleVictoriaAustralia
| | - Finlay Macrae
- Department of Colorectal Medicine and Genetics, The Royal Melbourne HospitalParkvilleVictoriaAustralia,Department of Medicine, The Royal Melbourne HospitalThe University of MelbourneParkvilleVictoriaAustralia
| |
Collapse
|
7
|
Bao Y, Deng Z, Wang Y, Kim H, Armengol VD, Acevedo F, Ouardaoui N, Wang C, Parmigiani G, Barzilay R, Braun D, Hughes KS. Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes. JCO Clin Cancer Inform 2020; 3:1-9. [PMID: 31545655 DOI: 10.1200/cci.19.00042] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
PURPOSE The medical literature relevant to germline genetics is growing exponentially. Clinicians need tools that help to monitor and prioritize the literature to understand the clinical implications of pathogenic genetic variants. We developed and evaluated two machine learning models to classify abstracts as relevant to the penetrance-risk of cancer for germline mutation carriers-or prevalence of germline genetic mutations. MATERIALS AND METHODS We conducted literature searches in PubMed and retrieved paper titles and abstracts to create an annotated data set for training and evaluating the two machine learning classification models. Our first model is a support vector machine (SVM) which learns a linear decision rule on the basis of the bag-of-ngrams representation of each title and abstract. Our second model is a convolutional neural network (CNN) which learns a complex nonlinear decision rule on the basis of the raw title and abstract. We evaluated the performance of the two models on the classification of papers as relevant to penetrance or prevalence. RESULTS For penetrance classification, we annotated 3,740 paper titles and abstracts and evaluated the two models using 10-fold cross-validation. The SVM model achieved 88.93% accuracy-percentage of papers that were correctly classified-whereas the CNN model achieved 88.53% accuracy. For prevalence classification, we annotated 3,753 paper titles and abstracts. The SVM model achieved 88.92% accuracy and the CNN model achieved 88.52% accuracy. CONCLUSION Our models achieve high accuracy in classifying abstracts as relevant to penetrance or prevalence. By facilitating literature review, this tool could help clinicians and researchers keep abreast of the burgeoning knowledge of gene-cancer associations and keep the knowledge bases for clinical decision support tools up to date.
Collapse
Affiliation(s)
- Yujia Bao
- Massachusetts Institute of Technology, Boston, MA
| | | | - Yan Wang
- Massachusetts General Hospital, Boston, MA
| | - Heeyoon Kim
- Massachusetts Institute of Technology, Boston, MA
| | | | | | | | - Cathy Wang
- Harvard T.H. Chan School of Public Health, Boston, MA.,Dana-Farber Cancer Institute, Boston, MA
| | - Giovanni Parmigiani
- Harvard T.H. Chan School of Public Health, Boston, MA.,Dana-Farber Cancer Institute, Boston, MA
| | | | - Danielle Braun
- Harvard T.H. Chan School of Public Health, Boston, MA.,Dana-Farber Cancer Institute, Boston, MA
| | - Kevin S Hughes
- Massachusetts General Hospital, Boston, MA.,Harvard Medical School, Boston, MA
| |
Collapse
|
8
|
Ogris C, Guala D, Sonnhammer ELL. FunCoup 4: new species, data, and visualization. Nucleic Acids Res 2019; 46:D601-D607. [PMID: 29165593 PMCID: PMC5755233 DOI: 10.1093/nar/gkx1138] [Citation(s) in RCA: 61] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2017] [Accepted: 10/31/2017] [Indexed: 01/22/2023] Open
Abstract
This release of the FunCoup database (http://funcoup.sbc.su.se) is the fourth generation of one of the most comprehensive databases for genome-wide functional association networks. These functional associations are inferred via integrating various data types using a naive Bayesian algorithm and orthology based information transfer across different species. This approach provides high coverage of the included genomes as well as high quality of inferred interactions. In this update of FunCoup we introduce four new eukaryotic species: Schizosaccharomyces pombe, Plasmodium falciparum, Bos taurus, Oryza sativa and open the database to the prokaryotic domain by including networks for Escherichia coli and Bacillus subtilis. The latter allows us to also introduce a new class of functional association between genes - co-occurrence in the same operon. We also supplemented the existing classes of functional association: metabolic, signaling, complex and physical protein interaction with up-to-date information. In this release we switched to InParanoid v8 as the source of orthology and base for calculation of phylogenetic profiles. While populating all other evidence types with new data we introduce a new evidence type based on quantitative mass spectrometry data. Finally, the new JavaScript based network viewer provides the user an intuitive and responsive platform to further evaluate the results.
Collapse
Affiliation(s)
- Christoph Ogris
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Dimitri Guala
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
9
|
Guala D, Ogris C, Müller N, Sonnhammer ELL. Genome-wide functional association networks: background, data & state-of-the-art resources. Brief Bioinform 2019; 21:1224-1237. [PMID: 31281921 PMCID: PMC7373183 DOI: 10.1093/bib/bbz064] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2019] [Revised: 04/29/2019] [Accepted: 05/04/2019] [Indexed: 02/06/2023] Open
Abstract
The vast amount of experimental data from recent advances in the field of high-throughput biology begs for integration into more complex data structures such as genome-wide functional association networks. Such networks have been used for elucidation of the interplay of intra-cellular molecules to make advances ranging from the basic science understanding of evolutionary processes to the more translational field of precision medicine. The allure of the field has resulted in rapid growth of the number of available network resources, each with unique attributes exploitable to answer different biological questions. Unfortunately, the high volume of network resources makes it impossible for the intended user to select an appropriate tool for their particular research question. The aim of this paper is to provide an overview of the underlying data and representative network resources as well as to mention methods of integration, allowing a customized approach to resource selection. Additionally, this report will provide a primer for researchers venturing into the field of network integration.
Collapse
Affiliation(s)
- Dimitri Guala
- Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 17121 Solna, Sweden
| | - Christoph Ogris
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Center Munich, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany
| | - Nikola Müller
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Center Munich, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany
| | - Erik L L Sonnhammer
- Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
10
|
Couch D, Yu Z, Nam JH, Allen C, Ramos PS, da Silveira WA, Hunt KJ, Hazard ES, Hardiman G, Lawson A, Chung D. GAIL: An interactive webserver for inference and dynamic visualization of gene-gene associations based on gene ontology guided mining of biomedical literature. PLoS One 2019; 14:e0219195. [PMID: 31260503 PMCID: PMC6602258 DOI: 10.1371/journal.pone.0219195] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Accepted: 06/18/2019] [Indexed: 01/08/2023] Open
Abstract
In systems biology, inference of functional associations among genes is compelling because the construction of functional association networks facilitates biomarker discovery. Specifically, such gene associations in human can help identify putative biomarkers that can be used as diagnostic tools in treating patients. Although biomedical literature is considered a valuable data source for this task, currently only a limited number of webservers are available for mining gene-gene associations from the vast amount of biomedical literature using text mining techniques. Moreover, these webservers often have limited coverage of biomedical literature and also lack efficient and user-friendly tools to interpret and visualize mined relationships among genes. To address these limitations, we developed GAIL (Gene-gene Association Inference based on biomedical Literature), an interactive webserver that infers human gene-gene associations from Gene Ontology (GO) guided biomedical literature mining and provides dynamic visualization of the resulting association networks and various gene set enrichment analysis tools. We evaluate the utility and performance of GAIL with applications to gene signatures associated with systemic lupus erythematosus and breast cancer. Results show that GAIL allows effective interrogation and visualization of gene-gene networks and their subnetworks, which facilitates biological understanding of gene-gene associations. GAIL is available at http://chunglab.io/GAIL/.
Collapse
Affiliation(s)
- Daniel Couch
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
| | - Zhenning Yu
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
| | - Jin Hyun Nam
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
| | - Carter Allen
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
| | - Paula S. Ramos
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
- Department of Medicine, Medical University of South Carolina, Charleston, SC, United States of America
| | - Willian A. da Silveira
- Department of Pathology and Laboratory Medicine, Medical University of South Carolina, Charleston, SC, United States of America
- Center for Genomic Medicine, Medical University of South Carolina, Charleston, SC, United States of America
| | - Kelly J. Hunt
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
| | - Edward S. Hazard
- Center for Genomic Medicine, Medical University of South Carolina, Charleston, SC, United States of America
| | - Gary Hardiman
- Department of Medicine, Medical University of South Carolina, Charleston, SC, United States of America
- Center for Genomic Medicine, Medical University of South Carolina, Charleston, SC, United States of America
| | - Andrew Lawson
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
| | - Dongjun Chung
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
| |
Collapse
|
11
|
Xu J, Yang P, Xue S, Sharma B, Sanchez-Martin M, Wang F, Beaty KA, Dehan E, Parikh B. Translating cancer genomics into precision medicine with artificial intelligence: applications, challenges and future perspectives. Hum Genet 2019; 138:109-124. [PMID: 30671672 PMCID: PMC6373233 DOI: 10.1007/s00439-019-01970-5] [Citation(s) in RCA: 94] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2018] [Accepted: 01/02/2019] [Indexed: 02/07/2023]
Abstract
In the field of cancer genomics, the broad availability of genetic information offered by next-generation sequencing technologies and rapid growth in biomedical publication has led to the advent of the big-data era. Integration of artificial intelligence (AI) approaches such as machine learning, deep learning, and natural language processing (NLP) to tackle the challenges of scalability and high dimensionality of data and to transform big data into clinically actionable knowledge is expanding and becoming the foundation of precision medicine. In this paper, we review the current status and future directions of AI application in cancer genomics within the context of workflows to integrate genomic analysis for precision cancer care. The existing solutions of AI and their limitations in cancer genetic testing and diagnostics such as variant calling and interpretation are critically analyzed. Publicly available tools or algorithms for key NLP technologies in the literature mining for evidence-based clinical recommendations are reviewed and compared. In addition, the present paper highlights the challenges to AI adoption in digital healthcare with regard to data requirements, algorithmic transparency, reproducibility, and real-world assessment, and discusses the importance of preparing patients and physicians for modern digitized healthcare. We believe that AI will remain the main driver to healthcare transformation toward precision medicine, yet the unprecedented challenges posed should be addressed to ensure safety and beneficial impact to healthcare.
Collapse
Affiliation(s)
- Jia Xu
- IBM Watson Health, Cambridge, MA, USA.
| | | | - Shang Xue
- IBM Watson Health, Cambridge, MA, USA
| | | | | | - Fang Wang
- IBM Watson Health, Cambridge, MA, USA
| | | | | | | |
Collapse
|
12
|
Zhao N, Zheng G, Li J, Zhao HY, Lu C, Jiang M, Zhang C, Guo HT, Lu AP. Text Mining of Rheumatoid Arthritis and Diabetes Mellitus to Understand the Mechanisms of Chinese Medicine in Different Diseases with Same Treatment. Chin J Integr Med 2018; 24:777-784. [PMID: 29327123 DOI: 10.1007/s11655-018-2825-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/25/2016] [Indexed: 11/28/2022]
Abstract
OBJECTIVE To identify the commonalities between rheumatoid arthritis (RA) and diabetes mellitus (DM) to understand the mechanisms of Chinese medicine (CM) in different diseases with the same treatment. METHODS A text mining approach was adopted to analyze the commonalities between RA and DM according to CM and biological elements. The major commonalities were subsequently verified in RA and DM rat models, in which herbal formula for the treatment of both RA and DM identified via text mining was used as the intervention. RESULTS Similarities were identified between RA and DM regarding the CM approach used for diagnosis and treatment, as well as the networks of biological activities affected by each disease, including the involvement of adhesion molecules, oxidative stress, cytokines, T-lymphocytes, apoptosis, and inflammation. The Ramulus Cinnamomi-Radix Paeoniae Alba-Rhizoma Anemarrhenae is an herbal combination used to treat RA and DM. This formula demonstrated similar effects on oxidative stress and inflammation in rats with collagen-induced arthritis, which supports the text mining results regarding the commonalities between RA and DM. CONCLUSION Commonalities between the biological activities involved in RA and DM were identified through text mining, and both RA and DM might be responsive to the same intervention at a specific stage.
Collapse
Affiliation(s)
- Ning Zhao
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, 100700, China
| | - Guang Zheng
- School of Information Science and Engineering, Lanzhou University, Lanzhou, 730000, China
| | - Jian Li
- School of Basic Medical Sciences, Beijing University of Chinese Medicine, Beijing, 100029, China
| | - Hong-Yan Zhao
- Institute of Basic Theory for Chinese Medicine, China Academy of Chinese Medical Sciences, Beijing, 100700, China
| | - Cheng Lu
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, 100700, China
| | - Miao Jiang
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, 100700, China
| | - Chi Zhang
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, 100700, China
| | - Hong-Tao Guo
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, 100700, China
| | - Ai-Ping Lu
- School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, China.
| |
Collapse
|
13
|
Fergadis A, Baziotis C, Pappas D, Papageorgiou H, Potamianos A. Hierarchical bi-directional attention-based RNNs for supporting document classification on protein-protein interactions affected by genetic mutations. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:5077305. [PMID: 30137284 PMCID: PMC6105093 DOI: 10.1093/database/bay076] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 06/22/2018] [Indexed: 02/03/2023]
Abstract
In this paper, we describe a hierarchical bi-directional attention-based Re-current Neural Network (RNN) as a reusable sequence encoder architecture, which is used as sentence and document encoder for document classification. The sequence encoder is composed of two bi-directional RNN equipped with an attention mechanism that identifies and captures the most important elements, words or sentences, in a document followed by a dense layer for the classification task. Our approach utilizes the hierarchical nature of documents which are composed of sequences of sentences and sentences are composed of sequences of words. In our model, we use word embeddings to project the words to a low-dimensional vector space. We leverage word embeddings trained on PubMed for initializing the embedding layer of our network. We apply this model to biomedical literature specifically, on paper abstracts published in PubMed. We argue that the title of the paper itself usually contains important information more salient than a typical sentence in the abstract. For this reason, we propose a shortcut connection that integrates the title vector representation directly to the final feature representation of the document. We concatenate the sentence vector that represents the title and the vectors of the abstract to the document feature vector used as input to the task classifier. With this system we participated in the Document Triage Task of the BioCreative VI Precision Medicine Track and we achieved 0.6289 Precision, 0.7656 Recall and 0.6906 F1-score with the Precision and F1-score be the highest ranking first among the other systems. Database URL: https://github.com/afergadis/BC6PM-HRNN
Collapse
Affiliation(s)
- Aris Fergadis
- School of Electrical and Computer Engineering, National Technical University of Athens, Zografou Campus 9, Iroon Polytechniou str, Athens, Greece.,Institute for Language and Speech Processing, "Athena" Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi, Athens, Greece
| | - Christos Baziotis
- School of Electrical and Computer Engineering, National Technical University of Athens, Zografou Campus 9, Iroon Polytechniou str, Athens, Greece.,Department of Informatics, Athens University of Economics and Business, 76 Patission Str., Athens, Greece
| | - Dimitris Pappas
- Institute for Language and Speech Processing, "Athena" Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi, Athens, Greece.,Department of Informatics, Athens University of Economics and Business, 76 Patission Str., Athens, Greece
| | - Haris Papageorgiou
- Institute for Language and Speech Processing, "Athena" Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi, Athens, Greece
| | - Alexandros Potamianos
- School of Electrical and Computer Engineering, National Technical University of Athens, Zografou Campus 9, Iroon Polytechniou str, Athens, Greece.,Institute for Language and Speech Processing, "Athena" Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi, Athens, Greece
| |
Collapse
|
14
|
Babtie AC, Stumpf MPH. How to deal with parameters for whole-cell modelling. J R Soc Interface 2017; 14:20170237. [PMID: 28768879 PMCID: PMC5582120 DOI: 10.1098/rsif.2017.0237] [Citation(s) in RCA: 52] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2017] [Accepted: 06/22/2017] [Indexed: 11/12/2022] Open
Abstract
Dynamical systems describing whole cells are on the verge of becoming a reality. But as models of reality, they are only useful if we have realistic parameters for the molecular reaction rates and cell physiological processes. There is currently no suitable framework to reliably estimate hundreds, let alone thousands, of reaction rate parameters. Here, we map out the relative weaknesses and promises of different approaches aimed at redressing this issue. While suitable procedures for estimation or inference of the whole (vast) set of parameters will, in all likelihood, remain elusive, some hope can be drawn from the fact that much of the cellular behaviour may be explained in terms of smaller sets of parameters. Identifying such parameter sets and assessing their behaviour is now becoming possible even for very large systems of equations, and we expect such methods to become central tools in the development and analysis of whole-cell models.
Collapse
Affiliation(s)
- Ann C Babtie
- Department of Life Sciences, Imperial College London, London, UK
| | | |
Collapse
|
15
|
Gomez-Cabrero D, Tegnér J. Iterative Systems Biology for Medicine – Time for advancing from network signatures to mechanistic equations. ACTA ACUST UNITED AC 2017. [DOI: 10.1016/j.coisb.2017.05.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
16
|
Gomez-Cabrero D, Menche J, Vargas C, Cano I, Maier D, Barabási AL, Tegnér J, Roca J. From comorbidities of chronic obstructive pulmonary disease to identification of shared molecular mechanisms by data integration. BMC Bioinformatics 2016; 17:441. [PMID: 28185567 PMCID: PMC5133493 DOI: 10.1186/s12859-016-1291-3] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Background Deep mining of healthcare data has provided maps of comorbidity relationships between diseases. In parallel, integrative multi-omics investigations have generated high-resolution molecular maps of putative relevance for understanding disease initiation and progression. Yet, it is unclear how to advance an observation of comorbidity relations (one disease to others) to a molecular understanding of the driver processes and associated biomarkers. Results Since Chronic Obstructive Pulmonary disease (COPD) has emerged as a central hub in temporal comorbidity networks, we developed a systematic integrative data-driven framework to identify shared disease-associated genes and pathways, as a proxy for the underlying generative mechanisms inducing comorbidity. We integrated records from approximately 13 M patients from the Medicare database with disease-gene maps that we derived from several resources including a semantic-derived knowledge-base. Using rank-based statistics we not only recovered known comorbidities but also discovered a novel association between COPD and digestive diseases. Furthermore, our analysis provides the first set of COPD co-morbidity candidate biomarkers, including IL15, TNF and JUP, and characterizes their association to aging and life-style conditions, such as smoking and physical activity. Conclusions The developed framework provides novel insights in COPD and especially COPD co-morbidity associated mechanisms. The methodology could be used to discover and decipher the molecular underpinning of other comorbidity relationships and furthermore, allow the identification of candidate co-morbidity biomarkers. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1291-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- David Gomez-Cabrero
- Department of Medicine, Karolinska Institutet, Unit of Computational Medicine, Stockholm, 171 77, Sweden. .,Karolinska Institutet, Center for Molecular Medicine, Stockholm, 171 77, Sweden. .,Department of Medicine, Unit of Clinical Epidemiology, Karolinska University Hospital, Solna, L8, 17176, Sweden. .,Science for Life Laboratory, Solna, 17121, Sweden. .,Mucosal and Salivary Biology Division, King's College London Dental Institute, London, UK.
| | - Jörg Menche
- Center for Complex Networks Research and Department of Physics, Northeastern University, Boston, MA, USA.,Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, USA.,Center for Network Science, Central European University, Budapest, Hungary
| | - Claudia Vargas
- Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Hospital Clinic de Barcelona, Universitat de Barcelona, Barcelona, Spain.,Center for Biomedical Network Research in Respiratory Diseases (CIBERES), Madrid, Spain
| | - Isaac Cano
- Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Hospital Clinic de Barcelona, Universitat de Barcelona, Barcelona, Spain.,Center for Biomedical Network Research in Respiratory Diseases (CIBERES), Madrid, Spain
| | | | - Albert-László Barabási
- Center for Complex Networks Research and Department of Physics, Northeastern University, Boston, MA, USA.,Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, USA.,Center for Network Science, Central European University, Budapest, Hungary.,Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Jesper Tegnér
- Department of Medicine, Karolinska Institutet, Unit of Computational Medicine, Stockholm, 171 77, Sweden.,Karolinska Institutet, Center for Molecular Medicine, Stockholm, 171 77, Sweden.,Department of Medicine, Unit of Clinical Epidemiology, Karolinska University Hospital, Solna, L8, 17176, Sweden.,Science for Life Laboratory, Solna, 17121, Sweden
| | - Josep Roca
- Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Hospital Clinic de Barcelona, Universitat de Barcelona, Barcelona, Spain. .,Center for Biomedical Network Research in Respiratory Diseases (CIBERES), Madrid, Spain.
| | | |
Collapse
|
17
|
Tennant JP, Waldner F, Jacques DC, Masuzzo P, Collister LB, Hartgerink CHJ. The academic, economic and societal impacts of Open Access: an evidence-based review. F1000Res 2016; 5:632. [PMID: 27158456 DOI: 10.12688/f1000research.8460.1] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/08/2016] [Indexed: 11/20/2022] Open
Abstract
Ongoing debates surrounding Open Access to the scholarly literature are multifaceted and complicated by disparate and often polarised viewpoints from engaged stakeholders. At the current stage, Open Access has become such a global issue that it is critical for all involved in scholarly publishing, including policymakers, publishers, research funders, governments, learned societies, librarians, and academic communities, to be well-informed on the history, benefits, and pitfalls of Open Access. In spite of this, there is a general lack of consensus regarding the potential pros and cons of Open Access at multiple levels. This review aims to be a resource for current knowledge on the impacts of Open Access by synthesizing important research in three major areas: academic, economic and societal. While there is clearly much scope for additional research, several key trends are identified, including a broad citation advantage for researchers who publish openly, as well as additional benefits to the non-academic dissemination of their work. The economic impact of Open Access is less well-understood, although it is clear that access to the research literature is key for innovative enterprises, and a range of governmental and non-governmental services. Furthermore, Open Access has the potential to save both publishers and research funders considerable amounts of financial resources, and can provide some economic benefits to traditionally subscription-based journals. The societal impact of Open Access is strong, in particular for advancing citizen science initiatives, and leveling the playing field for researchers in developing countries. Open Access supersedes all potential alternative modes of access to the scholarly literature through enabling unrestricted re-use, and long-term stability independent of financial constraints of traditional publishers that impede knowledge sharing. However, Open Access has the potential to become unsustainable for research communities if high-cost options are allowed to continue to prevail in a widely unregulated scholarly publishing market. Open Access remains only one of the multiple challenges that the scholarly publishing system is currently facing. Yet, it provides one foundation for increasing engagement with researchers regarding ethical standards of publishing and the broader implications of 'Open Research'.
Collapse
Affiliation(s)
- Jonathan P Tennant
- Department of Earth Science and Engineering, Imperial College London, London, UK
| | - François Waldner
- Earth and Life Institute, Université catholique de Louvain, Louvain-la-Neuve, Belgium
| | - Damien C Jacques
- Earth and Life Institute, Université catholique de Louvain, Louvain-la-Neuve, Belgium
| | - Paola Masuzzo
- Medical Biotechnology Center, VIB, Ghent, Belgium; Department of Biochemistry, Ghent University, Ghent, Belgium
| | - Lauren B Collister
- University Library System, University of Pittsburgh, Pittsburgh, PA, USA
| | - Chris H J Hartgerink
- Department of Methodology and Statistics, Tilburg University, Tilburg, Netherlands
| |
Collapse
|
18
|
Tennant JP, Waldner F, Jacques DC, Masuzzo P, Collister LB, Hartgerink CHJ. The academic, economic and societal impacts of Open Access: an evidence-based review. F1000Res 2016; 5:632. [PMID: 27158456 PMCID: PMC4837983 DOI: 10.12688/f1000research.8460.3] [Citation(s) in RCA: 157] [Impact Index Per Article: 19.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/20/2016] [Indexed: 12/22/2022] Open
Abstract
Ongoing debates surrounding Open Access to the scholarly literature are multifaceted and complicated by disparate and often polarised viewpoints from engaged stakeholders. At the current stage, Open Access has become such a global issue that it is critical for all involved in scholarly publishing, including policymakers, publishers, research funders, governments, learned societies, librarians, and academic communities, to be well-informed on the history, benefits, and pitfalls of Open Access. In spite of this, there is a general lack of consensus regarding the potential pros and cons of Open Access at multiple levels. This review aims to be a resource for current knowledge on the impacts of Open Access by synthesizing important research in three major areas: academic, economic and societal. While there is clearly much scope for additional research, several key trends are identified, including a broad citation advantage for researchers who publish openly, as well as additional benefits to the non-academic dissemination of their work. The economic impact of Open Access is less well-understood, although it is clear that access to the research literature is key for innovative enterprises, and a range of governmental and non-governmental services. Furthermore, Open Access has the potential to save both publishers and research funders considerable amounts of financial resources, and can provide some economic benefits to traditionally subscription-based journals. The societal impact of Open Access is strong, in particular for advancing citizen science initiatives, and leveling the playing field for researchers in developing countries. Open Access supersedes all potential alternative modes of access to the scholarly literature through enabling unrestricted re-use, and long-term stability independent of financial constraints of traditional publishers that impede knowledge sharing. However, Open Access has the potential to become unsustainable for research communities if high-cost options are allowed to continue to prevail in a widely unregulated scholarly publishing market. Open Access remains only one of the multiple challenges that the scholarly publishing system is currently facing. Yet, it provides one foundation for increasing engagement with researchers regarding ethical standards of publishing and the broader implications of 'Open Research'.
Collapse
Affiliation(s)
- Jonathan P Tennant
- Department of Earth Science and Engineering, Imperial College London, London, UK
| | - François Waldner
- Earth and Life Institute, Université catholique de Louvain, Louvain-la-Neuve, Belgium
| | - Damien C Jacques
- Earth and Life Institute, Université catholique de Louvain, Louvain-la-Neuve, Belgium
| | - Paola Masuzzo
- Medical Biotechnology Center, VIB, Ghent, Belgium; Department of Biochemistry, Ghent University, Ghent, Belgium
| | - Lauren B Collister
- University Library System, University of Pittsburgh, Pittsburgh, PA, USA
| | - Chris H J Hartgerink
- Department of Methodology and Statistics, Tilburg University, Tilburg, Netherlands
| |
Collapse
|
19
|
Tennant JP, Waldner F, Jacques DC, Masuzzo P, Collister LB, Hartgerink CHJ. The academic, economic and societal impacts of Open Access: an evidence-based review. F1000Res 2016; 5:632. [PMID: 27158456 PMCID: PMC4837983 DOI: 10.12688/f1000research.8460.2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/20/2016] [Indexed: 12/02/2023] Open
Abstract
Ongoing debates surrounding Open Access to the scholarly literature are multifaceted and complicated by disparate and often polarised viewpoints from engaged stakeholders. At the current stage, Open Access has become such a global issue that it is critical for all involved in scholarly publishing, including policymakers, publishers, research funders, governments, learned societies, librarians, and academic communities, to be well-informed on the history, benefits, and pitfalls of Open Access. In spite of this, there is a general lack of consensus regarding the potential pros and cons of Open Access at multiple levels. This review aims to be a resource for current knowledge on the impacts of Open Access by synthesizing important research in three major areas: academic, economic and societal. While there is clearly much scope for additional research, several key trends are identified, including a broad citation advantage for researchers who publish openly, as well as additional benefits to the non-academic dissemination of their work. The economic impact of Open Access is less well-understood, although it is clear that access to the research literature is key for innovative enterprises, and a range of governmental and non-governmental services. Furthermore, Open Access has the potential to save both publishers and research funders considerable amounts of financial resources, and can provide some economic benefits to traditionally subscription-based journals. The societal impact of Open Access is strong, in particular for advancing citizen science initiatives, and leveling the playing field for researchers in developing countries. Open Access supersedes all potential alternative modes of access to the scholarly literature through enabling unrestricted re-use, and long-term stability independent of financial constraints of traditional publishers that impede knowledge sharing. However, Open Access has the potential to become unsustainable for research communities if high-cost options are allowed to continue to prevail in a widely unregulated scholarly publishing market. Open Access remains only one of the multiple challenges that the scholarly publishing system is currently facing. Yet, it provides one foundation for increasing engagement with researchers regarding ethical standards of publishing and the broader implications of 'Open Research'.
Collapse
Affiliation(s)
- Jonathan P. Tennant
- Department of Earth Science and Engineering, Imperial College London, London, UK
| | - François Waldner
- Earth and Life Institute, Université catholique de Louvain, Louvain-la-Neuve, Belgium
| | - Damien C. Jacques
- Earth and Life Institute, Université catholique de Louvain, Louvain-la-Neuve, Belgium
| | - Paola Masuzzo
- Medical Biotechnology Center, VIB, Ghent, Belgium
- Department of Biochemistry, Ghent University, Ghent, Belgium
| | | | | |
Collapse
|
20
|
Abascal MF, Besso MJ, Rosso M, Mencucci MV, Aparicio E, Szapiro G, Furlong LI, Vazquez-Levin MH. CDH1/E-cadherin and solid tumors. An updated gene-disease association analysis using bioinformatics tools. Comput Biol Chem 2015; 60:9-20. [PMID: 26674224 DOI: 10.1016/j.compbiolchem.2015.10.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2015] [Revised: 10/17/2015] [Accepted: 10/19/2015] [Indexed: 12/13/2022]
Abstract
Cancer is a group of diseases that causes millions of deaths worldwide. Among cancers, Solid Tumors (ST) stand-out due to their high incidence and mortality rates. Disruption of cell-cell adhesion is highly relevant during tumor progression. Epithelial-cadherin (protein: E-cadherin, gene: CDH1) is a key molecule in cell-cell adhesion and an abnormal expression or/and function(s) contributes to tumor progression and is altered in ST. A systematic study was carried out to gather and summarize current knowledge on CDH1/E-cadherin and ST using bioinformatics resources. The DisGeNET database was exploited to survey CDH1-associated diseases. Reported mutations in specific ST were obtained by interrogating COSMIC and IntOGen tools. CDH1 Single Nucleotide Polymorphisms (SNP) were retrieved from the dbSNP database. DisGeNET analysis identified 609 genes annotated to ST, among which CDH1 was listed. Using CDH1 as query term, 26 disease concepts were found, 21 of which were neoplasms-related terms. Using DisGeNET ALL Databases, 172 disease concepts were identified. Of those, 80 ST disease-related terms were subjected to manual curation and 75/80 (93.75%) associations were validated. On selected ST, 489 CDH1 somatic mutations were listed in COSMIC and IntOGen databases. Breast neoplasms had the highest CDH1-mutation rate. CDH1 was positioned among the 20 genes with highest mutation frequency and was confirmed as driver gene in breast cancer. Over 14,000 SNP for CDH1 were found in the dbSNP database. This report used DisGeNET to gather/compile current knowledge on gene-disease association for CDH1/E-cadherin and ST; data curation expanded the number of terms that relate them. An updated list of CDH1 somatic mutations was obtained with COSMIC and IntOGen databases and of SNP from dbSNP. This information can be used to further understand the role of CDH1/E-cadherin in health and disease.
Collapse
Affiliation(s)
- María Florencia Abascal
- Laboratory of Cell-Cell Interaction in Cancer and Reproduction, Instituto de Biología & Medicina Experimental (IBYME), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Fundación IBYME (FIBYME), Vuelta de Obligado 2490, Zip Code C1428ADN, Buenos Aires, Argentina.
| | - María José Besso
- Laboratory of Cell-Cell Interaction in Cancer and Reproduction, Instituto de Biología & Medicina Experimental (IBYME), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Fundación IBYME (FIBYME), Vuelta de Obligado 2490, Zip Code C1428ADN, Buenos Aires, Argentina.
| | - Marina Rosso
- Laboratory of Cell-Cell Interaction in Cancer and Reproduction, Instituto de Biología & Medicina Experimental (IBYME), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Fundación IBYME (FIBYME), Vuelta de Obligado 2490, Zip Code C1428ADN, Buenos Aires, Argentina.
| | - María Victoria Mencucci
- Laboratory of Cell-Cell Interaction in Cancer and Reproduction, Instituto de Biología & Medicina Experimental (IBYME), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Fundación IBYME (FIBYME), Vuelta de Obligado 2490, Zip Code C1428ADN, Buenos Aires, Argentina.
| | - Evangelina Aparicio
- Laboratory of Cell-Cell Interaction in Cancer and Reproduction, Instituto de Biología & Medicina Experimental (IBYME), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Fundación IBYME (FIBYME), Vuelta de Obligado 2490, Zip Code C1428ADN, Buenos Aires, Argentina.
| | - Gala Szapiro
- Laboratory of Cell-Cell Interaction in Cancer and Reproduction, Instituto de Biología & Medicina Experimental (IBYME), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Fundación IBYME (FIBYME), Vuelta de Obligado 2490, Zip Code C1428ADN, Buenos Aires, Argentina.
| | - Laura Inés Furlong
- Research Programme on Biomedical Informatics (GRIB) (IMIM), DCEXS, Universitat Pompeu Fabra, C/Dr Aiguader 88, Zip Code 08003, Barcelona, Spain.
| | - Mónica Hebe Vazquez-Levin
- Laboratory of Cell-Cell Interaction in Cancer and Reproduction, Instituto de Biología & Medicina Experimental (IBYME), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Fundación IBYME (FIBYME), Vuelta de Obligado 2490, Zip Code C1428ADN, Buenos Aires, Argentina; Laboratory of Cell-Cell Interaction in Cancer and Reproduction, Instituto de Biología y Medicina Experimental (IBYME), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Fundación IBYME (FIBYME), Vuelta de Obligado 2490, Zip Code C1428ADN, Buenos Aires, Argentina.
| |
Collapse
|
21
|
Kiela D, Guo Y, Stenius U, Korhonen A. Unsupervised discovery of information structure in biomedical documents. Bioinformatics 2015; 31:1084-92. [PMID: 25411329 DOI: 10.1093/bioinformatics/btu758] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2014] [Accepted: 11/10/2014] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Information structure (IS) analysis is a text mining technique, which classifies text in biomedical articles into categories that capture different types of information, such as objectives, methods, results and conclusions of research. It is a highly useful technique that can support a range of Biomedical Text Mining tasks and can help readers of biomedical literature find information of interest faster, accelerating the highly time-consuming process of literature review. Several approaches to IS analysis have been presented in the past, with promising results in real-world biomedical tasks. However, all existing approaches, even weakly supervised ones, require several hundreds of hand-annotated training sentences specific to the domain in question. Because biomedicine is subject to considerable domain variation, such annotations are expensive to obtain. This makes the application of IS analysis across biomedical domains difficult. In this article, we investigate an unsupervised approach to IS analysis and evaluate the performance of several unsupervised methods on a large corpus of biomedical abstracts collected from PubMed. RESULTS Our best unsupervised algorithm (multilevel-weighted graph clustering algorithm) performs very well on the task, obtaining over 0.70 F scores for most IS categories when applied to well-known IS schemes. This level of performance is close to that of lightly supervised IS methods and has proven sufficient to aid a range of practical tasks. Thus, using an unsupervised approach, IS could be applied to support a wide range of tasks across sub-domains of biomedicine. We also demonstrate that unsupervised learning brings novel insights into IS of biomedical literature and discovers information categories that are not present in any of the existing IS schemes. AVAILABILITY AND IMPLEMENTATION The annotated corpus and software are available at http://www.cl.cam.ac.uk/∼dk427/bio14info.html.
Collapse
Affiliation(s)
- Douwe Kiela
- Computer Laboratory, University of Cambridge, Cambridge CB3 0FD, UK and Institute of Environmental Medicine, Karolinska Institutet, Stockholm SE-171 77, Sweden
| | - Yufan Guo
- Computer Laboratory, University of Cambridge, Cambridge CB3 0FD, UK and Institute of Environmental Medicine, Karolinska Institutet, Stockholm SE-171 77, Sweden
| | - Ulla Stenius
- Computer Laboratory, University of Cambridge, Cambridge CB3 0FD, UK and Institute of Environmental Medicine, Karolinska Institutet, Stockholm SE-171 77, Sweden
| | - Anna Korhonen
- Computer Laboratory, University of Cambridge, Cambridge CB3 0FD, UK and Institute of Environmental Medicine, Karolinska Institutet, Stockholm SE-171 77, Sweden
| |
Collapse
|
22
|
Tamaddoni-Nezhad A, Milani GA, Raybould A, Muggleton S, Bohan DA. Construction and Validation of Food Webs Using Logic-Based Machine Learning and Text Mining. ADV ECOL RES 2013. [DOI: 10.1016/b978-0-12-420002-9.00004-4] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
|
23
|
Approaches to verb subcategorization for biomedicine. J Biomed Inform 2012; 46:212-27. [PMID: 23276747 DOI: 10.1016/j.jbi.2012.12.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2012] [Revised: 12/05/2012] [Accepted: 12/06/2012] [Indexed: 11/23/2022]
Abstract
Information about verb subcategorization frames (SCFs) is important to many tasks in natural language processing (NLP) and, in turn, text mining. Biomedicine has a need for high-quality SCF lexicons to support the extraction of information from the biomedical literature, which helps biologists to take advantage of the latest biomedical knowledge despite the overwhelming growth of that literature. Unfortunately, techniques for creating such resources for biomedical text are relatively undeveloped compared to general language. This paper serves as an introduction to subcategorization and existing approaches to acquisition, and provides motivation for developing techniques that address issues particularly important to biomedical NLP. First, we give the traditional linguistic definition of subcategorization, along with several related concepts. Second, we describe approaches to learning SCF lexicons from large data sets for general and biomedical domains. Third, we consider the crucial issue of linguistic variation between biomedical fields (subdomain variation). We demonstrate significant variation among subdomains, and find the variation does not simply follow patterns of general lexical variation. Finally, we note several requirements for future research in biomedical SCF lexicon acquisition: a high-quality gold standard, investigation of different definitions of subcategorization, and minimally-supervised methods that can learn subdomain-specific lexical usage without the need for extensive manual work.
Collapse
|
24
|
Harmston N, Filsell W, Stumpf MPH. Which species is it? Species-driven gene name disambiguation using random walks over a mixture of adjacency matrices. Bioinformatics 2011; 28:254-60. [PMID: 22135416 DOI: 10.1093/bioinformatics/btr640] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The scientific literature contains a wealth of information about biological systems. Manual curation lacks the scalability to extract this information due to the ever-increasing numbers of papers being published. The development and application of text mining technologies has been proposed as a way of dealing with this problem. However, the inter-species ambiguity of the genomic nomenclature makes mapping of gene mentions identified in text to their corresponding Entrez gene identifiers an extremely difficult task. We propose a novel method, which transforms a MEDLINE record into a mixture of adjacency matrices; by performing a random walkover the resulting graph, we can perform multi-class supervised classification allowing the assignment of taxonomy identifiers to individual gene mentions. The ability to achieve good performance at this task has a direct impact on the performance of normalizing gene mentions to Entrez gene identifiers. Such graph mixtures add flexibility and allow us to generate probabilistic classification schemes that naturally reflect the uncertainties inherent, even in literature-derived data. RESULTS Our method performs well in terms of both micro- and macro-averaged performance, achieving micro-F(1) of 0.76 and macro-F(1) of 0.36 on the publicly available DECA corpus. Re-curation of the DECA corpus was performed, with our method achieving 0.88 micro-F(1) and 0.51 macro-F(1). Our method improves over standard classification techniques [such as support vector machines (SVMs)] in a number of ways: flexibility, interpretability and its resistance to the effects of class bias in the training data. Good performance is achieved without the need for computationally expensive parse tree generation or 'bag of words classification'.
Collapse
Affiliation(s)
- Nathan Harmston
- Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College London, London SW7 2AZ, UK
| | | | | |
Collapse
|
25
|
Carreira R, Carneiro S, Pereira R, Rocha M, Rocha I, Ferreira EC, Lourenço A. Semantic annotation of biological concepts interplaying microbial cellular responses. BMC Bioinformatics 2011; 12:460. [PMID: 22122862 PMCID: PMC3259143 DOI: 10.1186/1471-2105-12-460] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2011] [Accepted: 11/28/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Automated extraction systems have become a time saving necessity in Systems Biology. Considerable human effort is needed to model, analyse and simulate biological networks. Thus, one of the challenges posed to Biomedical Text Mining tools is that of learning to recognise a wide variety of biological concepts with different functional roles to assist in these processes. RESULTS Here, we present a novel corpus concerning the integrated cellular responses to nutrient starvation in the model-organism Escherichia coli. Our corpus is a unique resource in that it annotates biomedical concepts that play a functional role in expression, regulation and metabolism. Namely, it includes annotations for genetic information carriers (genes and DNA, RNA molecules), proteins (transcription factors, enzymes and transporters), small metabolites, physiological states and laboratory techniques. The corpus consists of 130 full-text papers with a total of 59043 annotations for 3649 different biomedical concepts; the two dominant classes are genes (highest number of unique concepts) and compounds (most frequently annotated concepts), whereas other important cellular concepts such as proteins account for no more than 10% of the annotated concepts. CONCLUSIONS To the best of our knowledge, a corpus that details such a wide range of biological concepts has never been presented to the text mining community. The inter-annotator agreement statistics provide evidence of the importance of a consolidated background when dealing with such complex descriptions, the ambiguities naturally arising from the terminology and their impact for modelling purposes.Availability is granted for the full-text corpora of 130 freely accessible documents, the annotation scheme and the annotation guidelines. Also, we include a corpus of 340 abstracts.
Collapse
Affiliation(s)
- Rafael Carreira
- Institute for Biotechnology and Bioengineering, Centre of Biological Engineering, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| | | | | | | | | | | | | |
Collapse
|
26
|
Verification of systems biology research in the age of collaborative competition. Nat Biotechnol 2011; 29:811-5. [DOI: 10.1038/nbt.1968] [Citation(s) in RCA: 74] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|