1
|
Joachimiak MP, Caufield JH, Harris NL, Kim H, Mungall CJ. Gene Set Summarization Using Large Language Models. ARXIV 2024:arXiv:2305.13338v3. [PMID: 37292480 PMCID: PMC10246080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Molecular biologists frequently interpret gene lists derived from high-throughput experiments and computational analysis. This is typically done as a statistical enrichment analysis that measures the over- or under-representation of biological function terms associated with genes or their properties, based on curated assertions from a knowledge base (KB) such as the Gene Ontology (GO). Interpreting gene lists can also be framed as a textual summarization task, enabling Large Language Models (LLMs) to use scientific texts directly and avoid reliance on a KB. TALISMAN (Terminological ArtificiaL Intelligence SuMmarization of Annotation and Narratives) uses generative AI to perform gene set function summarization as a complement to standard enrichment analysis. This method can use different sources of gene functional information: (1) structured text derived from curated ontological KB annotations, (2) ontology-free narrative gene summaries, or (3) direct retrieval from the model. We demonstrate that these methods are able to generate plausible and biologically valid summary GO term lists for an input gene set. However, LLM-based approaches are unable to deliver reliable scores or p-values and often return terms that are not statistically significant. Crucially, in our experiments these methods were rarely able to recapitulate the most precise and informative term from standard enrichment analysis. We also observe minor differences depending on prompt input information, with GO term descriptions leading to higher recall but lower precision. However, newer LLM models perform statistically significantly better than the oldest model across all performance metrics, suggesting that future models may lead to further improvements. Overall, the results are nondeterministic, with minor variations in prompt resulting in radically different term lists, true to the stochastic nature of LLMs. Our results show that at this point, LLM-based methods are unsuitable as a replacement for standard term enrichment analysis, however they may provide summarization benefits for implicit knowledge integration across extant but unstandardized knowledge, for large sets of features, and where the amount of information is difficult for humans to process.
Collapse
Affiliation(s)
- Marcin P Joachimiak
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - J Harry Caufield
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - Nomi L Harris
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | | | - Christopher J Mungall
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| |
Collapse
|
2
|
Schilder BM, Murphy AE, Skene NG. rworkflows: automating reproducible practices for the R community. Nat Commun 2024; 15:149. [PMID: 38167858 PMCID: PMC10761765 DOI: 10.1038/s41467-023-44484-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Accepted: 12/14/2023] [Indexed: 01/05/2024] Open
Abstract
Despite calls to improve reproducibility in research, achieving this goal remains elusive even within computational fields. Currently, >50% of R packages are distributed exclusively through GitHub. While the trend towards sharing open-source software has been revolutionary, GitHub does not have any default built-in checks for minimal coding standards or software usability. This makes it difficult to assess the current quality R packages, or to consistently use them over time and across platforms. While GitHub-native solutions are technically possible, they require considerable time and expertise for each developer to write, implement, and maintain. To address this, we develop rworkflows; a suite of tools to make robust continuous integration and deployment ( https://github.com/neurogenomics/rworkflows ). rworkflows can be implemented by developers of all skill levels using a one-time R function call which has both sensible defaults and extensive options for customisation. Once implemented, any updates to the GitHub repository automatically trigger parallel workflows that install all software dependencies, run code checks, generate a dedicated documentation website, and deploy a publicly accessible containerised environment. By making the rworkflows suite free, automated, and simple to use, we aim to promote widespread adoption of reproducible practices across a continually growing R community.
Collapse
Affiliation(s)
- Brian M Schilder
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, W12 0BZ, UK.
- UK Dementia Research Institute at Imperial College London, London, W12 0BZ, UK.
| | - Alan E Murphy
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, W12 0BZ, UK
- UK Dementia Research Institute at Imperial College London, London, W12 0BZ, UK
| | - Nathan G Skene
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, W12 0BZ, UK.
- UK Dementia Research Institute at Imperial College London, London, W12 0BZ, UK.
| |
Collapse
|
3
|
Friedrichs M, Königs C. A web-based platform for the annotation and analysis of NAR-published databases. PLoS One 2023; 18:e0293134. [PMID: 37871106 PMCID: PMC10593211 DOI: 10.1371/journal.pone.0293134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Accepted: 10/06/2023] [Indexed: 10/25/2023] Open
Abstract
Biological databases are essential resources for life science research, but finding and selecting the most relevant and up-to-date databases can be challenging due to the large number and diversity of available databases. The Nucleic Acids Research (NAR) journal publishes annual database issues that provide a comprehensive list of databases in the molecular biology domain. However, the information provided by NAR is limited and sometimes does not reflect the current status and quality of the databases. In this article, we present a web-based platform for the annotation and analysis of NAR-published databases. The platform allows users to manually curate and enrich the NAR entries with additional information such as availability, downloadability, source code links, cross-references, and duplicates. Statistics and visualizations on various aspects of the database landscape, such as recency, status, category, and curation history are also provided. Currently, it contains a total of 2,246 database entries of which 2,025 are unique with the majority updated within the last five years. Around 75% of all databases are still available and more than half provide a download option. Cross references to Database Commons are available for 1,889 entries. The platform is freely available online at https://nardbstatus.kalis-amts.de and aims to help researchers in database selection and decision-making. It also provides insights into the current state and challenges of a subset of all databases in the life sciences.
Collapse
Affiliation(s)
- Marcel Friedrichs
- Bioinformatics / Medical Informatics Department, Bielefeld University, Bielefeld, NRW, Germany
| | - Cassandra Königs
- Bioinformatics / Medical Informatics Department, Bielefeld University, Bielefeld, NRW, Germany
| |
Collapse
|
4
|
Aleksander SA, Balhoff J, Carbon S, Cherry JM, Drabkin HJ, Ebert D, Feuermann M, Gaudet P, Harris NL, Hill DP, Lee R, Mi H, Moxon S, Mungall CJ, Muruganugan A, Mushayahama T, Sternberg PW, Thomas PD, Van Auken K, Ramsey J, Siegele DA, Chisholm RL, Fey P, Aspromonte MC, Nugnes MV, Quaglia F, Tosatto S, Giglio M, Nadendla S, Antonazzo G, Attrill H, Dos Santos G, Marygold S, Strelets V, Tabone CJ, Thurmond J, Zhou P, Ahmed SH, Asanitthong P, Luna Buitrago D, Erdol MN, Gage MC, Ali Kadhum M, Li KYC, Long M, Michalak A, Pesala A, Pritazahra A, Saverimuttu SCC, Su R, Thurlow KE, Lovering RC, Logie C, Oliferenko S, Blake J, Christie K, Corbani L, Dolan ME, Drabkin HJ, Hill DP, Ni L, Sitnikov D, Smith C, Cuzick A, Seager J, Cooper L, Elser J, Jaiswal P, Gupta P, Jaiswal P, Naithani S, Lera-Ramirez M, Rutherford K, Wood V, De Pons JL, Dwinell MR, Hayman GT, Kaldunski ML, Kwitek AE, Laulederkind SJF, Tutaj MA, Vedi M, Wang SJ, D'Eustachio P, Aimo L, Axelsen K, Bridge A, Hyka-Nouspikel N, Morgat A, Aleksander SA, Cherry JM, Engel SR, Karra K, Miyasato SR, Nash RS, Skrzypek MS, Weng S, Wong ED, Bakker E, Berardini TZ, Reiser L, Auchincloss A, Axelsen K, Argoud-Puy G, Blatter MC, Boutet E, Breuza L, Bridge A, Casals-Casas C, Coudert E, Estreicher A, Livia Famiglietti M, Feuermann M, Gos A, Gruaz-Gumowski N, Hulo C, Hyka-Nouspikel N, Jungo F, Le Mercier P, Lieberherr D, Masson P, Morgat A, Pedruzzi I, Pourcel L, Poux S, Rivoire C, Sundaram S, Bateman A, Bowler-Barnett E, Bye-A-Jee H, Denny P, Ignatchenko A, Ishtiaq R, Lock A, Lussi Y, Magrane M, Martin MJ, Orchard S, Raposo P, Speretta E, Tyagi N, Warner K, Zaru R, Diehl AD, Lee R, Chan J, Diamantakis S, Raciti D, Zarowiecki M, Fisher M, James-Zorn C, Ponferrada V, Zorn A, Ramachandran S, Ruzicka L, Westerfield M. The Gene Ontology knowledgebase in 2023. Genetics 2023; 224:iyad031. [PMID: 36866529 PMCID: PMC10158837 DOI: 10.1093/genetics/iyad031] [Citation(s) in RCA: 389] [Impact Index Per Article: 389.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Revised: 02/10/2023] [Accepted: 02/11/2023] [Indexed: 03/04/2023] Open
Abstract
The Gene Ontology (GO) knowledgebase (http://geneontology.org) is a comprehensive resource concerning the functions of genes and gene products (proteins and noncoding RNAs). GO annotations cover genes from organisms across the tree of life as well as viruses, though most gene function knowledge currently derives from experiments carried out in a relatively small number of model organisms. Here, we provide an updated overview of the GO knowledgebase, as well as the efforts of the broad, international consortium of scientists that develops, maintains, and updates the GO knowledgebase. The GO knowledgebase consists of three components: (1) the GO-a computational knowledge structure describing the functional characteristics of genes; (2) GO annotations-evidence-supported statements asserting that a specific gene product has a particular functional characteristic; and (3) GO Causal Activity Models (GO-CAMs)-mechanistic models of molecular "pathways" (GO biological processes) created by linking multiple GO annotations using defined relations. Each of these components is continually expanded, revised, and updated in response to newly published discoveries and receives extensive QA checks, reviews, and user feedback. For each of these components, we provide a description of the current contents, recent developments to keep the knowledgebase up to date with new discoveries, and guidance on how users can best make use of the data that we provide. We conclude with future directions for the project.
Collapse
|
5
|
A review on method entities in the academic literature: extraction, evaluation, and application. Scientometrics 2022. [DOI: 10.1007/s11192-022-04332-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
6
|
Schindler D, Bensmann F, Dietze S, Krüger F. The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central. PeerJ Comput Sci 2022; 8:e835. [PMID: 35111920 PMCID: PMC8771769 DOI: 10.7717/peerj-cs.835] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Accepted: 12/07/2021] [Indexed: 06/06/2023]
Abstract
Science across all disciplines has become increasingly data-driven, leading to additional needs with respect to software for collecting, processing and analysing data. Thus, transparency about software used as part of the scientific process is crucial to understand provenance of individual research data and insights, is a prerequisite for reproducibility and can enable macro-analysis of the evolution of scientific methods over time. However, missing rigor in software citation practices renders the automated detection and disambiguation of software mentions a challenging problem. In this work, we provide a large-scale analysis of software usage and citation practices facilitated through an unprecedented knowledge graph of software mentions and affiliated metadata generated through supervised information extraction models trained on a unique gold standard corpus and applied to more than 3 million scientific articles. Our information extraction approach distinguishes different types of software and mentions, disambiguates mentions and outperforms the state-of-the-art significantly, leading to the most comprehensive corpus of 11.8 M software mentions that are described through a knowledge graph consisting of more than 300 M triples. Our analysis provides insights into the evolution of software usage and citation patterns across various fields, ranks of journals, and impact of publications. Whereas, to the best of our knowledge, this is the most comprehensive analysis of software use and citation at the time, all data and models are shared publicly to facilitate further research into scientific use and citation of software.
Collapse
Affiliation(s)
- David Schindler
- Institute of Communications Engineering, University of Rostock, Rostock, Germany
| | - Felix Bensmann
- GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany
| | - Stefan Dietze
- GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany
- Heinrich-Heine-University, Düsseldorf, Germany
| | - Frank Krüger
- Institute of Communications Engineering, University of Rostock, Rostock, Germany
- Department Knowledge, Culture & Transformation, University of Rostock, Rostock, Germany
| |
Collapse
|
7
|
Cho Y, Kim JS, Dai YC, Gafforov Y, Lim YW. Taxonomic evaluation of Xylodon (Hymenochaetales, Basidiomycota) in Korea and sequence verification of the corresponding species in GenBank. PeerJ 2021; 9:e12625. [PMID: 34966599 PMCID: PMC8667721 DOI: 10.7717/peerj.12625] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Accepted: 11/19/2021] [Indexed: 11/23/2022] Open
Abstract
Genus Xylodon consists of white-rot fungi that grow on both angiosperms and gymnosperms. With resupinate and adnate basidiomes, Xylodon species have been classified into other resupinate genera for a long time. Upon the integration of molecular assessments, the taxonomy of the genus has been revised multiple times over the years. However, the emendations were poorly reflected in studies and public sequence databases. In the present study, the genus Xylodon in Korea was evaluated using molecular and morphological analyses of 172 specimens collected in the period of 2011 to 2018. The host types and geographical distributions were also determined for species delimitation. Furthermore, public sequences that correspond to the Xylodon species in Korea were assessed to validate their identities. Nine Xylodon species were identified in Korea, with three species new to the country. Morphological differentiation and identification of some species were challenging, but all nine species were clearly divided into well-resolved clades in the phylogenetic analyses. Detailed species descriptions, phylogeny, and a key to Xylodon species in Korea are provided in the present study. A total of 646 public ITS and nrLSU sequences corresponding to the nine Xylodon species were found, each with 404 (73.1%) and 57 (61.3%) misidentified or labeled with synonymous names. In many cases, sequences released before the report of new names have not been revised or updated. Revisions of these sequences are arranged in the present study. These amendments may be used to avoid the misidentification of future sequence-based identifications and concurrently prevent the accumulation of misidentified sequences in GenBank.
Collapse
Affiliation(s)
- Yoonhee Cho
- School of Biological Sciences and Institute of Microbiology, Seoul National University, Seoul, South Korea
| | - Ji Seon Kim
- School of Biological Sciences and Institute of Microbiology, Seoul National University, Seoul, South Korea
| | - Yu-Cheng Dai
- Institute of Microbiology, School of Ecology and Nature Conservation, Beijing Forestry University, Beijing, China
| | - Yusufjon Gafforov
- Laboratory of Mycology, Institute of Botany, Academy of Sciences of Republic of Uzbekistan, Tashkent, Uzbekistan
| | - Young Woon Lim
- School of Biological Sciences and Institute of Microbiology, Seoul National University, Seoul, South Korea
| |
Collapse
|
8
|
Zheng A, Zhao H, Luo Z, Feng C, Liu X, Ye Y. Improving On-line Scientific Resource Profiling by Exploiting Resource Citation Information in the Literature. Inf Process Manag 2021. [DOI: 10.1016/j.ipm.2021.102638] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
9
|
Data set entity recognition based on distant supervision. ELECTRONIC LIBRARY 2021. [DOI: 10.1108/el-10-2020-0301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Purpose
This paper aims to identify data set entities in scientific literature. To address poor recognition caused by a lack of training corpora in existing studies, a distant supervised learning-based approach is proposed to identify data set entities automatically from large-scale scientific literature in an open domain.
Design/methodology/approach
Firstly, the authors use a dictionary combined with a bootstrapping strategy to create a labelled corpus to apply supervised learning. Secondly, a bidirectional encoder representation from transformers (BERT)-based neural model was applied to identify data set entities in the scientific literature automatically. Finally, two data augmentation techniques, entity replacement and entity masking, were introduced to enhance the model generalisability and improve the recognition of data set entities.
Findings
In the absence of training data, the proposed method can effectively identify data set entities in large-scale scientific papers. The BERT-based vectorised representation and data augmentation techniques enable significant improvements in the generality and robustness of named entity recognition models, especially in long-tailed data set entity recognition.
Originality/value
This paper provides a practical research method for automatically recognising data set entities in scientific literature. To the best of the authors’ knowledge, this is the first attempt to apply distant learning to the study of data set entity recognition. The authors introduce a robust vectorised representation and two data augmentation strategies (entity replacement and entity masking) to address the problem inherent in distant supervised learning methods, which the existing research has mostly ignored. The experimental results demonstrate that our approach effectively improves the recognition of data set entities, especially long-tailed data set entities.
Collapse
|
10
|
Lánczky A, Győrffy B. Web-Based Survival Analysis Tool Tailored for Medical Research (KMplot): Development and Implementation. J Med Internet Res 2021; 23:e27633. [PMID: 34309564 PMCID: PMC8367126 DOI: 10.2196/27633] [Citation(s) in RCA: 844] [Impact Index Per Article: 281.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Revised: 02/19/2021] [Accepted: 05/06/2021] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Survival analysis is a cornerstone of medical research, enabling the assessment of clinical outcomes for disease progression and treatment efficiency. Despite its central importance, no commonly used spreadsheet software can handle survival analysis and there is no web server available for its computation. OBJECTIVE Here, we introduce a web-based tool capable of performing univariate and multivariate Cox proportional hazards survival analysis using data generated by genomic, transcriptomic, proteomic, or metabolomic studies. METHODS We implemented different methods to establish cut-off values for the trichotomization or dichotomization of continuous data. The false discovery rate is computed to correct for multiple hypothesis testing. A multivariate analysis option enables comparing omics data with clinical variables. RESULTS We established a registration-free web-based survival analysis tool capable of performing univariate and multivariate survival analysis using any custom-generated data. CONCLUSIONS This tool fills a gap and will be an invaluable contribution to basic medical and clinical research.
Collapse
Affiliation(s)
- András Lánczky
- Department of Bioinformatics, Semmelweis University, Budapest, Hungary.,TTK Lendület Cancer Biomarker Research Group, Institute of Enzymology, Research Centre for Natural Sciences, Budapest, Hungary
| | - Balázs Győrffy
- Department of Bioinformatics, Semmelweis University, Budapest, Hungary.,TTK Lendület Cancer Biomarker Research Group, Institute of Enzymology, Research Centre for Natural Sciences, Budapest, Hungary
| |
Collapse
|
11
|
Marsh JI, Hu H, Gill M, Batley J, Edwards D. Crop breeding for a changing climate: integrating phenomics and genomics with bioinformatics. TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2021; 134:1677-1690. [PMID: 33852055 DOI: 10.1007/s00122-021-03820-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Accepted: 03/18/2021] [Indexed: 05/05/2023]
Abstract
Safeguarding crop yields in a changing climate requires bioinformatics advances in harnessing data from vast phenomics and genomics datasets to translate research findings into climate smart crops in the field. Climate change and an additional 3 billion mouths to feed by 2050 raise serious concerns over global food security. Crop breeding and land management strategies will need to evolve to maximize the utilization of finite resources in coming years. High-throughput phenotyping and genomics technologies are providing researchers with the information required to guide and inform the breeding of climate smart crops adapted to the environment. Bioinformatics has a fundamental role to play in integrating and exploiting this fast accumulating wealth of data, through association studies to detect genomic targets underlying key adaptive climate-resilient traits. These data provide tools for breeders to tailor crops to their environment and can be introduced using advanced selection or genome editing methods. To effectively translate research into the field, genomic and phenomic information will need to be integrated into comprehensive clade-specific databases and platforms alongside accessible tools that can be used by breeders to inform the selection of climate adaptive traits. Here we discuss the role of bioinformatics in extracting, analysing, integrating and managing genomic and phenomic data to improve climate resilience in crops, including current, emerging and potential approaches, applications and bottlenecks in the research and breeding pipeline.
Collapse
Affiliation(s)
- Jacob I Marsh
- School of Biological Sciences and Institute of Agriculture, The University of Western Australia, Perth, 6009, Australia
| | - Haifei Hu
- School of Biological Sciences and Institute of Agriculture, The University of Western Australia, Perth, 6009, Australia
| | - Mitchell Gill
- School of Biological Sciences and Institute of Agriculture, The University of Western Australia, Perth, 6009, Australia
| | - Jacqueline Batley
- School of Biological Sciences and Institute of Agriculture, The University of Western Australia, Perth, 6009, Australia
| | - David Edwards
- School of Biological Sciences and Institute of Agriculture, The University of Western Australia, Perth, 6009, Australia.
| |
Collapse
|
12
|
Rosado E, Garcia-Remesal M, Paraiso-Medina S, Pazos A, Maojo V. Using Machine Learning to Collect and Facilitate Remote Access to Biomedical Databases: Development of the Biomedical Database Inventory. JMIR Med Inform 2021; 9:e22976. [PMID: 33629960 PMCID: PMC7952234 DOI: 10.2196/22976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2020] [Revised: 12/30/2020] [Accepted: 01/16/2021] [Indexed: 11/13/2022] Open
Abstract
Background Currently, existing biomedical literature repositories do not commonly provide users with specific means to locate and remotely access biomedical databases. Objective To address this issue, we developed the Biomedical Database Inventory (BiDI), a repository linking to biomedical databases automatically extracted from the scientific literature. BiDI provides an index of data resources and a path to access them seamlessly. Methods We designed an ensemble of deep learning methods to extract database mentions. To train the system, we annotated a set of 1242 articles that included mentions of database publications. Such a data set was used along with transfer learning techniques to train an ensemble of deep learning natural language processing models targeted at database publication detection. Results The system obtained an F1 score of 0.929 on database detection, showing high precision and recall values. When applying this model to the PubMed and PubMed Central databases, we identified over 10,000 unique databases. The ensemble model also extracted the weblinks to the reported databases and discarded irrelevant links. For the extraction of weblinks, the model achieved a cross-validated F1 score of 0.908. We show two use cases: one related to “omics” and the other related to the COVID-19 pandemic. Conclusions BiDI enables access to biomedical resources over the internet and facilitates data-driven research and other scientific initiatives. The repository is openly available online and will be regularly updated with an automatic text processing pipeline. The approach can be reused to create repositories of different types (ie, biomedical and others).
Collapse
Affiliation(s)
- Eduardo Rosado
- Biomedical Informatics Group, School of Computer Science, Universidad Politecnica de Madrid, Madrid, Spain
| | - Miguel Garcia-Remesal
- Biomedical Informatics Group, School of Computer Science, Universidad Politecnica de Madrid, Madrid, Spain
| | - Sergio Paraiso-Medina
- Biomedical Informatics Group, School of Computer Science, Universidad Politecnica de Madrid, Madrid, Spain
| | - Alejandro Pazos
- Grupo de Redes de Neuronas Artificiales y Sistemas Adaptativos - Imagen Médica y Diagnóstico Radiológico, Department of Computer Science and Information Technologies, Faculty of Computer Science, University of A Coruña, A Coruña, Spain
| | - Victor Maojo
- Biomedical Informatics Group, School of Computer Science, Universidad Politecnica de Madrid, Madrid, Spain
| |
Collapse
|
13
|
Du C, Cohoon J, Lopez P, Howison J. Softcite dataset: A dataset of software mentions in biomedical and economic research publications. J Assoc Inf Sci Technol 2021. [DOI: 10.1002/asi.24454] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- Caifan Du
- University of Texas at Austin Austin Texas USA
| | | | | | | |
Collapse
|
14
|
Drysdale R, Cook CE, Petryszak R, Baillie-Gerritsen V, Barlow M, Gasteiger E, Gruhl F, Haas J, Lanfear J, Lopez R, Redaschi N, Stockinger H, Teixeira D, Venkatesan A, Blomberg N, Durinx C, McEntyre J. The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences. Bioinformatics 2020; 36:2636-2642. [PMID: 31950984 PMCID: PMC7446027 DOI: 10.1093/bioinformatics/btz959] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2019] [Revised: 10/08/2019] [Accepted: 01/07/2020] [Indexed: 01/07/2023] Open
Abstract
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rachel Drysdale
- ELIXIR Hub, South Building, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Charles E Cook
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Robert Petryszak
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | - Mary Barlow
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | - Franziska Gruhl
- SIB Swiss Institute of Bioinformatics Quartier Sorge-Bâtiment Amphipôle, 1015 Lausanne, Switzerland
| | - Jürgen Haas
- SIB Swiss Institute of Bioinformatics & Biozentrum, University of Basel, 4056 Basel, Switzerland
| | - Jerry Lanfear
- ELIXIR Hub, South Building, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Rodrigo Lopez
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Nicole Redaschi
- SIB Swiss Institute of Bioinformatics, CMU, 1211 Geneva, Switzerland
| | - Heinz Stockinger
- SIB Swiss Institute of Bioinformatics Quartier Sorge-Bâtiment Amphipôle, 1015 Lausanne, Switzerland
| | - Daniel Teixeira
- SIB Swiss Institute of Bioinformatics Quartier Sorge-Bâtiment Amphipôle, 1015 Lausanne, Switzerland.,Hôpitaux Universitaires de Genève, Rue Gabrielle-Perret-Gentil 4, 1205 Geneva, Switzerland
| | - Aravind Venkatesan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | - Niklas Blomberg
- ELIXIR Hub, South Building, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Christine Durinx
- SIB Swiss Institute of Bioinformatics Quartier Sorge-Bâtiment Amphipôle, 1015 Lausanne, Switzerland
| | - Johanna McEntyre
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
15
|
Screening and Identification of Differentially Expressed Genes Expressed among Left and Right Colon Adenocarcinoma. BIOMED RESEARCH INTERNATIONAL 2020; 2020:8465068. [PMID: 32420374 PMCID: PMC7201700 DOI: 10.1155/2020/8465068] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/01/2019] [Revised: 11/06/2019] [Accepted: 12/17/2019] [Indexed: 01/05/2023]
Abstract
Purpose Colon adenocarcinoma (COAD) is the third most common malignancy globally and is further categorized as left colon adenocarcinoma (LCOAD) or right colon adenocarcinoma (RCOAD) depending on the location of the primary tumor. The therapeutic outcome and long-term prognosis for patients with COAD are less than satisfactory, and this may be associated with tumor location. Therefore, it is important to investigate the genetic differences in COAD at different sites. Patients and Methods. Public data associated with COAD were downloaded from the Gene Expression Omnibus (GEO) database. Differentially expressed genes (DEGs) were identified using R software (version 3.5.3), and functional annotation of DEGs was performed using Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses. A protein-protein interaction network was constructed, hub genes were identified and analyzed, and data mining using Gene Expression Profiling Interactive Analysis (GEPIA) was conducted. Results A total of 286 DEGs were identified between LCOAD and RCOAD. Additionally, 10 hub genes associated with COAD at different locations were screened, namely, CDKN2A, IGF1R, MDM2, SMAD3, SLC2A1, GRM5, PLCB4, FGFR1, UBE2V2, and TNFRSF10B. The expression of cyclin-dependent kinase inhibitor 2A (CDKN2A) and solute carrier family 2 member 1 (SLC2A1) was significantly associated with pathological stage (P < 0.05). COAD patients with high expression levels of CDKN2A exhibited poorer overall survival (OS) times than those with low expression levels (P < 0.05). Conclusion CDKN2A expression was significantly different between LCOAD and RCOAD and was closely related to the prognosis of COAD. It is of great value for further understanding of the pathogenesis of LCOAD and RCOAD.
Collapse
|
16
|
Capuccini M, Dahlö M, Toor S, Spjuth O. MaRe: Processing Big Data with application containers on Apache Spark. Gigascience 2020; 9:giaa042. [PMID: 32369166 PMCID: PMC7199472 DOI: 10.1093/gigascience/giaa042] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2019] [Revised: 02/10/2020] [Accepted: 04/07/2020] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Furthermore, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing. RESULTS Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability. CONCLUSIONS MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.
Collapse
Affiliation(s)
- Marco Capuccini
- Department of Information Technology, Uppsala University, Box 337, 75105, Uppsala, Sweden
- Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 751 24, Uppsala, Sweden
| | - Martin Dahlö
- Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 751 24, Uppsala, Sweden
- Science for Life Laboratory, Uppsala University, Box 591, 751 24, Uppsala, Sweden
- Uppsala Multidisciplinary Center for Advanced Computational Science, Uppsala University, Box 337, 75105, Uppsala, Sweden
| | - Salman Toor
- Department of Information Technology, Uppsala University, Box 337, 75105, Uppsala, Sweden
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 751 24, Uppsala, Sweden
| |
Collapse
|
17
|
Kruger F, Schindler D. A Literature Review on Methods for the Extraction of Usage Statements of Software and Data. Comput Sci Eng 2020. [DOI: 10.1109/mcse.2019.2943847] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
18
|
Harth A, Kirrane S, Ngonga Ngomo AC, Paulheim H, Rula A, Gentile AL, Haase P, Cochez M. Investigating Software Usage in the Social Sciences: A Knowledge Graph Approach. THE SEMANTIC WEB 2020. [PMCID: PMC7250610 DOI: 10.1007/978-3-030-49461-2_16] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Knowledge about the software used in scientific investigations is necessary for different reasons, including provenance of the results, measuring software impact to attribute developers, and bibliometric software citation analysis in general. Additionally, providing information about whether and how the software and the source code are available allows an assessment about the state and role of open source software in science in general. While such analyses can be done manually, large scale analyses require the application of automated methods of information extraction and linking. In this paper, we present SoftwareKG—a knowledge graph that contains information about software mentions from more than 51,000 scientific articles from the social sciences. A silver standard corpus, created by a distant and weak supervision approach, and a gold standard corpus, created by manual annotation, were used to train an LSTM based neural network to identify software mentions in scientific articles. The model achieves a recognition rate of .82 F-score in exact matches. As a result, we identified more than 133,000 software mentions. For entity disambiguation, we used the public domain knowledge base DBpedia. Furthermore, we linked the entities of the knowledge graph to other knowledge bases such as the Microsoft Academic Knowledge Graph, the Software Ontology, and Wikidata. Finally, we illustrate, how SoftwareKG can be used to assess the role of software in the social sciences.
Collapse
Affiliation(s)
- Andreas Harth
- University of Erlangen-Nuremberg, Nuremberg, Germany
| | - Sabrina Kirrane
- Vienna University of Economics and Business, Vienna, Austria
| | | | | | - Anisa Rula
- University of Milano-Bicocca, Milan, Italy
| | | | | | - Michael Cochez
- Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| |
Collapse
|
19
|
The Alliance of Genome Resources: Building a Modern Data Ecosystem for Model Organism Databases. Genetics 2019; 213:1189-1196. [PMID: 31796553 PMCID: PMC6893393 DOI: 10.1534/genetics.119.302523] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2019] [Accepted: 10/11/2019] [Indexed: 12/17/2022] Open
Abstract
Model organisms are essential experimental platforms for discovering gene functions, defining protein and genetic networks, uncovering functional consequences of human genome variation, and for modeling human disease. For decades, researchers who use model organisms have relied on Model Organism Databases (MODs) and the Gene Ontology Consortium (GOC) for expertly curated annotations, and for access to integrated genomic and biological information obtained from the scientific literature and public data archives. Through the development and enforcement of data and semantic standards, these genome resources provide rapid access to the collected knowledge of model organisms in human readable and computation-ready formats that would otherwise require countless hours for individual researchers to assemble on their own. Since their inception, the MODs for the predominant biomedical model organisms [Mus sp (laboratory mouse), Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, Danio rerio, and Rattus norvegicus] along with the GOC have operated as a network of independent, highly collaborative genome resources. In 2016, these six MODs and the GOC joined forces as the Alliance of Genome Resources (the Alliance). By implementing shared programmatic access methods and data-specific web pages with a unified "look and feel," the Alliance is tackling barriers that have limited the ability of researchers to easily compare common data types and annotations across model organisms. To adapt to the rapidly changing landscape for evaluating and funding core data resources, the Alliance is building a modern, extensible, and operationally efficient "knowledge commons" for model organisms using shared, modular infrastructure.
Collapse
|
20
|
Palmblad M, Lamprecht AL, Ison J, Schwämmle V. Automated workflow composition in mass spectrometry-based proteomics. Bioinformatics 2019; 35:656-664. [PMID: 30060113 PMCID: PMC6378944 DOI: 10.1093/bioinformatics/bty646] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2018] [Revised: 07/06/2018] [Accepted: 07/26/2018] [Indexed: 11/28/2022] Open
Abstract
Motivation Numerous software utilities operating on mass spectrometry (MS) data are described in the literature and provide specific operations as building blocks for the assembly of on-purpose workflows. Working out which tools and combinations are applicable or optimal in practice is often hard. Thus researchers face difficulties in selecting practical and effective data analysis pipelines for a specific experimental design. Results We provide a toolkit to support researchers in identifying, comparing and benchmarking multiple workflows from individual bioinformatics tools. Automated workflow composition is enabled by the tools’ semantic annotation in terms of the EDAM ontology. To demonstrate the practical use of our framework, we created and evaluated a number of logically and semantically equivalent workflows for four use cases representing frequent tasks in MS-based proteomics. Indeed we found that the results computed by the workflows could vary considerably, emphasizing the benefits of a framework that facilitates their systematic exploration. Availability and implementation The project files and workflows are available from https://github.com/bio-tools/biotoolsCompose/tree/master/Automatic-Workflow-Composition. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Magnus Palmblad
- Center for Proteomics and Metabolomics, Leiden University Medical Center, RC Leiden, The Netherlands
| | - Anna-Lena Lamprecht
- Department of Information and Computing Sciences, Utrecht University, CC Utrecht, The Netherlands
| | - Jon Ison
- National Life Science Supercomputing Center, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Veit Schwämmle
- Department of Biochemistry and Molecular Biology and VILLUM Center for Bioanalytical Sciences, University of Southern Denmark, Odense, Denmark
| |
Collapse
|
21
|
Wei Q, Zhang Y, Amith M, Lin R, Lapeyrolerie J, Tao C, Xu H. Recognizing software names in biomedical literature using machine learning. Health Informatics J 2019; 26:21-33. [PMID: 31566474 PMCID: PMC7334865 DOI: 10.1177/1460458219869490] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
Software tools now are essential to research and applications in the biomedical domain. However, existing software repositories are mainly built using manual curation, which is time-consuming and unscalable. This study took the initiative to manually annotate software names in 1,120 MEDLINE abstracts and titles and used this corpus to develop and evaluate machine learning-based named entity recognition systems for biomedical software. Specifically, two strategies were proposed for feature engineering: (1) domain knowledge features and (2) unsupervised word representation features of clustered and binarized word embeddings. Our best system achieved an F-measure of 91.79% for recognizing software from titles and an F-measure of 86.35% for recognizing software from both titles and abstracts using inexact matching criteria. We then created a biomedical software catalog with 19,557 entries using the developed system. This study demonstrates the feasibility of using natural language processing methods to automatically build a high-quality software index from biomedical literature.
Collapse
Affiliation(s)
| | | | - Muhammad Amith
- The University of Texas Health Science Center at Houston, USA
| | | | | | | | - Hua Xu
- The University of Texas Health Science Center at Houston, USA
| |
Collapse
|
22
|
Abstract
Bioinformatics plays a key role in supporting the life sciences. In this work, we examine bioinformatics in Jordan, beginning with the current status of bioinformatics education and research, then exploring the challenges of advancing bioinformatics, and finally looking to the future for how Jordanian bioinformatics research may develop.
Collapse
Affiliation(s)
- Qanita Bani Baker
- Department of Computer Science, Jordan University of Science and Technology, Irbid, Jordan
| | - Maryam S. Nuser
- Department of Computer Science, Jordan University of Science and Technology, Irbid, Jordan
- Department of Information Systems, Yarmouk University, Irbid, Jordan
| |
Collapse
|
23
|
Dozmorov MG. GitHub Statistics as a Measure of the Impact of Open-Source Bioinformatics Software. Front Bioeng Biotechnol 2018; 6:198. [PMID: 30619845 PMCID: PMC6306043 DOI: 10.3389/fbioe.2018.00198] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Accepted: 12/04/2018] [Indexed: 11/13/2022] Open
Abstract
Modern research is increasingly data-driven and reliant on bioinformatics software. Publication is a common way of introducing new software, but not all bioinformatics tools get published. Giving there are competing tools, it is important not merely to find the appropriate software, but have a metric for judging its usefulness. Journal's impact factor has been shown to be a poor predictor of software popularity; consequently, focusing on publications in high-impact journals limits user's choices in finding useful bioinformatics tools. Free and open source software repositories on popular code sharing platforms such as GitHub provide another venue to follow the latest bioinformatics trends. The open source component of GitHub allows users to bookmark and copy repositories that are most useful to them. This Perspective aims to demonstrate the utility of GitHub "stars," "watchers," and "forks" (GitHub statistics) as a measure of software impact. We compiled lists of impactful bioinformatics software and analyzed commonly used impact metrics and GitHub statistics of 50 genomics-oriented bioinformatics tools. We present examples of community-selected best bioinformatics resources and show that GitHub statistics are distinct from the journal's impact factor (JIF), citation counts, and alternative metrics (Altmetrics, CiteScore) in capturing the level of community attention. We suggest the use of GitHub statistics as an unbiased measure of the usability of bioinformatics software complementing the traditional impact metrics.
Collapse
Affiliation(s)
- Mikhail G. Dozmorov
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, United States
| |
Collapse
|
24
|
Russell PH, Johnson RL, Ananthan S, Harnke B, Carlson NE. A large-scale analysis of bioinformatics code on GitHub. PLoS One 2018; 13:e0205898. [PMID: 30379882 PMCID: PMC6209220 DOI: 10.1371/journal.pone.0205898] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2018] [Accepted: 10/03/2018] [Indexed: 11/19/2022] Open
Abstract
In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. The purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods. Code to generate the dataset and reproduce the analysis is provided under the MIT license at https://github.com/pamelarussell/github-bioinformatics. Data are available at https://doi.org/10.17605/OSF.IO/UWHX8.
Collapse
Affiliation(s)
- Pamela H. Russell
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, United States of America
- * E-mail:
| | - Rachel L. Johnson
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, United States of America
| | - Shreyas Ananthan
- High-Performance Algorithms and Complex Fluids, National Renewable Energy Laboratory, Golden, CO, United States of America
| | - Benjamin Harnke
- Health Sciences Library, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America
| | - Nichole E. Carlson
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, United States of America
| |
Collapse
|
25
|
|
26
|
Kleinaki AS, Mytis-Gkometh P, Drosatos G, Efraimidis PS, Kaldoudi E. A Blockchain-Based Notarization Service for Biomedical Knowledge Retrieval. Comput Struct Biotechnol J 2018; 16:288-297. [PMID: 30181840 PMCID: PMC6120721 DOI: 10.1016/j.csbj.2018.08.002] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2018] [Revised: 08/06/2018] [Accepted: 08/09/2018] [Indexed: 11/26/2022] Open
Abstract
Biomedical research and clinical decision depend increasingly on scientific evidence realized by a number of authoritative databases, mostly public and continually enriched via peer scientific contributions. Given the dynamic nature of biomedical evidence data and their usage in the sensitive domain of biomedical science, it is important to ensure retrieved data integrity and non-repudiation. In this work, we present a blockchain-based notarization service that uses smart digital contracts to seal a biomedical database query and the respective results. The goal is to ensure that retrieved data cannot be modified after retrieval and that the database cannot validly deny that the particular data has been provided as a result of a specific query. Biomedical evidence data versioning is also supported. The feasibility of the proposed notarization approach is demonstrated using a real blockchain infrastructure and is tested on two different biomedical evidence databases: a publicly available medical risk factor reference repository and on the PubMed database of biomedical literature references and abstracts.
Collapse
Affiliation(s)
- Athina-Styliani Kleinaki
- Dept. of Electrical and Computer Engineering, Democritus University of Thrace, Kimmeria, Xanthi 67100, Greece
| | - Petros Mytis-Gkometh
- Dept. of Electrical and Computer Engineering, Democritus University of Thrace, Kimmeria, Xanthi 67100, Greece
| | - George Drosatos
- School of Medicine, Democritus University of Thrace, Dragana, Alexandroupoli 68100, Greece
| | - Pavlos S Efraimidis
- Dept. of Electrical and Computer Engineering, Democritus University of Thrace, Kimmeria, Xanthi 67100, Greece
| | - Eleni Kaldoudi
- School of Medicine, Democritus University of Thrace, Dragana, Alexandroupoli 68100, Greece
| |
Collapse
|
27
|
Abstract
Ever return from a meeting feeling elated by all those exciting talks, yet unsure how all those presented glamorous and/or exciting tools can be useful in your research? Or do you have a great piece of software you want to share, yet only a handful of people visited your poster? We have all been there, and that is why we organized the Matchmaking for Computational and Experimental Biologists Session at the latest ISCB/GLBIO’2017 meeting in Chicago (May 15-17, 2017). The session exemplifies a novel approach, mimicking “matchmaking”, to encouraging communication, making connections and fostering collaborations between computational and non-computational biologists. More specifically, the session facilitates face-to-face communication between researchers with similar or differing research interests, which we feel are critical for promoting productive discussions and collaborations. To accomplish this, three short scheduled talks were delivered, focusing on RNA-seq, integration of clinical and genomic data, and chromatin accessibility analyses. Next, small-table developer-led discussions, modeled after speed-dating, enabled each developer (including the speakers) to introduce a specific tool and to engage potential users or other developers around the table. Notably, we asked the audience whether any other tool developers would want to showcase their tool and we thus added four developers as moderators of these small-table discussions. Given the positive feedback from the tool developers, we feel that this type of session is an effective approach for promoting valuable scientific discussion, and is particularly helpful in the context of conferences where the number of participants and activities could hamper such interactions.
Collapse
Affiliation(s)
- Ewy Mathé
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Ben Busby
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Helen Piontkivska
- Department of Biological Sciences and School of Biomedical Sciences, Kent State University, Kent, OH, 44242, USA
| | | |
Collapse
|
28
|
Callahan A, Winnenburg R, Shah NH. U-Index, a dataset and an impact metric for informatics tools and databases. Sci Data 2018; 5:180043. [PMID: 29557976 PMCID: PMC5859919 DOI: 10.1038/sdata.2018.43] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2017] [Accepted: 02/08/2018] [Indexed: 01/28/2023] Open
Abstract
Measuring the usage of informatics resources such as software tools and databases is essential to quantifying their impact, value and return on investment. We have developed a publicly available dataset of informatics resource publications and their citation network, along with an associated metric (u-Index) to measure informatics resources' impact over time. Our dataset differentiates the context in which citations occur to distinguish between 'awareness' and 'usage', and uses a citing universe of open access publications to derive citation counts for quantifying impact. Resources with a high ratio of usage citations to awareness citations are likely to be widely used by others and have a high u-Index score. We have pre-calculated the u-Index for nearly 100,000 informatics resources. We demonstrate how the u-Index can be used to track informatics resource impact over time. The method of calculating the u-Index metric, the pre-computed u-Index values, and the dataset we compiled to calculate the u-Index are publicly available.
Collapse
Affiliation(s)
- Alison Callahan
- Stanford Center for Biomedical Informatics Research, Stanford University, Medical School Office Building X215, Stanford, CA 94305, USA
| | - Rainer Winnenburg
- Stanford Center for Biomedical Informatics Research, Stanford University, Medical School Office Building X215, Stanford, CA 94305, USA
| | - Nigam H Shah
- Stanford Center for Biomedical Informatics Research, Stanford University, Medical School Office Building X215, Stanford, CA 94305, USA
| |
Collapse
|
29
|
Mytis-Gkometh P, Drosatos G, Efraimidis PS, Kaldoudi E. Notarization of Knowledge Retrieval from Biomedical Repositories Using Blockchain Technology. PRECISION MEDICINE POWERED BY PHEALTH AND CONNECTED HEALTH 2018. [DOI: 10.1007/978-981-10-7419-6_12] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
30
|
Rose R, Constantinides B, Tapinos A, Robertson DL, Prosperi M. Challenges in the analysis of viral metagenomes. Virus Evol 2016; 2:vew022. [PMID: 29492275 PMCID: PMC5822887 DOI: 10.1093/ve/vew022] [Citation(s) in RCA: 59] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Genome sequencing technologies continue to develop with remarkable pace, yet
analytical approaches for reconstructing and classifying viral genomes from
mixed samples remain limited in their performance and usability. Existing
solutions generally target expert users and often have unclear scope, making it
challenging to critically evaluate their performance. There is a growing need
for intuitive analytical tooling for researchers lacking specialist computing
expertise and that is applicable in diverse experimental circumstances. Notable
technical challenges have impeded progress; for example, fragments of viral
genomes are typically orders of magnitude less abundant than those of host,
bacteria, and/or other organisms in clinical and environmental metagenomes;
observed viral genomes often deviate considerably from reference genomes
demanding use of exhaustive alignment approaches; high intrapopulation viral
diversity can lead to ambiguous sequence reconstruction; and finally, the
relatively few documented viral reference genomes compared to the estimated
number of distinct viral taxa renders classification problematic. Various
software tools have been developed to accommodate the unique challenges and use
cases associated with characterizing viral sequences; however, the quality of
these tools varies, and their use often necessitates computing expertise or
access to powerful computers, thus limiting their usefulness to many
researchers. In this review, we consider the general and application-specific
challenges posed by viral sequencing and analysis, outline the landscape of
available tools and methodologies, and propose ways of overcoming the current
barriers to effective analysis.
Collapse
Affiliation(s)
- Rebecca Rose
- BioInfoExperts, Norfolk, VA, USA.,Computational and Evolutionary Biology Faculty of Life Sciences, University of Manchester, Manchester, UK.,Department of Epidemiology, University of Florida, Gainesville, FL, USA
| | - Bede Constantinides
- BioInfoExperts, Norfolk, VA, USA.,Computational and Evolutionary Biology Faculty of Life Sciences, University of Manchester, Manchester, UK.,Department of Epidemiology, University of Florida, Gainesville, FL, USA
| | - Avraam Tapinos
- BioInfoExperts, Norfolk, VA, USA.,Computational and Evolutionary Biology Faculty of Life Sciences, University of Manchester, Manchester, UK.,Department of Epidemiology, University of Florida, Gainesville, FL, USA
| | - David L Robertson
- BioInfoExperts, Norfolk, VA, USA.,Computational and Evolutionary Biology Faculty of Life Sciences, University of Manchester, Manchester, UK.,Department of Epidemiology, University of Florida, Gainesville, FL, USA
| | - Mattia Prosperi
- BioInfoExperts, Norfolk, VA, USA.,Computational and Evolutionary Biology Faculty of Life Sciences, University of Manchester, Manchester, UK.,Department of Epidemiology, University of Florida, Gainesville, FL, USA
| |
Collapse
|