1
|
Tang X, Qian B, Gao R, Chen J, Chen X, Gerstein MB. BioCoder: a benchmark for bioinformatics code generation with large language models. Bioinformatics 2024; 40:i266-i276. [PMID: 38940140 PMCID: PMC11211839 DOI: 10.1093/bioinformatics/btae230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
SUMMARY Pretrained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by >15% in terms of Pass@K under certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (i) Successful models accommodate a long prompt (>2600 tokens) with full context, including functional dependencies. (ii) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50% versus up to 25%). AVAILABILITY AND IMPLEMENTATION All datasets, benchmark, Docker images, and scripts required for testing are available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark.github.io/.
Collapse
Affiliation(s)
- Xiangru Tang
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Bill Qian
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Rick Gao
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Jiakang Chen
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Xinyun Chen
- Google Deepmind, Mountain View, CA 94043, United States
| | - Mark B Gerstein
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
- Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT 06520, United States
- Department of Molecular Biophysics & Biochemistry, Yale University, New Haven, CT 06520, United States
- Department of Statistics & Data Science, Yale University, New Haven, CT 06520, United States
- Department of Biomedical Informatics & Data Science, Yale University, New Haven, CT 06520, United States
| |
Collapse
|
2
|
Schilder BM, Murphy AE, Skene NG. rworkflows: automating reproducible practices for the R community. Nat Commun 2024; 15:149. [PMID: 38167858 PMCID: PMC10761765 DOI: 10.1038/s41467-023-44484-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Accepted: 12/14/2023] [Indexed: 01/05/2024] Open
Abstract
Despite calls to improve reproducibility in research, achieving this goal remains elusive even within computational fields. Currently, >50% of R packages are distributed exclusively through GitHub. While the trend towards sharing open-source software has been revolutionary, GitHub does not have any default built-in checks for minimal coding standards or software usability. This makes it difficult to assess the current quality R packages, or to consistently use them over time and across platforms. While GitHub-native solutions are technically possible, they require considerable time and expertise for each developer to write, implement, and maintain. To address this, we develop rworkflows; a suite of tools to make robust continuous integration and deployment ( https://github.com/neurogenomics/rworkflows ). rworkflows can be implemented by developers of all skill levels using a one-time R function call which has both sensible defaults and extensive options for customisation. Once implemented, any updates to the GitHub repository automatically trigger parallel workflows that install all software dependencies, run code checks, generate a dedicated documentation website, and deploy a publicly accessible containerised environment. By making the rworkflows suite free, automated, and simple to use, we aim to promote widespread adoption of reproducible practices across a continually growing R community.
Collapse
Affiliation(s)
- Brian M Schilder
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, W12 0BZ, UK.
- UK Dementia Research Institute at Imperial College London, London, W12 0BZ, UK.
| | - Alan E Murphy
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, W12 0BZ, UK
- UK Dementia Research Institute at Imperial College London, London, W12 0BZ, UK
| | - Nathan G Skene
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, W12 0BZ, UK.
- UK Dementia Research Institute at Imperial College London, London, W12 0BZ, UK.
| |
Collapse
|
3
|
Samuel S, Mietchen D. Computational reproducibility of Jupyter notebooks from biomedical publications. Gigascience 2024; 13:giad113. [PMID: 38206590 PMCID: PMC10783158 DOI: 10.1093/gigascience/giad113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 08/09/2023] [Accepted: 12/08/2023] [Indexed: 01/12/2024] Open
Abstract
BACKGROUND Jupyter notebooks facilitate the bundling of executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows, including for research publications. The reproducibility of computational aspects of research is a key component of scientific reproducibility but has not yet been assessed at scale for Jupyter notebooks associated with biomedical publications. APPROACH We address computational reproducibility at 2 levels: (i) using fully automated workflows, we analyzed the computational reproducibility of Jupyter notebooks associated with publications indexed in the biomedical literature repository PubMed Central. We identified such notebooks by mining the article's full text, trying to locate them on GitHub, and attempting to rerun them in an environment as close to the original as possible. We documented reproduction success and exceptions and explored relationships between notebook reproducibility and variables related to the notebooks or publications. (ii) This study represents a reproducibility attempt in and of itself, using essentially the same methodology twice on PubMed Central over the course of 2 years, during which the corpus of Jupyter notebooks from articles indexed in PubMed Central has grown in a highly dynamic fashion. RESULTS Out of 27,271 Jupyter notebooks from 2,660 GitHub repositories associated with 3,467 publications, 22,578 notebooks were written in Python, including 15,817 that had their dependencies declared in standard requirement files and that we attempted to rerun automatically. For 10,388 of these, all declared dependencies could be installed successfully, and we reran them to assess reproducibility. Of these, 1,203 notebooks ran through without any errors, including 879 that produced results identical to those reported in the original notebook and 324 for which our results differed from the originally reported ones. Running the other notebooks resulted in exceptions. CONCLUSIONS We zoom in on common problems and practices, highlight trends, and discuss potential improvements to Jupyter-related workflows associated with biomedical publications.
Collapse
Affiliation(s)
- Sheeba Samuel
- Heinz-Nixdorf Chair for Distributed Information Systems, Friedrich Schiller University Jena, Jena 07743, Germany
- Michael Stifel Center Jena, Jena 07743, Germany
| | - Daniel Mietchen
- Ronin Institute, Montclair 07043-2314, NJ, United States
- Institute for Globally Distributed Open Research and Education (IGDORE)
- FIZ Karlsruhe—Leibniz Institute for Information Infrastructure, Berlin 76344, Germany
| |
Collapse
|
4
|
Weberpals J, Wang SV. The FAIRification of research in real-world evidence: A practical introduction to reproducible analytic workflows using Git and R. Pharmacoepidemiol Drug Saf 2024; 33:e5740. [PMID: 38173166 DOI: 10.1002/pds.5740] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 11/29/2023] [Accepted: 11/30/2023] [Indexed: 01/05/2024]
Abstract
Transparency and reproducibility are major prerequisites for conducting meaningful real-world evidence (RWE) studies that are fit for decision-making. Many advances have been made in the documentation and reporting of study protocols and results, but the principles for version control and sharing of analytic code in RWE are not yet as established as in other quantitative disciplines like computational biology and health informatics. In this practical tutorial, we aim to give an introduction to distributed version control systems (VCS) tailored toward the FAIR (Findable, Accessible, Interoperable, and Reproducible) implementation of RWE studies. To ease adoption, we provide detailed step-by-step instructions with practical examples on how the Git VCS and R programming language can be implemented into RWE study workflows to facilitate reproducible analyzes. We further discuss and showcase how these tools can be used to track changes, collaborate, disseminate, and archive RWE studies through dedicated project repositories that maintain a complete audit trail of all relevant study documents.
Collapse
Affiliation(s)
- Janick Weberpals
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA
| | - Shirley V Wang
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA
| |
Collapse
|
5
|
Downing T, Angelopoulos N. A primer on correlation-based dimension reduction methods for multi-omics analysis. J R Soc Interface 2023; 20:20230344. [PMID: 37817584 PMCID: PMC10565429 DOI: 10.1098/rsif.2023.0344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 09/19/2023] [Indexed: 10/12/2023] Open
Abstract
The continuing advances of omic technologies mean that it is now more tangible to measure the numerous features collectively reflecting the molecular properties of a sample. When multiple omic methods are used, statistical and computational approaches can exploit these large, connected profiles. Multi-omics is the integration of different omic data sources from the same biological sample. In this review, we focus on correlation-based dimension reduction approaches for single omic datasets, followed by methods for pairs of omics datasets, before detailing further techniques for three or more omic datasets. We also briefly detail network methods when three or more omic datasets are available and which complement correlation-oriented tools. To aid readers new to this area, these are all linked to relevant R packages that can implement these procedures. Finally, we discuss scenarios of experimental design and present road maps that simplify the selection of appropriate analysis methods. This review will help researchers navigate emerging methods for multi-omics and integrating diverse omic datasets appropriately. This raises the opportunity of implementing population multi-omics with large sample sizes as omics technologies and our understanding improve.
Collapse
Affiliation(s)
- Tim Downing
- Pirbright Institute, Pirbright, Surrey, UK
- Department of Biotechnology, Dublin City University, Dublin, Ireland
| | | |
Collapse
|
6
|
Armato SG, Drukker K, Hadjiiski L. AI in medical imaging grand challenges: translation from competition to research benefit and patient care. Br J Radiol 2023; 96:20221152. [PMID: 37698542 PMCID: PMC10546459 DOI: 10.1259/bjr.20221152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2022] [Revised: 05/24/2023] [Accepted: 07/11/2023] [Indexed: 09/13/2023] Open
Abstract
Artificial intelligence (AI), in one form or another, has been a part of medical imaging for decades. The recent evolution of AI into approaches such as deep learning has dramatically accelerated the application of AI across a wide range of radiologic settings. Despite the promises of AI, developers and users of AI technology must be fully aware of its potential biases and pitfalls, and this knowledge must be incorporated throughout the AI system development pipeline that involves training, validation, and testing. Grand challenges offer an opportunity to advance the development of AI methods for targeted applications and provide a mechanism for both directing and facilitating the development of AI systems. In the process, a grand challenge centralizes (with the challenge organizers) the burden of providing a valid benchmark test set to assess performance and generalizability of participants' models and the collection and curation of image metadata, clinical/demographic information, and the required reference standard. The most relevant grand challenges are those designed to maximize the open-science nature of the competition, with code and trained models deposited for future public access. The ultimate goal of AI grand challenges is to foster the translation of AI systems from competition to research benefit and patient care. Rather than reference the many medical imaging grand challenges that have been organized by groups such as MICCAI, RSNA, AAPM, and grand-challenge.org, this review assesses the role of grand challenges in promoting AI technologies for research advancement and for eventual clinical implementation, including their promises and limitations.
Collapse
Affiliation(s)
- Samuel G Armato
- Department of Radiology, The University of Chicago, Chicago, Illinois, USA
| | - Karen Drukker
- Department of Radiology, The University of Chicago, Chicago, Illinois, USA
| | - Lubomir Hadjiiski
- Department of Radiology, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
7
|
Pillay NS, Ross OA, Christoffels A, Bardien S. Current Status of Next-Generation Sequencing Approaches for Candidate Gene Discovery in Familial Parkinson´s Disease. Front Genet 2022; 13:781816. [PMID: 35299952 PMCID: PMC8921601 DOI: 10.3389/fgene.2022.781816] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Accepted: 01/12/2022] [Indexed: 11/13/2022] Open
Abstract
Parkinson’s disease is a neurodegenerative disorder with a heterogeneous genetic etiology. The advent of next-generation sequencing (NGS) technologies has aided novel gene discovery in several complex diseases, including PD. This Perspective article aimed to explore the use of NGS approaches to identify novel loci in familial PD, and to consider their current relevance. A total of 17 studies, spanning various populations (including Asian, Middle Eastern and European ancestry), were identified. All the studies used whole-exome sequencing (WES), with only one study incorporating both WES and whole-genome sequencing. It is worth noting how additional genetic analyses (including linkage analysis, haplotyping and homozygosity mapping) were incorporated to enhance the efficacy of some studies. Also, the use of consanguineous families and the specific search for de novo mutations appeared to facilitate the finding of causal mutations. Across the studies, similarities and differences in downstream analysis methods and the types of bioinformatic tools used, were observed. Although these studies serve as a practical guide for novel gene discovery in familial PD, these approaches have not significantly resolved the “missing heritability” of PD. We speculate that what is needed is the use of third-generation sequencing technologies to identify complex genomic rearrangements and new sequence variation, missed with existing methods. Additionally, the study of ancestrally diverse populations (in particular those of Black African ancestry), with the concomitant optimization and tailoring of sequencing and analytic workflows to these populations, are critical. Only then, will this pave the way for exciting new discoveries in the field.
Collapse
Affiliation(s)
- Nikita Simone Pillay
- South African National Bioinformatics Institute (SANBI), South African Medical Research Council Bioinformatics Unit, University of the Western Cape, Bellville, South Africa
| | - Owen A. Ross
- Department of Neuroscience, Mayo Clinic, Jacksonville, FL, United States
- Department of Clinical Genomics, Mayo Clinic, Jacksonville, FL, United States
| | - Alan Christoffels
- South African National Bioinformatics Institute (SANBI), South African Medical Research Council Bioinformatics Unit, University of the Western Cape, Bellville, South Africa
- Africa Centres for Disease Control and Prevention, African Union Headquarters, Addis Ababa, Ethiopia
| | - Soraya Bardien
- Division of Molecular Biology and Human Genetics, Department of Biomedical Sciences, Faculty of Medicine and Health Sciences, Stellenbosch University, Cape Town, South Africa
- South African Medical Research Council/Stellenbosch University Genomics of Brain Disorders Research Unit, Cape Town, South Africa
- *Correspondence: Soraya Bardien,
| |
Collapse
|
8
|
Schindler D, Bensmann F, Dietze S, Krüger F. The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central. PeerJ Comput Sci 2022; 8:e835. [PMID: 35111920 PMCID: PMC8771769 DOI: 10.7717/peerj-cs.835] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Accepted: 12/07/2021] [Indexed: 06/06/2023]
Abstract
Science across all disciplines has become increasingly data-driven, leading to additional needs with respect to software for collecting, processing and analysing data. Thus, transparency about software used as part of the scientific process is crucial to understand provenance of individual research data and insights, is a prerequisite for reproducibility and can enable macro-analysis of the evolution of scientific methods over time. However, missing rigor in software citation practices renders the automated detection and disambiguation of software mentions a challenging problem. In this work, we provide a large-scale analysis of software usage and citation practices facilitated through an unprecedented knowledge graph of software mentions and affiliated metadata generated through supervised information extraction models trained on a unique gold standard corpus and applied to more than 3 million scientific articles. Our information extraction approach distinguishes different types of software and mentions, disambiguates mentions and outperforms the state-of-the-art significantly, leading to the most comprehensive corpus of 11.8 M software mentions that are described through a knowledge graph consisting of more than 300 M triples. Our analysis provides insights into the evolution of software usage and citation patterns across various fields, ranks of journals, and impact of publications. Whereas, to the best of our knowledge, this is the most comprehensive analysis of software use and citation at the time, all data and models are shared publicly to facilitate further research into scientific use and citation of software.
Collapse
Affiliation(s)
- David Schindler
- Institute of Communications Engineering, University of Rostock, Rostock, Germany
| | - Felix Bensmann
- GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany
| | - Stefan Dietze
- GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany
- Heinrich-Heine-University, Düsseldorf, Germany
| | - Frank Krüger
- Institute of Communications Engineering, University of Rostock, Rostock, Germany
- Department Knowledge, Culture & Transformation, University of Rostock, Rostock, Germany
| |
Collapse
|
9
|
Johnson D, Batista D, Cochrane K, Davey RP, Etuk A, Gonzalez-Beltran A, Haug K, Izzo M, Larralde M, Lawson TN, Minotto A, Moreno P, Nainala VC, O'Donovan C, Pireddu L, Roger P, Shaw F, Steinbeck C, Weber RJM, Sansone SA, Rocca-Serra P. ISA API: An open platform for interoperable life science experimental metadata. Gigascience 2021; 10:giab060. [PMID: 34528664 PMCID: PMC8444265 DOI: 10.1093/gigascience/giab060] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Revised: 03/19/2021] [Accepted: 08/23/2021] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND The Investigation/Study/Assay (ISA) Metadata Framework is an established and widely used set of open source community specifications and software tools for enabling discovery, exchange, and publication of metadata from experiments in the life sciences. The original ISA software suite provided a set of user-facing Java tools for creating and manipulating the information structured in ISA-Tab-a now widely used tabular format. To make the ISA framework more accessible to machines and enable programmatic manipulation of experiment metadata, the JSON serialization ISA-JSON was developed. RESULTS In this work, we present the ISA API, a Python library for the creation, editing, parsing, and validating of ISA-Tab and ISA-JSON formats by using a common data model engineered as Python object classes. We describe the ISA API feature set, early adopters, and its growing user community. CONCLUSIONS The ISA API provides users with rich programmatic metadata-handling functionality to support automation, a common interface, and an interoperable medium between the 2 ISA formats, as well as with other life science data formats required for depositing data in public databases.
Collapse
Affiliation(s)
- David Johnson
- Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, Oxford, OX1 3QG, UK
- Department of Informatics and Media, Uppsala University, Box 513, 75120 Uppsala, Sweden
| | - Dominique Batista
- Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, Oxford, OX1 3QG, UK
| | - Keeva Cochrane
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Robert P Davey
- Earlham Institute, Data infrastructure and algorithms, Norwich Research Park, Norwich NR4 7UZ, UK
| | - Anthony Etuk
- Earlham Institute, Data infrastructure and algorithms, Norwich Research Park, Norwich NR4 7UZ, UK
| | - Alejandra Gonzalez-Beltran
- Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, Oxford, OX1 3QG, UK
- Science and Technology Facilities Council, Scientific Computing Department, Rutherford Appleton Laboratory, Harwell Campus, Didcot, OX11 0QX, UK
| | - Kenneth Haug
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
- Genome Research Limited, Wellcome Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Saffron Walden, CB10 1RQ, UK
| | - Massimiliano Izzo
- Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, Oxford, OX1 3QG, UK
| | - Martin Larralde
- Structural and Computational Biology Unit, European Molecular Biology Laboratory (EMBL), Meyerhofstraße 1, 69117 Heidelberg, Germany
| | - Thomas N Lawson
- School of Biosciences, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
| | - Alice Minotto
- Earlham Institute, Data infrastructure and algorithms, Norwich Research Park, Norwich NR4 7UZ, UK
| | - Pablo Moreno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Venkata Chandrasekhar Nainala
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Claire O'Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Luca Pireddu
- Distributed Computing Group, CRS4: Center for Advanced Studies, Research & Development in Sardinia, Pula 09050, Italy
| | - Pierrick Roger
- CEA, LIST, Laboratory for Data Analysis and Systems’ Intelligence, MetaboHUB, Gif-Sur-Yvette F-91191, France
| | - Felix Shaw
- Earlham Institute, Data infrastructure and algorithms, Norwich Research Park, Norwich NR4 7UZ, UK
| | - Christoph Steinbeck
- Cheminformatics and Computational Metabolomics, Institute for Analytical Chemistry, Lessingstr. 8, 07743 Jena, Germany
| | - Ralf J M Weber
- School of Biosciences, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
- Phenome Centre Birmingham, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
| | - Susanna-Assunta Sansone
- Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, Oxford, OX1 3QG, UK
| | - Philippe Rocca-Serra
- Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, Oxford, OX1 3QG, UK
| |
Collapse
|
10
|
Shajii A, Numanagić I, Leighton AT, Greenyer H, Amarasinghe S, Berger B. A Python-based programming language for high-performance computational genomics. Nat Biotechnol 2021; 39:1062-1064. [PMID: 34282326 PMCID: PMC8542382 DOI: 10.1038/s41587-021-00985-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Affiliation(s)
- Ariya Shajii
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ibrahim Numanagić
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Computer Science, University of Victoria, Victoria, British Columbia, Canada
| | - Alexander T Leighton
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Haley Greenyer
- Department of Computer Science, University of Victoria, Victoria, British Columbia, Canada
| | - Saman Amarasinghe
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA.
| |
Collapse
|
11
|
Du C, Cohoon J, Lopez P, Howison J. Softcite dataset: A dataset of software mentions in biomedical and economic research publications. J Assoc Inf Sci Technol 2021. [DOI: 10.1002/asi.24454] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- Caifan Du
- University of Texas at Austin Austin Texas USA
| | | | | | | |
Collapse
|
12
|
Estravis-Barcala M, Mattera MG, Soliani C, Bellora N, Opgenoorth L, Heer K, Arana MV. Molecular bases of responses to abiotic stress in trees. JOURNAL OF EXPERIMENTAL BOTANY 2020; 71:3765-3779. [PMID: 31768543 PMCID: PMC7316969 DOI: 10.1093/jxb/erz532] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Accepted: 11/25/2019] [Indexed: 05/05/2023]
Abstract
Trees are constantly exposed to climate fluctuations, which vary with both time and geographic location. Environmental changes that are outside of the physiological favorable range usually negatively affect plant performance and trigger responses to abiotic stress. Long-living trees in particular have evolved a wide spectrum of molecular mechanisms to coordinate growth and development under stressful conditions, thus minimizing fitness costs. The ongoing development of techniques directed at quantifying abiotic stress has significantly increased our knowledge of physiological responses in woody plants. However, it is only within recent years that advances in next-generation sequencing and biochemical approaches have enabled us to begin to understand the complexity of the molecular systems that underlie these responses. Here, we review recent progress in our understanding of the molecular bases of drought and temperature stresses in trees, with a focus on functional, transcriptomic, epigenetic, and population genomic studies. In addition, we highlight topics that will contribute to progress in our understanding of the plastic and adaptive responses of woody plants to drought and temperature in a context of global climate change.
Collapse
Affiliation(s)
- Maximiliano Estravis-Barcala
- Instituto Andino Patagónico de Tecnologías Biológicas y Geoambientales, (Consejo Nacional de Investigaciones Científicas y Técnicas- Universidad Nacional del Comahue), San Carlos de Bariloche, Rio Negro, Argentina
| | - María Gabriela Mattera
- Instituto de Investigaciones Forestales y Agropecuarias Bariloche (Instituto Nacional de Tecnología Agropecuaria - Consejo Nacional de Investigaciones Científicas y Técnicas), San Carlos de Bariloche, Rio Negro, Argentina
| | - Carolina Soliani
- Instituto de Investigaciones Forestales y Agropecuarias Bariloche (Instituto Nacional de Tecnología Agropecuaria - Consejo Nacional de Investigaciones Científicas y Técnicas), San Carlos de Bariloche, Rio Negro, Argentina
| | - Nicolás Bellora
- Instituto Andino Patagónico de Tecnologías Biológicas y Geoambientales, (Consejo Nacional de Investigaciones Científicas y Técnicas- Universidad Nacional del Comahue), San Carlos de Bariloche, Rio Negro, Argentina
| | - Lars Opgenoorth
- Department of Ecology, Philipps University Marburg, Marburg, Germany
- Swiss Federal Research Institute WSL, BirmensdorfSwitzerland
| | - Katrin Heer
- Department of Conservation Biology, Philipps University Marburg, Marburg Germany
| | - María Verónica Arana
- Instituto de Investigaciones Forestales y Agropecuarias Bariloche (Instituto Nacional de Tecnología Agropecuaria - Consejo Nacional de Investigaciones Científicas y Técnicas), San Carlos de Bariloche, Rio Negro, Argentina
- Correspondence:
| |
Collapse
|
13
|
Eicher T, Kinnebrew G, Patt A, Spencer K, Ying K, Ma Q, Machiraju R, Mathé EA. Metabolomics and Multi-Omics Integration: A Survey of Computational Methods and Resources. Metabolites 2020; 10:E202. [PMID: 32429287 PMCID: PMC7281435 DOI: 10.3390/metabo10050202] [Citation(s) in RCA: 60] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Revised: 05/07/2020] [Accepted: 05/13/2020] [Indexed: 02/06/2023] Open
Abstract
As researchers are increasingly able to collect data on a large scale from multiple clinical and omics modalities, multi-omics integration is becoming a critical component of metabolomics research. This introduces a need for increased understanding by the metabolomics researcher of computational and statistical analysis methods relevant to multi-omics studies. In this review, we discuss common types of analyses performed in multi-omics studies and the computational and statistical methods that can be used for each type of analysis. We pinpoint the caveats and considerations for analysis methods, including required parameters, sample size and data distribution requirements, sources of a priori knowledge, and techniques for the evaluation of model accuracy. Finally, for the types of analyses discussed, we provide examples of the applications of corresponding methods to clinical and basic research. We intend that our review may be used as a guide for metabolomics researchers to choose effective techniques for multi-omics analyses relevant to their field of study.
Collapse
Affiliation(s)
- Tara Eicher
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Computer Science and Engineering Department, The Ohio State University College of Engineering, Columbus, OH 43210, USA
| | - Garrett Kinnebrew
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Comprehensive Cancer Center, The Ohio State University and James Cancer Hospital, Columbus, OH 43210, USA;
- Bioinformatics Shared Resource Group, The Ohio State University, Columbus, OH 43210, USA
| | - Andrew Patt
- Division of Preclinical Innovation, National Center for Advancing Translational Sciences, NIH, 9800 Medical Center Dr., Rockville, MD, 20892, USA;
- Biomedical Sciences Graduate Program, The Ohio State University, Columbus, OH 43210, USA
| | - Kyle Spencer
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Biomedical Sciences Graduate Program, The Ohio State University, Columbus, OH 43210, USA
- Nationwide Children’s Research Hospital, Columbus, OH 43210, USA
| | - Kevin Ying
- Comprehensive Cancer Center, The Ohio State University and James Cancer Hospital, Columbus, OH 43210, USA;
- Molecular, Cellular and Developmental Biology Program, The Ohio State University, Columbus, OH 43210, USA
| | - Qin Ma
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
| | - Raghu Machiraju
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Computer Science and Engineering Department, The Ohio State University College of Engineering, Columbus, OH 43210, USA
- Department of Pathology, Wexner Medical Center, The Ohio State University, Columbus, OH 43210, USA
- Translational Data Analytics Institute, The Ohio State University, Columbus, OH 43210, USA
| | - Ewy A. Mathé
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Division of Preclinical Innovation, National Center for Advancing Translational Sciences, NIH, 9800 Medical Center Dr., Rockville, MD, 20892, USA;
| |
Collapse
|
14
|
Hulsen T. The ten commandments of translational research informatics. DATA SCIENCE 2019. [DOI: 10.3233/ds-190020] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Affiliation(s)
- Tim Hulsen
- Department of Professional Health Solutions & Services, Philips Research, Eindhoven, The Netherlands. E-mail:
| |
Collapse
|
15
|
Thelwall M, Nevill T. No evidence of citation bias as a determinant of STEM gender disparities in US biochemistry, genetics and molecular biology research. Scientometrics 2019. [DOI: 10.1007/s11192-019-03271-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
16
|
Georgeson P, Syme A, Sloggett C, Chung J, Dashnow H, Milton M, Lonsdale A, Powell D, Seemann T, Pope B. Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software. Gigascience 2019; 8:giz109. [PMID: 31544213 PMCID: PMC6755254 DOI: 10.1093/gigascience/giz109] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Revised: 07/16/2019] [Accepted: 08/13/2019] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND Bioinformatics software tools are often created ad hoc, frequently by people without extensive training in software development. In particular, for beginners, the barrier to entry in bioinformatics software development is high, especially if they want to adopt good programming practices. Even experienced developers do not always follow best practices. This results in the proliferation of poorer-quality bioinformatics software, leading to limited scalability and inefficient use of resources; lack of reproducibility, usability, adaptability, and interoperability; and erroneous or inaccurate results. FINDINGS We have developed Bionitio, a tool that automates the process of starting new bioinformatics software projects following recommended best practices. With a single command, the user can create a new well-structured project in 1 of 12 programming languages. The resulting software is functional, carrying out a prototypical bioinformatics task, and thus serves as both a working example and a template for building new tools. Key features include command-line argument parsing, error handling, progress logging, defined exit status values, a test suite, a version number, standardized building and packaging, user documentation, code documentation, a standard open source software license, software revision control, and containerization. CONCLUSIONS Bionitio serves as a learning aid for beginner-to-intermediate bioinformatics programmers and provides an excellent starting point for new projects. This helps developers adopt good programming practices from the beginning of a project and encourages high-quality tools to be developed more rapidly. This also benefits users because tools are more easily installed and consistent in their usage. Bionitio is released as open source software under the MIT License and is available at https://github.com/bionitio-team/bionitio.
Collapse
Affiliation(s)
- Peter Georgeson
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
- Colorectal Oncogenomics Group, Department of Clinical Pathology, The University of Melbourne, Victorian Comprehensive Cancer Centre, 305 Grattan Street, Melbourne, Victoria, Australia 3000
| | - Anna Syme
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
- Royal Botanic Gardens Victoria, Birdwood Avenue, Melbourne, Victoria, Australia 3004
| | - Clare Sloggett
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
| | - Jessica Chung
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
| | - Harriet Dashnow
- Bioinformatics, Murdoch Children's Research Institute, Royal Children's Hospital, Flemington Road, Parkville, Victoria, Australia 3052
- School of BioSciences, The University of Melbourne, Royal Parade, Parkville, Victoria, Australia 3052
| | - Michael Milton
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
- Melbourne Genomics Health Alliance, Walter and Eliza Hall Institute, 1G Royal Parade, Parkville, Victoria, Australia 3052
| | - Andrew Lonsdale
- Bioinformatics, Murdoch Children's Research Institute, Royal Children's Hospital, Flemington Road, Parkville, Victoria, Australia 3052
- ARC Centre of Excellence in Plant Cell Walls, School of BioSciences, The University of Melbourne, Royal Parade, Parkville, Victoria, Australia 3052
| | - David Powell
- Monash Bioinformatics Platform, Biomedicine Discovery Institute, Faculty of Medicine, Nursing and Health Sciences, 15 Innovation Walk, Monash University, Clayton, Victoria, Australia 3800
| | - Torsten Seemann
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
- Department of Microbiology and Immunology, Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street Melbourne, Victoria, Australia 3000
| | - Bernard Pope
- Melbourne Bioinformatics, The University of Melbourne, 187 Grattan Street, Carlton, Victoria, Australia 3053
- Colorectal Oncogenomics Group, Department of Clinical Pathology, The University of Melbourne, Victorian Comprehensive Cancer Centre, 305 Grattan Street, Melbourne, Victoria, Australia 3000
- Department of Medicine, Central Clinical School, Monash University, Clayton, Victoria, Australia 3800
| |
Collapse
|
17
|
Abstract
Biobtree is a bioinformatics tool to search and map bioinformatics datasets via identifiers or special keywords such as species name. It processes large bioinformatics datasets using a specialized MapReduce-based solution with optimum computational and storage resource usage. It provides uniform and B+ tree-based database output, a web interface, web services and allows performing chain mapping queries between datasets. It can be used via a single executable file or alternatively it can be used via the R or Python-based wrapper packages which are additionally provided for easier integration into existing pipelines. Biobtree is open source and available at
GitHub.
Collapse
Affiliation(s)
- Tamer Gur
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD Cambridge, UK
| |
Collapse
|
18
|
A Systematic Review of Open Source Clinical Software on GitHub for Improving Software Reuse in Smart Healthcare. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9010150] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
The plethora of open source clinical software offers great reuse opportunities for developers to build clinical tools at lower cost and at a faster pace. However, the lack of research on open source clinical software poses a challenge for software reuse in clinical software development. This paper aims to help clinical developers better understand open source clinical software by conducting a thorough investigation of open source clinical software hosted on GitHub. We first developed a data pipeline that automatically collected and preprocessed GitHub data. Then, a deep analysis with several methods, such as statistical analysis, hypothesis testing, and topic modeling, was conducted to reveal the overall status and various characteristics of open source clinical software. There were 14,971 clinical-related GitHub repositories created during the last 10 years, with an average annual growth rate of 55%. Among them, 12,919 are open source clinical software. Our analysis unveiled a number of interesting findings: Popular open source clinical software in terms of the number of stars, most productive countries that contribute to the community, important factors that make an open source clinical software popular, and 10 main groups of open source clinical software. The results can assist both researchers and practitioners, especially newcomers, in understanding open source clinical software.
Collapse
|
19
|
Dozmorov MG. GitHub Statistics as a Measure of the Impact of Open-Source Bioinformatics Software. Front Bioeng Biotechnol 2018; 6:198. [PMID: 30619845 PMCID: PMC6306043 DOI: 10.3389/fbioe.2018.00198] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Accepted: 12/04/2018] [Indexed: 11/13/2022] Open
Abstract
Modern research is increasingly data-driven and reliant on bioinformatics software. Publication is a common way of introducing new software, but not all bioinformatics tools get published. Giving there are competing tools, it is important not merely to find the appropriate software, but have a metric for judging its usefulness. Journal's impact factor has been shown to be a poor predictor of software popularity; consequently, focusing on publications in high-impact journals limits user's choices in finding useful bioinformatics tools. Free and open source software repositories on popular code sharing platforms such as GitHub provide another venue to follow the latest bioinformatics trends. The open source component of GitHub allows users to bookmark and copy repositories that are most useful to them. This Perspective aims to demonstrate the utility of GitHub "stars," "watchers," and "forks" (GitHub statistics) as a measure of software impact. We compiled lists of impactful bioinformatics software and analyzed commonly used impact metrics and GitHub statistics of 50 genomics-oriented bioinformatics tools. We present examples of community-selected best bioinformatics resources and show that GitHub statistics are distinct from the journal's impact factor (JIF), citation counts, and alternative metrics (Altmetrics, CiteScore) in capturing the level of community attention. We suggest the use of GitHub statistics as an unbiased measure of the usability of bioinformatics software complementing the traditional impact metrics.
Collapse
Affiliation(s)
- Mikhail G. Dozmorov
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, United States
| |
Collapse
|