51
|
Abstract
It is now a decade since The International Commission on the Taxonomy of Fungi (ICTF) produced an overview of requirements and best practices for describing a new fungal species. In the meantime the International Code of Nomenclature for algae, fungi, and plants (ICNafp) has changed from its former name (the International Code of Botanical Nomenclature) and introduced new formal requirements for valid publication of species scientific names, including the separation of provisions specific to Fungi and organisms treated as fungi in a new Chapter F. Equally transformative have been changes in the data collection, data dissemination, and analytical tools available to mycologists. This paper provides an updated and expanded discussion of current publication requirements along with best practices for the description of new fungal species and publication of new names and for improving accessibility of their associated metadata that have developed over the last 10 years. Additionally, we provide: (1) model papers for different fungal groups and circumstances; (2) a checklist to simplify meeting (i) the requirements of the ICNafp to ensure the effective, valid and legitimate publication of names of new taxa, and (ii) minimally accepted standards for description; and, (3) templates for preparing standardized species descriptions.
Collapse
|
52
|
Abstract
Workflows are the keystone of bioimage analysis, and the NEUBIAS (Network of European BioImage AnalystS) community is trying to gather the actors of this field and organize the information around them. One of its most recent outputs is the opening of the F1000Research NEUBIAS gateway, whose main objective is to offer a channel of publication for bioimage analysis workflows and associated resources. In this paper we want to express some personal opinions and recommendations related to finding, handling and developing bioimage analysis workflows. The emergence of "big data" in bioimaging and resource-intensive analysis algorithms make local data storage and computing solutions a limiting factor. At the same time, the need for data sharing with collaborators and a general shift towards remote work, have created new challenges and avenues for the execution and sharing of bioimage analysis workflows. These challenges are to reproducibly run workflows in remote environments, in particular when their components come from different software packages, but also to document them and link their parameters and results by following the FAIR principles (Findable, Accessible, Interoperable, Reusable) to foster open and reproducible science. In this opinion paper, we focus on giving some directions to the reader to tackle these challenges and navigate through this complex ecosystem, in order to find and use workflows, and to compare workflows addressing the same problem. We also discuss tools to run workflows in the cloud and on High Performance Computing resources, and suggest ways to make these workflows FAIR.
Collapse
|
53
|
Next-generation field courses: Integrating Open Science and online learning. Ecol Evol 2021; 11:3577-3587. [PMID: 33898010 PMCID: PMC8057340 DOI: 10.1002/ece3.7009] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Revised: 09/12/2020] [Accepted: 10/16/2020] [Indexed: 11/17/2022] Open
Abstract
As Open Science practices become more commonplace, there is a need for the next generation of scientists to be well versed in these aspects of scientific research. Yet, many training opportunities for early career researchers (ECRs) could better emphasize or integrate Open Science elements. Field courses provide opportunities for ECRs to apply theoretical knowledge, practice new methodological approaches, and gain an appreciation for the challenges of real-life research, and could provide an excellent platform for integrating training in Open Science practices. Our recent experience, as primarily ECRs engaged in a field course interrupted by COVID-19, led us to reflect on the potential to enhance learning outcomes in field courses by integrating Open Science practices and online learning components. Specifically, we highlight the opportunity for field courses to align teaching activities with the recent developments and trends in how we conduct research, including training in: publishing registered reports, collecting data using standardized methods, adopting high-quality data documentation, managing data through reproducible workflows, and sharing and publishing data through appropriate channels. We also discuss how field courses can use online tools to optimize time in the field, develop open access resources, and cultivate collaborations. By integrating these elements, we suggest that the next generation of field courses will offer excellent arenas for participants to adopt Open Science practices.
Collapse
|
54
|
A Second Look at FAIR in Proteomic Investigations. J Proteome Res 2021; 20:2182-2186. [PMID: 33719446 DOI: 10.1021/acs.jproteome.1c00177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Proteomics is, by definition, comprehensive and large-scale, seeking to unravel ome-level protein features with phenotypic information on an entire system, an organ, cells, or organisms. This scope consistently involves and extends beyond single experiments. Multitudinous resources now exist to assist in making the results of proteomics experiments more findable, accessible, interoperable, and reusable (FAIR), yet many tools are awaiting to be adopted by our community. Here we highlight strategies for expanding the impact of proteomics data beyond single studies. We show how linking specific terminologies, identifiers, and text (words) can unify individual data points across a wide spectrum of studies and, more importantly, how this approach may potentially reveal novel relationships. In this effort, we explain how data sets and methods can be rendered more linkable and how this maximizes their value. We also include a discussion on how data linking strategies benefit stakeholders across the proteomics community and beyond.
Collapse
|
55
|
How FAIR are plant sciences in the twenty-first century? The pressing need for reproducibility in plant ecology and evolution. Proc Biol Sci 2021; 288:20202597. [PMID: 33563121 DOI: 10.1098/rspb.2020.2597] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The need for open, reproducible science is of growing concern in the twenty-first century, with multiple initiatives like the widely supported FAIR principles advocating for data to be Findable, Accessible, Interoperable and Reusable. Plant ecological and evolutionary studies are not exempt from the need to ensure that the data upon which their findings are based are accessible and allow for replication in accordance with the FAIR principles. However, it is common that the collection and curation of herbarium specimens, a foundational aspect of studies involving plants, is neglected by authors. Without publicly available specimens, huge numbers of studies that rely on the field identification of plants are fundamentally not reproducible. We argue that the collection and public availability of herbarium specimens is not only good botanical practice but is also fundamental in ensuring that plant ecological and evolutionary studies are replicable, and thus scientifically sound. Data repositories that adhere to the FAIR principles must make sure that the original data are traceable to and re-examinable at their empirical source. In order to secure replicability, and adherence to the FAIR principles, substantial changes need to be brought about to restore the practice of collecting and curating specimens, to educate students of their importance, and to properly fund the herbaria which house them.
Collapse
|
56
|
Impact of structural biologists and the Protein Data Bank on small-molecule drug discovery and development. J Biol Chem 2021; 296:100559. [PMID: 33744282 PMCID: PMC8059052 DOI: 10.1016/j.jbc.2021.100559] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 02/02/2021] [Accepted: 03/16/2021] [Indexed: 12/12/2022] Open
Abstract
The Protein Data Bank (PDB) is an international core data resource central to fundamental biology, biomedicine, bioenergy, and biotechnology/bioengineering. Now celebrating its 50th anniversary, the PDB houses >175,000 experimentally determined atomic structures of proteins, nucleic acids, and their complexes with one another and small molecules and drugs. The importance of three-dimensional (3D) biostructure information for research and education obtains from the intimate link between molecular form and function evident throughout biology. Among the most prolific consumers of PDB data are biomedical researchers, who rely on the open access resource as the authoritative source of well-validated, expertly curated biostructures. This review recounts how the PDB grew from just seven protein structures to contain more than 49,000 structures of human proteins that have proven critical for understanding their roles in human health and disease. It then describes how these structures are used in academe and industry to validate drug targets, assess target druggability, characterize how tool compounds and other small-molecules bind to drug targets, guide medicinal chemistry optimization of binding affinity and selectivity, and overcome challenges during preclinical drug development. Three case studies drawn from oncology exemplify how structural biologists and open access to PDB structures impacted recent regulatory approvals of antineoplastic drugs.
Collapse
|
57
|
RCSB Protein Data Bank: Architectural Advances Towards Integrated Searching and Efficient Access to Macromolecular Structure Data from the PDB Archive. J Mol Biol 2020; 433:166704. [PMID: 33186584 PMCID: PMC9093041 DOI: 10.1016/j.jmb.2020.11.003] [Citation(s) in RCA: 84] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 11/03/2020] [Accepted: 11/05/2020] [Indexed: 11/10/2022]
Abstract
The US Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) serves many millions of unique users worldwide by delivering experimentally-determined 3D structures of biomolecules integrated with >40 external data resources via RCSB.org, application programming interfaces (APIs), and FTP downloads. Herein, we present the architectural redesign of RCSB PDB data delivery services that build on existing PDBx/mmCIF data schemas. New data access APIs (data.rcsb.org) enable efficient delivery of all PDB archive data. A novel GraphQL-based API provides flexible, declarative data retrieval along with a simple-to-use REST API. A powerful new search system (search.rcsb.org) seamlessly integrates heterogeneous types of searches across the PDB archive. Searches may combine text attributes, protein or nucleic acid sequences, small-molecule chemical descriptors, 3D macromolecular shapes, and sequence motifs. The new RCSB.org architecture adheres to the FAIR Principles, empowering users to address a wide array of research problems in fundamental biology, biomedicine, biotechnology, bioengineering, and bioenergy.
Collapse
|
58
|
Building a global genomics observatory: Using GEOME (the Genomic Observatories Metadatabase) to expedite and improve deposition and retrieval of genetic data and metadata for biodiversity research. Mol Ecol Resour 2020; 20:1458-1469. [PMID: 33031625 DOI: 10.1111/1755-0998.13269] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2020] [Revised: 07/22/2020] [Accepted: 09/09/2020] [Indexed: 11/30/2022]
Abstract
Genetic data represent a relatively new frontier for our understanding of global biodiversity. Ideally, such data should include both organismal DNA-based genotypes and the ecological context where the organisms were sampled. Yet most tools and standards for data deposition focus exclusively either on genetic or ecological attributes. The Genomic Observatories Metadatabase (GEOME: geome-db.org) provides an intuitive solution for maintaining links between genetic data sets stored by the International Nucleotide Sequence Database Collaboration (INSDC) and their associated ecological metadata. GEOME facilitates the deposition of raw genetic data to INSDCs sequence read archive (SRA) while maintaining persistent links to standards-compliant ecological metadata held in the GEOME database. This approach facilitates findable, accessible, interoperable and reusable data archival practices. Moreover, GEOME enables data management solutions for large collaborative groups and expedites batch retrieval of genetic data from the SRA. The article that follows describes how GEOME can enable genuinely open data workflows for researchers in the field of molecular ecology.
Collapse
|
59
|
The on-premise data sharing infrastructure e!DAL: Foster FAIR data for faster data acquisition. Gigascience 2020; 9:giaa107. [PMID: 33090199 PMCID: PMC7580168 DOI: 10.1093/gigascience/giaa107] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2020] [Revised: 09/09/2020] [Accepted: 09/19/2020] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND The FAIR data principle as a commitment to support long-term research data management is widely accepted in the scientific community. Although the ELIXIR Core Data Resources and other established infrastructures provide comprehensive and long-term stable services and platforms for FAIR data management, a large quantity of research data is still hidden or at risk of getting lost. Currently, high-throughput plant genomics and phenomics technologies are producing research data in abundance, the storage of which is not covered by established core databases. This concerns the data volume, e.g., time series of images or high-resolution hyper-spectral data; the quality of data formatting and annotation, e.g., with regard to structure and annotation specifications of core databases; uncovered data domains; or organizational constraints prohibiting primary data storage outside institional boundaries. RESULTS To share these potentially dark data in a FAIR way and master these challenges the ELIXIR Germany/de.NBI service Plant Genomic and Phenomics Research Data Repository (PGP) implements a "bring the infrastructure to the data" approach, which allows research data to be kept in place and wrapped in a FAIR-aware software infrastructure. This article presents new features of the e!DAL infrastructure software and the PGP repository as a best practice on how to easily set up FAIR-compliant and intuitive research data services. Furthermore, the integration of the ELIXIR Authentication and Authorization Infrastructure (AAI) and data discovery services are introduced as means to lower technical barriers and to increase the visibility of research data. CONCLUSION The e!DAL software matured to a powerful and FAIR-compliant infrastructure, while keeping the focus on flexible setup and integration into existing infrastructures and into the daily research process.
Collapse
|
60
|
Best Practices for Making Reproducible Biochemical Models. Cell Syst 2020; 11:109-120. [PMID: 32853539 DOI: 10.1016/j.cels.2020.06.012] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2019] [Revised: 05/15/2020] [Accepted: 06/24/2020] [Indexed: 01/03/2023]
Abstract
Like many scientific disciplines, dynamical biochemical modeling is hindered by irreproducible results. This limits the utility of biochemical models by making them difficult to understand, trust, or reuse. We comprehensively list the best practices that biochemical modelers should follow to build reproducible biochemical model artifacts-all data, model descriptions, and custom software used by the model-that can be understood and reused. The best practices provide advice for all steps of a typical biochemical modeling workflow in which a modeler collects data; constructs, trains, simulates, and validates the model; uses the predictions of a model to advance knowledge; and publicly shares the model artifacts. The best practices emphasize the benefits obtained by using standard tools and formats and provides guidance to modelers who do not or cannot use standards in some stages of their modeling workflow. Adoption of these best practices will enhance the ability of researchers to reproduce, understand, and reuse biochemical models.
Collapse
|
61
|
Abstract
Wouldn't it be great, if experimental data were findable wherever they were? If experimental data were accessible' regardless of the storage place and format? If experimental data were interoperable independent of the author or its origin? If experimental data were reusable for further analysis without experimental repetition? The current state of the art of data acquisition in the laboratory is very diverse. A lot of different devices are used, analogue as well as digital ones. Usually all experimental setups and observations are summarized in a handwritten lab notebook, independently from digital or analogue sources. To change the actual and common way of laboratory data acquisition into a digital and modern one, electronic lab notebooks can be used. A challenge of science is to facilitate knowledge discovery by assisting humans and machines in their discovery of scientific data and their associated algorithms and workflows. FAIR describes a set of guiding principles to make data Findable, Accessible, Interoperable, and Reusable.
Collapse
|
62
|
Data Dissemination: Shortening the Long Tail of Traumatic Brain Injury Dark Data. J Neurotrauma 2019; 37:2414-2423. [PMID: 30794049 DOI: 10.1089/neu.2018.6192] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
Translation of traumatic brain injury (TBI) research findings from bench to bedside involves aligning multi-species data across diverse data types including imaging and molecular biomarkers, histopathology, behavior, and functional outcomes. In this review we argue that TBI translation should be acknowledged for what it is: a problem of big data that can be addressed using modern data science approaches. We review the history of the term big data, tracing its origins in Internet technology as data that are "big" according to the "4Vs" of volume, velocity, variety, veracity and discuss how the term has transitioned into the mainstream of biomedical research. We argue that the problem of TBI translation fundamentally centers around data variety and that solutions to this problem can be found in modern machine learning and other cutting-edge analytical approaches. Throughout our discussion we highlight the need to pull data from diverse sources including unpublished data ("dark data") and "long-tail data" (small, specialty TBI datasets undergirding the published literature). We review a few early examples of published articles in both the pre-clinical and clinical TBI research literature to demonstrate how data reuse can drive new discoveries leading into translational therapies. Making TBI data resources more Findable, Accessible, Interoperable, and Reusable (FAIR) through better data stewardship has great potential to accelerate discovery and translation for the silent epidemic of TBI.
Collapse
|
63
|
Abstract
Publicly available gene expression datasets deposited in the Gene Expression Omnibus (GEO) are growing at an accelerating rate. Such datasets hold great value for knowledge discovery, particularly when integrated. Although numerous software platforms and tools have been developed to enable reanalysis and integration of individual, or groups, of GEO datasets, large-scale reuse of those datasets is impeded by minimal requirements for standardized metadata both at the study and sample levels as well as uniform processing of the data across studies. Here, we review methodologies developed to facilitate the systematic curation and processing of publicly available gene expression datasets from GEO. We identify trends for advanced metadata curation and summarize approaches for reprocessing the data within the entire GEO repository.
Collapse
|
64
|
Abstract
Data sharing, i.e. depositing data in research community
accessiblerepositories, is not becoming as rapidly widespread across the life
scienceresearch community as hoped or expected. I consider the sociological and
cultural context of research and lay out why the community should instead move
to data publishing with a focus on neuroscience data, and outline practical
steps that can be taken to realize this goal.
Collapse
|
65
|
Abstract
Data sharing enables research communities to exchange findings and build upon the knowledge that arises from their discoveries. Areas of public and animal health as well as food safety would benefit from rapid data sharing when it comes to emergencies. However, ethical, regulatory and institutional challenges, as well as lack of suitable platforms which provide an infrastructure for data sharing in structured formats, often lead to data not being shared or at most shared in form of supplementary materials in journal publications. Here, we describe an informatics platform that includes workflows for structured data storage, managing and pre-publication sharing of pathogen sequencing data and its analysis interpretations with relevant stakeholders.
Collapse
|
66
|
How Structural Biologists and the Protein Data Bank Contributed to Recent FDA New Drug Approvals. Structure 2018; 27:211-217. [PMID: 30595456 DOI: 10.1016/j.str.2018.11.007] [Citation(s) in RCA: 53] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2018] [Revised: 11/09/2018] [Accepted: 11/15/2018] [Indexed: 01/01/2023]
Abstract
Discovery and development of 210 new molecular entities (NMEs; new drugs) approved by the US Food and Drug Administration 2010-2016 was facilitated by 3D structural information generated by structural biologists worldwide and distributed on an open-access basis by the PDB. The molecular targets for 94% of these NMEs are known. The PDB archive contains 5,914 structures containing one of the known targets and/or a new drug, providing structural coverage for 88% of the recently approved NMEs across all therapeutic areas. More than half of the 5,914 structures were published and made available by the PDB at no charge, with no restrictions on usage >10 years before drug approval. Citation analyses revealed that these 5,914 PDB structures significantly affected the very large body of publicly funded research reported in publications on the NME targets that motivated biopharmaceutical company investment in discovery and development programs that produced the NMEs.
Collapse
|
67
|
FAANG, establishing metadata standards, validation and best practices for the farmed and companion animal community. Anim Genet 2018; 49:520-526. [PMID: 30311252 PMCID: PMC6334167 DOI: 10.1111/age.12736] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/11/2018] [Indexed: 12/30/2022]
Abstract
The Functional Annotation of ANimal Genomes (FAANG) project aims, through a coordinated international effort, to provide high quality functional annotation of animal genomes with an initial focus on farmed and companion animals. A key goal of the initiative is to ensure high quality and rich supporting metadata to describe the project's animals, specimens, cell cultures and experimental assays. By defining rich sample and experimental metadata standards and promoting best practices in data descriptions, deposition and openness, FAANG champions higher quality and reusability of published datasets. FAANG has established a Data Coordination Centre, which sits at the heart of the Metadata and Data Sharing Committee. It continues to evolve the metadata standards, support submissions and, crucially, create powerful and accessible tools to support deposition and validation of metadata. FAANG conforms to the findable, accessible, interoperable, and reusable (FAIR) data principles, with high quality, open access and functionally interlinked data. In addition to data generated by FAANG members and specific FAANG projects, existing datasets that meet the main—or more permissive legacy—standards are incorporated into a central, focused, functional data resource portal for the entire farmed and companion animal community. Through clear and effective metadata standards, validation and conversion software, combined with promotion of best practices in metadata implementation, FAANG aims to maximise effectiveness and inter‐comparability of assay data. This supports the community to create a rich genome‐to‐phenotype resource and promotes continuing improvements in animal data standards as a whole.
Collapse
|
68
|
OpenPVSignal: Advancing Information Search, Sharing and Reuse on Pharmacovigilance Signals via FAIR Principles and Semantic Web Technologies. Front Pharmacol 2018; 9:609. [PMID: 29997499 PMCID: PMC6028717 DOI: 10.3389/fphar.2018.00609] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Accepted: 05/21/2018] [Indexed: 12/27/2022] Open
Abstract
Signal detection and management is a key activity in pharmacovigilance (PV). When a new PV signal is identified, the respective information is publicly communicated in the form of periodic newsletters or reports by organizations that monitor and investigate PV-related information (such as the World Health Organization and national PV centers). However, this type of communication does not allow for systematic access, discovery and explicit data interlinking and, therefore, does not facilitate automated data sharing and reuse. In this paper, we present OpenPVSignal, a novel ontology aiming to support the semantic enrichment and rigorous communication of PV signal information in a systematic way, focusing on two key aspects: (a) publishing signal information according to the FAIR (Findable, Accessible, Interoperable, and Re-usable) data principles, and (b) exploiting automatic reasoning capabilities upon the interlinked PV signal report data. OpenPVSignal is developed as a reusable, extendable and machine-understandable model based on Semantic Web standards/recommendations. In particular, it can be used to model PV signal report data focusing on: (a) heterogeneous data interlinking, (b) semantic and syntactic interoperability, (c) provenance tracking and (d) knowledge expressiveness. OpenPVSignal is built upon widely-accepted semantic models, namely, the provenance ontology (PROV-O), the Micropublications semantic model, the Web Annotation Data Model (WADM), the Ontology of Adverse Events (OAE) and the Time ontology. To this end, we describe the design of OpenPVSignal and demonstrate its applicability as well as the reasoning capabilities enabled by its use. We also provide an evaluation of the model against the FAIR data principles. The applicability of OpenPVSignal is demonstrated by using PV signal information published in: (a) the World Health Organization's Pharmaceuticals Newsletter, (b) the Netherlands Pharmacovigilance Centre Lareb Web site and (c) the U.S. Food and Drug Administration (FDA) Drug Safety Communications, also available on the FDA Web site.
Collapse
|
69
|
RCSB Protein Data Bank: Sustaining a living digital data resource that enables breakthroughs in scientific research and biomedical education. Protein Sci 2018; 27:316-330. [PMID: 29067736 PMCID: PMC5734314 DOI: 10.1002/pro.3331] [Citation(s) in RCA: 165] [Impact Index Per Article: 27.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2017] [Revised: 10/20/2017] [Accepted: 10/23/2017] [Indexed: 01/27/2023]
Abstract
The Protein Data Bank (PDB) is one of two archival resources for experimental data central to biomedical research and education worldwide (the other key Primary Data Archive in biology being the International Nucleotide Sequence Database Collaboration). The PDB currently houses >134,000 atomic level biomolecular structures determined by crystallography, NMR spectroscopy, and 3D electron microscopy. It was established in 1971 as the first open-access, digital-data resource in biology, and is managed by the Worldwide Protein Data Bank partnership (wwPDB; wwpdb.org). US PDB operations are conducted by the RCSB Protein Data Bank (RCSB PDB; RCSB.org; Rutgers University and UC San Diego) and funded by NSF, NIH, and DoE. The RCSB PDB serves as the global Archive Keeper for the wwPDB. During calendar 2016, >591 million structure data files were downloaded from the PDB by Data Consumers working in every sovereign nation recognized by the United Nations. During this same period, the RCSB PDB processed >5300 new atomic level biomolecular structures plus experimental data and metadata coming into the archive from Data Depositors working in the Americas and Oceania. In addition, RCSB PDB served >1 million RCSB.org users worldwide with PDB data integrated with ∼40 external data resources providing rich structural views of fundamental biology, biomedicine, and energy sciences, and >600,000 PDB101.rcsb.org educational website users around the globe. RCSB PDB resources are described in detail together with metrics documenting the impact of access to PDB data on basic and applied research, clinical medicine, education, and the economy.
Collapse
|
70
|
Challenges for visualizing three-dimensional data in genomic browsers. FEBS Lett 2017; 591:2505-2519. [PMID: 28771695 PMCID: PMC5638070 DOI: 10.1002/1873-3468.12778] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2017] [Revised: 07/30/2017] [Accepted: 07/31/2017] [Indexed: 12/14/2022]
Abstract
Genomic interactions reveal the spatial organization of genomes and genomic domains, which is known to play key roles in cell function. Physical proximity can be represented as two-dimensional heat maps or matrices. From these, three-dimensional (3D) conformations of chromatin can be computed revealing coherent structures that highlight the importance of nonsequential relationships across genomic features. Mainstream genomic browsers have been classically developed to display compact, stacked tracks based on a linear, sequential, per-chromosome coordinate system. Genome-wide comparative analysis demands new approaches to data access and new layouts for analysis. The legibility can be compromised when displaying track-aligned second dimension matrices, which require greater screen space. Moreover, 3D representations of genomes defy vertical alignment in track-based genome browsers. Furthermore, investigation at previously unattainable levels of detail is revealing multiscale, multistate, time-dependent complexity. This article outlines how these challenges are currently handled in mainstream browsers as well as how novel techniques in visualization are being explored to address them. A set of requirements for coherent visualization of novel spatial genomic data is defined and the resulting potential for whole genome visualization is described.
Collapse
|
71
|
Developing a framework for digital objects in the Big Data to Knowledge (BD2K) commons: Report from the Commons Framework Pilots workshop. J Biomed Inform 2017; 71:49-57. [PMID: 28501646 DOI: 10.1016/j.jbi.2017.05.006] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2016] [Revised: 05/01/2017] [Accepted: 05/08/2017] [Indexed: 12/11/2022]
Abstract
The volume and diversity of data in biomedical research have been rapidly increasing in recent years. While such data hold significant promise for accelerating discovery, their use entails many challenges including: the need for adequate computational infrastructure, secure processes for data sharing and access, tools that allow researchers to find and integrate diverse datasets, and standardized methods of analysis. These are just some elements of a complex ecosystem that needs to be built to support the rapid accumulation of these data. The NIH Big Data to Knowledge (BD2K) initiative aims to facilitate digitally enabled biomedical research. Within the BD2K framework, the Commons initiative is intended to establish a virtual environment that will facilitate the use, interoperability, and discoverability of shared digital objects used for research. The BD2K Commons Framework Pilots Working Group (CFPWG) was established to clarify goals and work on pilot projects that address existing gaps toward realizing the vision of the BD2K Commons. This report reviews highlights from a two-day meeting involving the BD2K CFPWG to provide insights on trends and considerations in advancing Big Data science for biomedical research in the United States.
Collapse
|