1
|
Posch L, Panahiazar M, Dumontier M, Gevaert O. Predicting structured metadata from unstructured metadata. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018. [PMID: 28637268 PMCID: PMC4892825 DOI: 10.1093/database/baw080] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
Enormous amounts of biomedical data have been and are being produced by investigators all over the world. However, one crucial and limiting factor in data reuse is accurate, structured and complete description of the data or data about the data—defined as metadata. We propose a framework to predict structured metadata terms from unstructured metadata for improving quality and quantity of metadata, using the Gene Expression Omnibus (GEO) microarray database. Our framework consists of classifiers trained using term frequency-inverse document frequency (TF-IDF) features and a second approach based on topics modeled using a Latent Dirichlet Allocation model (LDA) to reduce the dimensionality of the unstructured data. Our results on the GEO database show that structured metadata terms can be the most accurately predicted using the TF-IDF approach followed by LDA both outperforming the majority vote baseline. While some accuracy is lost by the dimensionality reduction of LDA, the difference is small for elements with few possible values, and there is a large improvement over the majority classifier baseline. Overall this is a promising approach for metadata prediction that is likely to be applicable to other datasets and has implications for researchers interested in biomedical metadata curation and metadata prediction. Database URL:http://www.yeastgenome.org/
Collapse
Affiliation(s)
- Lisa Posch
- GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany.,Institute for Web Science and Technologies, University of Koblenz-Landau, Koblenz, Germany
| | - Maryam Panahiazar
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, CA, USA
| | - Michel Dumontier
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, CA, USA
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, CA, USA
| |
Collapse
|
2
|
Panahiazar M, Dumontier M, Gevaert O. Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO). J Biomed Inform 2017; 72:132-139. [PMID: 28625880 PMCID: PMC5643580 DOI: 10.1016/j.jbi.2017.06.017] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2017] [Revised: 06/01/2017] [Accepted: 06/14/2017] [Indexed: 11/22/2022]
Abstract
A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1.3million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table. All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse.
Collapse
Affiliation(s)
- Maryam Panahiazar
- Stanford Center for Biomedical Informatics Research, Center for Data Annotation and Retrieval, Department of Medicine, Stanford University, Stanford, 94305, United States
| | - Michel Dumontier
- Stanford Center for Biomedical Informatics Research, Center for Data Annotation and Retrieval, Department of Medicine, Stanford University, Stanford, 94305, United States
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research, Center for Data Annotation and Retrieval, Department of Medicine, Stanford University, Stanford, 94305, United States.
| |
Collapse
|
3
|
Artaza H, Chue Hong N, Corpas M, Corpuz A, Hooft R, Jimenez RC, Leskošek B, Olivier BG, Stourac J, Svobodová Vařeková R, Van Parys T, Vaughan D. Top 10 metrics for life science software good practices. F1000Res 2016; 5. [PMID: 27635232 PMCID: PMC5007752 DOI: 10.12688/f1000research.9206.1] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/15/2016] [Indexed: 11/20/2022] Open
Abstract
Metrics for assessing adoption of good development practices are a useful way to ensure that software is sustainable, reusable and functional. Sustainability means that the software used today will be available - and continue to be improved and supported - in the future. We report here an initial set of metrics that measure good practices in software development. This initiative differs from previously developed efforts in being a community-driven grassroots approach where experts from different organisations propose good software practices that have reasonable potential to be adopted by the communities they represent. We not only focus our efforts on understanding and prioritising good practices, we assess their feasibility for implementation and publish them here.
Collapse
Affiliation(s)
- Haydee Artaza
- The Earlham Institute & ELIXIR-UK, Norwich Research Park, Norwich, NR4 7UH, UK
| | - Neil Chue Hong
- Software Sustainability Institute, University of Edinburgh, Edinburgh, EH9 3FD, UK
| | - Manuel Corpas
- The Earlham Institute & ELIXIR-UK, Norwich Research Park, Norwich, NR4 7UH, UK
| | | | - Rob Hooft
- DTL, PO Box 19245, Utrecht, 3501 DE, Netherlands
| | | | - Brane Leskošek
- Institute for Biostatistics and Medical Informatics (IBMI), Faculty of Medicine, University of Ljubljana, Ljubljana, SI-1104, Slovenia
| | - Brett G Olivier
- Systems Bioinformatics, Vrije Universiteit Amsterdam, Amsterdam, 1081 HV, Netherlands
| | - Jan Stourac
- Loschmidt Laboratories, Faculty of Science, Masaryk University, Brno, 625 00, Czech Republic
| | | | | | - Daniel Vaughan
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK
| |
Collapse
|
4
|
Marchese Robinson RL, Lynch I, Peijnenburg W, Rumble J, Klaessig F, Marquardt C, Rauscher H, Puzyn T, Purian R, Åberg C, Karcher S, Vriens H, Hoet P, Hoover MD, Hendren CO, Harper SL. How should the completeness and quality of curated nanomaterial data be evaluated? NANOSCALE 2016; 8:9919-43. [PMID: 27143028 PMCID: PMC4899944 DOI: 10.1039/c5nr08944a] [Citation(s) in RCA: 58] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/19/2023]
Abstract
Nanotechnology is of increasing significance. Curation of nanomaterial data into electronic databases offers opportunities to better understand and predict nanomaterials' behaviour. This supports innovation in, and regulation of, nanotechnology. It is commonly understood that curated data need to be sufficiently complete and of sufficient quality to serve their intended purpose. However, assessing data completeness and quality is non-trivial in general and is arguably especially difficult in the nanoscience area, given its highly multidisciplinary nature. The current article, part of the Nanomaterial Data Curation Initiative series, addresses how to assess the completeness and quality of (curated) nanomaterial data. In order to address this key challenge, a variety of related issues are discussed: the meaning and importance of data completeness and quality, existing approaches to their assessment and the key challenges associated with evaluating the completeness and quality of curated nanomaterial data. Considerations which are specific to the nanoscience area and lessons which can be learned from other relevant scientific disciplines are considered. Hence, the scope of this discussion ranges from physicochemical characterisation requirements for nanomaterials and interference of nanomaterials with nanotoxicology assays to broader issues such as minimum information checklists, toxicology data quality schemes and computational approaches that facilitate evaluation of the completeness and quality of (curated) data. This discussion is informed by a literature review and a survey of key nanomaterial data curation stakeholders. Finally, drawing upon this discussion, recommendations are presented concerning the central question: how should the completeness and quality of curated nanomaterial data be evaluated?
Collapse
Affiliation(s)
- Richard L. Marchese Robinson
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, James Parsons Building, Byrom Street, Liverpool, L3 3AF, United Kingdom
| | - Iseult Lynch
- School of Geography, Earth and Environmental Sciences, University of Birmingham, Edgbaston, B15 2TT Birmingham, United Kingdom
| | - Willie Peijnenburg
- National Institute of Public Health and the Environment (RIVM), Bilthoven, The Netherlands
- Institute of Environmental Sciences, Leiden University, Leiden, The Netherlands
| | - John Rumble
- R&R Data Services, 11 Montgomery Avenue, Gaithersburg MD 20877 USA
| | - Fred Klaessig
- Pennsylvania Bio Nano Systems LLC, 3805 Old Easton Road, Doylestown, PA 18902
| | - Clarissa Marquardt
- Institute of Applied Computer Sciences (IAI), Karlsruhe Institute of Technology (KIT), Hermann v. Helmholtz Platz 1, 76344 Eggenstein-Leopoldshafen, Germany
| | - Hubert Rauscher
- European Commission, Joint Research Centre, Institute for Health and Consumer Protection, Via Fermi 2749, 21027 Ispra (VA), Italy
| | - Tomasz Puzyn
- Laboratory of Environmental Chemistry, University of Gdansk, Wita Stwosza 63, 80-308 Gdansk, Poland
| | - Ronit Purian
- Faculty of Engineering, Tel Aviv University, Tel Aviv 69978 Israel
| | - Christoffer Åberg
- Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Nijenborgh 4, 9747 AG Groningen, The Netherlands
| | - Sandra Karcher
- Civil and Environmental Engineering, Carnegie Mellon University, Pittsburgh, PA 15213-3890
| | - Hanne Vriens
- Department of Public Health and Primary Care, K.U.Leuven, Faculty of Medicine, Unit Environment & Health – Toxicology, Herestraat 49 (O&N 706), Leuven, Belgium
| | - Peter Hoet
- Department of Public Health and Primary Care, K.U.Leuven, Faculty of Medicine, Unit Environment & Health – Toxicology, Herestraat 49 (O&N 706), Leuven, Belgium
| | - Mark D. Hoover
- National Institute for Occupational Safety and Health, 1095 Willowdale Road, Morgantown, WV 26505-2888
| | - Christine Ogilvie Hendren
- Center for the Environmental Implications of NanoTechnology, Duke University, PO Box 90287 121 Hudson Hall, Durham NC 27708
| | - Stacey L. Harper
- Department of Environmental and Molecular Toxicology, School of Chemical, Biological and Environmental Engineering, Oregon State University, 1007 ALS, Corvallis, OR 97331
| |
Collapse
|
5
|
Tenenbaum JD, Avillach P, Benham-Hutchins M, Breitenstein MK, Crowgey EL, Hoffman MA, Jiang X, Madhavan S, Mattison JE, Nagarajan R, Ray B, Shin D, Visweswaran S, Zhao Z, Freimuth RR. An informatics research agenda to support precision medicine: seven key areas. J Am Med Inform Assoc 2016; 23:791-5. [PMID: 27107452 PMCID: PMC4926738 DOI: 10.1093/jamia/ocv213] [Citation(s) in RCA: 52] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2015] [Accepted: 12/24/2015] [Indexed: 01/22/2023] Open
Abstract
The recent announcement of the Precision Medicine Initiative by President Obama has brought precision medicine (PM) to the forefront for healthcare providers, researchers, regulators, innovators, and funders alike. As technologies continue to evolve and datasets grow in magnitude, a strong computational infrastructure will be essential to realize PM's vision of improved healthcare derived from personal data. In addition, informatics research and innovation affords a tremendous opportunity to drive the science underlying PM. The informatics community must lead the development of technologies and methodologies that will increase the discovery and application of biomedical knowledge through close collaboration between researchers, clinicians, and patients. This perspective highlights seven key areas that are in need of further informatics research and innovation to support the realization of PM.
Collapse
Affiliation(s)
- Jessica D Tenenbaum
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
| | - Paul Avillach
- Department of Biomedical Informatics, Harvard Medical School & Children's Hospital Informatics Program, Boston Children's Hospital, Boston, MA, USA
| | | | | | - Erin L Crowgey
- Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE, USA
| | - Mark A Hoffman
- Department of Biomedical & Health Informatics, University of Missouri - Kansas City, Children's Mercy Hospital, Kansas City, MO, USA
| | - Xia Jiang
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Subha Madhavan
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Innovation Center for Biomedical Informatics, Washington, DC, USA
| | - John E Mattison
- Exponential Medicine, Singularity University; Internal Medicine, System Solutions at Kaiser Permanente, Pasadena, CA, USA
| | | | - Bisakha Ray
- Center for Health Informatics and Bioinformatics, New York University School of Medicine, New York, NY, USA
| | - Dmitriy Shin
- Department of Pathology, MU Informatics Institute, University of Missouri, Columbia, MO, USA
| | - Shyam Visweswaran
- Department of Biomedical Informatics and the Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Robert R Freimuth
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
6
|
Arend D, Junker A, Scholz U, Schüler D, Wylie J, Lange M. PGP repository: a plant phenomics and genomics data publication infrastructure. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw033. [PMID: 27087305 PMCID: PMC4834206 DOI: 10.1093/database/baw033] [Citation(s) in RCA: 64] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/05/2015] [Accepted: 02/26/2016] [Indexed: 11/22/2022]
Abstract
Plant genomics and phenomics represents the most promising tools for accelerating yield gains and overcoming emerging crop productivity bottlenecks. However, accessing this wealth of plant diversity requires the characterization of this material using state-of-the-art genomic, phenomic and molecular technologies and the release of subsequent research data via a long-term stable, open-access portal. Although several international consortia and public resource centres offer services for plant research data management, valuable digital assets remains unpublished and thus inaccessible to the scientific community. Recently, the Leibniz Institute of Plant Genetics and Crop Plant Research and the German Plant Phenotyping Network have jointly initiated the Plant Genomics and Phenomics Research Data Repository (PGP) as infrastructure to comprehensively publish plant research data. This covers in particular cross-domain datasets that are not being published in central repositories because of its volume or unsupported data scope, like image collections from plant phenotyping and microscopy, unfinished genomes, genotyping data, visualizations of morphological plant models, data from mass spectrometry as well as software and documents. The repository is hosted at Leibniz Institute of Plant Genetics and Crop Plant Research using e!DAL as software infrastructure and a Hierarchical Storage Management System as data archival backend. A novel developed data submission tool was made available for the consortium that features a high level of automation to lower the barriers of data publication. After an internal review process, data are published as citable digital object identifiers and a core set of technical metadata is registered at DataCite. The used e!DAL-embedded Web frontend generates for each dataset a landing page and supports an interactive exploration. PGP is registered as research data repository at BioSharing.org, re3data.org and OpenAIRE as valid EU Horizon 2020 open data archive. Above features, the programmatic interface and the support of standard metadata formats, enable PGP to fulfil the FAIR data principles—findable, accessible, interoperable, reusable. Database URL:http://edal.ipk-gatersleben.de/repos/pgp/
Collapse
Affiliation(s)
- Daniel Arend
- Leibniz Institute for Plant Genetics and Crop Plant Research (IPK), OT Gatersleben, Corrensstraße 3, Stadt Seeland, 06466, Gatersleben, Germany
| | - Astrid Junker
- Leibniz Institute for Plant Genetics and Crop Plant Research (IPK), OT Gatersleben, Corrensstraße 3, Stadt Seeland, 06466, Gatersleben, Germany
| | - Uwe Scholz
- Leibniz Institute for Plant Genetics and Crop Plant Research (IPK), OT Gatersleben, Corrensstraße 3, Stadt Seeland, 06466, Gatersleben, Germany
| | - Danuta Schüler
- Leibniz Institute for Plant Genetics and Crop Plant Research (IPK), OT Gatersleben, Corrensstraße 3, Stadt Seeland, 06466, Gatersleben, Germany
| | - Juliane Wylie
- Leibniz Institute for Plant Genetics and Crop Plant Research (IPK), OT Gatersleben, Corrensstraße 3, Stadt Seeland, 06466, Gatersleben, Germany
| | - Matthias Lange
- Leibniz Institute for Plant Genetics and Crop Plant Research (IPK), OT Gatersleben, Corrensstraße 3, Stadt Seeland, 06466, Gatersleben, Germany
| |
Collapse
|
7
|
Lapatas V, Stefanidakis M, Jimenez RC, Via A, Schneider MV. Data integration in biological research: an overview. JOURNAL OF BIOLOGICAL RESEARCH (THESSALONIKE, GREECE) 2015; 22:9. [PMID: 26336651 PMCID: PMC4557916 DOI: 10.1186/s40709-015-0032-5] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/20/2015] [Accepted: 08/10/2015] [Indexed: 11/16/2022]
Abstract
Data sharing, integration and annotation are essential to ensure the reproducibility of the analysis and interpretation of the experimental findings. Often these activities are perceived as a role that bioinformaticians and computer scientists have to take with no or little input from the experimental biologist. On the contrary, biological researchers, being the producers and often the end users of such data, have a big role in enabling biological data integration. The quality and usefulness of data integration depend on the existence and adoption of standards, shared formats, and mechanisms that are suitable for biological researchers to submit and annotate the data, so it can be easily searchable, conveniently linked and consequently used for further biological analysis and discovery. Here, we provide background on what is data integration from a computational science point of view, how it has been applied to biological research, which key aspects contributed to its success and future directions.
Collapse
Affiliation(s)
- Vasileios Lapatas
- />Department of Informatics, Ionian University, 7 Tsirigoti Square, Corfu, 49100 Greece
| | - Michalis Stefanidakis
- />Department of Informatics, Ionian University, 7 Tsirigoti Square, Corfu, 49100 Greece
| | | | - Allegra Via
- />Biocomputing Group, Sapienza University, Piazzale Aldo Moro 5, Rome, 00185 Italy
| | | |
Collapse
|
8
|
Carroll AJ, Zhang P, Whitehead L, Kaines S, Tcherkez G, Badger MR. PhenoMeter: A Metabolome Database Search Tool Using Statistical Similarity Matching of Metabolic Phenotypes for High-Confidence Detection of Functional Links. Front Bioeng Biotechnol 2015; 3:106. [PMID: 26284240 PMCID: PMC4518198 DOI: 10.3389/fbioe.2015.00106] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2015] [Accepted: 07/10/2015] [Indexed: 12/14/2022] Open
Abstract
This article describes PhenoMeter (PM), a new type of metabolomics database search that accepts metabolite response patterns as queries and searches the MetaPhen database of reference patterns for responses that are statistically significantly similar or inverse for the purposes of detecting functional links. To identify a similarity measure that would detect functional links as reliably as possible, we compared the performance of four statistics in correctly top-matching metabolic phenotypes of Arabidopsis thaliana metabolism mutants affected in different steps of the photorespiration metabolic pathway to reference phenotypes of mutants affected in the same enzymes by independent mutations. The best performing statistic, the PM score, was a function of both Pearson correlation and Fisher's Exact Test of directional overlap. This statistic outperformed Pearson correlation, biweight midcorrelation and Fisher's Exact Test used alone. To demonstrate general applicability, we show that the PM reliably retrieved the most closely functionally linked response in the database when queried with responses to a wide variety of environmental and genetic perturbations. Attempts to match metabolic phenotypes between independent studies were met with varying success and possible reasons for this are discussed. Overall, our results suggest that integration of pattern-based search tools into metabolomics databases will aid functional annotation of newly recorded metabolic phenotypes analogously to the way sequence similarity search algorithms have aided the functional annotation of genes and proteins. PM is freely available at MetabolomeExpress (https://www.metabolome-express.org/phenometer.php).
Collapse
Affiliation(s)
- Adam J. Carroll
- College of Medicine, Biology and Environment, Research School of Biology, The Australian National University, Canberra, ACT, Australia
| | - Peng Zhang
- College of Medicine, Biology and Environment, Research School of Biology, The Australian National University, Canberra, ACT, Australia
| | - Lynne Whitehead
- College of Medicine, Biology and Environment, Research School of Biology, The Australian National University, Canberra, ACT, Australia
| | - Sarah Kaines
- College of Medicine, Biology and Environment, Research School of Biology, The Australian National University, Canberra, ACT, Australia
| | - Guillaume Tcherkez
- College of Medicine, Biology and Environment, Research School of Biology, The Australian National University, Canberra, ACT, Australia
| | - Murray R. Badger
- College of Medicine, Biology and Environment, Research School of Biology, The Australian National University, Canberra, ACT, Australia
| |
Collapse
|
9
|
Tenenbaum JD, Sansone SA, Haendel M. A sea of standards for omics data: sink or swim? J Am Med Inform Assoc 2014; 21:200-3. [PMID: 24076747 PMCID: PMC3932466 DOI: 10.1136/amiajnl-2013-002066] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2013] [Revised: 07/08/2013] [Accepted: 09/10/2013] [Indexed: 11/29/2022] Open
Abstract
In the era of Big Data, omic-scale technologies, and increasing calls for data sharing, it is generally agreed that the use of community-developed, open data standards is critical. Far less agreed upon is exactly which data standards should be used, the criteria by which one should choose a standard, or even what constitutes a data standard. It is impossible simply to choose a domain and have it naturally follow which data standards should be used in all cases. The 'right' standards to use is often dependent on the use case scenarios for a given project. Potential downstream applications for the data, however, may not always be apparent at the time the data are generated. Similarly, technology evolves, adding further complexity. Would-be standards adopters must strike a balance between planning for the future and minimizing the burden of compliance. Better tools and resources are required to help guide this balancing act.
Collapse
Affiliation(s)
- Jessica D Tenenbaum
- Duke Translational Medicine Institute, Duke University, Durham, North Carolina, USA
| | | | - Melissa Haendel
- Library and Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University, Portland, Oregon, USA
| |
Collapse
|
10
|
Juty N, Le Novère N, Hermjakob H, Laibe C. Towards the collaborative curation of the registry underlying Identifiers.org. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013; 2013:bat017. [PMID: 23584831 PMCID: PMC3625955 DOI: 10.1093/database/bat017] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
The MIRIAM Registry (http://www.ebi.ac.uk/miriam/) records information about collections of data in the life sciences, as well as where it can be obtained. This information is used, in combination with the resolving infrastructure of Identifiers.org (http://identifiers.org/), to generate globally unique identifiers, in the form of Uniform Resource Identifier. These identifiers are now widely used to provide perennial cross-references and annotations. The growing demand for these identifiers results in a significant increase in curational efforts to maintain the underlying registry. This requires the design and implementation of an economically viable and sustainable solution able to cope with such expansion. We briefly describe the Registry, the current curation duties entailed, and our plans to extend and distribute this workload through collaborative and community efforts.
Collapse
Affiliation(s)
- Nick Juty
- Proteomics Services, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | | | | |
Collapse
|
11
|
Higdon R, Haynes W, Stanberry L, Stewart E, Yandl G, Howard C, Broomall W, Kolker N, Kolker E. Unraveling the Complexities of Life Sciences Data. BIG DATA 2013; 1:42-50. [PMID: 27447037 DOI: 10.1089/big.2012.1505] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The life sciences have entered into the realm of big data and data-enabled science, where data can either empower or overwhelm. These data bring the challenges of the 5 Vs of big data: volume, veracity, velocity, variety, and value. Both independently and through our involvement with DELSA Global (Data-Enabled Life Sciences Alliance, DELSAglobal.org), the Kolker Lab ( kolkerlab.org ) is creating partnerships that identify data challenges and solve community needs. We specialize in solutions to complex biological data challenges, as exemplified by the community resource of MOPED (Model Organism Protein Expression Database, MOPED.proteinspire.org ) and the analysis pipeline of SPIRE (Systematic Protein Investigative Research Environment, PROTEINSPIRE.org ). Our collaborative work extends into the computationally intensive tasks of analysis and visualization of millions of protein sequences through innovative implementations of sequence alignment algorithms and creation of the Protein Sequence Universe tool (PSU). Pushing into the future together with our collaborators, our lab is pursuing integration of multi-omics data and exploration of biological pathways, as well as assigning function to proteins and porting solutions to the cloud. Big data have come to the life sciences; discovering the knowledge in the data will bring breakthroughs and benefits.
Collapse
Affiliation(s)
- Roger Higdon
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Winston Haynes
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Larissa Stanberry
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Elizabeth Stewart
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Gregory Yandl
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Chris Howard
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
- 5 Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
| | - William Broomall
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Natali Kolker
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Eugene Kolker
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
- 6 Departments of Biomedical Informatics & Medical Education and Pediatrics, University of Washington , Seattle, Washington
| |
Collapse
|
12
|
Brandizi M, Kurbatova N, Sarkans U, Rocca-Serra P. graph2tab, a library to convert experimental workflow graphs into tabular formats. Bioinformatics 2012; 28:1665-7. [PMID: 22556367 PMCID: PMC3371871 DOI: 10.1093/bioinformatics/bts258] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
Motivations: Spreadsheet-like tabular formats are ever more popular in the biomedical field as a mean for experimental reporting. The problem of converting the graph of an experimental workflow into a table-based representation occurs in many such formats and is not easy to solve. Results: We describe graph2tab, a library that implements methods to realise such a conversion in a size-optimised way. Our solution is generic and can be adapted to specific cases of data exporters or data converters that need to be implemented. Availability and Implementation: The library source code and documentation are available at http://github.com/ISA-tools/graph2tab. Contact:brandizi@ebi.ac.uk. Supplementary Information: A supplementary document describes the theoretical and technical details about the library implementation.
Collapse
Affiliation(s)
- Marco Brandizi
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, UK.
| | | | | | | |
Collapse
|
13
|
Field D, Amaral-Zettler L, Cochrane G, Cole JR, Dawyndt P, Garrity GM, Gilbert J, Glöckner FO, Hirschman L, Karsch-Mizrachi I, Klenk HP, Knight R, Kottmann R, Kyrpides N, Meyer F, San Gil I, Sansone SA, Schriml LM, Sterk P, Tatusova T, Ussery DW, White O, Wooley J. The Genomic Standards Consortium. PLoS Biol 2011; 9:e1001088. [PMID: 21713030 PMCID: PMC3119656 DOI: 10.1371/journal.pbio.1001088] [Citation(s) in RCA: 135] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
A vast and rich body of information has grown up as a result of the world's enthusiasm for 'omics technologies. Finding ways to describe and make available this information that maximise its usefulness has become a major effort across the 'omics world. At the heart of this effort is the Genomic Standards Consortium (GSC), an open-membership organization that drives community-based standardization activities, Here we provide a short history of the GSC, provide an overview of its range of current activities, and make a call for the scientific community to join forces to improve the quality and quantity of contextual information about our public collections of genomes, metagenomes, and marker gene sequences.
Collapse
Affiliation(s)
- Dawn Field
- Centre for Ecology & Hydrology, Maclean Building, Crowmarsh Gifford, Wallingford, Oxfordshire, United Kingdom.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Kettner C, Field D, Sansone SA, Taylor C, Aerts J, Binns N, Blake A, Britten CM, de Marco A, Fostel J, Gaudet P, González-Beltrán A, Hardy N, Hellemans J, Hermjakob H, Juty N, Leebens-Mack J, Maguire E, Neumann S, Orchard S, Parkinson H, Piel W, Ranganathan S, Rocca-Serra P, Santarsiero A, Shotton D, Sterk P, Untergasser A, Whetzel PL. Meeting Report from the Second "Minimum Information for Biological and Biomedical Investigations" (MIBBI) workshop. Stand Genomic Sci 2010; 3:259-66. [PMID: 21304730 PMCID: PMC3035314 DOI: 10.4056/sigs.147362] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
This report summarizes the proceedings of the second workshop of the 'Minimum Information for Biological and Biomedical Investigations' (MIBBI) consortium held on Dec 1-2, 2010 in Rüdesheim, Germany through the sponsorship of the Beilstein-Institute. MIBBI is an umbrella organization uniting communities developing Minimum Information (MI) checklists to standardize the description of data sets, the workflows by which they were generated and the scientific context for the work. This workshop brought together representatives of more than twenty communities to present the status of their MI checklists and plans for future development. Shared challenges and solutions were identified and the role of MIBBI in MI checklist development was discussed. The meeting featured some thirty presentations, wide-ranging discussions and breakout groups. The top outcomes of the two-day workshop as defined by the participants were: 1) the chance to share best practices and to identify areas of synergy; 2) defining a series of tasks for updating the MIBBI Portal; 3) reemphasizing the need to maintain independent MI checklists for various communities while leveraging common terms and workflow elements contained in multiple checklists; and 4) revision of the concept of the MIBBI Foundry to focus on the creation of a core set of MIBBI modules intended for reuse by individual MI checklist projects while maintaining the integrity of each MI project. Further information about MIBBI and its range of activities can be found at http://mibbi.org/.
Collapse
Affiliation(s)
| | - Dawn Field
- Centre for Ecology & Hydrology, Oxfordshire UK
| | | | - Chris Taylor
- The European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK
| | - Jan Aerts
- Faculty of Engineering - ESAT/SCD, Leuven University, Leuven-Heverlee, Belgium
| | - Nigel Binns
- Division of Pathway Medicine, University of Edinburgh Medical School, Edinburgh, UK
| | - Andrew Blake
- MRC Harwell, Harwell Science and Innovation Campus, Oxfordshire, UK
| | - Cedrik M. Britten
- Medical Department, University Medical Center, Johannes Gutenberg University-Mainz, Mainz, DE
| | - Ario de Marco
- Consortium for Genomic Technology, Milano, Italy
- University of Nova Gorica, Nova Gorica, Slovenia
| | | | | | - Alejandra González-Beltrán
- Computational and Systems Medicine and Department of Computer Science, University College London, London, UK
| | - Nigel Hardy
- Department of Computer Science, Aberystwyth University, Aberystwyth, UK
| | - Jan Hellemans
- Center for Medical Genetics, Ghent University, Ghent, Belgium
| | - Henning Hermjakob
- The European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK
| | - Nick Juty
- The European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK
| | - Jim Leebens-Mack
- Department of Plant Biology, University of Georgia, Athens, GA, U.S.A
| | - Eamonn Maguire
- University of Oxford, Oxford e-Research Centre, Oxfordshire, UK
| | - Steffen Neumann
- Department of Stress- and Developmental Biology, Institute for Plant Biochemistry, Halle, DE
| | - Sandra Orchard
- The European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK
| | - Helen Parkinson
- The European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK
| | - William Piel
- Peabody Museum of Natural History, Yale University, New Haven, CT, U.S.A
| | - Shoba Ranganathan
- Macquarie University, Sydney NSW, Australia
- National University of Singapore, Singapore
| | | | - Annapaola Santarsiero
- The Mario Negri Institute for Pharmacological Research, Cancer Pharmacology, 20156 Milan, Italy
| | - David Shotton
- Image Bioinformatics Research Group, Department of Zoology, University of Oxford, Oxford, UK
| | - Peter Sterk
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK
| | - Andreas Untergasser
- Zentrum für Molekulare Biologie der Universität Heidelberg, Heidelberg, Germany
| | - Patricia L. Whetzel
- The National Center for Biomedical Ontology / Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, U.S.A
| |
Collapse
|
15
|
Field D, Kottmann R, Sterk P. The first special issue of Standards in Genomic Sciences from the Genomic Standards Consortium. Stand Genomic Sci 2010; 3:214-5. [PMID: 21304721 PMCID: PMC3035305 DOI: 10.4056/sigs.1493697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
|
16
|
Field D, Sansone S, Delong EF, Sterk P, Friedberg I, Kottmann R, Hirschman L, Garrity G, Cochrane G, Wooley J, Meyer F, Hunter S, White O. Meeting Report: Metagenomics, Metadata and MetaAnalysis (M3) at ISMB 2010. Stand Genomic Sci 2010; 3:232-4. [PMID: 21304724 PMCID: PMC3035302 DOI: 10.4056/sigs.1383476] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
This report summarizes the proceedings of the first day of the Metagenomics, Metadata and MetaAnalysis (M3) workshop held at the Intelligent Systems for Molecular Biology 2010 conference. The second day, which was dedicated to the inaugural meeting of the BioSharing initiative is presented in a separate report. The Genomic Standards Consortium (GSC) hosted the first day of this Special Interest Group (SIG) at ISMB to continue exploring the bottlenecks and emerging solutions for obtaining biological insights through large-scale comparative analysis of metagenomic datasets. The M3 SIG included invited and selected talks and a panel discussion at the end of the day involving the plenary speakers. Further information about the GSC and its range of activities can be found at http://gensc.org. Information about the newly established BioSharing effort can be found at http://biosharing.org/.
Collapse
|