1
|
Petrillo M, Fabbri M, Kagkli DM, Querci M, Van den Eede G, Alm E, Aytan-Aktug D, Capella-Gutierrez S, Carrillo C, Cestaro A, Chan KG, Coque T, Endrullat C, Gut I, Hammer P, Kay GL, Madec JY, Mather AE, McHardy AC, Naas T, Paracchini V, Peter S, Pightling A, Raffael B, Rossen J, Ruppé E, Schlaberg R, Vanneste K, Weber LM, Westh H, Angers-Loustau A. A roadmap for the generation of benchmarking resources for antimicrobial resistance detection using next generation sequencing. F1000Res 2022; 10:80. [PMID: 35847383 PMCID: PMC9243550 DOI: 10.12688/f1000research.39214.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/10/2022] [Indexed: 11/20/2022] Open
Abstract
Next Generation Sequencing technologies significantly impact the field of Antimicrobial Resistance (AMR) detection and monitoring, with immediate uses in diagnosis and risk assessment. For this application and in general, considerable challenges remain in demonstrating sufficient trust to act upon the meaningful information produced from raw data, partly because of the reliance on bioinformatics pipelines, which can produce different results and therefore lead to different interpretations. With the constant evolution of the field, it is difficult to identify, harmonise and recommend specific methods for large-scale implementations over time. In this article, we propose to address this challenge through establishing a transparent, performance-based, evaluation approach to provide flexibility in the bioinformatics tools of choice, while demonstrating proficiency in meeting common performance standards. The approach is two-fold: first, a community-driven effort to establish and maintain “live” (dynamic) benchmarking platforms to provide relevant performance metrics, based on different use-cases, that would evolve together with the AMR field; second, agreed and defined datasets to allow the pipelines’ implementation, validation, and quality-control over time. Following previous discussions on the main challenges linked to this approach, we provide concrete recommendations and future steps, related to different aspects of the design of benchmarks, such as the selection and the characteristics of the datasets (quality, choice of pathogens and resistances, etc.), the evaluation criteria of the pipelines, and the way these resources should be deployed in the community.
Collapse
Affiliation(s)
| | - Marco Fabbri
- European Commission Joint Research Centre, Ispra, Italy
| | | | | | - Guy Van den Eede
- European Commission Joint Research Centre, Ispra, Italy
- European Commission Joint Research Centre, Geel, Belgium
| | - Erik Alm
- The European Centre for Disease Prevention and Control, Stockholm, Sweden
| | - Derya Aytan-Aktug
- National Food Institute, Technical University of Denmark, Lyngby, Denmark
| | | | - Catherine Carrillo
- Ottawa Laboratory – Carling, Canadian Food Inspection Agency, Ottawa, Ontario, Canada
| | | | - Kok-Gan Chan
- International Genome Centre, Jiangsu University, Zhenjiang, China
- Division of Genetics and Molecular Biology, Institute of Biological Sciences, Faculty of Science, University of Malaya, Kuala Lumpur, Malaysia
| | - Teresa Coque
- Servicio de Microbiología, Hospital Universitario Ramón y Cajal, Instituto Ramón y Cajal de Investigación Sanitaria (IRYCIS), Madrid, Spain
- Spanish Consortium for Research on Epidemiology and Public Health (CIBERESP), Carlos III Health Institute, Madrid, Spain
| | | | - Ivo Gut
- Centro Nacional de Análisis Genómico, Centre for Genomic Regulation (CNAG-CRG), Barcelona Institute of Technology, Barcelona, Spain
- Universitat Pompeu Fabra, Barcelona, Spain
| | - Paul Hammer
- BIOMES. NGS GmbH c/o Technische Hochschule Wildau, Wildau, Germany
| | - Gemma L. Kay
- Quadram Institute Bioscience, Norwich Research Park, Norwich, UK
| | - Jean-Yves Madec
- Unité Antibiorésistance et Virulence Bactériennes, ANSES Site de Lyon, Lyon, France
| | - Alison E. Mather
- Quadram Institute Bioscience, Norwich Research Park, Norwich, UK
- University of East Anglia, Norwich, UK
| | | | - Thierry Naas
- French-NRC for CPEs, Service de Bactériologie-Hygiène, Hôpital de Bicêtre, Le Kremlin-Bicêtre, France
| | | | - Silke Peter
- Institute of Medical Microbiology and Hygiene, University of Tübingen, Tübingen, Germany
| | - Arthur Pightling
- Center for Food Safety and Applied Nutrition, US Food and Drug Administration, College Park, MD, USA
| | | | - John Rossen
- Department of Medical Microbiology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | | | - Robert Schlaberg
- Department of Pathology, University of Utah, Salt Lake City, UT, USA
| | - Kevin Vanneste
- Transversal activities in Applied Genomics, Sciensano, Brussels, Belgium
| | - Lukas M. Weber
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
- Present address: Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | | | | |
Collapse
|
2
|
Perez-Riverol Y, Bai J, Bandla C, García-Seisdedos D, Hewapathirana S, Kamatchinathan S, Kundu D, Prakash A, Frericks-Zipper A, Eisenacher M, Walzer M, Wang S, Brazma A, Vizcaíno J. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res 2022; 50:D543-D552. [PMID: 34723319 PMCID: PMC8728295 DOI: 10.1093/nar/gkab1038] [Citation(s) in RCA: 2877] [Impact Index Per Article: 1438.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2021] [Revised: 10/12/2021] [Accepted: 10/14/2021] [Indexed: 12/12/2022] Open
Abstract
The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world's largest data repository of mass spectrometry-based proteomics data. PRIDE is one of the founding members of the global ProteomeXchange (PX) consortium and an ELIXIR core data resource. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2019. The number of submitted datasets to PRIDE Archive (the archival component of PRIDE) has reached on average around 500 datasets per month during 2021. In addition to continuous improvements in PRIDE Archive data pipelines and infrastructure, the PRIDE Spectra Archive has been developed to provide direct access to the submitted mass spectra using Universal Spectrum Identifiers. As a key point, the file format MAGE-TAB for proteomics has been developed to enable the improvement of sample metadata annotation. Additionally, the resource PRIDE Peptidome provides access to aggregated peptide/protein evidences across PRIDE Archive. Furthermore, we will describe how PRIDE has increased its efforts to reuse and disseminate high-quality proteomics data into other added-value resources such as UniProt, Ensembl and Expression Atlas.
Collapse
Affiliation(s)
- Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jingwen Bai
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Chakradhar Bandla
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - David García-Seisdedos
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Suresh Hewapathirana
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Selvakumar Kamatchinathan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Deepti J Kundu
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ananth Prakash
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Anika Frericks-Zipper
- Ruhr University Bochum, Medical Faculty, Medizinisches Proteom-Center, D-44801 Bochum, Germany
- Ruhr University Bochum, Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, 44801 Bochum, Germany
| | - Martin Eisenacher
- Ruhr University Bochum, Medical Faculty, Medizinisches Proteom-Center, D-44801 Bochum, Germany
- Ruhr University Bochum, Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, 44801 Bochum, Germany
| | - Mathias Walzer
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Shengbo Wang
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Alvis Brazma
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
3
|
Klie A, Tsui BY, Mollah S, Skola D, Dow M, Hsu CN, Carter H. Increasing metadata coverage of SRA BioSample entries using deep learning-based named entity recognition. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2021:6259052. [PMID: 33914028 PMCID: PMC8083811 DOI: 10.1093/database/baab021] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 03/11/2021] [Accepted: 04/16/2021] [Indexed: 11/14/2022]
Abstract
High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information's Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute-value pairs from SRA BioSample to train a scalable, recurrent neural network that predicts missing metadata via named entity recognition (NER). The network was first trained to classify short text phrases according to 11 metadata categories and achieved an overall accuracy and area under the receiver operating characteristic curve of 85.2% and 0.977, respectively. We then applied our classifier to predict 11 metadata categories from the longer TITLE attribute of samples, evaluating performance on a set of samples withheld from model training. Prediction accuracies were high when extracting sample Genus/Species (94.85%), Condition/Disease (95.65%) and Strain (82.03%) from TITLEs, with lower accuracies and lack of predictions for other categories highlighting multiple issues with the current metadata annotations in BioSample. These results indicate the utility of recurrent neural networks for NER-based metadata prediction and the potential for models such as the one presented here to increase metadata coverage in BioSample while minimizing the need for manual curation. Database URL: https://github.com/cartercompbio/PredictMEE.
Collapse
Affiliation(s)
- Adam Klie
- Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA.,Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA
| | - Brian Y Tsui
- Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA.,Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA
| | - Shamim Mollah
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA.,Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA.,Department of Genetics, Washington University in St. Louis, St. Louis, MO 63130, USA
| | - Dylan Skola
- Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA.,Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA
| | - Michelle Dow
- Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA.,Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA
| | - Chun-Nan Hsu
- Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA.,Department of Neurosciences, University of California San Diego, La Jolla, CA 92093, USA
| | - Hannah Carter
- Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
4
|
Das J, Barman Mandal S. Classification of Homo sapiens gene behavior using linear discriminant analysis fused with minimum entropy mapping. Med Biol Eng Comput 2021; 59:673-691. [PMID: 33595791 DOI: 10.1007/s11517-021-02324-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2020] [Accepted: 01/18/2021] [Indexed: 11/25/2022]
Abstract
Classification of Homo sapiens gene behavior employing computational biology is a recent research trend. But monitoring gene activity profile and genetic behavior from the alphabetic DNA sequence using a non-invasive method is a tremendous challenge in functional genomics. The present paper addresses such issue and attempts to differentiate Homo sapiens genes using linear discriminant analysis (LDA) method. Annotated protein coding sequences of Homo sapiens genes, collected from NCBI, are taken as test samples. Minimum entropy-based mapping (MEM) technique assists to extract highest information from the numerical DNA sequences. The proposed LDA technique has successfully classified Homo sapiens genes based on the following features: composition of hydrophilic amino acids, dominance of arginine amino acid, and magnitude and size of individual amino acids. The proposed algorithm is successfully tested on 84 Homo sapiens healthy and cancer genes of the prostate and breast cells. Classification performance of the proposed LDA technique is judged by sensitivity (89.12%), specificity (91.9%), accuracy (90.87%), F1 score (92.03%), Matthews' correlation coefficients (81.04%), and miss rate (9.12%), and it outperforms other four existing classifiers. The results are cross-validated through Rayleigh PDF and mutual information technique. Fisher test, 2-sample T-test, and relative entropy test are considered to verify the efficacy of the present classifier.
Collapse
Affiliation(s)
- Joyshri Das
- Institute of Radio Physics & Electronics, University of Calcutta, Kolkata, India
| | - Soma Barman Mandal
- Institute of Radio Physics & Electronics, University of Calcutta, Kolkata, India
| |
Collapse
|
5
|
Petrillo M, Fabbri M, Kagkli DM, Querci M, Van den Eede G, Alm E, Aytan-Aktug D, Capella-Gutierrez S, Carrillo C, Cestaro A, Chan KG, Coque T, Endrullat C, Gut I, Hammer P, Kay GL, Madec JY, Mather AE, McHardy AC, Naas T, Paracchini V, Peter S, Pightling A, Raffael B, Rossen J, Ruppé E, Schlaberg R, Vanneste K, Weber LM, Westh H, Angers-Loustau A. A roadmap for the generation of benchmarking resources for antimicrobial resistance detection using next generation sequencing. F1000Res 2021; 10:80. [DOI: 10.12688/f1000research.39214.1] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/02/2021] [Indexed: 01/12/2023] Open
Abstract
Next Generation Sequencing technologies significantly impact the field of Antimicrobial Resistance (AMR) detection and monitoring, with immediate uses in diagnosis and risk assessment. For this application and in general, considerable challenges remain in demonstrating sufficient trust to act upon the meaningful information produced from raw data, partly because of the reliance on bioinformatics pipelines, which can produce different results and therefore lead to different interpretations. With the constant evolution of the field, it is difficult to identify, harmonise and recommend specific methods for large-scale implementations over time. In this article, we propose to address this challenge through establishing a transparent, performance-based, evaluation approach to provide flexibility in the bioinformatics tools of choice, while demonstrating proficiency in meeting common performance standards. The approach is two-fold: first, a community-driven effort to establish and maintain “live” (dynamic) benchmarking platforms to provide relevant performance metrics, based on different use-cases, that would evolve together with the AMR field; second, agreed and defined datasets to allow the pipelines’ implementation, validation, and quality-control over time. Following previous discussions on the main challenges linked to this approach, we provide concrete recommendations and future steps, related to different aspects of the design of benchmarks, such as the selection and the characteristics of the datasets (quality, choice of pathogens and resistances, etc.), the evaluation criteria of the pipelines, and the way these resources should be deployed in the community.
Collapse
|
6
|
Zondervan NA, Martins Dos Santos VAP, Suarez-Diez M, Saccenti E. Phenotype and multi-omics comparison of Staphylococcus and Streptococcus uncovers pathogenic traits and predicts zoonotic potential. BMC Genomics 2021; 22:102. [PMID: 33541265 PMCID: PMC7860044 DOI: 10.1186/s12864-021-07388-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Accepted: 01/13/2021] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND Staphylococcus and Streptococcus species can cause many different diseases, ranging from mild skin infections to life-threatening necrotizing fasciitis. Both genera consist of commensal species that colonize the skin and nose of humans and animals, and of which some can display a pathogenic phenotype. RESULTS We compared 235 Staphylococcus and 315 Streptococcus genomes based on their protein domain content. We show the relationships between protein persistence and essentiality by integrating essentiality predictions from two metabolic models and essentiality measurements from six large-scale transposon mutagenesis experiments. We identified clusters of strains within species based on proteins associated to similar biological processes. We built Random Forest classifiers that predicted the zoonotic potential. Furthermore, we identified shared attributes between of Staphylococcus aureus and Streptococcus pyogenes that allow them to cause necrotizing fasciitis. CONCLUSIONS Differences observed in clustering of strains based on functional groups of proteins correlate with phenotypes such as host tropism, capability to infect multiple hosts and drug resistance. Our method provides a solid basis towards large-scale prediction of phenotypes based on genomic information.
Collapse
Affiliation(s)
- Niels A Zondervan
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Stippeneng 4, 6708WE, Wageningen, Netherlands
| | - Vitor A P Martins Dos Santos
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Stippeneng 4, 6708WE, Wageningen, Netherlands
- LifeGlimmer GmBH, Markelstraße 38, 12163, Berlin, Germany
| | - Maria Suarez-Diez
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Stippeneng 4, 6708WE, Wageningen, Netherlands
| | - Edoardo Saccenti
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Stippeneng 4, 6708WE, Wageningen, Netherlands.
| |
Collapse
|
7
|
Roy T, Bhattacharjee P. Performance analysis of melanoma classifier using electrical modeling technique. Med Biol Eng Comput 2020; 58:2443-2454. [PMID: 32770290 DOI: 10.1007/s11517-020-02241-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2020] [Accepted: 07/27/2020] [Indexed: 11/25/2022]
Abstract
An efficient and novel modeling approach is proposed in this paper for identifying proteins or genes involved in melanoma skin cancer. Two types of classifiers are modeled, based on the chemical structure and hydropathy property of amino acids. These classifiers are further implemented using NI LabVIEW-based hardware kit to observe the real-time response for proper diagnosis. The phase responses, pole-zero diagrams, and transient responses are examined to screen out the genes related to melanoma from healthy genes. The performance of the proposed classifier is measured using various performance measurement metrics in terms of accuracy, sensitivity, specificity, etc. The classifier is experimented along with a color code scheme on skin genes and illustrates the superiority in comparison with traditional methods by achieving 94% of classification accuracy with 96% of sensitivity.Graphical abstract An equivalent electrical model is developed for designing melanoma classifier. Initially, each amino acid is modeled using the RC passive circuit depending on their physicochemical structure and hydropathy nature, to form a gene structure model. The melanoma-related genes are detected by phase, transient, and color code analysis.
Collapse
Affiliation(s)
- Tanusree Roy
- Department of Electrical and Electronics Engineering, University of Engineering and Management, Kolkata, 700135, India.
| | - Pranabesh Bhattacharjee
- Department of Electrical and Electronics Engineering, University of Engineering and Management, Kolkata, 700135, India
| |
Collapse
|
8
|
Abstract
Today, Scientific Data is refining its standards for new submissions describing nucleic acid sequence data.
Collapse
|
9
|
Manzoni C, Kia DA, Vandrovcova J, Hardy J, Wood NW, Lewis PA, Ferrari R. Genome, transcriptome and proteome: the rise of omics data and their integration in biomedical sciences. Brief Bioinform 2019; 19:286-302. [PMID: 27881428 PMCID: PMC6018996 DOI: 10.1093/bib/bbw114] [Citation(s) in RCA: 376] [Impact Index Per Article: 75.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2016] [Indexed: 02/07/2023] Open
Abstract
Advances in the technologies and informatics used to generate and process large biological data sets (omics data) are promoting a critical shift in the study of biomedical sciences. While genomics, transcriptomics and proteinomics, coupled with bioinformatics and biostatistics, are gaining momentum, they are still, for the most part, assessed individually with distinct approaches generating monothematic rather than integrated knowledge. As other areas of biomedical sciences, including metabolomics, epigenomics and pharmacogenomics, are moving towards the omics scale, we are witnessing the rise of inter-disciplinary data integration strategies to support a better understanding of biological systems and eventually the development of successful precision medicine. This review cuts across the boundaries between genomics, transcriptomics and proteomics, summarizing how omics data are generated, analysed and shared, and provides an overview of the current strengths and weaknesses of this global approach. This work intends to target students and researchers seeking knowledge outside of their field of expertise and fosters a leap from the reductionist to the global-integrative analytical approach in research.
Collapse
Affiliation(s)
- Claudia Manzoni
- School of Pharmacy, University of Reading, Whiteknights, Reading, United Kingdom.,Department Molecular Neuroscience, UCL Institute of Neurology, London, United Kingdom
| | - Demis A Kia
- Department Molecular Neuroscience, UCL Institute of Neurology, London, United Kingdom
| | - Jana Vandrovcova
- Department Molecular Neuroscience, UCL Institute of Neurology, London, United Kingdom
| | - John Hardy
- Department Molecular Neuroscience, UCL Institute of Neurology, London, United Kingdom
| | - Nicholas W Wood
- Department Molecular Neuroscience, UCL Institute of Neurology, London, United Kingdom
| | - Patrick A Lewis
- School of Pharmacy, University of Reading, Whiteknights, Reading, United Kingdom.,Department Molecular Neuroscience, UCL Institute of Neurology, London, United Kingdom
| | - Raffaele Ferrari
- Department Molecular Neuroscience, UCL Institute of Neurology, London, United Kingdom
| |
Collapse
|
10
|
Gonçalves RS, Musen MA. The variable quality of metadata about biological samples used in biomedical experiments. Sci Data 2019; 6:190021. [PMID: 30778255 PMCID: PMC6380228 DOI: 10.1038/sdata.2019.21] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2018] [Accepted: 01/18/2019] [Indexed: 11/08/2022] Open
Abstract
We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample-a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples-a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4 M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets.
Collapse
Affiliation(s)
- Rafael S. Gonçalves
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA
| | - Mark A. Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA
| |
Collapse
|
11
|
Brandizi M, Melnichuk O, Bild R, Kohlmayer F, Rodriguez-Castro B, Spengler H, Kuhn KA, Kuchinke W, Ohmann C, Mustonen T, Linden M, Nyrönen T, Lappalainen I, Brazma A, Sarkans U. Orchestrating differential data access for translational research: a pilot implementation. BMC Med Inform Decis Mak 2017; 17:30. [PMID: 28330491 PMCID: PMC5363029 DOI: 10.1186/s12911-017-0424-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2016] [Accepted: 03/03/2017] [Indexed: 01/30/2023] Open
Abstract
Background Translational researchers need robust IT solutions to access a range of data types, varying from public data sets to pseudonymised patient information with restricted access, provided on a case by case basis. The reason for this complication is that managing access policies to sensitive human data must consider issues of data confidentiality, identifiability, extent of consent, and data usage agreements. All these ethical, social and legal aspects must be incorporated into a differential management of restricted access to sensitive data. Methods In this paper we present a pilot system that uses several common open source software components in a novel combination to coordinate access to heterogeneous biomedical data repositories containing open data (open access) as well as sensitive data (restricted access) in the domain of biobanking and biosample research. Our approach is based on a digital identity federation and software to manage resource access entitlements. Results Open source software components were assembled and configured in such a way that they allow for different ways of restricted access according to the protection needs of the data. We have tested the resulting pilot infrastructure and assessed its performance, feasibility and reproducibility. Conclusions Common open source software components are sufficient to allow for the creation of a secure system for differential access to sensitive data. The implementation of this system is exemplary for researchers facing similar requirements for restricted access data. Here we report experience and lessons learnt of our pilot implementation, which may be useful for similar use cases. Furthermore, we discuss possible extensions for more complex scenarios.
Collapse
Affiliation(s)
- Marco Brandizi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK.
| | - Olga Melnichuk
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK
| | - Raffael Bild
- Chair of Medical Informatics, Institute of Medical Statistics and Epidemiology, University Medical Center rechts der Isar, Technical University of Munich, Munich, Germany
| | - Florian Kohlmayer
- Chair of Medical Informatics, Institute of Medical Statistics and Epidemiology, University Medical Center rechts der Isar, Technical University of Munich, Munich, Germany
| | - Benedicto Rodriguez-Castro
- Chair of Medical Informatics, Institute of Medical Statistics and Epidemiology, University Medical Center rechts der Isar, Technical University of Munich, Munich, Germany
| | - Helmut Spengler
- Chair of Medical Informatics, Institute of Medical Statistics and Epidemiology, University Medical Center rechts der Isar, Technical University of Munich, Munich, Germany
| | - Klaus A Kuhn
- Chair of Medical Informatics, Institute of Medical Statistics and Epidemiology, University Medical Center rechts der Isar, Technical University of Munich, Munich, Germany
| | - Wolfgang Kuchinke
- Heinrich-Heine Universität Düsseldorf, Coordination Centre for Clinical Trials, Düsseldorf, Germany
| | - Christian Ohmann
- European Clinical Research Infrastructure Network (ECRIN), Düsseldorf, Germany
| | | | | | | | | | - Alvis Brazma
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK
| | - Ugis Sarkans
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK.
| |
Collapse
|
12
|
De Sousa PA, Steeg R, Wachter E, Bruce K, King J, Hoeve M, Khadun S, McConnachie G, Holder J, Kurtz A, Seltmann S, Dewender J, Reimann S, Stacey G, O'Shea O, Chapman C, Healy L, Zimmermann H, Bolton B, Rawat T, Atkin I, Veiga A, Kuebler B, Serano BM, Saric T, Hescheler J, Brüstle O, Peitz M, Thiele C, Geijsen N, Holst B, Clausen C, Lako M, Armstrong L, Gupta SK, Kvist AJ, Hicks R, Jonebring A, Brolén G, Ebneth A, Cabrera-Socorro A, Foerch P, Geraerts M, Stummann TC, Harmon S, George C, Streeter I, Clarke L, Parkinson H, Harrison PW, Faulconbridge A, Cherubin L, Burdett T, Trigueros C, Patel MJ, Lucas C, Hardy B, Predan R, Dokler J, Brajnik M, Keminer O, Pless O, Gribbon P, Claussen C, Ringwald A, Kreisel B, Courtney A, Allsopp TE. Rapid establishment of the European Bank for induced Pluripotent Stem Cells (EBiSC) - the Hot Start experience. Stem Cell Res 2017; 20:105-114. [PMID: 28334554 DOI: 10.1016/j.scr.2017.03.002] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/17/2016] [Revised: 02/17/2017] [Accepted: 03/03/2017] [Indexed: 10/20/2022] Open
Abstract
A fast track "Hot Start" process was implemented to launch the European Bank for Induced Pluripotent Stem Cells (EBiSC) to provide early release of a range of established control and disease linked human induced pluripotent stem cell (hiPSC) lines. Established practice amongst consortium members was surveyed to arrive at harmonised and publically accessible Standard Operations Procedures (SOPs) for tissue procurement, bio-sample tracking, iPSC expansion, cryopreservation, qualification and distribution to the research community. These were implemented to create a quality managed foundational collection of lines and associated data made available for distribution. Here we report on the successful outcome of this experience and work flow for banking and facilitating access to an otherwise disparate European resource, with lessons to benefit the international research community. ETOC: The report focuses on the EBiSC experience of rapidly establishing an operational capacity to procure, bank and distribute a foundational collection of established hiPSC lines. It validates the feasibility and defines the challenges of harnessing and integrating the capability and productivity of centres across Europe using commonly available resources currently in the field.
Collapse
Affiliation(s)
- Paul A De Sousa
- Centre for Clinical Brain Sciences, Chancellors Building, 49 Little France Crescent, University of Edinburgh, Edinburgh EH16 4SB, UK; Roslin Cells Ltd(1), Head office, Nine Edinburgh Bioquarter, 9 Little France Rd, Edinburgh EH16 4UX, UK; EBiSC banking facility, Babraham Research Campus, B260 Meditrina, Cambridge CB22 3AT, UK.
| | - Rachel Steeg
- Roslin Cells Ltd(1), Head office, Nine Edinburgh Bioquarter, 9 Little France Rd, Edinburgh EH16 4UX, UK; EBiSC banking facility, Babraham Research Campus, B260 Meditrina, Cambridge CB22 3AT, UK
| | - Elisabeth Wachter
- Roslin Cells Ltd(1), Head office, Nine Edinburgh Bioquarter, 9 Little France Rd, Edinburgh EH16 4UX, UK; EBiSC banking facility, Babraham Research Campus, B260 Meditrina, Cambridge CB22 3AT, UK
| | - Kevin Bruce
- Roslin Cells Ltd(1), Head office, Nine Edinburgh Bioquarter, 9 Little France Rd, Edinburgh EH16 4UX, UK; EBiSC banking facility, Babraham Research Campus, B260 Meditrina, Cambridge CB22 3AT, UK
| | - Jason King
- Roslin Cells Ltd(1), Head office, Nine Edinburgh Bioquarter, 9 Little France Rd, Edinburgh EH16 4UX, UK; EBiSC banking facility, Babraham Research Campus, B260 Meditrina, Cambridge CB22 3AT, UK
| | - Marieke Hoeve
- Roslin Cells Ltd(1), Head office, Nine Edinburgh Bioquarter, 9 Little France Rd, Edinburgh EH16 4UX, UK; EBiSC banking facility, Babraham Research Campus, B260 Meditrina, Cambridge CB22 3AT, UK
| | - Shalinee Khadun
- Roslin Cells Ltd(1), Head office, Nine Edinburgh Bioquarter, 9 Little France Rd, Edinburgh EH16 4UX, UK; EBiSC banking facility, Babraham Research Campus, B260 Meditrina, Cambridge CB22 3AT, UK
| | - George McConnachie
- Roslin Cells Ltd(1), Head office, Nine Edinburgh Bioquarter, 9 Little France Rd, Edinburgh EH16 4UX, UK; EBiSC banking facility, Babraham Research Campus, B260 Meditrina, Cambridge CB22 3AT, UK
| | - Julie Holder
- Roslin Cells Ltd(1), Head office, Nine Edinburgh Bioquarter, 9 Little France Rd, Edinburgh EH16 4UX, UK; EBiSC banking facility, Babraham Research Campus, B260 Meditrina, Cambridge CB22 3AT, UK
| | - Andreas Kurtz
- Charité - Universitätsmedizin Berlin, Berlin-Brandenburg Center for Regenerative Therapies, Augustenburger Platz, Berlin 13353, Germany
| | - Stefanie Seltmann
- Charité - Universitätsmedizin Berlin, Berlin-Brandenburg Center for Regenerative Therapies, Augustenburger Platz, Berlin 13353, Germany
| | - Johannes Dewender
- Charité - Universitätsmedizin Berlin, Berlin-Brandenburg Center for Regenerative Therapies, Augustenburger Platz, Berlin 13353, Germany
| | - Sascha Reimann
- Charité - Universitätsmedizin Berlin, Berlin-Brandenburg Center for Regenerative Therapies, Augustenburger Platz, Berlin 13353, Germany
| | - Glyn Stacey
- UK Stem Cell Bank, Division of Advanced Therapies, National Institute for Biological Standards and Control, Medicines and Healthcare Products Regulatory Authority, Blanche Lane, South Mimms, Hertfordshire, ENG 3GQ, UK
| | - Orla O'Shea
- UK Stem Cell Bank, Division of Advanced Therapies, National Institute for Biological Standards and Control, Medicines and Healthcare Products Regulatory Authority, Blanche Lane, South Mimms, Hertfordshire, ENG 3GQ, UK
| | - Charlotte Chapman
- UK Stem Cell Bank, Division of Advanced Therapies, National Institute for Biological Standards and Control, Medicines and Healthcare Products Regulatory Authority, Blanche Lane, South Mimms, Hertfordshire, ENG 3GQ, UK
| | - Lyn Healy
- UK Stem Cell Bank, Division of Advanced Therapies, National Institute for Biological Standards and Control, Medicines and Healthcare Products Regulatory Authority, Blanche Lane, South Mimms, Hertfordshire, ENG 3GQ, UK
| | - Heiko Zimmermann
- Fraunhofer Institute for Biomedical Engineering (IBMT), Josef-von-Fraunhofer-Weg 1, 66280 Sulzbach, Germany; Molecular & Cellular Biotechnology/Nanotechnology, Saarland University, Campus, 66123 Saarbrücken, Germany
| | - Bryan Bolton
- European Collection of Authenticated Cell Cultures, Public Health England, Porton Down, Salisbury SP4 0JG, UK
| | - Trisha Rawat
- European Collection of Authenticated Cell Cultures, Public Health England, Porton Down, Salisbury SP4 0JG, UK
| | - Isobel Atkin
- European Collection of Authenticated Cell Cultures, Public Health England, Porton Down, Salisbury SP4 0JG, UK
| | - Anna Veiga
- Barcelona Stem Cell Bank, Centre for Regenerative Medicine in Barcelona, C/Dr. Aiguader 88, 08003 Barcelona, Spain
| | - Bernd Kuebler
- Barcelona Stem Cell Bank, Centre for Regenerative Medicine in Barcelona, C/Dr. Aiguader 88, 08003 Barcelona, Spain
| | - Blanca Miranda Serano
- Andalusian Public Health Care System, Avda Conocimiento sn, 18100 Armilla, Granada, Spain
| | - Tomo Saric
- Centre for Physiology and Pathophysiology, Institute for Neurophysiology, Medical Faculty, University of Cologne, 50931 Cologne, Germany
| | - Jürgen Hescheler
- Centre for Physiology and Pathophysiology, Institute for Neurophysiology, Medical Faculty, University of Cologne, 50931 Cologne, Germany
| | - Oliver Brüstle
- Institute of Reconstructive Neurobiology, LIFE & BRAIN Centre, University of Bonn, Sigmund-Freud-Strasse 25, 53105 Bonn, Germany
| | - Michael Peitz
- Institute of Reconstructive Neurobiology, LIFE & BRAIN Centre, University of Bonn, Sigmund-Freud-Strasse 25, 53105 Bonn, Germany
| | - Cornelia Thiele
- Institute of Reconstructive Neurobiology, LIFE & BRAIN Centre, University of Bonn, Sigmund-Freud-Strasse 25, 53105 Bonn, Germany
| | - Niels Geijsen
- Hubrecht Institute for developmental biology and stem cell research, Royal Netherlands Academy of Arts and Sciences (KNAW), Utrecht University, Department of Clinical Sciences of Companion Animals and UMC Utrecht, 3584CT Utrecht, The Netherlands
| | - Bjørn Holst
- Bioneer A/S, Kogle Alle 2, DK-2970 Hørsholm, Denmark
| | | | - Majlinda Lako
- Institute for Genetic Medicine, University of Newcastle, Newcastle NE1 3BZ, United Kingdom
| | - Lyle Armstrong
- Institute for Genetic Medicine, University of Newcastle, Newcastle NE1 3BZ, United Kingdom
| | - Shailesh K Gupta
- AstraZeneca, R&D, Innovative Medicines, Discovery Sciences, Reagents and Assay Development, HC3006, Pepparedsleden 1, SE-431 83 Mölndal, Sweden
| | - Alexander J Kvist
- AstraZeneca, R&D, Innovative Medicines, Discovery Sciences, Reagents and Assay Development, HC3006, Pepparedsleden 1, SE-431 83 Mölndal, Sweden
| | - Ryan Hicks
- AstraZeneca, R&D, Innovative Medicines, Discovery Sciences, Reagents and Assay Development, HC3006, Pepparedsleden 1, SE-431 83 Mölndal, Sweden
| | - Anna Jonebring
- AstraZeneca, R&D, Innovative Medicines, Discovery Sciences, Reagents and Assay Development, HC3006, Pepparedsleden 1, SE-431 83 Mölndal, Sweden
| | - Gabriella Brolén
- AstraZeneca, R&D, Innovative Medicines, Discovery Sciences, Reagents and Assay Development, HC3006, Pepparedsleden 1, SE-431 83 Mölndal, Sweden
| | - Andreas Ebneth
- Janssen Research & Development (A Division of Janssen Pharmaceutica N.V), Neuroscience Therapeutic Area, Turnhoutseweg 30, 2340 Beerse, Belgium
| | - Alfredo Cabrera-Socorro
- Janssen Research & Development (A Division of Janssen Pharmaceutica N.V), Neuroscience Therapeutic Area, Turnhoutseweg 30, 2340 Beerse, Belgium
| | - Patrik Foerch
- UCB Biopharma (since May 2014), Discovery Research, Chemin du Foriest, Braine l'Alleud B-1420, Belgium
| | - Martine Geraerts
- UCB Biopharma (since May 2014), Discovery Research, Chemin du Foriest, Braine l'Alleud B-1420, Belgium
| | | | - Shawn Harmon
- University of Edinburgh School of Law, Old College, South Bridge, Edinburgh EH8 9YL, UK
| | - Carol George
- University of Edinburgh School of Law, Old College, South Bridge, Edinburgh EH8 9YL, UK
| | - Ian Streeter
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Laura Clarke
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Helen Parkinson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Peter W Harrison
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Adam Faulconbridge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Luca Cherubin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Tony Burdett
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Cesar Trigueros
- Inbiomed, P° Mikeletegi, 81, 20009 San Sebastián, Gipuzkoa, Spain
| | - Minal J Patel
- Cellular Generation and Phenotyping (CGaP) facility, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinston CB10 1SA, UK
| | - Christa Lucas
- Cellular Generation and Phenotyping (CGaP) facility, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinston CB10 1SA, UK
| | - Barry Hardy
- Douglas Connect, Technology Park Basel, Hochbergerstrasse 60C, 4057 Basel, Switzerland
| | - Rok Predan
- Douglas Connect, Technology Park Basel, Hochbergerstrasse 60C, 4057 Basel, Switzerland
| | - Joh Dokler
- Douglas Connect, Technology Park Basel, Hochbergerstrasse 60C, 4057 Basel, Switzerland
| | - Maja Brajnik
- Douglas Connect, Technology Park Basel, Hochbergerstrasse 60C, 4057 Basel, Switzerland
| | - Oliver Keminer
- Fraunhofer IME ScreeningPort, Schnackenburgallee 114, D-22525 Hamburg, Germany
| | - Ole Pless
- Fraunhofer IME ScreeningPort, Schnackenburgallee 114, D-22525 Hamburg, Germany
| | - Philip Gribbon
- Fraunhofer IME ScreeningPort, Schnackenburgallee 114, D-22525 Hamburg, Germany
| | - Carsten Claussen
- Fraunhofer IME ScreeningPort, Schnackenburgallee 114, D-22525 Hamburg, Germany
| | | | - Beate Kreisel
- ARTTIC, 58A rue du Dessous des Berges, F-75013 Paris, France
| | - Aidan Courtney
- Roslin Cells Ltd(1), Head office, Nine Edinburgh Bioquarter, 9 Little France Rd, Edinburgh EH16 4UX, UK; EBiSC banking facility, Babraham Research Campus, B260 Meditrina, Cambridge CB22 3AT, UK
| | - Timothy E Allsopp
- Pfizer Ltd (Neusentis), The Portway Building, Granta Park, Great Abington, Cambridge, CB21 6GS, UK
| |
Collapse
|
13
|
Krauth C, Kuchinke W, Eckert M, Bergmann R, Braasch B, Karakoyun T, Ohmann C. Clinical Trial Information Mediator. J Biomed Inform 2016; 63:157-168. [DOI: 10.1016/j.jbi.2016.08.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2016] [Revised: 07/04/2016] [Accepted: 08/07/2016] [Indexed: 11/28/2022]
|
14
|
Chang WE, Peterson MW, Garay CD, Korves T. Pathogen metadata platform: software for accessing and analyzing pathogen strain information. BMC Bioinformatics 2016; 17:379. [PMID: 27634291 PMCID: PMC5025631 DOI: 10.1186/s12859-016-1231-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2015] [Accepted: 08/26/2016] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Pathogen metadata includes information about where and when a pathogen was collected and the type of environment it came from. Along with genomic nucleotide sequence data, this metadata is growing rapidly and becoming a valuable resource not only for research but for biosurveillance and public health. However, current freely available tools for analyzing this data are geared towards bioinformaticians and/or do not provide summaries and visualizations needed to readily interpret results. RESULTS We designed a platform to easily access and summarize data about pathogen samples. The software includes a PostgreSQL database that captures metadata useful for disease outbreak investigations, and scripts for downloading and parsing data from NCBI BioSample and BioProject into the database. The software provides a user interface to query metadata and obtain standardized results in an exportable, tab-delimited format. To visually summarize results, the user interface provides a 2D histogram for user-selected metadata types and mapping of geolocated entries. The software is built on the LabKey data platform, an open-source data management platform, which enables developers to add functionalities. We demonstrate the use of the software in querying for a pathogen serovar and for genome sequence identifiers. CONCLUSIONS This software enables users to create a local database for pathogen metadata, populate it with data from NCBI, easily query the data, and obtain visual summaries. Some of the components, such as the database, are modular and can be incorporated into other data platforms. The source code is freely available for download at https://github.com/wchangmitre/bioattribution .
Collapse
Affiliation(s)
- Wenling E. Chang
- Data Analytics Department, The MITRE Corporation, 2280 Historic Decatur Rd, San Diego, CA 92106 USA
| | - Matthew W. Peterson
- Data Analytics Department, The MITRE Corporation, 202 Burlington Rd, Bedford, MA 01730 USA
| | - Christopher D. Garay
- Data Analytics Department, The MITRE Corporation, 202 Burlington Rd, Bedford, MA 01730 USA
| | - Tonia Korves
- Data Analytics Department, The MITRE Corporation, 202 Burlington Rd, Bedford, MA 01730 USA
| |
Collapse
|
15
|
Clarke L, Fairley S, Zheng-Bradley X, Streeter I, Perry E, Lowy E, Tassé AM, Flicek P. The international Genome sample resource (IGSR): A worldwide collection of genome variation incorporating the 1000 Genomes Project data. Nucleic Acids Res 2016; 45:D854-D859. [PMID: 27638885 PMCID: PMC5210610 DOI: 10.1093/nar/gkw829] [Citation(s) in RCA: 154] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2016] [Accepted: 09/08/2016] [Indexed: 01/09/2023] Open
Abstract
The International Genome Sample Resource (IGSR; http://www.internationalgenome.org) expands in data type and population diversity the resources from the 1000 Genomes Project. IGSR represents the largest open collection of human variation data and provides easy access to these resources. IGSR was established in 2015 to maintain and extend the 1000 Genomes Project data, which has been widely used as a reference set of human variation and by researchers developing analysis methods. IGSR has mapped all of the 1000 Genomes sequence to the newest human reference (GRCh38), and will release updated variant calls to ensure maximal usefulness of the existing data. IGSR is collecting new structural variation data on the 1000 Genomes samples from long read sequencing and other technologies, and will collect relevant functional data into a single comprehensive resource. IGSR is extending coverage with new populations sequenced by collaborating groups. Here, we present the new data and analysis that IGSR has made available. We have also introduced a new data portal that increases discoverability of our data—previously only browseable through our FTP site—by focusing on particular samples, populations or data sets of interest.
Collapse
Affiliation(s)
- Laura Clarke
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Susan Fairley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Xiangqun Zheng-Bradley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ian Streeter
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Emily Perry
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ernesto Lowy
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Anne-Marie Tassé
- Public Population Project in Genomics and Society, McGill University and Genome Quebec Innovation Centre, Montreal, Quebec, Canada
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
16
|
Tuggle CK, Giuffra E, White SN, Clarke L, Zhou H, Ross PJ, Acloque H, Reecy JM, Archibald A, Bellone RR, Boichard M, Chamberlain A, Cheng H, Crooijmans RPMA, Delany ME, Finno CJ, Groenen MAM, Hayes B, Lunney JK, Petersen JL, Plastow GS, Schmidt CJ, Song J, Watson M. GO-FAANG meeting: a Gathering On Functional Annotation of Animal Genomes. Anim Genet 2016; 47:528-33. [PMID: 27453069 PMCID: PMC5082551 DOI: 10.1111/age.12466] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/21/2016] [Indexed: 12/18/2022]
Abstract
The Functional Annotation of Animal Genomes (FAANG) Consortium recently held a Gathering On FAANG (GO‐FAANG) Workshop in Washington, DC on October 7–8, 2015. This consortium is a grass‐roots organization formed to advance the annotation of newly assembled genomes of domesticated and non‐model organisms (www.faang.org). The workshop gathered together from around the world a group of 100+ genome scientists, administrators, representatives of funding agencies and commodity groups to discuss the latest advancements of the consortium, new perspectives, next steps and implementation plans. The workshop was streamed live and recorded, and all talks, along with speaker slide presentations, are available at www.faang.org. In this report, we describe the major activities and outcomes of this meeting. We also provide updates on ongoing efforts to implement discussions and decisions taken at GO‐FAANG to guide future FAANG activities. In summary, reference datasets are being established under pilot projects; plans for tissue sets, morphological classification and methods of sample collection for different tissues were organized; and core assays and data and meta‐data analysis standards were established.
Collapse
Affiliation(s)
- Christopher K Tuggle
- Department of Animal Science, Iowa State University, 806 Stange Road, Ames, IA, 50011, USA.
| | - Elisabetta Giuffra
- GABI, INRA, AgroParisTech, Université Paris-Saclay, 78350, Jouy-en-Josas, France.
| | - Stephen N White
- USDA-ARS Animal Disease Research Unit, Pullman, WA, 99164, USA.,Department of Veterinary Microbiology & Pathology, Washington State University, Pullman, WA, 99164, USA.,Center for Reproductive Biology, Washington State University, Pullman, WA, 99164, USA
| | - Laura Clarke
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Huaijun Zhou
- Department of Animal Science, University of California, Davis, CA, 95616, USA
| | - Pablo J Ross
- Department of Animal Science, University of California, Davis, CA, 95616, USA
| | - Hervé Acloque
- INRA, UMR1388 Génétique, Physiologie et Systèmes d'Elevage, F-31326, Castanet Tolosan, France
| | - James M Reecy
- Department of Animal Science, Iowa State University, 806 Stange Road, Ames, IA, 50011, USA
| | - Alan Archibald
- The Roslin Institute and Royal(Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush, Edinburgh, EH29 9RG, UK
| | - Rebecca R Bellone
- Department of Population Health and Reproduction, Veterinary Genetics Laboratory, School of Veterinary Medicine, University of California-Davis, Davis, CA, USA
| | - Michèle Boichard
- GABI, INRA, AgroParisTech, Université Paris-Saclay, 78350, Jouy-en-Josas, France
| | - Amanda Chamberlain
- Department of Economic Development, Jobs, Transport and Resources, Agribiosciences Building, Bundoora, Australia
| | - Hans Cheng
- Avian Disease and Oncology Laboratory, USDA, ARS, East Lansing, MI, 48823, USA
| | - Richard P M A Crooijmans
- Animal Breeding and Genomics Centre, Wageningen University, PO Box 338, 6700, AH Wageningen, The Netherlands
| | - Mary E Delany
- Department of Animal Science, University of California, Davis, CA, 95616, USA
| | - Carrie J Finno
- Department of Population Health and Reproduction, University of California, Davis, CA, 95616, USA
| | - Martien A M Groenen
- Animal Breeding and Genomics Centre, Wageningen University, PO Box 338, 6700, AH Wageningen, The Netherlands
| | - Ben Hayes
- Queensland Alliance for Agriculture and Food Innovation, Centre for Animal Science, The University of Queensland, St. Lucia, 4072, Queensland, Australia
| | - Joan K Lunney
- Animal Parasitic Diseases Laboratory, BARC, ARS, USDA, Beltsville, MD, 20705, USA
| | - Jessica L Petersen
- Department of Agricultural, Food, and Nutritional Science, University of Alberta, Edmonton, AB, Canada
| | - Graham S Plastow
- Department of Agricultural, Food, and Nutritional Science, University of Alberta, Edmonton, AB, Canada
| | - Carl J Schmidt
- Department of Animal and Food Sciences, University of Delaware, Newark, DE, 19716, USA
| | - Jiuzhou Song
- Department of Animal and Avian Sciences, University of Maryland, College Park, MD, 20742, USA
| | - Mick Watson
- The Roslin Institute and Royal(Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush, Edinburgh, EH29 9RG, UK
| |
Collapse
|
17
|
Abstract
Cancer classification based on site of origin is very significant research issue for prediction and treatment of cancer. This paper is addressing the problem of cancer classification for Homo Sapiens genes composed of amino acid chain. Cancer gene network is realized by equivalent electrical circuits based on hydrophilic/ hydrophobic property of amino acid and a classifier is modeled to determine the cancer origin. The phase value, peak gain value and shape of Nyquist curve of network model are investigated to characterize different types of cancer gene origins. The model achieves 81.09% of classification accuracy and proves to be more sensitive and simple, since it shows 69% better performance compare to the existing nucleotide based method. The proposed classifier successfully predicts the site of origin of 93 cancer gene samples.
Collapse
|
18
|
Kerksick CM, Tsatsakis AM, Hayes AW, Kafantaris I, Kouretas D. How can bioinformatics and toxicogenomics assist the next generation of research on physical exercise and athletic performance. J Strength Cond Res 2015; 29:270-8. [PMID: 25353080 DOI: 10.1519/jsc.0000000000000730] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
The past 2-3 decades have seen an explosion in analytical areas related to "omic" technologies. These advancements have reached a point where their application can be and are being used as a part of exercise physiology and sport performance research. Such advancements have drastically enabled researchers to analyze extremely large groups of data that can provide amounts of information never before made available. Although these "omic" technologies offer exciting possibilities, the analytical costs and time required to complete the statistical approaches are substantial. The areas of exercise physiology and sport performance continue to witness an exponential growth of published studies using any combination of these techniques. Because more investigators within these traditionally applied science disciplines use these approaches, the need for efficient, thoughtful, and accurate extraction of information from electronic databases is paramount. As before, these disciplines can learn much from other disciplines who have already developed software and technologies to rapidly enhance the quality of results received when searching for key information. In addition, further development and interest in areas such as toxicogenomics could aid in the development and identification of more accurate testing programs for illicit drugs, performance enhancing drugs abused in sport, and better therapeutic outcomes from prescribed drug use. This review is intended to offer a discussion related to how bioinformatics approaches may assist the new generation of "omic" research in areas related to exercise physiology and toxicogenomics. Consequently, more focus will be placed on popular tools that are already available for analyzing such complex data and highlighting additional strategies and considerations that can further aid in developing new tools and data management approaches to assist future research in this field. It is our contention that introducing more scientists to how this type of work can complement existing experimental approaches within exercise physiology and sport performance will foster additional discussion and stimulate new research in these areas.
Collapse
Affiliation(s)
- Chad M Kerksick
- 1Department of Exercise Science, School of Sport, Recreation and Exercise Sciences, Lindenwood University, St. Charles, Missouri; 2Department of Forensic Sciences and Toxicology, Laboratory of Toxicology, Medical School, University of Crete, Heraklion, Greece; 3Department of Environmental Health, Harvard School of Public Health, Boston, Massachusetts; 4Spherix Consulting, Inc., Bethesda, Maryland; and 5Department of Biochemistry and Biotechnology, University of Thessaly, Larissa, Greece
| | | | | | | | | |
Collapse
|
19
|
Perez-Riverol Y, Alpi E, Wang R, Hermjakob H, Vizcaíno JA. Making proteomics data accessible and reusable: current state of proteomics databases and repositories. Proteomics 2015; 15:930-49. [PMID: 25158685 PMCID: PMC4409848 DOI: 10.1002/pmic.201400302] [Citation(s) in RCA: 141] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2014] [Revised: 08/06/2014] [Accepted: 08/22/2014] [Indexed: 01/10/2023]
Abstract
Compared to other data-intensive disciplines such as genomics, public deposition and storage of MS-based proteomics, data are still less developed due to, among other reasons, the inherent complexity of the data and the variety of data types and experimental workflows. In order to address this need, several public repositories for MS proteomics experiments have been developed, each with different purposes in mind. The most established resources are the Global Proteome Machine Database (GPMDB), PeptideAtlas, and the PRIDE database. Additionally, there are other useful (in many cases recently developed) resources such as ProteomicsDB, Mass Spectrometry Interactive Virtual Environment (MassIVE), Chorus, MaxQB, PeptideAtlas SRM Experiment Library (PASSEL), Model Organism Protein Expression Database (MOPED), and the Human Proteinpedia. In addition, the ProteomeXchange consortium has been recently developed to enable better integration of public repositories and the coordinated sharing of proteomics information, maximizing its benefit to the scientific community. Here, we will review each of the major proteomics resources independently and some tools that enable the integration, mining and reuse of the data. We will also discuss some of the major challenges and current pitfalls in the integration and sharing of the data.
Collapse
Affiliation(s)
- Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| | | | | | | | | |
Collapse
|
20
|
Seltmann S, Lekschas F, Müller R, Stachelscheid H, Bittner MS, Zhang W, Kidane L, Seriola A, Veiga A, Stacey G, Kurtz A. hPSCreg--the human pluripotent stem cell registry. Nucleic Acids Res 2015; 44:D757-63. [PMID: 26400179 PMCID: PMC4702942 DOI: 10.1093/nar/gkv963] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Accepted: 09/11/2015] [Indexed: 12/22/2022] Open
Abstract
The human pluripotent stem cell registry (hPSCreg), accessible at http://hpscreg.eu, is a public registry and data portal for human embryonic and induced pluripotent stem cell lines (hESC and hiPSC). Since their first isolation the number of hESC lines has steadily increased to over 3000 and new iPSC lines are generated in a rapidly growing number of laboratories as a result of their potentially broad applicability in biomedicine and drug testing. Many of these lines are deposited in stem cell banks, which are globally established to store tens of thousands of lines from healthy and diseased donors. The Registry provides comprehensive and standardized biological and legal information as well as tools to search and compare information from multiple hPSC sources and hence addresses a translational research need. To facilitate unambiguous identification over different resources, hPSCreg automatically creates a unique standardized name for each cell line registered. In addition to biological information, hPSCreg stores extensive data about ethical standards regarding cell sourcing and conditions for application and privacy protection. hPSCreg is the first global registry that holds both, manually validated scientific and ethical information on hPSC lines, and provides access by means of a user-friendly, mobile-ready web application.
Collapse
Affiliation(s)
- Stefanie Seltmann
- Berlin-Brandenburg Center for Regenerative Therapies, Charité University Medicine Berlin, Berlin, 13353, Germany
| | - Fritz Lekschas
- Berlin-Brandenburg Center for Regenerative Therapies, Charité University Medicine Berlin, Berlin, 13353, Germany
| | - Robert Müller
- Berlin-Brandenburg Center for Regenerative Therapies, Charité University Medicine Berlin, Berlin, 13353, Germany
| | - Harald Stachelscheid
- Berlin-Brandenburg Center for Regenerative Therapies, Charité University Medicine Berlin, Berlin, 13353, Germany Berlin Institute of Health-Stem Cell Core Facility, 13353 Berlin, Germany
| | - Marie-Sophie Bittner
- Berlin-Brandenburg Center for Regenerative Therapies, Charité University Medicine Berlin, Berlin, 13353, Germany
| | - Weiping Zhang
- Berlin-Brandenburg Center for Regenerative Therapies, Charité University Medicine Berlin, Berlin, 13353, Germany
| | - Luam Kidane
- National Institute for Biological Standards and Control, South Mimms EN63QG, UK
| | - Anna Seriola
- Center of Regenerative Medicine in Barcelona, Barcelona Stem Cell Bank, Barcelona 08003, Spain
| | - Anna Veiga
- Center of Regenerative Medicine in Barcelona, Barcelona Stem Cell Bank, Barcelona 08003, Spain
| | - Glyn Stacey
- National Institute for Biological Standards and Control, South Mimms EN63QG, UK
| | - Andreas Kurtz
- Berlin-Brandenburg Center for Regenerative Therapies, Charité University Medicine Berlin, Berlin, 13353, Germany Seoul National University, College of Veterinary Medicine and Research Institute for Veterinary Science, Seoul 151-742, Republic of Korea
| |
Collapse
|
21
|
Spjuth O, Krestyaninova M, Hastings J, Shen HY, Heikkinen J, Waldenberger M, Langhammer A, Ladenvall C, Esko T, Persson MÅ, Heggland J, Dietrich J, Ose S, Gieger C, Ried JS, Peters A, Fortier I, de Geus EJC, Klovins J, Zaharenko L, Willemsen G, Hottenga JJ, Litton JE, Karvanen J, Boomsma DI, Groop L, Rung J, Palmgren J, Pedersen NL, McCarthy MI, van Duijn CM, Hveem K, Metspalu A, Ripatti S, Prokopenko I, Harris JR. Harmonising and linking biomedical and clinical data across disparate data archives to enable integrative cross-biobank research. Eur J Hum Genet 2015; 24:521-8. [PMID: 26306643 PMCID: PMC4929882 DOI: 10.1038/ejhg.2015.165] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2014] [Revised: 05/18/2015] [Accepted: 06/18/2015] [Indexed: 12/04/2022] Open
Abstract
A wealth of biospecimen samples are stored in modern globally distributed biobanks.
Biomedical researchers worldwide need to be able to combine the available resources
to improve the power of large-scale studies. A prerequisite for this effort is to be
able to search and access phenotypic, clinical and other information about samples
that are currently stored at biobanks in an integrated manner. However, privacy
issues together with heterogeneous information systems and the lack of agreed-upon
vocabularies have made specimen searching across multiple biobanks extremely
challenging. We describe three case studies where we have linked samples and sample
descriptions in order to facilitate global searching of available samples for
research. The use cases include the ENGAGE (European Network for Genetic and Genomic
Epidemiology) consortium comprising at least 39 cohorts, the SUMMIT (surrogate
markers for micro- and macro-vascular hard endpoints for innovative diabetes tools)
consortium and a pilot for data integration between a Swedish clinical health
registry and a biobank. We used the Sample avAILability (SAIL) method for data
linking: first, created harmonised variables and then annotated and made searchable
information on the number of specimens available in individual biobanks for various
phenotypic categories. By operating on this categorised availability data we sidestep
many obstacles related to privacy that arise when handling real values and show that
harmonised and annotated records about data availability across disparate biomedical
archives provide a key methodological advance in pre-analysis exchange of information
between biobanks, that is, during the project planning phase.
Collapse
Affiliation(s)
- Ola Spjuth
- Department of Medical Epidemiology and Biostatistics, Swedish e-Science Research Centre, Karolinska Institutet, Stockholm, Sweden.,Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Maria Krestyaninova
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, UK.,Uniquer Sarl, rue de la Mercerie, Lausanne, Switzerland
| | - Janna Hastings
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, UK
| | - Huei-Yi Shen
- Institute for Molecular Medicine Finland, FIMM, University of Helsinki, Biomedicum Helsinki 2U, Helsinki, Finland
| | - Jani Heikkinen
- Institute for Molecular Medicine Finland, FIMM, University of Helsinki, Biomedicum Helsinki 2U, Helsinki, Finland
| | - Melanie Waldenberger
- Institute of Epidemiology II, Helmholtz Zentrum München, Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany.,Research Unit of Molecular Epidemiology, Helmholtz Zentrum München, Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
| | - Arnulf Langhammer
- Department of Public Health and General Practice, HUNT Research Centre, Norwegian University of Science and Technology, Levanger, Norway
| | - Claes Ladenvall
- Department of Clinical Sciences, Diabetes and Endocrinology, Lund University, Lund, Sweden.,Lund University Diabetes Centre, CRC at Skåne University Hospital, Malmö, Sweden
| | - Tõnu Esko
- Estonian Genome Center, University of Tartu, Tartu, Estonia
| | - Mats-Åke Persson
- Department of Clinical Sciences, Diabetes and Endocrinology, Lund University, Lund, Sweden.,Lund University Diabetes Centre, CRC at Skåne University Hospital, Malmö, Sweden
| | - Jon Heggland
- Department of Public Health and General Practice, HUNT Research Centre, Norwegian University of Science and Technology, Levanger, Norway
| | - Joern Dietrich
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, UK
| | - Sandra Ose
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, UK
| | - Christian Gieger
- Institute of Epidemiology II, Helmholtz Zentrum München, Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany.,Research Unit of Molecular Epidemiology, Helmholtz Zentrum München, Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
| | - Janina S Ried
- Institute of Genetic Epidemiology, Helmholtz Zentrum München, Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg
| | - Annette Peters
- Institute of Epidemiology II, Helmholtz Zentrum München, Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany.,Research Unit of Molecular Epidemiology, Helmholtz Zentrum München, Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
| | - Isabel Fortier
- McGill University Health Centre, Montreal, Quebec, Canada
| | - Eco J C de Geus
- Department of Biological Psychology, FGB, VU University, Amsterdam, The Netherlands
| | - Janis Klovins
- Latvian Genome Data Base (LGDB), Latvian Biomedical Research and Study Centre, Ratsupites 1 k-1, Riga, Latvia
| | - Linda Zaharenko
- Latvian Genome Data Base (LGDB), Latvian Biomedical Research and Study Centre, Ratsupites 1 k-1, Riga, Latvia
| | - Gonneke Willemsen
- Department of Biological Psychology, FGB, VU University, Amsterdam, The Netherlands
| | - Jouke-Jan Hottenga
- Department of Biological Psychology, FGB, VU University, Amsterdam, The Netherlands
| | - Jan-Eric Litton
- Department of Medical Epidemiology and Biostatistics, Swedish e-Science Research Centre, Karolinska Institutet, Stockholm, Sweden.,BBMRI-ERIC, Neue Stiftingtalstrasse 2/B/6, Graz, Austria
| | - Juha Karvanen
- National Institute for Health and Welfare, Helsinki, Finland.,University of Jyvaskyla, Jyväskylä, Finland
| | - Dorret I Boomsma
- Department of Biological Psychology, FGB, VU University, Amsterdam, The Netherlands
| | - Leif Groop
- Institute for Molecular Medicine Finland, FIMM, University of Helsinki, Biomedicum Helsinki 2U, Helsinki, Finland.,Department of Clinical Sciences, Diabetes and Endocrinology, Lund University, Lund, Sweden
| | - Johan Rung
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, UK.,Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Juni Palmgren
- Department of Medical Epidemiology and Biostatistics, Swedish e-Science Research Centre, Karolinska Institutet, Stockholm, Sweden.,Institute for Molecular Medicine Finland, FIMM, University of Helsinki, Biomedicum Helsinki 2U, Helsinki, Finland
| | - Nancy L Pedersen
- Department of Medical Epidemiology and Biostatistics, Swedish e-Science Research Centre, Karolinska Institutet, Stockholm, Sweden
| | - Mark I McCarthy
- Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Churchill Hospital, Headington, Oxford, UK.,Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK.,Oxford NIHR Biomedical Research Centre, Churchill Hospital, Headington, Oxford, UK
| | | | - Kristian Hveem
- Department of Public Health and General Practice, HUNT Research Centre, Norwegian University of Science and Technology, Levanger, Norway
| | | | - Samuli Ripatti
- Institute for Molecular Medicine Finland, FIMM, University of Helsinki, Biomedicum Helsinki 2U, Helsinki, Finland.,Department of Public Health, Faculty of Medicine, University of Helsinki, Helsinki, Finland.,Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK
| | - Inga Prokopenko
- Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Churchill Hospital, Headington, Oxford, UK.,Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK.,Department of Genomics of Common Disease, School of Public Health, Imperial College London, London, UK
| | - Jennifer R Harris
- Division of Epidemiology, Department of Genes and Environment, The Norwegian Institute of Public Health, Oslo, Norway
| |
Collapse
|
22
|
Lappalainen I, Almeida-King J, Kumanduri V, Senf A, Spalding JD, Ur-Rehman S, Saunders G, Kandasamy J, Caccamo M, Leinonen R, Vaughan B, Laurent T, Rowland F, Marin-Garcia P, Barker J, Jokinen P, Torres AC, de Argila JR, Llobet OM, Medina I, Puy MS, Alberich M, de la Torre S, Navarro A, Paschall J, Flicek P. The European Genome-phenome Archive of human data consented for biomedical research. Nat Genet 2015; 47:692-5. [PMID: 26111507 PMCID: PMC5426533 DOI: 10.1038/ng.3312] [Citation(s) in RCA: 240] [Impact Index Per Article: 26.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
The European Genome-phenome Archive (EGA) is a permanent archive that promotes distribution and sharing of genetic and phenotype data consented for specific approved uses, but not fully open public distribution. The EGA follows strict protocols for information management, data storage, security and dissemination. Authorized access to the data is managed in partnership with the data providing organizations. The EGA includes major reference data collections for human genetics research.
Collapse
Affiliation(s)
- Ilkka Lappalainen
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - Jeff Almeida-King
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - Vasudev Kumanduri
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - Alexander Senf
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - John Dylan Spalding
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - Saif Ur-Rehman
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - Gary Saunders
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - Jag Kandasamy
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - Mario Caccamo
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - Rasko Leinonen
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - Brendan Vaughan
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - Thomas Laurent
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - Francis Rowland
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - Pablo Marin-Garcia
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - Jonathan Barker
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - Petteri Jokinen
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | | | | | | | - Ignacio Medina
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | | | | | | | - Arcadi Navarro
- 1] Centre for Genomic Regulation, Barcelona, Spain. [2] Institute of Evolutionary Biology, Universitat Pompeu Fabra-Consejo Superior de Investigaciones Científicas (CSIC), Barcelona, Spain. [3] Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| | - Justin Paschall
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| | - Paul Flicek
- European Molecular Biology Laboratory-European Bioinformatics Institute, Hinxton, UK
| |
Collapse
|
23
|
Ara T, Enomoto M, Arita M, Ikeda C, Kera K, Yamada M, Nishioka T, Ikeda T, Nihei Y, Shibata D, Kanaya S, Sakurai N. Metabolonote: a wiki-based database for managing hierarchical metadata of metabolome analyses. Front Bioeng Biotechnol 2015; 3:38. [PMID: 25905099 PMCID: PMC4388006 DOI: 10.3389/fbioe.2015.00038] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2014] [Accepted: 03/13/2015] [Indexed: 01/04/2023] Open
Abstract
Metabolomics – technology for comprehensive detection of small molecules in an organism – lags behind the other “omics” in terms of publication and dissemination of experimental data. Among the reasons for this are difficulty precisely recording information about complicated analytical experiments (metadata), existence of various databases with their own metadata descriptions, and low reusability of the published data, resulting in submitters (the researchers who generate the data) being insufficiently motivated. To tackle these issues, we developed Metabolonote, a Semantic MediaWiki-based database designed specifically for managing metabolomic metadata. We also defined a metadata and data description format, called “Togo Metabolome Data” (TogoMD), with an ID system that is required for unique access to each level of the tree-structured metadata such as study purpose, sample, analytical method, and data analysis. Separation of the management of metadata from that of data and permission to attach related information to the metadata provide advantages for submitters, readers, and database developers. The metadata are enriched with information such as links to comparable data, thereby functioning as a hub of related data resources. They also enhance not only readers’ understanding and use of data but also submitters’ motivation to publish the data. The metadata are computationally shared among other systems via APIs, which facilitate the construction of novel databases by database developers. A permission system that allows publication of immature metadata and feedback from readers also helps submitters to improve their metadata. Hence, this aspect of Metabolonote, as a metadata preparation tool, is complementary to high-quality and persistent data repositories such as MetaboLights. A total of 808 metadata for analyzed data obtained from 35 biological species are published currently. Metabolonote and related tools are available free of cost at http://metabolonote.kazusa.or.jp/.
Collapse
Affiliation(s)
- Takeshi Ara
- Department of Technology Development, Kazusa DNA Research Institute , Kisarazu , Japan ; National Bioscience Database Center (NBDC), Japan Science and Technology Agency (JST) , Tokyo , Japan
| | - Mitsuo Enomoto
- Department of Technology Development, Kazusa DNA Research Institute , Kisarazu , Japan ; National Bioscience Database Center (NBDC), Japan Science and Technology Agency (JST) , Tokyo , Japan
| | - Masanori Arita
- National Bioscience Database Center (NBDC), Japan Science and Technology Agency (JST) , Tokyo , Japan ; RIKEN Center for Sustainable Resource Science , Yokohama , Japan
| | - Chiaki Ikeda
- Department of Technology Development, Kazusa DNA Research Institute , Kisarazu , Japan ; National Bioscience Database Center (NBDC), Japan Science and Technology Agency (JST) , Tokyo , Japan
| | - Kota Kera
- Department of Research & Development, Kazusa DNA Research Institute , Kisarazu , Japan
| | - Manabu Yamada
- Department of Technology Development, Kazusa DNA Research Institute , Kisarazu , Japan ; National Bioscience Database Center (NBDC), Japan Science and Technology Agency (JST) , Tokyo , Japan
| | - Takaaki Nishioka
- National Bioscience Database Center (NBDC), Japan Science and Technology Agency (JST) , Tokyo , Japan ; Graduate School of Information Science, Nara Institute of Science and Technology , Ikoma , Japan
| | - Tasuku Ikeda
- National Bioscience Database Center (NBDC), Japan Science and Technology Agency (JST) , Tokyo , Japan ; Graduate School of Information Science, Nara Institute of Science and Technology , Ikoma , Japan
| | - Yoshito Nihei
- National Bioscience Database Center (NBDC), Japan Science and Technology Agency (JST) , Tokyo , Japan ; Graduate School of Information Science, Nara Institute of Science and Technology , Ikoma , Japan
| | - Daisuke Shibata
- Department of Technology Development, Kazusa DNA Research Institute , Kisarazu , Japan
| | - Shigehiko Kanaya
- National Bioscience Database Center (NBDC), Japan Science and Technology Agency (JST) , Tokyo , Japan ; Graduate School of Information Science, Nara Institute of Science and Technology , Ikoma , Japan
| | - Nozomu Sakurai
- Department of Technology Development, Kazusa DNA Research Institute , Kisarazu , Japan ; National Bioscience Database Center (NBDC), Japan Science and Technology Agency (JST) , Tokyo , Japan
| |
Collapse
|
24
|
Griss J, Perez-Riverol Y, Hermjakob H, Vizcaíno JA. Identifying novel biomarkers through data mining-a realistic scenario? Proteomics Clin Appl 2015; 9:437-43. [PMID: 25347964 PMCID: PMC4833187 DOI: 10.1002/prca.201400107] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2014] [Revised: 10/08/2014] [Accepted: 10/21/2014] [Indexed: 12/12/2022]
Abstract
In this article we discuss the requirements to use data mining of published proteomics datasets to assist proteomics-based biomarker discovery, the use of external data integration to solve the issue of inadequate small sample sizes and finally, we try to estimate the probability that new biomarkers will be identified through data mining alone.
Collapse
Affiliation(s)
- Johannes Griss
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK; Division of Immunology, Allergy and Infectious Diseases, Department of Dermatology, Medical University of Vienna, Austria
| | | | | | | |
Collapse
|
25
|
Malladi VS, Erickson DT, Podduturi NR, Rowe LD, Chan ET, Davidson JM, Hitz BC, Ho M, Lee BT, Miyasato S, Roe GR, Simison M, Sloan CA, Strattan JS, Tanaka F, Kent WJ, Cherry JM, Hong EL. Ontology application and use at the ENCODE DCC. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015; 2015:bav010. [PMID: 25776021 PMCID: PMC4360730 DOI: 10.1093/database/bav010] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a catalog of genomic annotations. To date, the project has generated over 4000 experiments across more than 350 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory network and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All ENCODE experimental data, metadata and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage and distribution to community resources and the scientific community. As the volume of data increases, the organization of experimental details becomes increasingly complicated and demands careful curation to identify related experiments. Here, we describe the ENCODE DCC’s use of ontologies to standardize experimental metadata. We discuss how ontologies, when used to annotate metadata, provide improved searching capabilities and facilitate the ability to find connections within a set of experiments. Additionally, we provide examples of how ontologies are used to annotate ENCODE metadata and how the annotations can be identified via ontology-driven searches at the ENCODE portal. As genomic datasets grow larger and more interconnected, standardization of metadata becomes increasingly vital to allow for exploration and comparison of data between different scientific projects. Database URL: https://www.encodeproject.org/
Collapse
Affiliation(s)
- Venkat S Malladi
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Drew T Erickson
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Nikhil R Podduturi
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Laurence D Rowe
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Esther T Chan
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Jean M Davidson
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Benjamin C Hitz
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Marcus Ho
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Brian T Lee
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Stuart Miyasato
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Gregory R Roe
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Matt Simison
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Cricket A Sloan
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - J Seth Strattan
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Forrest Tanaka
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - W James Kent
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - J Michael Cherry
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Eurie L Hong
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| |
Collapse
|
26
|
Roy T, Barman S. Performance Analysis of Network Model to Identify Healthy and Cancerous Colon Genes. IEEE J Biomed Health Inform 2015; 20:710-6. [PMID: 25730835 DOI: 10.1109/jbhi.2015.2408366] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Modeling of cancerous and healthy Homo Sapiens colon gene using electrical network is proposed to study their behavior. In this paper, the individual amino acid models are designed using hydropathy index of amino acid side chain. The phase and magnitude responses of genes are examined to screen out cancer from healthy genes. The performance of proposed modeling technique is judged using various performance measurement metrics such as accuracy, sensitivity, specificity, etc. The network model performance is increased with frequency, which is analyzed using the receiver operating characteristic curve. The accuracy of the model is tested on colon genes and achieved maximum 97% at 10-MHz frequency.
Collapse
|
27
|
Abbey DA, Funt J, Lurie-Weinberger MN, Thompson DA, Regev A, Myers CL, Berman J. YMAP: a pipeline for visualization of copy number variation and loss of heterozygosity in eukaryotic pathogens. Genome Med 2014; 6:100. [PMID: 25505934 PMCID: PMC4263066 DOI: 10.1186/s13073-014-0100-8] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2014] [Accepted: 10/30/2014] [Indexed: 12/13/2022] Open
Abstract
The design of effective antimicrobial therapies for serious eukaryotic pathogens requires a clear understanding of their highly variable genomes. To facilitate analysis of copy number variations, single nucleotide polymorphisms and loss of heterozygosity events in these pathogens, we developed a pipeline for analyzing diverse genome-scale datasets from microarray, deep sequencing, and restriction site associated DNA sequence experiments for clinical and laboratory strains of Candida albicans, the most prevalent human fungal pathogen. The YMAP pipeline (http://lovelace.cs.umn.edu/Ymap/) automatically illustrates genome-wide information in a single intuitive figure and is readily modified for the analysis of other pathogens with small genomes.
Collapse
Affiliation(s)
- Darren A Abbey
- Department of Genetics, Cell Biology and Development, University of Minnesota, 6-160 Jackson Hall, Minneapolis, MN 55415 USA
| | - Jason Funt
- Broad Institute of MIT and Harvard University, 415 Main Street, Cambridge, MA 02142 USA
| | - Mor N Lurie-Weinberger
- Department of Molecular Microbiology and Biotechnology, Tel Aviv University, 418 Britannia Building, Ramat Aviv, 69978 Israel
| | - Dawn A Thompson
- Broad Institute of MIT and Harvard University, 415 Main Street, Cambridge, MA 02142 USA
| | - Aviv Regev
- Broad Institute of MIT and Harvard University, 415 Main Street, Cambridge, MA 02142 USA
| | - Chad L Myers
- Department of Computer Science and Engineering, University of Minnesota, 200 Union St SE, Minneapolis, MN 55455 USA
| | - Judith Berman
- Department of Genetics, Cell Biology and Development, University of Minnesota, 6-160 Jackson Hall, Minneapolis, MN 55415 USA ; Department of Molecular Microbiology and Biotechnology, Tel Aviv University, 418 Britannia Building, Ramat Aviv, 69978 Israel
| |
Collapse
|
28
|
Wimalaratne SM, Grenon P, Hermjakob H, Le Novère N, Laibe C. BioModels linked dataset. BMC SYSTEMS BIOLOGY 2014; 8:91. [PMID: 25182954 PMCID: PMC4423647 DOI: 10.1186/s12918-014-0091-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/06/2014] [Accepted: 07/18/2014] [Indexed: 11/17/2022]
Abstract
Background BioModels Database is a reference repository of mathematical models used in biology. Models are stored as SBML files on a file system and metadata is provided in a relational database. Models can be retrieved through a web interface and programmatically via web services. In addition to those more traditional ways to access information, Linked Data using Semantic Web technologies (such as the Resource Description Framework, RDF), is becoming an increasingly popular means to describe and expose biological relevant data. Results We present the BioModels Linked Dataset, which exposes the models’ content as a dereferencable interlinked dataset. BioModels Linked Dataset makes use of the wealth of annotations available within a large number of manually curated models to link and integrate data and models from other resources. Conclusions The BioModels Linked Dataset provides users with a dataset interoperable with other semantic web resources. It supports powerful search queries, some of which were not previously available to users and allow integration of data from multiple resources. This provides a distributed platform to find similar models for comparison, processing and enrichment.
Collapse
Affiliation(s)
- Sarala M Wimalaratne
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Pierre Grenon
- CHIME, The Farr Institute of Health Informatics Research, London, NW1 2DA, UK.
| | - Henning Hermjakob
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Nicolas Le Novère
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. .,Babraham Institute, Babraham Research Campus, Cambridge, CB22 3AT, UK.
| | - Camille Laibe
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| |
Collapse
|
29
|
Federhen S, Clark K, Barrett T, Parkinson H, Ostell J, Kodama Y, Mashima J, Nakamura Y, Cochrane G, Karsch-Mizrachi I. Toward richer metadata for microbial sequences: replacing strain-level NCBI taxonomy taxids with BioProject, BioSample and Assembly records. Stand Genomic Sci 2014; 9:1275-7. [PMID: 25197497 PMCID: PMC4149001 DOI: 10.4056/sigs.4851102] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Microbial genome sequence submissions to the International Nucleotide Sequence Database Collaboration (INSDC) have been annotated with organism names that include the strain identifier. Each of these strain-level names has been assigned a unique 'taxid' in the NCBI Taxonomy Database. With the significant growth in genome sequencing, it is not possible to continue with the curation of strain-level taxids. In January 2014, NCBI will cease assigning strain-level taxids. Instead, submitters are encouraged provide strain information and rich metadata with their submission to the sequence database, BioProject and BioSample.
Collapse
Affiliation(s)
- Scott Federhen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Karen Clark
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Tanya Barrett
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Helen Parkinson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, UK
| | - James Ostell
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Yuichi Kodama
- DDBJ Center, National Institute of Genetics, Research Organization for Information and Systems, Yata, Mishima, Japan
| | - Jun Mashima
- DDBJ Center, National Institute of Genetics, Research Organization for Information and Systems, Yata, Mishima, Japan
| | - Yasukazu Nakamura
- DDBJ Center, National Institute of Genetics, Research Organization for Information and Systems, Yata, Mishima, Japan
| | - Guy Cochrane
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, UK
| | - Ilene Karsch-Mizrachi
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
30
|
Jupp S, Malone J, Bolleman J, Brandizi M, Davies M, Garcia L, Gaulton A, Gehant S, Laibe C, Redaschi N, Wimalaratne SM, Martin M, Le Novère N, Parkinson H, Birney E, Jenkinson AM. The EBI RDF platform: linked open data for the life sciences. Bioinformatics 2014; 30:1338-9. [PMID: 24413672 PMCID: PMC3998127 DOI: 10.1093/bioinformatics/btt765] [Citation(s) in RCA: 117] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Motivation: Resource description framework (RDF) is an emerging technology for describing, publishing and linking life science data. As a major provider of bioinformatics data and services, the European Bioinformatics Institute (EBI) is committed to making data readily accessible to the community in ways that meet existing demand. The EBI RDF platform has been developed to meet an increasing demand to coordinate RDF activities across the institute and provides a new entry point to querying and exploring integrated resources available at the EBI. Availability:http://www.ebi.ac.uk/rdf Contact:jupp@ebi.ac.uk
Collapse
Affiliation(s)
- Simon Jupp
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1211 Geneve, Switzerland
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
31
|
Brooksbank C, Bergman MT, Apweiler R, Birney E, Thornton J. The European Bioinformatics Institute's data resources 2014. Nucleic Acids Res 2014; 42:D18-25. [PMID: 24271396 PMCID: PMC3964968 DOI: 10.1093/nar/gkt1206] [Citation(s) in RCA: 52] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2013] [Revised: 11/01/2013] [Accepted: 11/04/2013] [Indexed: 12/18/2022] Open
Abstract
Molecular Biology has been at the heart of the 'big data' revolution from its very beginning, and the need for access to biological data is a common thread running from the 1965 publication of Dayhoff's 'Atlas of Protein Sequence and Structure' through the Human Genome Project in the late 1990s and early 2000s to today's population-scale sequencing initiatives. The European Bioinformatics Institute (EMBL-EBI; http://www.ebi.ac.uk) is one of three organizations worldwide that provides free access to comprehensive, integrated molecular data sets. Here, we summarize the principles underpinning the development of these public resources and provide an overview of EMBL-EBI's database collection to complement the reviews of individual databases provided elsewhere in this issue.
Collapse
Affiliation(s)
- Catherine Brooksbank
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Mary Todd Bergman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Rolf Apweiler
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ewan Birney
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Janet Thornton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
32
|
Faulconbridge A, Burdett T, Brandizi M, Gostev M, Pereira R, Vasant D, Sarkans U, Brazma A, Parkinson H. Updates to BioSamples database at European Bioinformatics Institute. Nucleic Acids Res 2013; 42:D50-2. [PMID: 24265224 PMCID: PMC3965081 DOI: 10.1093/nar/gkt1081] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
The BioSamples database at the EBI (http://www.ebi.ac.uk/biosamples) provides an integration point for BioSamples information between technology specific databases at the EBI, projects such as ENCODE and reference collections such as cell lines. The database delivers a unified query interface and API to query sample information across EBI's databases and provides links back to assay databases. Sample groups are used to manage related samples, e.g. those from an experimental submission, or a single reference collection. Infrastructural improvements include a new user interface with ontological and key word queries, a new query API, a new data submission API, complete RDF data download and a supporting SPARQL endpoint, accessioning at the point of submission to the European Nucleotide Archive and European Genotype Phenotype Archives and improved query response times.
Collapse
Affiliation(s)
- Adam Faulconbridge
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | | | | | | | | | | | | | | |
Collapse
|
33
|
Pakseresht N, Alako B, Amid C, Cerdeño-Tárraga A, Cleland I, Gibson R, Goodgame N, Gur T, Jang M, Kay S, Leinonen R, Li W, Liu X, Lopez R, McWilliam H, Oisel A, Pallreddy S, Plaister S, Radhakrishnan R, Rivière S, Rossello M, Senf A, Silvester N, Smirnov D, Squizzato S, ten Hoopen P, Toribio AL, Vaughan D, Zalunin V, Cochrane G. Assembly information services in the European Nucleotide Archive. Nucleic Acids Res 2013; 42:D38-43. [PMID: 24214989 PMCID: PMC3965037 DOI: 10.1093/nar/gkt1082] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) is a repository for the world public domain nucleotide sequence data output. ENA content covers a spectrum of data types including raw reads, assembly data and functional annotation. ENA has faced a dramatic growth in genome assembly submission rates, data volumes and complexity of datasets. This has prompted a broad reworking of assembly submission services, for which we now reach the end of a major programme of work and many enhancements have already been made available over the year to components of the submission service. In this article, we briefly review ENA content and growth over 2013, describe our rapidly developing services for genome assembly information and outline further major developments over the last year.
Collapse
Affiliation(s)
- Nima Pakseresht
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
34
|
Kosuge T, Mashima J, Kodama Y, Fujisawa T, Kaminuma E, Ogasawara O, Okubo K, Takagi T, Nakamura Y. DDBJ progress report: a new submission system for leading to a correct annotation. Nucleic Acids Res 2013; 42:D44-9. [PMID: 24194602 PMCID: PMC3964987 DOI: 10.1093/nar/gkt1066] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
The DNA Data Bank of Japan (DDBJ; http://www.ddbj.nig.ac.jp) maintains and provides archival, retrieval and analytical resources for biological information. This database content is shared with the US National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI) within the framework of the International Nucleotide Sequence Database Collaboration (INSDC). DDBJ launched a new nucleotide sequence submission system for receiving traditional nucleotide sequence. We expect that the new submission system will be useful for many submitters to input accurate annotation and reduce the time needed for data input. In addition, DDBJ has started a new service, the Japanese Genotype–phenotype Archive (JGA), with our partner institute, the National Bioscience Database Center (NBDC). JGA permanently archives and shares all types of individual human genetic and phenotypic data. We also introduce improvements in the DDBJ services and databases made during the past year.
Collapse
Affiliation(s)
- Takehide Kosuge
- DDBJ Center, National Institute of Genetics, Yata 1111, Mishima, Shizuoka 411-8540, Japan and National Bioscience Database Center, Japan Science and Technology Agency, Tokyo 102-8666, Japan
| | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Developing translational research infrastructure and capabilities associated with cancer clinical trials. Expert Rev Mol Med 2013; 15:e11. [PMID: 24074187 DOI: 10.1017/erm.2013.12] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
The integration of molecular information in clinical decision making is becoming a reality. These changes are shaping the way clinical research is conducted, and as reality sets in, the challenges in conducting, managing and organising multi-disciplinary research become apparent. Clinical trials provide a platform to conduct translational research (TR) within the context of high quality clinical data accrual. Integrating TR objectives in trials allows the execution of pivotal studies that provide clinical evidence for biomarker-driven treatment strategies, targeting early drug development trials to a homogeneous and well defined patient population, supports the development of companion diagnostics and provides an opportunity for deepening our understanding of cancer biology and mechanisms of drug action. To achieve these goals within a clinical trial, developing translational research infrastructure and capabilities (TRIC) plays a critical catalytic role for translating preclinical data into successful clinical research and development. TRIC represents a technical platform, dedicated resources and access to expertise promoting high quality standards, logistical and operational support and unified streamlined procedures under an appropriate governance framework. TRIC promotes integration of multiple disciplines including biobanking, laboratory analysis, molecular data, informatics, statistical analysis and dissemination of results which are all required for successful TR projects and scientific progress. Such a supporting infrastructure is absolutely essential in order to promote high quality robust research, avoid duplication and coordinate resources. Lack of such infrastructure, we would argue, is one reason for the limited effect of TR in clinical practice beyond clinical trials.
Collapse
|
36
|
Gopinath G, Hari K, Jain R, Mammel M, Kothary M, Franco A, Grim C, Jarvis K, Sathyamoorthy V, Hu L, Datta A, Patel I, Jackson S, Gangiredla J, Kotewicz M, LeClerc J, Wekell M, McCardell B, Solomotis M, Tall B. The Pathogen-annotated Tracking Resource Network (PATRN) system: A web-based resource to aid food safety, regulatory science, and investigations of foodborne pathogens and disease. Food Microbiol 2013; 34:303-18. [DOI: 10.1016/j.fm.2013.01.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2012] [Revised: 12/23/2012] [Accepted: 01/07/2013] [Indexed: 01/14/2023]
|
37
|
Abstract
Our understanding of gene expression has changed dramatically over the past decade, largely catalysed by technological developments. High-throughput experiments - microarrays and next-generation sequencing - have generated large amounts of genome-wide gene expression data that are collected in public archives. Added-value databases process, analyse and annotate these data further to make them accessible to every biologist. In this Review, we discuss the utility of the gene expression data that are in the public domain and how researchers are making use of these data. Reuse of public data can be very powerful, but there are many obstacles in data preparation and analysis and in the interpretation of the results. We will discuss these challenges and provide recommendations that we believe can improve the utility of such data.
Collapse
Affiliation(s)
- Johan Rung
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | |
Collapse
|
38
|
Rustici G, Kolesnikov N, Brandizi M, Burdett T, Dylag M, Emam I, Farne A, Hastings E, Ison J, Keays M, Kurbatova N, Malone J, Mani R, Mupo A, Pedro Pereira R, Pilicheva E, Rung J, Sharma A, Tang YA, Ternent T, Tikhonov A, Welter D, Williams E, Brazma A, Parkinson H, Sarkans U. ArrayExpress update--trends in database growth and links to data analysis tools. Nucleic Acids Res 2012. [PMID: 23193272 PMCID: PMC3531147 DOI: 10.1093/nar/gks1174] [Citation(s) in RCA: 299] [Impact Index Per Article: 24.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
The ArrayExpress Archive of Functional Genomics Data (http://www.ebi.ac.uk/arrayexpress) is one of three international functional genomics public data repositories, alongside the Gene Expression Omnibus at NCBI and the DDBJ Omics Archive, supporting peer-reviewed publications. It accepts data generated by sequencing or array-based technologies and currently contains data from almost a million assays, from over 30 000 experiments. The proportion of sequencing-based submissions has grown significantly over the last 2 years and has reached, in 2012, 15% of all new data. All data are available from ArrayExpress in MAGE-TAB format, which allows robust linking to data analysis and visualization tools, including Bioconductor and GenomeSpace. Additionally, R objects, for microarray data, and binary alignment format files, for sequencing data, have been generated for a significant proportion of ArrayExpress data.
Collapse
Affiliation(s)
- Gabriella Rustici
- Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Ogasawara O, Mashima J, Kodama Y, Kaminuma E, Nakamura Y, Okubo K, Takagi T. DDBJ new system and service refactoring. Nucleic Acids Res 2012. [PMID: 23180790 PMCID: PMC3531146 DOI: 10.1093/nar/gks1152] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
The DNA data bank of Japan (DDBJ, http://www.ddbj.nig.ac.jp) maintains a primary nucleotide sequence database and provides analytical resources for biological information to researchers. This database content is exchanged with the US National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI) within the framework of the International Nucleotide Sequence Database Collaboration (INSDC). Resources provided by the DDBJ include traditional nucleotide sequence data released in the form of 27 316 452 entries or 16 876 791 557 base pairs (as of June 2012), and raw reads of new generation sequencers in the sequence read archive (SRA). A Japanese researcher published his own genome sequence via DDBJ-SRA on 31 July 2012. To cope with the ongoing genomic data deluge, in March 2012, our computer previous system was totally replaced by a commodity cluster-based system that boasts 122.5 TFlops of CPU capacity and 5 PB of storage space. During this upgrade, it was considered crucial to replace and refactor substantial portions of the DDBJ software systems as well. As a result of the replacement process, which took more than 2 years to perform, we have achieved significant improvements in system performance.
Collapse
Affiliation(s)
- Osamu Ogasawara
- DDBJ Center, National Institute of Genetics, Yata 1111, Mishima, Shizuoka 411-8540, Japan.
| | | | | | | | | | | | | |
Collapse
|
40
|
Nakamura Y, Cochrane G, Karsch-Mizrachi I. The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res 2012. [PMID: 23180798 PMCID: PMC3531182 DOI: 10.1093/nar/gks1084] [Citation(s) in RCA: 99] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
The International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org), one of the longest-standing global alliances of biological data archives, captures, preserves and provides comprehensive public domain nucleotide sequence information. Three partners of the INSDC work in cooperation to establish formats for data and metadata and protocols that facilitate reliable data submission to their databases and support continual data exchange around the world. In this article, the INSDC current status and update for the year of 2012 are presented. Among discussed items of international collaboration meeting in 2012, BioSample database and changes in submission are described as topics.
Collapse
Affiliation(s)
- Yasukazu Nakamura
- DDBJ Center, National Institute of Genetics, Research Organization for Information and Systems, Yata, Mishima 411-8510, Japan.
| | | | | | | |
Collapse
|
41
|
Hanash S, Schliekelman M, Zhang Q, Taguchi A. Integration of proteomics into systems biology of cancer. WILEY INTERDISCIPLINARY REVIEWS-SYSTEMS BIOLOGY AND MEDICINE 2012; 4:327-37. [PMID: 22407608 DOI: 10.1002/wsbm.1169] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Deciphering the complexity and heterogeneity of cancer, benefits from integration of proteomic level data into systems biology efforts. The opportunities available as a result of advances in proteomic technologies, the successes to date, and the challenges involved in integrating diverse datasets are addressed in this review.
Collapse
Affiliation(s)
- S Hanash
- Molecular Diagnostics Program, Fred Hutchinson Cancer Research Center, Seattle, WA, USA.
| | | | | | | |
Collapse
|
42
|
Using genome-wide expression profiling to define gene networks relevant to the study of complex traits: from RNA integrity to network topology. INTERNATIONAL REVIEW OF NEUROBIOLOGY 2012. [PMID: 23195313 DOI: 10.1016/b978-0-12-398323-7.00005-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Postgenomic studies of the function of genes and their role in disease have now become an area of intense study since efforts to define the raw sequence material of the genome have largely been completed. The use of whole-genome approaches such as microarray expression profiling and, more recently, RNA-sequence analysis of transcript abundance has allowed an unprecedented look at the workings of the genome. However, the accurate derivation of such high-throughput data and their analysis in terms of biological function has been critical to truly leveraging the postgenomic revolution. This chapter will describe an approach that focuses on the use of gene networks to both organize and interpret genomic expression data. Such networks, derived from statistical analysis of large genomic datasets and the application of multiple bioinformatics data resources, potentially allow the identification of key control elements for networks associated with human disease, and thus may lead to derivation of novel therapeutic approaches. However, as discussed in this chapter, the leveraging of such networks cannot occur without a thorough understanding of the technical and statistical factors influencing the derivation of genomic expression data. Thus, while the catch phrase may be "it's the network … stupid," the understanding of factors extending from RNA isolation to genomic profiling technique, multivariate statistics, and bioinformatics are all critical to defining fully useful gene networks for study of complex biology.
Collapse
|
43
|
Galperin MY, Fernández-Suárez XM. The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Res 2011; 40:D1-8. [PMID: 22144685 PMCID: PMC3245068 DOI: 10.1093/nar/gkr1196] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The 19th annual Database Issue of Nucleic Acids Research features descriptions of 92 new online databases covering various areas of molecular biology and 100 papers describing recent updates to the databases previously described in NAR and other journals. The highlights of this issue include, among others, a description of neXtProt, a knowledgebase on human proteins; a detailed explanation of the principles behind the NCBI Taxonomy Database; NCBI and EBI papers on the recently launched BioSample databases that store sample information for a variety of database resources; descriptions of the recent developments in the Gene Ontology and UniProt Gene Ontology Annotation projects; updates on Pfam, SMART and InterPro domain databases; update papers on KEGG and TAIR, two universally acclaimed databases that face an uncertain future; and a separate section with 10 wiki-based databases, introduced in an accompanying editorial. The NAR online Molecular Biology Database Collection, available at http://www.oxfordjournals.org/nar/database/a/, has been updated and now lists 1380 databases. Brief machine-readable descriptions of the databases featured in this issue, according to the BioDBcore standards, will be provided at the http://biosharing.org/biodbcore web site. The full content of the Database Issue is freely available online on the Nucleic Acids Research web site (http://nar.oxfordjournals.org/).
Collapse
Affiliation(s)
- Michael Y Galperin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | |
Collapse
|
44
|
Barrett T, Clark K, Gevorgyan R, Gorelenkov V, Gribov E, Karsch-Mizrachi I, Kimelman M, Pruitt KD, Resenchuk S, Tatusova T, Yaschenko E, Ostell J. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res 2011; 40:D57-63. [PMID: 22139929 PMCID: PMC3245069 DOI: 10.1093/nar/gkr1163] [Citation(s) in RCA: 215] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
As the volume and complexity of data sets archived at NCBI grow rapidly, so does the need to gather and organize the associated metadata. Although metadata has been collected for some archival databases, previously, there was no centralized approach at NCBI for collecting this information and using it across databases. The BioProject database was recently established to facilitate organization and classification of project data submitted to NCBI, EBI and DDBJ databases. It captures descriptive information about research projects that result in high volume submissions to archival databases, ties together related data across multiple archives and serves as a central portal by which to inform users of data availability. Concomitantly, the BioSample database is being developed to capture descriptive information about the biological samples investigated in projects. BioProject and BioSample records link to corresponding data stored in archival repositories. Submissions are supported by a web-based Submission Portal that guides users through a series of forms for input of rich metadata describing their projects and samples. Together, these databases offer improved ways for users to query, locate, integrate and interpret the masses of data held in NCBI's archival repositories. The BioProject and BioSample databases are available at http://www.ncbi.nlm.nih.gov/bioproject and http://www.ncbi.nlm.nih.gov/biosample, respectively.
Collapse
Affiliation(s)
- Tanya Barrett
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD 20892, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|