1
|
LeRoy NJ, Khoroshevskyi O, O’Brien A, Stepień R, Arslan A, Sheffield NC. PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.08.15.551388. [PMID: 37645717 PMCID: PMC10462087 DOI: 10.1101/2023.08.15.551388] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Background As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself. Results Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data. Availability https://pephub.databio.org.
Collapse
Affiliation(s)
- Nathan J. LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville VA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, 22904, Charlottesville VA
| | - Oleksandr Khoroshevskyi
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville VA
| | - Aaron O’Brien
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville VA
| | - Rafał Stepień
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville VA
| | - Alip Arslan
- Department of Computer Science, School of Engineering, University of Virginia, 22908, Charlottesville VA
| | - Nathan C. Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville VA
- School of Data Science, University of Virginia, Charlottesville VA 22904, Charlottesville VA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, 22904, Charlottesville VA
- Department of Public Health Sciences, School of Medicine, University of Virginia, 22908, Charlottesville VA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, 22908, Charlottesville VA
- Child Health Research Center, School of Medicine, University of Virginia, 22908, Charlottesville VA
| |
Collapse
|
2
|
Gharavi E, LeRoy NJ, Zheng G, Zhang A, Brown DE, Sheffield NC. Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets. Bioengineering (Basel) 2024; 11:263. [PMID: 38534537 DOI: 10.3390/bioengineering11030263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 02/20/2024] [Accepted: 02/22/2024] [Indexed: 03/28/2024] Open
Abstract
As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.
Collapse
Affiliation(s)
- Erfaneh Gharavi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
| | - Nathan J LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Guangtao Zheng
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Aidong Zhang
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Donald E Brown
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| |
Collapse
|
3
|
LeRoy NJ, Khoroshevskyi O, O'Brien A, Stępień R, Arslan A, Sheffield NC. PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata. Gigascience 2024; 13:giae033. [PMID: 38991851 PMCID: PMC11238423 DOI: 10.1093/gigascience/giae033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 02/07/2024] [Accepted: 05/21/2024] [Indexed: 07/13/2024] Open
Abstract
BACKGROUND As biological data increase, we need additional infrastructure to share them and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important and in some ways has a wider scope than sharing data themselves. RESULTS Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural-language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural-language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data or to share new data. AVAILABILITY https://pephub.databio.org.
Collapse
Affiliation(s)
- Nathan J LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Oleksandr Khoroshevskyi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Aaron O'Brien
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Rafał Stępień
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Alip Arslan
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| |
Collapse
|
4
|
Dumschott K, Dörpholz H, Laporte MA, Brilhaus D, Schrader A, Usadel B, Neumann S, Arnaud E, Kranz A. Ontologies for increasing the FAIRness of plant research data. FRONTIERS IN PLANT SCIENCE 2023; 14:1279694. [PMID: 38098789 PMCID: PMC10720748 DOI: 10.3389/fpls.2023.1279694] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 11/15/2023] [Indexed: 12/17/2023]
Abstract
The importance of improving the FAIRness (findability, accessibility, interoperability, reusability) of research data is undeniable, especially in the face of large, complex datasets currently being produced by omics technologies. Facilitating the integration of a dataset with other types of data increases the likelihood of reuse, and the potential of answering novel research questions. Ontologies are a useful tool for semantically tagging datasets as adding relevant metadata increases the understanding of how data was produced and increases its interoperability. Ontologies provide concepts for a particular domain as well as the relationships between concepts. By tagging data with ontology terms, data becomes both human- and machine- interpretable, allowing for increased reuse and interoperability. However, the task of identifying ontologies relevant to a particular research domain or technology is challenging, especially within the diverse realm of fundamental plant research. In this review, we outline the ontologies most relevant to the fundamental plant sciences and how they can be used to annotate data related to plant-specific experiments within metadata frameworks, such as Investigation-Study-Assay (ISA). We also outline repositories and platforms most useful for identifying applicable ontologies or finding ontology terms.
Collapse
Affiliation(s)
- Kathryn Dumschott
- Institute of Bio- and Geosciences (IBG-4: Bioinformatics) & Bioeconomy Science Center (BioSC), CEPLAS, Forschungszentrum Jülich, Jülich, Germany
| | - Hannah Dörpholz
- Institute of Bio- and Geosciences (IBG-4: Bioinformatics) & Bioeconomy Science Center (BioSC), CEPLAS, Forschungszentrum Jülich, Jülich, Germany
| | - Marie-Angélique Laporte
- Digital Solutions Team, Digital Inclusion Lever, Bioversity International, Montpellier Office, Montpellier, France
| | - Dominik Brilhaus
- Data Science and Management & Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Andrea Schrader
- Data Science and Management & Cluster of Excellence on Plant Sciences (CEPLAS), University of Cologne, Cologne, Germany
| | - Björn Usadel
- Institute of Bio- and Geosciences (IBG-4: Bioinformatics) & Bioeconomy Science Center (BioSC), CEPLAS, Forschungszentrum Jülich, Jülich, Germany
- Institute for Biological Data Science & Cluster of Excellence on Plant Sciences (CEPLAS), Faculty of Mathematics and Life Sciences, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Steffen Neumann
- Program Center MetaCom, Leibniz Institute of Plant Biochemistry, Halle, Germany
- German Centre for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, Germany
| | - Elizabeth Arnaud
- Digital Solutions Team, Digital Inclusion Lever, Bioversity International, Montpellier Office, Montpellier, France
| | - Angela Kranz
- Institute of Bio- and Geosciences (IBG-4: Bioinformatics) & Bioeconomy Science Center (BioSC), CEPLAS, Forschungszentrum Jülich, Jülich, Germany
| |
Collapse
|
5
|
Sheffield NC, LeRoy NJ, Khoroshevskyi O. Challenges to sharing sample metadata in computational genomics. Front Genet 2023; 14:1154198. [PMID: 37287537 PMCID: PMC10243526 DOI: 10.3389/fgene.2023.1154198] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 05/09/2023] [Indexed: 06/09/2023] Open
Affiliation(s)
- Nathan C. Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
- School of Data Science, University of Virginia, Charlottesville, VA, United States
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA, United States
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - Nathan J. LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| | - Oleksandr Khoroshevskyi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, United States
| |
Collapse
|
6
|
Cascianelli S, Barbera C, Ulla AA, Grassi E, Lupo B, Pasini D, Bertotti A, Trusolino L, Medico E, Isella C, Masseroli M. Multi-label transcriptional classification of colorectal cancer reflects tumor cell population heterogeneity. Genome Med 2023; 15:37. [PMID: 37189167 PMCID: PMC10184353 DOI: 10.1186/s13073-023-01176-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Accepted: 03/31/2023] [Indexed: 05/17/2023] Open
Abstract
BACKGROUND Transcriptional classification has been used to stratify colorectal cancer (CRC) into molecular subtypes with distinct biological and clinical features. However, it is not clear whether such subtypes represent discrete, mutually exclusive entities or molecular/phenotypic states with potential overlap. Therefore, we focused on the CRC Intrinsic Subtype (CRIS) classifier and evaluated whether assigning multiple CRIS subtypes to the same sample provides additional clinically and biologically relevant information. METHODS A multi-label version of the CRIS classifier (multiCRIS) was applied to newly generated RNA-seq profiles from 606 CRC patient-derived xenografts (PDXs), together with human CRC bulk and single-cell RNA-seq datasets. Biological and clinical associations of single- and multi-label CRIS were compared. Finally, a machine learning-based multi-label CRIS predictor (ML2CRIS) was developed for single-sample classification. RESULTS Surprisingly, about half of the CRC cases could be significantly assigned to more than one CRIS subtype. Single-cell RNA-seq analysis revealed that multiple CRIS membership can be a consequence of the concomitant presence of cells of different CRIS class or, less frequently, of cells with hybrid phenotype. Multi-label assignments were found to improve prediction of CRC prognosis and response to treatment. Finally, the ML2CRIS classifier was validated for retaining the same biological and clinical associations also in the context of single-sample classification. CONCLUSIONS These results show that CRIS subtypes retain their biological and clinical features even when concomitantly assigned to the same CRC sample. This approach could be potentially extended to other cancer types and classification systems.
Collapse
Affiliation(s)
- Silvia Cascianelli
- Department of Electronics, Information and Bioengineering, Politecnico Di Milano, Piazza Leonardo da Vinci 32, 20133, Milan, Italy
| | - Chiara Barbera
- Department of Electronics, Information and Bioengineering, Politecnico Di Milano, Piazza Leonardo da Vinci 32, 20133, Milan, Italy
| | - Alexandra Ambra Ulla
- Department of Oncology, University of Turin, S.P. 142, Km 3.95, 10060, Candiolo (TO), Turin, Italy
| | - Elena Grassi
- Department of Oncology, University of Turin, S.P. 142, Km 3.95, 10060, Candiolo (TO), Turin, Italy
- Candiolo Cancer Institute, FPO-IRCCS, S.P. 142, Km 3.95, 10060, Candiolo (TO), Italy
| | - Barbara Lupo
- Department of Oncology, University of Turin, S.P. 142, Km 3.95, 10060, Candiolo (TO), Turin, Italy
- Candiolo Cancer Institute, FPO-IRCCS, S.P. 142, Km 3.95, 10060, Candiolo (TO), Italy
| | - Diego Pasini
- Department of Experimental Oncology, IEO, European Institute of Oncology IRCCS, Via Adamello 16, 20139, Milan, Italy
- Department of Health Sciences, University of Milan, Via A. Di Rudini 8, 20142, Milan, Italy
| | - Andrea Bertotti
- Department of Oncology, University of Turin, S.P. 142, Km 3.95, 10060, Candiolo (TO), Turin, Italy
- Candiolo Cancer Institute, FPO-IRCCS, S.P. 142, Km 3.95, 10060, Candiolo (TO), Italy
| | - Livio Trusolino
- Department of Oncology, University of Turin, S.P. 142, Km 3.95, 10060, Candiolo (TO), Turin, Italy
- Candiolo Cancer Institute, FPO-IRCCS, S.P. 142, Km 3.95, 10060, Candiolo (TO), Italy
| | - Enzo Medico
- Department of Oncology, University of Turin, S.P. 142, Km 3.95, 10060, Candiolo (TO), Turin, Italy
- Candiolo Cancer Institute, FPO-IRCCS, S.P. 142, Km 3.95, 10060, Candiolo (TO), Italy
| | - Claudio Isella
- Department of Oncology, University of Turin, S.P. 142, Km 3.95, 10060, Candiolo (TO), Turin, Italy.
- Candiolo Cancer Institute, FPO-IRCCS, S.P. 142, Km 3.95, 10060, Candiolo (TO), Italy.
| | - Marco Masseroli
- Department of Electronics, Information and Bioengineering, Politecnico Di Milano, Piazza Leonardo da Vinci 32, 20133, Milan, Italy.
| |
Collapse
|
7
|
Bernasconi A, Canakoglu A, Comolli F. Processing genome-wide association studies within a repository of heterogeneous genomic datasets. BMC Genom Data 2023; 24:13. [PMID: 36869294 PMCID: PMC9985298 DOI: 10.1186/s12863-023-01111-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Accepted: 02/02/2023] [Indexed: 03/05/2023] Open
Abstract
BACKGROUND Genome Wide Association Studies (GWAS) are based on the observation of genome-wide sets of genetic variants - typically single-nucleotide polymorphisms (SNPs) - in different individuals that are associated with phenotypic traits. Research efforts have so far been directed to improving GWAS techniques rather than on making the results of GWAS interoperable with other genomic signals; this is currently hindered by the use of heterogeneous formats and uncoordinated experiment descriptions. RESULTS To practically facilitate integrative use, we propose to include GWAS datasets within the META-BASE repository, exploiting an integration pipeline previously studied for other genomic datasets that includes several heterogeneous data types in the same format, queryable from the same systems. We represent GWAS SNPs and metadata by means of the Genomic Data Model and include metadata within a relational representation by extending the Genomic Conceptual Model with a dedicated view. To further reduce the gap with the descriptions of other signals in the repository of genomic datasets, we perform a semantic annotation of phenotypic traits. Our pipeline is demonstrated using two important data sources, initially organized according to different data models: the NHGRI-EBI GWAS Catalog and FinnGen (University of Helsinki). The integration effort finally allows us to use these datasets within multi-sample processing queries that respond to important biological questions. These are then made usable for multi-omic studies together with, e.g., somatic and reference mutation data, genomic annotations, epigenetic signals. CONCLUSIONS As a result of the our work on GWAS datasets, we enable 1) their interoperable use with several other homogenized and processed genomic datasets in the context of the META-BASE repository; 2) their big data processing by means of the GenoMetric Query Language and associated system. Future large-scale tertiary data analysis may extensively benefit from the addition of GWAS results to inform several different downstream analysis workflows.
Collapse
Affiliation(s)
- Anna Bernasconi
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy
| | - Arif Canakoglu
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy
| | - Federico Comolli
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy
| |
Collapse
|
8
|
Alfonsi T, Bernasconi A, Canakoglu A, Masseroli M. Genomic data integration and user-defined sample-set extraction for population variant analysis. BMC Bioinformatics 2022; 23:401. [PMID: 36175857 PMCID: PMC9520931 DOI: 10.1186/s12859-022-04927-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Accepted: 09/13/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a valuable reference for human genetic variation; however, the integrative use of the corresponding data with other datasets within existing repositories and pipelines is not fully supported. Particularly, there is a pressing need for flexible and fast selection of population partitions based on their variant and metadata-related characteristics. RESULTS Here, we target general germline or somatic mutation data sources for their seamless inclusion within an interoperable-format repository, supporting integration among them and with other genomic data, as well as their integrated use within bioinformatic workflows. In addition, we provide VarSum, a data summarization service working on sub-populations of interest selected using filters on population metadata and/or variant characteristics. The service is developed as an optimized computational framework with an Application Programming Interface (API) that can be called from within any existing computing pipeline or programming script. Provided example use cases of biological interest show the relevance, power and ease of use of the API functionalities. CONCLUSIONS The proposed data integration pipeline and data set extraction and summarization API pave the way for solid computational infrastructures that quickly process cumbersome variation data, and allow biologists and bioinformaticians to easily perform scalable analysis on user-defined partitions of large cohorts from increasingly available genetic variation studies. With the current tendency to large (cross)nation-wide sequencing and variation initiatives, we expect an ever growing need for the kind of computational support hereby proposed.
Collapse
Affiliation(s)
- Tommaso Alfonsi
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Via Ponzio 34/5, 20133, Milan, Italy.
| | - Anna Bernasconi
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Via Ponzio 34/5, 20133, Milan, Italy
| | - Arif Canakoglu
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Via Ponzio 34/5, 20133, Milan, Italy.,Dipartimento di Anestesia, Rianimazione ed Emergenza-Urgenza, Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico, Policlinico di Milano, Via Francesco Sforza, 35, 20122, Milan, Italy
| | - Marco Masseroli
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Via Ponzio 34/5, 20133, Milan, Italy
| |
Collapse
|
9
|
Waldrop AM, Cheadle JB, Bradford K, Preiss A, Chew R, Holt JR, Kebede Y, Braswell N, Watson M, Hench V, Crerar A, Ball CM, Schreep C, Linebaugh PJ, Hiles H, Boyles R, Bizon C, Krishnamurthy A, Cox S. Dug: a semantic search engine leveraging peer-reviewed knowledge to query biomedical data repositories. Bioinformatics 2022; 38:3252-3258. [PMID: 35441678 PMCID: PMC9991886 DOI: 10.1093/bioinformatics/btac284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Revised: 03/04/2022] [Accepted: 04/15/2022] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION As the number of public data resources continues to proliferate, identifying relevant datasets across heterogenous repositories is becoming critical to answering scientific questions. To help researchers navigate this data landscape, we developed Dug: a semantic search tool for biomedical datasets utilizing evidence-based relationships from curated knowledge graphs to find relevant datasets and explain why those results are returned. RESULTS Developed through the National Heart, Lung and Blood Institute's (NHLBI) BioData Catalyst ecosystem, Dug has indexed more than 15 911 study variables from public datasets. On a manually curated search dataset, Dug's total recall (total relevant results/total results) of 0.79 outperformed default Elasticsearch's total recall of 0.76. When using synonyms or related concepts as search queries, Dug (0.36) far outperformed Elasticsearch (0.14) in terms of total recall with no significant loss in the precision of its top results. AVAILABILITY AND IMPLEMENTATION Dug is freely available at https://github.com/helxplatform/dug. An example Dug deployment is also available for use at https://search.biodatacatalyst.renci.org/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alexander M Waldrop
- Center for Genomics, Bioinformatics, and Translational Research, RTI International, Research Triangle Park, NC 27709-2194, USA
| | - John B Cheadle
- Research Computing Division, RTI International, Research Triangle Park, NC 27709-2194, USA
| | - Kira Bradford
- Renaissance Computing Institute, University of Chapel Hill, North Carolina, Chapel Hill, NC 27599-7568, USA
| | - Alexander Preiss
- Center for Data Science, RTI International, Research Triangle Park, NC 27709-2194, USA
| | - Robert Chew
- Center for Data Science, RTI International, Research Triangle Park, NC 27709-2194, USA
| | - Jonathan R Holt
- Center for Data Science, RTI International, Research Triangle Park, NC 27709-2194, USA
| | - Yaphet Kebede
- Renaissance Computing Institute, University of Chapel Hill, North Carolina, Chapel Hill, NC 27599-7568, USA
| | - Nathan Braswell
- Research Computing Division, RTI International, Research Triangle Park, NC 27709-2194, USA
| | - Matt Watson
- Renaissance Computing Institute, University of Chapel Hill, North Carolina, Chapel Hill, NC 27599-7568, USA
| | - Virginia Hench
- Center for Genomics, Bioinformatics, and Translational Research, RTI International, Research Triangle Park, NC 27709-2194, USA
| | - Andrew Crerar
- Center for Genomics, Bioinformatics, and Translational Research, RTI International, Research Triangle Park, NC 27709-2194, USA
| | - Chris M Ball
- Research Computing Division, RTI International, Research Triangle Park, NC 27709-2194, USA
| | - Carl Schreep
- Renaissance Computing Institute, University of Chapel Hill, North Carolina, Chapel Hill, NC 27599-7568, USA
| | - P J Linebaugh
- Renaissance Computing Institute, University of Chapel Hill, North Carolina, Chapel Hill, NC 27599-7568, USA
| | - Hannah Hiles
- Renaissance Computing Institute, University of Chapel Hill, North Carolina, Chapel Hill, NC 27599-7568, USA
| | - Rebecca Boyles
- Research Computing Division, RTI International, Research Triangle Park, NC 27709-2194, USA
| | - Chris Bizon
- Renaissance Computing Institute, University of Chapel Hill, North Carolina, Chapel Hill, NC 27599-7568, USA
| | - Ashok Krishnamurthy
- Renaissance Computing Institute, University of Chapel Hill, North Carolina, Chapel Hill, NC 27599-7568, USA.,Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7548, USA
| | - Steve Cox
- Renaissance Computing Institute, University of Chapel Hill, North Carolina, Chapel Hill, NC 27599-7568, USA
| |
Collapse
|
10
|
Serna Garcia G, Leone M, Bernasconi A, Carman MJ. GeMI: interactive interface for transformer-based Genomic Metadata Integration. Database (Oxford) 2022; 2022:6600540. [PMID: 35657113 PMCID: PMC9216561 DOI: 10.1093/database/baac036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 03/26/2022] [Accepted: 04/26/2022] [Indexed: 11/15/2022]
Abstract
The Gene Expression Omnibus (GEO) is a public archive containing >4 million digital samples from functional genomics experiments collected over almost two decades. The accompanying metadata describing the experiments suffer from redundancy, inconsistency and incompleteness due to the prevalence of free text and the lack of well-defined data formats and their validation. To remedy this situation, we created Genomic Metadata Integration (GeMI; http://gmql.eu/gemi/), a web application that learns to automatically extract structured metadata (in the form of key-value pairs) from the plain text descriptions of GEO experiments. The extracted information can then be indexed for structured search and used for various downstream data mining activities. GeMI works in continuous interaction with its users. The natural language processing transformer-based model at the core of our system is a fine-tuned version of the Generative Pre-trained Transformer 2 (GPT2) model that is able to learn continuously from the feedback of the users thanks to an active learning framework designed for the purpose. As a part of such a framework, a machine learning interpretation mechanism (that exploits saliency maps) allows the users to understand easily and quickly whether the predictions of the model are correct and improves the overall usability. GeMI’s ability to extract attributes not explicitly mentioned (such as sex, tissue type, cell type, ethnicity and disease) allows researchers to perform specific queries and classification of experiments, which was previously possible only after spending time and resources with tedious manual annotation. The usefulness of GeMI is demonstrated on practical research use cases.
Database URL
http://gmql.eu/gemi/
Collapse
Affiliation(s)
- Giuseppe Serna Garcia
- Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34/5, Milano 20133, Italy
| | - Michele Leone
- Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34/5, Milano 20133, Italy
| | - Anna Bernasconi
- Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34/5, Milano 20133, Italy
| | - Mark J Carman
- Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34/5, Milano 20133, Italy
| |
Collapse
|
11
|
Cilibrasi L, Pinoli P, Bernasconi A, Canakoglu A, Chiara M, Ceri S. ViruClust: direct comparison of SARS-CoV-2 genomes and genetic variants in space and time. Bioinformatics 2022; 38:1988-1994. [PMID: 35040923 DOI: 10.1093/bioinformatics/btac030] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 12/24/2021] [Accepted: 01/13/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The ongoing evolution of SARS-CoV-2 and the rapid emergence of variants of concern at distinct geographic locations have relevant implications for the implementation of strategies for controlling the COVID-19 pandemic. Combining the growing body of data and the evidence on potential functional implications of SARS-CoV-2 mutations can suggest highly effective methods for the prioritization of novel variants of potential concern, e.g. increasing in frequency locally and/or globally. However, these analyses may be complex, requiring the integration of different data and resources. We claim the need for a streamlined access to up-to-date and high-quality genome sequencing data from different geographic regions/countries, and the current lack of a robust and consistent framework for the evaluation/comparison of the results. RESULTS To overcome these limitations, we developed ViruClust, a novel tool for the comparison of SARS-CoV-2 genomic sequences and lineages in space and time. ViruClust is made available through a powerful and intuitive web-based user interface. Sophisticated large-scale analyses can be executed with a few clicks, even by users without any computational background. To demonstrate potential applications of our method, we applied ViruClust to conduct a thorough study of the evolution of the most prevalent lineage of the Delta SARS-CoV-2 variant, and derived relevant observations. By allowing the seamless integration of different types of functional annotations and the direct comparison of viral genomes and genetic variants in space and time, ViruClust represents a highly valuable resource for monitoring the evolution of SARS-CoV-2, facilitating the identification of variants and/or mutations of potential concern. AVAILABILITY AND IMPLEMENTATION ViruClust is openly available at http://gmql.eu/viruclust/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Luca Cilibrasi
- Department of Electronics, Information and Bioengineering (DEIB), Politecnico di Milano, 20133 Milano, Italy
| | - Pietro Pinoli
- Department of Electronics, Information and Bioengineering (DEIB), Politecnico di Milano, 20133 Milano, Italy
| | - Anna Bernasconi
- Department of Electronics, Information and Bioengineering (DEIB), Politecnico di Milano, 20133 Milano, Italy
| | - Arif Canakoglu
- Department of Electronics, Information and Bioengineering (DEIB), Politecnico di Milano, 20133 Milano, Italy
| | - Matteo Chiara
- Department of BioSciences, University of Milano, 20133 Milano, Italy
| | - Stefano Ceri
- Department of Electronics, Information and Bioengineering (DEIB), Politecnico di Milano, 20133 Milano, Italy
| |
Collapse
|
12
|
Alfonsi T, Pinoli P, Canakoglu A. High Performance Integration Pipeline for Viral and Epitope Sequences. BIOTECH 2022; 11:biotech11010007. [PMID: 35822815 PMCID: PMC9245902 DOI: 10.3390/biotech11010007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 03/08/2022] [Accepted: 03/15/2022] [Indexed: 11/28/2022] Open
Abstract
With the spread of COVID-19, sequencing laboratories started to share hundreds of sequences daily. However, the lack of a commonly agreed standard across deposition databases hindered the exploration and study of all the viral sequences collected worldwide in a practical and homogeneous way. During the first months of the pandemic, we developed an automatic procedure to collect, transform, and integrate viral sequences of SARS-CoV-2, MERS, SARS-CoV, Ebola, and Dengue from four major database institutions (NCBI, COG-UK, GISAID, and NMDC). This data pipeline allowed the creation of the data exploration interfaces VirusViz and EpiSurf, as well as ViruSurf, one of the largest databases of integrated viral sequences. Almost two years after the first release of the repository, the original pipeline underwent a thorough refinement process and became more efficient, scalable, and general (currently, it also includes epitopes from the IEDB). Thanks to these improvements, we constantly update and expand our integrated repository, encompassing about 9.1 million SARS-CoV-2 sequences at present (March 2022). This pipeline made it possible to design and develop fundamental resources for any researcher interested in understanding the biological mechanisms behind the viral infection. In addition, it plays a crucial role in many analytic and visualization tools, such as ViruSurf, EpiSurf, VirusViz, and VirusLab.
Collapse
Affiliation(s)
- Tommaso Alfonsi
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy; (P.P.); (A.C.)
- Correspondence:
| | - Pietro Pinoli
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy; (P.P.); (A.C.)
| | - Arif Canakoglu
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy; (P.P.); (A.C.)
- Policlinico di Milano Ospedale Maggiore, Fondazione IRCCS Ca’ Granda, Via Francesco Sforza, 35, 20122 Milano, Italy
| |
Collapse
|
13
|
Ulrich H, Kock-Schoppenhauer AK, Deppenwiese N, Gött R, Kern J, Lablans M, Majeed RW, Stöhr MR, Stausberg J, Varghese J, Dugas M, Ingenerf J. Understanding the Nature of Metadata: Systematic Review. J Med Internet Res 2022; 24:e25440. [PMID: 35014967 PMCID: PMC8790684 DOI: 10.2196/25440] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 01/28/2021] [Accepted: 10/14/2021] [Indexed: 01/11/2023] Open
Abstract
Background Metadata are created to describe the corresponding data in a detailed and unambiguous way and is used for various applications in different research areas, for example, data identification and classification. However, a clear definition of metadata is crucial for further use. Unfortunately, extensive experience with the processing and management of metadata has shown that the term “metadata” and its use is not always unambiguous. Objective This study aimed to understand the definition of metadata and the challenges resulting from metadata reuse. Methods A systematic literature search was performed in this study following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines for reporting on systematic reviews. Five research questions were identified to streamline the review process, addressing metadata characteristics, metadata standards, use cases, and problems encountered. This review was preceded by a harmonization process to achieve a general understanding of the terms used. Results The harmonization process resulted in a clear set of definitions for metadata processing focusing on data integration. The following literature review was conducted by 10 reviewers with different backgrounds and using the harmonized definitions. This study included 81 peer-reviewed papers from the last decade after applying various filtering steps to identify the most relevant papers. The 5 research questions could be answered, resulting in a broad overview of the standards, use cases, problems, and corresponding solutions for the application of metadata in different research areas. Conclusions Metadata can be a powerful tool for identifying, describing, and processing information, but its meaningful creation is costly and challenging. This review process uncovered many standards, use cases, problems, and solutions for dealing with metadata. The presented harmonized definitions and the new schema have the potential to improve the classification and generation of metadata by creating a shared understanding of metadata and its context.
Collapse
Affiliation(s)
- Hannes Ulrich
- IT Center for Clinical Research, University of Lübeck, Lübeck, Germany.,Institute of Medical Informatics, University of Lübeck, Lübeck, Germany
| | | | - Noemi Deppenwiese
- Chair of Medical Informatics, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Robert Gött
- Department Epidemiology of Health Care and Community Health, Institute for Community Medicine, University Medicine Greifswald, Greifswald, Germany
| | - Jori Kern
- Federated Information Systems, German Cancer Research Center, Heidelberg, Germany.,Complex Data Processing in Medical Informatics, University Medical Center Mannheim, Mannheim, Germany
| | - Martin Lablans
- Federated Information Systems, German Cancer Research Center, Heidelberg, Germany.,Complex Data Processing in Medical Informatics, University Medical Center Mannheim, Mannheim, Germany
| | - Raphael W Majeed
- Universities of Giessen and Marburg Lung Center, German Center for Lung Research, Justus-Liebig-University, Giessen, Germany.,Institute of Medical Informatics, University Hospital RWTH Aachen, Aachen, Germany
| | - Mark R Stöhr
- Universities of Giessen and Marburg Lung Center, German Center for Lung Research, Justus-Liebig-University, Giessen, Germany
| | - Jürgen Stausberg
- Institute of Medical Informatics, Biometry and Epidemiology, Faculty of Medicine, University of Duisburg-Essen, Essen, Germany
| | - Julian Varghese
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Martin Dugas
- Institute of Medical Informatics, Heidelberg University Hospital, Heidelberg, Germany
| | - Josef Ingenerf
- IT Center for Clinical Research, University of Lübeck, Lübeck, Germany.,Institute of Medical Informatics, University of Lübeck, Lübeck, Germany
| |
Collapse
|
14
|
Bernasconi A, Cascianelli S. Scenarios for the Integration of Microarray Gene Expression Profiles in COVID-19-Related Studies. Methods Mol Biol 2022; 2401:195-215. [PMID: 34902130 DOI: 10.1007/978-1-0716-1839-4_13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The COVID-19 pandemic has hit heavily many aspects of our lives. At this time, genomic research is concerned with exploiting available datasets and knowledge to fuel discovery on this novel disease. Studies that can precisely characterize the gene expression profiles of human hosts infected by SARS-CoV-2 are of significant relevance. However, not many such experiments have yet been produced to date, nor made publicly available online. Thus, it is of paramount importance that data analysts explore all possibilities to integrate information coming from similar viruses and related diseases; interestingly, microarray gene profile experiments become extremely valuable for this purpose. This chapter reviews the aspects that should be considered when integrating transcriptomics data, considering mainly samples infected by different viruses and combining together various data types and also the extracted knowledge. It describes a series of scenarios from studies performed in literature and it suggests possible other directions of noteworthy integration.
Collapse
Affiliation(s)
- Anna Bernasconi
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy.
| | - Silvia Cascianelli
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
| |
Collapse
|
15
|
Bernasconi A, Canakoglu A, Masseroli M, Ceri S. META-BASE: A Novel Architecture for Large-Scale Genomic Metadata Integration. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:543-557. [PMID: 32750853 DOI: 10.1109/tcbb.2020.2998954] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data repositories is available to drive biological and clinical research; combining information from various heterogeneous and widely dispersed sources is paramount to a number of biological discoveries. It is difficult because the domain is complex and there is no agreement among the various metadata definitions, which refer to different vocabularies and ontologies. It is well-recognized in the bioinformatics community because, in the common practice, repositories are accessed one-by-one, learning their specific metadata definitions as result of long and tedious efforts, and such practice is error-prone. In this paper, we describe META-BASE, an architecture for integrating metadata extracted from a variety of genomic data sources, based upon a structured transformation process. We present a variety of innovative techniques for data extraction, cleaning, normalization and enrichment. We propose a general, open and extensible pipeline that can easily incorporate any number of new data sources, and propose the resulting repository-already integrating several important sources-which is exposed by means of practical user interfaces to respond biological researchers' needs.
Collapse
|
16
|
Bernasconi A, Cilibrasi L, Al Khalaf R, Alfonsi T, Ceri S, Pinoli P, Canakoglu A. EpiSurf: metadata-driven search server for analyzing amino acid changes within epitopes of SARS-CoV-2 and other viral species. Database (Oxford) 2021; 2021:baab059. [PMID: 34585726 PMCID: PMC8500151 DOI: 10.1093/database/baab059] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Revised: 07/27/2021] [Accepted: 09/16/2021] [Indexed: 11/21/2022]
Abstract
EpiSurf is a Web application for selecting viral populations of interest and then analyzing how their amino acid changes are distributed along epitopes. Viral sequences are searched within ViruSurf, which stores curated metadata and amino acid changes imported from the most widely used deposition sources for viral databases (GenBank, COVID-19 Genomics UK (COG-UK) and Global initiative on sharing all influenza data (GISAID)). Epitopes are searched within the open source Immune Epitope Database or directly proposed by users by indicating their start and stop positions in the context of a given viral protein. Amino acid changes of selected populations are joined with epitopes of interest; a result table summarizes, for each epitope, statistics about the overlapping amino acid changes and about the sequences carrying such alterations. The results may also be inspected by the VirusViz Web application; epitope regions are highlighted within the given viral protein, and changes can be comparatively inspected. For sequences mutated within the epitope, we also offer a complete view of the distribution of amino acid changes, optionally grouped by the location, collection date or lineage. Thanks to these functionalities, EpiSurf supports the user-friendly testing of epitope conservancy within selected populations of interest, which can be of utmost relevance for designing vaccines, drugs or serological assays. EpiSurf is available at two endpoints. Database URL: http://gmql.eu/episurf/ (for searching GenBank and COG-UK sequences) and http://gmql.eu/episurf_gisaid/ (for GISAID sequences).
Collapse
Affiliation(s)
- Anna Bernasconi
- Dipartimento di Elettronica, Informazione e
Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, Milano 20133,
Italy
| | - Luca Cilibrasi
- Dipartimento di Elettronica, Informazione e
Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, Milano 20133,
Italy
| | - Ruba Al Khalaf
- Dipartimento di Elettronica, Informazione e
Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, Milano 20133,
Italy
| | - Tommaso Alfonsi
- Dipartimento di Elettronica, Informazione e
Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, Milano 20133,
Italy
| | - Stefano Ceri
- Dipartimento di Elettronica, Informazione e
Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, Milano 20133,
Italy
| | - Pietro Pinoli
- Dipartimento di Elettronica, Informazione e
Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, Milano 20133,
Italy
| | - Arif Canakoglu
- Dipartimento di Elettronica, Informazione e
Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, Milano 20133,
Italy
| |
Collapse
|
17
|
Bernasconi A, Gulino A, Alfonsi T, Canakoglu A, Pinoli P, Sandionigi A, Ceri S. VirusViz: comparative analysis and effective visualization of viral nucleotide and amino acid variants. Nucleic Acids Res 2021; 49:e90. [PMID: 34107016 PMCID: PMC8344854 DOI: 10.1093/nar/gkab478] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 05/11/2021] [Accepted: 05/24/2021] [Indexed: 12/27/2022] Open
Abstract
Variant visualization plays an important role in supporting the viral evolution analysis, extremely valuable during the COVID-19 pandemic. VirusViz is a web-based application for comparing variants of selected viral populations and their sub-populations; it is primarily focused on SARS-CoV-2 variants, although the tool also supports other viral species (SARS-CoV, MERS-CoV, Dengue, Ebola). As input, VirusViz imports results of queries extracting variants and metadata from the large database ViruSurf, which integrates information about most SARS-CoV-2 sequences publicly deposited worldwide. Moreover, VirusViz accepts sequences of new viral populations as multi-FASTA files plus corresponding metadata in CSV format; a bioinformatic pipeline builds a suitable input for VirusViz by extracting the nucleotide and amino acid variants. Pages of VirusViz provide metadata summarization, variant descriptions, and variant visualization with rich options for zooming, highlighting variants or regions of interest, and switching from nucleotides to amino acids; sequences can be grouped, groups can be comparatively analyzed. For SARS-CoV-2, we manually collect mutations with known or predicted levels of severity/virulence, as indicated in linked research articles; such critical mutations are reported when observed in sequences. The system includes light-weight project management for downloading, resuming, and merging data analysis sessions. VirusViz is freely available at http://gmql.eu/virusviz/.
Collapse
Affiliation(s)
- Anna Bernasconi
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy
| | - Andrea Gulino
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy
| | - Tommaso Alfonsi
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy
| | - Arif Canakoglu
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy
| | - Pietro Pinoli
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy
| | - Anna Sandionigi
- Quantia Consulting S.r.l., Via Petrarca 20, 22066, Mariano Comense, Como, Italy
| | - Stefano Ceri
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy
| |
Collapse
|
18
|
Bernasconi A, Canakoglu A, Masseroli M, Pinoli P, Ceri S. A review on viral data sources and search systems for perspective mitigation of COVID-19. Brief Bioinform 2021; 22:664-675. [PMID: 33348368 PMCID: PMC7799334 DOI: 10.1093/bib/bbaa359] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Revised: 10/09/2020] [Accepted: 11/09/2020] [Indexed: 12/26/2022] Open
Abstract
With the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the effects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services. At the same time, novel organizations and resources were born in this critical period to serve specifically the purposes of COVID-19 mitigation while setting the research ground for contrasting possible future pandemics. Accessibility and integration of viral sequence data, possibly in conjunction with the human host genotype and clinical data, are paramount to better understand the COVID-19 disease and mitigate its effects. Few examples of host-pathogen integrated datasets exist so far, but we expect them to grow together with the knowledge of COVID-19 disease; once such datasets will be available, useful integrative surveillance mechanisms can be put in place by observing how common variants distribute in time and space, relating them to the phenotypic impact evidenced in the literature.
Collapse
|
19
|
Canakoglu A, Pinoli P, Bernasconi A, Alfonsi T, Melidis DP, Ceri S. ViruSurf: an integrated database to investigate viral sequences. Nucleic Acids Res 2021; 49:D817-D824. [PMID: 33045721 PMCID: PMC7778888 DOI: 10.1093/nar/gkaa846] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Revised: 09/06/2020] [Accepted: 09/21/2020] [Indexed: 11/16/2022] Open
Abstract
ViruSurf, available at http://gmql.eu/virusurf/, is a large public database of viral sequences and integrated and curated metadata from heterogeneous sources (RefSeq, GenBank, COG-UK and NMDC); it also exposes computed nucleotide and amino acid variants, called from original sequences. A GISAID-specific ViruSurf database, available at http://gmql.eu/virusurf_gisaid/, offers a subset of these functionalities. Given the current pandemic outbreak, SARS-CoV-2 data are collected from the four sources; but ViruSurf contains other virus species harmful to humans, including SARS-CoV, MERS-CoV, Ebola and Dengue. The database is centered on sequences, described from their biological, technological and organizational dimensions. In addition, the analytical dimension characterizes the sequence in terms of its annotations and variants. The web interface enables expressing complex search queries in a simple way; arbitrary search queries can freely combine conditions on attributes from the four dimensions, extracting the resulting sequences. Several example queries on the database confirm and possibly improve results from recent research papers; results can be recomputed over time and upon selected populations. Effective search over large and curated sequence data may enable faster responses to future threats that could arise from new viruses.
Collapse
Affiliation(s)
- Arif Canakoglu
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy
| | - Pietro Pinoli
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy
| | - Anna Bernasconi
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy
| | - Tommaso Alfonsi
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy
| | - Damianos P Melidis
- L3S Research Center, Leibniz University Hannover, Appelstr. 9a, 30167 Hannover, Germany
| | - Stefano Ceri
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy
| |
Collapse
|
20
|
Canakoglu A, Pinoli P, Gulino A, Nanni L, Masseroli M, Ceri S. Federated sharing and processing of genomic datasets for tertiary data analysis. Brief Bioinform 2020; 22:5868062. [PMID: 34020536 DOI: 10.1093/bib/bbaa091] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Revised: 04/05/2020] [Accepted: 04/27/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION With the spreading of biological and clinical uses of next-generation sequencing (NGS) data, many laboratories and health organizations are facing the need of sharing NGS data resources and easily accessing and processing comprehensively shared genomic data; in most cases, primary and secondary data management of NGS data is done at sequencing stations, and sharing applies to processed data. Based on the previous single-instance GMQL system architecture, here we review the model, language and architectural extensions that make the GMQL centralized system innovatively open to federated computing. RESULTS A well-designed extension of a centralized system architecture to support federated data sharing and query processing. Data is federated thanks to simple data sharing instructions. Queries are assigned to execution nodes; they are translated into an intermediate representation, whose computation drives data and processing distributions. The approach allows writing federated applications according to classical styles: centralized, distributed or externalized. AVAILABILITY The federated genomic data management system is freely available for non-commercial use as an open source project at http://www.bioinformatics.deib.polimi.it/FederatedGMQLsystem/. CONTACT {arif.canakoglu, pietro.pinoli}@polimi.it. SUMMARY
Collapse
Affiliation(s)
| | | | - Andrea Gulino
- Computer Science and Engineering at Politecnico di Milano
| | - Luca Nanni
- Computer Science and Engineering at Politecnico di Milano
| | | | | |
Collapse
|
21
|
Bernasconi A, Canakoglu A, Masseroli M, Ceri S. The road towards data integration in human genomics: players, steps and interactions. Brief Bioinform 2020; 22:30-44. [PMID: 32496509 DOI: 10.1093/bib/bbaa080] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Revised: 03/09/2020] [Accepted: 04/18/2020] [Indexed: 12/15/2022] Open
Abstract
Thousands of new experimental datasets are becoming available every day; in many cases, they are produced within the scope of large cooperative efforts, involving a variety of laboratories spread all over the world, and typically open for public use. Although the potential collective amount of available information is huge, the effective combination of such public sources is hindered by data heterogeneity, as the datasets exhibit a wide variety of notations and formats, concerning both experimental values and metadata. Thus, data integration is becoming a fundamental activity, to be performed prior to data analysis and biological knowledge discovery, consisting of subsequent steps of data extraction, normalization, matching and enrichment; once applied to heterogeneous data sources, it builds multiple perspectives over the genome, leading to the identification of meaningful relationships that could not be perceived by using incompatible data formats. In this paper, we first describe a technological pipeline from data production to data integration; we then propose a taxonomy of genomic data players (based on the distinction between contributors, repository hosts, consortia, integrators and consumers) and apply the taxonomy to describe about 30 important players in genomic data management. We specifically focus on the integrator players and analyse the issues in solving the genomic data integration challenges, as well as evaluate the computational environments that they provide to follow up data integration by means of visualization and analysis tools.
Collapse
|