1
|
Ogle C, Reddick D, McKnight C, Biggs T, Pauly R, Ficklin SP, Feltus FA, Shannigrahi S. Named Data Networking for Genomics Data Management and Integrated Workflows. Front Big Data 2021; 4:582468. [PMID: 33748749 PMCID: PMC7968724 DOI: 10.3389/fdata.2021.582468] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2020] [Accepted: 01/04/2021] [Indexed: 11/25/2022] Open
Abstract
Advanced imaging and DNA sequencing technologies now enable the diverse biology community to routinely generate and analyze terabytes of high resolution biological data. The community is rapidly heading toward the petascale in single investigator laboratory settings. As evidence, the single NCBI SRA central DNA sequence repository contains over 45 petabytes of biological data. Given the geometric growth of this and other genomics repositories, an exabyte of mineable biological data is imminent. The challenges of effectively utilizing these datasets are enormous as they are not only large in the size but also stored in geographically distributed repositories in various repositories such as National Center for Biotechnology Information (NCBI), DNA Data Bank of Japan (DDBJ), European Bioinformatics Institute (EBI), and NASA’s GeneLab. In this work, we first systematically point out the data-management challenges of the genomics community. We then introduce Named Data Networking (NDN), a novel but well-researched Internet architecture, is capable of solving these challenges at the network layer. NDN performs all operations such as forwarding requests to data sources, content discovery, access, and retrieval using content names (that are similar to traditional filenames or filepaths) and eliminates the need for a location layer (the IP address) for data management. Utilizing NDN for genomics workflows simplifies data discovery, speeds up data retrieval using in-network caching of popular datasets, and allows the community to create infrastructure that supports operations such as creating federation of content repositories, retrieval from multiple sources, remote data subsetting, and others. Named based operations also streamlines deployment and integration of workflows with various cloud platforms. Our contributions in this work are as follows 1) we enumerate the cyberinfrastructure challenges of the genomics community that NDN can alleviate, and 2) we describe our efforts in applying NDN for a contemporary genomics workflow (GEMmaker) and quantify the improvements. The preliminary evaluation shows a sixfold speed up in data insertion into the workflow. 3) As a pilot, we have used an NDN naming scheme (agreed upon by the community and discussed in Section 4) to publish data from broadly used data repositories including the NCBI SRA. We have loaded the NDN testbed with these pre-processed genomes that can be accessed over NDN and used by anyone interested in those datasets. Finally, we discuss our continued effort in integrating NDN with cloud computing platforms, such as the Pacific Research Platform (PRP). The reader should note that the goal of this paper is to introduce NDN to the genomics community and discuss NDN’s properties that can benefit the genomics community. We do not present an extensive performance evaluation of NDN—we are working on extending and evaluating our pilot deployment and will present systematic results in a future work.
Collapse
Affiliation(s)
- Cameron Ogle
- School of Computing, Clemson University, Clemson, SC, United States
| | - David Reddick
- Department of Computer Science, Tennessee Tech University, Cookeville, TN, United States
| | - Coleman McKnight
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC, United States
| | - Tyler Biggs
- Department of Horticulture, Washington State University, Pullman, WA, United States
| | - Rini Pauly
- Biomedical Data Science and Informatics Program, Clemson, SC, United States
| | - Stephen P Ficklin
- Department of Horticulture, Washington State University, Pullman, WA, United States
| | - F Alex Feltus
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC, United States.,Biomedical Data Science and Informatics Program, Clemson, SC, United States.,Center for Human Genetics, Clemson University, Greenwood, SC, United States
| | - Susmit Shannigrahi
- Department of Computer Science, Tennessee Tech University, Cookeville, TN, United States
| |
Collapse
|
2
|
Testa U, Pelosi E, Castelli G. Genetic Alterations in Renal Cancers: Identification of The Mechanisms Underlying Cancer Initiation and Progression and of Therapeutic Targets. MEDICINES (BASEL, SWITZERLAND) 2020; 7:E44. [PMID: 32751108 PMCID: PMC7459851 DOI: 10.3390/medicines7080044] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Revised: 07/19/2020] [Accepted: 07/24/2020] [Indexed: 12/26/2022]
Abstract
Renal cell cancer (RCC) involves three most recurrent sporadic types: clear-cell RCC (70-75%, CCRCC), papillary RCCC (10-15%, PRCC), and chromophobe RCC (5%, CHRCC). Hereditary cases account for about 5% of all cases of RCC and are caused by germline pathogenic variants. Herein, we review how a better understanding of the molecular biology of RCCs has driven the inception of new diagnostic and therapeutic approaches. Genomic research has identified relevant genetic alterations associated with each RCC subtype. Molecular studies have clearly shown that CCRCC is universally initiated by Von Hippel Lindau (VHL) gene dysregulation, followed by different types of additional genetic events involving epigenetic regulatory genes, dictating disease progression, aggressiveness, and differential response to treatments. The understanding of the molecular mechanisms that underlie the development and progression of RCC has considerably expanded treatment options; genomic data might guide treatment options by enabling patients to be matched with therapeutics that specifically target the genetic alterations present in their tumors. These new targeted treatments have led to a moderate improvement of the survival of metastatic RCC patients. Ongoing studies based on the combination of immunotherapeutic agents (immune check inhibitors) with VEGF inhibitors are expected to further improve the survival of these patients.
Collapse
Affiliation(s)
- Ugo Testa
- Department of Oncology, Istituto Superiore di Sanità, Vaile Regina Elena 299, 00161 Rome, Italy; (E.P.); (G.C.)
| | | | | |
Collapse
|
3
|
Poehlman WL, Schnabel EL, Chavan SA, Frugoli JA, Feltus FA. Identifying Temporally Regulated Root Nodulation Biomarkers Using Time Series Gene Co-Expression Network Analysis. FRONTIERS IN PLANT SCIENCE 2019; 10:1409. [PMID: 31737022 PMCID: PMC6836625 DOI: 10.3389/fpls.2019.01409] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Accepted: 10/11/2019] [Indexed: 06/10/2023]
Abstract
Root nodulation results from a symbiotic relationship between a plant host and Rhizobium bacteria. Synchronized gene expression patterns over the course of rhizobial infection result in activation of pathways that are unique but overlapping with the highly conserved pathways that enable mycorrhizal symbiosis. We performed RNA sequencing of 30 Medicago truncatula root maturation zone samples at five distinct time points. These samples included plants inoculated with Sinorhizobium medicae and control plants that did not receive any Rhizobium. Following gene expression quantification, we identified 1,758 differentially expressed genes at various time points. We constructed a gene co-expression network (GCN) from the same data and identified link community modules (LCMs) that were comprised entirely of differentially expressed genes at specific time points post-inoculation. One LCM included genes that were up-regulated at 24 h following inoculation, suggesting an activation of allergen family genes and carbohydrate-binding gene products in response to Rhizobium. We also identified two LCMs that were comprised entirely of genes that were down regulated at 24 and 48 h post-inoculation. The identity of the genes in these modules suggest that down-regulating specific genes at 24 h may result in decreased jasmonic acid production with an increase in cytokinin production. At 48 h, coordinated down-regulation of a specific set of genes involved in lipid biosynthesis may play a role in nodulation. We show that GCN-LCM analysis is an effective method to preliminarily identify polygenic candidate biomarkers of root nodulation and develop hypotheses for future discovery.
Collapse
|