1
|
Baharav TZ, Tse D, Salzman J. OASIS: An interpretable, finite-sample valid alternative to Pearson's X2 for scientific discovery. Proc Natl Acad Sci U S A 2024; 121:e2304671121. [PMID: 38564640 PMCID: PMC11009617 DOI: 10.1073/pnas.2304671121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Accepted: 02/08/2024] [Indexed: 04/04/2024] Open
Abstract
Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference [K. Chaung et al., Cell 186, 5440-5456 (2023)], we develop Optimized Adaptive Statistic for Inferring Structure (OASIS), a family of statistical tests for contingency tables. OASIS constructs a test statistic which is linear in the normalized data matrix, providing closed-form P-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's P-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. Using OASIS, we develop a method that can detect SARS-CoV-2 and Mycobacterium tuberculosis strains de novo, which existing approaches cannot achieve. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single-cell RNA sequencing, where under accepted noise models OASIS provides good control of the false discovery rate, while Pearson's [Formula: see text] consistently rejects the null. Additionally, we show in simulations that OASIS is more powerful than Pearson's [Formula: see text] in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.
Collapse
Affiliation(s)
- Tavor Z. Baharav
- Eric and Wendy Schmidt Center, Broad Institute, Cambridge, MA02142
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA02115
| | - David Tse
- Department of Electrical Engineering, Stanford University, Stanford, CA94305
| | - Julia Salzman
- Department of Biomedical Data Science, Stanford University, Stanford, CA94305
- Department of Biochemistry, Stanford University, Stanford, CA94305
- Department of Statistics (by courtesy), Stanford University, Stanford, CA94305
| |
Collapse
|
2
|
Olbrich M, Bartels L, Wohlers I. Sequencing technologies and hardware-accelerated parallel computing transform computational genomics research. Front Bioinform 2024; 4:1384497. [PMID: 38567256 PMCID: PMC10985184 DOI: 10.3389/fbinf.2024.1384497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Accepted: 03/07/2024] [Indexed: 04/04/2024] Open
Affiliation(s)
- Michael Olbrich
- Center for Biotechnology, Khalifa University for Science and Technology, Abu Dhabi, United Arab Emirates
| | - Lennart Bartels
- Biomolecular Data Science in Pneumology, Research Center Borstel, Borstel, Germany
| | - Inken Wohlers
- Biomolecular Data Science in Pneumology, Research Center Borstel, Borstel, Germany
- University of Lübeck, Lübeck, Germany
| |
Collapse
|
3
|
Gharavi E, LeRoy NJ, Zheng G, Zhang A, Brown DE, Sheffield NC. Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets. Bioengineering (Basel) 2024; 11:263. [PMID: 38534537 DOI: 10.3390/bioengineering11030263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 02/20/2024] [Accepted: 02/22/2024] [Indexed: 03/28/2024] Open
Abstract
As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.
Collapse
Affiliation(s)
- Erfaneh Gharavi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
| | - Nathan J LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Guangtao Zheng
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Aidong Zhang
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Donald E Brown
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| |
Collapse
|
4
|
Emes RD, Pirooznia M, Zou Q, Pellegrini M. Editorial: Insights in computational genomics: 2022. Front Genet 2023; 14:1256011. [PMID: 37554406 PMCID: PMC10406376 DOI: 10.3389/fgene.2023.1256011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Accepted: 07/17/2023] [Indexed: 08/10/2023] Open
Affiliation(s)
| | - Mehdi Pirooznia
- School of Medicine, Johns Hopkins University, Baltimore, MD, United States
- Pharmaceutical Data Sciences, R&D Johnson & Johnson, Boston, MA, United States
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | | |
Collapse
|
5
|
Ansaloni F, Gustincich S, Sanges R. In silico characterisation of minor wave genes and LINE-1s transcriptional dynamics at murine zygotic genome activation. Front Cell Dev Biol 2023; 11:1124266. [PMID: 37389353 PMCID: PMC10300423 DOI: 10.3389/fcell.2023.1124266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Accepted: 06/05/2023] [Indexed: 07/01/2023] Open
Abstract
Introduction: In mouse, the zygotic genome activation (ZGA) is coordinated by MERVL elements, a class of LTR retrotransposons. In addition to MERVL, another class of retrotransposons, LINE-1 elements, recently came under the spotlight as key regulators of murine ZGA. In particular, LINE-1 transcripts seem to be required to switch-off the transcriptional program started by MERVL sequences, suggesting an antagonistic interplay between LINE-1 and MERVL pathways. Methods: To better investigate the activities of LINE-1 and MERVL elements at ZGA, we integrated publicly available transcriptomics (RNA-seq), chromatin accessibility (ATAC-seq) and Pol-II binding (Stacc-seq) datasets and characterised the transcriptional and epigenetic dynamics of such elements during murine ZGA. Results: We identified two likely distinct transcriptional activities characterising the murine zygotic genome at ZGA onset. On the one hand, our results confirmed that ZGA minor wave genes are preferentially transcribed from MERVL-rich and gene-dense genomic compartments, such as gene clusters. On the other hand, we identified a set of evolutionary young and likely transcriptionally autonomous LINE-1s located in intergenic and gene-poor regions showing, at the same stage, features such as open chromatin and RNA Pol II binding suggesting them to be, at least, poised for transcription. Discussion: These results suggest that, across evolution, transcription of two different classes of transposable elements, MERVLs and LINE-1s, have likely been confined in genic and intergenic regions respectively in order to maintain and regulate two successive transcriptional programs at ZGA.
Collapse
Affiliation(s)
- Federico Ansaloni
- Area of Neuroscience, Scuola Internazionale Superiore di Studi Avanzati (SISSA), Trieste, Italy
- Central RNA Laboratory, Istituto Italiano di Tecnologia—IIT, Genova, Italy
| | - Stefano Gustincich
- Central RNA Laboratory, Istituto Italiano di Tecnologia—IIT, Genova, Italy
| | - Remo Sanges
- Area of Neuroscience, Scuola Internazionale Superiore di Studi Avanzati (SISSA), Trieste, Italy
- Central RNA Laboratory, Istituto Italiano di Tecnologia—IIT, Genova, Italy
| |
Collapse
|
6
|
Koesterich J, An JY, Inoue F, Sohota A, Ahituv N, Sanders SJ, Kreimer A. Characterization of De Novo Promoter Variants in Autism Spectrum Disorder with Massively Parallel Reporter Assays. Int J Mol Sci 2023; 24:3509. [PMID: 36834916 PMCID: PMC9959321 DOI: 10.3390/ijms24043509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Revised: 01/13/2023] [Accepted: 02/03/2023] [Indexed: 02/12/2023] Open
Abstract
Autism spectrum disorder (ASD) is a common, complex, and highly heritable condition with contributions from both common and rare genetic variations. While disruptive, rare variants in protein-coding regions clearly contribute to symptoms, the role of rare non-coding remains unclear. Variants in these regions, including promoters, can alter downstream RNA and protein quantity; however, the functional impacts of specific variants observed in ASD cohorts remain largely uncharacterized. Here, we analyzed 3600 de novo mutations in promoter regions previously identified by whole-genome sequencing of autistic probands and neurotypical siblings to test the hypothesis that mutations in cases have a greater functional impact than those in controls. We leveraged massively parallel reporter assays (MPRAs) to detect transcriptional consequences of these variants in neural progenitor cells and identified 165 functionally high confidence de novo variants (HcDNVs). While these HcDNVs are enriched for markers of active transcription, disruption to transcription factor binding sites, and open chromatin, we did not identify differences in functional impact based on ASD diagnostic status.
Collapse
Affiliation(s)
- Justin Koesterich
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, NJ 08854, USA
- Department of Cell and Developmental Biology, Rutgers University, Piscataway, NJ 08854, USA
| | - Joon-Yong An
- Department of Psychiatry and Behavioral Sciences, Weill Institute for Neuroscience, University of California, San Francisco, CA 94143, USA
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, Republic of Korea
- BK21FOUR R&E Center for Learning Health Systems, Korea University, Seoul 02841, Republic of Korea
| | - Fumitaka Inoue
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA
- Institute for Human Genetics, University of California, San Francisco, CA 94158, USA
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto 606-8501, Japan
| | - Ajuni Sohota
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA
| | - Nadav Ahituv
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA
- Institute for Human Genetics, University of California, San Francisco, CA 94158, USA
| | - Stephan J. Sanders
- Department of Psychiatry and Behavioral Sciences, Weill Institute for Neuroscience, University of California, San Francisco, CA 94143, USA
- Institute for Human Genetics, University of California, San Francisco, CA 94158, USA
- Institute for Developmental and Regenerative Medicine, Old Road Campus, Roosevelt Dr, Headington, Oxford OX3 7TY, UK
| | - Anat Kreimer
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, NJ 08854, USA
- Department of Biochemistry and Molecular Biology, Robert Wood Johnson Medical School, Rutgers University, Piscataway, NJ 08854, USA
| |
Collapse
|
7
|
Nabeel Asim M, Ali Ibrahim M, Fazeel A, Dengel A, Ahmed S. DNA-MP: a generalized DNA modifications predictor for multiple species based on powerful sequence encoding method. Brief Bioinform 2023; 24:6931721. [PMID: 36528802 DOI: 10.1093/bib/bbac546] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Revised: 11/06/2022] [Accepted: 11/12/2022] [Indexed: 12/23/2022] Open
Abstract
Accurate prediction of deoxyribonucleic acid (DNA) modifications is essential to explore and discern the process of cell differentiation, gene expression and epigenetic regulation. Several computational approaches have been proposed for particular type-specific DNA modification prediction. Two recent generalized computational predictors are capable of detecting three different types of DNA modifications; however, type-specific and generalized modifications predictors produce limited performance across multiple species mainly due to the use of ineffective sequence encoding methods. The paper in hand presents a generalized computational approach "DNA-MP" that is competent to more precisely predict three different DNA modifications across multiple species. Proposed DNA-MP approach makes use of a powerful encoding method "position specific nucleotides occurrence based 117 on modification and non-modification class densities normalized difference" (POCD-ND) to generate the statistical representations of DNA sequences and a deep forest classifier for modifications prediction. POCD-ND encoder generates statistical representations by extracting position specific distributional information of nucleotides in the DNA sequences. We perform a comprehensive intrinsic and extrinsic evaluation of the proposed encoder and compare its performance with 32 most widely used encoding methods on $17$ benchmark DNA modifications prediction datasets of $12$ different species using $10$ different machine learning classifiers. Overall, with all classifiers, the proposed POCD-ND encoder outperforms existing $32$ different encoders. Furthermore, combinedly over 5-fold cross validation benchmark datasets and independent test sets, proposed DNA-MP predictor outperforms state-of-the-art type-specific and generalized modifications predictors by an average accuracy of 7% across 4mc datasets, 1.35% across 5hmc datasets and 10% for 6ma datasets. To facilitate the scientific community, the DNA-MP web application is available at https://sds_genetic_analysis.opendfki.de/DNA_Modifications/.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Ahtisham Fazeel
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| |
Collapse
|
8
|
Nelson TM, Ghosh S, Postler TS. L-RAPiT: A Cloud-Based Computing Pipeline for the Analysis of Long-Read RNA Sequencing Data. Int J Mol Sci 2022; 23:ijms232415851. [PMID: 36555493 PMCID: PMC9781625 DOI: 10.3390/ijms232415851] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Revised: 12/07/2022] [Accepted: 12/11/2022] [Indexed: 12/15/2022] Open
Abstract
Long-read sequencing (LRS) has been adopted to meet a wide variety of research needs, ranging from the construction of novel transcriptome annotations to the rapid identification of emerging virus variants. Amongst other advantages, LRS preserves more information about RNA at the transcript level than conventional high-throughput sequencing, including far more accurate and quantitative records of splicing patterns. New studies with LRS datasets are being published at an exponential rate, generating a vast reservoir of information that can be leveraged to address a host of different research questions. However, mining such publicly available data in a tailored fashion is currently not easy, as the available software tools typically require familiarity with the command-line interface, which constitutes a significant obstacle to many researchers. Additionally, different research groups utilize different software packages to perform LRS analysis, which often prevents a direct comparison of published results across different studies. To address these challenges, we have developed the Long-Read Analysis Pipeline for Transcriptomics (L-RAPiT), a user-friendly, free pipeline requiring no dedicated computational resources or bioinformatics expertise. L-RAPiT can be implemented directly through Google Colaboratory, a system based on the open-source Jupyter notebook environment, and allows for the direct analysis of transcriptomic reads from Oxford Nanopore and PacBio LRS machines. This new pipeline enables the rapid, convenient, and standardized analysis of publicly available or newly generated LRS datasets.
Collapse
|
9
|
Thommana A, Shakya M, Gandhi J, Fung CK, Chain PSG, Maljkovic Berry I, Conte MA. Intrahost SARS-CoV-2 k-mer Identification Method (iSKIM) for Rapid Detection of Mutations of Concern Reveals Emergence of Global Mutation Patterns. Viruses 2022; 14:2128. [PMID: 36298683 PMCID: PMC9609618 DOI: 10.3390/v14102128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 09/23/2022] [Accepted: 09/24/2022] [Indexed: 11/27/2022] Open
Abstract
Despite unprecedented global sequencing and surveillance of SARS-CoV-2, timely identification of the emergence and spread of novel variants of concern (VoCs) remains a challenge. Several million raw genome sequencing runs are now publicly available. We sought to survey these datasets for intrahost variation to study emerging mutations of concern. We developed iSKIM ("intrahost SARS-CoV-2 k-mer identification method") to relatively quickly and efficiently screen the many SARS-CoV-2 datasets to identify intrahost mutations belonging to lineages of concern. Certain mutations surged in frequency as intrahost minor variants just prior to, or while lineages of concern arose. The Spike N501Y change common to several VoCs was found as a minor variant in 834 samples as early as October 2020. This coincides with the timing of the first detected samples with this mutation in the Alpha/B.1.1.7 and Beta/B.1.351 lineages. Using iSKIM, we also found that Spike L452R was detected as an intrahost minor variant as early as September 2020, prior to the observed rise of the Epsilon/B.1.429/B.1.427 lineages in late 2020. iSKIM rapidly screens for mutations of interest in raw data, prior to genome assembly, and can be used to detect increases in intrahost variants, potentially providing an early indication of novel variant spread.
Collapse
Affiliation(s)
- Ashley Thommana
- Viral Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA
- Montgomery Blair High School, Silver Spring, MD 20901, USA
| | - Migun Shakya
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
| | - Jaykumar Gandhi
- Viral Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA
| | - Christian K. Fung
- Viral Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA
| | - Patrick S. G. Chain
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
| | - Irina Maljkovic Berry
- Viral Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA
- Integrated Research Facility, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Frederick, MD 21702, USA
| | - Matthew A. Conte
- Viral Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA
| |
Collapse
|
10
|
Thommana A, Shakya M, Gandhi J, Fung CK, Chain PSG, Berry IM, Conte MA. Intrahost SARS-CoV-2 k-mer identification method (iSKIM) for rapid detection of mutations of concern reveals emergence of global mutation patterns. bioRxiv 2022:2022.08.16.504117. [PMID: 36032969 PMCID: PMC9413717 DOI: 10.1101/2022.08.16.504117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Despite unprecedented global sequencing and surveillance of SARS-CoV-2, timely identification of the emergence and spread of novel variants of concern (VoCs) remains a challenge. Several million raw genome sequencing runs are now publicly available. We sought to survey these datasets for intrahost variation to study emerging mutations of concern. We developed iSKIM ("intrahost SARS-CoV-2 k-mer identification method") to relatively quickly and efficiently screen the many SARS-CoV-2 datasets to identify intrahost mutations belonging to lineages of concern. Certain mutations surged in frequency as intrahost minor variants just prior to, or while lineages of concern arose. The Spike N501Y change common to several VoCs was found as a minor variant in 834 samples as early as October 2020. This coincides with the timing of the first detected samples with this mutation in the Alpha/B.1.1.7 and Beta/B.1.351 lineages. Using iSKIM, we also found that Spike L452R was detected as an intrahost minor variant as early as September 2020, prior to the observed rise of the Epsilon/B.1.429/B.1.427 lineages in late 2020. iSKIM rapidly screens for mutations of interest in raw data, prior to genome assembly, and can be used to detect increases in intrahost variants, potentially providing an early indication of novel variant spread.
Collapse
Affiliation(s)
- Ashley Thommana
- Viral Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD, USA
- Montgomery Blair High School, Silver Spring, MD, USA
| | - Migun Shakya
- Los Alamos National Laboratory, Biosecurity and Public Health Group, Bioscience Division, Los Alamos, NM, USA
| | - Jaykumar Gandhi
- Viral Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD, USA
| | - Christian K Fung
- Viral Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD, USA
| | - Patrick S G Chain
- Los Alamos National Laboratory, Biosecurity and Public Health Group, Bioscience Division, Los Alamos, NM, USA
| | - Irina Maljkovic Berry
- Viral Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD, USA
- Integrated Research Facility, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Frederick, MD, USA
| | - Matthew A Conte
- Viral Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD, USA
| |
Collapse
|
11
|
Orlov YL, Tatarinova TV, Oparina NY, Galieva ER, Baranova AV. Editorial: Bioinformatics of Genome Regulation, Volume I. Front Genet 2021; 12:803273. [PMID: 34938326 PMCID: PMC8687738 DOI: 10.3389/fgene.2021.803273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Accepted: 11/08/2021] [Indexed: 11/23/2022] Open
Affiliation(s)
- Yuriy L Orlov
- Institute of Digital Medicine, I.M.Sechenov First Moscow State Medical University (Sechenov University), Moscow, Russia.,Agrarian and Technological Institute, Peoples' Friendship University of Russia (RUDN University), Moscow, Russia.,Life Sciences Department, Novosibirsk State University, Novosibirsk, Russia.,Institute of Cytology and Genetics SB RAS, Novosibirsk, Russia
| | | | - Nina Y Oparina
- Institute of Medicine, University of Gothenburg, Göteborg, Sweden
| | - Elvira R Galieva
- Life Sciences Department, Novosibirsk State University, Novosibirsk, Russia
| | - Ancha V Baranova
- School of Systems Biology, George Mason University, Fairfax, VA, United States.,Research Centre for Medical Genetics, Moscow, Russia
| |
Collapse
|
12
|
Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol Biol Evol 2021; 38:5825-5829. [PMID: 34597405 PMCID: PMC8662613 DOI: 10.1093/molbev/msab293] [Citation(s) in RCA: 915] [Impact Index Per Article: 305.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Even though automated functional annotation of genes represents a fundamental step in most genomic and metagenomic workflows, it remains challenging at large scales. Here, we describe a major upgrade to eggNOG-mapper, a tool for functional annotation based on precomputed orthology assignments, now optimized for vast (meta)genomic data sets. Improvements in version 2 include a full update of both the genomes and functional databases to those from eggNOG v5, as well as several efficiency enhancements and new features. Most notably, eggNOG-mapper v2 now allows for: 1) de novo gene prediction from raw contigs, 2) built-in pairwise orthology prediction, 3) fast protein domain discovery, and 4) automated GFF decoration. eggNOG-mapper v2 is available as a standalone tool or as an online service at http://eggnog-mapper.embl.de.
Collapse
Affiliation(s)
- Carlos P Cantalapiedra
- Centro de Biotecnologia y Genomica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, Madrid, Spain
| | - Ana Hernández-Plaza
- Centro de Biotecnologia y Genomica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, Madrid, Spain
| | | | - Peer Bork
- European Molecular Biology Laboratory, Structural and Computational Biology Unit, Heidelberg, Germany.,Department of Bioinformatics, Biocenter, University of Würzburg, Würzburg, Germany.,Yonsei Frontier Lab (YFL), Yonsei University, Seoul, South Korea
| | - Jaime Huerta-Cepas
- Centro de Biotecnologia y Genomica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, Madrid, Spain
| |
Collapse
|
13
|
Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol Biol Evol 2021; 38:5825-5829. [PMID: 34597405 DOI: 10.1101/2021.06.03.446934] [Citation(s) in RCA: 48] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/23/2023] Open
Abstract
Even though automated functional annotation of genes represents a fundamental step in most genomic and metagenomic workflows, it remains challenging at large scales. Here, we describe a major upgrade to eggNOG-mapper, a tool for functional annotation based on precomputed orthology assignments, now optimized for vast (meta)genomic data sets. Improvements in version 2 include a full update of both the genomes and functional databases to those from eggNOG v5, as well as several efficiency enhancements and new features. Most notably, eggNOG-mapper v2 now allows for: 1) de novo gene prediction from raw contigs, 2) built-in pairwise orthology prediction, 3) fast protein domain discovery, and 4) automated GFF decoration. eggNOG-mapper v2 is available as a standalone tool or as an online service at http://eggnog-mapper.embl.de.
Collapse
Affiliation(s)
- Carlos P Cantalapiedra
- Centro de Biotecnologia y Genomica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, Madrid, Spain
| | - Ana Hernández-Plaza
- Centro de Biotecnologia y Genomica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, Madrid, Spain
| | | | - Peer Bork
- European Molecular Biology Laboratory, Structural and Computational Biology Unit, Heidelberg, Germany
- Department of Bioinformatics, Biocenter, University of Würzburg, Würzburg, Germany
- Yonsei Frontier Lab (YFL), Yonsei University, Seoul, South Korea
| | - Jaime Huerta-Cepas
- Centro de Biotecnologia y Genomica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, Madrid, Spain
| |
Collapse
|
14
|
Orlov YL, Anashkina AA, Tatarinova TV, Baranova AV. Editorial: Bioinformatics of Genome Regulation, Volume II. Front Genet 2021; 12:795257. [PMID: 34819949 PMCID: PMC8606529 DOI: 10.3389/fgene.2021.795257] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 10/25/2021] [Indexed: 12/17/2022] Open
Affiliation(s)
- Yuriy L Orlov
- The Digital Health Institute, I.M.Sechenov First Moscow State Medical University (Sechenov University), Moscow, Russia.,Agrobiotechnology Department, Agrarian and Technological Institute, Peoples' Friendship University of Russia (RUDN University), Moscow, Russia
| | - Anastasia A Anashkina
- The Digital Health Institute, I.M.Sechenov First Moscow State Medical University (Sechenov University), Moscow, Russia.,Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia
| | | | - Ancha V Baranova
- School of Systems Biology, George Mason University, Fairfax, VA, United States.,Research Centre for Medical Genetics, Moscow, Russia
| |
Collapse
|
15
|
Orchard P, Kyono Y, Hensley J, Kitzman JO, Parker SCJ. Quantification, Dynamic Visualization, and Validation of Bias in ATAC-Seq Data with ataqv. Cell Syst 2020; 10:298-306.e4. [PMID: 32213349 DOI: 10.1016/j.cels.2020.02.009] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Revised: 11/15/2019] [Accepted: 02/25/2020] [Indexed: 12/17/2022]
Abstract
The assay for transposase-accessible chromatin using sequencing (ATAC-seq) has become the preferred method for mapping chromatin accessibility due to its time and input material efficiency. However, it can be difficult to evaluate data quality and identify sources of technical bias across samples. Here, we present ataqv, a computational toolkit for efficiently measuring, visualizing, and comparing quality control (QC) results across samples and experiments. We use ataqv to analyze 2,009 public ATAC-seq datasets; their QC metrics display a 10-fold range. Tn5 dosage experiments and statistical modeling show that technical variation in the ratio of Tn5 transposase to nuclei and sequencing flowcell density induces systematic bias in ATAC-seq data by changing the enrichment of reads across functional genomic annotations including promoters, enhancers, and transcription-factor-bound regions, with the notable exception of CTCF. ataqv can be integrated into existing computational pipelines and is freely available at https://github.com/ParkerLab/ataqv/.
Collapse
|
16
|
Belbin GM, Cullina S, Wenric S, Soper ER, Glicksberg BS, Torre D, Moscati A, Wojcik GL, Shemirani R, Beckmann ND, Cohain A, Sorokin EP, Park DS, Ambite JL, Ellis S, Auton A, Bottinger EP, Cho JH, Loos RJF, Abul-Husn NS, Zaitlen NA, Gignoux CR, Kenny EE. Toward a fine-scale population health monitoring system. Cell 2021; 184:2068-2083.e11. [PMID: 33861964 DOI: 10.1016/j.cell.2021.03.034] [Citation(s) in RCA: 57] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Revised: 11/18/2020] [Accepted: 03/12/2021] [Indexed: 12/22/2022]
Abstract
Understanding population health disparities is an essential component of equitable precision health efforts. Epidemiology research often relies on definitions of race and ethnicity, but these population labels may not adequately capture disease burdens and environmental factors impacting specific sub-populations. Here, we propose a framework for repurposing data from electronic health records (EHRs) in concert with genomic data to explore the demographic ties that can impact disease burdens. Using data from a diverse biobank in New York City, we identified 17 communities sharing recent genetic ancestry. We observed 1,177 health outcomes that were statistically associated with a specific group and demonstrated significant differences in the segregation of genetic variants contributing to Mendelian diseases. We also demonstrated that fine-scale population structure can impact the prediction of complex disease risk within groups. This work reinforces the utility of linking genomic data to EHRs and provides a framework toward fine-scale monitoring of population health.
Collapse
Affiliation(s)
- Gillian M Belbin
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Sinead Cullina
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Stephane Wenric
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Emily R Soper
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Benjamin S Glicksberg
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Denis Torre
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Arden Moscati
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Genevieve L Wojcik
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Ruhollah Shemirani
- Information Science Institute, University of Southern California, Marina del Rey, CA 90089, USA
| | - Noam D Beckmann
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Ariella Cohain
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Elena P Sorokin
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Danny S Park
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Jose-Luis Ambite
- Information Science Institute, University of Southern California, Marina del Rey, CA 90089, USA
| | - Steve Ellis
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Adam Auton
- Department of Genetics, Albert Einstein College of Medicine, New York, NY 10461, USA
| | -
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | -
- Regeneron Genetics Center, Tarrytown, New York, NY 10591, USA
| | - Erwin P Bottinger
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Judy H Cho
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Ruth J F Loos
- Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Noura S Abul-Husn
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Noah A Zaitlen
- Department of Neurology, University of California, Los Angeles, Los Angeles, CA 90033, USA
| | - Christopher R Gignoux
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
| |
Collapse
|
17
|
Augustyn DR, Wyciślik Ł, Mrozek D. Perspectives of using Cloud computing in integrative analysis of multi-omics data. Brief Funct Genomics 2021; 20:198-206. [PMID: 33676373 DOI: 10.1093/bfgp/elab007] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 01/25/2021] [Accepted: 01/26/2021] [Indexed: 12/11/2022] Open
Abstract
Integrative analysis of multi-omics data is usually computationally demanding. It frequently requires building complex, multi-step analysis pipelines, applying dedicated techniques for data processing and combining several data sources. These efforts lead to a better understanding of life processes, current health state or the effects of therapeutic activities. However, many omics data analysis solutions focus only on a selected problem, disease, types of data or organisms. Moreover, they are implemented for general-purpose scientific computational platforms that most often do not easily scale the calculations natively. These features are not conducive to advances in understanding genotype-phenotypic relationships. Fortunately, with new technological paradigms, including Cloud computing, virtualization and containerization, these functionalities could be orchestrated for easy scaling and building independent analysis pipelines for omics data. Therefore, solutions can be re-used for purposes that they were not primarily designed. This paper shows perspectives of using Cloud computing advances and containerization approach for such a purpose. We first review how the Cloud computing model is utilized in multi-omics data analysis and show weak points of the adopted solutions. Then, we introduce containerization concepts, which allow both scaling and linking of functional services designed for various purposes. Finally, on the Bioconductor software package example, we disclose a verified concept model of a universal solution that exhibits the potentials for performing integrative analysis of multiple omics data sources.
Collapse
Affiliation(s)
- Dariusz R Augustyn
- Silesian University of Technology, Department of Applied Informatics, Gliwice 44-100, Poland
| | - Łukasz Wyciślik
- Silesian University of Technology, Department of Applied Informatics, Gliwice 44-100, Poland
| | - Dariusz Mrozek
- Silesian University of Technology, Department of Applied Informatics, Gliwice 44-100, Poland
| |
Collapse
|
18
|
Holub AS, Bouley RA, Petreaca RC, Husbands AY. Identifying Cancer-Relevant Mutations in the DLC START Domain Using Evolutionary and Structure-Function Analyses. Int J Mol Sci 2020; 21:E8175. [PMID: 33142932 DOI: 10.3390/ijms21218175] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 10/22/2020] [Accepted: 10/30/2020] [Indexed: 01/05/2023] Open
Abstract
Rho GTPase signaling promotes proliferation, invasion, and metastasis in a broad spectrum of cancers. Rho GTPase activity is regulated by the deleted in liver cancer (DLC) family of bona fide tumor suppressors which directly inactivate Rho GTPases by stimulating GTP hydrolysis. In addition to a RhoGAP domain, DLC proteins contain a StAR-related lipid transfer (START) domain. START domains in other organisms bind hydrophobic small molecules and can regulate interacting partners or co-occurring domains through a variety of mechanisms. In the case of DLC proteins, their START domain appears to contribute to tumor suppressive activity. However, the nature of this START-directed mechanism, as well as the identities of relevant functional residues, remain virtually unknown. Using the Catalogue of Somatic Mutations in Cancer (COSMIC) dataset and evolutionary and structure-function analyses, we identify several conserved residues likely to be required for START-directed regulation of DLC-1 and DLC-2 tumor-suppressive capabilities. This pan-cancer analysis shows that conserved residues of both START domains are highly overrepresented in cancer cells from a wide range tissues. Interestingly, in DLC-1 and DLC-2, three of these residues form multiple interactions at the tertiary structural level. Furthermore, mutation of any of these residues is predicted to disrupt interactions and thus destabilize the START domain. As such, these mutations would not have emerged from traditional hotspot scans of COSMIC. We propose that evolutionary and structure-function analyses are an underutilized strategy which could be used to unmask cancer-relevant mutations within COSMIC. Our data also suggest DLC-1 and DLC-2 as high-priority candidates for development of novel therapeutics that target their START domain.
Collapse
|
19
|
Rey F, Pandini C, Barzaghini B, Messa L, Giallongo T, Pansarasa O, Gagliardi S, Brilli M, Zuccotti GV, Cereda C, Raimondi MT, Carelli S. Dissecting the Effect of a 3D Microscaffold on the Transcriptome of Neural Stem Cells with Computational Approaches: A Focus on Mechanotransduction. Int J Mol Sci 2020; 21:E6775. [PMID: 32942778 DOI: 10.3390/ijms21186775] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Revised: 09/05/2020] [Accepted: 09/14/2020] [Indexed: 12/16/2022] Open
Abstract
3D cell cultures are becoming more and more important in the field of regenerative medicine due to their ability to mimic the cellular physiological microenvironment. Among the different types of 3D scaffolds, we focus on the Nichoid, a miniaturized scaffold with a structure inspired by the natural staminal niche. The Nichoid can activate cellular responses simply by subjecting the cells to mechanical stimuli. This kind of influence results in different cellular morphology and organization, but the molecular bases of these changes remain largely unknown. Through RNA-Seq approach on murine neural precursors stem cells expanded inside the Nichoid, we investigated the deregulated genes and pathways showing that the Nichoid causes alteration in genes strongly connected to mechanobiological functions. Moreover, we fully dissected this mechanism highlighting how the changes start at a membrane level, with subsequent alterations in the cytoskeleton, signaling pathways, and metabolism, all leading to a final alteration in gene expression. The results shown here demonstrate that the Nichoid influences the biological and genetic response of stem cells thorough specific alterations of cellular signaling. The characterization of these pathways elucidates the role of mechanical manipulation on stem cells, with possible implications in regenerative medicine applications.
Collapse
|
20
|
Abstract
Next-generation sequencing approaches have fundamentally changed the types of questions that can be asked about gene function and regulation. With the goal of approaching truly genome-wide quantifications of all the interaction partners and downstream effects of particular genes, these quantitative assays have allowed for an unprecedented level of detail in exploring biological interactions. However, many challenges remain in our ability to accurately describe and quantify the interactions that take place in those hard to reach and extremely repetitive regions of our genome comprised mostly of transposable elements (TEs). Tools dedicated to TE-derived sequences have lagged behind, making the inclusion of these sequences in genome-wide analyses difficult. Recent improvements, both computational and experimental, allow for the better inclusion of TE sequences in genomic assays and a renewed appreciation for the importance of TE biology. This review will discuss the recent improvements that have been made in the computational analysis of TE-derived sequences as well as the areas where such analysis still proves difficult. This article is part of a discussion meeting issue 'Crossroads between transposons and gene regulation'.
Collapse
Affiliation(s)
- Kathryn O'Neill
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - David Brocks
- Department of Computer Science and Applied Mathematics, The Weizmann Institute of Science, Rehovot, Israel
| | - Molly Gale Hammell
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| |
Collapse
|
21
|
Díaz-Gay M, Franch-Expósito S, Arnau-Collell C, Park S, Supek F, Muñoz J, Bonjoch L, Gratacós-Mulleras A, Sánchez-Rojas PA, Esteban-Jurado C, Ocaña T, Cuatrecasas M, Vila-Casadesús M, Lozano JJ, Parra G, Laurie S, Beltran S, Castells A, Bujanda L, Cubiella J, Balaguer F, Castellví-Bel S. Integrated Analysis of Germline and Tumor DNA Identifies New Candidate Genes Involved in Familial Colorectal Cancer. Cancers (Basel) 2019; 11:cancers11030362. [PMID: 30871259 PMCID: PMC6468873 DOI: 10.3390/cancers11030362] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2019] [Revised: 03/08/2019] [Accepted: 03/09/2019] [Indexed: 12/29/2022] Open
Abstract
Colorectal cancer (CRC) shows aggregation in some families but no alterations in the known hereditary CRC genes. We aimed to identify new candidate genes which are potentially involved in germline predisposition to familial CRC. An integrated analysis of germline and tumor whole-exome sequencing data was performed in 18 unrelated CRC families. Deleterious single nucleotide variants (SNV), short insertions and deletions (indels), copy number variants (CNVs) and loss of heterozygosity (LOH) were assessed as candidates for first germline or second somatic hits. Candidate tumor suppressor genes were selected when alterations were detected in both germline and somatic DNA, fulfilling Knudson’s two-hit hypothesis. Somatic mutational profiling and signature analysis were also performed. A series of germline-somatic variant pairs were detected. In all cases, the first hit was presented as a rare SNV/indel, whereas the second hit was either a different SNV (3 genes) or LOH affecting the same gene (141 genes). BRCA2, BLM, ERCC2, RECQL, REV3L and RIF1 were among the most promising candidate genes for germline CRC predisposition. The identification of new candidate genes involved in familial CRC could be achieved by our integrated analysis. Further functional studies and replication in additional cohorts are required to confirm the selected candidates.
Collapse
Affiliation(s)
- Marcos Díaz-Gay
- Gastroenterology Department, Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Hospital Clínic, 08036 Barcelona, Spain; (M.D.-G.); (S.F.-E.); (C.A.-C.); (J.M.); (L.B.); (A.G.-M.); (P.A.S.-R.); (C.E.-J.); (T.O.); (A.C.); (F.B.)
| | - Sebastià Franch-Expósito
- Gastroenterology Department, Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Hospital Clínic, 08036 Barcelona, Spain; (M.D.-G.); (S.F.-E.); (C.A.-C.); (J.M.); (L.B.); (A.G.-M.); (P.A.S.-R.); (C.E.-J.); (T.O.); (A.C.); (F.B.)
| | - Coral Arnau-Collell
- Gastroenterology Department, Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Hospital Clínic, 08036 Barcelona, Spain; (M.D.-G.); (S.F.-E.); (C.A.-C.); (J.M.); (L.B.); (A.G.-M.); (P.A.S.-R.); (C.E.-J.); (T.O.); (A.C.); (F.B.)
| | - Solip Park
- Systems Biology Program, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, 08003 Barcelona, Spain;
| | - Fran Supek
- Institut de Recerca Biomedica (IRB Barcelona), The Barcelona Institute of Science and Technology, 08028 Barcelona, Spain;
| | - Jenifer Muñoz
- Gastroenterology Department, Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Hospital Clínic, 08036 Barcelona, Spain; (M.D.-G.); (S.F.-E.); (C.A.-C.); (J.M.); (L.B.); (A.G.-M.); (P.A.S.-R.); (C.E.-J.); (T.O.); (A.C.); (F.B.)
| | - Laia Bonjoch
- Gastroenterology Department, Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Hospital Clínic, 08036 Barcelona, Spain; (M.D.-G.); (S.F.-E.); (C.A.-C.); (J.M.); (L.B.); (A.G.-M.); (P.A.S.-R.); (C.E.-J.); (T.O.); (A.C.); (F.B.)
| | - Anna Gratacós-Mulleras
- Gastroenterology Department, Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Hospital Clínic, 08036 Barcelona, Spain; (M.D.-G.); (S.F.-E.); (C.A.-C.); (J.M.); (L.B.); (A.G.-M.); (P.A.S.-R.); (C.E.-J.); (T.O.); (A.C.); (F.B.)
| | - Paula A. Sánchez-Rojas
- Gastroenterology Department, Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Hospital Clínic, 08036 Barcelona, Spain; (M.D.-G.); (S.F.-E.); (C.A.-C.); (J.M.); (L.B.); (A.G.-M.); (P.A.S.-R.); (C.E.-J.); (T.O.); (A.C.); (F.B.)
| | - Clara Esteban-Jurado
- Gastroenterology Department, Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Hospital Clínic, 08036 Barcelona, Spain; (M.D.-G.); (S.F.-E.); (C.A.-C.); (J.M.); (L.B.); (A.G.-M.); (P.A.S.-R.); (C.E.-J.); (T.O.); (A.C.); (F.B.)
| | - Teresa Ocaña
- Gastroenterology Department, Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Hospital Clínic, 08036 Barcelona, Spain; (M.D.-G.); (S.F.-E.); (C.A.-C.); (J.M.); (L.B.); (A.G.-M.); (P.A.S.-R.); (C.E.-J.); (T.O.); (A.C.); (F.B.)
| | | | - Maria Vila-Casadesús
- Bioinformatics Platform, Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), 08036 Barcelona, Spain; (M.V.-C.); (J.J.L.)
| | - Juan José Lozano
- Bioinformatics Platform, Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), 08036 Barcelona, Spain; (M.V.-C.); (J.J.L.)
| | - Genis Parra
- Centre Nacional d’Anàlisi Genòmica-Centre de Regulació Genòmica (CNAG-CRG), Parc Científic de Barcelona, 08028 Barcelona, Spain; (G.P.); (S.L.); (S.B.)
| | - Steve Laurie
- Centre Nacional d’Anàlisi Genòmica-Centre de Regulació Genòmica (CNAG-CRG), Parc Científic de Barcelona, 08028 Barcelona, Spain; (G.P.); (S.L.); (S.B.)
| | - Sergi Beltran
- Centre Nacional d’Anàlisi Genòmica-Centre de Regulació Genòmica (CNAG-CRG), Parc Científic de Barcelona, 08028 Barcelona, Spain; (G.P.); (S.L.); (S.B.)
| | - EPICOLON Consortium
- Gastroenterology Department, Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Hospital Clínic, 08036 Barcelona, Spain; (M.D.-G.); (S.F.-E.); (C.A.-C.); (J.M.); (L.B.); (A.G.-M.); (P.A.S.-R.); (C.E.-J.); (T.O.); (A.C.); (F.B.)
- Gastroenterology Department, Hospital Donostia-Instituto Biodonostia, Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Basque Country University (UPV/EHU), 20014 San Sebastián, Spain;
- Gastroenterology Department, Complexo Hospitalario Universitario de Ourense, Instituto de Investigación Sanitaria Galicia Sur, 32005 Ourense, Spain;
| | - Antoni Castells
- Gastroenterology Department, Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Hospital Clínic, 08036 Barcelona, Spain; (M.D.-G.); (S.F.-E.); (C.A.-C.); (J.M.); (L.B.); (A.G.-M.); (P.A.S.-R.); (C.E.-J.); (T.O.); (A.C.); (F.B.)
| | - Luis Bujanda
- Gastroenterology Department, Hospital Donostia-Instituto Biodonostia, Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Basque Country University (UPV/EHU), 20014 San Sebastián, Spain;
| | - Joaquín Cubiella
- Gastroenterology Department, Complexo Hospitalario Universitario de Ourense, Instituto de Investigación Sanitaria Galicia Sur, 32005 Ourense, Spain;
| | - Francesc Balaguer
- Gastroenterology Department, Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Hospital Clínic, 08036 Barcelona, Spain; (M.D.-G.); (S.F.-E.); (C.A.-C.); (J.M.); (L.B.); (A.G.-M.); (P.A.S.-R.); (C.E.-J.); (T.O.); (A.C.); (F.B.)
| | - Sergi Castellví-Bel
- Gastroenterology Department, Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Hospital Clínic, 08036 Barcelona, Spain; (M.D.-G.); (S.F.-E.); (C.A.-C.); (J.M.); (L.B.); (A.G.-M.); (P.A.S.-R.); (C.E.-J.); (T.O.); (A.C.); (F.B.)
- Correspondence: ; Tel.: +34-93227-5400 (ext. 4183)
| |
Collapse
|
22
|
Metzis V, Steinhauser S, Pakanavicius E, Gouti M, Stamataki D, Ivanovitch K, Watson T, Rayon T, Mousavy Gharavy SN, Lovell-Badge R, Luscombe NM, Briscoe J. Nervous System Regionalization Entails Axial Allocation before Neural Differentiation. Cell 2018; 175:1105-1118.e17. [PMID: 30343898 PMCID: PMC6218657 DOI: 10.1016/j.cell.2018.09.040] [Citation(s) in RCA: 87] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Revised: 07/06/2018] [Accepted: 09/19/2018] [Indexed: 01/28/2023]
Abstract
Neural induction in vertebrates generates a CNS that extends the rostral-caudal length of the body. The prevailing view is that neural cells are initially induced with anterior (forebrain) identity; caudalizing signals then convert a proportion to posterior fates (spinal cord). To test this model, we used chromatin accessibility to define how cells adopt region-specific neural fates. Together with genetic and biochemical perturbations, this identified a developmental time window in which genome-wide chromatin-remodeling events preconfigure epiblast cells for neural induction. Contrary to the established model, this revealed that cells commit to a regional identity before acquiring neural identity. This “primary regionalization” allocates cells to anterior or posterior regions of the nervous system, explaining how cranial and spinal neurons are generated at appropriate axial positions. These findings prompt a revision to models of neural induction and support the proposed dual evolutionary origin of the vertebrate CNS. Chromatin accessibility defines neural progenitor identity A limited developmental window exists to establish spinal cord competency Cells acquire axial identity prior to neural identity
Collapse
Affiliation(s)
| | | | | | - Mina Gouti
- The Francis Crick Institute, London NW1 1AT, UK; Max-Delbrück Center for Molecular Medicine, Berlin 13092, Germany
| | | | | | | | | | | | | | - Nicholas M Luscombe
- The Francis Crick Institute, London NW1 1AT, UK; UCL Genetics Institute, Department of Genetics Evolution and Environment, University College London, London WC1E 6BT, UK; Okinawa Institute of Science and Technology Graduate University, Onna-son, Kunigami-gun, Okinawa 904-0495, Japan
| | | |
Collapse
|
23
|
Robinson M, Glusman G. Genotype Fingerprints Enable Fast and Private Comparison of Genetic Testing Results for Research and Direct-to-Consumer Applications. Genes (Basel) 2018; 9:E481. [PMID: 30287784 DOI: 10.3390/genes9100481] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2018] [Revised: 09/29/2018] [Accepted: 10/02/2018] [Indexed: 11/18/2022] Open
Abstract
Genetic testing has expanded out of the research laboratory into medical practice and the direct-to-consumer market. Rapid analysis of the resulting genotype data now has a significant impact. We present a method for summarizing personal genotypes as ‘genotype fingerprints’ that meets these needs. Genotype fingerprints can be derived from any single nucleotide polymorphism-based assay, and remain comparable as chip designs evolve to higher marker densities. We demonstrate that these fingerprints support distinguishing types of relationships among closely related individuals and closely related individuals from individuals from the same background population, as well as high-throughput identification of identical genotypes, individuals in known background populations, and de novo separation of subpopulations within a large cohort through extremely rapid comparisons. Although fingerprints do not preserve anonymity, they provide a useful degree of privacy by summarizing a genotype while preventing reconstruction of individual marker states. Genotype fingerprints are therefore well-suited as a format for public aggregation of genetic information to support ancestry and relatedness determination without revealing personal health risk status.
Collapse
|
24
|
Abstract
Genomic instability, although frequently deleterious, is also an important mechanism for microbial adaptation to environmental change. Although widely studied in bacteria, in archaea the effect of genomic instability on organism phenotypes and fitness remains unclear. Here we use DNA segmentation methods to detect and quantify genome-wide copy number variation (CNV) in large compendia of high-throughput datasets in a model archaeal species, Halobacterium salinarum. CNV hotspots were identified throughout the genome. Some hotspots were strongly associated with changes in gene expression, suggesting a mechanism for phenotypic innovation. In contrast, CNV hotspots in other genomic loci left expression unchanged, suggesting buffering of certain phenotypes. The correspondence of CNVs with gene expression was validated with strain- and condition-matched transcriptomics and DNA quantification experiments at specific loci. Significant correlation of CNV hotspot locations with the positions of known insertion sequence (IS) elements suggested a mechanism for generating genomic instability. Given the efficient recombination capabilities in H. salinarum despite stability at the single nucleotide level, these results suggest that genomic plasticity mediated by IS element activity can provide a source of phenotypic innovation in extreme environments.
Collapse
Affiliation(s)
- Keely A Dulmage
- 1University Program in Genetics and Genomics, Duke University, Durham, NC, USA.,2Biology Department, Duke University, Durham, NC, USA
| | | | | | - Amy K Schmid
- 1University Program in Genetics and Genomics, Duke University, Durham, NC, USA.,2Biology Department, Duke University, Durham, NC, USA.,3Center for Genomics and Computational Biology, Duke University, Durham, NC 27708, USA
| |
Collapse
|
25
|
Abstract
Integrins are contributors to remodeling of the extracellular matrix and cell migration. Integrins participate in the assembly of the actin cytoskeleton, regulate growth factor signaling pathways, cell proliferation, and control cell motility. In solid tumors, integrins are involved in promoting metastasis to distant sites, and angiogenesis. Integrins are a key target in cancer therapy and imaging. Integrin antagonists have proven successful in halting invasion and migration of tumors. Overexpressed integrins are prime anti-cancer drug targets. To streamline the development of specific integrin cancer therapeutics, we curated data to predict which integrin heterodimers are pausible therapeutic targets against 17 different solid tumors. Computational analysis of The Cancer Genome Atlas (TCGA) gene expression data revealed a set of integrin targets that are differentially expressed in tumors. Filtered by FPKM (Fragments Per Kilobase of transcript per Million mapped reads) expression level, overexpressed subunits were paired into heterodimeric protein targets. By comparing the RNA-seq differential expression results with immunohistochemistry (IHC) data, overexpressed integrin subunits were validated. Biologics and small molecule drug compounds against these identified overexpressed subunits and heterodimeric receptors are potential therapeutics against these cancers. In addition, high-affinity and high-specificity ligands against these integrins can serve as efficient vehicles for delivery of cancer drugs, nanotherapeutics, or imaging probes against cancer.
Collapse
Affiliation(s)
- Adith S Arun
- Department of Biochemistry and Molecular Medicine, University of California Davis School of Medicine, UC Davis NCI-Designated Comprehensive Cancer Center, Sacramento, CA 95817, USA
| | - Clifford G Tepper
- Department of Biochemistry and Molecular Medicine, University of California Davis School of Medicine, UC Davis NCI-Designated Comprehensive Cancer Center, Sacramento, CA 95817, USA
| | - Kit S Lam
- Department of Biochemistry and Molecular Medicine, University of California Davis School of Medicine, UC Davis NCI-Designated Comprehensive Cancer Center, Sacramento, CA 95817, USA
| |
Collapse
|
26
|
Kang J, Rancati T, Lee S, Oh JH, Kerns SL, Scott JG, Schwartz R, Kim S, Rosenstein BS. Machine Learning and Radiogenomics: Lessons Learned and Future Directions. Front Oncol 2018; 8:228. [PMID: 29977864 PMCID: PMC6021505 DOI: 10.3389/fonc.2018.00228] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2018] [Accepted: 06/04/2018] [Indexed: 12/25/2022] Open
Abstract
Due to the rapid increase in the availability of patient data, there is significant interest in precision medicine that could facilitate the development of a personalized treatment plan for each patient on an individual basis. Radiation oncology is particularly suited for predictive machine learning (ML) models due to the enormous amount of diagnostic data used as input and therapeutic data generated as output. An emerging field in precision radiation oncology that can take advantage of ML approaches is radiogenomics, which is the study of the impact of genomic variations on the sensitivity of normal and tumor tissue to radiation. Currently, patients undergoing radiotherapy are treated using uniform dose constraints specific to the tumor and surrounding normal tissues. This is suboptimal in many ways. First, the dose that can be delivered to the target volume may be insufficient for control but is constrained by the surrounding normal tissue, as dose escalation can lead to significant morbidity and rare. Second, two patients with nearly identical dose distributions can have substantially different acute and late toxicities, resulting in lengthy treatment breaks and suboptimal control, or chronic morbidities leading to poor quality of life. Despite significant advances in radiogenomics, the magnitude of the genetic contribution to radiation response far exceeds our current understanding of individual risk variants. In the field of genomics, ML methods are being used to extract harder-to-detect knowledge, but these methods have yet to fully penetrate radiogenomics. Hence, the goal of this publication is to provide an overview of ML as it applies to radiogenomics. We begin with a brief history of radiogenomics and its relationship to precision medicine. We then introduce ML and compare it to statistical hypothesis testing to reflect on shared lessons and to avoid common pitfalls. Current ML approaches to genome-wide association studies are examined. The application of ML specifically to radiogenomics is next presented. We end with important lessons for the proper integration of ML into radiogenomics.
Collapse
Affiliation(s)
- John Kang
- Department of Radiation Oncology, University of Rochester Medical Center, Rochester, NY, United States
| | - Tiziana Rancati
- Prostate Cancer Program, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - Sangkyu Lee
- Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| | - Jung Hun Oh
- Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| | - Sarah L. Kerns
- Department of Radiation Oncology, University of Rochester Medical Center, Rochester, NY, United States
| | - Jacob G. Scott
- Department of Translational Hematology and Oncology Research, Cleveland Clinic, Cleveland, OH, United States
- Department of Radiation Oncology, Cleveland Clinic, Cleveland, OH, United States
| | - Russell Schwartz
- Computational Biology Department, Carnegie Mellon School of Computer Science, Pittsburgh, PA, United States
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA, United States
| | - Seyoung Kim
- Computational Biology Department, Carnegie Mellon School of Computer Science, Pittsburgh, PA, United States
| | - Barry S. Rosenstein
- Department of Radiation Oncology, Icahn School of Medicine at Mount Sinai, New York, NY, United States
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| |
Collapse
|
27
|
Shmakov SA, Makarova KS, Wolf YI, Severinov KV, Koonin EV. Systematic prediction of genes functionally linked to CRISPR-Cas systems by gene neighborhood analysis. Proc Natl Acad Sci U S A 2018; 115:E5307-16. [PMID: 29784811 DOI: 10.1073/pnas.1803440115] [Citation(s) in RCA: 92] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
The CRISPR-Cas systems of bacterial and archaeal adaptive immunity consist of direct repeat arrays separated by unique spacers and multiple CRISPR-associated (cas) genes encoding proteins that mediate all stages of the CRISPR response. In addition to the relatively small set of core cas genes that are typically present in all CRISPR-Cas systems of a given (sub)type and are essential for the defense function, numerous genes occur in CRISPR-cas loci only sporadically. Some of these have been shown to perform various ancillary roles in CRISPR response, but the functional relevance of most remains unknown. We developed a computational strategy for systematically detecting genes that are likely to be functionally linked to CRISPR-Cas. The approach is based on a "CRISPRicity" metric that measures the strength of CRISPR association for all protein-coding genes from sequenced bacterial and archaeal genomes. Uncharacterized genes with CRISPRicity values comparable to those of cas genes are considered candidate CRISPR-linked genes. We describe additional criteria to predict functionally relevance for genes in the candidate set and identify 79 genes as strong candidates for functional association with CRISPR-Cas systems. A substantial majority of these CRISPR-linked genes reside in type III CRISPR-cas loci, which implies exceptional functional versatility of type III systems. Numerous candidate CRISPR-linked genes encode integral membrane proteins suggestive of tight membrane association of CRISPR-Cas systems, whereas many others encode proteins implicated in various signal transduction pathways. These predictions provide ample material for improving annotation of CRISPR-cas loci and experimental characterization of previously unsuspected aspects of CRISPR-Cas system functionality.
Collapse
|
28
|
Glusman G, Mauldin DE, Hood LE, Robinson M. Ultrafast Comparison of Personal Genomes via Precomputed Genome Fingerprints. Front Genet 2017; 8:136. [PMID: 29018478 PMCID: PMC5623000 DOI: 10.3389/fgene.2017.00136] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2017] [Accepted: 09/12/2017] [Indexed: 01/01/2023] Open
Abstract
We present an ultrafast method for comparing personal genomes. We transform the standard genome representation (lists of variants relative to a reference) into “genome fingerprints” via locality sensitive hashing. The resulting genome fingerprints can be meaningfully compared even when the input data were obtained using different sequencing technologies, processed using different pipelines, represented in different data formats and relative to different reference versions. Furthermore, genome fingerprints are robust to up to 30% missing data. Because of their reduced size, computation on the genome fingerprints is fast and requires little memory. For example, we could compute all-against-all pairwise comparisons among the 2504 genomes in the 1000 Genomes data set in 67 s at high quality (21 μs per comparison, on a single processor), and achieved a lower quality approximation in just 11 s. Efficient computation enables scaling up a variety of important genome analyses, including quantifying relatedness, recognizing duplicative sequenced genomes in a set, population reconstruction, and many others. The original genome representation cannot be reconstructed from its fingerprint, effectively decoupling genome comparison from genome interpretation; the method thus has significant implications for privacy-preserving genome analytics.
Collapse
Affiliation(s)
| | | | - Leroy E Hood
- Institute for Systems Biology, Seattle, WA, United States
| | - Max Robinson
- Institute for Systems Biology, Seattle, WA, United States
| |
Collapse
|
29
|
Abstract
Genome rearrangement problems have been extensively studied due to their importance in biology. Most studied models assumed a single copy per gene. However, in reality, duplicated genes are common, most notably in cancer. In this study, we make a step toward handling duplicated genes by considering a model that allows the atomic operations of cut, join, and whole chromosome duplication. Given two linear genomes, [Formula: see text] with one copy per gene and [Formula: see text] with two copies per gene, we give a linear time algorithm for computing a shortest sequence of operations transforming [Formula: see text] into [Formula: see text] such that all intermediate genomes are linear. We also show that computing an optimal sequence with fewest duplications is NP-hard.
Collapse
Affiliation(s)
- Ron Zeira
- Blavatnik School of Computer Science, Tel Aviv University , Tel-Aviv, Israel
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel Aviv University , Tel-Aviv, Israel
| |
Collapse
|
30
|
Karp PD, Latendresse M, Paley SM, Krummenacker M, Ong QD, Billington R, Kothari A, Weaver D, Lee T, Subhraveti P, Spaulding A, Fulcher C, Keseler IM, Caspi R. Pathway Tools version 19.0 update: software for pathway/genome informatics and systems biology. Brief Bioinform 2015; 17:877-90. [PMID: 26454094 DOI: 10.1093/bib/bbv079] [Citation(s) in RCA: 203] [Impact Index Per Article: 22.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2015] [Indexed: 11/15/2022] Open
Abstract
Pathway Tools is a bioinformatics software environment with a broad set of capabilities. The software provides genome-informatics tools such as a genome browser, sequence alignments, a genome-variant analyzer and comparative-genomics operations. It offers metabolic-informatics tools, such as metabolic reconstruction, quantitative metabolic modeling, prediction of reaction atom mappings and metabolic route search. Pathway Tools also provides regulatory-informatics tools, such as the ability to represent and visualize a wide range of regulatory interactions. This article outlines the advances in Pathway Tools in the past 5 years. Major additions include components for metabolic modeling, metabolic route search, computation of atom mappings and estimation of compound Gibbs free energies of formation; addition of editors for signaling pathways, for genome sequences and for cellular architecture; storage of gene essentiality data and phenotype data; display of multiple alignments, and of signaling and electron-transport pathways; and development of Python and web-services application programming interfaces. Scientists around the world have created more than 9800 Pathway/Genome Databases by using Pathway Tools, many of which are curated databases for important model organisms.
Collapse
|
31
|
Guzina J, Djordjevic M. Bioinformatics as a first-line approach for understanding bacteriophage transcription. Bacteriophage 2015; 5:e1062588. [PMID: 26442194 DOI: 10.1080/21597081.2015.1062588] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Revised: 06/09/2015] [Accepted: 06/09/2015] [Indexed: 01/21/2023]
Abstract
Current approach to understanding bacteriophage transcription strategies during infection includes a combination of experimental and bioinformatics approaches, which is often time and resource consuming. Given the exponentially growing number of sequenced bacteriophage genomes, it becomes sensible asking to what extent one can understand bacteriophage transcription by using bioinformatics methods alone. We here argue that a suitable choice of computational methods may provide a highly efficient first-line approach for underst-anding bacteriophage transcription.
Collapse
Affiliation(s)
- Jelena Guzina
- Institute of Physiology and Biochemistry; Faculty of Biology; University of Belgrade ; Belgrade, Serbia
| | - Marko Djordjevic
- Institute of Physiology and Biochemistry; Faculty of Biology; University of Belgrade ; Belgrade, Serbia
| |
Collapse
|
32
|
Paten B, Diekhans M, Druker BJ, Friend S, Guinney J, Gassner N, Guttman M, Kent WJ, Mantey P, Margolin AA, Massie M, Novak AM, Nothaft F, Pachter L, Patterson D, Smuga-Otto M, Stuart JM, Van't Veer L, Wold B, Haussler D. The NIH BD2K center for big data in translational genomics. J Am Med Inform Assoc 2015; 22:1143-7. [PMID: 26174866 DOI: 10.1093/jamia/ocv047] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2015] [Accepted: 04/20/2015] [Indexed: 11/14/2022] Open
Abstract
The world's genomics data will never be stored in a single repository - rather, it will be distributed among many sites in many countries. No one site will have enough data to explain genotype to phenotype relationships in rare diseases; therefore, sites must share data. To accomplish this, the genetics community must forge common standards and protocols to make sharing and computing data among many sites a seamless activity. Through the Global Alliance for Genomics and Health, we are pioneering the development of shared application programming interfaces (APIs) to connect the world's genome repositories. In parallel, we are developing an open source software stack (ADAM) that uses these APIs. This combination will create a cohesive genome informatics ecosystem. Using containers, we are facilitating the deployment of this software in a diverse array of environments. Through benchmarking efforts and big data driver projects, we are ensuring ADAM's performance and utility.
Collapse
Affiliation(s)
- Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Brian J Druker
- Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
| | - Stephen Friend
- Sage Bionetworks, Fairview Ave North, Seattle 98109, WA, USA
| | - Justin Guinney
- Sage Bionetworks, Fairview Ave North, Seattle 98109, WA, USA
| | - Nadine Gassner
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Mitchell Guttman
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - W James Kent
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Patrick Mantey
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA Jack Baskin School of Engineering, University of California, Santa Cruz, CA, USA
| | - Adam A Margolin
- Computational Biology Program, Oregon Health & Science University, Portland, OR, USA
| | - Matt Massie
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA, USA
| | - Adam M Novak
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Frank Nothaft
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA, USA
| | - Lior Pachter
- Department of Mathematics, University of California Berkeley, Berkeley, CA, USA Department of Molecular & Cellular Biology, University of California Berkeley, Berkeley, CA, USA
| | - David Patterson
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA, USA
| | - Maciej Smuga-Otto
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Joshua M Stuart
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Laura Van't Veer
- Department of Laboratory Medicine, University of California, San Francisco, CA, USA
| | - Barbara Wold
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - David Haussler
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA Howard Hughes Medical Institute, Bethesda, MD, USA
| |
Collapse
|
33
|
Emmert-Streib F, Dehmer M, Haibe-Kains B. Untangling statistical and biological models to understand network inference: the need for a genomics network ontology. Front Genet 2014; 5:299. [PMID: 25221572 PMCID: PMC4148777 DOI: 10.3389/fgene.2014.00299] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2014] [Accepted: 08/12/2014] [Indexed: 12/31/2022] Open
Abstract
In this paper, we shed light on approaches that are currently used to infer networks from gene expression data with respect to their biological meaning. As we will show, the biological interpretation of these networks depends on the chosen theoretical perspective. For this reason, we distinguish a statistical perspective from a mathematical modeling perspective and elaborate their differences and implications. Our results indicate the imperative need for a genomic network ontology in order to avoid increasing confusion about the biological interpretation of inferred networks, which can be even enhanced by approaches that integrate multiple data sets, respectively, data types.
Collapse
Affiliation(s)
- Frank Emmert-Streib
- Computational Biology and Machine Learning Laboratory, Faculty of Medicine, Health and Life Sciences, Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast Belfast, UK
| | - Matthias Dehmer
- Institute for Bioinformatics and Translational Research, UMIT Hall in Tyrol, Austria
| | - Benjamin Haibe-Kains
- Bioinformatics and Computational Genomics Laboratory, Princess Margaret Cancer Centre, University Health Network Toronto, ON, Canada
| |
Collapse
|
34
|
Emmert-Streib F, Dehmer M, Haibe-Kains B. Gene regulatory networks and their applications: understanding biological and medical problems in terms of networks. Front Cell Dev Biol 2014; 2:38. [PMID: 25364745 PMCID: PMC4207011 DOI: 10.3389/fcell.2014.00038] [Citation(s) in RCA: 108] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2014] [Accepted: 07/29/2014] [Indexed: 11/13/2022] Open
Abstract
In recent years gene regulatory networks (GRNs) have attracted a lot of interest and many methods have been introduced for their statistical inference from gene expression data. However, despite their popularity, GRNs are widely misunderstood. For this reason, we provide in this paper a general discussion and perspective of gene regulatory networks. Specifically, we discuss their meaning, the consistency among different network inference methods, ensemble methods, the assessment of GRNs, the estimated number of existing GRNs and their usage in different application domains. Furthermore, we discuss open questions and necessary steps in order to utilize gene regulatory networks in a clinical context and for personalized medicine.
Collapse
Affiliation(s)
- Frank Emmert-Streib
- Computational Biology and Machine Learning Laboratory, Faculty of Medicine, Health and Life Sciences, Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast Belfast, UK
| | - Matthias Dehmer
- Institute for Bioinformatics and Translational Research, UMIT Hall in Tyrol, Austria
| | - Benjamin Haibe-Kains
- Bioinformatics and Computational Genomics Laboratory, Department of Medical Biophysics, Princess Margaret Cancer Centre, University of Toronto Canada
| |
Collapse
|
35
|
Haibe-Kains B, Emmert-Streib F. Quantitative assessment and validation of network inference methods in bioinformatics. Front Genet 2014; 5:221. [PMID: 25076966 PMCID: PMC4099936 DOI: 10.3389/fgene.2014.00221] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2014] [Accepted: 06/26/2014] [Indexed: 11/30/2022] Open
Affiliation(s)
- Benjamin Haibe-Kains
- Bioinformatics and Computational Genomics, Princess Margaret Cancer Centre, University Health Network Toronto, ON, Canada ; Medical Biophysics Department, University of Toronto Toronto, ON, Canada
| | - Frank Emmert-Streib
- Computational Biology and Machine Learning Laboratory, Center for Cancer Research and Cell Biology, Queen's University Belfast Belfast, UK
| |
Collapse
|
36
|
Kennedy B, Kronenberg Z, Hu H, Moore B, Flygare S, Reese MG, Jorde LB, Yandell M, Huff C. Using VAAST to Identify Disease-Associated Variants in Next-Generation Sequencing Data. Curr Protoc Hum Genet 2014; 81:6.14.1-6.14.25. [PMID: 24763993 PMCID: PMC4137768 DOI: 10.1002/0471142905.hg0614s81] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
The VAAST pipeline is specifically designed to identify disease-associated alleles in next-generation sequencing data. In the protocols presented in this paper, we outline the best practices for variant prioritization using VAAST. Examples and test data are provided for case-control, small pedigree, and large pedigree analyses. These protocols will teach users the fundamentals of VAAST, VAAST 2.0, and pVAAST analyses.
Collapse
Affiliation(s)
- Brett Kennedy
- Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, Utah
| | - Zev Kronenberg
- Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, Utah
| | - Hao Hu
- Department of Epidemiology, The University of Texas M.D. Anderson Cancer Center, Houston, Texas
| | - Barry Moore
- Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, Utah
| | - Steven Flygare
- Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, Utah
| | | | - Lynn B. Jorde
- Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, Utah
| | - Mark Yandell
- Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, Utah
| | - Chad Huff
- Department of Epidemiology, The University of Texas M.D. Anderson Cancer Center, Houston, Texas
| |
Collapse
|
37
|
Emmert-Streib F, de Matos Simoes R, Mullan P, Haibe-Kains B, Dehmer M. The gene regulatory network for breast cancer: integrated regulatory landscape of cancer hallmarks. Front Genet 2014; 5:15. [PMID: 24550935 PMCID: PMC3909882 DOI: 10.3389/fgene.2014.00015] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2013] [Accepted: 01/15/2014] [Indexed: 12/22/2022] Open
Abstract
In this study, we infer the breast cancer gene regulatory network from gene expression data. This network is obtained from the application of the BC3Net inference algorithm to a large-scale gene expression data set consisting of 351 patient samples. In order to elucidate the functional relevance of the inferred network, we are performing a Gene Ontology (GO) analysis for its structural components. Our analysis reveals that most significant GO-terms we find for the breast cancer network represent functional modules of biological processes that are described by known cancer hallmarks, including translation, immune response, cell cycle, organelle fission, mitosis, cell adhesion, RNA processing, RNA splicing and response to wounding. Furthermore, by using a curated list of census cancer genes, we find an enrichment in these functional modules. Finally, we study cooperative effects of chromosomes based on information of interacting genes in the beast cancer network. We find that chromosome 21 is most coactive with other chromosomes. To our knowledge this is the first study investigating the genome-scale breast cancer network.
Collapse
Affiliation(s)
- Frank Emmert-Streib
- Computational Biology and Machine Learning Laboratory, Faculty of Medicine, Health and Life Sciences, Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast Belfast, UK
| | - Ricardo de Matos Simoes
- Computational Biology and Machine Learning Laboratory, Faculty of Medicine, Health and Life Sciences, Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast Belfast, UK
| | - Paul Mullan
- Faculty of Medicine, Health and Life Sciences, Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast Belfast, UK
| | - Benjamin Haibe-Kains
- Bioinformatics and Computational Genomics Laboratory, Princess Margaret Cancer Centre, University Health Network Toronto, Ontario, Canada
| | - Matthias Dehmer
- Institute for Bioinformatics and Translational Research, UMIT, Eduard Wallnoefer Zentrum 1 Hall in Tyrol, Austria
| |
Collapse
|
38
|
Li X, Fleetwood AD, Bayas C, Bilwes AM, Ortega DR, Falke JJ, Zhulin IB, Crane BR. The 3.2 Å resolution structure of a receptor: CheA:CheW signaling complex defines overlapping binding sites and key residue interactions within bacterial chemosensory arrays. Biochemistry 2013; 52:3852-65. [PMID: 23668907 PMCID: PMC3694592 DOI: 10.1021/bi400383e] [Citation(s) in RCA: 63] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Bacterial chemosensory arrays are composed of extended networks of chemoreceptors (also known as methyl-accepting chemotaxis proteins, MCPs), the histidine kinase CheA, and the adaptor protein CheW. Models of these arrays have been developed from cryoelectron microscopy, crystal structures of binary and ternary complexes, NMR spectroscopy, mutational, data and biochemical studies. A new 3.2 Å resolution crystal structure of a Thermotoga maritima MCP protein interaction region in complex with the CheA kinase-regulatory module (P4-P5) and adaptor protein CheW provides sufficient detail to define residue contacts at the interfaces formed among the three proteins. As in a previous 4.5 Å resolution structure, CheA-P5 and CheW interact through conserved hydrophobic surfaces at the ends of their β-barrels to form pseudo 6-fold symmetric rings in which the two proteins alternate around the circumference. The interface between P5 subdomain 1 and CheW subdomain 2 was anticipated from previous studies, whereas the related interface between CheW subdomain 1 and P5 subdomain 2 has only been observed in these ring assemblies. The receptor forms an unexpected structure in that the helical hairpin tip of each subunit has "unzipped" into a continuous α-helix; four such helices associate into a bundle, and the tetramers bridge adjacent P5-CheW rings in the lattice through interactions with both P5 and CheW. P5 and CheW each bind a receptor helix with a groove of conserved hydrophobic residues between subdomains 1 and 2. P5 binds the receptor helix N-terminal to the tip region (lower site), whereas CheW binds the same helix with inverted polarity near the bundle end (upper site). Sequence comparisons among different evolutionary classes of chemotaxis proteins show that the binding partners undergo correlated changes at key residue positions that involve the lower site. Such evolutionary analyses argue that both CheW and P5 bind to the receptor tip at overlapping positions. Computational genomics further reveal that two distinct CheW proteins in Thermotogae utilize the analogous recognition motifs to couple different receptor classes to the same CheA kinase. Important residues for function previously identified by mutagenesis, chemical modification and biophysical approaches also map to these same interfaces. Thus, although the native CheW-receptor interaction is not observed in the present crystal structure, the bioinformatics and previous data predict key features of this interface. The companion study of the P5-receptor interface in native arrays (accompanying paper Piasta et al. (2013) Biochemistry, DOI: 10.1021/bi400385c) shows that, despite the non-native receptor fold in the present crystal structure, the local helix-in-groove contacts of the crystallographic P5-receptor interaction are present in native arrays and are essential for receptor regulation of kinase activity.
Collapse
Affiliation(s)
- Xiaoxiao Li
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, NY 14853, United States
| | - Aaron D. Fleetwood
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831 United States and Department of Microbiology, University of Tennessee, Knoxville TN 37996 United States
| | - Camille Bayas
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, NY 14853, United States
| | - Alexandrine M. Bilwes
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, NY 14853, United States
| | - Davi R. Ortega
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831 United States and Department of Microbiology, University of Tennessee, Knoxville TN 37996 United States
| | | | - Igor B. Zhulin
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831 United States and Department of Microbiology, University of Tennessee, Knoxville TN 37996 United States,To whom correspondence should be addressed , Tel (607) 254-8634 (B.R.C); (I.B.Z), Tel (865) 201-1860
| | - Brian R. Crane
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, NY 14853, United States,To whom correspondence should be addressed , Tel (607) 254-8634 (B.R.C); (I.B.Z), Tel (865) 201-1860
| |
Collapse
|
39
|
Abstract
Cloud computing services have emerged as a cost-effective alternative for cluster systems as the number of genomes and required computation power to analyze them increased in recent years. Here we introduce the Microsoft Azure platform with detailed execution steps and a cost comparison with Amazon Web Services.
Collapse
Affiliation(s)
- Insik Kim
- Center for Biomedical Informatics, Harvard Medical School, Boston, MA, USA. ; School of Electrical and Computer Engineering, Ulsan National Institute of Technology, Ulsan, Korea
| | | | | | | | | |
Collapse
|