201
|
Stenson PD, Mort M, Ball EV, Evans K, Hayden M, Heywood S, Hussain M, Phillips AD, Cooper DN. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum Genet 2017. [PMID: 28349240 DOI: 10.1007/s00439‐017‐1779‐6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
The Human Gene Mutation Database (HGMD®) constitutes a comprehensive collection of published germline mutations in nuclear genes that underlie, or are closely associated with human inherited disease. At the time of writing (March 2017), the database contained in excess of 203,000 different gene lesions identified in over 8000 genes manually curated from over 2600 journals. With new mutation entries currently accumulating at a rate exceeding 17,000 per annum, HGMD represents de facto the central unified gene/disease-oriented repository of heritable mutations causing human genetic disease used worldwide by researchers, clinicians, diagnostic laboratories and genetic counsellors, and is an essential tool for the annotation of next-generation sequencing data. The public version of HGMD ( http://www.hgmd.org ) is freely available to registered users from academic institutions and non-profit organisations whilst the subscription version (HGMD Professional) is available to academic, clinical and commercial users under license via QIAGEN Inc.
Collapse
Affiliation(s)
- Peter D Stenson
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK.
| | - Matthew Mort
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Edward V Ball
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Katy Evans
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Matthew Hayden
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Sally Heywood
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Michelle Hussain
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Andrew D Phillips
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - David N Cooper
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK.
| |
Collapse
|
202
|
Stenson PD, Mort M, Ball EV, Evans K, Hayden M, Heywood S, Hussain M, Phillips AD, Cooper DN. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum Genet 2017; 136:665-677. [PMID: 28349240 PMCID: PMC5429360 DOI: 10.1007/s00439-017-1779-6] [Citation(s) in RCA: 905] [Impact Index Per Article: 129.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2017] [Accepted: 03/14/2017] [Indexed: 02/06/2023]
Abstract
The Human Gene Mutation Database (HGMD®) constitutes a comprehensive collection of published germline mutations in nuclear genes that underlie, or are closely associated with human inherited disease. At the time of writing (March 2017), the database contained in excess of 203,000 different gene lesions identified in over 8000 genes manually curated from over 2600 journals. With new mutation entries currently accumulating at a rate exceeding 17,000 per annum, HGMD represents de facto the central unified gene/disease-oriented repository of heritable mutations causing human genetic disease used worldwide by researchers, clinicians, diagnostic laboratories and genetic counsellors, and is an essential tool for the annotation of next-generation sequencing data. The public version of HGMD (http://www.hgmd.org) is freely available to registered users from academic institutions and non-profit organisations whilst the subscription version (HGMD Professional) is available to academic, clinical and commercial users under license via QIAGEN Inc.
Collapse
Affiliation(s)
- Peter D Stenson
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK.
| | - Matthew Mort
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Edward V Ball
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Katy Evans
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Matthew Hayden
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Sally Heywood
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Michelle Hussain
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Andrew D Phillips
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - David N Cooper
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK.
| |
Collapse
|
203
|
Moustafa A, Xie C, Kirkness E, Biggs W, Wong E, Turpaz Y, Bloom K, Delwart E, Nelson KE, Venter JC, Telenti A. The blood DNA virome in 8,000 humans. PLoS Pathog 2017; 13:e1006292. [PMID: 28328962 PMCID: PMC5378407 DOI: 10.1371/journal.ppat.1006292] [Citation(s) in RCA: 199] [Impact Index Per Article: 28.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2016] [Revised: 04/03/2017] [Accepted: 03/14/2017] [Indexed: 02/06/2023] Open
Abstract
The characterization of the blood virome is important for the safety of blood-derived transfusion products, and for the identification of emerging pathogens. We explored non-human sequence data from whole-genome sequencing of blood from 8,240 individuals, none of whom were ascertained for any infectious disease. Viral sequences were extracted from the pool of sequence reads that did not map to the human reference genome. Analyses sifted through close to 1 Petabyte of sequence data and performed 0.5 trillion similarity searches. With a lower bound for identification of 2 viral genomes/100,000 cells, we mapped sequences to 94 different viruses, including sequences from 19 human DNA viruses, proviruses and RNA viruses (herpesviruses, anelloviruses, papillomaviruses, three polyomaviruses, adenovirus, HIV, HTLV, hepatitis B, hepatitis C, parvovirus B19, and influenza virus) in 42% of the study participants. Of possible relevance to transfusion medicine, we identified Merkel cell polyomavirus in 49 individuals, papillomavirus in blood of 13 individuals, parvovirus B19 in 6 individuals, and the presence of herpesvirus 8 in 3 individuals. The presence of DNA sequences from two RNA viruses was unexpected: Hepatitis C virus is revealing of an integration event, while the influenza virus sequence resulted from immunization with a DNA vaccine. Age, sex and ancestry contributed significantly to the prevalence of infection. The remaining 75 viruses mostly reflect extensive contamination of commercial reagents and from the environment. These technical problems represent a major challenge for the identification of novel human pathogens. Increasing availability of human whole-genome sequences will contribute substantial amounts of data on the composition of the normal and pathogenic human blood virome. Distinguishing contaminants from real human viruses is challenging. Novel sequencing technologies offer insight into the virome in human samples. Here, we identify the viral DNA sequences in blood of over 8,000 individuals undergoing whole genome sequencing. This approach serves to identify 94 viruses; however, many are shown to reflect widespread DNA contamination of commercial reagents or of environmental origin. While this represents a significant limitation to reliably identify novel viruses infecting humans, we could confidently detect sequences and quantify abundance of 19 human viruses in 42% of individuals. Ancestry, sex, and age were important determinants of viral prevalence. This large study calls attention on the challenge of interpreting next generation sequencing data for the identification of novel viruses. However, it serves to categorize the abundance of human DNA viruses using an unbiased technique.
Collapse
Affiliation(s)
- Ahmed Moustafa
- Human Longevity Inc., San Diego, California, United States of America
| | - Chao Xie
- Human Longevity Singapore Pte. Ltd., Singapore
| | - Ewen Kirkness
- Human Longevity Inc., San Diego, California, United States of America
| | - William Biggs
- Human Longevity Inc., San Diego, California, United States of America
| | - Emily Wong
- Human Longevity Inc., San Diego, California, United States of America
| | | | - Kenneth Bloom
- Human Longevity Inc., San Diego, California, United States of America
| | - Eric Delwart
- Blood Systems Research Institute, Department of Laboratory Medicine, University of California San Francisco, San Francisco, California, United States of America
| | - Karen E. Nelson
- J. Craig Venter Institute, La Jolla, California, United States of America
| | - J. Craig Venter
- Human Longevity Inc., San Diego, California, United States of America
- J. Craig Venter Institute, La Jolla, California, United States of America
- * E-mail: (JCV); (AT)
| | - Amalio Telenti
- Human Longevity Inc., San Diego, California, United States of America
- J. Craig Venter Institute, La Jolla, California, United States of America
- * E-mail: (JCV); (AT)
| |
Collapse
|
204
|
Whole-genome sequencing identifies common-to-rare variants associated with human blood metabolites. Nat Genet 2017; 49:568-578. [PMID: 28263315 DOI: 10.1038/ng.3809] [Citation(s) in RCA: 268] [Impact Index Per Article: 38.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2016] [Accepted: 02/10/2017] [Indexed: 02/07/2023]
Abstract
Genetic factors modifying the blood metabolome have been investigated through genome-wide association studies (GWAS) of common genetic variants and through exome sequencing. We conducted a whole-genome sequencing study of common, low-frequency and rare variants to associate genetic variations with blood metabolite levels using comprehensive metabolite profiling in 1,960 adults. We focused the analysis on 644 metabolites with consistent levels across three longitudinal data collections. Genetic sequence variations at 101 loci were associated with the levels of 246 (38%) metabolites (P ≤ 1.9 × 10-11). We identified 113 (10.7%) among 1,054 unrelated individuals in the cohort who carried heterozygous rare variants likely influencing the function of 17 genes. Thirteen of the 17 genes are associated with inborn errors of metabolism or other pediatric genetic conditions. This study extends the map of loci influencing the metabolome and highlights the importance of heterozygous rare variants in determining abnormal blood metabolic phenotypes in adults.
Collapse
|
205
|
Diversity in non-repetitive human sequences not found in the reference genome. Nat Genet 2017; 49:588-593. [PMID: 28250455 DOI: 10.1038/ng.3801] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Accepted: 02/03/2017] [Indexed: 12/15/2022]
Abstract
Genomes usually contain some non-repetitive sequences that are missing from the reference genome and occur only in a population subset. Such non-repetitive, non-reference (NRNR) sequences have remained largely unexplored in terms of their characterization and downstream analyses. Here we describe 3,791 breakpoint-resolved NRNR sequence variants called using PopIns from whole-genome sequence data of 15,219 Icelanders. We found that over 95% of the 244 NRNR sequences that are 200 bp or longer are present in chimpanzees, indicating that they are ancestral. Furthermore, 149 variant loci are in linkage disequilibrium (r2 > 0.8) with a genome-wide association study (GWAS) catalog marker, suggesting disease relevance. Additionally, we report an association (P = 3.8 × 10-8, odds ratio (OR) = 0.92) with myocardial infarction (23,360 cases, 300,771 controls) for a 766-bp NRNR sequence variant. Our results underline the importance of including variation of all complexity levels when searching for variants that associate with disease.
Collapse
|
206
|
Allendorf FW. Genetics and the conservation of natural populations: allozymes to genomes. Mol Ecol 2017; 26:420-430. [DOI: 10.1111/mec.13948] [Citation(s) in RCA: 180] [Impact Index Per Article: 25.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2016] [Accepted: 11/28/2016] [Indexed: 12/14/2022]
Affiliation(s)
- Fred W. Allendorf
- Division of Biological Sciences University of Montana Missoula MT 59812 USA
| |
Collapse
|
207
|
Freedman JE, Miano JM. Challenges and Opportunities in Linking Long Noncoding RNAs to Cardiovascular, Lung, and Blood Diseases. Arterioscler Thromb Vasc Biol 2016; 37:21-25. [PMID: 27856459 DOI: 10.1161/atvbaha.116.308513] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2016] [Accepted: 11/04/2016] [Indexed: 01/16/2023]
Abstract
The new millennium heralds an unanticipated surge of genomic information, most notably an expansive class of long noncoding RNAs (lncRNAs). These transcripts, which now outnumber all protein-coding genes, often exhibit the same characteristics as mRNAs (RNA polymerase II-dependent, 5' methyl-capped, multiexonic, polyadenylated); yet, they do not encode for stable, well-conserved proteins. Elucidating the function of all relevant lncRNAs in heart, vasculature, lung, and blood is essential for generating a complete interactome in these tissues. This is particularly evident because an increasing number of investigators perform RNA-sequencing experiments where, typically, annotated lncRNAs exhibit impressive changes in gene expression. How does one go about evaluating an lncRNA when the sequence of the transcript lends no insight into how it may function within a cell type? Here, we provide a brief overview for the rational study of lncRNAs.
Collapse
Affiliation(s)
- Jane E Freedman
- From the Memorial Heart and Vascular Center, University of Massachusetts Medical School, Worcester (J.E.F.); and Aab Cardiovascular Research Institute, University of Rochester School of Medicine and Dentistry, NY (J.M.M.)
| | - Joseph M Miano
- From the Memorial Heart and Vascular Center, University of Massachusetts Medical School, Worcester (J.E.F.); and Aab Cardiovascular Research Institute, University of Rochester School of Medicine and Dentistry, NY (J.M.M.).
| | | |
Collapse
|
208
|
|
209
|
It takes a genome to understand a village: Population scale precision medicine. Proc Natl Acad Sci U S A 2016; 113:12344-12346. [PMID: 27791179 DOI: 10.1073/pnas.1615329113] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
210
|
Mao Q, Ciotlos S, Zhang RY, Ball MP, Chin R, Carnevali P, Barua N, Nguyen S, Agarwal MR, Clegg T, Connelly A, Vandewege W, Zaranek AW, Estep PW, Church GM, Drmanac R, Peters BA. The whole genome sequences and experimentally phased haplotypes of over 100 personal genomes. Gigascience 2016; 5:42. [PMID: 27724973 PMCID: PMC5057367 DOI: 10.1186/s13742-016-0148-z] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2016] [Accepted: 09/19/2016] [Indexed: 02/01/2023] Open
Abstract
Background Since the completion of the Human Genome Project in 2003, it is estimated that more than 200,000 individual whole human genomes have been sequenced. A stunning accomplishment in such a short period of time. However, most of these were sequenced without experimental haplotype data and are therefore missing an important aspect of genome biology. In addition, much of the genomic data is not available to the public and lacks phenotypic information. Findings As part of the Personal Genome Project, blood samples from 184 participants were collected and processed using Complete Genomics’ Long Fragment Read technology. Here, we present the experimental whole genome haplotyping and sequencing of these samples to an average read coverage depth of 100X. This is approximately three-fold higher than the read coverage applied to most whole human genome assemblies and ensures the highest quality results. Currently, 114 genomes from this dataset are freely available in the GigaDB repository and are associated with rich phenotypic data; the remaining 70 should be added in the near future as they are approved through the PGP data release process. For reproducibility analyses, 20 genomes were sequenced at least twice using independent LFR barcoded libraries. Seven genomes were also sequenced using Complete Genomics’ standard non-barcoded library process. In addition, we report 2.6 million high-quality, rare variants not previously identified in the Single Nucleotide Polymorphisms database or the 1000 Genomes Project Phase 3 data. Conclusions These genomes represent a unique source of haplotype and phenotype data for the scientific community and should help to expand our understanding of human genome evolution and function. Electronic supplementary material The online version of this article (doi:10.1186/s13742-016-0148-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Qing Mao
- Complete Genomics, Inc., 2071 Stierlin Ct., Mountain View, CA, 94043, USA
| | - Serban Ciotlos
- Complete Genomics, Inc., 2071 Stierlin Ct., Mountain View, CA, 94043, USA
| | - Rebecca Yu Zhang
- Complete Genomics, Inc., 2071 Stierlin Ct., Mountain View, CA, 94043, USA
| | - Madeleine P Ball
- Harvard Personal Genome Project, Harvard Medical School, NRB 238, 77 Avenue Louis Pasteur, Boston, MA, 02115, USA.,PersonalGenomes.org, 423 Brookline Avenue, #323, Boston, MA, 02215, USA
| | - Robert Chin
- Complete Genomics, Inc., 2071 Stierlin Ct., Mountain View, CA, 94043, USA
| | - Paolo Carnevali
- Complete Genomics, Inc., 2071 Stierlin Ct., Mountain View, CA, 94043, USA
| | - Nina Barua
- Complete Genomics, Inc., 2071 Stierlin Ct., Mountain View, CA, 94043, USA
| | - Staci Nguyen
- Complete Genomics, Inc., 2071 Stierlin Ct., Mountain View, CA, 94043, USA
| | - Misha R Agarwal
- Complete Genomics, Inc., 2071 Stierlin Ct., Mountain View, CA, 94043, USA
| | - Tom Clegg
- Harvard Personal Genome Project, Harvard Medical School, NRB 238, 77 Avenue Louis Pasteur, Boston, MA, 02115, USA.,Curoverse Inc., 212 Elm St, 3rd Floor, Somerville, MA, 02144, USA
| | - Abram Connelly
- Harvard Personal Genome Project, Harvard Medical School, NRB 238, 77 Avenue Louis Pasteur, Boston, MA, 02115, USA.,Curoverse Inc., 212 Elm St, 3rd Floor, Somerville, MA, 02144, USA
| | - Ward Vandewege
- Harvard Personal Genome Project, Harvard Medical School, NRB 238, 77 Avenue Louis Pasteur, Boston, MA, 02115, USA.,Curoverse Inc., 212 Elm St, 3rd Floor, Somerville, MA, 02144, USA
| | - Alexander Wait Zaranek
- Harvard Personal Genome Project, Harvard Medical School, NRB 238, 77 Avenue Louis Pasteur, Boston, MA, 02115, USA.,Curoverse Inc., 212 Elm St, 3rd Floor, Somerville, MA, 02144, USA
| | - Preston W Estep
- Harvard Personal Genome Project, Harvard Medical School, NRB 238, 77 Avenue Louis Pasteur, Boston, MA, 02115, USA
| | - George M Church
- Harvard Personal Genome Project, Harvard Medical School, NRB 238, 77 Avenue Louis Pasteur, Boston, MA, 02115, USA
| | - Radoje Drmanac
- Complete Genomics, Inc., 2071 Stierlin Ct., Mountain View, CA, 94043, USA.,BGI-Shenzhen, Shenzhen, 518083, China
| | - Brock A Peters
- Complete Genomics, Inc., 2071 Stierlin Ct., Mountain View, CA, 94043, USA. .,BGI-Shenzhen, Shenzhen, 518083, China.
| |
Collapse
|
211
|
Popitsch N, Schuh A, Taylor JC. ReliableGenome: annotation of genomic regions with high/low variant calling concordance. Bioinformatics 2016; 33:155-160. [PMID: 27605105 PMCID: PMC5903559 DOI: 10.1093/bioinformatics/btw587] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2016] [Revised: 08/12/2016] [Accepted: 09/04/2016] [Indexed: 12/30/2022] Open
Abstract
Motivation The increasing adoption of clinical whole-genome resequencing (WGS) demands for highly accurate and reproducible variant calling (VC) methods. The observed discordance between state-of-the-art VC pipelines, however, indicates that the current practice still suffers from non-negligible numbers of false positive and negative SNV and INDEL calls that were shown to be enriched among discordant calls but also in genomic regions with low sequence complexity. Results Here, we describe our method ReliableGenome (RG) for partitioning genomes into high and low concordance regions with respect to a set of surveyed VC pipelines. Our method combines call sets derived by multiple pipelines from arbitrary numbers of datasets and interpolates expected concordance for genomic regions without data. By applying RG to 219 deep human WGS datasets, we demonstrate that VC concordance depends predominantly on genomic context rather than the actual sequencing data which manifests in high recurrence of regions that can/cannot be reliably genotyped by a single method. This enables the application of pre-computed regions to other data created with comparable sequencing technology and software. RG outperforms comparable efforts in predicting VC concordance and false positive calls in low-concordance regions which underlines its usefulness for variant filtering, annotation and prioritization. RG allows focusing resource-intensive algorithms (e.g. consensus calling methods) on the smaller, discordant share of the genome (20–30%) which might result in increased overall accuracy at reasonable costs. Our method and analysis of discordant calls may further be useful for development, benchmarking and optimization of VC algorithms and for the relative comparison of call sets between different studies/pipelines. Availability and Implementation RG was implemented in Java, source code and binaries are freely available for non-commercial use at https://github.com/popitsch/wtchg-rg/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Niko Popitsch
- Wellcome Trust Centre of Human Genetics, University of Oxford, Oxford OX3 7BN, UK.,National Institute for Health Research (NIHR) Oxford Biomedical Research Centre, The Churchill Hospital, Old Road OX3 7LE, UK
| | | | - Anna Schuh
- National Institute for Health Research (NIHR) Oxford Biomedical Research Centre, The Churchill Hospital, Old Road OX3 7LE, UK.,Department of Oncology, University of Oxford, Oxford OX3 7DQ, UK
| | - Jenny C Taylor
- Wellcome Trust Centre of Human Genetics, University of Oxford, Oxford OX3 7BN, UK.,National Institute for Health Research (NIHR) Oxford Biomedical Research Centre, The Churchill Hospital, Old Road OX3 7LE, UK
| |
Collapse
|