51
|
Beck CR, Carvalho CMB, Akdemir ZC, Sedlazeck FJ, Song X, Meng Q, Hu J, Doddapaneni H, Chong Z, Chen ES, Thornton PC, Liu P, Yuan B, Withers M, Jhangiani SN, Kalra D, Walker K, English AC, Han Y, Chen K, Muzny DM, Ira G, Shaw CA, Gibbs RA, Hastings PJ, Lupski JR. Megabase Length Hypermutation Accompanies Human Structural Variation at 17p11.2. Cell 2019; 176:1310-1324.e10. [PMID: 30827684 PMCID: PMC6438178 DOI: 10.1016/j.cell.2019.01.045] [Citation(s) in RCA: 49] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Revised: 11/06/2018] [Accepted: 01/25/2019] [Indexed: 01/16/2023]
Abstract
DNA rearrangements resulting in human genome structural variants (SVs) are caused by diverse mutational mechanisms. We used long- and short-read sequencing technologies to investigate end products of de novo chromosome 17p11.2 rearrangements and query the molecular mechanisms underlying both recurrent and non-recurrent events. Evidence for an increased rate of clustered single-nucleotide variant (SNV) mutation in cis with non-recurrent rearrangements was found. Indel and SNV formation are associated with both copy-number gains and losses of 17p11.2, occur up to ∼1 Mb away from the breakpoint junctions, and favor C > G transversion substitutions; results suggest that single-stranded DNA is formed during the genesis of the SV and provide compelling support for a microhomology-mediated break-induced replication (MMBIR) mechanism for SV formation. Our data show an additional mutational burden of MMBIR consisting of hypermutation confined to the locus and manifesting as SNVs and indels predominantly within genes.
Collapse
Affiliation(s)
- Christine R Beck
- Department of Molecular and Human Genetics, BCM, Houston, TX 77030, USA
| | | | - Zeynep C Akdemir
- Department of Molecular and Human Genetics, BCM, Houston, TX 77030, USA
| | | | - Xiaofei Song
- Department of Molecular and Human Genetics, BCM, Houston, TX 77030, USA
| | - Qingchang Meng
- Human Genome Sequencing Center, BCM, Houston, TX 77030, USA
| | - Jianhong Hu
- Human Genome Sequencing Center, BCM, Houston, TX 77030, USA
| | | | - Zechen Chong
- Department of Genetics and the Informatics Institute, the University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Edward S Chen
- Department of Molecular and Human Genetics, BCM, Houston, TX 77030, USA
| | - Philip C Thornton
- Department of Molecular and Human Genetics, BCM, Houston, TX 77030, USA
| | - Pengfei Liu
- Department of Molecular and Human Genetics, BCM, Houston, TX 77030, USA
| | - Bo Yuan
- Department of Molecular and Human Genetics, BCM, Houston, TX 77030, USA
| | - Marjorie Withers
- Department of Molecular and Human Genetics, BCM, Houston, TX 77030, USA
| | | | - Divya Kalra
- Human Genome Sequencing Center, BCM, Houston, TX 77030, USA
| | | | - Adam C English
- Human Genome Sequencing Center, BCM, Houston, TX 77030, USA
| | - Yi Han
- Human Genome Sequencing Center, BCM, Houston, TX 77030, USA
| | - Ken Chen
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Donna M Muzny
- Human Genome Sequencing Center, BCM, Houston, TX 77030, USA
| | - Grzegorz Ira
- Department of Molecular and Human Genetics, BCM, Houston, TX 77030, USA
| | - Chad A Shaw
- Department of Molecular and Human Genetics, BCM, Houston, TX 77030, USA
| | - Richard A Gibbs
- Department of Molecular and Human Genetics, BCM, Houston, TX 77030, USA; Human Genome Sequencing Center, BCM, Houston, TX 77030, USA
| | - P J Hastings
- Department of Molecular and Human Genetics, BCM, Houston, TX 77030, USA; Dan L. Duncan Comprehensive Cancer Center, BCM, Houston, TX 77030, USA.
| | - James R Lupski
- Department of Molecular and Human Genetics, BCM, Houston, TX 77030, USA; Human Genome Sequencing Center, BCM, Houston, TX 77030, USA; Department of Pediatrics, BCM, Houston, TX 77030, USA; Texas Children's Hospital, Houston, TX 77030, USA; Dan L. Duncan Comprehensive Cancer Center, BCM, Houston, TX 77030, USA.
| |
Collapse
|
52
|
Gabur I, Chawla HS, Snowdon RJ, Parkin IAP. Connecting genome structural variation with complex traits in crop plants. TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2019; 132:733-750. [PMID: 30448864 DOI: 10.1007/s00122-018-3233-0] [Citation(s) in RCA: 64] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/15/2018] [Accepted: 11/07/2018] [Indexed: 05/05/2023]
Abstract
Structural genome variation is a major determinant of useful trait diversity. We describe how genome analysis methods are enabling discovery of trait-associated structural variants and their potential impact on breeding. As our understanding of complex crop genomes continues to grow, there is growing evidence that structural genome variation plays a major role in determining traits important for breeding and agriculture. Identifying the extent and impact of structural variants in crop genomes is becoming increasingly feasible with ongoing advances in the sophistication of genome sequencing technologies, particularly as it becomes easier to generate accurate long sequence reads on a genome-wide scale. In this article, we discuss the origins of structural genome variation in crops from ancient and recent genome duplication and polyploidization events and review high-throughput methods to assay such variants in crop populations in order to find associations with phenotypic traits. There is increasing evidence from such studies that gene presence-absence and copy number variation resulting from segmental chromosome exchanges may be at the heart of adaptive variation of crops to counter abiotic and biotic stress factors. We present examples from major crops that demonstrate the potential of pangenomic diversity as a key resource for future plant breeding for resilience and sustainability.
Collapse
Affiliation(s)
- Iulian Gabur
- Department of Plant Breeding, Justus Liebig University, Heinrich-Buff-Ring 26-32, 35392, Giessen, Germany
| | - Harmeet Singh Chawla
- Department of Plant Breeding, Justus Liebig University, Heinrich-Buff-Ring 26-32, 35392, Giessen, Germany
| | - Rod J Snowdon
- Department of Plant Breeding, Justus Liebig University, Heinrich-Buff-Ring 26-32, 35392, Giessen, Germany.
| | - Isobel A P Parkin
- Agriculture and Agri-Food Canada, 107 Science Place, Saskatoon, SK, S7N OX2, Canada
| |
Collapse
|
53
|
Comprehensive structural variation genome map of individuals carrying complex chromosomal rearrangements. PLoS Genet 2019; 15:e1007858. [PMID: 30735495 PMCID: PMC6368290 DOI: 10.1371/journal.pgen.1007858] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Accepted: 11/28/2018] [Indexed: 11/19/2022] Open
Abstract
Complex chromosomal rearrangements (CCRs) are rearrangements involving more than two chromosomes or more than two breakpoints. Whole genome sequencing (WGS) allows for outstanding high resolution characterization on the nucleotide level in unique sequences of such rearrangements, but problems remain for mapping breakpoints in repetitive regions of the genome, which are known to be prone to rearrangements. Hence, multiple complementary WGS experiments are sometimes needed to solve the structures of CCRs. We have studied three individuals with CCRs: Case 1 and Case 2 presented with de novo karyotypically balanced, complex interchromosomal rearrangements (46,XX,t(2;8;15)(q35;q24.1;q22) and 46,XY,t(1;10;5)(q32;p12;q31)), and Case 3 presented with a de novo, extremely complex intrachromosomal rearrangement on chromosome 1. Molecular cytogenetic investigation revealed cryptic deletions in the breakpoints of chromosome 2 and 8 in Case 1, and on chromosome 10 in Case 2, explaining their clinical symptoms. In Case 3, 26 breakpoints were identified using WGS, disrupting five known disease genes. All rearrangements were subsequently analyzed using optical maps, linked-read WGS, and short-read WGS. In conclusion, we present a case series of three unique de novo CCRs where we by combining the results from the different technologies fully solved the structure of each rearrangement. The power in combining short-read WGS with long-molecule sequencing or optical mapping in these unique de novo CCRs in a clinical setting is demonstrated.
Collapse
|
54
|
Massonnet M, Morales-Cruz A, Minio A, Figueroa-Balderas R, Lawrence DP, Travadon R, Rolshausen PE, Baumgartner K, Cantu D. Whole-Genome Resequencing and Pan-Transcriptome Reconstruction Highlight the Impact of Genomic Structural Variation on Secondary Metabolite Gene Clusters in the Grapevine Esca Pathogen Phaeoacremonium minimum. Front Microbiol 2018; 9:1784. [PMID: 30150972 PMCID: PMC6099105 DOI: 10.3389/fmicb.2018.01784] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2018] [Accepted: 07/16/2018] [Indexed: 12/30/2022] Open
Abstract
The Ascomycete fungus Phaeoacremonium minimum is one of the primary causal agents of Esca, a widespread and damaging grapevine trunk disease. Variation in virulence among Pm. minimum isolates has been reported, but the underlying genetic basis of the phenotypic variability remains unknown. The goal of this study was to characterize intraspecific genetic diversity and explore its potential impact on virulence functions associated with secondary metabolism, cellular transport, and cell wall decomposition. We generated a chromosome-scale genome assembly, using single molecule real-time sequencing, and resequenced the genomes and transcriptomes of multiple isolates to identify sequence and structural polymorphisms. Numerous insertion and deletion events were found for a total of about 1 Mbp in each isolate. Structural variation in this extremely gene dense genome frequently caused presence/absence polymorphisms of multiple adjacent genes, mostly belonging to biosynthetic clusters associated with secondary metabolism. Because of the observed intraspecific diversity in gene content due to structural variation we concluded that a transcriptome reference developed from a single isolate is insufficient to represent the virulence factor repertoire of the species. We therefore compiled a pan-transcriptome reference of Pm. minimum comprising a non-redundant set of 15,245 protein-coding sequences. Using naturally infected field samples expressing Esca symptoms, we demonstrated that mapping of meta-transcriptomics data on a multi-species reference that included the Pm. minimum pan-transcriptome allows the profiling of an expanded set of virulence factors, including variable genes associated with secondary metabolism and cellular transport.
Collapse
Affiliation(s)
- Mélanie Massonnet
- Department of Viticulture and Enology, University of California, Davis, Davis, CA, United States
| | - Abraham Morales-Cruz
- Department of Viticulture and Enology, University of California, Davis, Davis, CA, United States
| | - Andrea Minio
- Department of Viticulture and Enology, University of California, Davis, Davis, CA, United States
| | - Rosa Figueroa-Balderas
- Department of Viticulture and Enology, University of California, Davis, Davis, CA, United States
| | - Daniel P. Lawrence
- Department of Plant Pathology, University of California, Davis, Davis, CA, United States
| | - Renaud Travadon
- Department of Plant Pathology, University of California, Davis, Davis, CA, United States
| | - Philippe E. Rolshausen
- Department of Botany and Plant Sciences, University of California, Riverside, Riverside, CA, United States
| | - Kendra Baumgartner
- Crops Pathology and Genetics Research Unit, Agricultural Research Service, United States Department of Agriculture, Davis, CA, United States
| | - Dario Cantu
- Department of Viticulture and Enology, University of California, Davis, Davis, CA, United States
| |
Collapse
|
55
|
Xia LC, Ai D, Lee H, Andor N, Li C, Zhang NR, Ji HP. SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution. Gigascience 2018; 7:5049476. [PMID: 29982625 PMCID: PMC6057526 DOI: 10.1093/gigascience/giy081] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2018] [Revised: 05/22/2018] [Accepted: 06/26/2018] [Indexed: 11/29/2022] Open
Abstract
Background Simulating genome sequence data with variant features facilitates the development and benchmarking of structural variant analysis programs. However, there are only a few data simulators that provide structural variants in silico and even fewer that provide variants with different allelic fraction and haplotypes. Findings We developed SVEngine, an open-source tool to address this need. SVEngine simulates next-generation sequencing data with embedded structural variations. As input, SVEngine takes template haploid sequences (FASTA) and an external variant file, a variant distribution file, and/or a clonal phylogeny tree file (NEWICK) as input. Subsequently, it simulates and outputs sequence contigs (FASTAs), sequence reads (FASTQs), and/or post-alignment files (BAMs). All of the files contain the desired variants, along with BED files containing the ground truth. SVEngine's flexible design process enables one to specify size, position, and allelic fraction for deletions, insertions, duplications, inversions, and translocations. Finally, SVEngine simulates sequence data that replicate the characteristics of a sequencing library with mixed sizes of DNA insert molecules. To improve the compute speed, SVEngine is highly parallelized to reduce the simulation time. Conclusions We demonstrated the versatile features of SVEngine and its improved runtime comparisons with other available simulators. SVEngine's features include the simulation of locus-specific variant frequency designed to mimic the phylogeny of cancer clonal evolution. We validated SVEngine's accuracy by simulating genome-wide structural variants of NA12878 and a heterogeneous cancer genome. Our evaluation included checking various sequencing mapping features such as coverage change, read clipping, insert size shift, and neighboring hanging read pairs for representative variant types. Structural variant callers Lumpy and Manta and tumor heterogeneity estimator THetA2 were able to perform realistically on the simulated data. SVEngine is implemented as a standard Python package and is freely available for academic use .
Collapse
Affiliation(s)
- Li Charlie Xia
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, 269 Campus Drive, Stanford, CA 94305
- Department of Statistics, the Wharton School, University of Pennsylvania, 3730 Walnut Street, Philadelphia, PA 18014
| | - Dongmei Ai
- School of Mathematics and Physics, University of Science and Technology Beijing, 30 Xueyuan Road, Haidian District, Beijing 100083 P. R. China
| | - Hojoon Lee
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, 269 Campus Drive, Stanford, CA 94305
| | - Noemi Andor
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, 269 Campus Drive, Stanford, CA 94305
| | - Chao Li
- School of Mathematics and Physics, University of Science and Technology Beijing, 30 Xueyuan Road, Haidian District, Beijing 100083 P. R. China
| | - Nancy R Zhang
- Department of Statistics, the Wharton School, University of Pennsylvania, 3730 Walnut Street, Philadelphia, PA 18014
| | - Hanlee P Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, 269 Campus Drive, Stanford, CA 94305
- Stanford Genome Technology Center, Stanford University, 3165 Porter Drive, Palo Alto, CA 94304
| |
Collapse
|
56
|
Barseghyan H, Délot EC, Vilain E. New technologies to uncover the molecular basis of disorders of sex development. Mol Cell Endocrinol 2018; 468:60-69. [PMID: 29655603 PMCID: PMC7249677 DOI: 10.1016/j.mce.2018.04.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/10/2018] [Revised: 04/06/2018] [Accepted: 04/06/2018] [Indexed: 02/04/2023]
Abstract
The elegant developmental biology experiments conducted in the 1940s by French physiologist Alfred Jost demonstrated that the sexual phenotype of a mammalian embryo depended whether the embryonic gonad develops into a testis or not. In humans, anomalies in the processes that regulate development of chromosomal, gonadal or anatomic sex result in a spectrum of conditions termed Disorders/Differences of Sex Development (DSD). Each of these conditions is rare, and understanding of their genetic etiology is still incomplete. Historically, DSD diagnoses have been difficult to establish due to the lack of standardization of anatomical and endocrine phenotyping procedures as well as genetic testing. Yet, a definitive diagnosis is critical for optimal management of the medical and psychosocial challenges associated with DSD conditions. The advent in the clinical realm of next-generation sequencing methods, with constantly decreasing price and turnaround time, has revolutionized the diagnostic process. Here we review the successes and limitations of the genetic methods currently available for DSD diagnosis, including Sanger sequencing, karyotyping, exome sequencing and chromosomal microarrays. While exome sequencing provides higher diagnostic rates, many patients still remain undiagnosed. Newer approaches, such as whole-genome sequencing and whole-genome mapping, along with gene expression studies, have the potential to identify novel DSD-causing genes and significantly increase total diagnostic yield, hopefully shortening the patient's journey to an accurate diagnosis and enhancing health-related quality-of-life outcomes for patients and families.
Collapse
Affiliation(s)
- Hayk Barseghyan
- Center for Genetic Medicine Research, Children's National Health System, Children's Research Institute, Washington, DC, 20010, USA.
| | - Emmanuèle C Délot
- Center for Genetic Medicine Research, Children's National Health System, Children's Research Institute, Washington, DC, 20010, USA.
| | - Eric Vilain
- Center for Genetic Medicine Research, Children's National Health System, Children's Research Institute, Washington, DC, 20010, USA.
| |
Collapse
|
57
|
Zhu T, Hu Z, Rodriguez JC, Deal KR, Dvorak J, Vogel JP, Liu Z, Luo MC. Analysis of Brachypodium genomes with genome-wide optical maps. Genome 2018; 61:559-565. [PMID: 29883550 DOI: 10.1139/gen-2018-0013] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Brachypodium distachyon (n = 5) is a diploid and has been widely used as a genetic model. Brachypodium stacei (n = 10) and B. hybridum (n = 15) are species that are related to B. distachyon, leading to an hypothesis that they are part of a polyploid series based on x = 5. Several lines of evidence suggest that this hypothesis is incorrect and that the genomes of the three taxa may have evolved by a more complex process. We constructed an optical whole-genome BioNano genome (BNG) map for each species and did pairwise alignment of the BNG maps. The maps showed that B. distachyon and B. stacei are both diploid, in spite of B. stacei having twice as many chromosomes as B. distachyon, and that B. hybridum is an allopolyploid formed from hybridization between B. distachyon and B. stacei. This study also demonstrated the use of BNG maps in the detection and quantification of structural variants among the genomes.
Collapse
Affiliation(s)
- Tingting Zhu
- a Department of Plant Sciences, University of California, Davis, CA 95616, USA
| | - Zhaorong Hu
- a Department of Plant Sciences, University of California, Davis, CA 95616, USA.,b State Key Laboratory for Agrobiotechnology, Key Laboratory of Crop Heterosis Utilization (MOE), China Agricultural University, Beijing, 100193, China
| | - Juan C Rodriguez
- a Department of Plant Sciences, University of California, Davis, CA 95616, USA
| | - Karin R Deal
- a Department of Plant Sciences, University of California, Davis, CA 95616, USA
| | - Jan Dvorak
- a Department of Plant Sciences, University of California, Davis, CA 95616, USA
| | - John P Vogel
- c DOE Joint Genome Institute, 2800 Mitchell Dr., Walnut Creek, CA 94598, USA
| | - Zhiyong Liu
- d State Key Laboratory of Plant Cell and Chromosome Engineering, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, 100101, China
| | - Ming-Cheng Luo
- a Department of Plant Sciences, University of California, Davis, CA 95616, USA
| |
Collapse
|
58
|
Fang L, Hu J, Wang D, Wang K. NextSV: a meta-caller for structural variants from low-coverage long-read sequencing data. BMC Bioinformatics 2018; 19:180. [PMID: 29792160 PMCID: PMC5966861 DOI: 10.1186/s12859-018-2207-1] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Accepted: 05/15/2018] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Structural variants (SVs) in human genomes are implicated in a variety of human diseases. Long-read sequencing delivers much longer read lengths than short-read sequencing and may greatly improve SV detection. However, due to the relatively high cost of long-read sequencing, it is unclear what coverage is needed and how to optimally use the aligners and SV callers. RESULTS In this study, we developed NextSV, a meta-caller to perform SV calling from low coverage long-read sequencing data. NextSV integrates three aligners and three SV callers and generates two integrated call sets (sensitive/stringent) for different analysis purposes. We evaluated SV calling performance of NextSV under different PacBio coverages on two personal genomes, NA12878 and HX1. Our results showed that, compared with running any single SV caller, NextSV stringent call set had higher precision and balanced accuracy (F1 score) while NextSV sensitive call set had a higher recall. At 10X coverage, the recall of NextSV sensitive call set was 93.5 to 94.1% for deletions and 87.9 to 93.2% for insertions, indicating that ~10X coverage might be an optimal coverage to use in practice, considering the balance between the sequencing costs and the recall rates. We further evaluated the Mendelian errors on an Ashkenazi Jewish trio dataset. CONCLUSIONS Our results provide useful guidelines for SV detection from low coverage whole-genome PacBio data and we expect that NextSV will facilitate the analysis of SVs on long-read sequencing data.
Collapse
Affiliation(s)
- Li Fang
- Grandomics Biosciences, Beijing, 102206 China
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104 USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104 USA
| | - Jiang Hu
- Grandomics Biosciences, Beijing, 102206 China
| | - Depeng Wang
- Grandomics Biosciences, Beijing, 102206 China
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104 USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104 USA
- Previous address: Department of Biomedical Informatics and Institute for Genomic Medicine, Columbia University Medical Center, New York, NY 10032 USA
| |
Collapse
|
59
|
Lan T, Lin H, Zhu W, Laurent TCAM, Yang M, Liu X, Wang J, Wang J, Yang H, Xu X, Guo X. Deep whole-genome sequencing of 90 Han Chinese genomes. Gigascience 2018; 6:1-7. [PMID: 28938720 PMCID: PMC5603764 DOI: 10.1093/gigascience/gix067] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2016] [Accepted: 07/20/2017] [Indexed: 12/30/2022] Open
Abstract
Next-generation sequencing provides a high-resolution insight into human genetic information. However, the focus of previous studies has primarily been on low-coverage data due to the high cost of sequencing. Although the 1000 Genomes Project and the Haplotype Reference Consortium have both provided powerful reference panels for imputation, low-frequency and novel variants remain difficult to discover and call with accuracy on the basis of low-coverage data. Deep sequencing provides an optimal solution for the problem of these low-frequency and novel variants. Although whole-exome sequencing is also a viable choice for exome regions, it cannot account for noncoding regions, sometimes resulting in the absence of important, causal variants. For Han Chinese populations, the majority of variants have been discovered based upon low-coverage data from the 1000 Genomes Project. However, high-coverage, whole-genome sequencing data are limited for any population, and a large amount of low-frequency, population-specific variants remain uncharacterized. We have performed whole-genome sequencing at a high depth (∼×80) of 90 unrelated individuals of Chinese ancestry, collected from the 1000 Genomes Project samples, including 45 Northern Han Chinese and 45 Southern Han Chinese samples. Eighty-three of these 90 have been sequenced by the 1000 Genomes Project. We have identified 12 568 804 single nucleotide polymorphisms, 2 074 210 short InDels, and 26 142 structural variations from these 90 samples. Compared to the Han Chinese data from the 1000 Genomes Project, we have found 7 000 629 novel variants with low frequency (defined as minor allele frequency < 5%), including 5 813 503 single nucleotide polymorphisms, 1 169 199 InDels, and 17 927 structural variants. Using deep sequencing data, we have built a greatly expanded spectrum of genetic variation for the Han Chinese genome. Compared to the 1000 Genomes Project, these Han Chinese deep sequencing data enhance the characterization of a large number of low-frequency, novel variants. This will be a valuable resource for promoting Chinese genetics research and medical development. Additionally, it will provide a valuable supplement to the 1000 Genomes Project, as well as to other human genome projects.
Collapse
Affiliation(s)
- Tianming Lan
- BGI-Shenzhen, Build 11, Beishan Industrial Zone, Yantian District, Shenzhen, 518083, China
| | - Haoxiang Lin
- BGI Genomics, BGI-Shenzhen, Building NO. 7, BGI Park, No. 21 Hongan 3rd Street, Yantian District, Shenzhen, 518083, China
| | - Wenjuan Zhu
- BGI Genomics, BGI-Shenzhen, Building NO. 7, BGI Park, No. 21 Hongan 3rd Street, Yantian District, Shenzhen, 518083, China
| | - Tellier Christian Asker Melchior Laurent
- BGI-Shenzhen, Build 11, Beishan Industrial Zone, Yantian District, Shenzhen, 518083, China.,Department of Biology, University of Copenhagen, Nørregade 10, PO Box 2177 1017 Copenhagen, Denmark
| | - Mengcheng Yang
- BGI-Shenzhen, Build 11, Beishan Industrial Zone, Yantian District, Shenzhen, 518083, China
| | - Xin Liu
- BGI-Shenzhen, Build 11, Beishan Industrial Zone, Yantian District, Shenzhen, 518083, China
| | - Jun Wang
- BGI-Shenzhen, Build 11, Beishan Industrial Zone, Yantian District, Shenzhen, 518083, China.,Department of Biology, University of Copenhagen, Nørregade 10, PO Box 2177 1017 Copenhagen, Denmark
| | - Jian Wang
- BGI-Shenzhen, Build 11, Beishan Industrial Zone, Yantian District, Shenzhen, 518083, China.,James D. Watson Institute of Genome Sciences, 866 Yuhangtang Road, Hangzhou, Zhejiang Province, 310058, P. R. China
| | - Huanming Yang
- BGI-Shenzhen, Build 11, Beishan Industrial Zone, Yantian District, Shenzhen, 518083, China.,James D. Watson Institute of Genome Sciences, 866 Yuhangtang Road, Hangzhou, Zhejiang Province, 310058, P. R. China
| | - Xun Xu
- BGI-Shenzhen, Build 11, Beishan Industrial Zone, Yantian District, Shenzhen, 518083, China
| | - Xiaosen Guo
- BGI-Shenzhen, Build 11, Beishan Industrial Zone, Yantian District, Shenzhen, 518083, China.,Department of Biology, University of Copenhagen, Nørregade 10, PO Box 2177 1017 Copenhagen, Denmark.,Shenzhen Key Laboratory of Neurogenomics, BGI-Shenzhen, Build 11, Beishan Industrial Zone, Yantian District, Shenzhen, 518083, China
| |
Collapse
|
60
|
Telenti A, Lippert C, Chang PC, DePristo M. Deep learning of genomic variation and regulatory network data. Hum Mol Genet 2018; 27:R63-R71. [PMID: 29648622 PMCID: PMC6499235 DOI: 10.1093/hmg/ddy115] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2018] [Revised: 03/26/2018] [Accepted: 03/27/2018] [Indexed: 02/07/2023] Open
Abstract
The human genome is now investigated through high-throughput functional assays, and through the generation of population genomic data. These advances support the identification of functional genetic variants and the prediction of traits (e.g. deleterious variants and disease). This review summarizes lessons learned from the large-scale analyses of genome and exome data sets, modeling of population data and machine-learning strategies to solve complex genomic sequence regions. The review also portrays the rapid adoption of artificial intelligence/deep neural networks in genomics; in particular, deep learning approaches are well suited to model the complex dependencies in the regulatory landscape of the genome, and to provide predictors for genetic variant calling and interpretation.
Collapse
Affiliation(s)
- Amalio Telenti
- Scripps Translational Science Institute, The Scripps Research Institute, La Jolla, CA 92037, USA
| | | | | | | |
Collapse
|
61
|
Whole Genome Sequencing of Greater Amberjack ( Seriola dumerili) for SNP Identification on Aligned Scaffolds and Genome Structural Variation Analysis Using Parallel Resequencing. Int J Genomics 2018; 2018:7984292. [PMID: 29785397 PMCID: PMC5896239 DOI: 10.1155/2018/7984292] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2017] [Revised: 01/04/2018] [Accepted: 01/14/2018] [Indexed: 01/30/2023] Open
Abstract
Greater amberjack (Seriola dumerili) is distributed in tropical and temperate waters worldwide and is an important aquaculture fish. We carried out de novo sequencing of the greater amberjack genome to construct a reference genome sequence to identify single nucleotide polymorphisms (SNPs) for breeding amberjack by marker-assisted or gene-assisted selection as well as to identify functional genes for biological traits. We obtained 200 times coverage and constructed a high-quality genome assembly using next generation sequencing technology. The assembled sequences were aligned onto a yellowtail (Seriola quinqueradiata) radiation hybrid (RH) physical map by sequence homology. A total of 215 of the longest amberjack sequences, with a total length of 622.8 Mbp (92% of the total length of the genome scaffolds), were lined up on the yellowtail RH map. We resequenced the whole genomes of 20 greater amberjacks and mapped the resulting sequences onto the reference genome sequence. About 186,000 nonredundant SNPs were successfully ordered on the reference genome. Further, we found differences in the genome structural variations between two greater amberjack populations using BreakDancer. We also analyzed the greater amberjack transcriptome and mapped the annotated sequences onto the reference genome sequence.
Collapse
|
62
|
Next-Generation Sequencing and Mutational Analysis: Implications for Genes Encoding LINC Complex Proteins. Methods Mol Biol 2018; 1840:321-336. [PMID: 30141054 DOI: 10.1007/978-1-4939-8691-0_22] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Targeted panel, whole exome, or whole genome DNA sequencing using next-generation sequencing (NGS) allows for extensive high-throughput investigation of molecular machines/systems such as the LINC complex. This includes the identification of genetic variants in humans that cause disease, as is the case for some genes encoding LINC complex proteins. The relatively low cost and high speed of the sequencing process results in large datasets at various stages of analysis and interpretation. For those not intimately familiar with the process, interpretation of the data might prove challenging. This review lays out the most important and most commonly used materials and methods of NGS. It also discusses data analysis and potential pitfalls one might encounter because of peculiarities of the laboratory methodology or data analysis pipelines.
Collapse
|
63
|
Enhancer adoption caused by genomic insertion elicits interdigital Shh expression and syndactyly in mouse. Proc Natl Acad Sci U S A 2017; 115:1021-1026. [PMID: 29255029 PMCID: PMC5798340 DOI: 10.1073/pnas.1713339115] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
In this study, we reexamined an old mouse mutant named Hammer toe (Hm), which arose spontaneously almost a half century ago and exhibits a limb phenotype with webbing. We revealed that a 150-kb noncoding genomic fragment that was originally located in chromosome 14 has been inserted into a genomic region proximal to Sonic hedgehog (Shh), located in chromosome 5. This inserted fragment possesses enhancer activity to induce Shh expression in the interdigital regions in Hm, which in turn down-regulates bone morphogenetic protein signaling and eventually results in syndactyly and web formation. Since the donor fragment residing in chromosome 14 has enhancer activity to induce interdigital gene expression, the Hm mutation appears to be an archetypal case of enhancer adoption. Acquisition of new cis-regulatory elements (CREs) can cause alteration of developmental gene regulation and may introduce morphological novelty in evolution. Although structural variation in the genome generated by chromosomal rearrangement is one possible source of new CREs, only a few examples are known, except for cases of retrotransposition. In this study, we show the acquisition of novel regulatory sequences as a result of large genomic insertion in the spontaneous mouse mutation Hammer toe (Hm). Hm mice exhibit syndactyly with webbing, due to suppression of interdigital cell death in limb development. We reveal that, in the Hm genome, a 150-kb noncoding DNA fragment from chromosome 14 is inserted into the region upstream of the Sonic hedgehog (Shh) promoter in chromosome 5. Phenotyping of mouse embryos with a series of CRISPR/Cas9-aided partial deletion of the 150-kb insert clearly indicated that two different regions are necessary for the syndactyly phenotype of Hm. We found that each of the two regions contains at least one enhancer for interdigital regulation. These results show that a set of enhancers brought by the large genomic insertion elicits the interdigital Shh expression and the Hm phenotype. Transcriptome analysis indicates that ectopic expression of Shh up-regulates Chordin (Chrd) that antagonizes bone morphogenetic protein signaling in the interdigital region. Indeed, Chrd-overexpressing transgenic mice recapitulated syndactyly with webbing. Thus, the Hm mutation provides an insight into enhancer acquisition as a source of creation of novel gene regulation.
Collapse
|
64
|
Li L, Leung AKY, Kwok TP, Lai YYY, Pang IK, Chung GTY, Mak ACY, Poon A, Chu C, Li M, Wu JJK, Lam ET, Cao H, Lin C, Sibert J, Yiu SM, Xiao M, Lo KW, Kwok PY, Chan TF, Yip KY. OMSV enables accurate and comprehensive identification of large structural variations from nanochannel-based single-molecule optical maps. Genome Biol 2017; 18:230. [PMID: 29195502 PMCID: PMC5709945 DOI: 10.1186/s13059-017-1356-2] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 11/03/2017] [Indexed: 12/20/2022] Open
Abstract
We present a new method, OMSV, for accurately and comprehensively identifying structural variations (SVs) from optical maps. OMSV detects both homozygous and heterozygous SVs, SVs of various types and sizes, and SVs with or without creating or destroying restriction sites. We show that OMSV has high sensitivity and specificity, with clear performance gains over the latest method. Applying OMSV to a human cell line, we identified hundreds of SVs >2 kbp, with 68 % of them missed by sequencing-based callers. Independent experimental validation confirmed the high accuracy of these SVs. The OMSV software is available at http://yiplab.cse.cuhk.edu.hk/omsv/ .
Collapse
Affiliation(s)
- Le Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Alden King-Yung Leung
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Tsz-Piu Kwok
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Yvonne Y Y Lai
- Cardiovascular Research Institute, University of California San Francisco, San Francisco, California, USA
| | - Iris K Pang
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Grace Tin-Yun Chung
- Department of Anatomical and Cellular Pathology, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Angel C Y Mak
- Cardiovascular Research Institute, University of California San Francisco, San Francisco, California, USA
| | - Annie Poon
- Cardiovascular Research Institute, University of California San Francisco, San Francisco, California, USA
| | - Catherine Chu
- Cardiovascular Research Institute, University of California San Francisco, San Francisco, California, USA
| | - Menglu Li
- Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong
| | - Jacob J K Wu
- Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong
| | | | - Han Cao
- BioNano Genomics, San Diego, California, USA
| | - Chin Lin
- Cardiovascular Research Institute, University of California San Francisco, San Francisco, California, USA
| | - Justin Sibert
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, Pennsylvania, USA
| | - Siu-Ming Yiu
- Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong
| | - Ming Xiao
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, Pennsylvania, USA
| | - Kwok-Wai Lo
- Department of Anatomical and Cellular Pathology, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Pui-Yan Kwok
- Cardiovascular Research Institute, University of California San Francisco, San Francisco, California, USA.,Institute for Human Genetics, University of California San Francisco, San Francisco, California, USA
| | - Ting-Fung Chan
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. .,Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. .,Hong Kong Institute of Diabetes and Obesity, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. .,CUHK-BGI Innovation Institute of Trans-omics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong.
| | - Kevin Y Yip
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. .,Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. .,Hong Kong Institute of Diabetes and Obesity, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. .,CUHK-BGI Innovation Institute of Trans-omics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong.
| |
Collapse
|
65
|
Cretu Stancu M, van Roosmalen MJ, Renkens I, Nieboer MM, Middelkamp S, de Ligt J, Pregno G, Giachino D, Mandrile G, Espejo Valle-Inclan J, Korzelius J, de Bruijn E, Cuppen E, Talkowski ME, Marschall T, de Ridder J, Kloosterman WP. Mapping and phasing of structural variation in patient genomes using nanopore sequencing. Nat Commun 2017; 8:1326. [PMID: 29109544 PMCID: PMC5673902 DOI: 10.1038/s41467-017-01343-4] [Citation(s) in RCA: 233] [Impact Index Per Article: 33.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Accepted: 09/07/2017] [Indexed: 01/08/2023] Open
Abstract
Despite improvements in genomics technology, the detection of structural variants (SVs) from short-read sequencing still poses challenges, particularly for complex variation. Here we analyse the genomes of two patients with congenital abnormalities using the MinION nanopore sequencer and a novel computational pipeline-NanoSV. We demonstrate that nanopore long reads are superior to short reads with regard to detection of de novo chromothripsis rearrangements. The long reads also enable efficient phasing of genetic variations, which we leveraged to determine the parental origin of all de novo chromothripsis breakpoints and to resolve the structure of these complex rearrangements. Additionally, genome-wide surveillance of inherited SVs reveals novel variants, missed in short-read data sets, a large proportion of which are retrotransposon insertions. We provide a first exploration of patient genome sequencing with a nanopore sequencer and demonstrate the value of long-read sequencing in mapping and phasing of SVs for both clinical and research applications.
Collapse
Affiliation(s)
- Mircea Cretu Stancu
- Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, 3584 CG, Utrecht, The Netherlands
| | - Markus J van Roosmalen
- Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, 3584 CG, Utrecht, The Netherlands
| | - Ivo Renkens
- Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, 3584 CG, Utrecht, The Netherlands
| | - Marleen M Nieboer
- Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, 3584 CG, Utrecht, The Netherlands
| | - Sjors Middelkamp
- Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, 3584 CG, Utrecht, The Netherlands
| | - Joep de Ligt
- Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, 3584 CG, Utrecht, The Netherlands
| | - Giulia Pregno
- Medical Genetics Unit, Department of Clinical and Biological Sciences, University of Torino, Orbassano, 10043, Italy
| | - Daniela Giachino
- Medical Genetics Unit, Department of Clinical and Biological Sciences, University of Torino, Orbassano, 10043, Italy
| | - Giorgia Mandrile
- Medical Genetics Unit, Department of Clinical and Biological Sciences, University of Torino, Orbassano, 10043, Italy
| | - Jose Espejo Valle-Inclan
- Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, 3584 CG, Utrecht, The Netherlands
| | - Jerome Korzelius
- Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, 3584 CG, Utrecht, The Netherlands
| | - Ewart de Bruijn
- Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, 3584 CG, Utrecht, The Netherlands
| | - Edwin Cuppen
- Department of Genetics and Cancer Genomics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, 3584 CG, Utrecht, The Netherlands
| | - Michael E Talkowski
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, 02114, USA
- Department of Neurology, Harvard Medical School, Boston, MA, 02115, USA
- Program in Population and Medical Genetics and Stanley Center for Psychiatric Research, The Broad Institute of M.I.T. and Harvard, Cambridge, MA, 02142, USA
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, 66123, Saarbrücken, Germany
- Max Planck Institute for Informatics, 66123, Saarbrücken, Germany
| | - Jeroen de Ridder
- Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, 3584 CG, Utrecht, The Netherlands
| | - Wigard P Kloosterman
- Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, 3584 CG, Utrecht, The Netherlands.
| |
Collapse
|
66
|
Barseghyan H, Tang W, Wang RT, Almalvez M, Segura E, Bramble MS, Lipson A, Douine ED, Lee H, Délot EC, Nelson SF, Vilain E. Next-generation mapping: a novel approach for detection of pathogenic structural variants with a potential utility in clinical diagnosis. Genome Med 2017; 9:90. [PMID: 29070057 PMCID: PMC5655859 DOI: 10.1186/s13073-017-0479-0] [Citation(s) in RCA: 63] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Accepted: 10/10/2017] [Indexed: 11/13/2022] Open
Abstract
Background Massively parallel DNA sequencing, such as exome sequencing, has become a routine clinical procedure to identify pathogenic variants responsible for a patient’s phenotype. Exome sequencing has the capability of reliably identifying inherited and de novo single-nucleotide variants, small insertions, and deletions. However, due to the use of 100–300-bp fragment reads, this platform is not well powered to sensitively identify moderate to large structural variants (SV), such as insertions, deletions, inversions, and translocations. Methods To overcome these limitations, we used next-generation mapping (NGM) to image high molecular weight double-stranded DNA molecules (megabase size) with fluorescent tags in nanochannel arrays for de novo genome assembly. We investigated the capacity of this NGM platform to identify pathogenic SV in a series of patients diagnosed with Duchenne muscular dystrophy (DMD), due to large deletions, insertion, and inversion involving the DMD gene. Results We identified deletion, duplication, and inversion breakpoints within DMD. The sizes of deletions were in the range of 45–250 Kbp, whereas the one identified insertion was approximately 13 Kbp in size. This method refined the location of the break points within introns for cases with deletions compared to current polymerase chain reaction (PCR)-based clinical techniques. Heterozygous SV were detected in the known carrier mothers of the DMD patients, demonstrating the ability of the method to ascertain carrier status for large SV. The method was also able to identify a 5.1-Mbp inversion involving the DMD gene, previously identified by RNA sequencing. Conclusions We showed the ability of NGM technology to detect pathogenic structural variants otherwise missed by PCR-based techniques or chromosomal microarrays. NGM is poised to become a new tool in the clinical genetic diagnostic strategy and research due to its ability to sensitively identify large genomic variations. Electronic supplementary material The online version of this article (doi:10.1186/s13073-017-0479-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hayk Barseghyan
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA.,Center for Genetic Medicine Research, Children's National Health System, Children's Research Institute, Washington, DC, 20010, USA
| | - Wilson Tang
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA
| | - Richard T Wang
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA
| | - Miguel Almalvez
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA.,Center for Genetic Medicine Research, Children's National Health System, Children's Research Institute, Washington, DC, 20010, USA
| | - Eva Segura
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA
| | - Matthew S Bramble
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA.,Center for Genetic Medicine Research, Children's National Health System, Children's Research Institute, Washington, DC, 20010, USA
| | - Allen Lipson
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA
| | - Emilie D Douine
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA
| | - Hane Lee
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA
| | - Emmanuèle C Délot
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA.,Department of Pediatrics, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA.,Center for Genetic Medicine Research, Children's National Health System, Children's Research Institute, Washington, DC, 20010, USA
| | - Stanley F Nelson
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA.,Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA
| | - Eric Vilain
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA. .,Department of Pediatrics, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA. .,Center for Genetic Medicine Research, Children's National Health System, Children's Research Institute, Washington, DC, 20010, USA.
| |
Collapse
|
67
|
Hampton OA, English AC, Wang M, Salerno WJ, Liu Y, Muzny DM, Han Y, Wheeler DA, Worley KC, Lupski JR, Gibbs RA. SVachra: a tool to identify genomic structural variation in mate pair sequencing data containing inward and outward facing reads. BMC Genomics 2017; 18:691. [PMID: 28984202 PMCID: PMC5629590 DOI: 10.1186/s12864-017-4021-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Background Characterization of genomic structural variation (SV) is essential to expanding the research and clinical applications of genome sequencing. Reliance upon short DNA fragment paired end sequencing has yielded a wealth of single nucleotide variants and internal sequencing read insertions-deletions, at the cost of limited SV detection. Multi-kilobase DNA fragment mate pair sequencing has supplemented the void in SV detection, but introduced new analytic challenges requiring SV detection tools specifically designed for mate pair sequencing data. Here, we introduce SVachra – Structural Variation Assessment of CHRomosomal Aberrations, a breakpoint calling program that identifies large insertions-deletions, inversions, inter- and intra-chromosomal translocations utilizing both inward and outward facing read types generated by mate pair sequencing.
Results We demonstrate SVachra’s utility by executing the program on large-insert (Illumina Nextera) mate pair sequencing data from the personal genome of a single subject (HS1011). An additional data set of long-read (Pacific BioSciences RSII) was also generated to validate SV calls from SVachra and other comparison SV calling programs. SVachra exhibited the highest validation rate and reported the widest distribution of SV types and size ranges when compared to other SV callers. Conclusions SVachra is a highly specific breakpoint calling program that exhibits a more unbiased SV detection methodology than other callers. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-4021-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Oliver A Hampton
- Human Genome Sequencing Center, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA. .,Department of Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA.
| | - Adam C English
- Human Genome Sequencing Center, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA
| | - Mark Wang
- Human Genome Sequencing Center, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA
| | - William J Salerno
- Human Genome Sequencing Center, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA
| | - Yue Liu
- Human Genome Sequencing Center, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA
| | - Donna M Muzny
- Human Genome Sequencing Center, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA
| | - Yi Han
- Human Genome Sequencing Center, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA
| | - David A Wheeler
- Human Genome Sequencing Center, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA
| | - Kim C Worley
- Human Genome Sequencing Center, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA
| | - James R Lupski
- Human Genome Sequencing Center, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA.,Department of Pediatrics, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA.,Texas Children's Hospital, 6621 Fanin Street, Houston, TX, 77030, USA
| | - Richard A Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, USA
| |
Collapse
|
68
|
Sedlazeck FJ, Dhroso A, Bodian DL, Paschall J, Hermes F, Zook JM. Tools for annotation and comparison of structural variation. F1000Res 2017; 6:1795. [PMID: 29123647 PMCID: PMC5668921 DOI: 10.12688/f1000research.12516.1] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 11/02/2017] [Indexed: 11/20/2022] Open
Abstract
The impact of structural variants (SVs) on a variety of organisms and diseases like cancer has become increasingly evident. Methods for SV detection when studying genomic differences across cells, individuals or populations are being actively developed. Currently, just a few methods are available to compare different SVs callsets, and no specialized methods are available to annotate SVs that account for the unique characteristics of these variant types. Here, we introduce SURVIVOR_ant, a tool that compares types and breakpoints for candidate SVs from different callsets and enables fast comparison of SVs to genomic features such as genes and repetitive regions, as well as to previously established SV datasets such as from the 1000 Genomes Project. As proof of concept we compared 16 SV callsets generated by different SV calling methods on a single genome, the Genome in a Bottle sample HG002 (Ashkenazi son), and annotated the SVs with gene annotations, 1000 Genomes Project SV calls, and four different types of repetitive regions. Computation time to annotate 134,528 SVs with 33,954 of annotations was 22 seconds on a laptop.
Collapse
Affiliation(s)
- Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Andi Dhroso
- Worcester Polytechnic Institute, Worcester, MA, USA
| | - Dale L Bodian
- Inova Translational Medicine Institute, Inova Health System, Falls Church, VA, USA
| | | | | | - Justin M Zook
- Genome-scale Measurements Group, National Institute of Standards and Technology, Gaithersburg, MD, USA
| |
Collapse
|
69
|
Jensen JM, Villesen P, Friborg RM, Mailund T, Besenbacher S, Schierup MH. Assembly and analysis of 100 full MHC haplotypes from the Danish population. Genome Res 2017; 27:1597-1607. [PMID: 28774965 PMCID: PMC5580718 DOI: 10.1101/gr.218891.116] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2016] [Accepted: 07/21/2017] [Indexed: 01/05/2023]
Abstract
Genes in the major histocompatibility complex (MHC, also known as HLA) play a critical role in the immune response and variation within the extended 4-Mb region shows association with major risks of many diseases. Yet, deciphering the underlying causes of these associations is difficult because the MHC is the most polymorphic region of the genome with a complex linkage disequilibrium structure. Here, we reconstruct full MHC haplotypes from de novo assembled trios without relying on a reference genome and perform evolutionary analyses. We report 100 full MHC haplotypes and call a large set of structural variants in the regions for future use in imputation with GWAS data. We also present the first complete analysis of the recombination landscape in the entire region and show how balancing selection at classical genes have linked effects on the frequency of variants throughout the region.
Collapse
Affiliation(s)
- Jacob M Jensen
- Bioinformatics Research Centre, Aarhus University, 8000 Aarhus C., Denmark
| | - Palle Villesen
- Bioinformatics Research Centre, Aarhus University, 8000 Aarhus C., Denmark.,Department of Clinical Medicine, Aarhus University, 8200 Aarhus N., Denmark
| | - Rune M Friborg
- Bioinformatics Research Centre, Aarhus University, 8000 Aarhus C., Denmark
| | | | - Thomas Mailund
- Bioinformatics Research Centre, Aarhus University, 8000 Aarhus C., Denmark
| | - Søren Besenbacher
- Bioinformatics Research Centre, Aarhus University, 8000 Aarhus C., Denmark.,Department of Molecular Medicine, Aarhus University Hospital, Skejby, 8200 Aarhus N., Denmark
| | - Mikkel H Schierup
- Bioinformatics Research Centre, Aarhus University, 8000 Aarhus C., Denmark.,Department of Bioscience, Aarhus University, 8000 Aarhus C., Denmark
| |
Collapse
|
70
|
Ghurye J, Pop M, Koren S, Bickhart D, Chin CS. Scaffolding of long read assemblies using long range contact information. BMC Genomics 2017; 18:527. [PMID: 28701198 PMCID: PMC5508778 DOI: 10.1186/s12864-017-3879-z] [Citation(s) in RCA: 138] [Impact Index Per Article: 19.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2017] [Accepted: 06/20/2017] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND Long read technologies have revolutionized de novo genome assembly by generating contigs orders of magnitude longer than that of short read assemblies. Although assembly contiguity has increased, it usually does not reconstruct a full chromosome or an arm of the chromosome, resulting in an unfinished chromosome level assembly. To increase the contiguity of the assembly to the chromosome level, different strategies are used which exploit long range contact information between chromosomes in the genome. METHODS We develop a scalable and computationally efficient scaffolding method that can boost the assembly contiguity to a large extent using genome-wide chromatin interaction data such as Hi-C. RESULTS we demonstrate an algorithm that uses Hi-C data for longer-range scaffolding of de novo long read genome assemblies. We tested our methods on the human and goat genome assemblies. We compare our scaffolds with the scaffolds generated by LACHESIS based on various metrics. CONCLUSION Our new algorithm SALSA produces more accurate scaffolds compared to the existing state of the art method LACHESIS.
Collapse
Affiliation(s)
- Jay Ghurye
- Department of Computer Science, University of Maryland, 20742 College Park, Maryland, USA
| | - Mihai Pop
- Department of Computer Science, University of Maryland, 20742 College Park, Maryland, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 21702 Bethesda, Maryland, USA
| | - Derek Bickhart
- Cell Wall Biology and Utilization Research, US Dairy Forage Research Center, 53706 Madison, Wisconsin, USA
| | | |
Collapse
|
71
|
Abstract
PurposeCurrent clinical genomics assays primarily utilize short-read sequencing (SRS), but SRS has limited ability to evaluate repetitive regions and structural variants. Long-read sequencing (LRS) has complementary strengths, and we aimed to determine whether LRS could offer a means to identify overlooked genetic variation in patients undiagnosed by SRS.MethodsWe performed low-coverage genome LRS to identify structural variants in a patient who presented with multiple neoplasia and cardiac myxomata, in whom the results of targeted clinical testing and genome SRS were negative.ResultsThis LRS approach yielded 6,971 deletions and 6,821 insertions > 50 bp. Filtering for variants that are absent in an unrelated control and overlap a disease gene coding exon identified three deletions and three insertions. One of these, a heterozygous 2,184 bp deletion, overlaps the first coding exon of PRKAR1A, which is implicated in autosomal dominant Carney complex. RNA sequencing demonstrated decreased PRKAR1A expression. The deletion was classified as pathogenic based on guidelines for interpretation of sequence variants.ConclusionThis first successful application of genome LRS to identify a pathogenic variant in a patient suggests that LRS has significant potential for the identification of disease-causing structural variation. Larger studies will ultimately be required to evaluate the potential clinical utility of LRS.
Collapse
|
72
|
Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 2017; 17:333-51. [PMID: 27184599 PMCID: PMC10373632 DOI: 10.1038/nrg.2016.49] [Citation(s) in RCA: 2147] [Impact Index Per Article: 306.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Since the completion of the human genome project in 2003, extraordinary progress has been made in genome sequencing technologies, which has led to a decreased cost per megabase and an increase in the number and diversity of sequenced genomes. An astonishing complexity of genome architecture has been revealed, bringing these sequencing technologies to even greater advancements. Some approaches maximize the number of bases sequenced in the least amount of time, generating a wealth of data that can be used to understand increasingly complex phenotypes. Alternatively, other approaches now aim to sequence longer contiguous pieces of DNA, which are essential for resolving structurally complex regions. These and other strategies are providing researchers and clinicians a variety of tools to probe genomes in greater depth, leading to an enhanced understanding of how genome sequence variants underlie phenotype and disease.
Collapse
Affiliation(s)
- Sara Goodwin
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| | - John D McPherson
- Department of Biochemistry and Molecular Medicine; and the Comprehensive Cancer Center, University of California, Davis, California 95817, USA
| | | |
Collapse
|
73
|
Smith M. DNA Sequence Analysis in Clinical Medicine, Proceeding Cautiously. Front Mol Biosci 2017; 4:24. [PMID: 28516087 PMCID: PMC5413496 DOI: 10.3389/fmolb.2017.00024] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Accepted: 04/07/2017] [Indexed: 12/03/2022] Open
Abstract
Delineation of underlying genomic and genetic factors in a specific disease may be valuable in establishing a definitive diagnosis and may guide patient management and counseling. In addition, genetic information may be useful in identification of at risk family members. Gene mapping and initial genome sequencing data enabled the development of microarrays to analyze genomic variants. The goal of this review is to consider different generations of sequencing techniques and their application to exome sequencing and whole genome sequencing and their clinical applications. In recent decades, exome sequencing has primarily been used in patient studies. Discussed in some detail, are important measures that have been developed to standardize variant calling and to assess pathogenicity of variants. Examples of cases where exome sequencing has facilitated diagnosis and led to improved medical management are presented. Whole genome sequencing and its clinical relevance are presented particularly in the context of analysis of nucleotide and structural genomic variants in large population studies and in certain patient cohorts. Applications involving analysis of cell free DNA in maternal blood for prenatal diagnosis of specific autosomal trisomies are reviewed. Applications of DNA sequencing to diagnosis and therapeutics of cancer are presented. Also discussed are important recent diagnostic applications of DNA sequencing in cancer, including analysis of tumor derived cell free DNA and exosomes that are present in body fluids. Insights gained into underlying pathogenetic mechanisms of certain complex common diseases, including schizophrenia, macular degeneration, neurodegenerative disease are presented. The relevance of different types of variants, rare, uncommon, and common to disease pathogenesis, and the continuum of causality, are addressed. Pharmogenetic variants detected by DNA sequence analysis are gaining in importance and are particularly relevant to personalized and precision medicine.
Collapse
Affiliation(s)
- Moyra Smith
- Genetics and Genomic Medicine, Pediatrics, School of Medicine, University of CaliforniaIrvine, CA, USA
| |
Collapse
|
74
|
Couldrey C, Keehan M, Johnson T, Tiplady K, Winkelman A, Littlejohn MD, Scott A, Kemper KE, Hayes B, Davis SR, Spelman RJ. Detection and assessment of copy number variation using PacBio long-read and Illumina sequencing in New Zealand dairy cattle. J Dairy Sci 2017; 100:5472-5478. [PMID: 28456410 DOI: 10.3168/jds.2016-12199] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2016] [Accepted: 03/12/2017] [Indexed: 11/19/2022]
Abstract
Single nucleotide polymorphisms have been the DNA variant of choice for genomic prediction, largely because of the ease of single nucleotide polymorphism genotype collection. In contrast, structural variants (SV), which include copy number variants (CNV), translocations, insertions, and inversions, have eluded easy detection and characterization, particularly in nonhuman species. However, evidence increasingly shows that SV not only contribute a substantial proportion of genetic variation but also have significant influence on phenotypes. Here we present the discovery of CNV in a prominent New Zealand dairy bull using long-read PacBio (Pacific Biosciences, Menlo Park, CA) sequencing technology and the Sniffles SV discovery tool (version 0.0.1; https://github.com/fritzsedlazeck/Sniffles). The CNV identified from long reads were compared with CNV discovered in the same bull from Illumina sequencing using CNVnator (read depth-based tool; Illumina Inc., San Diego, CA) as a means of validation. Subsequently, further validation was undertaken using whole-genome Illumina sequencing of 556 cattle representing the wider New Zealand dairy cattle population. Very limited overlap was observed in CNV discovered from the 2 sequencing platforms, in part because of the differences in size of CNV detected. Only a few CNV were therefore able to be validated using this approach. However, the ability to use CNVnator to genotype the 557 cattle for copy number across all regions identified as putative CNV allowed a genome-wide assessment of transmission level of copy number based on pedigree. The more highly transmissible a putative CNV region was observed to be, the more likely the distribution of copy number was multimodal across the 557 sequenced animals. Furthermore, visual assessment of highly transmissible CNV regions provided evidence supporting the presence of CNV across the sequenced animals. This transmission-based approach was able to confirm a subset of CNV that segregates in the New Zealand dairy cattle population. Genome-wide identification and validation of CNV is an important step toward their inclusion in genomic selection strategies.
Collapse
Affiliation(s)
- C Couldrey
- Research and Development, Livestock Improvement Corporation, Hamilton, New Zealand 3240.
| | - M Keehan
- Research and Development, Livestock Improvement Corporation, Hamilton, New Zealand 3240
| | - T Johnson
- Research and Development, Livestock Improvement Corporation, Hamilton, New Zealand 3240
| | - K Tiplady
- Research and Development, Livestock Improvement Corporation, Hamilton, New Zealand 3240
| | - A Winkelman
- Research and Development, Livestock Improvement Corporation, Hamilton, New Zealand 3240
| | - M D Littlejohn
- Research and Development, Livestock Improvement Corporation, Hamilton, New Zealand 3240
| | - A Scott
- Research and Development, Livestock Improvement Corporation, Hamilton, New Zealand 3240
| | - K E Kemper
- Institute for Molecular Bioscience, University of Queensland, St Lucia 4072, Queensland, Australia
| | - B Hayes
- Centre for Animal Science, University of Queensland, St Lucia 4072, Queensland, Australia
| | - S R Davis
- Research and Development, Livestock Improvement Corporation, Hamilton, New Zealand 3240
| | - R J Spelman
- Research and Development, Livestock Improvement Corporation, Hamilton, New Zealand 3240
| |
Collapse
|
75
|
Sekizuka T, Kawanishi M, Ohnishi M, Shima A, Kato K, Yamashita A, Matsui M, Suzuki S, Kuroda M. Elucidation of quantitative structural diversity of remarkable rearrangement regions, shufflons, in IncI2 plasmids. Sci Rep 2017; 7:928. [PMID: 28424528 PMCID: PMC5430464 DOI: 10.1038/s41598-017-01082-y] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2016] [Accepted: 03/20/2017] [Indexed: 12/30/2022] Open
Abstract
A multiple DNA inversion system, the shufflon, exists in incompatibility (Inc) I1 and I2 plasmids. The shufflon generates variants of the PilV protein, a minor component of the thin pilus. The shufflon is one of the most difficult regions for de novo genome assembly because of its structural diversity even in an isolated bacterial clone. We determined complete genome sequences, including those of IncI2 plasmids carrying mcr-1, of three Escherichia coli strains using single-molecule, real-time (SMRT) sequencing and Illumina sequencing. The sequences assembled using only SMRT sequencing contained misassembled regions in the shufflon. A hybrid analysis using SMRT and Illumina sequencing resolved the misassembled region and revealed that the three IncI2 plasmids, excluding the shufflon region, were highly conserved. Moreover, the abundance ratio of whole-shufflon structures could be determined by quantitative structural variation analysis of the SMRT data, suggesting that a remarkable heterogeneity of whole-shufflon structural variations exists in IncI2 plasmids. These findings indicate that remarkable rearrangement regions should be validated using both long-read and short-read sequencing data and that the structural variation of PilV in the shufflon might be closely related to phenotypic heterogeneity of plasmid-mediated transconjugation involved in horizontal gene transfer even in bacterial clonal populations.
Collapse
Affiliation(s)
- Tsuyoshi Sekizuka
- Pathogen Genomics Center, National Institute of Infectious Diseases, 1-23-1 Toyama, Shinjyuku-ku, Tokyo, 162-8640, Japan.
| | - Michiko Kawanishi
- Assay Division II, Bacterial Assay Section, National Veterinary Assay Laboratory, Ministry of Agriculture, Forestry and Fisheries, 1-15-1 Tokura, Kokubunji-shi, 185-8511, Tokyo, Japan
| | - Mamoru Ohnishi
- Ohnishi Laboratory of Veterinary Microbiology, 10-3-3 Nishirokujyouminami, Shibetsugunnakashibetsu-cho, 086-1106, Hokkaido, Japan
| | - Ayaka Shima
- Department of Bacteriology II, National Institute of Infectious Diseases, 4-7-1 Gakuen, Musashimurayama-shi, Tokyo, 208-0011, Japan
| | - Kengo Kato
- Pathogen Genomics Center, National Institute of Infectious Diseases, 1-23-1 Toyama, Shinjyuku-ku, Tokyo, 162-8640, Japan
| | - Akifumi Yamashita
- Pathogen Genomics Center, National Institute of Infectious Diseases, 1-23-1 Toyama, Shinjyuku-ku, Tokyo, 162-8640, Japan
| | - Mari Matsui
- Department of Bacteriology II, National Institute of Infectious Diseases, 4-7-1 Gakuen, Musashimurayama-shi, Tokyo, 208-0011, Japan
| | - Satowa Suzuki
- Department of Bacteriology II, National Institute of Infectious Diseases, 4-7-1 Gakuen, Musashimurayama-shi, Tokyo, 208-0011, Japan
| | - Makoto Kuroda
- Pathogen Genomics Center, National Institute of Infectious Diseases, 1-23-1 Toyama, Shinjyuku-ku, Tokyo, 162-8640, Japan
| |
Collapse
|
76
|
Chakravorty S, Hegde M. Gene and Variant Annotation for Mendelian Disorders in the Era of Advanced Sequencing Technologies. Annu Rev Genomics Hum Genet 2017; 18:229-256. [PMID: 28415856 DOI: 10.1146/annurev-genom-083115-022545] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Comprehensive annotations of genetic and noncoding regions and corresponding accurate variant classification for Mendelian diseases are the next big challenge in the new genomic era of personalized medicine. Progress in the development of faster and more accurate pipelines for genome annotation and variant classification will lead to the discovery of more novel disease associations and candidate therapeutic targets. This ultimately will facilitate better patient recruitment in clinical trials. In this review, we describe the trends in research at the intersection of basic and clinical genomics that aims to increase understanding of overall genomic complexity, complex inheritance patterns of disease, and patient-phenotype-specific genomic associations. We describe the emerging field of translational functional genomics, which integrates other functional "-omics" approaches that support next-generation sequencing genomic data in order to facilitate personalized diagnostics, disease management, biomarker discovery, and medicine. We also discuss the utility of this integrated approach for diagnostic clinics and medical databases and its role in the future of personalized medicine.
Collapse
Affiliation(s)
- Samya Chakravorty
- Department of Human Genetics, Emory University School of Medicine, Atlanta, Georgia 30322;
| | - Madhuri Hegde
- Department of Human Genetics, Emory University School of Medicine, Atlanta, Georgia 30322;
| |
Collapse
|
77
|
Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen HC, Kitts PA, Murphy TD, Pruitt KD, Thibaud-Nissen F, Albracht D, Fulton RS, Kremitzki M, Magrini V, Markovic C, McGrath S, Steinberg KM, Auger K, Chow W, Collins J, Harden G, Hubbard T, Pelan S, Simpson JT, Threadgold G, Torrance J, Wood JM, Clarke L, Koren S, Boitano M, Peluso P, Li H, Chin CS, Phillippy AM, Durbin R, Wilson RK, Flicek P, Eichler EE, Church DM. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res 2017; 27:849-864. [PMID: 28396521 PMCID: PMC5411779 DOI: 10.1101/gr.213611.116] [Citation(s) in RCA: 533] [Impact Index Per Article: 76.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2016] [Accepted: 03/14/2017] [Indexed: 11/24/2022]
Abstract
The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.
Collapse
Affiliation(s)
- Valerie A Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Tina Graves-Lindsay
- McDonnell Genome Institute at Washington University, St. Louis, Missouri 63018, USA
| | - Kerstin Howe
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Nathan Bouk
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Hsiu-Chuan Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Paul A Kitts
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Kim D Pruitt
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Derek Albracht
- McDonnell Genome Institute at Washington University, St. Louis, Missouri 63018, USA
| | - Robert S Fulton
- McDonnell Genome Institute at Washington University, St. Louis, Missouri 63018, USA
| | - Milinn Kremitzki
- McDonnell Genome Institute at Washington University, St. Louis, Missouri 63018, USA
| | - Vincent Magrini
- McDonnell Genome Institute at Washington University, St. Louis, Missouri 63018, USA
| | - Chris Markovic
- McDonnell Genome Institute at Washington University, St. Louis, Missouri 63018, USA
| | - Sean McGrath
- McDonnell Genome Institute at Washington University, St. Louis, Missouri 63018, USA
| | | | - Kate Auger
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - William Chow
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Joanna Collins
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Glenn Harden
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Timothy Hubbard
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Sarah Pelan
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Jared T Simpson
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Glen Threadgold
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - James Torrance
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Jonathan M Wood
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Laura Clarke
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Sergey Koren
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | | | - Paul Peluso
- Pacific Biosciences, Menlo Park, California 94025, USA
| | - Heng Li
- Broad Institute, Cambridge, Massachusetts 02142, USA
| | | | - Adam M Phillippy
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Richard Durbin
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Richard K Wilson
- McDonnell Genome Institute at Washington University, St. Louis, Missouri 63018, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA.,Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
| | - Deanna M Church
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| |
Collapse
|
78
|
Jain A, Dorfman KD. Simulations of knotting of DNA during genome mapping. BIOMICROFLUIDICS 2017; 11:024117. [PMID: 28798853 PMCID: PMC5533507 DOI: 10.1063/1.4979605] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2017] [Accepted: 03/21/2017] [Indexed: 05/28/2023]
Abstract
Genome mapping involves the confinement of long DNA molecules, in excess of 150 kilobase pairs, in nanochannels near the circa 50 nm persistence length of DNA. The fidelity of the map relies on the assumption that the DNA is linearized by channel confinement, which assumes the absence of knots. We have computed the probability of forming different knot types and the size of these knots for long chains (approximately 164 kilobase pairs) via pruned-enriched Rosenbluth method simulations of a discrete wormlike chain model of DNA in channel sizes ranging from 35 nm to 60 nm. Compared to prior simulations of short DNA in similar confinement, these long molecules exhibit both complex knots, with up to seven crossings, and multiple knots per chain. The knotting probability is a very strong function of channel size, ranging from 0.3% to 60%, and rationalized in the context of Odijk's theory for confined semiflexible chains. Overall, the knotting probability and knot size obtained from these equilibrium measurements are not consistent with experimental measurements of the properties of anomalously bright regions along the DNA backbone during genome mapping experiments. This result suggests that these events in experiments are either knots formed during the processing of the DNA prior to injection into the nanochannel or regions of locally high DNA concentration without a topological constraint. If so, knots during genome mapping are not an intrinsic problem for genome mapping technology.
Collapse
Affiliation(s)
- Aashish Jain
- Department of Chemical Engineering and Materials Science, University of Minnesota-Twin Cities, 421 Washington Ave. SE, Minneapolis, Minnesota 55455, USA
| | - Kevin D Dorfman
- Department of Chemical Engineering and Materials Science, University of Minnesota-Twin Cities, 421 Washington Ave. SE, Minneapolis, Minnesota 55455, USA
| |
Collapse
|
79
|
Abstract
Deciphering the genetic basis of human disease requires a comprehensive knowledge of genetic variants irrespective of their class or frequency. Although an impressive number of human genetic variants have been catalogued, a large fraction of the genetic difference that distinguishes two human genomes is still not understood at the base-pair level. This is because the emphasis has been on single-nucleotide variation as opposed to less tractable and more complex genetic variants, including indels and structural variants. The latter, we propose, will have a large impact on human phenotypes but require a more systematic assessment of genomes at deeper coverage and alternate sequencing and mapping technologies.
Collapse
|
80
|
Pausch H, MacLeod IM, Fries R, Emmerling R, Bowman PJ, Daetwyler HD, Goddard ME. Evaluation of the accuracy of imputed sequence variant genotypes and their utility for causal variant detection in cattle. Genet Sel Evol 2017; 49:24. [PMID: 28222685 PMCID: PMC5320806 DOI: 10.1186/s12711-017-0301-x] [Citation(s) in RCA: 71] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2016] [Accepted: 02/14/2017] [Indexed: 12/11/2022] Open
Abstract
Background The availability of dense genotypes and whole-genome sequence variants from various sources offers the opportunity to compile large datasets consisting of tens of thousands of individuals with genotypes at millions of polymorphic sites that may enhance the power of genomic analyses. The imputation of missing genotypes ensures that all individuals have genotypes for a shared set of variants. Results We evaluated the accuracy of imputation from dense genotypes to whole-genome sequence variants in 249 Fleckvieh and 450 Holstein cattle using Minimac and FImpute. The sequence variants of a subset of the animals were reduced to the variants that were included on the Illumina BovineHD genotyping array and subsequently inferred in silico using either within- or multi-breed reference populations. The accuracy of imputation varied considerably across chromosomes and dropped at regions where the bovine genome contains segmental duplications. Depending on the imputation strategy, the correlation between imputed and true genotypes ranged from 0.898 to 0.952. The accuracy of imputation was higher with Minimac than FImpute particularly for variants with a low minor allele frequency. Using a multi-breed reference population increased the accuracy of imputation, particularly when FImpute was used to infer genotypes. When the sequence variants were imputed using Minimac, the true genotypes were more correlated to predicted allele dosages than best-guess genotypes. The computing costs to impute 23,256,743 sequence variants in 6958 animals were ten-fold higher with Minimac than FImpute. Association studies with imputed sequence variants revealed seven quantitative trait loci (QTL) for milk fat percentage. Two causal mutations in the DGAT1 and GHR genes were the most significantly associated variants at two QTL on chromosomes 14 and 20 when Minimac was used to infer genotypes. Conclusions The population-based imputation of millions of sequence variants in large cohorts is computationally feasible and provides accurate genotypes. However, the accuracy of imputation is low in regions where the genome contains large segmental duplications or the coverage with array-derived single nucleotide polymorphisms is poor. Using a reference population that includes individuals from many breeds increases the accuracy of imputation particularly at low-frequency variants. Considering allele dosages rather than best-guess genotypes as explanatory variables is advantageous to detect causal mutations in association studies with imputed sequence variants. Electronic supplementary material The online version of this article (doi:10.1186/s12711-017-0301-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hubert Pausch
- Agriculture Victoria, AgriBio, Centre for AgriBiosciences, Bundoora, VIC, 3083, Australia.
| | - Iona M MacLeod
- Agriculture Victoria, AgriBio, Centre for AgriBiosciences, Bundoora, VIC, 3083, Australia
| | - Ruedi Fries
- Chair of Animal Breeding, Technische Universitaet Muenchen, 85354, Freising, Germany
| | - Reiner Emmerling
- Institute of Animal Breeding, Bavarian State Research Center for Agriculture, 85586, Grub, Germany
| | - Phil J Bowman
- Agriculture Victoria, AgriBio, Centre for AgriBiosciences, Bundoora, VIC, 3083, Australia.,School of Applied Systems Biology, La Trobe University, Bundoora, VIC, 3083, Australia
| | - Hans D Daetwyler
- Agriculture Victoria, AgriBio, Centre for AgriBiosciences, Bundoora, VIC, 3083, Australia.,School of Applied Systems Biology, La Trobe University, Bundoora, VIC, 3083, Australia
| | - Michael E Goddard
- Agriculture Victoria, AgriBio, Centre for AgriBiosciences, Bundoora, VIC, 3083, Australia.,Faculty of Veterinary and Agricultural Sciences, University of Melbourne, Melbourne, VIC, 3010, Australia
| |
Collapse
|
81
|
Huddleston J, Chaisson MJP, Steinberg KM, Warren W, Hoekzema K, Gordon D, Graves-Lindsay TA, Munson KM, Kronenberg ZN, Vives L, Peluso P, Boitano M, Chin CS, Korlach J, Wilson RK, Eichler EE. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res 2016; 27:677-685. [PMID: 27895111 PMCID: PMC5411763 DOI: 10.1101/gr.214007.116] [Citation(s) in RCA: 215] [Impact Index Per Article: 26.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2016] [Accepted: 11/15/2016] [Indexed: 01/07/2023]
Abstract
In an effort to more fully understand the full spectrum of human genetic variation, we generated deep single-molecule, real-time (SMRT) sequencing data from two haploid human genomes. By using an assembly-based approach (SMRT-SV), we systematically assessed each genome independently for structural variants (SVs) and indels resolving the sequence structure of 461,553 genetic variants from 2 bp to 28 kbp in length. We find that >89% of these variants have been missed as part of analysis of the 1000 Genomes Project even after adjusting for more common variants (MAF > 1%). We estimate that this theoretical human diploid differs by as much as ∼16 Mbp with respect to the human reference, with long-read sequencing data providing a fivefold increase in sensitivity for genetic variants ranging in size from 7 bp to 1 kbp compared with short-read sequence data. Although a large fraction of genetic variants were not detected by short-read approaches, once the alternate allele is sequence-resolved, we show that 61% of SVs can be genotyped in short-read sequence data sets with high accuracy. Uncoupling discovery from genotyping thus allows for the majority of this missed common variation to be genotyped in the human population. Interestingly, when we repeat SV detection on a pseudodiploid genome constructed in silico by merging the two haploids, we find that ∼59% of the heterozygous SVs are no longer detected by SMRT-SV. These results indicate that haploid resolution of long-read sequencing data will significantly increase sensitivity of SV detection.
Collapse
Affiliation(s)
- John Huddleston
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA.,Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
| | - Mark J P Chaisson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Karyn Meltz Steinberg
- McDonnell Genome Institute, Department of Medicine, Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63108, USA
| | - Wes Warren
- McDonnell Genome Institute, Department of Medicine, Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63108, USA
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - David Gordon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA.,Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
| | - Tina A Graves-Lindsay
- McDonnell Genome Institute, Department of Medicine, Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63108, USA
| | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Zev N Kronenberg
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Laura Vives
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Paul Peluso
- Pacific Biosciences of California, Incorporated, Menlo Park, California 94025, USA
| | - Matthew Boitano
- Pacific Biosciences of California, Incorporated, Menlo Park, California 94025, USA
| | - Chen-Shin Chin
- Pacific Biosciences of California, Incorporated, Menlo Park, California 94025, USA
| | - Jonas Korlach
- Pacific Biosciences of California, Incorporated, Menlo Park, California 94025, USA
| | - Richard K Wilson
- Department of Pathology, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA.,Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
82
|
Abstract
We report on the sequencing of 10,545 human genomes at 30×-40× coverage with an emphasis on quality metrics and novel variant and sequence discovery. We find that 84% of an individual human genome can be sequenced confidently. This high-confidence region includes 91.5% of exon sequence and 95.2% of known pathogenic variant positions. We present the distribution of over 150 million single-nucleotide variants in the coding and noncoding genome. Each newly sequenced genome contributes an average of 8,579 novel variants. In addition, each genome carries on average 0.7 Mb of sequence that is not found in the main build of the hg38 reference genome. The density of this catalog of variation allowed us to construct high-resolution profiles that define genomic sites that are highly intolerant of genetic variation. These results indicate that the data generated by deep genome sequencing is of the quality necessary for clinical use.
Collapse
|
83
|
Hoban S, Kelley JL, Lotterhos KE, Antolin MF, Bradburd G, Lowry DB, Poss ML, Reed LK, Storfer A, Whitlock MC. Finding the Genomic Basis of Local Adaptation: Pitfalls, Practical Solutions, and Future Directions. Am Nat 2016; 188:379-97. [PMID: 27622873 PMCID: PMC5457800 DOI: 10.1086/688018] [Citation(s) in RCA: 431] [Impact Index Per Article: 53.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Uncovering the genetic and evolutionary basis of local adaptation is a major focus of evolutionary biology. The recent development of cost-effective methods for obtaining high-quality genome-scale data makes it possible to identify some of the loci responsible for adaptive differences among populations. Two basic approaches for identifying putatively locally adaptive loci have been developed and are broadly used: one that identifies loci with unusually high genetic differentiation among populations (differentiation outlier methods) and one that searches for correlations between local population allele frequencies and local environments (genetic-environment association methods). Here, we review the promises and challenges of these genome scan methods, including correcting for the confounding influence of a species' demographic history, biases caused by missing aspects of the genome, matching scales of environmental data with population structure, and other statistical considerations. In each case, we make suggestions for best practices for maximizing the accuracy and efficiency of genome scans to detect the underlying genetic basis of local adaptation. With attention to their current limitations, genome scan methods can be an important tool in finding the genetic basis of adaptive evolutionary change.
Collapse
Affiliation(s)
- Sean Hoban
- Morton Arboretum, Lisle, Illinois 60532; and National Institute for Mathematical and Biological Synthesis (NIMBioS), Knoxville, Tennessee 37966
| | - Joanna L. Kelley
- School of Biological Sciences, Washington State University, Pullman, Washington 99164
| | - Katie E. Lotterhos
- Department of Marine and Environmental Sciences, Northeastern University Marine Science Center, Nahant, Massachusetts 01908
| | - Michael F. Antolin
- Department of Biology, Colorado State University, Fort Collins, Colorado 80523
| | - Gideon Bradburd
- Museum of Vertebrate Zoology and Department of Environmental Science, Policy, and Management, University of California, Berkeley, California 94720
| | - David B. Lowry
- Department of Plant Biology, Michigan State University, East Lansing, Michigan 48824
| | - Mary L. Poss
- Department of Biology and Veterinary and Biomedical Sciences, Penn State University, University Park, Pennsylvania 16802
| | - Laura K. Reed
- Department of Biological Sciences, University of Alabama, Tuscaloosa, Alabama 35406
| | - Andrew Storfer
- School of Biological Sciences, Washington State University, Pullman, Washington 99164
| | | |
Collapse
|
84
|
Du C, Pusey BN, Adams CJ, Lau CC, Bone WP, Gahl WA, Markello TC, Adams DR. Explorations to improve the completeness of exome sequencing. BMC Med Genomics 2016; 9:56. [PMID: 27568008 PMCID: PMC5002202 DOI: 10.1186/s12920-016-0216-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2015] [Accepted: 08/05/2016] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Exome sequencing has advanced to clinical practice and proven useful for obtaining molecular diagnoses in rare diseases. In approximately 75 % of cases, however, a clinical exome study does not produce a definitive molecular diagnosis. These residual cases comprise a new diagnostic challenge for the genetics community. The Undiagnosed Diseases Program of the National Institutes of Health routinely utilizes exome sequencing for refractory clinical cases. Our preliminary data suggest that disease-causing variants may be missed by current standard-of-care clinical exome analysis. Such false negatives reflect limitations in experimental design, technical performance, and data analysis. RESULTS We present examples from our datasets to quantify the analytical performance associated with current practices, and explore strategies to improve the completeness of data analysis. In particular, we focus on patient ascertainment, exome capture, inclusion of intronic variants, and evaluation of medium-sized structural variants. CONCLUSIONS The strategies we present may recover previously-missed, disease causing variants in second-pass exome analysis. Understanding the limitations of the current clinical exome search space provides a rational basis to improve methods for disease variant detection using genome-scale sequencing techniques.
Collapse
Affiliation(s)
- Chen Du
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA
| | - Barbara N Pusey
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA
| | - Christopher J Adams
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA
| | - C Christopher Lau
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA
| | - William P Bone
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA
| | - William A Gahl
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA
| | - Thomas C Markello
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA
| | - David R Adams
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA.
| |
Collapse
|
85
|
Fawcett GL, Karina Eterovic A. Identification of Genomic Somatic Variants in Cancer: From Discovery to Actionability. Adv Clin Chem 2016; 78:123-162. [PMID: 28057186 DOI: 10.1016/bs.acc.2016.07.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The perfect method to discover and validate actionable somatic variants in cancer has not yet been developed, yet significant progress has been made toward this goal. There have been huge increases in the throughput and cost of deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) sequencing technologies that have led to the burgeoning possibility of using sequencing data in clinical settings. Discovery of somatic mutations is relatively simple and has been improved recently due to laboratory methods optimization, bioinformatics algorithms development, and the expansion of various databases of population genomic information. Tiered systems of evidence evaluation are currently being used to classify genomic variants for clinicians to more rapidly and accurately determine actionability of these aberrations. These efforts are complicated by the intricacies of communicating sequencing results to physicians and supporting its biological relevance, emphasizing the need for increasing education of clinicians and administrators, and the ongoing development of ethical standards for dealing with incidental results. This chapter will focus on general aspects of DNA and RNA tumor sequencing technologies, data analysis and interpretation, assessment of biological and clinical relevance of genomic aberrations, ethical aspects of germline sequencing, and how these factors impact cancer personalized care.
Collapse
Affiliation(s)
- G L Fawcett
- Institute for Personalized Cancer Therapy (IPCT) at University of Texas M.D. Anderson Cancer Center, Houston, TX, United States
| | - A Karina Eterovic
- Institute for Personalized Cancer Therapy (IPCT) at University of Texas M.D. Anderson Cancer Center, Houston, TX, United States.
| |
Collapse
|
86
|
Yuen RKC, Merico D, Cao H, Pellecchia G, Alipanahi B, Thiruvahindrapuram B, Tong X, Sun Y, Cao D, Zhang T, Wu X, Jin X, Zhou Z, Liu X, Nalpathamkalam T, Walker S, Howe JL, Wang Z, MacDonald JR, Chan A, D'Abate L, Deneault E, Siu MT, Tammimies K, Uddin M, Zarrei M, Wang M, Li Y, Wang J, Wang J, Yang H, Bookman M, Bingham J, Gross SS, Loy D, Pletcher M, Marshall CR, Anagnostou E, Zwaigenbaum L, Weksberg R, Fernandez BA, Roberts W, Szatmari P, Glazer D, Frey BJ, Ring RH, Xu X, Scherer SW. Genome-wide characteristics of de novo mutations in autism. NPJ Genom Med 2016; 1:160271-1602710. [PMID: 27525107 PMCID: PMC4980121 DOI: 10.1038/npjgenmed.2016.27] [Citation(s) in RCA: 150] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
De novo mutations (DNMs) are important in Autism Spectrum Disorder (ASD), but so far analyses have mainly been on the ~1.5% of the genome encoding genes. Here, we performed whole genome sequencing (WGS) of 200 ASD parent-child trios and characterized germline and somatic DNMs. We confirmed that the majority of germline DNMs (75.6%) originated from the father, and these increased significantly with paternal age only (p=4.2×10-10). However, when clustered DNMs (those within 20kb) were found in ASD, not only did they mostly originate from the mother (p=7.7×10-13), but they could also be found adjacent to de novo copy number variations (CNVs) where the mutation rate was significantly elevated (p=2.4×10-24). By comparing DNMs detected in controls, we found a significant enrichment of predicted damaging DNMs in ASD cases (p=8.0×10-9; OR=1.84), of which 15.6% (p=4.3×10-3) and 22.5% (p=7.0×10-5) were in the non-coding or genic non-coding, respectively. The non-coding elements most enriched for DNM were untranslated regions of genes, boundaries involved in exon-skipping and DNase I hypersensitive regions. Using microarrays and a novel outlier detection test, we also found aberrant methylation profiles in 2/185 (1.1%) of ASD cases. These same individuals carried independently identified DNMs in the ASD risk- and epigenetic- genes DNMT3A and ADNP. Our data begins to characterize different genome-wide DNMs, and highlight the contribution of non-coding variants, to the etiology of ASD.
Collapse
Affiliation(s)
- Ryan K C Yuen
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Daniele Merico
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | | | - Giovanna Pellecchia
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Babak Alipanahi
- Department of Electrical and Computer Engineering, University of Toronto, Toronto, Ontario, Canada
| | - Bhooma Thiruvahindrapuram
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Xin Tong
- BGI-Shenzhen, Yantian, Shenzhen, China
| | - Yuhui Sun
- BGI-Shenzhen, Yantian, Shenzhen, China
| | | | - Tao Zhang
- BGI-Shenzhen, Yantian, Shenzhen, China
| | - Xueli Wu
- BGI-Shenzhen, Yantian, Shenzhen, China
| | - Xin Jin
- BGI-Shenzhen, Yantian, Shenzhen, China
| | - Ze Zhou
- BGI-Shenzhen, Yantian, Shenzhen, China
| | | | - Thomas Nalpathamkalam
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Susan Walker
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Jennifer L Howe
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Zhuozhi Wang
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Jeffrey R MacDonald
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Ada Chan
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Lia D'Abate
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Eric Deneault
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Michelle T Siu
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Kristiina Tammimies
- Center of Neurodevelopmental Disorders (KIND), Pediatric Neuropsychiatry Unit, Karolinska Institutet, Stockholm, Sweden
| | - Mohammed Uddin
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Mehdi Zarrei
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | | | | | - Jun Wang
- BGI-Shenzhen, Yantian, Shenzhen, China
| | - Jian Wang
- BGI-Shenzhen, Yantian, Shenzhen, China
| | | | | | | | | | - Dion Loy
- Google, Mountain View, California, USA
| | | | - Christian R Marshall
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada; Department of Molecular Genetics, Paediatric Laboratory Medicine, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Evdokia Anagnostou
- Bloorview Research Institute, University of Toronto, Toronto, Ontario, Canada
| | - Lonnie Zwaigenbaum
- Department of Pediatrics, University of Alberta, Edmonton, Alberta, Canada
| | - Rosanna Weksberg
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada; Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | - Bridget A Fernandez
- Disciplines of Genetics and Medicine, Memorial University of Newfoundland, St. John's, Newfoundland, Canada; Provincial Medical Genetic Program, Eastern Health, St. John's, Newfoundland, Canada
| | - Wendy Roberts
- Autism Research Unit, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Peter Szatmari
- Autism Research Unit, The Hospital for Sick Children, Toronto, Ontario, Canada; Child Youth and Family Services, Centre for Addiction and Mental Health, Toronto, Ontario, Canada; Department of Psychiatry, University of Toronto, Toronto, Ontario, Canada
| | - David Glazer
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Brendan J Frey
- Department of Electrical and Computer Engineering, University of Toronto, Toronto, Ontario, Canada; Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | | | - Xun Xu
- BGI-Shenzhen, Yantian, Shenzhen, China
| | - Stephen W Scherer
- The Centre for Applied Genomics, Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada; Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada; McLaughlin Centre, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
87
|
Yuan B, Neira J, Gu S, Harel T, Liu P, Briceño I, Elsea SH, Gómez A, Potocki L, Lupski JR. Nonrecurrent PMP22-RAI1 contiguous gene deletions arise from replication-based mechanisms and result in Smith-Magenis syndrome with evident peripheral neuropathy. Hum Genet 2016; 135:1161-74. [PMID: 27386852 DOI: 10.1007/s00439-016-1703-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2016] [Accepted: 06/21/2016] [Indexed: 11/29/2022]
Abstract
Hereditary neuropathy with liability to pressure palsies (HNPP) and Smith-Magenis syndrome (SMS) are genomic disorders associated with deletion copy number variants involving chromosome 17p12 and 17p11.2, respectively. Nonallelic homologous recombination (NAHR)-mediated recurrent deletions are responsible for the majority of HNPP and SMS cases; the rearrangement products encompass the key dosage-sensitive genes PMP22 and RAI1, respectively, and result in haploinsufficiency for these genes. Less frequently, nonrecurrent genomic rearrangements occur at this locus. Contiguous gene duplications encompassing both PMP22 and RAI1, i.e., PMP22-RAI1 duplications, have been investigated, and replication-based mechanisms rather than NAHR have been proposed for these rearrangements. In the current study, we report molecular and clinical characterizations of six subjects with the reciprocal phenomenon of deletions spanning both genes, i.e., PMP22-RAI1 deletions. Molecular studies utilizing high-resolution array comparative genomic hybridization and breakpoint junction sequencing identified mutational signatures that were suggestive of replication-based mechanisms. Systematic clinical studies revealed features consistent with SMS, including features of intellectual disability, speech and gross motor delays, behavioral problems and ocular abnormalities. Five out of six subjects presented clinical signs and/or objective electrophysiologic studies of peripheral neuropathy. Clinical profiling may improve the clinical management of this unique group of subjects, as the peripheral neuropathy can be more severe or of earlier onset as compared to SMS patients having the common recurrent deletion. Moreover, the current study, in combination with the previous report of PMP22-RAI1 duplications, contributes to the understanding of rare complex phenotypes involving multiple dosage-sensitive genes from a genetic mechanistic standpoint.
Collapse
Affiliation(s)
- Bo Yuan
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Juanita Neira
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Shen Gu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Tamar Harel
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Pengfei Liu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Ignacio Briceño
- Instituto de Genética Humana, Facultad de Medicina, Pontificia Universidad Javeriana, Bogotá, Colombia
- Instituto de Referencia Andino, Bogotá, Colombia
- Facultad de Medicina, Universidad de La Sabana, Chía, Colombia
| | - Sarah H Elsea
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Alberto Gómez
- Instituto de Genética Humana, Facultad de Medicina, Pontificia Universidad Javeriana, Bogotá, Colombia
- Instituto de Referencia Andino, Bogotá, Colombia
| | - Lorraine Potocki
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
- Texas Children's Hospital, Houston, TX, 77030, USA
| | - James R Lupski
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA.
- Department of Pediatrics, Baylor College of Medicine, Houston, TX, 77030, USA.
- Texas Children's Hospital, Houston, TX, 77030, USA.
| |
Collapse
|
88
|
Vembar SS, Seetin M, Lambert C, Nattestad M, Schatz MC, Baybayan P, Scherf A, Smith ML. Complete telomere-to-telomere de novo assembly of the Plasmodium falciparum genome through long-read (>11 kb), single molecule, real-time sequencing. DNA Res 2016; 23:339-51. [PMID: 27345719 PMCID: PMC4991835 DOI: 10.1093/dnares/dsw022] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2015] [Accepted: 05/10/2016] [Indexed: 01/03/2023] Open
Abstract
The application of next-generation sequencing to estimate genetic diversity of Plasmodium falciparum, the most lethal malaria parasite, has proved challenging due to the skewed AT-richness [∼80.6% (A + T)] of its genome and the lack of technology to assemble highly polymorphic subtelomeric regions that contain clonally variant, multigene virulence families (Ex: var and rifin). To address this, we performed amplification-free, single molecule, real-time sequencing of P. falciparum genomic DNA and generated reads of average length 12 kb, with 50% of the reads between 15.5 and 50 kb in length. Next, using the Hierarchical Genome Assembly Process, we assembled the P. falciparum genome de novo and successfully compiled all 14 nuclear chromosomes telomere-to-telomere. We also accurately resolved centromeres [∼90–99% (A + T)] and subtelomeric regions and identified large insertions and duplications that add extra var and rifin genes to the genome, along with smaller structural variants such as homopolymer tract expansions. Overall, we show that amplification-free, long-read sequencing combined with de novo assembly overcomes major challenges inherent to studying the P. falciparum genome. Indeed, this technology may not only identify the polymorphic and repetitive subtelomeric sequences of parasite populations from endemic areas but may also evaluate structural variation linked to virulence, drug resistance and disease transmission.
Collapse
Affiliation(s)
- Shruthi Sridhar Vembar
- Unité Biologie des Interactions Hôte-Parasite, Département de Parasites et Insectes Vecteurs, Institut Pasteur, Paris 75015, France CNRS, ERL 9195, Paris 75015, France INSERM, Unit U1201, Paris 75015, France
| | | | | | | | - Michael C Schatz
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | | | - Artur Scherf
- Unité Biologie des Interactions Hôte-Parasite, Département de Parasites et Insectes Vecteurs, Institut Pasteur, Paris 75015, France CNRS, ERL 9195, Paris 75015, France INSERM, Unit U1201, Paris 75015, France
| | | |
Collapse
|
89
|
Xia LC, Sakshuwong S, Hopmans ES, Bell JM, Grimes SM, Siegmund DO, Ji HP, Zhang NR. A genome-wide approach for detecting novel insertion-deletion variants of mid-range size. Nucleic Acids Res 2016; 44:e126. [PMID: 27325742 PMCID: PMC5009736 DOI: 10.1093/nar/gkw481] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2015] [Accepted: 05/15/2016] [Indexed: 11/14/2022] Open
Abstract
We present SWAN, a statistical framework for robust detection of genomic structural variants in next-generation sequencing data and an analysis of mid-range size insertion and deletions (<10 Kb) for whole genome analysis and DNA mixtures. To identify these mid-range size events, SWAN collectively uses information from read-pair, read-depth and one end mapped reads through statistical likelihoods based on Poisson field models. SWAN also uses soft-clip/split read remapping to supplement the likelihood analysis and determine variant boundaries. The accuracy of SWAN is demonstrated by in silico spike-ins and by identification of known variants in the NA12878 genome. We used SWAN to identify a series of novel set of mid-range insertion/deletion detection that were confirmed by targeted deep re-sequencing. An R package implementation of SWAN is open source and freely available.
Collapse
Affiliation(s)
- Li C Xia
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA Department of Statistics, the Wharton School, University of Pennsylvania, Philadelphia, PA 18014, USA
| | - Sukolsak Sakshuwong
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Erik S Hopmans
- Stanford Genome Technology Centre, Stanford University, Palo Alto, CA 94304, USA
| | - John M Bell
- Stanford Genome Technology Centre, Stanford University, Palo Alto, CA 94304, USA
| | - Susan M Grimes
- Stanford Genome Technology Centre, Stanford University, Palo Alto, CA 94304, USA
| | - David O Siegmund
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Hanlee P Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA Stanford Genome Technology Centre, Stanford University, Palo Alto, CA 94304, USA
| | - Nancy R Zhang
- Department of Statistics, the Wharton School, University of Pennsylvania, Philadelphia, PA 18014, USA
| |
Collapse
|
90
|
Mason-Suares H, Landry L, S. Lebo M. Detecting Copy Number Variation via Next Generation Technology. CURRENT GENETIC MEDICINE REPORTS 2016. [DOI: 10.1007/s40142-016-0091-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
91
|
Lupski JR. Clinical genomics: from a truly personal genome viewpoint. Hum Genet 2016; 135:591-601. [PMID: 27221143 DOI: 10.1007/s00439-016-1682-6] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2016] [Accepted: 05/11/2016] [Indexed: 12/23/2022]
Abstract
The path to Clinical Genomics is punctuated by our understanding of what types of DNA structural and sequence variation contribute to disease, the many technical challenges to detect such variation genome-wide, and the initial struggles to interpret personal genome variation in the context of disease. This review describes one perspective of the development of clinical genomics; whereas the experimental challenges, and hurdles to overcoming them, might be deemed readily apparent, the non-technical issues for clinical implementation may be less obvious. Some of these latter challenges, including: (1) informed consent, (2) privacy, (3) what constitutes potentially pathogenic variation contributing to disease, (4) disease penetrance in populations, and (5) the genetic architecture of disease, and the struggles sometimes faced for solutions, are highlighted using illustrative examples.
Collapse
Affiliation(s)
- James R Lupski
- Department of Molecular and Human Genetics, Baylor College of Medicine, 604B, One Baylor Plaza, Houston, TX, 77030, USA. .,Department of Pediatrics, Baylor College of Medicine, One Baylor Plaza, Houston, TX, 77030, USA. .,Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, 77030, USA. .,Texas Children's Hospital, Houston, TX, 77030, USA.
| |
Collapse
|
92
|
Friedrich SM, Zec HC, Wang TH. Analysis of single nucleic acid molecules in micro- and nano-fluidics. LAB ON A CHIP 2016; 16:790-811. [PMID: 26818700 PMCID: PMC4767527 DOI: 10.1039/c5lc01294e] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Nucleic acid analysis has enhanced our understanding of biological processes and disease progression, elucidated the association of genetic variants and disease, and led to the design and implementation of new treatment strategies. These diverse applications require analysis of a variety of characteristics of nucleic acid molecules: size or length, detection or quantification of specific sequences, mapping of the general sequence structure, full sequence identification, analysis of epigenetic modifications, and observation of interactions between nucleic acids and other biomolecules. Strategies that can detect rare or transient species, characterize population distributions, and analyze small sample volumes enable the collection of richer data from biosamples. Platforms that integrate micro- and nano-fluidic operations with high sensitivity single molecule detection facilitate manipulation and detection of individual nucleic acid molecules. In this review, we will highlight important milestones and recent advances in single molecule nucleic acid analysis in micro- and nano-fluidic platforms. We focus on assessment modalities for single nucleic acid molecules and highlight the role of micro- and nano-structures and fluidic manipulation. We will also briefly discuss future directions and the current limitations and obstacles impeding even faster progress toward these goals.
Collapse
Affiliation(s)
- Sarah M Friedrich
- Biomedical Engineering Department, Johns Hopkins University, Baltimore, MD 21218, USA.
| | - Helena C Zec
- Mechanical Engineering Department, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Tza-Huei Wang
- Biomedical Engineering Department, Johns Hopkins University, Baltimore, MD 21218, USA. and Mechanical Engineering Department, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
93
|
Guan P, Sung WK. Structural variation detection using next-generation sequencing data: A comparative technical review. Methods 2016; 102:36-49. [PMID: 26845461 DOI: 10.1016/j.ymeth.2016.01.020] [Citation(s) in RCA: 98] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2015] [Revised: 01/09/2016] [Accepted: 01/31/2016] [Indexed: 12/11/2022] Open
Abstract
Structural variations (SVs) are mutations in the genome of size at least fifty nucleotides. They contribute to the phenotypic differences among healthy individuals, cause severe diseases and even cancers by breaking or linking genes. Thus, it is crucial to systematically profile SVs in the genome. In the past decade, many next-generation sequencing (NGS)-based SV detection methods have been proposed due to the significant cost reduction of NGS experiments and their ability to unbiasedly detect SVs to the base-pair resolution. These SV detection methods vary in both sensitivity and specificity, since they use different SV-property-dependent and library-property-dependent features. As a result, predictions from different SV callers are often inconsistent. Besides, the noises in the data (both platform-specific sequencing error and artificial chimeric reads) impede the specificity of SV detection. Poorly characterized regions in the human genome (e.g., repeat regions) greatly impact the reads mapping and in turn affect the SV calling accuracy. Calling of complex SVs requires specialized SV callers. Apart from accuracy, processing speed of SV caller is another factor deciding its usability. Knowing the pros and cons of different SV calling techniques and the objectives of the biological study are essential for biologists and bioinformaticians to make informed decisions. This paper describes different components in the SV calling pipeline and reviews the techniques used by existing SV callers. Through simulation study, we also demonstrate that library properties, especially insert size, greatly impact the sensitivity of different SV callers. We hope the community can benefit from this work both in designing new SV calling methods and in selecting the appropriate SV caller for specific biological studies.
Collapse
Affiliation(s)
- Peiyong Guan
- School of Computing, National University of Singapore, 117543, Singapore
| | - Wing-Kin Sung
- School of Computing, National University of Singapore, 117543, Singapore; Computational & Mathematical Biology Group, Genome Institute of Singapore, 138672, Singapore.
| |
Collapse
|
94
|
Norris AL, Workman RE, Fan Y, Eshleman JR, Timp W. Nanopore sequencing detects structural variants in cancer. Cancer Biol Ther 2016; 17:246-53. [PMID: 26787508 PMCID: PMC4848001 DOI: 10.1080/15384047.2016.1139236] [Citation(s) in RCA: 96] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2015] [Revised: 12/08/2015] [Accepted: 01/01/2016] [Indexed: 11/21/2022] Open
Abstract
Despite advances in sequencing, structural variants (SVs) remain difficult to reliably detect due to the short read length (<300 bp) of 2nd generation sequencing. Not only do the reads (or paired-end reads) need to straddle a breakpoint, but repetitive elements often lead to ambiguities in the alignment of short reads. We propose to use the long-reads (up to 20 kb) possible with 3rd generation sequencing, specifically nanopore sequencing on the MinION. Nanopore sequencing relies on a similar concept to a Coulter counter, reading the DNA sequence from the change in electrical current resulting from a DNA strand being forced through a nanometer-sized pore embedded in a membrane. Though nanopore sequencing currently has a relatively high mismatch rate that precludes base substitution and small frameshift mutation detection, its accuracy is sufficient for SV detection because of its long reads. In fact, long reads in some cases may improve SV detection efficiency. We have tested nanopore sequencing to detect a series of well-characterized SVs, including large deletions, inversions, and translocations that inactivate the CDKN2A/p16 and SMAD4/DPC4 tumor suppressor genes in pancreatic cancer. Using PCR amplicon mixes, we have demonstrated that nanopore sequencing can detect large deletions, translocations and inversions at dilutions as low as 1:100, with as few as 500 reads per sample. Given the speed, small footprint, and low capital cost, nanopore sequencing could become the ideal tool for the low-level detection of cancer-associated SVs needed for molecular relapse, early detection, or therapeutic monitoring.
Collapse
Affiliation(s)
- Alexis L. Norris
- Departments of Pathology and Oncology, The Sol Goldman Pancreatic Cancer Research Center, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Rachael E. Workman
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Yunfan Fan
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - James R. Eshleman
- Departments of Pathology and Oncology, The Sol Goldman Pancreatic Cancer Research Center, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
95
|
Parikh H, Mohiyuddin M, Lam HYK, Iyer H, Chen D, Pratt M, Bartha G, Spies N, Losert W, Zook JM, Salit M. svclassify: a method to establish benchmark structural variant calls. BMC Genomics 2016; 17:64. [PMID: 26772178 PMCID: PMC4715349 DOI: 10.1186/s12864-016-2366-2] [Citation(s) in RCA: 64] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2015] [Accepted: 01/05/2016] [Indexed: 01/24/2023] Open
Abstract
Background The human genome contains variants ranging in size from small single nucleotide polymorphisms (SNPs) to large structural variants (SVs). High-quality benchmark small variant calls for the pilot National Institute of Standards and Technology (NIST) Reference Material (NA12878) have been developed by the Genome in a Bottle Consortium, but no similar high-quality benchmark SV calls exist for this genome. Since SV callers output highly discordant results, we developed methods to combine multiple forms of evidence from multiple sequencing technologies to classify candidate SVs into likely true or false positives. Our method (svclassify) calculates annotations from one or more aligned bam files from many high-throughput sequencing technologies, and then builds a one-class model using these annotations to classify candidate SVs as likely true or false positives. Results We first used pedigree analysis to develop a set of high-confidence breakpoint-resolved large deletions. We then used svclassify to cluster and classify these deletions as well as a set of high-confidence deletions from the 1000 Genomes Project and a set of breakpoint-resolved complex insertions from Spiral Genetics. We find that likely SVs cluster separately from likely non-SVs based on our annotations, and that the SVs cluster into different types of deletions. We then developed a supervised one-class classification method that uses a training set of random non-SV regions to determine whether candidate SVs have abnormal annotations different from most of the genome. To test this classification method, we use our pedigree-based breakpoint-resolved SVs, SVs validated by the 1000 Genomes Project, and assembly-based breakpoint-resolved insertions, along with semi-automated visualization using svviz. Conclusions We find that candidate SVs with high scores from multiple technologies have high concordance with PCR validation and an orthogonal consensus method MetaSV (99.7 % concordant), and candidate SVs with low scores are questionable. We distribute a set of 2676 high-confidence deletions and 68 high-confidence insertions with high svclassify scores from these call sets for benchmarking SV callers. We expect these methods to be particularly useful for establishing high-confidence SV calls for benchmark samples that have been characterized by multiple technologies. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2366-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hemang Parikh
- Genome-Scale Measurements Group, Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8313, Gaithersburg, MD, 20899, USA. .,Dakota Consulting Inc., 1110 Bonifant Street, Suite 310, Silver Spring, MD, 20910, USA.
| | | | - Hugo Y K Lam
- Bina Technologies, Roche Sequencing, Redwood City, CA, 94065, USA.
| | - Hariharan Iyer
- Statistical Engineering Division, National Institute of Standards and Technology, Gaithersburg, MD, 20899, USA.
| | - Desu Chen
- Institute for Research in Electronics and Applied Physics, University of Maryland, College Park, MD, 20742, USA.
| | - Mark Pratt
- Personalis Inc., 1350 Willow Road, Suite 202, Menlo Park, CA, 94025, USA.
| | - Gabor Bartha
- Personalis Inc., 1350 Willow Road, Suite 202, Menlo Park, CA, 94025, USA.
| | - Noah Spies
- Genome-Scale Measurements Group, Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8313, Gaithersburg, MD, 20899, USA. .,Department of Pathology, Stanford University, Stanford, CA, USA.
| | - Wolfgang Losert
- Institute for Research in Electronics and Applied Physics, University of Maryland, College Park, MD, 20742, USA.
| | - Justin M Zook
- Genome-Scale Measurements Group, Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8313, Gaithersburg, MD, 20899, USA.
| | - Marc Salit
- Genome-Scale Measurements Group, Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8313, Gaithersburg, MD, 20899, USA. .,Bioengineering Department, Stanford University, Stanford, CA, USA.
| |
Collapse
|
96
|
Davey JW, Chouteau M, Barker SL, Maroja L, Baxter SW, Simpson F, Merrill RM, Joron M, Mallet J, Dasmahapatra KK, Jiggins CD. Major Improvements to the Heliconius melpomene Genome Assembly Used to Confirm 10 Chromosome Fusion Events in 6 Million Years of Butterfly Evolution. G3 (BETHESDA, MD.) 2016; 6:695-708. [PMID: 26772750 PMCID: PMC4777131 DOI: 10.1534/g3.115.023655] [Citation(s) in RCA: 95] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2015] [Accepted: 01/06/2016] [Indexed: 12/30/2022]
Abstract
The Heliconius butterflies are a widely studied adaptive radiation of 46 species spread across Central and South America, several of which are known to hybridize in the wild. Here, we present a substantially improved assembly of the Heliconius melpomene genome, developed using novel methods that should be applicable to improving other genome assemblies produced using short read sequencing. First, we whole-genome-sequenced a pedigree to produce a linkage map incorporating 99% of the genome. Second, we incorporated haplotype scaffolds extensively to produce a more complete haploid version of the draft genome. Third, we incorporated ∼20x coverage of Pacific Biosciences sequencing, and scaffolded the haploid genome using an assembly of this long-read sequence. These improvements result in a genome of 795 scaffolds, 275 Mb in length, with an N50 length of 2.1 Mb, an N50 number of 34, and with 99% of the genome placed, and 84% anchored on chromosomes. We use the new genome assembly to confirm that the Heliconius genome underwent 10 chromosome fusions since the split with its sister genus Eueides, over a period of about 6 million yr.
Collapse
Affiliation(s)
- John W Davey
- Department of Zoology, University of Cambridge, CB2 3EJ, United Kingdom
| | - Mathieu Chouteau
- Centre d'Ecologie Fonctionnelle et Evolutive, UMR 5175 CNRS - EPHE - Université de Montpellier - Université Paul Valéry, 34293 Montpellier 5, France
| | - Sarah L Barker
- Department of Zoology, University of Cambridge, CB2 3EJ, United Kingdom
| | - Luana Maroja
- Department of Biology, Williams College, Williamstown, Massachusetts, 01267
| | - Simon W Baxter
- School of Biological Sciences, University of Adelaide, SA 5005 Australia
| | - Fraser Simpson
- Department of Genetics, Evolution and Environment, University College London, Darwin Building, Gower Street, WC1E 6BT, United Kingdom
| | | | - Mathieu Joron
- Centre d'Ecologie Fonctionnelle et Evolutive, UMR 5175 CNRS - EPHE - Université de Montpellier - Université Paul Valéry, 34293 Montpellier 5, France
| | - James Mallet
- Department of Genetics, Evolution and Environment, University College London, Darwin Building, Gower Street, WC1E 6BT, United Kingdom
| | - Kanchon K Dasmahapatra
- Department of Genetics, Evolution and Environment, University College London, Darwin Building, Gower Street, WC1E 6BT, United Kingdom
| | - Chris D Jiggins
- Department of Zoology, University of Cambridge, CB2 3EJ, United Kingdom
| |
Collapse
|
97
|
Yuan B, Harel T, Gu S, Liu P, Burglen L, Chantot-Bastaraud S, Gelowani V, Beck C, Carvalho C, Cheung S, Coe A, Malan V, Munnich A, Magoulas P, Potocki L, Lupski J. Nonrecurrent 17p11.2p12 Rearrangement Events that Result in Two Concomitant Genomic Disorders: The PMP22-RAI1 Contiguous Gene Duplication Syndrome. Am J Hum Genet 2015; 97:691-707. [PMID: 26544804 DOI: 10.1016/j.ajhg.2015.10.003] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2015] [Accepted: 10/05/2015] [Indexed: 12/31/2022] Open
Abstract
The genomic duplication associated with Potocki-Lupski syndrome (PTLS) maps in close proximity to the duplication associated with Charcot-Marie-Tooth disease type 1A (CMT1A). PTLS is characterized by hypotonia, failure to thrive, reduced body weight, intellectual disability, and autistic features. CMT1A is a common autosomal dominant distal symmetric peripheral polyneuropathy. The key dosage-sensitive genes RAI1 and PMP22 are respectively associated with PTLS and CMT1A. Recurrent duplications accounting for the majority of subjects with these conditions are mediated by nonallelic homologous recombination between distinct low-copy repeat (LCR) substrates. The LCRs flanking a contiguous genomic interval encompassing both RAI1 and PMP22 do not share extensive homology; thus, duplications encompassing both loci are rare and potentially generated by a different mutational mechanism. We characterized genomic rearrangements that simultaneously duplicate PMP22 and RAI1, including nine potential complex genomic rearrangements, in 23 subjects by high-resolution array comparative genomic hybridization and breakpoint junction sequencing. Insertions and microhomologies were found at the breakpoint junctions, suggesting potential replicative mechanisms for rearrangement formation. At the breakpoint junctions of these nonrecurrent rearrangements, enrichment of repetitive DNA sequences was observed, indicating that they might predispose to genomic instability and rearrangement. Clinical evaluation revealed blended PTLS and CMT1A phenotypes with a potential earlier onset of neuropathy. Moreover, additional clinical findings might be observed due to the extra duplicated material included in the rearrangements. Our genomic analysis suggests replicative mechanisms as a predominant mechanism underlying PMP22-RAI1 contiguous gene duplications and provides further evidence supporting the role of complex genomic architecture in genomic instability.
Collapse
|
98
|
Rhoads A, Au KF. PacBio Sequencing and Its Applications. GENOMICS PROTEOMICS & BIOINFORMATICS 2015; 13:278-89. [PMID: 26542840 PMCID: PMC4678779 DOI: 10.1016/j.gpb.2015.08.002] [Citation(s) in RCA: 1162] [Impact Index Per Article: 129.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/28/2015] [Revised: 08/06/2015] [Accepted: 08/11/2015] [Indexed: 12/15/2022]
Abstract
Single-molecule, real-time sequencing developed by Pacific BioSciences offers longer read lengths than the second-generation sequencing (SGS) technologies, making it well-suited for unsolved problems in genome, transcriptome, and epigenetics research. The highly-contiguous de novo assemblies using PacBio sequencing can close gaps in current reference assemblies and characterize structural variation (SV) in personal genomes. With longer reads, we can sequence through extended repetitive regions and detect mutations, many of which are associated with diseases. Moreover, PacBio transcriptome sequencing is advantageous for the identification of gene isoforms and facilitates reliable discoveries of novel genes and novel isoforms of annotated genes, due to its ability to sequence full-length transcripts or fragments with significant lengths. Additionally, PacBio’s sequencing technique provides information that is useful for the direct detection of base modifications, such as methylation. In addition to using PacBio sequencing alone, many hybrid sequencing strategies have been developed to make use of more accurate short reads in conjunction with PacBio long reads. In general, hybrid sequencing strategies are more affordable and scalable especially for small-size laboratories than using PacBio Sequencing alone. The advent of PacBio sequencing has made available much information that could not be obtained via SGS alone.
Collapse
Affiliation(s)
- Anthony Rhoads
- Department of Biostatistics, University of Iowa, Iowa City, IA 52242, USA
| | - Kin Fai Au
- Department of Biostatistics, University of Iowa, Iowa City, IA 52242, USA; Department of Internal Medicine, University of Iowa, Iowa City, IA 52242, USA.
| |
Collapse
|
99
|
Genome-Wide Structural Variation Detection by Genome Mapping on Nanochannel Arrays. Genetics 2015; 202:351-62. [PMID: 26510793 PMCID: PMC4701098 DOI: 10.1534/genetics.115.183483] [Citation(s) in RCA: 88] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2014] [Accepted: 10/28/2015] [Indexed: 01/06/2023] Open
Abstract
Comprehensive whole-genome structural variation detection is challenging with current approaches. With diploid cells as DNA source and the presence of numerous repetitive elements, short-read DNA sequencing cannot be used to detect structural variation efficiently. In this report, we show that genome mapping with long, fluorescently labeled DNA molecules imaged on nanochannel arrays can be used for whole-genome structural variation detection without sequencing. While whole-genome haplotyping is not achieved, local phasing (across >150-kb regions) is routine, as molecules from the parental chromosomes are examined separately. In one experiment, we generated genome maps from a trio from the 1000 Genomes Project, compared the maps against that derived from the reference human genome, and identified structural variations that are >5 kb in size. We find that these individuals have many more structural variants than those published, including some with the potential of disrupting gene function or regulation.
Collapse
|
100
|
Mu JC, Tootoonchi Afshar P, Mohiyuddin M, Chen X, Li J, Bani Asadi N, Gerstein MB, Wong WH, Lam HYK. Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods. Sci Rep 2015; 5:14493. [PMID: 26412485 PMCID: PMC4585973 DOI: 10.1038/srep14493] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2015] [Accepted: 08/28/2015] [Indexed: 11/09/2022] Open
Abstract
A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of variants including structural variants (SVs). Thus, we leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms. It was a necessary effort to completely reanalyze the HuRef genome as its previously published variants were mostly reported five years ago, suffering from compatibility, organization, and accuracy issues that prevent their direct use in benchmarking. Our extensive analysis and validation resulted in a gold set with high specificity and sensitivity. In contrast to the current gold sets of the NA12878 or HS1011 genomes, our gold set is the first that includes small variants, deletion SVs and insertion SVs up to a hundred thousand base-pairs. We demonstrate the utility of our HuRef gold set to benchmark several published SV detection tools.
Collapse
Affiliation(s)
- John C. Mu
- Bina Technologies, Roche Sequencing, Redwood City, CA 94065, USA
| | | | | | - Xi Chen
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Jian Li
- Bina Technologies, Roche Sequencing, Redwood City, CA 94065, USA
| | | | - Mark B. Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | - Wing H. Wong
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
- Department of Health Research and Policy, Stanford University, Stanford, CA 94305, USA
| | - Hugo Y. K. Lam
- Bina Technologies, Roche Sequencing, Redwood City, CA 94065, USA
| |
Collapse
|