1
|
Kersten O, Star B, Krabberød AK, Atmore LM, Tørresen OK, Anker-Nilssen T, Descamps S, Strøm H, Johansson US, Sweet PR, Jakobsen KS, Boessenkool S. Hybridization of Atlantic puffins in the Arctic coincides with 20th-century climate change. SCIENCE ADVANCES 2023; 9:eadh1407. [PMID: 37801495 PMCID: PMC10558128 DOI: 10.1126/sciadv.adh1407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Accepted: 09/06/2023] [Indexed: 10/08/2023]
Abstract
The Arctic is experiencing the fastest rates of global warming, leading to shifts in the distribution of its biota and increasing the potential for hybridization. However, genomic evidence of recent hybridization events in the Arctic remains unexpectedly rare. Here, we use whole-genome sequencing of contemporary and 122-year-old historical specimens to investigate the origin of an Arctic hybrid population of Atlantic puffins (Fratercula arctica) on Bjørnøya, Norway. We show that the hybridization between the High Arctic, large-bodied subspecies F. a. naumanni and the temperate, smaller-sized subspecies F. a. arctica began as recently as six generations ago due to an unexpected southward range expansion of F. a. naumanni. Moreover, we find a significant temporal loss of genetic diversity across Arctic and temperate puffin populations. Our observations provide compelling genomic evidence of the impacts of recent distributional shifts and loss of diversity in Arctic communities during the 20th century.
Collapse
Affiliation(s)
- Oliver Kersten
- Centre for Ecological and Evolutionary Synthesis (CEES), Department of Biosciences, University of Oslo, Oslo, Norway
| | - Bastiaan Star
- Centre for Ecological and Evolutionary Synthesis (CEES), Department of Biosciences, University of Oslo, Oslo, Norway
| | - Anders K. Krabberød
- Section for Genetics and Evolutionary Biology (Evogene), Department of Biosciences, University of Oslo, Oslo, Norway
| | - Lane M. Atmore
- Centre for Ecological and Evolutionary Synthesis (CEES), Department of Biosciences, University of Oslo, Oslo, Norway
| | - Ole K. Tørresen
- Centre for Ecological and Evolutionary Synthesis (CEES), Department of Biosciences, University of Oslo, Oslo, Norway
| | | | | | - Hallvard Strøm
- Norwegian Polar Institute, Fram Centre, Langnes, Tromsø, Norway
| | | | - Paul R. Sweet
- American Museum of Natural History, New York, NY, USA
| | - Kjetill S. Jakobsen
- Centre for Ecological and Evolutionary Synthesis (CEES), Department of Biosciences, University of Oslo, Oslo, Norway
| | - Sanne Boessenkool
- Centre for Ecological and Evolutionary Synthesis (CEES), Department of Biosciences, University of Oslo, Oslo, Norway
| |
Collapse
|
2
|
Zivanovic A, Miller J, Munro S, Knutson T, Li Y, Passow C, Simonaitis P, Lynch M, Oseth L, Zhao S, Feng F, Wikström P, Corey E, Morrissey C, Henzler C, Raphael B, Dehm S. Co-evolution of AR gene copy number and structural complexity in endocrine therapy resistant prostate cancer. NAR Cancer 2023; 5:zcad045. [PMID: 37636316 PMCID: PMC10448862 DOI: 10.1093/narcan/zcad045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 07/17/2023] [Accepted: 08/09/2023] [Indexed: 08/29/2023] Open
Abstract
Androgen receptor (AR) inhibition is standard of care for advanced prostate cancer (PC). However, efficacy is limited by progression to castration-resistant PC (CRPC), usually due to AR re-activation via mechanisms that include AR amplification and structural rearrangement. These two classes of AR alterations often co-occur in CRPC tumors, but it is unclear whether this reflects intercellular or intracellular heterogeneity of AR. Resolving this is important for developing new therapies and predictive biomarkers. Here, we analyzed 41 CRPC tumors and 6 patient-derived xenografts (PDXs) using linked-read DNA-sequencing, and identified 7 tumors that developed complex, multiply-rearranged AR gene structures in conjunction with very high AR copy number. Analysis of PDX models by optical genome mapping and fluorescence in situ hybridization showed that AR residing on extrachromosomal DNA (ecDNA) was an underlying mechanism, and was associated with elevated levels and diversity of AR expression. This study identifies co-evolution of AR gene copy number and structural complexity via ecDNA as a mechanism associated with endocrine therapy resistance.
Collapse
Affiliation(s)
- Andrej Zivanovic
- Masonic Cancer Center, University of Minnesota, Minneapolis, MN, USA
| | - Jeffrey T Miller
- Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN, USA
| | - Sarah A Munro
- Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN, USA
| | - Todd P Knutson
- Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN, USA
| | - Yingming Li
- Masonic Cancer Center, University of Minnesota, Minneapolis, MN, USA
| | - Courtney N Passow
- University of Minnesota Genomics Center, University of Minnesota, Minneapolis, MN, USA
| | - Pijus Simonaitis
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| | - Molly Lynch
- Masonic Cancer Center, University of Minnesota, Minneapolis, MN, USA
| | - LeAnn Oseth
- Masonic Cancer Center, University of Minnesota, Minneapolis, MN, USA
| | - Shuang G Zhao
- Department of Human Oncology, University of Wisconsin-Madison, Madison, WI, USA
- Carbone Cancer Center, University of Wisconsin-Madison, Madison, WI, USA
- William S. Middleton Memorial Veterans Hospital, Madison, Madison, WI, USA
| | - Felix Y Feng
- Departments of Radiation Oncology, Urology, and Medicine, University of California San Francisco, San Francisco, CA, USA
- Helen Diller Family Comprehensive Cancer Center, University of California at San Francisco, San Francisco, CA, USA
| | - Pernilla Wikström
- Department of Medical Biosciences, Pathology, Umeå University, Umeå, Sweden
| | - Eva Corey
- Department of Urology, University of Washington, Seattle, WA, USA
| | - Colm Morrissey
- Department of Urology, University of Washington, Seattle, WA, USA
| | - Christine Henzler
- Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN, USA
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| | - Scott M Dehm
- Masonic Cancer Center, University of Minnesota, Minneapolis, MN, USA
- Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN, USA
- Department of Urology, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
3
|
Groza C, Chen X, Pacis A, Simon MM, Pramatarova A, Aracena KA, Pastinen T, Barreiro LB, Bourque G. Genome graphs detect human polymorphisms in active epigenomic state during influenza infection. CELL GENOMICS 2023; 3:100294. [PMID: 37228750 PMCID: PMC10203048 DOI: 10.1016/j.xgen.2023.100294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 07/26/2022] [Accepted: 03/09/2023] [Indexed: 05/27/2023]
Abstract
Genetic variants, including mobile element insertions (MEIs), are known to impact the epigenome. We hypothesized that genome graphs, which encapsulate genetic diversity, could reveal missing epigenomic signals. To test this, we sequenced the epigenome of monocyte-derived macrophages from 35 ancestrally diverse individuals before and after influenza infection, allowing us to investigate the role of MEIs in immunity. We characterized genetic variants and MEIs using linked reads and built a genome graph. Mapping epigenetic data revealed 2.3%-3% novel peaks for H3K4me1, H3K27ac chromatin immunoprecipitation sequencing (ChIP-seq), and ATAC-seq. Additionally, the use of a genome graph modified some quantitative trait loci estimates and revealed 375 polymorphic MEIs in an active epigenomic state. Among these is an AluYh3 polymorphism whose chromatin state changed after infection and was associated with the expression of TRIM25, a gene that restricts influenza RNA synthesis. Our results demonstrate that graph genomes can reveal regulatory regions that would have been overlooked by other approaches.
Collapse
Affiliation(s)
- Cristian Groza
- Quantitative Life Sciences, McGill University, Montréal, QC, Canada
| | - Xun Chen
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Alain Pacis
- Canadian Centre for Computational Genomics, McGill University, Montréal, QC, Canada
| | - Marie-Michelle Simon
- Victor Phillip Dahdaleh Institute of Genomic Medicine at McGill University, Montréal, QC, Canada
| | - Albena Pramatarova
- Victor Phillip Dahdaleh Institute of Genomic Medicine at McGill University, Montréal, QC, Canada
| | | | - Tomi Pastinen
- Genomic Medicine Center, Children’s Mercy Hospital and Research Institute, Kansas City, MO, USA
| | - Luis B. Barreiro
- Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA
- Committee on Immunology, University of Chicago, Chicago, IL, USA
| | - Guillaume Bourque
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
- Canadian Centre for Computational Genomics, McGill University, Montréal, QC, Canada
- Victor Phillip Dahdaleh Institute of Genomic Medicine at McGill University, Montréal, QC, Canada
- Human Genetics, McGill University, Montréal, QC, Canada
| |
Collapse
|
4
|
Mortensen Ó, Thomsen E, Lydersen LN, Apol KD, Weihe P, Steig BÁ, Andorsdóttir G, Als TD, Gregersen NO. FarGen: Elucidating the distribution of coding variants in the isolated population of the Faroe Islands. Eur J Hum Genet 2023; 31:329-337. [PMID: 36404349 PMCID: PMC9995356 DOI: 10.1038/s41431-022-01227-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Revised: 09/30/2022] [Accepted: 10/27/2022] [Indexed: 11/22/2022] Open
Abstract
Here we present results from FarGen Phase I exomes. This dataset is based on the FarGen cohort, which consists of 1,541 individuals from the isolated population of the Faroe Islands. The purpose of this cohort is to serve as a reference catalog of coding variants, and to conduct population genetic studies to better understand the genetic contribution to various diseases in the Faroese population. The first whole-exome data set comprise 465 individuals and a total of 148,267 genetic variants were discovered. Principle Component Analysis indicates that the population is isolated and weakly structured. The distribution of variants in various functional classes was compared with populations in the gnomAD dataset; the results indicated that the proportions were consistent across the cohorts, but probably due to a small sample size, the FarGen dataset contained relatively few rare variants. We identified 19 variants that are classified as pathogenic or likely pathogenic in ClinVar; several of these variants are associated with monogenetic diseases with increased prevalence in the Faroe Islands. The results support previous studies, which indicate that the Faroe Islands is an isolated and weakly structured population. Future studies may elucidate the significance of the 19 pathogenic variants that were identified. The FarGen Phase I dataset is an important step for genetic research in the Faroese population, and the next phase of FarGen will increase the sample size and broaden the scope.
Collapse
Affiliation(s)
- Ólavur Mortensen
- The Genetic Biobank of the Faroe Islands, Tórshavn, Faroe Islands
| | - Elisabet Thomsen
- The Genetic Biobank of the Faroe Islands, Tórshavn, Faroe Islands
| | | | - Katrin D Apol
- The Genetic Biobank of the Faroe Islands, Tórshavn, Faroe Islands
| | - Pál Weihe
- Department of Occupational Medicine and Public Health, National Hospital of the Faroe Islands Tórshavn, Tórshavn, Faroe Islands
| | - Bjarni Á Steig
- Medical Department, National Hospital of the Faroe Islands, Tórshavn, Faroe Islands
| | - Guðrið Andorsdóttir
- The Genetic Biobank of the Faroe Islands, Tórshavn, Faroe Islands
- Centre of Health Science, Faculty of Health, University of the Faroe Islands, Tórshavn, Faroe Islands
| | - Thomas D Als
- Department of Biomedicine, Aarhus University, Aarhus, Denmark
- The Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, Aarhus, Denmark
- Center for Genomics and Personalized Medicine, Aarhus, Denmark
| | - Noomi O Gregersen
- The Genetic Biobank of the Faroe Islands, Tórshavn, Faroe Islands.
- Centre of Health Science, Faculty of Health, University of the Faroe Islands, Tórshavn, Faroe Islands.
| |
Collapse
|
5
|
Qi Y, Gu S, Zhang Y, Guo L, Xu M, Cheng X, Wang O, Sun Y, Chen J, Fang X, Liu X, Deng L, Fan G. MetaTrass: A high-quality metagenome assembler of the human gut microbiome by cobarcoding sequencing reads. IMETA 2022; 1:e46. [PMID: 38867906 PMCID: PMC10989976 DOI: 10.1002/imt2.46] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 06/28/2022] [Accepted: 07/20/2022] [Indexed: 06/14/2024]
Abstract
Metagenomic evidence of great genetic diversity within the nonconserved regions of the human gut microbial genomes appeals for new methods to elucidate the species-level variability at high resolution. However, current approaches cannot satisfy this methodologically challenge. In this study, we proposed an efficient binning-first-and-assembly-later strategy, named MetaTrass, to recover high-quality species-resolved genomes based on public reference genomes and the single-tube long fragment read (stLFR) technology, which enables cobarcoding. MetaTrass can generate genomes with longer contiguity, higher completeness, and lower contamination than those produced by conventional assembly-first-and-binning-later strategies. From a simulation study on a mock microbial community, MetaTrass showed the potential to improve the contiguity of assembly from kb to Mb without accuracy loss, as compared to other methods based on the next-generation sequencing technology. From four human fecal samples, MetaTrass successfully retrieved 178 high-quality genomes, whereas only 58 ones were provided by the optimal performance of other conventional strategies. Most importantly, these high-quality genomes confirmed the high level of genetic diversity among different samples and unveiled much more. MetaTrass was designed to work with metagenomic reads sequenced by stLFR technology, but is also applicable to other types of cobarcoding libraries. With the high capability of assembling high-quality genomes of metagenomic data sets, MetaTrass seeks to facilitate the study of spatial characters and dynamics of complex microbial communities at enhanced resolution. The open-source code of MetaTrass is available at https://github.com/BGI-Qingdao/MetaTrass.
Collapse
Affiliation(s)
- Yanwei Qi
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
- State Key Laboratory of Agricultural GenomicsBGI‐ShenzhenShenzhenChina
- China National GeneBankBGI‐ShenzhenShenzhenChina
| | - Shengqiang Gu
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
- College of Life SciencesUniversity of Chinese Academy of SciencesBeijingChina
| | | | - Lidong Guo
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
- College of Life SciencesUniversity of Chinese Academy of SciencesBeijingChina
| | - Mengyang Xu
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
- State Key Laboratory of Agricultural GenomicsBGI‐ShenzhenShenzhenChina
- China National GeneBankBGI‐ShenzhenShenzhenChina
- BGI‐ShenzhenBGI‐ShenzhenShenzhenChina
| | - Xiaofang Cheng
- BGI‐ShenzhenBGI‐ShenzhenShenzhenChina
- MGIBGI‐ShenzhenShenzhenChina
| | - Ou Wang
- BGI‐ShenzhenBGI‐ShenzhenShenzhenChina
- MGIBGI‐ShenzhenShenzhenChina
| | - Ying Sun
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
| | | | - Xiaodong Fang
- BGI‐ShenzhenBGI‐ShenzhenShenzhenChina
- BGI GenomicsBGI‐ShenzhenShenzhenChina
| | - Xin Liu
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
- State Key Laboratory of Agricultural GenomicsBGI‐ShenzhenShenzhenChina
- China National GeneBankBGI‐ShenzhenShenzhenChina
| | - Li Deng
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
- State Key Laboratory of Agricultural GenomicsBGI‐ShenzhenShenzhenChina
- China National GeneBankBGI‐ShenzhenShenzhenChina
| | - Guangyi Fan
- BGI‐QingdaoBGI‐ShenzhenQingdaoChina
- State Key Laboratory of Agricultural GenomicsBGI‐ShenzhenShenzhenChina
- China National GeneBankBGI‐ShenzhenShenzhenChina
- BGI‐ShenzhenBGI‐ShenzhenShenzhenChina
| |
Collapse
|
6
|
Palmada-Flores M, Orkin JD, Haase B, Mountcastle J, Bertelsen MF, Fedrigo O, Kuderna LFK, Jarvis ED, Marques-Bonet T. A high-quality, long-read genome assembly of the endangered ring-tailed lemur (Lemur catta). Gigascience 2022; 11:6562532. [PMID: 35365833 PMCID: PMC8975718 DOI: 10.1093/gigascience/giac026] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 01/14/2022] [Accepted: 02/19/2022] [Indexed: 01/31/2023] Open
Abstract
BACKGROUND The ring-tailed lemur (Lemur catta) is a charismatic strepsirrhine primate endemic to Madagascar. These lemurs are of particular interest, given their status as a flagship species and widespread publicity in the popular media. Unfortunately, a recent population decline has resulted in the census population decreasing to <2,500 individuals in the wild, and the species's classification as an endangered species by the IUCN. As is the case for most strepsirrhine primates, only a limited amount of genomic research has been conducted on L. catta, in part owing to the lack of genomic resources. RESULTS We generated a new high-quality reference genome assembly for L. catta (mLemCat1) that conforms to the standards of the Vertebrate Genomes Project. This new long-read assembly is composed of Pacific Biosciences continuous long reads (CLR data), Optical Mapping Bionano reads, Arima HiC data, and 10X linked reads. The contiguity and completeness of the assembly are extremely high, with scaffold and contig N50 values of 90.982 and 10.570 Mb, respectively. Additionally, when compared to other high-quality primate assemblies, L. catta has the lowest reported number of Alu elements, which results predominantly from a lack of AluS and AluY elements. CONCLUSIONS mLemCat1 is an excellent genomic resource not only for the ring-tailed lemur community, but also for other members of the Lemuridae family, and is the first very long read assembly for a strepsirrhine.
Collapse
Affiliation(s)
- Marc Palmada-Flores
- Department of Medicine and Life Sciences (MELIS), Institut de Biologia Evolutiva, Universitat Pompeu Fabra-CSIC, Barcelona 08003, Spain
| | - Joseph D Orkin
- Department of Medicine and Life Sciences (MELIS), Institut de Biologia Evolutiva, Universitat Pompeu Fabra-CSIC, Barcelona 08003, Spain.,Département d'anthropologie, Université de Montréal, Montréal, QC H3T 1N8, Canada
| | - Bettina Haase
- The Vertebrate Genomes Lab, The Rockefeller University, New York, NY 10065, USA
| | | | - Mads F Bertelsen
- Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, Frederiksberg C 1870, Denmark.,Center for Zoo and Wild Animal Health, Copenhagen Zoo, Frederiksber 1870, Denmark
| | - Olivier Fedrigo
- The Vertebrate Genomes Lab, The Rockefeller University, New York, NY 10065, USA
| | - Lukas F K Kuderna
- Department of Medicine and Life Sciences (MELIS), Institut de Biologia Evolutiva, Universitat Pompeu Fabra-CSIC, Barcelona 08003, Spain
| | - Erich D Jarvis
- The Vertebrate Genomes Lab, The Rockefeller University, New York, NY 10065, USA.,Center for Zoo and Wild Animal Health, Copenhagen Zoo, Frederiksber 1870, Denmark.,Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA.,Laboratory of Neurogenetics of Language, The Rockefeller University, NY 10065, USA
| | - Tomas Marques-Bonet
- Department of Medicine and Life Sciences (MELIS), Institut de Biologia Evolutiva, Universitat Pompeu Fabra-CSIC, Barcelona 08003, Spain.,Catalan Institution of Research and Advanced Studies (ICREA), Barcelona 08010, Spain.,CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Barcelon 08028a, Spain.,Institut Català de Paleontologia Miquel Crusafont, Universitat Autònoma de Barcelona, Cerdanyola del Vallès 08193, Spain
| |
Collapse
|
7
|
Mueller JC, Botero-Delgadillo E, Espíndola-Hernández P, Gilsenan C, Ewels P, Gruselius J, Kempenaers B. Local selection signals in the genome of Blue tits emphasize regulatory and neuronal evolution. Mol Ecol 2022; 31:1504-1514. [PMID: 34995389 DOI: 10.1111/mec.16345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Revised: 11/18/2021] [Accepted: 12/15/2021] [Indexed: 11/30/2022]
Abstract
Understanding the genomic landscape of adaptation is central to the understanding of microevolution in wild populations. Genomic targets of selection and the underlying genomic mechanisms of adaptation can be elucidated by genome-wide scans for past selective sweeps or by scans for direct fitness associations. We sequenced and assembled 150 haplotypes of 75 Blue tits (Cyanistes caeruleus) of a single central-European population by a linked-read technology. We used these genome data in combination with coalescent simulations (1) to estimate an historical effective population size of ~250,000, which recently declined to ~10,000, and (2) to identify genome-wide distributed selective sweeps of beneficial variants most likely originating from standing genetic variation (soft sweeps). The genes linked to these soft sweeps, but also the ones linked to hard sweeps based on new beneficial mutants, showed a significant enrichment for functions associated with gene expression and transcription regulation. This emphasizes the importance of regulatory evolution in the population's adaptive history. Soft sweeps were further enriched for genes related to axon and synapse development, indicating the significance of neuronal connectivity changes in the brain potentially linked to behavioural adaptations. A previous scan of heterozygosity-fitness correlations revealed a consistent negative effect on arrival date at the breeding site for a single microsatellite in the MDGA2 gene. Here, we used the haplotype structure around this microsatellite to explain the effect as a local and direct outbreeding effect of a gene involved in synapse development.
Collapse
Affiliation(s)
- Jakob C Mueller
- Department of Behavioural Ecology and Evolutionary Genetics, Max Planck Institute for Ornithology, Seewiesen, Germany
| | - Esteban Botero-Delgadillo
- Department of Behavioural Ecology and Evolutionary Genetics, Max Planck Institute for Ornithology, Seewiesen, Germany
| | - Pamela Espíndola-Hernández
- Department of Behavioural Ecology and Evolutionary Genetics, Max Planck Institute for Ornithology, Seewiesen, Germany
| | - Carol Gilsenan
- Department of Behavioural Ecology and Evolutionary Genetics, Max Planck Institute for Ornithology, Seewiesen, Germany
| | - Phil Ewels
- Science for Life Laboratory (SciLifeLab), Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Joel Gruselius
- Science for Life Laboratory, Department of Biosciences and Nutrition, Karolinska Institutet, Stockholm, Sweden.,current address: Vanadis Diagnostics, PerkinElmer, Sollentuna, Sweden
| | - Bart Kempenaers
- Department of Behavioural Ecology and Evolutionary Genetics, Max Planck Institute for Ornithology, Seewiesen, Germany
| |
Collapse
|
8
|
Tarabichi M, Demeulemeester J, Verfaillie A, Flanagan AM, Van Loo P, Konopka T. A pan-cancer landscape of somatic mutations in non-unique regions of the human genome. Nat Biotechnol 2021; 39:1589-1596. [PMID: 34282324 PMCID: PMC7612106 DOI: 10.1038/s41587-021-00971-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Accepted: 06/02/2021] [Indexed: 12/27/2022]
Abstract
A substantial fraction of the human genome displays high sequence similarity with at least one other genomic sequence, posing a challenge for the identification of somatic mutations from short-read sequencing data. Here we annotate genomic variants in 2,658 cancers from the Pan-Cancer Analysis of Whole Genomes (PCAWG) cohort with links to similar sites across the human genome. We train a machine learning model to use signals distributed over multiple genomic sites to call somatic events in non-unique regions and validate the data against linked-read sequencing in an independent dataset. Using this approach, we uncover previously hidden mutations in ~1,700 coding sequences and in thousands of regulatory elements, including in known cancer genes, immunoglobulins and highly mutated gene families. Mutations in non-unique regions are consistent with mutations in unique regions in terms of mutation burden and substitution profiles. The analysis provides a systematic summary of the mutation events in non-unique regions at a genome-wide scale across multiple human cancers.
Collapse
Affiliation(s)
- Maxime Tarabichi
- The Francis Crick Institute, London, UK.
- Institute for Interdisciplinary Research, Université Libre de Bruxelles, Brussels, Belgium.
| | - Jonas Demeulemeester
- The Francis Crick Institute, London, UK
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | | | - Adrienne M Flanagan
- Research Department of Pathology, Cancer Institute, University College London, London, UK
- Department of Cellular and Molecular Pathology, Royal National Orthopaedic Hospital NHS Trust, Stanmore, UK
| | | | - Tomasz Konopka
- The Francis Crick Institute, London, UK.
- William Harvey Research Institute, Queen Mary University of London, London, UK.
| |
Collapse
|
9
|
Whole genome and exome sequencing reference datasets from a multi-center and cross-platform benchmark study. Sci Data 2021; 8:296. [PMID: 34753956 PMCID: PMC8578599 DOI: 10.1038/s41597-021-01077-5] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Accepted: 10/11/2021] [Indexed: 11/08/2022] Open
Abstract
With the rapid advancement of sequencing technologies, next generation sequencing (NGS) analysis has been widely applied in cancer genomics research. More recently, NGS has been adopted in clinical oncology to advance personalized medicine. Clinical applications of precision oncology require accurate tests that can distinguish tumor-specific mutations from artifacts introduced during NGS processes or data analysis. Therefore, there is an urgent need to develop best practices in cancer mutation detection using NGS and the need for standard reference data sets for systematically measuring accuracy and reproducibility across platforms and methods. Within the SEQC2 consortium context, we established paired tumor-normal reference samples and generated whole-genome (WGS) and whole-exome sequencing (WES) data using sixteen library protocols, seven sequencing platforms at six different centers. We systematically interrogated somatic mutations in the reference samples to identify factors affecting detection reproducibility and accuracy in cancer genomes. These large cross-platform/site WGS and WES datasets using well-characterized reference samples will represent a powerful resource for benchmarking NGS technologies, bioinformatics pipelines, and for the cancer genomics studies.
Collapse
|
10
|
Hiltunen M, Ryberg M, Johannesson H. ARBitR: an overlap-aware genome assembly scaffolder for linked reads. Bioinformatics 2021; 37:2203-2205. [PMID: 33216122 PMCID: PMC8352505 DOI: 10.1093/bioinformatics/btaa975] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2020] [Revised: 10/22/2020] [Accepted: 11/10/2020] [Indexed: 12/02/2022] Open
Abstract
Summary Linked genomic sequencing reads contain information that can be used to join sequences together into scaffolds in draft genome assemblies. Existing software for this purpose performs the scaffolding by joining sequences with a gap between them, not considering potential overlaps of contigs. We developed ARBitR to create scaffolds where overlaps are taken into account and show that it can accurately recreate regions where draft assemblies are broken. Availability and implementation ARBitR is written and implemented in Python3 for Unix-based operative systems. All source code is available at https://github.com/markhilt/ARBitR under the GNU General Public License v3. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Markus Hiltunen
- Department of Organismal Biology, Uppsala University, 75236 Uppsala, Sweden
| | - Martin Ryberg
- Department of Organismal Biology, Uppsala University, 75236 Uppsala, Sweden
| | - Hanna Johannesson
- Department of Organismal Biology, Uppsala University, 75236 Uppsala, Sweden
| |
Collapse
|
11
|
Tedersoo L, Albertsen M, Anslan S, Callahan B. Perspectives and Benefits of High-Throughput Long-Read Sequencing in Microbial Ecology. Appl Environ Microbiol 2021; 87:e0062621. [PMID: 34132589 PMCID: PMC8357291 DOI: 10.1128/aem.00626-21] [Citation(s) in RCA: 74] [Impact Index Per Article: 24.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Short-read, high-throughput sequencing (HTS) methods have yielded numerous important insights into microbial ecology and function. Yet, in many instances short-read HTS techniques are suboptimal, for example, by providing insufficient phylogenetic resolution or low integrity of assembled genomes. Single-molecule and synthetic long-read (SLR) HTS methods have successfully ameliorated these limitations. In addition, nanopore sequencing has generated a number of unique analysis opportunities, such as rapid molecular diagnostics and direct RNA sequencing, and both Pacific Biosciences (PacBio) and nanopore sequencing support detection of epigenetic modifications. Although initially suffering from relatively low sequence quality, recent advances have greatly improved the accuracy of long-read sequencing technologies. In spite of great technological progress in recent years, the long-read HTS methods (PacBio and nanopore sequencing) are still relatively costly, require large amounts of high-quality starting material, and commonly need specific solutions in various analysis steps. Despite these challenges, long-read sequencing technologies offer high-quality, cutting-edge alternatives for testing hypotheses about microbiome structure and functioning as well as assembly of eukaryote genomes from complex environmental DNA samples.
Collapse
Affiliation(s)
- Leho Tedersoo
- Mycology and Microbiology Center, University of Tartu, Tartu, Estonia
| | - Mads Albertsen
- Department of Chemistry and Bioscience, Aalborg University, Aalborg, Denmark
| | - Sten Anslan
- Mycology and Microbiology Center, University of Tartu, Tartu, Estonia
- Braunschweig University of Technology, Zoological Institute, Braunschweig, Germany
| | - Benjamin Callahan
- Department of Population Health and Pathobiology, College of Veterinary Medicine and Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, USA
| |
Collapse
|
12
|
Yang C, Zhou Y, Marcus S, Formenti G, Bergeron LA, Song Z, Bi X, Bergman J, Rousselle MMC, Zhou C, Zhou L, Deng Y, Fang M, Xie D, Zhu Y, Tan S, Mountcastle J, Haase B, Balacco J, Wood J, Chow W, Rhie A, Pippel M, Fabiszak MM, Koren S, Fedrigo O, Freiwald WA, Howe K, Yang H, Phillippy AM, Schierup MH, Jarvis ED, Zhang G. Evolutionary and biomedical insights from a marmoset diploid genome assembly. Nature 2021; 594:227-233. [PMID: 33910227 PMCID: PMC8189906 DOI: 10.1038/s41586-021-03535-x] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2020] [Accepted: 04/12/2021] [Indexed: 01/23/2023]
Abstract
The accurate and complete assembly of both haplotype sequences of a diploid organism is essential to understanding the role of variation in genome functions, phenotypes and diseases1. Here, using a trio-binning approach, we present a high-quality, diploid reference genome, with both haplotypes assembled independently at the chromosome level, for the common marmoset (Callithrix jacchus), an primate model system that is widely used in biomedical research2,3. The full spectrum of heterozygosity between the two haplotypes involves 1.36% of the genome-much higher than the 0.13% indicated by the standard estimation based on single-nucleotide heterozygosity alone. The de novo mutation rate is 0.43 × 10-8 per site per generation, and the paternal inherited genome acquired twice as many mutations as the maternal. Our diploid assembly enabled us to discover a recent expansion of the sex-differentiation region and unique evolutionary changes in the marmoset Y chromosome. In addition, we identified many genes with signatures of positive selection that might have contributed to the evolution of Callithrix biological features. Brain-related genes were highly conserved between marmosets and humans, although several genes experienced lineage-specific copy number variations or diversifying selection, with implications for the use of marmosets as a model system.
Collapse
Affiliation(s)
- Chentao Yang
- BGI-Shenzhen, Shenzhen, China.,Villum Centre for Biodiversity Genomics, Section for Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | | | - Stephanie Marcus
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - Giulio Formenti
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA.,Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Lucie A Bergeron
- Villum Centre for Biodiversity Genomics, Section for Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Zhenzhen Song
- University of the Chinese Academy of Sciences, Beijing, China
| | | | - Juraj Bergman
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
| | | | | | | | - Yuan Deng
- BGI-Shenzhen, Shenzhen, China.,Villum Centre for Biodiversity Genomics, Section for Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | | | - Duo Xie
- BGI-Shenzhen, Shenzhen, China
| | | | | | | | - Bettina Haase
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Jennifer Balacco
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | | | | | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Martin Pippel
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.,Center for Systems Biology, Dresden, Germany
| | | | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Olivier Fedrigo
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Winrich A Freiwald
- Laboratory of Neural Systems, The Rockefeller University, New York, NY, USA.,Center for Brains, Minds and Machines (CBMM), The Rockefeller University, New York, NY, USA
| | | | - Huanming Yang
- BGI-Shenzhen, Shenzhen, China.,University of the Chinese Academy of Sciences, Beijing, China.,James D. Watson Institute of Genome Sciences, Hangzhou, China.,Guangdong Provincial Academician Workstation of BGI Synthetic Genomics, BGI-Shenzhen, Shenzhen, China
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Erich D Jarvis
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA.,Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA.,Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Guojie Zhang
- Villum Centre for Biodiversity Genomics, Section for Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark. .,State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China. .,China National GeneBank, BGI-Shenzhen, Shenzhen, China. .,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, China.
| |
Collapse
|
13
|
Rhie A, McCarthy SA, Fedrigo O, Damas J, Formenti G, Koren S, Uliano-Silva M, Chow W, Fungtammasan A, Kim J, Lee C, Ko BJ, Chaisson M, Gedman GL, Cantin LJ, Thibaud-Nissen F, Haggerty L, Bista I, Smith M, Haase B, Mountcastle J, Winkler S, Paez S, Howard J, Vernes SC, Lama TM, Grutzner F, Warren WC, Balakrishnan CN, Burt D, George JM, Biegler MT, Iorns D, Digby A, Eason D, Robertson B, Edwards T, Wilkinson M, Turner G, Meyer A, Kautt AF, Franchini P, Detrich HW, Svardal H, Wagner M, Naylor GJP, Pippel M, Malinsky M, Mooney M, Simbirsky M, Hannigan BT, Pesout T, Houck M, Misuraca A, Kingan SB, Hall R, Kronenberg Z, Sović I, Dunn C, Ning Z, Hastie A, Lee J, Selvaraj S, Green RE, Putnam NH, Gut I, Ghurye J, Garrison E, Sims Y, Collins J, Pelan S, Torrance J, Tracey A, Wood J, Dagnew RE, Guan D, London SE, Clayton DF, Mello CV, Friedrich SR, Lovell PV, Osipova E, Al-Ajli FO, Secomandi S, Kim H, Theofanopoulou C, Hiller M, Zhou Y, Harris RS, Makova KD, Medvedev P, Hoffman J, Masterson P, Clark K, Martin F, Howe K, Flicek P, Walenz BP, Kwak W, Clawson H, Diekhans M, Nassar L, Paten B, Kraus RHS, Crawford AJ, Gilbert MTP, Zhang G, Venkatesh B, Murphy RW, Koepfli KP, Shapiro B, Johnson WE, Di Palma F, Marques-Bonet T, Teeling EC, Warnow T, Graves JM, Ryder OA, Haussler D, O'Brien SJ, Korlach J, Lewin HA, Howe K, Myers EW, Durbin R, Phillippy AM, Jarvis ED. Towards complete and error-free genome assemblies of all vertebrate species. Nature 2021; 592:737-746. [PMID: 33911273 PMCID: PMC8081667 DOI: 10.1038/s41586-021-03451-0] [Citation(s) in RCA: 824] [Impact Index Per Article: 274.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2020] [Accepted: 03/12/2021] [Indexed: 02/02/2023]
Abstract
High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species1-4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.
Collapse
Affiliation(s)
- Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Shane A McCarthy
- Department of Genetics, University of Cambridge, Cambridge, UK
- Wellcome Sanger Institute, Cambridge, UK
| | - Olivier Fedrigo
- Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA
| | - Joana Damas
- The Genome Center, University of California Davis, Davis, CA, USA
| | - Giulio Formenti
- Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Marcela Uliano-Silva
- Leibniz Institute for Zoo and Wildlife Research, Department of Evolutionary Genetics, Berlin, Germany
- Berlin Center for Genomics in Biodiversity Research, Berlin, Germany
| | | | | | - Juwan Kim
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | - Chul Lee
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | - Byung June Ko
- Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea
| | - Mark Chaisson
- University of Southern California, Los Angeles, CA, USA
| | - Gregory L Gedman
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - Lindsey J Cantin
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - Francoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD, USA
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Iliana Bista
- Department of Genetics, University of Cambridge, Cambridge, UK
- Wellcome Sanger Institute, Cambridge, UK
| | | | - Bettina Haase
- Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA
| | | | - Sylke Winkler
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- DRESDEN-concept Genome Center, Dresden, Germany
| | - Sadye Paez
- Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | | | - Sonja C Vernes
- Neurogenetics of Vocal Communication Group, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
- Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands
- School of Biology, University of St Andrews, St Andrews, UK
| | - Tanya M Lama
- University of Massachusetts Cooperative Fish and Wildlife Research Unit, Amherst, MA, USA
| | - Frank Grutzner
- School of Biological Science, The Environment Institute, University of Adelaide, Adelaide, South Australia, Australia
| | - Wesley C Warren
- Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | | | - Dave Burt
- UQ Genomics, University of Queensland, Brisbane, Queensland, Australia
| | - Julia M George
- Department of Biological Sciences, Clemson University, Clemson, SC, USA
| | - Matthew T Biegler
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - David Iorns
- The Genetic Rescue Foundation, Wellington, New Zealand
| | - Andrew Digby
- Kākāpō Recovery, Department of Conservation, Invercargill, New Zealand
| | - Daryl Eason
- Kākāpō Recovery, Department of Conservation, Invercargill, New Zealand
| | - Bruce Robertson
- Department of Zoology, University of Otago, Dunedin, New Zealand
| | | | - Mark Wilkinson
- Department of Life Sciences, Natural History Museum, London, UK
| | - George Turner
- School of Natural Sciences, Bangor University, Gwynedd, UK
| | - Axel Meyer
- Department of Biology, University of Konstanz, Konstanz, Germany
| | - Andreas F Kautt
- Department of Biology, University of Konstanz, Konstanz, Germany
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA
| | - Paolo Franchini
- Department of Biology, University of Konstanz, Konstanz, Germany
| | - H William Detrich
- Department of Marine and Environmental Sciences, Northeastern University Marine Science Center, Nahant, MA, USA
| | - Hannes Svardal
- Department of Biology, University of Antwerp, Antwerp, Belgium
- Naturalis Biodiversity Center, Leiden, The Netherlands
| | - Maximilian Wagner
- Institute of Biology, Karl-Franzens University of Graz, Graz, Austria
| | - Gavin J P Naylor
- Florida Museum of Natural History, University of Florida, Gainesville, FL, USA
| | - Martin Pippel
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Center for Systems Biology, Dresden, Germany
| | - Milan Malinsky
- Wellcome Sanger Institute, Cambridge, UK
- Zoological Institute, University of Basel, Basel, Switzerland
| | | | | | | | - Trevor Pesout
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | | | | | | | | | | | - Ivan Sović
- Pacific Biosciences, Menlo Park, CA, USA
- Digital BioLogic, Ivanić-Grad, Croatia
| | | | - Zemin Ning
- Wellcome Sanger Institute, Cambridge, UK
| | | | - Joyce Lee
- Bionano Genomics, San Diego, CA, USA
| | | | - Richard E Green
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
- Dovetail Genomics, Santa Cruz, CA, USA
| | | | - Ivo Gut
- CNAG-CRG, Centre for Genomic Regulation, Barcelona Institute of Science and Technology, Barcelona, Spain
- Universitat Pompeu Fabra, Barcelona, Spain
| | - Jay Ghurye
- Dovetail Genomics, Santa Cruz, CA, USA
- Department of Computer Science, University of Maryland College Park, College Park, MD, USA
| | - Erik Garrison
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Ying Sims
- Wellcome Sanger Institute, Cambridge, UK
| | | | | | | | | | | | | | - Dengfeng Guan
- Department of Genetics, University of Cambridge, Cambridge, UK
- School of Computer Science and Technology, Center for Bioinformatics, Harbin Institute of Technology, Harbin, China
| | - Sarah E London
- Department of Psychology, Institute for Mind and Biology, University of Chicago, Chicago, IL, USA
| | - David F Clayton
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA
| | - Claudio V Mello
- Department of Behavioral Neuroscience, Oregon Health and Science University, Portland, OR, USA
| | - Samantha R Friedrich
- Department of Behavioral Neuroscience, Oregon Health and Science University, Portland, OR, USA
| | - Peter V Lovell
- Department of Behavioral Neuroscience, Oregon Health and Science University, Portland, OR, USA
| | - Ekaterina Osipova
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Center for Systems Biology, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| | - Farooq O Al-Ajli
- Monash University Malaysia Genomics Facility, School of Science, Selangor Darul Ehsan, Malaysia
- Tropical Medicine and Biology Multidisciplinary Platform, Monash University Malaysia, Selangor Darul Ehsan, Malaysia
- Qatar Falcon Genome Project, Doha, Qatar
| | | | - Heebal Kim
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
- Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea
- eGnome, Inc., Seoul, Republic of Korea
| | | | - Michael Hiller
- LOEWE Centre for Translational Biodiversity Genomics, Frankfurt, Germany
- Senckenberg Research Institute, Frankfurt, Germany
- Goethe-University, Faculty of Biosciences, Frankfurt, Germany
| | | | - Robert S Harris
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | - Kateryna D Makova
- Department of Biology, Pennsylvania State University, University Park, PA, USA
- Center for Medical Genomics, Pennsylvania State University, University Park, PA, USA
- Center for Computational Biology and Bioinformatics, Pennsylvania State University, University Park, PA, USA
| | - Paul Medvedev
- Center for Medical Genomics, Pennsylvania State University, University Park, PA, USA
- Center for Computational Biology and Bioinformatics, Pennsylvania State University, University Park, PA, USA
- Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA, USA
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, USA
| | - Jinna Hoffman
- National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD, USA
| | - Patrick Masterson
- National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD, USA
| | - Karen Clark
- National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD, USA
| | - Fergal Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Kevin Howe
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Woori Kwak
- eGnome, Inc., Seoul, Republic of Korea
- Hoonygen, Seoul, Korea
| | - Hiram Clawson
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Luis Nassar
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Robert H S Kraus
- Department of Biology, University of Konstanz, Konstanz, Germany
- Department of Migration, Max Planck Institute of Animal Behavior, Radolfzell, Germany
| | - Andrew J Crawford
- Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| | - M Thomas P Gilbert
- Center for Evolutionary Hologenomics, The GLOBE Institute, University of Copenhagen, Copenhagen, Denmark
- University Museum, NTNU, Trondheim, Norway
| | - Guojie Zhang
- China National Genebank, BGI-Shenzhen, Shenzhen, China
- Villum Center for Biodiversity Genomics, Section for Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
- Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, China
| | - Byrappa Venkatesh
- Institute of Molecular and Cell Biology, A*STAR, Biopolis, Singapore, Singapore
| | - Robert W Murphy
- Centre for Biodiversity, Royal Ontario Museum, Toronto, Ontario, Canada
| | - Klaus-Peter Koepfli
- Smithsonian Conservation Biology Institute, Center for Species Survival, National Zoological Park, Washington, DC, USA
| | - Beth Shapiro
- Department of Ecology and Evolutionary Biology, University of California Santa Cruz, Santa Cruz, CA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Warren E Johnson
- Smithsonian Conservation Biology Institute, Center for Species Survival, National Zoological Park, Washington, DC, USA
- The Walter Reed Biosystematics Unit, Museum Support Center MRC-534, Smithsonian Institution, Suitland, MD, USA
- Walter Reed Army Institute of Research, Silver Spring, MD, USA
| | - Federica Di Palma
- Department of Biological Sciences, Earlham Institute, University of East Anglia, Norwich, UK
| | - Tomas Marques-Bonet
- Institute of Evolutionary Biology (UPF-CSIC), PRBB, Barcelona, Spain
- Catalan Institution of Research and Advanced Studies (ICREA), Barcelona, Spain
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Barcelona, Spain
- Institut Català de Paleontologia Miquel Crusafont, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Emma C Teeling
- School of Biology and Environmental Science, University College Dublin, Dublin, Ireland
| | - Tandy Warnow
- Department of Computer Science, The University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | | | - Oliver A Ryder
- San Diego Zoo Global, Escondido, CA, USA
- Department of Evolution, Behavior, and Ecology, University of California San Diego, La Jolla, CA, USA
| | - David Haussler
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
- Department of Ecology and Evolutionary Biology, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Stephen J O'Brien
- Laboratory of Genomics Diversity-Center for Computer Technologies, ITMO University, St. Petersburg, Russian Federation
- Guy Harvey Oceanographic Center, Halmos College of Natural Sciences and Oceanography, Nova Southeastern University, Fort Lauderdale, FL, USA
| | | | - Harris A Lewin
- The Genome Center, University of California Davis, Davis, CA, USA
- Department of Evolution and Ecology, University of California Davis, Davis, CA, USA
- John Muir Institute for the Environment, University of California Davis, Davis, CA, USA
| | | | - Eugene W Myers
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.
- Center for Systems Biology, Dresden, Germany.
- Faculty of Computer Science, Technical University Dresden, Dresden, Germany.
| | - Richard Durbin
- Department of Genetics, University of Cambridge, Cambridge, UK.
- Wellcome Sanger Institute, Cambridge, UK.
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| | - Erich D Jarvis
- Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA.
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA.
- Howard Hughes Medical Institute, Chevy Chase, MD, USA.
| |
Collapse
|
14
|
Guo L, Xu M, Wang W, Gu S, Zhao X, Chen F, Wang O, Xu X, Seim I, Fan G, Deng L, Liu X. SLR-superscaffolder: a de novo scaffolding tool for synthetic long reads using a top-to-bottom scheme. BMC Bioinformatics 2021; 22:158. [PMID: 33765921 PMCID: PMC7993450 DOI: 10.1186/s12859-021-04081-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Accepted: 03/16/2021] [Indexed: 12/30/2022] Open
Abstract
Background Synthetic long reads (SLR) with long-range co-barcoding information are now widely applied in genomics research. Although several tools have been developed for each specific SLR technique, a robust standalone scaffolder with high efficiency is warranted for hybrid genome assembly. Results In this work, we developed a standalone scaffolding tool, SLR-superscaffolder, to link together contigs in draft assemblies using co-barcoding and paired-end read information. Our top-to-bottom scheme first builds a global scaffold graph based on Jaccard Similarity to determine the order and orientation of contigs, and then locally improves the scaffolds with the aid of paired-end information. We also exploited a screening algorithm to reduce the negative effect of misassembled contigs in the input assembly. We applied SLR-superscaffolder to a human single tube long fragment read sequencing dataset and increased the scaffold NG50 of its corresponding draft assembly 1349 fold. Moreover, benchmarking on different input contigs showed that this approach overall outperformed existing SLR scaffolders, providing longer contiguity and fewer misassemblies, especially for short contigs assembled by next-generation sequencing data. The open-source code of SLR-superscaffolder is available at https://github.com/BGI-Qingdao/SLR-superscaffolder. Conclusions SLR-superscaffolder can dramatically improve the contiguity of a draft assembly by integrating a hybrid assembly strategy.
Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04081-z.
Collapse
Affiliation(s)
- Lidong Guo
- BGI Education Center, University of Chinese Academy of Sciences, Shenzhen, 518083, China.,BGI-Qingdao, BGI-Shenzhen, Qingdao, 266555, China.,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Mengyang Xu
- BGI-Qingdao, BGI-Shenzhen, Qingdao, 266555, China.,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,BGI-Shenzhen, Shenzhen, 518083, China.,China National GeneBank, BGI-Shenzhen, Shenzhen, 518120, China
| | - Wenchao Wang
- BGI-Qingdao, BGI-Shenzhen, Qingdao, 266555, China
| | - Shengqiang Gu
- BGI Education Center, University of Chinese Academy of Sciences, Shenzhen, 518083, China
| | - Xia Zhao
- MGI, BGI-Shenzhen, Shenzhen, 518083, China
| | - Fang Chen
- MGI, BGI-Shenzhen, Shenzhen, 518083, China
| | - Ou Wang
- BGI-Shenzhen, Shenzhen, 518083, China.,China National GeneBank, BGI-Shenzhen, Shenzhen, 518120, China
| | - Xun Xu
- BGI-Shenzhen, Shenzhen, 518083, China.,China National GeneBank, BGI-Shenzhen, Shenzhen, 518120, China
| | - Inge Seim
- Integrative Biology Laboratory, College of Life Sciences, Nanjing Normal University, Nanjing, 210046, China.,School of Biology and Environmental Science, Queensland University of Technology, Brisbane, 4000, Australia
| | - Guangyi Fan
- BGI-Qingdao, BGI-Shenzhen, Qingdao, 266555, China.,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China.,BGI-Shenzhen, Shenzhen, 518083, China.,China National GeneBank, BGI-Shenzhen, Shenzhen, 518120, China
| | - Li Deng
- BGI-Qingdao, BGI-Shenzhen, Qingdao, 266555, China. .,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China. .,BGI-Shenzhen, Shenzhen, 518083, China. .,China National GeneBank, BGI-Shenzhen, Shenzhen, 518120, China.
| | - Xin Liu
- BGI-Qingdao, BGI-Shenzhen, Qingdao, 266555, China. .,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China. .,BGI-Shenzhen, Shenzhen, 518083, China. .,China National GeneBank, BGI-Shenzhen, Shenzhen, 518120, China.
| |
Collapse
|
15
|
Xu M, Guo L, Du X, Li L, Peters BA, Deng L, Wang O, Chen F, Wang J, Jiang Z, Han J, Ni M, Yang H, Xu X, Liu X, Huang J, Fan G. Accurate Haplotype-Resolved Assembly Reveals The Origin Of Structural Variants For Human Trios. Bioinformatics 2021; 37:2095-2102. [PMID: 33538292 PMCID: PMC8613828 DOI: 10.1093/bioinformatics/btab068] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Revised: 12/07/2020] [Accepted: 01/28/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Achieving a near complete understanding of how the genome of an individual affects the phenotypes of that individual requires deciphering the order of variations along homologous chromosomes in species with diploid genomes. However, true diploid assembly of long-range haplotypes remains challenging. RESULTS To address this, we have developed Haplotype-resolved Assembly for Synthetic long reads using a Trio-binning strategy, or HAST, which uses parental information to classify reads into maternal or paternal. Once sorted, these reads are used to independently de novo assemble the parent-specific haplotypes. We applied HAST to co-barcoded second-generation sequencing data from an Asian individual, resulting in a haplotype assembly covering 94.7% of the reference genome with a scaffold N50 longer than 11 Mb. The high haplotyping precision (∼99.7%) and recall (∼95.9%) represents a substantial improvement over the commonly used tool for assembling co-barcoded reads (Supernova), and is comparable to a trio-binning-based third generation long-read based assembly method (TrioCanu) but with a significantly higher single-base accuracy (up to 99.99997% (Q65)). This makes HAST a superior tool for accurate haplotyping and future haplotype-based studies. AVAILABILITY The code of the analysis is available at https://github.com/BGI-Qingdao/HAST. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mengyang Xu
- BGI-QingDao, Qingdao, 266555, China.,BGI-Shenzhen, Shenzhen, 518083, China
| | - Lidong Guo
- BGI-QingDao, Qingdao, 266555, China.,BGI Education Center, University of Chinese Academy of Sciences, Shenzhen, 518083, China
| | - Xiao Du
- BGI-QingDao, Qingdao, 266555, China.,BGI-Shenzhen, Shenzhen, 518083, China
| | - Lei Li
- BGI-QingDao, Qingdao, 266555, China.,School of Future Technology, University of Chinese Academy of Sciences, Beijing, 101408, China
| | - Brock A Peters
- BGI-Shenzhen, Shenzhen, 518083, China.,Complete Genomics Inc, 2904 Orchard Pkwy, San Jose, California, 95134, USA
| | - Li Deng
- BGI-QingDao, Qingdao, 266555, China
| | - Ou Wang
- BGI-Shenzhen, Shenzhen, 518083, China
| | - Fang Chen
- MGI, BGI-Shenzhen, Shenzhen, 518083, China
| | - Jun Wang
- BGI-QingDao, Qingdao, 266555, China
| | | | | | - Ming Ni
- BGI-QingDao, Qingdao, 266555, China.,BGI-Shenzhen, Shenzhen, 518083, China
| | | | - Xun Xu
- BGI-Shenzhen, Shenzhen, 518083, China
| | - Xin Liu
- BGI-QingDao, Qingdao, 266555, China.,BGI-Shenzhen, Shenzhen, 518083, China.,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Jie Huang
- National Institutes for food and drug Control (NIFDC), No.2 Tiantan Xili, Dongcheng District, Beijing, 10050, China
| | - Guangyi Fan
- BGI-QingDao, Qingdao, 266555, China.,BGI-Shenzhen, Shenzhen, 518083, China.,State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| |
Collapse
|
16
|
Noninvasive prenatal test of single-gene disorders by linked-read direct haplotyping: application in various diseases. Eur J Hum Genet 2020; 29:463-470. [PMID: 33235377 DOI: 10.1038/s41431-020-00759-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2020] [Revised: 08/26/2020] [Accepted: 10/20/2020] [Indexed: 11/08/2022] Open
Abstract
Direct haplotyping enables noninvasive prenatal testing (NIPT) without analyzing proband, which is a promising strategy for pregnancies at risk of an inherited single-gene disorder. Here, we aimed to expand the scope of single-gene disorders that NIPT using linked-read direct haplotyping would be applicable to. Three families at risk of myotonic dystrophy type 1, lipoid congenital adrenal hyperplasia, and Fukuyama congenital muscular dystrophy were recruited. All cases exhibited distinct characteristics that are often encountered as hurdles (i.e., repeat expansion, identical variants in both parents, and novel variants with retrotransposon insertion) in the universal clinical application of NIPT. Direct haplotyping of parental genomes was performed by linked-read sequencing, combined with allele-specific PCR, if necessary. Target DMPK, STAR, and FKTN genes in the maternal plasma DNA were sequenced. Posterior risk calculations and an Anderson-Darling test were performed to deduce the maternal and paternal inheritance, respectively. In all cases, we could predict the inheritance of maternal mutant allele with > 99.9% confidence, while paternal mutant alleles were not predicted to be inherited. Our study indicates that direct haplotyping and posterior risk calculation can be applied with subtle modifications to NIPT for the detection of an expanded range of diseases.
Collapse
|
17
|
Lutgen D, Ritter R, Olsen R, Schielzeth H, Gruselius J, Ewels P, García JT, Shirihai H, Schweizer M, Suh A, Burri R. Linked‐read sequencing enables haplotype‐resolved resequencing at population scale. Mol Ecol Resour 2020; 20:1311-1322. [DOI: 10.1111/1755-0998.13192] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2020] [Revised: 04/25/2020] [Accepted: 05/06/2020] [Indexed: 11/28/2022]
Affiliation(s)
- Dave Lutgen
- Department of Population Ecology Institute of Ecology and Evolution Friedrich Schiller University Jena Jena Germany
| | - Raphael Ritter
- Department of Population Ecology Institute of Ecology and Evolution Friedrich Schiller University Jena Jena Germany
| | - Remi‐André Olsen
- Science for Life Laboratory Department of Biochemistry and Biophysics Stockholm University Solna Sweden
| | - Holger Schielzeth
- Department of Population Ecology Institute of Ecology and Evolution Friedrich Schiller University Jena Jena Germany
| | - Joel Gruselius
- Science for Life Laboratory Department of Biosciences and Nutrition Karolinska Institutet Stockholm Sweden
| | - Philip Ewels
- Science for Life Laboratory Department of Biochemistry and Biophysics Stockholm University Solna Sweden
| | - Jesús T. García
- Instituto de Investigación en Recursos Cinegéticos (IREC) CSIC‐UCLM‐JCCM Ciudad Real Spain
| | | | - Manuel Schweizer
- Natural History Museum Bern Bern Switzerland
- Institute of Ecology and Evolution University of Bern Bern Switzerland
| | - Alexander Suh
- Department of Organismal Biology – Systematic Biology Evolutionary Biology Centre (EBC) Uppsala University Uppsala Sweden
| | - Reto Burri
- Department of Population Ecology Institute of Ecology and Evolution Friedrich Schiller University Jena Jena Germany
| |
Collapse
|
18
|
Zlitni S, Bishara A, Moss EL, Tkachenko E, Kang JB, Culver RN, Andermann TM, Weng Z, Wood C, Handy C, Ji HP, Batzoglou S, Bhatt AS. Strain-resolved microbiome sequencing reveals mobile elements that drive bacterial competition on a clinical timescale. Genome Med 2020; 12:50. [PMID: 32471482 PMCID: PMC7260799 DOI: 10.1186/s13073-020-00747-0] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Accepted: 05/11/2020] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Populations of closely related microbial strains can be simultaneously present in bacterial communities such as the human gut microbiome. We recently developed a de novo genome assembly approach that uses read cloud sequencing to provide more complete microbial genome drafts, enabling precise differentiation and tracking of strain-level dynamics across metagenomic samples. In this case study, we present a proof-of-concept using read cloud sequencing to describe bacterial strain diversity in the gut microbiome of one hematopoietic cell transplantation patient over a 2-month time course and highlight temporal strain variation of gut microbes during therapy. The treatment was accompanied by diet changes and administration of multiple immunosuppressants and antimicrobials. METHODS We conducted short-read and read cloud metagenomic sequencing of DNA extracted from four longitudinal stool samples collected during the course of treatment of one hematopoietic cell transplantation (HCT) patient. After applying read cloud metagenomic assembly to discover strain-level sequence variants in these complex microbiome samples, we performed metatranscriptomic analysis to investigate differential expression of antibiotic resistance genes. Finally, we validated predictions from the genomic and metatranscriptomic findings through in vitro antibiotic susceptibility testing and whole genome sequencing of isolates derived from the patient stool samples. RESULTS During the 56-day longitudinal time course that was studied, the patient's microbiome was profoundly disrupted and eventually dominated by Bacteroides caccae. Comparative analysis of B. caccae genomes obtained using read cloud sequencing together with metagenomic RNA sequencing allowed us to identify differences in substrain populations over time. Based on this, we predicted that particular mobile element integrations likely resulted in increased antibiotic resistance, which we further supported using in vitro antibiotic susceptibility testing. CONCLUSIONS We find read cloud assembly to be useful in identifying key structural genomic strain variants within a metagenomic sample. These strains have fluctuating relative abundance over relatively short time periods in human microbiomes. We also find specific structural genomic variations that are associated with increased antibiotic resistance over the course of clinical treatment.
Collapse
Affiliation(s)
- Soumaya Zlitni
- Departments of Genetics, Stanford University, Stanford, CA USA
- Department of Medicine, Division of Hematology, Stanford University, 269 Campus Drive, MC5156, Stanford, CA 94305 USA
| | - Alex Bishara
- Departments of Genetics, Stanford University, Stanford, CA USA
- Department of Computer Science, Stanford University, Stanford, CA USA
| | - Eli L. Moss
- Departments of Genetics, Stanford University, Stanford, CA USA
- Department of Medicine, Division of Hematology, Stanford University, 269 Campus Drive, MC5156, Stanford, CA 94305 USA
| | - Ekaterina Tkachenko
- Departments of Genetics, Stanford University, Stanford, CA USA
- Department of Medicine, Division of Hematology, Stanford University, 269 Campus Drive, MC5156, Stanford, CA 94305 USA
| | | | | | - Tessa M. Andermann
- Department of Medicine, Division of Infectious Diseases, University of North Carolina, Chapel Hill, USA
| | - Ziming Weng
- Department of Pathology, Stanford University School of Medicine, Stanford, CA USA
| | - Christina Wood
- Division of Oncology, Department of Medicine, Stanford University, Stanford, CA USA
| | - Christine Handy
- Division of Oncology, Department of Medicine, Stanford University, Stanford, CA USA
| | - Hanlee P. Ji
- Division of Oncology, Department of Medicine, Stanford University, Stanford, CA USA
| | - Serafim Batzoglou
- Department of Computer Science, Stanford University, Stanford, CA USA
| | - Ami S. Bhatt
- Departments of Genetics, Stanford University, Stanford, CA USA
- Department of Medicine, Division of Hematology, Stanford University, 269 Campus Drive, MC5156, Stanford, CA 94305 USA
| |
Collapse
|
19
|
Abstract
Identifying structural variation (SV) is essential for genome interpretation but has been historically difficult due to limitations inherent to available genome technologies. Detection methods that use ensemble algorithms and emerging sequencing technologies have enabled the discovery of thousands of SVs, uncovering information about their ubiquity, relationship to disease and possible effects on biological mechanisms. Given the variability in SV type and size, along with unique detection biases of emerging genomic platforms, multiplatform discovery is necessary to resolve the full spectrum of variation. Here, we review modern approaches for investigating SVs and proffer that, moving forwards, studies integrating biological information with detection will be necessary to comprehensively understand the impact of SV in the human genome.
Collapse
Affiliation(s)
- Steve S Ho
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Alexander E Urban
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, CA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Ryan E Mills
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA.
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
20
|
Rheinbay E, Nielsen MM, Abascal F, Wala JA, Shapira O, Tiao G, Hornshøj H, Hess JM, Juul RI, Lin Z, Feuerbach L, Sabarinathan R, Madsen T, Kim J, Mularoni L, Shuai S, Lanzós A, Herrmann C, Maruvka YE, Shen C, Amin SB, Bandopadhayay P, Bertl J, Boroevich KA, Busanovich J, Carlevaro-Fita J, Chakravarty D, Chan CWY, Craft D, Dhingra P, Diamanti K, Fonseca NA, Gonzalez-Perez A, Guo Q, Hamilton MP, Haradhvala NJ, Hong C, Isaev K, Johnson TA, Juul M, Kahles A, Kahraman A, Kim Y, Komorowski J, Kumar K, Kumar S, Lee D, Lehmann KV, Li Y, Liu EM, Lochovsky L, Park K, Pich O, Roberts ND, Saksena G, Schumacher SE, Sidiropoulos N, Sieverling L, Sinnott-Armstrong N, Stewart C, Tamborero D, Tubio JMC, Umer HM, Uusküla-Reimand L, Wadelius C, Wadi L, Yao X, Zhang CZ, Zhang J, Haber JE, Hobolth A, Imielinski M, Kellis M, Lawrence MS, von Mering C, Nakagawa H, Raphael BJ, Rubin MA, Sander C, Stein LD, Stuart JM, Tsunoda T, Wheeler DA, Johnson R, Reimand J, Gerstein M, Khurana E, Campbell PJ, López-Bigas N, Weischenfeldt J, Beroukhim R, Martincorena I, Pedersen JS, Getz G. Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature 2020; 578:102-111. [PMID: 32025015 PMCID: PMC7054214 DOI: 10.1038/s41586-020-1965-x] [Citation(s) in RCA: 352] [Impact Index Per Article: 88.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2018] [Accepted: 12/02/2019] [Indexed: 01/28/2023]
Abstract
The discovery of drivers of cancer has traditionally focused on protein-coding genes1-4. Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers6,7, raise doubts about others and identify novel candidates, including point mutations in the 5' region of TP53, in the 3' untranslated regions of NFKBIZ and TOB1, focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available.
Collapse
Affiliation(s)
- Esther Rheinbay
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA, USA
- Harvard Medical School, Boston, MA, USA
| | - Morten Muhlig Nielsen
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Aarhus, Denmark
| | | | - Jeremiah A Wala
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Bioinformatics and Integrative Genomics, Harvard University, Cambridge, MA, USA
| | - Ofer Shapira
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Grace Tiao
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Henrik Hornshøj
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Aarhus, Denmark
| | - Julian M Hess
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Randi Istrup Juul
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Aarhus, Denmark
| | - Ziao Lin
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Harvard University, Cambridge, MA, USA
| | - Lars Feuerbach
- Division of Applied Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Radhakrishnan Sabarinathan
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Research Program on Biomedical Informatics, Universitat Pompeu Fabra, Barcelona, Spain
| | - Tobias Madsen
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Aarhus, Denmark
| | - Jaegil Kim
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Loris Mularoni
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Research Program on Biomedical Informatics, Universitat Pompeu Fabra, Barcelona, Spain
| | - Shimin Shuai
- Computational Biology Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
- Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | - Andrés Lanzós
- Department for BioMedical Research, University of Bern, Bern, Switzerland
- Graduate School of Cellular and Biomedical Sciences, University of Bern, Bern, Switzerland
- Department of Medical Oncology, Bern University Hospital, University of Bern, Bern, Switzerland
| | - Carl Herrmann
- Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Bioquant Center, Institute of Pharmacy and Molecular Biotechnology, University of Heidelberg, Heidelberg, Germany
| | - Yosef E Maruvka
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA, USA
| | - Ciyue Shen
- Department of Cell Biology, Harvard Medical School, Boston, MA, USA
- cBio Center, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - Samirkumar B Amin
- Department of Genomic Medicine, University of Texas MD Anderson Cancer Center, Houston, TX, USA
- Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, TX, USA
| | - Pratiti Bandopadhayay
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Johanna Bertl
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Aarhus, Denmark
| | - Keith A Boroevich
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - John Busanovich
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Joana Carlevaro-Fita
- Department for BioMedical Research, University of Bern, Bern, Switzerland
- Graduate School of Cellular and Biomedical Sciences, University of Bern, Bern, Switzerland
- Department of Medical Oncology, Bern University Hospital, University of Bern, Bern, Switzerland
| | - Dimple Chakravarty
- Department of Genitourinary Medical Oncology - Research, Division of Cancer Medicine, University of Texas MD Anderson Cancer Center, Houston, TX, USA
- Department of Urology, Icahn school of Medicine at Mount Sinai, New York, NY, USA
| | - Calvin Wing Yiu Chan
- Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Faculty of Biosciences, Heidelberg University, Heidelberg, Germany
| | - David Craft
- Department of Radiation Oncology, Harvard Medical School, Massachusetts General Hospital, Boston, MA, USA
| | - Priyanka Dhingra
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA
- Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY, USA
| | - Klev Diamanti
- Science for Life Laboratory, Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden
| | - Nuno A Fonseca
- European Bioinformatics Institute, European Molecular Biology Laboratory, Hinxton, UK
| | - Abel Gonzalez-Perez
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Research Program on Biomedical Informatics, Universitat Pompeu Fabra, Barcelona, Spain
| | - Qianyun Guo
- Bioinformatics Research Centre (BiRC), Aarhus University, Aarhus, Denmark
| | - Mark P Hamilton
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX, USA
| | - Nicholas J Haradhvala
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA, USA
| | - Chen Hong
- Division of Applied Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Faculty of Biosciences, Heidelberg University, Heidelberg, Germany
| | - Keren Isaev
- Computational Biology Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
| | - Todd A Johnson
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Malene Juul
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Aarhus, Denmark
| | - Andre Kahles
- Division of Computational Biology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Abdullah Kahraman
- Institute of Molecular Life Sciences and Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
| | - Youngwook Kim
- Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, South Korea
| | - Jan Komorowski
- Science for Life Laboratory, Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden
- Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
| | - Kiran Kumar
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Sushant Kumar
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| | - Donghoon Lee
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| | - Kjong-Van Lehmann
- Division of Computational Biology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Yilong Li
- SBGD Inc, Cambridge, MA, USA
- Department of Haematology, University of Cambridge, Cambridge, UK
| | - Eric Minwei Liu
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA
- Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY, USA
| | - Lucas Lochovsky
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Keunchil Park
- Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, South Korea
| | - Oriol Pich
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Research Program on Biomedical Informatics, Universitat Pompeu Fabra, Barcelona, Spain
| | - Nicola D Roberts
- Department of Haematology, University of Cambridge, Cambridge, UK
| | - Gordon Saksena
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Steven E Schumacher
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Nikos Sidiropoulos
- Biotech Research & Innovation Centre (BRIC), The Finsen Laboratory, Rigshospitalet, University of Copenhagen, Copenhagen, Denmark
| | - Lina Sieverling
- Division of Applied Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Faculty of Biosciences, Heidelberg University, Heidelberg, Germany
| | | | - Chip Stewart
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - David Tamborero
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Research Program on Biomedical Informatics, Universitat Pompeu Fabra, Barcelona, Spain
| | - Jose M C Tubio
- Department of Zoology, Genetics and Physical Anthropology, Universidade de Santiago de Compostela, Santiago de Compostela, Spain
- Centre for Research in Molecular Medicine and Chronic Diseases (CIMUS), Universidade de Santiago de Compostela, Santiago de Compostela, Spain
- The Biomedical Research Centre (CINBIO), Universidade de Vigo, Vigo, Spain
| | - Husen M Umer
- Science for Life Laboratory, Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden
- Department of Oncology-Pathology, Science for Life Laboratory, Karolinska Institute, Stockholm, Sweden
| | - Liis Uusküla-Reimand
- Genetics and Genome Biology Program, SickKids Research Institute, Toronto, Ontario, Canada
- Department of Gene Technology, Tallinn University of Technology, Tallinn, Estonia
| | - Claes Wadelius
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Lina Wadi
- Computational Biology Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | | | - Cheng-Zhong Zhang
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Jing Zhang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| | - James E Haber
- Department of Biology and Rosenstiel Basic Medical Sciences Research Center, Brandeis University, Waltham, MA, USA
| | - Asger Hobolth
- Bioinformatics Research Centre (BiRC), Aarhus University, Aarhus, Denmark
| | - Marcin Imielinski
- New York Genome Center, New York, NY, USA
- Department of Pathology and Laboratory Medicine, and Englander Institute for Precision Medicine, and Institute for Computational Biomedicine, and Meyer Cancer Center, Weill Cornell Medicine, New York, NY, USA
| | - Manolis Kellis
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
| | - Michael S Lawrence
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA, USA
| | - Christian von Mering
- Institute of Molecular Life Sciences and Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
| | - Hidewaki Nakagawa
- Laboratory for Cancer Genomics, RIKEN Center for Integrative Medical Sciences, Tokyo, Japan
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| | - Mark A Rubin
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY, USA
- Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY, USA
- Meyer Cancer Center, Weill Cornell Medicine, New York, NY, USA
| | - Chris Sander
- Department of Cell Biology, Harvard Medical School, Boston, MA, USA
- cBio Center, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - Lincoln D Stein
- Computational Biology Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
- Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | - Joshua M Stuart
- Center for Biomolecular Science and Engineering, University of California at Santa Cruz, Santa Cruz, CA, USA
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo, Japan
| | - David A Wheeler
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Rory Johnson
- Department for BioMedical Research, University of Bern, Bern, Switzerland
- Department of Medical Oncology, Bern University Hospital, University of Bern, Bern, Switzerland
| | - Jüri Reimand
- Computational Biology Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
- Department of Computer Science, Yale University, New Haven, CT, USA
| | - Ekta Khurana
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA
- Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY, USA
- Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY, USA
- Meyer Cancer Center, Weill Cornell Medicine, New York, NY, USA
| | - Peter J Campbell
- Wellcome Trust Sanger Institute, Hinxton, UK
- Department of Haematology, University of Cambridge, Cambridge, UK
| | - Núria López-Bigas
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Research Program on Biomedical Informatics, Universitat Pompeu Fabra, Barcelona, Spain
- Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, Spain
| | - Joachim Weischenfeldt
- Biotech Research & Innovation Centre (BRIC), The Finsen Laboratory, Rigshospitalet, University of Copenhagen, Copenhagen, Denmark.
- Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
| | - Rameen Beroukhim
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Bioinformatics and Integrative Genomics, Harvard University, Cambridge, MA, USA.
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA.
| | | | - Jakob Skou Pedersen
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Aarhus, Denmark.
- Bioinformatics Research Centre (BiRC), Aarhus University, Aarhus, Denmark.
| | - Gad Getz
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA, USA.
- Harvard Medical School, Boston, MA, USA.
- Department of Pathology, Massachusetts General Hospital, Boston, MA, USA.
| |
Collapse
|
21
|
Zhang L, Zhou X, Weng Z, Sidow A. De novo diploid genome assembly for genome-wide structural variant detection. NAR Genom Bioinform 2019; 2:lqz018. [PMID: 33575568 PMCID: PMC7671403 DOI: 10.1093/nargab/lqz018] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Revised: 10/09/2019] [Accepted: 12/02/2019] [Indexed: 12/30/2022] Open
Abstract
Detection of structural variants (SVs) on the basis of read alignment to a reference genome remains a difficult problem. De novo assembly, traditionally used to generate reference genomes, offers an alternative for SV detection. However, it has not been applied broadly to human genomes because of fundamental limitations of short-fragment approaches and high cost of long-read technologies. We here show that 10× linked-read sequencing supports accurate SV detection. We examined variants in six de novo 10× assemblies with diverse experimental parameters from two commonly used human cell lines: NA12878 and NA24385. The assemblies are effective for detecting mid-size SVs, which were discovered by simple pairwise alignment of the assemblies’ contigs to the reference (hg38). Our study also shows that the base-pair level SV breakpoint accuracy is high, with a majority of SVs having precisely correct sizes and breakpoints. Setting the ancestral state of SV loci by comparing to ape orthologs allows inference of the actual molecular mechanism (insertion or deletion) causing the mutation. In about half of cases, the mechanism is the opposite of the reference-based call. We uncover 214 SVs that may have been maintained as polymorphisms in the human lineage since before our divergence from chimp. Overall, we show that de novo assembly of 10× linked-read data can achieve cost-effective SV detection for personal genomes.
Collapse
Affiliation(s)
- Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong.,Department of Pathology, 300 Pasteur Dr, Stanford University, Stanford, CA 94305, USA.,Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Xin Zhou
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Ziming Weng
- Department of Pathology, 300 Pasteur Dr, Stanford University, Stanford, CA 94305, USA
| | - Arend Sidow
- Department of Pathology, 300 Pasteur Dr, Stanford University, Stanford, CA 94305, USA.,Department of Genetics, 300 Pasteur Dr, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
22
|
Fang L, Kao C, Gonzalez MV, Mafra FA, Pellegrino da Silva R, Li M, Wenzel SS, Wimmer K, Hakonarson H, Wang K. LinkedSV for detection of mosaic structural variants from linked-read exome and genome sequencing data. Nat Commun 2019; 10:5585. [PMID: 31811119 PMCID: PMC6898185 DOI: 10.1038/s41467-019-13397-7] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2019] [Accepted: 11/07/2019] [Indexed: 02/01/2023] Open
Abstract
Linked-read sequencing provides long-range information on short-read sequencing data by barcoding reads originating from the same DNA molecule, and can improve detection and breakpoint identification for structural variants (SVs). Here we present LinkedSV for SV detection on linked-read sequencing data. LinkedSV considers barcode overlapping and enriched fragment endpoints as signals to detect large SVs, while it leverages read depth, paired-end signals and local assembly to detect small SVs. Benchmarking studies demonstrate that LinkedSV outperforms existing tools, especially on exome data and on somatic SVs with low variant allele frequencies. We demonstrate clinical cases where LinkedSV identifies disease-causal SVs from linked-read exome sequencing data missed by conventional exome sequencing, and show examples where LinkedSV identifies SVs missed by high-coverage long-read sequencing. In summary, LinkedSV can detect SVs missed by conventional short-read and long-read sequencing approaches, and may resolve negative cases from clinical genome/exome sequencing studies.
Collapse
Affiliation(s)
- Li Fang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Charlly Kao
- Center for Applied Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Michael V Gonzalez
- Center for Applied Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Fernanda A Mafra
- Center for Applied Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | | | - Mingyao Li
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Sören-Sebastian Wenzel
- Institute of Human Genetics, Department for Genetics and Pharmacology, Medical University of Innsbruck, Innsbruck, Austria
| | - Katharina Wimmer
- Institute of Human Genetics, Department for Genetics and Pharmacology, Medical University of Innsbruck, Innsbruck, Austria
| | - Hakon Hakonarson
- Department of Pediatrics, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA. .,Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA.
| |
Collapse
|
23
|
Statistical Binning for Barcoded Reads Improves Downstream Analyses. Cell Syst 2019; 7:219-226.e5. [PMID: 30138581 PMCID: PMC6214366 DOI: 10.1016/j.cels.2018.07.005] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2018] [Revised: 05/03/2018] [Accepted: 07/10/2018] [Indexed: 12/30/2022]
Abstract
Sequencing technologies are capturing longer-range genomic information at lower error rates, enabling alignment to genomic regions that are inaccessible with short reads. However, many methods are unable to align reads to much of the genome, recognized as important in disease, and thus report erroneous results in downstream analyses. We introduce EMA, a novel two-tiered statistical binning model for barcoded read alignment, that first probabilistically maps reads to potentially multiple "read clouds" and then within clouds by newly exploiting the non-uniform read densities characteristic of barcoded read sequencing. EMA substantially improves downstream accuracy over existing methods, including phasing and genotyping on 10x data, with fewer false variant calls in nearly half the time. EMA effectively resolves particularly challenging alignments in genomic regions that contain nearby homologous elements, uncovering variants in the pharmacogenomically important CYP2D region, and clinically important genes C4 (schizophrenia) and AMY1A (obesity), which go undetected by existing methods. Our work provides a framework for future generation sequencing.
Collapse
|
24
|
Douglas GM, Langille MGI. Current and Promising Approaches to Identify Horizontal Gene Transfer Events in Metagenomes. Genome Biol Evol 2019; 11:2750-2766. [PMID: 31504488 PMCID: PMC6777429 DOI: 10.1093/gbe/evz184] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/19/2019] [Indexed: 12/16/2022] Open
Abstract
High-throughput shotgun metagenomics sequencing has enabled the profiling of myriad natural communities. These data are commonly used to identify gene families and pathways that were potentially gained or lost in an environment and which may be involved in microbial adaptation. Despite the widespread interest in these events, there are no established best practices for identifying gene gain and loss in metagenomics data. Horizontal gene transfer (HGT) represents several mechanisms of gene gain that are especially of interest in clinical microbiology due to the rapid spread of antibiotic resistance genes in natural communities. Several additional mechanisms of gene gain and loss, including gene duplication, gene loss-of-function events, and de novo gene birth are also important to consider in the context of metagenomes but have been less studied. This review is largely focused on detecting HGT in prokaryotic metagenomes, but methods for detecting these other mechanisms are first discussed. For this article to be self-contained, we provide a general background on HGT and the different possible signatures of this process. Lastly, we discuss how improved assembly of genomes from metagenomes would be the most straight-forward approach for improving the inference of gene gain and loss events. Several recent technological advances could help improve metagenome assemblies: long-read sequencing, determining the physical proximity of contigs, optical mapping of short sequences along chromosomes, and single-cell metagenomics. The benefits and limitations of these advances are discussed and open questions in this area are highlighted.
Collapse
Affiliation(s)
- Gavin M Douglas
- Department of Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Morgan G I Langille
- Department of Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia, Canada
| |
Collapse
|
25
|
Darby CA, Fitch JR, Brennan PJ, Kelly BJ, Bir N, Magrini V, Leonard J, Cottrell CE, Gastier-Foster JM, Wilson RK, Mardis ER, White P, Langmead B, Schatz MC. Samovar: Single-Sample Mosaic Single-Nucleotide Variant Calling with Linked Reads. iScience 2019; 18:1-10. [PMID: 31271967 PMCID: PMC6609817 DOI: 10.1016/j.isci.2019.05.037] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Revised: 05/06/2019] [Accepted: 05/24/2019] [Indexed: 12/25/2022] Open
Abstract
Linked-read sequencing enables greatly improves haplotype assembly over standard paired-end analysis. The detection of mosaic single-nucleotide variants benefits from haplotype assembly when the model is informed by the mapping between constituent reads and linked reads. Samovar evaluates haplotype-discordant reads identified through linked-read sequencing, thus enabling phasing and mosaic variant detection across the entire genome. Samovar trains a random forest model to score candidate sites using a dataset that considers read quality, phasing, and linked-read characteristics. Samovar calls mosaic single-nucleotide variants (SNVs) within a single sample with accuracy comparable with what previously required trios or matched tumor/normal pairs and outperforms single-sample mosaic variant callers at minor allele frequency 5%-50% with at least 30X coverage. Samovar finds somatic variants in both tumor and normal whole-genome sequencing from 13 pediatric cancer cases that can be corroborated with high recall with whole exome sequencing. Samovar is available open-source at https://github.com/cdarby/samovar under the MIT license.
Collapse
Affiliation(s)
- Charlotte A Darby
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - James R Fitch
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA
| | - Patrick J Brennan
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA
| | - Benjamin J Kelly
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA
| | - Natalie Bir
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA
| | - Vincent Magrini
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Jeffrey Leonard
- Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA; Department of Neurosurgery, Nationwide Children's Hospital, Columbus, OH, USA
| | - Catherine E Cottrell
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Julie M Gastier-Foster
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Richard K Wilson
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Elaine R Mardis
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Peter White
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA; Department of Biology, Johns Hopkins University, Baltimore, MD, USA; Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| |
Collapse
|
26
|
Sedlazeck FJ, Lee H, Darby CA, Schatz MC. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat Rev Genet 2019; 19:329-346. [PMID: 29599501 DOI: 10.1038/s41576-018-0003-4] [Citation(s) in RCA: 291] [Impact Index Per Article: 58.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Several new genomics technologies have become available that offer long-read sequencing or long-range mapping with higher throughput and higher resolution analysis than ever before. These long-range technologies are rapidly advancing the field with improved reference genomes, more comprehensive variant identification and more complete views of transcriptomes and epigenomes. However, they also require new bioinformatics approaches to take full advantage of their unique characteristics while overcoming their complex errors and modalities. Here, we discuss several of the most important applications of the new technologies, focusing on both the currently available bioinformatics tools and opportunities for future research.
Collapse
Affiliation(s)
- Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Hayan Lee
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Charlotte A Darby
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. .,Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| |
Collapse
|
27
|
Ghurye J, Koren S, Small ST, Redmond S, Howell P, Phillippy AM, Besansky NJ. A chromosome-scale assembly of the major African malaria vector Anopheles funestus. Gigascience 2019; 8:giz063. [PMID: 31157884 PMCID: PMC6545970 DOI: 10.1093/gigascience/giz063] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2018] [Revised: 03/28/2019] [Accepted: 05/06/2019] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND Anopheles funestus is one of the 3 most consequential and widespread vectors of human malaria in tropical Africa. However, the lack of a high-quality reference genome has hindered the association of phenotypic traits with their genetic basis in this important mosquito. FINDINGS Here we present a new high-quality A. funestus reference genome (AfunF3) assembled using 240× coverage of long-read single-molecule sequencing for contigging, combined with 100× coverage of short-read Hi-C data for chromosome scaffolding. The assembled contigs total 446 Mbp of sequence and contain substantial duplication due to alternative alleles present in the sequenced pool of mosquitos from the FUMOZ colony. Using alignment and depth-of-coverage information, these contigs were deduplicated to a 211 Mbp primary assembly, which is closer to the expected haploid genome size of 250 Mbp. This primary assembly consists of 1,053 contigs organized into 3 chromosome-scale scaffolds with an N50 contig size of 632 kbp and an N50 scaffold size of 93.811 Mbp, representing a 100-fold improvement in continuity versus the current reference assembly, AfunF1. CONCLUSION This highly contiguous and complete A. funestus reference genome assembly will serve as an improved basis for future studies of genomic variation and organization in this important disease vector.
Collapse
Affiliation(s)
- Jay Ghurye
- Department of Computer Science, University of Maryland, 8125 Paint Branch Drive, College Park, MD 20742, USA
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 9000 Rockville Pike, Bethesda, MD 20892, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 9000 Rockville Pike, Bethesda, MD 20892, USA
| | - Scott T Small
- Eck Institute for Global Health and Department of Biological Sciences, University of Notre Dame, 317 Galvin Life Science Center, Notre Dame, IN 46556, USA
| | - Seth Redmond
- Infectious Disease and Microbiome Program, Broad Institute, 415 Main Street, Cambridge, MA 02142, USA
- Department of Immunology and Infectious Disease, Harvard T.H. Chan School of Public Health, 665 Huntington Avenue, Boston, MA 02115, USA
| | - Paul Howell
- Centers for Disease Control and Prevention, 1600 Clifton Road, Atlanta, GA 30329, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 9000 Rockville Pike, Bethesda, MD 20892, USA
| | - Nora J Besansky
- Eck Institute for Global Health and Department of Biological Sciences, University of Notre Dame, 317 Galvin Life Science Center, Notre Dame, IN 46556, USA
| |
Collapse
|
28
|
Tian S, Yan H, Klee EW, Kalmbach M, Slager SL. Comparative analysis of de novo assemblers for variation discovery in personal genomes. Brief Bioinform 2019; 19:893-904. [PMID: 28407084 PMCID: PMC6169673 DOI: 10.1093/bib/bbx037] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2016] [Accepted: 03/08/2017] [Indexed: 12/30/2022] Open
Abstract
Current variant discovery approaches often rely on an initial read mapping to the reference sequence. Their effectiveness is limited by the presence of gaps, potential misassemblies, regions of duplicates with a high-sequence similarity and regions of high-sequence divergence in the reference. Also, mapping-based approaches are less sensitive to large INDELs and complex variations and provide little phase information in personal genomes. A few de novo assemblers have been developed to identify variants through direct variant calling from the assembly graph, micro-assembly and whole-genome assembly, but mainly for whole-genome sequencing (WGS) data. We developed SGVar, a de novo assembly workflow for haplotype-based variant discovery from whole-exome sequencing (WES) data. Using simulated human exome data, we compared SGVar with five variation-aware de novo assemblers and with BWA-MEM together with three haplotype- or local de novo assembly-based callers. SGVar outperforms the other assemblers in sensitivity and tolerance of sequencing errors. We recapitulated the findings on whole-genome and exome data from a Utah residents with Northern and Western European ancestry (CEU) trio, showing that SGVar had high sensitivity both in the highly divergent human leukocyte antigen (HLA) region and in non-HLA regions of chromosome 6. In particular, SGVar is robust to sequencing error, k-mer selection, divergence level and coverage depth. Unlike mapping-based approaches, SGVar is capable of resolving long-range phase and identifying large INDELs from WES, more prominently from WGS. We conclude that SGVar represents an ideal platform for WES-based variant discovery in highly divergent regions and across the whole genome.
Collapse
Affiliation(s)
- Shulan Tian
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Huihuang Yan
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Eric W Klee
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.,Center for Individualized Medicine Bioinformatics Program, Mayo Clinic, USA
| | - Michael Kalmbach
- Division of Information Management and Analytics, Department of Information Technology, Mayo Clinic, USA
| | - Susan L Slager
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
29
|
Marks P, Garcia S, Barrio AM, Belhocine K, Bernate J, Bharadwaj R, Bjornson K, Catalanotti C, Delaney J, Fehr A, Fiddes IT, Galvin B, Heaton H, Herschleb J, Hindson C, Holt E, Jabara CB, Jett S, Keivanfar N, Kyriazopoulou-Panagiotopoulou S, Lek M, Lin B, Lowe A, Mahamdallie S, Maheshwari S, Makarewicz T, Marshall J, Meschi F, O'Keefe CJ, Ordonez H, Patel P, Price A, Royall A, Ruark E, Seal S, Schnall-Levin M, Shah P, Stafford D, Williams S, Wu I, Xu AW, Rahman N, MacArthur D, Church DM. Resolving the full spectrum of human genome variation using Linked-Reads. Genome Res 2019; 29:635-645. [PMID: 30894395 PMCID: PMC6442396 DOI: 10.1101/gr.234443.118] [Citation(s) in RCA: 134] [Impact Index Per Article: 26.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2018] [Accepted: 02/21/2019] [Indexed: 02/07/2023]
Abstract
Large-scale population analyses coupled with advances in technology have demonstrated that the human genome is more diverse than originally thought. To date, this diversity has largely been uncovered using short-read whole-genome sequencing. However, these short-read approaches fail to give a complete picture of a genome. They struggle to identify structural events, cannot access repetitive regions, and fail to resolve the human genome into haplotypes. Here, we describe an approach that retains long range information while maintaining the advantages of short reads. Starting from ∼1 ng of high molecular weight DNA, we produce barcoded short-read libraries. Novel informatic approaches allow for the barcoded short reads to be associated with their original long molecules producing a novel data type known as "Linked-Reads". This approach allows for simultaneous detection of small and large variants from a single library. In this manuscript, we show the advantages of Linked-Reads over standard short-read approaches for reference-based analysis. Linked-Reads allow mapping to 38 Mb of sequence not accessible to short reads, adding sequence in 423 difficult-to-sequence genes including disease-relevant genes STRC, SMN1, and SMN2 Both Linked-Read whole-genome and whole-exome sequencing identify complex structural variations, including balanced events and single exon deletions and duplications. Further, Linked-Reads extend the region of high-confidence calls by 68.9 Mb. The data presented here show that Linked-Reads provide a scalable approach for comprehensive genome analysis that is not possible using short reads alone.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | - Adrian Fehr
- 10x Genomics, Pleasanton, California 94566, USA
| | | | | | | | | | | | - Esty Holt
- The Institute of Cancer Research, Division of Genetics and Epidemiology, London SM2 5NG, United Kingdom
| | | | | | | | | | - Monkol Lek
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Bill Lin
- 10x Genomics, Pleasanton, California 94566, USA
| | - Adam Lowe
- 10x Genomics, Pleasanton, California 94566, USA
| | - Shazia Mahamdallie
- The Institute of Cancer Research, Division of Genetics and Epidemiology, London SM2 5NG, United Kingdom
| | | | | | - Jamie Marshall
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | | | | | | | | | | | | | - Elise Ruark
- The Institute of Cancer Research, Division of Genetics and Epidemiology, London SM2 5NG, United Kingdom
| | - Sheila Seal
- The Institute of Cancer Research, Division of Genetics and Epidemiology, London SM2 5NG, United Kingdom
| | | | - Preyas Shah
- 10x Genomics, Pleasanton, California 94566, USA
| | | | | | - Indira Wu
- 10x Genomics, Pleasanton, California 94566, USA
| | | | - Nazneen Rahman
- The Institute of Cancer Research, Division of Genetics and Epidemiology, London SM2 5NG, United Kingdom
| | - Daniel MacArthur
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | | |
Collapse
|
30
|
Zhou B, Arthur JG, Ho SS, Pattni R, Huang Y, Wong WH, Urban AE. Extensive and deep sequencing of the Venter/HuRef genome for developing and benchmarking genome analysis tools. Sci Data 2018; 5:180261. [PMID: 30561434 PMCID: PMC6298255 DOI: 10.1038/sdata.2018.261] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2018] [Accepted: 10/04/2018] [Indexed: 12/30/2022] Open
Abstract
We produced an extensive collection of deep re-sequencing datasets for the Venter/HuRef genome using the Illumina massively-parallel DNA sequencing platform. The original Venter genome sequence is a very-high quality phased assembly based on Sanger sequencing. Therefore, researchers developing novel computational tools for the analysis of human genome sequence variation for the dominant Illumina sequencing technology can test and hone their algorithms by making variant calls from these Venter/HuRef datasets and then immediately confirm the detected variants in the Sanger assembly, freeing them of the need for further experimental validation. This process also applies to implementing and benchmarking existing genome analysis pipelines. We prepared and sequenced 200 bp and 350 bp short-insert whole-genome sequencing libraries (sequenced to 100x and 40x genomic coverages respectively) as well as 2 kb, 5 kb, and 12 kb mate-pair libraries (49x, 122x, and 145x physical coverages respectively). Lastly, we produced a linked-read library (128x physical coverage) from which we also performed haplotype phasing.
Collapse
Affiliation(s)
- Bo Zhou
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Joseph G. Arthur
- Department of Statistics, Department of Biomedical Data Science, Bio-X Program, Stanford University, Stanford, California 94305, USA
| | - Steve S. Ho
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Reenal Pattni
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Yiling Huang
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Wing H. Wong
- Department of Statistics, Department of Biomedical Data Science, Bio-X Program, Stanford University, Stanford, California 94305, USA
| | - Alexander E. Urban
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
- Tashia and John Morgridge Faculty Scholar, Stanford Child Health Research Institute, Palo Alto, California 94305, USA
| |
Collapse
|
31
|
Bishara A, Moss EL, Kolmogorov M, Parada AE, Weng Z, Sidow A, Dekas AE, Batzoglou S, Bhatt AS. High-quality genome sequences of uncultured microbes by assembly of read clouds. Nat Biotechnol 2018; 36:nbt.4266. [PMID: 30320765 PMCID: PMC6465186 DOI: 10.1038/nbt.4266] [Citation(s) in RCA: 70] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Accepted: 08/28/2018] [Indexed: 01/08/2023]
Abstract
Although shotgun metagenomic sequencing of microbiome samples enables partial reconstruction of strain-level community structure, obtaining high-quality microbial genome drafts without isolation and culture remains difficult. Here, we present an application of read clouds, short-read sequences tagged with long-range information, to microbiome samples. We present Athena, a de novo assembler that uses read clouds to improve metagenomic assemblies. We applied this approach to sequence stool samples from two healthy individuals and compared it with existing short-read and synthetic long-read metagenomic sequencing techniques. Read-cloud metagenomic sequencing and Athena assembly produced the most comprehensive individual genome drafts with high contiguity (>200-kb N50, fewer than ten contigs), even for bacteria with relatively low (20×) raw short-read-sequence coverage. We also sequenced a complex marine-sediment sample and generated 24 intermediate-quality genome drafts (>70% complete, <10% contaminated), nine of which were complete (>90% complete, <5% contaminated). Our approach allows for culture-free generation of high-quality microbial genome drafts by using a single shotgun experiment.
Collapse
Affiliation(s)
- Alex Bishara
- Department of Computer Science, Stanford University, Stanford, California, USA
- Department of Medicine (Hematology, Blood and Marrow Transplantation) and Department of Genetics, Stanford University, Stanford, California, USA
| | - Eli L. Moss
- Department of Medicine (Hematology, Blood and Marrow Transplantation) and Department of Genetics, Stanford University, Stanford, California, USA
| | - Mikhail Kolmogorov
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, USA
| | - Alma E. Parada
- Department of Earth System Science, Stanford University, Stanford, CA, USA
| | - Ziming Weng
- Department of Pathology, Stanford University School of Medicine, Stanford, California, USA
| | - Arend Sidow
- Department of Medicine (Hematology, Blood and Marrow Transplantation) and Department of Genetics, Stanford University, Stanford, California, USA
- Department of Pathology, Stanford University School of Medicine, Stanford, California, USA
| | - Anne E. Dekas
- Department of Earth System Science, Stanford University, Stanford, CA, USA
| | - Serafim Batzoglou
- Department of Computer Science, Stanford University, Stanford, California, USA
| | - Ami S. Bhatt
- Department of Medicine (Hematology, Blood and Marrow Transplantation) and Department of Genetics, Stanford University, Stanford, California, USA
| |
Collapse
|
32
|
Zhou X, Batzoglou S, Sidow A, Zhang L. HAPDeNovo: a haplotype-based approach for filtering and phasing de novo mutations in linked read sequencing data. BMC Genomics 2018; 19:467. [PMID: 29914369 PMCID: PMC6006847 DOI: 10.1186/s12864-018-4867-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2017] [Accepted: 06/13/2018] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND De novo mutations (DNMs) are associated with neurodevelopmental and congenital diseases, and their detection can contribute to understanding disease pathogenicity. However, accurate detection is challenging because of their small number relative to the genome-wide false positives in next generation sequencing (NGS) data. Software such as DeNovoGear and TrioDeNovo have been developed to detect DNMs, but at good sensitivity they still produce many false positive calls. RESULTS To address this challenge, we develop HAPDeNovo, a program that leverages phasing information from linked read sequencing, to remove false positive DNMs from candidate lists generated by DNM-detection tools. Short reads from each phasing block are allocated to each of the two haplotypes followed by generating a haploid genotype for each putative DNM. HAPDeNovo removes variants that are called as heterozygous in one of the haplotypes because they are almost certainly false positives. Our experiments on 10X Chromium linked read sequencing trio data reveal that HAPDeNovo eliminates 80 to 99% of false positives regardless of how large the candidate DNM set is. CONCLUSIONS HAPDeNovo leverages the haplotype information from linked read sequencing to remove spurious false positive DNMs effectively, and it increases accuracy of DNM detection dramatically without sacrificing sensitivity.
Collapse
Affiliation(s)
- Xin Zhou
- Department of Computer Science, Stanford University, Stanford, California, 94305, USA
| | - Serafim Batzoglou
- Department of Computer Science, Stanford University, Stanford, California, 94305, USA
| | - Arend Sidow
- Department of Pathology, Stanford University School of Medicine, Stanford, California, 94305, USA.,Department of Genetics, Stanford University School of Medicine, Stanford, California, 94305, USA
| | - Lu Zhang
- Department of Computer Science, Stanford University, Stanford, California, 94305, USA. .,Department of Pathology, Stanford University School of Medicine, Stanford, California, 94305, USA.
| |
Collapse
|
33
|
Shajii A, Numanagić I, Berger B. Latent Variable Model for Aligning Barcoded Short-Reads Improves Downstream Analyses. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY : ... ANNUAL INTERNATIONAL CONFERENCE, RECOMB ... : PROCEEDINGS. RECOMB (CONFERENCE : 2005- ) 2018; 10812:280-282. [PMID: 29888346 PMCID: PMC5989713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Affiliation(s)
- Ariya Shajii
- Computer Science and AI Lab, MIT, Cambridge, MA, USA
| | - Ibrahim Numanagić
- Computer Science and AI Lab, MIT, Cambridge, MA, USA
- Department of Mathematics, MIT, Cambridge, MA, USA
| | - Bonnie Berger
- Computer Science and AI Lab, MIT, Cambridge, MA, USA
- Department of Mathematics, MIT, Cambridge, MA, USA
| |
Collapse
|
34
|
Bansal V. An accurate algorithm for the detection of DNA fragments from dilution pool sequencing experiments. Bioinformatics 2018; 34:155-162. [PMID: 29036419 PMCID: PMC5870854 DOI: 10.1093/bioinformatics/btx436] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Revised: 03/16/2017] [Accepted: 07/04/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation The short read lengths of current high-throughput sequencing technologies limit the ability to recover long-range haplotype information. Dilution pool methods for preparing DNA sequencing libraries from high molecular weight DNA fragments enable the recovery of long DNA fragments from short sequence reads. These approaches require computational methods for identifying the DNA fragments using aligned sequence reads and assembling the fragments into long haplotypes. Although a number of computational methods have been developed for haplotype assembly, the problem of identifying DNA fragments from dilution pool sequence data has not received much attention. Results We formulate the problem of detecting DNA fragments from dilution pool sequencing experiments as a genome segmentation problem and develop an algorithm that uses dynamic programming to optimize a likelihood function derived from a generative model for the sequence reads. This algorithm uses an iterative approach to automatically infer the mean background read depth and the number of fragments in each pool. Using simulated data, we demonstrate that our method, FragmentCut, has 25-30% greater sensitivity compared with an HMM based method for fragment detection and can also detect overlapping fragments. On a whole-genome human fosmid pool dataset, the haplotypes assembled using the fragments identified by FragmentCut had greater N50 length, 16.2% lower switch error rate and 35.8% lower mismatch error rate compared with two existing methods. We further demonstrate the greater accuracy of our method using two additional dilution pool datasets. Availability and implementation FragmentCut is available from https://bansal-lab.github.io/software/FragmentCut. Contact vibansal@ucsd.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Vikas Bansal
- Department of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, CA, USA
| |
Collapse
|
35
|
Lauschke VM, Milani L, Ingelman-Sundberg M. Pharmacogenomic Biomarkers for Improved Drug Therapy—Recent Progress and Future Developments. AAPS JOURNAL 2017; 20:4. [DOI: 10.1208/s12248-017-0161-x] [Citation(s) in RCA: 88] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2017] [Accepted: 10/06/2017] [Indexed: 12/13/2022]
|
36
|
Teeling EC, Vernes SC, Dávalos LM, Ray DA, Gilbert MTP, Myers E. Bat Biology, Genomes, and the Bat1K Project: To Generate Chromosome-Level Genomes for All Living Bat Species. Annu Rev Anim Biosci 2017; 6:23-46. [PMID: 29166127 DOI: 10.1146/annurev-animal-022516-022811] [Citation(s) in RCA: 121] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Bats are unique among mammals, possessing some of the rarest mammalian adaptations, including true self-powered flight, laryngeal echolocation, exceptional longevity, unique immunity, contracted genomes, and vocal learning. They provide key ecosystem services, pollinating tropical plants, dispersing seeds, and controlling insect pest populations, thus driving healthy ecosystems. They account for more than 20% of all living mammalian diversity, and their crown-group evolutionary history dates back to the Eocene. Despite their great numbers and diversity, many species are threatened and endangered. Here we announce Bat1K, an initiative to sequence the genomes of all living bat species (n∼1,300) to chromosome-level assembly. The Bat1K genome consortium unites bat biologists (>148 members as of writing), computational scientists, conservation organizations, genome technologists, and any interested individuals committed to a better understanding of the genetic and evolutionary mechanisms that underlie the unique adaptations of bats. Our aim is to catalog the unique genetic diversity present in all living bats to better understand the molecular basis of their unique adaptations; uncover their evolutionary history; link genotype with phenotype; and ultimately better understand, promote, and conserve bats. Here we review the unique adaptations of bats and highlight how chromosome-level genome assemblies can uncover the molecular basis of these traits. We present a novel sequencing and assembly strategy and review the striking societal and scientific benefits that will result from the Bat1K initiative.
Collapse
Affiliation(s)
- Emma C Teeling
- School of Biology and Environmental Science, University College Dublin, Belfield, Dublin 4, Ireland;
| | - Sonja C Vernes
- Neurogenetics of Vocal Communication Group, Max Planck Institute for Psycholinguistics, Nijmegen, 6500 AH, The Netherlands.,Donders Centre for Cognitive Neuroimaging, Nijmegen, 6525 EN, The Netherlands
| | - Liliana M Dávalos
- Department of Ecology and Evolution, Stony Brook University, Stony Brook, New York 11794-5245, USA
| | - David A Ray
- Department of Biological Sciences, Texas Tech University, Lubbock, Texas 79409, USA
| | - M Thomas P Gilbert
- Natural History Museum of Denmark, University of Copenhagen, 1350 Copenhagen, Denmark.,University Museum, Norwegian University of Science and Technology, 7491 Trondheim, Norway
| | - Eugene Myers
- Max Planck Institute for Molecular Cell Biology and Genetics, 01307 Dresden, Germany
| | -
- *Full list of Bat1K Consortium members in Supplemental Appendix
| |
Collapse
|
37
|
Elyanow R, Wu HT, Raphael BJ. Identifying structural variants using linked-read sequencing data. Bioinformatics 2017; 34:353-360. [PMID: 29112732 DOI: 10.1093/bioinformatics/btx712] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2017] [Revised: 10/24/2017] [Accepted: 11/02/2017] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION Structural variation, including large deletions, duplications, inversions, translocations and other rearrangements, is common in human and cancer genomes. A number of methods have been developed to identify structural variants from Illumina short-read sequencing data. However, reliable identification of structural variants remains challenging because many variants have breakpoints in repetitive regions of the genome and thus are difficult to identify with short reads. The recently developed linked-read sequencing technology from 10X Genomics combines a novel barcoding strategy with Illumina sequencing. This technology labels all reads that originate from a small number (∼5 to 10) DNA molecules ∼50 Kbp in length with the same molecular barcode. These barcoded reads contain long-range sequence information that is advantageous for identification of structural variants. RESULTS We present Novel Adjacency Identification with Barcoded Reads (NAIBR), an algorithm to identify structural variants in linked-read sequencing data. NAIBR predicts novel adjacencies in an individual genome resulting from structural variants using a probabilistic model that combines multiple signals in barcoded reads. We show that NAIBR outperforms several existing methods for structural variant identification-including two recent methods that also analyze linked-reads-on simulated sequencing data and 10X whole-genome sequencing data from the NA12878 human genome and the HCC1954 breast cancer cell line. Several of the novel somatic structural variants identified in HCC1954 overlap known cancer genes. AVAILABILITY AND IMPLEMENTATION Software is available at compbio.cs.brown.edu/software. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rebecca Elyanow
- Center for Computational Molecular Biology, Brown University, Providence, RI, USA
| | - Hsin-Ta Wu
- Center for Computational Molecular Biology, Brown University, Providence, RI, USA
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| |
Collapse
|
38
|
Abstract
MOTIVATION Despite rapid progress in sequencing technology, assembling de novo the genomes of new species as well as reconstructing complex metagenomes remains major technological challenges. New synthetic long read (SLR) technologies promise significant advances towards these goals; however, their applicability is limited by high sequencing requirements and the inability of current assembly paradigms to cope with combinations of short and long reads. RESULTS Here, we introduce Architect, a new de novo scaffolder aimed at SLR technologies. Unlike previous assembly strategies, Architect does not require a costly subassembly step; instead it assembles genomes directly from the SLR's underlying short reads, which we refer to as read clouds This enables a 4- to 20-fold reduction in sequencing requirements and a 5-fold increase in assembly contiguity on both genomic and metagenomic datasets relative to state-of-the-art assembly strategies aimed directly at fully subassembled long reads. AVAILABILITY AND IMPLEMENTATION Our source code is freely available at https://github.com/kuleshov/architect CONTACT kuleshov@stanford.edu.
Collapse
Affiliation(s)
- Volodymyr Kuleshov
- Department of Computer Science, Stanford University Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Michael P Snyder
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | | |
Collapse
|
39
|
Genome-wide reconstruction of complex structural variants using read clouds. Nat Methods 2017; 14:915-920. [PMID: 28714986 PMCID: PMC5578891 DOI: 10.1038/nmeth.4366] [Citation(s) in RCA: 80] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2016] [Accepted: 06/15/2017] [Indexed: 12/16/2022]
Abstract
In read cloud approaches, microfluidic partitioning of long genomic DNA fragments and barcoding of shorter fragments derived from these fragments retains long-range information in short sequencing reads. This combination of short reads with long-range information represents a powerful alternative to single-molecule long-read sequencing. We develop Genome-wide Reconstruction of Complex Structural Variants (GROC-SVs) for SV detection and assembly from read cloud data and apply this method to Illumina-sequenced 10x Genomics sarcoma and breast cancer data sets. Compared with short-fragment sequencing, GROC-SVs substantially improves the specificity of breakpoint detection at comparable sensitivity. This approach also performs sequence assembly across multiple breakpoints simultaneously, enabling the reconstruction of events exhibiting remarkable complexity. We show that chromothriptic rearrangements occurred before copy number amplifications, and that rates of single-nucleotide variants and SVs are not correlated. Our results support the use of read cloud approaches to advance the characterization of large and complex structural variation.
Collapse
|
40
|
Clavijo BJ, Venturini L, Schudoma C, Accinelli GG, Kaithakottil G, Wright J, Borrill P, Kettleborough G, Heavens D, Chapman H, Lipscombe J, Barker T, Lu FH, McKenzie N, Raats D, Ramirez-Gonzalez RH, Coince A, Peel N, Percival-Alwyn L, Duncan O, Trösch J, Yu G, Bolser DM, Namaati G, Kerhornou A, Spannagl M, Gundlach H, Haberer G, Davey RP, Fosker C, Palma FD, Phillips AL, Millar AH, Kersey PJ, Uauy C, Krasileva KV, Swarbreck D, Bevan MW, Clark MD. An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations. Genome Res 2017; 27:885-896. [PMID: 28420692 PMCID: PMC5411782 DOI: 10.1101/gr.217117.116] [Citation(s) in RCA: 244] [Impact Index Per Article: 34.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2016] [Accepted: 03/14/2017] [Indexed: 01/16/2023]
Abstract
Advances in genome sequencing and assembly technologies are generating many high-quality genome sequences, but assemblies of large, repeat-rich polyploid genomes, such as that of bread wheat, remain fragmented and incomplete. We have generated a new wheat whole-genome shotgun sequence assembly using a combination of optimized data types and an assembly algorithm designed to deal with large and complex genomes. The new assembly represents >78% of the genome with a scaffold N50 of 88.8 kb that has a high fidelity to the input data. Our new annotation combines strand-specific Illumina RNA-seq and Pacific Biosciences (PacBio) full-length cDNAs to identify 104,091 high-confidence protein-coding genes and 10,156 noncoding RNA genes. We confirmed three known and identified one novel genome rearrangements. Our approach enables the rapid and scalable assembly of wheat genomes, the identification of structural variants, and the definition of complete gene models, all powerful resources for trait analysis and breeding of this key global crop.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | - Tom Barker
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
| | - Fu-Hao Lu
- John Innes Centre, Norwich, NR4 7UH, United Kingdom
| | | | - Dina Raats
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
| | | | | | - Ned Peel
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
| | | | - Owen Duncan
- ARC Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley Western Australia 6009, Australia
| | - Josua Trösch
- ARC Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley Western Australia 6009, Australia
| | - Guotai Yu
- John Innes Centre, Norwich, NR4 7UH, United Kingdom
| | - Dan M Bolser
- EMBL European Bioinformatics Institute, Hinxton, CB10 1SD, United Kingdom
| | - Guy Namaati
- EMBL European Bioinformatics Institute, Hinxton, CB10 1SD, United Kingdom
| | - Arnaud Kerhornou
- EMBL European Bioinformatics Institute, Hinxton, CB10 1SD, United Kingdom
| | - Manuel Spannagl
- Plant Genome and Systems Biology, Helmholtz Center Munich, 85764 Neuherberg, Germany
| | - Heidrun Gundlach
- Plant Genome and Systems Biology, Helmholtz Center Munich, 85764 Neuherberg, Germany
| | - Georg Haberer
- Plant Genome and Systems Biology, Helmholtz Center Munich, 85764 Neuherberg, Germany
| | - Robert P Davey
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
- University of East Anglia, Norwich, NR4 7TJ, United Kingdom
| | | | - Federica Di Palma
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
- University of East Anglia, Norwich, NR4 7TJ, United Kingdom
| | | | - A Harvey Millar
- ARC Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley Western Australia 6009, Australia
| | - Paul J Kersey
- EMBL European Bioinformatics Institute, Hinxton, CB10 1SD, United Kingdom
| | | | - Ksenia V Krasileva
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
- University of East Anglia, Norwich, NR4 7TJ, United Kingdom
- The Sainsbury Laboratory, Norwich, NR4 7UH, United Kingdom
| | - David Swarbreck
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
- University of East Anglia, Norwich, NR4 7TJ, United Kingdom
| | | | - Matthew D Clark
- Earlham Institute, Norwich, NR4 7UZ, United Kingdom
- University of East Anglia, Norwich, NR4 7TJ, United Kingdom
| |
Collapse
|
41
|
Human Y chromosome copy number variation in the next generation sequencing era and beyond. Hum Genet 2017; 136:591-603. [PMID: 28378101 PMCID: PMC5418319 DOI: 10.1007/s00439-017-1788-5] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2017] [Accepted: 03/25/2017] [Indexed: 11/16/2022]
Abstract
The human Y chromosome provides a fertile ground for structural rearrangements owing to its haploidy and high content of repeated sequences. The methodologies used for copy number variation (CNV) studies have developed over the years. Low-throughput techniques based on direct observation of rearrangements were developed early on, and are still used, often to complement array-based or sequencing approaches which have limited power in regions with high repeat content and specifically in the presence of long, identical repeats, such as those found in human sex chromosomes. Some specific rearrangements have been investigated for decades; because of their effects on fertility, or their outstanding evolutionary features, the interest in these has not diminished. However, following the flourishing of large-scale genomics, several studies have investigated CNVs across the whole chromosome. These studies sometimes employ data generated within large genomic projects such as the DDD study or the 1000 Genomes Project, and often survey large samples of healthy individuals without any prior selection. Novel technologies based on sequencing long molecules and combinations of technologies, promise to stimulate the study of Y-CNVs in the immediate future.
Collapse
|
42
|
Haplotype-Phased Synthetic Long Reads from Short-Read Sequencing. PLoS One 2016; 11:e0147229. [PMID: 26789840 PMCID: PMC4720449 DOI: 10.1371/journal.pone.0147229] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2015] [Accepted: 12/30/2015] [Indexed: 12/26/2022] Open
Abstract
Next-generation DNA sequencing has revolutionized the study of biology. However, the short read lengths of the dominant instruments complicate assembly of complex genomes and haplotype phasing of mixtures of similar sequences. Here we demonstrate a method to reconstruct the sequences of individual nucleic acid molecules up to 11.6 kilobases in length from short (150-bp) reads. We show that our method can construct 99.97%-accurate synthetic reads from bacterial, plant, and animal genomic samples, full-length mRNA sequences from human cancer cell lines, and individual HIV env gene variants from a mixture. The preparation of multiple samples can be multiplexed into a single tube, further reducing effort and cost relative to competing approaches. Our approach generates sequencing libraries in three days from less than one microgram of DNA in a single-tube format without custom equipment or specialized expertise.
Collapse
|