1
|
Schreiber M, Jayakodi M, Stein N, Mascher M. Plant pangenomes for crop improvement, biodiversity and evolution. Nat Rev Genet 2024; 25:563-577. [PMID: 38378816 DOI: 10.1038/s41576-024-00691-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/14/2023] [Indexed: 02/22/2024]
Abstract
Plant genome sequences catalogue genes and the genetic elements that regulate their expression. Such inventories further research aims as diverse as mapping the molecular basis of trait diversity in domesticated plants or inquiries into the origin of evolutionary innovations in flowering plants millions of years ago. The transformative technological progress of DNA sequencing in the past two decades has enabled researchers to sequence ever more genomes with greater ease. Pangenomes - complete sequences of multiple individuals of a species or higher taxonomic unit - have now entered the geneticists' toolkit. The genomes of crop plants and their wild relatives are being studied with translational applications in breeding in mind. But pangenomes are applicable also in ecological and evolutionary studies, as they help classify and monitor biodiversity across the tree of life, deepen our understanding of how plant species diverged and show how plants adapt to changing environments or new selection pressures exerted by human beings.
Collapse
Affiliation(s)
- Mona Schreiber
- Department of Biology, University of Marburg, Marburg, Germany
| | - Murukarthick Jayakodi
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Nils Stein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
- Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| | - Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.
| |
Collapse
|
2
|
Sarwal V, Lee S, Yang J, Sankararaman S, Chaisson M, Eskin E, Mangul S. VISTA: an integrated framework for structural variant discovery. Brief Bioinform 2024; 25:bbae462. [PMID: 39297879 PMCID: PMC11411772 DOI: 10.1093/bib/bbae462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 08/27/2024] [Accepted: 09/07/2024] [Indexed: 09/26/2024] Open
Abstract
Structural variation (SV) refers to insertions, deletions, inversions, and duplications in human genomes. SVs are present in approximately 1.5% of the human genome. Still, this small subset of genetic variation has been implicated in the pathogenesis of psoriasis, Crohn's disease and other autoimmune disorders, autism spectrum and other neurodevelopmental disorders, and schizophrenia. Since identifying structural variants is an important problem in genetics, several specialized computational techniques have been developed to detect structural variants directly from sequencing data. With advances in whole-genome sequencing (WGS) technologies, a plethora of SV detection methods have been developed. However, dissecting SVs from WGS data remains a challenge, with the majority of SV detection methods prone to a high false-positive rate, and no existing method able to precisely detect a full range of SVs present in a sample. Previous studies have shown that none of the existing SV callers can maintain high accuracy across various SV lengths and genomic coverages. Here, we report an integrated structural variant calling framework, Variant Identification and Structural Variant Analysis (VISTA), that leverages the results of individual callers using a novel and robust filtering and merging algorithm. In contrast to existing consensus-based tools which ignore the length and coverage, VISTA overcomes this limitation by executing various combinations of top-performing callers based on variant length and genomic coverage to generate SV events with high accuracy. We evaluated the performance of VISTA on comprehensive gold-standard datasets across varying organisms and coverage. We benchmarked VISTA using the Genome-in-a-Bottle gold standard SV set, haplotype-resolved de novo assemblies from the Human Pangenome Reference Consortium, along with an in-house polymerase chain reaction (PCR)-validated mouse gold standard set. VISTA maintained the highest F1 score among top consensus-based tools measured using a comprehensive gold standard across both mouse and human genomes. VISTA also has an optimized mode, where the calls can be optimized for precision or recall. VISTA-optimized can attain 100% precision and the highest sensitivity among other variant callers. In conclusion, VISTA represents a significant advancement in structural variant calling, offering a robust and accurate framework that outperforms existing consensus-based tools and sets a new standard for SV detection in genomic research.
Collapse
Affiliation(s)
- Varuni Sarwal
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, United States
| | - Seungmo Lee
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, United States
| | - Jianzhi Yang
- Department of Quantitative and Computational Biology, Dana and David Dornsife College of Letters, Arts and Sciences University of Southern California, 3540 S Figueroa St, Los Angeles, California 90089, United States
| | - Sriram Sankararaman
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, United States
| | - Mark Chaisson
- Department of Quantitative and Computational Biology, Dana and David Dornsife College of Letters, Arts and Sciences University of Southern California, 3540 S Figueroa St, Los Angeles, California 90089, United States
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, United States
| | - Serghei Mangul
- Department of Quantitative and Computational Biology, Dana and David Dornsife College of Letters, Arts and Sciences University of Southern California, 3540 S Figueroa St, Los Angeles, California 90089, United States
- Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy, University of Southern California, 1540 Alcazar Street, Los Angeles, CA 90033, United States
| |
Collapse
|
3
|
Cheng PL, Wang H, Dombroski BA, Farrell JJ, Horng I, Chung T, Tosto G, Kunkle BW, Bush WS, Vardarajan B, Schellenberg GD, Lee WP. A Specialized Reference Panel with Structural Variants Integration for Improving Genotype Imputation in Alzheimer's Disease and Related Dementias (ADRD). MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.07.22.24310827. [PMID: 39108532 PMCID: PMC11302603 DOI: 10.1101/2024.07.22.24310827] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 08/12/2024]
Abstract
We developed an imputation panel for Alzheimer's disease (AD) and related dementias (ADRD) using whole-genome sequencing (WGS) data from the Alzheimer's Disease Sequencing Project (ADSP). Recognizing the significant associations between structural variants (SVs) and AD, and their underrepresentation in existing public reference panels, our panel uniquely integrates single nucleotide variants (SNVs), short insertions and deletions (indels), and SVs. This panel enhances the imputation of disease susceptibility, including rare AD-associated SNVs, indels, and SVs, onto genotype array data, offering a cost-effective alternative to whole-genome sequencing while significantly augmenting statistical power. Notably, we discovered 10 rare indels nominal significant related to AD that are absent in the TOPMed-r2 panel and identified three suggestive significant (p-value < 1E-05) AD-associated SVs in the genes EXOC3L2 and DMPK, were identified. These findings provide new insights into AD genetics and underscore the critical role of imputation panels in advancing our understanding of complex diseases like ADRD.
Collapse
Affiliation(s)
- Po-Liang Cheng
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Hui Wang
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Beth A Dombroski
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - John J Farrell
- Biomedical Genetics, Department of Medicine, Boston University Medical School, Boston, MA, USA
| | - Iris Horng
- Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Tingting Chung
- Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Giuseppe Tosto
- Taub Institute for Research on Alzheimer's Disease and the Aging Brain, College of Physicians and Surgeons, Columbia University, NY 10032, USA
- Department of Neurology, College of Physicians and Surgeons, Columbia University and the New York Presbyterian Hospital, NY 10032, USA
| | - Brian W Kunkle
- John P Hussman Institute for Human Genomics, Miami, FL, USA
- John T Macdonald Department of Human Genetics, Miami, FL, USA
| | - William S Bush
- Cleveland Institute for Computational Biology, Cleveland, OH, USA
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, USA
| | - Badri Vardarajan
- Taub Institute for Research on Alzheimer's Disease and the Aging Brain, College of Physicians and Surgeons, Columbia University, NY 10032, USA
- Department of Neurology, College of Physicians and Surgeons, Columbia University and the New York Presbyterian Hospital, NY 10032, USA
| | - Gerard D Schellenberg
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Wan-Ping Lee
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
4
|
Patel-Tupper D, Kelikian A, Leipertz A, Maryn N, Tjahjadi M, Karavolias NG, Cho MJ, Niyogi KK. Multiplexed CRISPR-Cas9 mutagenesis of rice PSBS1 noncoding sequences for transgene-free overexpression. SCIENCE ADVANCES 2024; 10:eadm7452. [PMID: 38848363 PMCID: PMC11160471 DOI: 10.1126/sciadv.adm7452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Accepted: 05/03/2024] [Indexed: 06/09/2024]
Abstract
Understanding CRISPR-Cas9's capacity to produce native overexpression (OX) alleles would accelerate agronomic gains achievable by gene editing. To generate OX alleles with increased RNA and protein abundance, we leveraged multiplexed CRISPR-Cas9 mutagenesis of noncoding sequences upstream of the rice PSBS1 gene. We isolated 120 gene-edited alleles with varying non-photochemical quenching (NPQ) capacity in vivo-from knockout to overexpression-using a high-throughput screening pipeline. Overexpression increased OsPsbS1 protein abundance two- to threefold, matching fold changes obtained by transgenesis. Increased PsbS protein abundance enhanced NPQ capacity and water-use efficiency. Across our resolved genetic variation, we identify the role of 5'UTR indels and inversions in driving knockout/knockdown and overexpression phenotypes, respectively. Complex structural variants, such as the 252-kb duplication/inversion generated here, evidence the potential of CRISPR-Cas9 to facilitate significant genomic changes with negligible off-target transcriptomic perturbations. Our results may inform future gene-editing strategies for hypermorphic alleles and have advanced the pursuit of gene-edited, non-transgenic rice plants with accelerated relaxation of photoprotection.
Collapse
Affiliation(s)
- Dhruv Patel-Tupper
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
- Howard Hughes Medical Institute, University of California, Berkeley, CA 94720, USA
| | - Armen Kelikian
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Anna Leipertz
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Nina Maryn
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Michelle Tjahjadi
- Innovative Genomics Institute, University of California, Berkeley, CA 94720, USA
| | - Nicholas G. Karavolias
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
- Innovative Genomics Institute, University of California, Berkeley, CA 94720, USA
| | - Myeong-Je Cho
- Innovative Genomics Institute, University of California, Berkeley, CA 94720, USA
| | - Krishna K. Niyogi
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
- Howard Hughes Medical Institute, University of California, Berkeley, CA 94720, USA
- Innovative Genomics Institute, University of California, Berkeley, CA 94720, USA
- Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| |
Collapse
|
5
|
Xue Z, Zhou A, Zhu X, Li L, Zhu H, Jin X, Wang J. NIPT-PG: empowering non-invasive prenatal testing to learn from population genomics through an incremental pan-genomic approach. Brief Bioinform 2024; 25:bbae266. [PMID: 38836702 DOI: 10.1093/bib/bbae266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Revised: 05/03/2024] [Accepted: 05/21/2024] [Indexed: 06/06/2024] Open
Abstract
Non-invasive prenatal testing (NIPT) is a quite popular approach for detecting fetal genomic aneuploidies. However, due to the limitations on sequencing read length and coverage, NIPT suffers a bottleneck on further improving performance and conducting earlier detection. The errors mainly come from reference biases and population polymorphism. To break this bottleneck, we proposed NIPT-PG, which enables the NIPT algorithm to learn from population data. A pan-genome model is introduced to incorporate variant and polymorphic loci information from tested population. Subsequently, we proposed a sequence-to-graph alignment method, which considers the read mis-match rates during the mapping process, and an indexing method using hash indexing and adjacency lists to accelerate the read alignment process. Finally, by integrating multi-source aligned read and polymorphic sites across the pan-genome, NIPT-PG obtains a more accurate z-score, thereby improving the accuracy of chromosomal aneuploidy detection. We tested NIPT-PG on two simulated datasets and 745 real-world cell-free DNA sequencing data sets from pregnant women. Results demonstrate that NIPT-PG outperforms the standard z-score test. Furthermore, combining experimental and theoretical analyses, we demonstrate the probably approximately correct learnability of NIPT-PG. In summary, NIPT-PG provides a new perspective for fetal chromosomal aneuploidies detection. NIPT-PG may have broad applications in clinical testing, and its detection results can serve as a reference for false positive samples approaching the critical threshold.
Collapse
Affiliation(s)
- Zhengfa Xue
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China
- Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, Xi'an 710049, China
| | - Aifen Zhou
- Institute of Maternal and Child Health, Wuhan Children's Hospital (Wuhan Maternal and Child Health care Hospital), Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430015, China
- Department of Obstetrics, Wuhan Children's Hospital (Wuhan Maternal and Child Health care Hospital), Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430015, China
| | - Xiaoyan Zhu
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China
- Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, Xi'an 710049, China
| | - Linxuan Li
- BGI Research, Shenzhen 518083, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | | | - Xin Jin
- BGI Research, Shenzhen 518083, China
- School of Medicine, South China University of Technology, Guangzhou 510006, China
| | - Jiayin Wang
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China
- Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, Xi'an 710049, China
| |
Collapse
|
6
|
Shi J, Jia Z, Sun J, Wang X, Zhao X, Zhao C, Liang F, Song X, Guan J, Jia X, Yang J, Chen Q, Yu K, Jia Q, Wu J, Wang D, Xiao Y, Xu X, Liu Y, Wu S, Zhong Q, Wu J, Cui S, Bo X, Wu Z, Park M, Kellis M, He K. Structural variants involved in high-altitude adaptation detected using single-molecule long-read sequencing. Nat Commun 2023; 14:8282. [PMID: 38092772 PMCID: PMC10719358 DOI: 10.1038/s41467-023-44034-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Accepted: 11/27/2023] [Indexed: 12/17/2023] Open
Abstract
Structural variants (SVs), accounting for a larger fraction of the genome than SNPs/InDels, are an important pool of genetic variation, enabling environmental adaptations. Here, we perform long-read sequencing data of 320 Tibetan and Han samples and show that SVs are highly involved in high-altitude adaptation. We expand the landscape of global SVs, apply robust models of selection and population differentiation combining SVs, SNPs and InDels, and use epigenomic analyses to predict enhancers, target genes and biological functions. We reveal diverse Tibetan-specific SVs affecting the regulatory circuitry of biological functions, including the hypoxia response, energy metabolism and pulmonary function. We find a Tibetan-specific deletion disrupts a super-enhancer and downregulates EPAS1 using enhancer reporter, cellular knock-out and DNA pull-down assays. Our study expands the global SV landscape, reveals the role of gene-regulatory circuitry rewiring in human adaptation, and illustrates the diverse functional roles of SVs in human biology.
Collapse
Affiliation(s)
- Jinlong Shi
- Medical Big Data Research Center, Medical Innovation Research Division of Chinese PLA General Hospital, Beijing, 100853, China
- National Engineering Research Center of Medical Big Data, Chinese PLA General Hospital, Beijing, 100853, China
- Key Laboratory of Biomedical Engineering and Translational Medicine, Ministry of Industry and Information Technology, Chinese PLA General Hospital, Beijing, 100853, China
- Beijing Key Laboratory for Precision Medicine of Chronic Heart Failure, Chinese PLA General Hospital, Beijing, China
| | - Zhilong Jia
- Key Laboratory of Biomedical Engineering and Translational Medicine, Ministry of Industry and Information Technology, Chinese PLA General Hospital, Beijing, 100853, China
- Beijing Key Laboratory for Precision Medicine of Chronic Heart Failure, Chinese PLA General Hospital, Beijing, China
- Medical Artificial Intelligence Research Center, Medical Innovation Research Division of Chinese PLA General Hospital, Beijing, 100853, China
| | - Jinxiu Sun
- Medical Big Data Research Center, Medical Innovation Research Division of Chinese PLA General Hospital, Beijing, 100853, China
- National Engineering Research Center of Medical Big Data, Chinese PLA General Hospital, Beijing, 100853, China
| | - Xiaoreng Wang
- Laboratory of Nuclear and Radiation Injury, Medical Innovation Research Division of Chinese PLA General Hospital, Beijing, 100853, China
- State Key Laboratory of Experimental Hematology, Beijing, 100853, China
| | - Xiaojing Zhao
- Beijing Key Laboratory for Precision Medicine of Chronic Heart Failure, Chinese PLA General Hospital, Beijing, China
- Translational Medicine Research Center, Medical Innovation Research Division of Chinese PLA General Hospital, Beijing, 100853, China
| | - Chenghui Zhao
- Key Laboratory of Biomedical Engineering and Translational Medicine, Ministry of Industry and Information Technology, Chinese PLA General Hospital, Beijing, 100853, China
- Research Center for Biomedical Engineering, Medical Innovation Research Division of Chinese PLA General Hospital, Beijing, 100853, China
| | - Fan Liang
- NextOmics Biosciences Inc, Wuhan, 430000, China
| | - Xinyu Song
- Key Laboratory of Biomedical Engineering and Translational Medicine, Ministry of Industry and Information Technology, Chinese PLA General Hospital, Beijing, 100853, China
- Medical Artificial Intelligence Research Center, Medical Innovation Research Division of Chinese PLA General Hospital, Beijing, 100853, China
| | - Jiawei Guan
- Medical Big Data Research Center, Medical Innovation Research Division of Chinese PLA General Hospital, Beijing, 100853, China
- National Engineering Research Center of Medical Big Data, Chinese PLA General Hospital, Beijing, 100853, China
| | - Xue Jia
- Laboratory of Nuclear and Radiation Injury, Medical Innovation Research Division of Chinese PLA General Hospital, Beijing, 100853, China
| | - Jing Yang
- Laboratory of Nuclear and Radiation Injury, Medical Innovation Research Division of Chinese PLA General Hospital, Beijing, 100853, China
| | - Qi Chen
- Medical Big Data Research Center, Medical Innovation Research Division of Chinese PLA General Hospital, Beijing, 100853, China
- National Engineering Research Center of Medical Big Data, Chinese PLA General Hospital, Beijing, 100853, China
| | - Kang Yu
- Key Laboratory of Biomedical Engineering and Translational Medicine, Ministry of Industry and Information Technology, Chinese PLA General Hospital, Beijing, 100853, China
| | - Qian Jia
- Key Laboratory of Biomedical Engineering and Translational Medicine, Ministry of Industry and Information Technology, Chinese PLA General Hospital, Beijing, 100853, China
| | - Jing Wu
- Medical Big Data Research Center, Medical Innovation Research Division of Chinese PLA General Hospital, Beijing, 100853, China
- National Engineering Research Center of Medical Big Data, Chinese PLA General Hospital, Beijing, 100853, China
| | - Depeng Wang
- NextOmics Biosciences Inc, Wuhan, 430000, China
| | - Yuhui Xiao
- NextOmics Biosciences Inc, Wuhan, 430000, China
| | - Xiaoman Xu
- NextOmics Biosciences Inc, Wuhan, 430000, China
| | - Yinzhe Liu
- NextOmics Biosciences Inc, Wuhan, 430000, China
| | - Shijing Wu
- Key Laboratory of Biomedical Engineering and Translational Medicine, Ministry of Industry and Information Technology, Chinese PLA General Hospital, Beijing, 100853, China
| | - Qin Zhong
- Medical Big Data Research Center, Medical Innovation Research Division of Chinese PLA General Hospital, Beijing, 100853, China
- National Engineering Research Center of Medical Big Data, Chinese PLA General Hospital, Beijing, 100853, China
| | - Jue Wu
- Key Laboratory of Biomedical Engineering and Translational Medicine, Ministry of Industry and Information Technology, Chinese PLA General Hospital, Beijing, 100853, China
| | - Saijia Cui
- Beijing Key Laboratory for Precision Medicine of Chronic Heart Failure, Chinese PLA General Hospital, Beijing, China
| | - Xiaochen Bo
- Beijing Institute of Radiation Medicine, Beijing, 100850, China
| | | | | | - Manolis Kellis
- Massachusetts Institute of Technology; MIT Computer Science and Artificial Intelligence Laboratory, Broad Institute of MIT and Harvard, Cambridge, 02139, MA, USA
| | - Kunlun He
- Medical Big Data Research Center, Medical Innovation Research Division of Chinese PLA General Hospital, Beijing, 100853, China.
- National Engineering Research Center of Medical Big Data, Chinese PLA General Hospital, Beijing, 100853, China.
- Key Laboratory of Biomedical Engineering and Translational Medicine, Ministry of Industry and Information Technology, Chinese PLA General Hospital, Beijing, 100853, China.
- Beijing Key Laboratory for Precision Medicine of Chronic Heart Failure, Chinese PLA General Hospital, Beijing, China.
| |
Collapse
|
7
|
Antinucci M, Comas D, Calafell F. Population history modulates the fitness effects of Copy Number Variation in the Roma. Hum Genet 2023; 142:1327-1343. [PMID: 37311904 PMCID: PMC10449987 DOI: 10.1007/s00439-023-02579-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Accepted: 06/02/2023] [Indexed: 06/15/2023]
Abstract
We provide the first whole genome Copy Number Variant (CNV) study addressing Roma, along with reference populations from South Asia, the Middle East and Europe. Using CNV calling software for short-read sequence data, we identified 3171 deletions and 489 duplications. Taking into account the known population history of the Roma, as inferred from whole genome nucleotide variation, we could discern how this history has shaped CNV variation. As expected, patterns of deletion variation, but not duplication, in the Roma followed those obtained from single nucleotide polymorphisms (SNPs). Reduced effective population size resulting in slightly relaxed natural selection may explain our observation of an increase in intronic (but not exonic) deletions within Loss of Function (LoF)-intolerant genes. Over-representation analysis for LoF-intolerant gene sets hosting intronic deletions highlights a substantial accumulation of shared biological processes in Roma, intriguingly related to signaling, nervous system and development features, which may be related to the known profile of private disease in the population. Finally, we show the link between deletions and known trait-related SNPs reported in the genome-wide association study (GWAS) catalog, which exhibited even frequency distributions among the studied populations. This suggests that, in general human populations, the strong association between deletions and SNPs associated to biomedical conditions and traits could be widespread across continental populations, reflecting a common background of potentially disease/trait-related CNVs.
Collapse
Affiliation(s)
- Marco Antinucci
- Institute of Evolutionary Biology (UPF-CSIC), Department of Medicine and Life Sciences, Universitat Pompeu Fabra, Barcelona, Spain
| | - David Comas
- Institute of Evolutionary Biology (UPF-CSIC), Department of Medicine and Life Sciences, Universitat Pompeu Fabra, Barcelona, Spain
| | - Francesc Calafell
- Institute of Evolutionary Biology (UPF-CSIC), Department of Medicine and Life Sciences, Universitat Pompeu Fabra, Barcelona, Spain.
| |
Collapse
|
8
|
Soto DC, Uribe-Salazar JM, Shew CJ, Sekar A, McGinty S, Dennis MY. Genomic structural variation: A complex but important driver of human evolution. AMERICAN JOURNAL OF BIOLOGICAL ANTHROPOLOGY 2023; 181 Suppl 76:118-144. [PMID: 36794631 PMCID: PMC10329998 DOI: 10.1002/ajpa.24713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2022] [Revised: 01/21/2023] [Accepted: 02/05/2023] [Indexed: 02/17/2023]
Abstract
Structural variants (SVs)-including duplications, deletions, and inversions of DNA-can have significant genomic and functional impacts but are technically difficult to identify and assay compared with single-nucleotide variants. With the aid of new genomic technologies, it has become clear that SVs account for significant differences across and within species. This phenomenon is particularly well-documented for humans and other primates due to the wealth of sequence data available. In great apes, SVs affect a larger number of nucleotides than single-nucleotide variants, with many identified SVs exhibiting population and species specificity. In this review, we highlight the importance of SVs in human evolution by (1) how they have shaped great ape genomes resulting in sensitized regions associated with traits and diseases, (2) their impact on gene functions and regulation, which subsequently has played a role in natural selection, and (3) the role of gene duplications in human brain evolution. We further discuss how to incorporate SVs in research, including the strengths and limitations of various genomic approaches. Finally, we propose future considerations in integrating existing data and biospecimens with the ever-expanding SV compendium propelled by biotechnology advancements.
Collapse
Affiliation(s)
- Daniela C. Soto
- Genome Center, MIND Institute, and Department of Biochemistry & Molecular Medicine, University of California, Davis, CA, USA
- Integrative Genetics and Genomics Graduate Group, University of California, Davis, CA, USA
| | - José M. Uribe-Salazar
- Genome Center, MIND Institute, and Department of Biochemistry & Molecular Medicine, University of California, Davis, CA, USA
- Integrative Genetics and Genomics Graduate Group, University of California, Davis, CA, USA
| | - Colin J. Shew
- Genome Center, MIND Institute, and Department of Biochemistry & Molecular Medicine, University of California, Davis, CA, USA
- Integrative Genetics and Genomics Graduate Group, University of California, Davis, CA, USA
| | - Aarthi Sekar
- Genome Center, MIND Institute, and Department of Biochemistry & Molecular Medicine, University of California, Davis, CA, USA
- Integrative Genetics and Genomics Graduate Group, University of California, Davis, CA, USA
| | - Sean McGinty
- Genome Center, MIND Institute, and Department of Biochemistry & Molecular Medicine, University of California, Davis, CA, USA
- Integrative Genetics and Genomics Graduate Group, University of California, Davis, CA, USA
| | - Megan Y. Dennis
- Genome Center, MIND Institute, and Department of Biochemistry & Molecular Medicine, University of California, Davis, CA, USA
- Integrative Genetics and Genomics Graduate Group, University of California, Davis, CA, USA
| |
Collapse
|
9
|
Wen S, Wang M, Qian X, Li Y, Wang K, Choi J, Pennesi ME, Yang P, Marra M, Koenekoop RK, Lopez I, Matynia A, Gorin M, Sui R, Yao F, Goetz K, Porto FBO, Chen R. Systematic assessment of the contribution of structural variants to inherited retinal diseases. Hum Mol Genet 2023; 32:2005-2015. [PMID: 36811936 PMCID: PMC10244226 DOI: 10.1093/hmg/ddad032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 01/03/2023] [Accepted: 02/11/2023] [Indexed: 02/24/2023] Open
Abstract
Despite increasing success in determining genetic diagnosis for patients with inherited retinal diseases (IRDs), mutations in about 30% of the IRD cases remain unclear or unsettled after targeted gene panel or whole exome sequencing. In this study, we aimed to investigate the contributions of structural variants (SVs) to settling the molecular diagnosis of IRD with whole-genome sequencing (WGS). A cohort of 755 IRD patients whose pathogenic mutations remain undefined were subjected to WGS. Four SV calling algorithms including include MANTA, DELLY, LUMPY and CNVnator were used to detect SVs throughout the genome. All SVs identified by any one of these four algorithms were included for further analysis. AnnotSV was used to annotate these SVs. SVs that overlap with known IRD-associated genes were examined with sequencing coverage, junction reads and discordant read pairs. Polymerase Chain Reaction (PCR) followed by Sanger sequencing was used to further confirm the SVs and identify the breakpoints. Segregation of the candidate pathogenic alleles with the disease was performed when possible. A total of 16 candidate pathogenic SVs were identified in 16 families, including deletions and inversions, representing 2.1% of patients with previously unsolved IRDs. Autosomal dominant, autosomal recessive and X-linked inheritance of disease-causing SVs were observed in 12 different genes. Among these, SVs in CLN3, EYS and PRPF31 were found in multiple families. Our study suggests that the contribution of SVs detected by short-read WGS is about 0.25% of our IRD patient cohort and is significantly lower than that of single nucleotide changes and small insertions and deletions.
Collapse
Affiliation(s)
- Shu Wen
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Meng Wang
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Xinye Qian
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Yumei Li
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Keqing Wang
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jongsu Choi
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Mark E Pennesi
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, OR 97239, USA
| | - Paul Yang
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, OR 97239, USA
| | - Molly Marra
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, OR 97239, USA
| | - Robert K Koenekoop
- McGill Ocular Genetics Laboratory and Centre, Department of Paediatric Surgery, Human Genetics, and Ophthalmology, McGill University Health Centre, Montreal, Quebec, H4A 3S5, Canada
| | - Irma Lopez
- McGill Ocular Genetics Laboratory and Centre, Department of Paediatric Surgery, Human Genetics, and Ophthalmology, McGill University Health Centre, Montreal, Quebec, H4A 3S5, Canada
| | - Anna Matynia
- Jules Stein Eye Institute, Los Angeles, CA 90095, USA
- Ophthalmology, University of California Los Angeles David Geffen School of Medicine, Los Angeles, CA 90095, USA
| | - Michael Gorin
- Jules Stein Eye Institute, Los Angeles, CA 90095, USA
- Ophthalmology, University of California Los Angeles David Geffen School of Medicine, Los Angeles, CA 90095, USA
| | - Ruifang Sui
- Department of Ophthalmology, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, 100005, China
| | - Fengxia Yao
- Medical Research Center, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, 100005, China
| | - Kerry Goetz
- Office of the Director, National Eye Institute/National Institutes of Health, Bethesda, MD 20892, USA
| | - Fernanda Belga Ottoni Porto
- INRET Clínica e Centro de Pesquisa, Belo Horizonte, Minas Gerais, 30150270, Brazil
- Department of Ophthalmology, Santa Casa de Misericórdia de Belo Horizonte, Belo Horizonte, Minas Gerais, 30150221, Brazil
- Centro Oftalmológico de Minas Gerais, Belo Horizonte, Minas Gerais, 30180070, Brazil
| | - Rui Chen
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| |
Collapse
|
10
|
Wen S, Wang M, Qian X, Li Y, Wang K, Choi J, Pennesi ME, Yang P, Marra M, Koenekoop RK, Lopez I, Matynia A, Gorin M, Sui R, Yao F, Goetz K, Porto FBO, Chen R. Systematic assessment of the contribution of structural variants to inherited retinal diseases. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.02.522522. [PMID: 36789417 PMCID: PMC9928032 DOI: 10.1101/2023.01.02.522522] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Despite increasing success in determining genetic diagnosis for patients with inherited retinal diseases (IRDs), mutations in about 30% of the IRD cases remain unclear or unsettled after targeted gene panel or whole exome sequencing. In this study, we aimed to investigate the contributions of structural variants (SVs) to settling the molecular diagnosis of IRD with whole-genome sequencing (WGS). A cohort of 755 IRD patients whose pathogenic mutations remain undefined was subjected to WGS. Four SV calling algorithms including include MANTA, DELLY, LUMPY, and CNVnator were used to detect SVs throughout the genome. All SVs identified by any one of these four algorithms were included for further analysis. AnnotSV was used to annotate these SVs. SVs that overlap with known IRD-associated genes were examined with sequencing coverage, junction reads, and discordant read pairs. PCR followed by Sanger sequencing was used to further confirm the SVs and identify the breakpoints. Segregation of the candidate pathogenic alleles with the disease was performed when possible. In total, sixteen candidate pathogenic SVs were identified in sixteen families, including deletions and inversions, representing 2.1% of patients with previously unsolved IRDs. Autosomal dominant, autosomal recessive, and X-linked inheritance of disease-causing SVs were observed in 12 different genes. Among these, SVs in CLN3, EYS, PRPF31 were found in multiple families. Our study suggests that the contribution of SVs detected by short-read WGS is about 0.25% of our IRD patient cohort and is significantly lower than that of single nucleotide changes and small insertions and deletions.
Collapse
|
11
|
Li Q, Yan B, Lam TW, Luo R. Assembly-free discovery of human novel sequences using long reads. DNA Res 2022; 29:dsac039. [PMID: 36308393 PMCID: PMC9700288 DOI: 10.1093/dnares/dsac039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2022] [Revised: 10/19/2022] [Accepted: 10/27/2022] [Indexed: 09/10/2024] Open
Abstract
DNA sequences that are absent in the human reference genome are classified as novel sequences. The discovery of these missed sequences is crucial for exploring the genomic diversity of populations and understanding the genetic basis of human diseases. However, various DNA lengths of reads generated from different sequencing technologies can significantly affect the results of novel sequences. In this work, we designed an assembly-free novel sequence (AF-NS) approach to identify novel sequences from Oxford Nanopore Technology long reads. Among the newly detected sequences using AF-NS, more than 95% were omitted from those using long-read assemblers and 85% were not present in short reads of Illumina. We identified the common novel sequences among all the samples and revealed their association with the binding motifs of transcription factors. Regarding the placements of the novel sequences, we found about 70% enriched in repeat regions and generated 430 for one specific subpopulation that might be related to their evolution. Our study demonstrates the advance of the assembly-free approach to capture more novel sequences over other assembler based methods. Combining the long-read data with powerful analytical methods can be a robust way to improve the completeness of novel sequences.
Collapse
Affiliation(s)
- Qiuhui Li
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Bin Yan
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Tak-Wah Lam
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| |
Collapse
|
12
|
Schuy J, Grochowski CM, Carvalho CMB, Lindstrand A. Complex genomic rearrangements: an underestimated cause of rare diseases. Trends Genet 2022; 38:1134-1146. [PMID: 35820967 PMCID: PMC9851044 DOI: 10.1016/j.tig.2022.06.003] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Revised: 05/12/2022] [Accepted: 06/06/2022] [Indexed: 01/24/2023]
Abstract
Complex genomic rearrangements (CGRs) are known contributors to disease but are often missed during routine genetic screening. Identifying CGRs requires (i) identifying copy number variants (CNVs) concurrently with inversions, (ii) phasing multiple breakpoint junctions incis, as well as (iii) detecting and resolving structural variants (SVs) within repeats. We demonstrate how combining cytogenetics and new sequencing methodologies is being successfully applied to gain insights into the genomic architecture of CGRs. In addition, we review CGR patterns and molecular features revealed by studying constitutional genomic disorders. These data offer invaluable lessons to individuals interested in investigating CGRs, evaluating their clinical relevance and frequency, as well as assessing their impact(s) on rare genetic diseases.
Collapse
Affiliation(s)
- Jakob Schuy
- Department of Molecular Medicine and Surgery and Center for Molecular Medicine, Karolinska Institutet, Stockholm, Sweden
| | | | - Claudia M B Carvalho
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA; Pacific Northwest Research Institute, Seattle, WA, USA
| | - Anna Lindstrand
- Department of Molecular Medicine and Surgery and Center for Molecular Medicine, Karolinska Institutet, Stockholm, Sweden; Department of Clinical Genetics, Karolinska University Hospital, Stockholm, Sweden.
| |
Collapse
|
13
|
Otsuki A, Okamura Y, Ishida N, Tadaka S, Takayama J, Kumada K, Kawashima J, Taguchi K, Minegishi N, Kuriyama S, Tamiya G, Kinoshita K, Katsuoka F, Yamamoto M. Construction of a trio-based structural variation panel utilizing activated T lymphocytes and long-read sequencing technology. Commun Biol 2022; 5:991. [PMID: 36127505 PMCID: PMC9489684 DOI: 10.1038/s42003-022-03953-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2021] [Accepted: 09/06/2022] [Indexed: 11/13/2022] Open
Abstract
Long-read sequencing technology enable better characterization of structural variants (SVs). To adapt the technology to population-scale analyses, one critical issue is to obtain sufficient amount of high-molecular-weight genomic DNA. Here, we propose utilizing activated T lymphocytes, which can be established efficiently in a biobank to stably supply high-grade genomic DNA sufficiently. We conducted nanopore sequencing of 333 individuals constituting 111 trios with high-coverage long-read sequencing data (depth 22.2x, N50 of 25.8 kb) and identified 74,201 SVs. Our trio-based analysis revealed that more than 95% of the SVs were concordant with Mendelian inheritance. We also identified SVs associated with clinical phenotypes, all of which appear to be stably transmitted from parents to offspring. Our data provide a catalog of SVs in the general Japanese population, and the applied approach using the activated T-lymphocyte resource will contribute to biobank-based human genetic studies focusing on SVs at the population scale. Long-read sequencing on activated T-cells from a sample of 333 Japanese individuals (representing 111 parent-offspring trios) provides a useful reference dataset of structural variation in the Japanese population.
Collapse
Affiliation(s)
- Akihito Otsuki
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan.,Department of Medical Biochemistry, Tohoku University Graduate School of Medicine, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8575, Japan
| | - Yasunobu Okamura
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan.,Advanced Research Center for Innovations in Next-Generation Medicine, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Noriko Ishida
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Shu Tadaka
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Jun Takayama
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan.,Advanced Research Center for Innovations in Next-Generation Medicine, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan.,Statistical Genetics Team, RIKEN Center for Advanced Intelligence Project, Nihonbashi 1-chome Mitsui Building 15 F, 1-4-1 Nihonbashi, Chuo-ku, Tokyo, 103-0027, Japan.,Department of AI and Innovative Medicine, Tohoku University Graduate School of Medicine, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8575, Japan
| | - Kazuki Kumada
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Junko Kawashima
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Keiko Taguchi
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan.,Department of Medical Biochemistry, Tohoku University Graduate School of Medicine, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8575, Japan.,Advanced Research Center for Innovations in Next-Generation Medicine, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Naoko Minegishi
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Shinichi Kuriyama
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Gen Tamiya
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan.,Advanced Research Center for Innovations in Next-Generation Medicine, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan.,Statistical Genetics Team, RIKEN Center for Advanced Intelligence Project, Nihonbashi 1-chome Mitsui Building 15 F, 1-4-1 Nihonbashi, Chuo-ku, Tokyo, 103-0027, Japan.,Department of AI and Innovative Medicine, Tohoku University Graduate School of Medicine, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8575, Japan
| | - Kengo Kinoshita
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan.,Department of Medical Biochemistry, Tohoku University Graduate School of Medicine, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8575, Japan.,Advanced Research Center for Innovations in Next-Generation Medicine, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan.,Graduate School of Information Sciences, Tohoku University, 6-3-09 Aramaki Aza-Aoba, Aoba-ku, Sendai, Miyagi, 980-8579, Japan
| | - Fumiki Katsuoka
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan.,Advanced Research Center for Innovations in Next-Generation Medicine, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Masayuki Yamamoto
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan. .,Department of Medical Biochemistry, Tohoku University Graduate School of Medicine, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8575, Japan. .,Advanced Research Center for Innovations in Next-Generation Medicine, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan.
| |
Collapse
|
14
|
Zhou Y, Yang L, Han X, Han J, Hu Y, Li F, Xia H, Peng L, Boschiero C, Rosen BD, Bickhart DM, Zhang S, Guo A, Van Tassell CP, Smith TPL, Yang L, Liu GE. Assembly of a pangenome for global cattle reveals missing sequences and novel structural variations, providing new insights into their diversity and evolutionary history. Genome Res 2022; 32:1585-1601. [PMID: 35977842 PMCID: PMC9435747 DOI: 10.1101/gr.276550.122] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2022] [Accepted: 07/21/2022] [Indexed: 02/03/2023]
Abstract
A cattle pangenome representation was created based on the genome sequences of 898 cattle representing 57 breeds. The pangenome identified 83 Mb of sequence not found in the cattle reference genome, representing 3.1% novel sequence compared with the 2.71-Gb reference. A catalog of structural variants developed from this cattle population identified 3.3 million deletions, 0.12 million inversions, and 0.18 million duplications. Estimates of breed ancestry and hybridization between cattle breeds using insertion/deletions as markers were similar to those produced by single nucleotide polymorphism-based analysis. Hundreds of deletions were observed to have stratification based on subspecies and breed. For example, an insertion of a Bov-tA1 repeat element was identified in the first intron of the APPL2 gene and correlated with cattle breed geographic distribution. This insertion falls within a segment overlapping predicted enhancer and promoter regions of the gene, and could affect important traits such as immune response, olfactory functions, cell proliferation, and glucose metabolism in muscle. The results indicate that pangenomes are a valuable resource for studying diversity and evolutionary history, and help to delineate how domestication, trait-based breeding, and adaptive introgression have shaped the cattle genome.
Collapse
Affiliation(s)
- Yang Zhou
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Lv Yang
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Xiaotao Han
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Jiazheng Han
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Yan Hu
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Fan Li
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Han Xia
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Lingwei Peng
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Clarissa Boschiero
- Animal Genomics and Improvement Laboratory, BARC, USDA-ARS, Beltsville, Maryland 20705, USA
| | - Benjamin D Rosen
- Animal Genomics and Improvement Laboratory, BARC, USDA-ARS, Beltsville, Maryland 20705, USA
| | - Derek M Bickhart
- Dairy Forage Research Center, ARS USDA, Madison, Wisconsin 53706, USA
| | - Shujun Zhang
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - Aizhen Guo
- The State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan 430070, China
| | - Curtis P Van Tassell
- Animal Genomics and Improvement Laboratory, BARC, USDA-ARS, Beltsville, Maryland 20705, USA
| | - Timothy P L Smith
- U.S. Meat Animal Research Center, ARS USDA, Clay Center, Nebraska 68933, USA
| | - Liguo Yang
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China
| | - George E Liu
- Animal Genomics and Improvement Laboratory, BARC, USDA-ARS, Beltsville, Maryland 20705, USA
| |
Collapse
|
15
|
Li Z, Jiang X, Fang M, Bai Y, Liu S, Huang S, Jin X. CMDB: the comprehensive population genome variation database of China. Nucleic Acids Res 2022; 51:D890-D895. [PMID: 35871305 PMCID: PMC9825573 DOI: 10.1093/nar/gkac638] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 07/22/2022] [Indexed: 01/30/2023] Open
Abstract
A high-quality genome variation database derived from a large-scale population is one of the most important infrastructures for genomics, clinical and translational medicine research. Here, we developed the Chinese Millionome Database (CMDB), a database that contains 9.04 million single nucleotide variants (SNV) with allele frequency information derived from low-coverage (0.06×-0.1×) whole-genome sequencing (WGS) data of 141 431 unrelated healthy Chinese individuals. These individuals were recruited from 31 out of the 34 administrative divisions in China, covering Han and 36 other ethnic minorities. CMDB, housing the WGS data of a multi-ethnic Chinese population featuring wide geographical distribution, has become the most representative and comprehensive Chinese population genome database to date. Researchers can quickly search for variant, gene or genomic regions to obtain the variant information, including mutation basic information, allele frequency, genic annotation and overview of frequencies in global populations. Furthermore, the CMDB also provides information on the association of the variants with a range of phenotypes, including height, BMI, maternal age and twin pregnancy. Based on these data, researchers can conduct meta-analysis of related phenotypes. CMDB is freely available at https://db.cngb.org/cmdb/.
Collapse
Affiliation(s)
| | | | | | - Yong Bai
- BGI-Shenzhen, Shenzhen518083, Guangdong, China
| | - Siyang Liu
- BGI-Shenzhen, Shenzhen518083, Guangdong, China
| | - Shujia Huang
- Correspondence may also be addressed to Shujia Huang.
| | - Xin Jin
- To whom correspondence should be addressed.
| |
Collapse
|
16
|
Sarwal V, Niehus S, Ayyala R, Kim M, Sarkar A, Chang S, Lu A, Rajkumar N, Darci-Maher N, Littman R, Chhugani K, Soylev A, Comarova Z, Wesel E, Castellanos J, Chikka R, Distler MG, Eskin E, Flint J, Mangul S. A comprehensive benchmarking of WGS-based deletion structural variant callers. Brief Bioinform 2022; 23:bbac221. [PMID: 35753701 PMCID: PMC9294411 DOI: 10.1093/bib/bbac221] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Revised: 04/30/2022] [Accepted: 05/11/2022] [Indexed: 01/10/2023] Open
Abstract
Advances in whole-genome sequencing (WGS) promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from WGS data presents a substantial number of challenges and a plethora of SV detection methods have been developed. Currently, evidence that investigators can use to select appropriate SV detection tools is lacking. In this article, we have evaluated the performance of SV detection tools on mouse and human WGS data using a comprehensive polymerase chain reaction-confirmed gold standard set of SVs and the genome-in-a-bottle variant set, respectively. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of the SV detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance as the SV detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low- and ultralow-pass sequencing data as well as for different deletion length categories.
Collapse
Affiliation(s)
- Varuni Sarwal
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
- Indian Institute of Technology Delhi, Hauz Khas, New Delhi, Delhi 110016, India
| | - Sebastian Niehus
- Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Str. 2, 10178 Berlin, Germany
- Charité-Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Charitéplatz 1, 10117 Berlin, Germany
| | - Ram Ayyala
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Minyoung Kim
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089
| | - Aditya Sarkar
- School of Computing and Electrical Engineering, Indian Institute of Technology Mandi, Kamand, Mandi, Himachal Pradesh 175001, India
| | - Sei Chang
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Angela Lu
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Neha Rajkumar
- Department of Bioengineering, Department of Bioengineering, University of California Los Angeles, Los Angeles, CA, 90095
| | - Nicholas Darci-Maher
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Russell Littman
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Karishma Chhugani
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California 1985 Zonal Avenue Los Angeles, CA 90089-9121
| | - Arda Soylev
- Department of Computer Engineering, Konya Food and Agriculture University, Konya, Turkey
| | - Zoia Comarova
- Department Civil and Environmental Engineering, University of Southern California, Los Angeles, CA, United States
| | - Emily Wesel
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Jacqueline Castellanos
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Rahul Chikka
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Margaret G Distler
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
- Department of Human Genetics, David Geffen School of Medicine at UCLA, 695 Charles E. Young Drive South, Box 708822, Los Angeles, CA, 90095, USA
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, 73-235 CHS, Los Angeles, CA, 90095, USA
| | - Jonathan Flint
- Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, University of California Los Angeles, 760 Westwood Plaza, Los Angeles, CA 90095, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California 1985 Zonal Avenue Los Angeles, CA 90089-9121
| |
Collapse
|
17
|
Lian Q, Chen Y, Chang F, Fu Y, Qi J. inGAP-family: Accurate Detection of Meiotic Recombination Loci and Causal Mutations by Filtering Out Artificial Variants due to Genome Complexities. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:524-535. [PMID: 33711466 PMCID: PMC9801030 DOI: 10.1016/j.gpb.2019.11.014] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Revised: 09/04/2019] [Accepted: 11/08/2019] [Indexed: 01/26/2023]
Abstract
Accurately identifying DNA polymorphisms can bridge the gap between phenotypes and genotypes and is essential for molecular marker assisted genetic studies. Genome complexities, including large-scale structural variations, bring great challenges to bioinformatic analysis for obtaining high-confidence genomic variants, as sequence differences between non-allelic loci of two or more genomes can be misinterpreted as polymorphisms. It is important to correctly filter out artificial variants to avoid false genotyping or estimation of allele frequencies. Here, we present an efficient and effective framework, inGAP-family, to discover, filter, and visualize DNA polymorphisms and structural variants (SVs) from alignment of short reads. Applying this method to polymorphism detection on real datasets shows that elimination of artificial variants greatly facilitates the precise identification of meiotic recombination points as well as causal mutations in mutant genomes or quantitative trait loci. In addition, inGAP-family provides a user-friendly graphical interface for detecting polymorphisms and SVs, further evaluating predicted variants and identifying mutations related to genotypes. It is accessible at https://sourceforge.net/projects/ingap-family/.
Collapse
|
18
|
The Thousand Polish Genomes-A Database of Polish Variant Allele Frequencies. Int J Mol Sci 2022; 23:ijms23094532. [PMID: 35562925 PMCID: PMC9104289 DOI: 10.3390/ijms23094532] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 04/13/2022] [Accepted: 04/14/2022] [Indexed: 02/05/2023] Open
Abstract
Although Slavic populations account for over 4.5% of world inhabitants, no centralised, open-source reference database of genetic variation of any Slavic population exists to date. Such data are crucial for clinical genetics, biomedical research, as well as archeological and historical studies. The Polish population, which is homogenous and sedentary in its nature but influenced by many migrations of the past, is unique and could serve as a genetic reference for the Slavic nations. In this study, we analysed whole genomes of 1222 Poles to identify and genotype a wide spectrum of genomic variation, such as small and structural variants, runs of homozygosity, mitochondrial haplogroups, and de novo variants. Common variant analyses showed that the Polish cohort is highly homogenous and shares ancestry with other European populations. In rare variant analyses, we identified 32 autosomal-recessive genes with significantly different frequencies of pathogenic alleles in the Polish population as compared to the non-Finish Europeans, including C2, TGM5, NUP93, C19orf12, and PROP1. The allele frequencies for small and structural variants, calculated for 1076 unrelated individuals, are released publicly as The Thousand Polish Genomes database, and will contribute to the worldwide genomic resources available to researchers and clinicians.
Collapse
|
19
|
Lei Y, Meng Y, Guo X, Ning K, Bian Y, Li L, Hu Z, Anashkina AA, Jiang Q, Dong Y, Zhu X. Overview of structural variation calling: Simulation, identification, and visualization. Comput Biol Med 2022; 145:105534. [DOI: 10.1016/j.compbiomed.2022.105534] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Revised: 04/09/2022] [Accepted: 04/14/2022] [Indexed: 12/11/2022]
|
20
|
Smetana J, Brož P. National Genome Initiatives in Europe and the United Kingdom in the Era of Whole-Genome Sequencing: A Comprehensive Review. Genes (Basel) 2022; 13:556. [PMID: 35328109 PMCID: PMC8953625 DOI: 10.3390/genes13030556] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Revised: 03/17/2022] [Accepted: 03/18/2022] [Indexed: 12/04/2022] Open
Abstract
Identification of genomic variability in population plays an important role in the clinical diagnostics of human genetic diseases. Thanks to rapid technological development in the field of massive parallel sequencing technologies, also known as next-generation sequencing (NGS), complex genomic analyses are now easier and cheaper than ever before, which consequently leads to more effective utilization of these techniques in clinical practice. However, interpretation of data from NGS is still challenging due to several issues caused by natural variability of DNA sequences in human populations. Therefore, development and realization of projects focused on description of genetic variability of local population (often called "national or digital genome") with a NGS technique is one of the best approaches to address this problem. The next step of the process is to share such data via publicly available databases. Such databases are important for the interpretation of variants with unknown significance or (likely) pathogenic variants in rare diseases or cancer or generally for identification of pathological variants in a patient's genome. In this paper, we have compiled an overview of published results of local genome sequencing projects from United Kingdom and Europe together with future plans and perspectives for newly announced ones.
Collapse
Affiliation(s)
- Jan Smetana
- Institute of Food Science and Biotechnology, Faculty of Chemistry, Brno University of Technology, 61200 Brno, Czech Republic
| | - Petr Brož
- Department of Genetics and Molecular Biology, Institute of Experimental Biology, Faculty of Science, Masaryk University, 61137 Brno, Czech Republic;
| |
Collapse
|
21
|
Whole-genome sequencing of 1,171 elderly admixed individuals from São Paulo, Brazil. Nat Commun 2022; 13:1004. [PMID: 35246524 PMCID: PMC8897431 DOI: 10.1038/s41467-022-28648-3] [Citation(s) in RCA: 40] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Accepted: 01/21/2022] [Indexed: 02/07/2023] Open
Abstract
As whole-genome sequencing (WGS) becomes the gold standard tool for studying population genomics and medical applications, data on diverse non-European and admixed individuals are still scarce. Here, we present a high-coverage WGS dataset of 1,171 highly admixed elderly Brazilians from a census-based cohort, providing over 76 million variants, of which ~2 million are absent from large public databases. WGS enables identification of ~2,000 previously undescribed mobile element insertions without previous description, nearly 5 Mb of genomic segments absent from the human genome reference, and over 140 alleles from HLA genes absent from public resources. We reclassify and curate pathogenicity assertions for nearly four hundred variants in genes associated with dominantly-inherited Mendelian disorders and calculate the incidence for selected recessive disorders, demonstrating the clinical usefulness of the present study. Finally, we observe that whole-genome and HLA imputation could be significantly improved compared to available datasets since rare variation represents the largest proportion of input from WGS. These results demonstrate that even smaller sample sizes of underrepresented populations bring relevant data for genomic studies, especially when exploring analyses allowed only by WGS. Whole genome sequencing (WGS) data on non-European and admixed individuals remains scarce. Here, the authors analyse WGS data from 1,171 admixed elderly Brazilians from a census cohort, characterising population-specific genetic variation and exploring the clinical utility of this expanded dataset.
Collapse
|
22
|
Valls-Margarit J, Galván-Femenía I, Matías-Sánchez D, Blay N, Puiggròs M, Carreras A, Salvoro C, Cortés B, Amela R, Farre X, Lerga-Jaso J, Puig M, Sánchez-Herrero J, Moreno V, Perucho M, Sumoy L, Armengol L, Delaneau O, Cáceres M, de Cid R, Torrents D. GCAT|Panel, a comprehensive structural variant haplotype map of the Iberian population from high-coverage whole-genome sequencing. Nucleic Acids Res 2022; 50:2464-2479. [PMID: 35176773 PMCID: PMC8934637 DOI: 10.1093/nar/gkac076] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Revised: 12/24/2021] [Accepted: 02/09/2022] [Indexed: 11/17/2022] Open
Abstract
The combined analysis of haplotype panels with phenotype clinical cohorts is a common approach to explore the genetic architecture of human diseases. However, genetic studies are mainly based on single nucleotide variants (SNVs) and small insertions and deletions (indels). Here, we contribute to fill this gap by generating a dense haplotype map focused on the identification, characterization, and phasing of structural variants (SVs). By integrating multiple variant identification methods and Logistic Regression Models (LRMs), we present a catalogue of 35 431 441 variants, including 89 178 SVs (≥50 bp), 30 325 064 SNVs and 5 017 199 indels, across 785 Illumina high coverage (30x) whole-genomes from the Iberian GCAT Cohort, containing a median of 3.52M SNVs, 606 336 indels and 6393 SVs per individual. The haplotype panel is able to impute up to 14 360 728 SNVs/indels and 23 179 SVs, showing a 2.7-fold increase for SVs compared with available genetic variation panels. The value of this panel for SVs analysis is shown through an imputed rare Alu element located in a new locus associated with Mononeuritis of lower limb, a rare neuromuscular disease. This study represents the first deep characterization of genetic variation within the Iberian population and the first operational haplotype panel to systematically include the SVs into genome-wide genetic studies.
Collapse
Affiliation(s)
| | | | | | - Natalia Blay
- Genomes for Life-GCAT lab Group, Institute for Health Science Research Germans Trias i Pujol (IGTP), Badalona 08916, Spain
| | - Montserrat Puiggròs
- Life Sciences Department, Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain
| | - Anna Carreras
- Genomes for Life-GCAT lab Group, Institute for Health Science Research Germans Trias i Pujol (IGTP), Badalona 08916, Spain
| | - Cecilia Salvoro
- Life Sciences Department, Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain
| | - Beatriz Cortés
- Genomes for Life-GCAT lab Group, Institute for Health Science Research Germans Trias i Pujol (IGTP), Badalona 08916, Spain
| | - Ramon Amela
- Life Sciences Department, Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain
| | - Xavier Farre
- Genomes for Life-GCAT lab Group, Institute for Health Science Research Germans Trias i Pujol (IGTP), Badalona 08916, Spain
| | - Jon Lerga-Jaso
- Institut de Biotecnologia i de Biomedicina, Universitat Autònoma de Barcelona, Bellaterra, Barcelona 08193, Spain
| | - Marta Puig
- Institut de Biotecnologia i de Biomedicina, Universitat Autònoma de Barcelona, Bellaterra, Barcelona 08193, Spain
| | - Jose Francisco Sánchez-Herrero
- High Content Genomics and Bioinformatics Unit, Institute for Health Science Research Germans Trias i Pujol (IGTP), 08916 Badalona, Spain
| | - Victor Moreno
- Catalan Institute of Oncology, Hospitalet del Llobregat, 08908, Spain
- Bellvitge Biomedical Research Institute (IDIBELL), Hospitalet del Llobregat, 08908, Spain
- CIBER Epidemiología y Salud Pública (CIBERESP), Madrid 28029, Spain
- Universitat de Barcelona (UB), Barcelona 08007, Spain
| | - Manuel Perucho
- Sanford Burnham Prebys Medical Discovery Institute (SBP), La Jolla, CA 92037, USA
- Cancer Genetics and Epigenetics, Program of Predictive and Personalized Medicine of Cancer (PMPPC), Health Science Research Institute Germans Trias i Pujol (IGTP), Badalona 08916, Spain
| | - Lauro Sumoy
- High Content Genomics and Bioinformatics Unit, Institute for Health Science Research Germans Trias i Pujol (IGTP), 08916 Badalona, Spain
| | - Lluís Armengol
- Quantitative Genomic Medicine Laboratories (qGenomics), Esplugues del Llobregat, 08950, Spain
| | - Olivier Delaneau
- Department of Computational Biology, University of Lausanne, Génopode, 1015 Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), University of Lausanne, Quartier Sorge – Batiment Amphipole, 1015 Lausanne, Switzerland
| | - Mario Cáceres
- Institut de Biotecnologia i de Biomedicina, Universitat Autònoma de Barcelona, Bellaterra, Barcelona 08193, Spain
- ICREA, Barcelona 08010, Spain
| | - Rafael de Cid
- Correspondence may also be addressed to Rafael de Cid. Tel: +34 930330542;
| | - David Torrents
- To whom correspondence should be addressed. Tel: +34 934134074;
| |
Collapse
|
23
|
Henriksen RA, Jenjaroenpun P, Sjøstrøm IB, Jensen KR, Prada-Luengo I, Wongsurawat T, Nookaew I, Regenberg B. Circular DNA in the human germline and its association with recombination. Mol Cell 2022; 82:209-217.e7. [PMID: 34951964 PMCID: PMC10707452 DOI: 10.1016/j.molcel.2021.11.027] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 08/24/2021] [Accepted: 11/23/2021] [Indexed: 12/24/2022]
Abstract
Extrachromosomal circular DNA (eccDNA) is common in somatic tissue, but its existence and effects in the human germline are unexplored. We used microscopy, long-read DNA sequencing, and new analytic methods to document thousands of eccDNAs from human sperm. EccDNAs derived from all genomic regions and mostly contained a single DNA fragment, although some consisted of multiple fragments. The generation of eccDNA inversely correlates with the meiotic recombination rate, and chromosomes with high coding-gene density and Alu element abundance form the least eccDNA. Analysis of insertions in human genomes further indicates that eccDNA can persist in the human germline when the circular molecules reinsert themselves into the chromosomes. Our results suggest that eccDNA has transient and permanent effects on the germline. They explain how differences in the physical and genetic map might arise and offer an explanation of how Alu elements coevolved with genes to protect genome integrity against deleterious mutations producing eccDNA.
Collapse
Affiliation(s)
- Rasmus Amund Henriksen
- Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Piroon Jenjaroenpun
- Department of Biomedical Informatics, College of Medicine, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Ida Borup Sjøstrøm
- Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | | | - Iñigo Prada-Luengo
- Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Thidathip Wongsurawat
- Department of Biomedical Informatics, College of Medicine, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Intawat Nookaew
- Department of Biomedical Informatics, College of Medicine, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Birgitte Regenberg
- Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
24
|
The correctness of large scale analysis of genomic data. FOUNDATIONS OF COMPUTING AND DECISION SCIENCES 2021. [DOI: 10.2478/fcds-2021-0024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Abstract
Implementing a large genomic project is a demanding task, also from the computer science point of view. Besides collecting many genome samples and sequencing them, there is processing of a huge amount of data at every stage of their production and analysis. Efficient transfer and storage of the data is also an important issue. During the execution of such a project, there is a need to maintain work standards and control quality of the results, which can be difficult if a part of the work is carried out externally. Here, we describe our experience with such data quality analysis on a number of levels - from an obvious check of the quality of the results obtained, to examining consistency of the data at various stages of their processing, to verifying, as far as possible, their compatibility with the data describing the sample.
Collapse
|
25
|
Zhang T, Li Q, Dong B, Liang X, Jia M, Bai J, Yu J, Fu S. Genetic Polymorphism of Drug Metabolic Gene CYPs, VKORC1, NAT2, DPYD and CHST3 of Five Ethnic Minorities in Heilongjiang Province, Northeast China. Pharmgenomics Pers Med 2021; 14:1537-1547. [PMID: 34876832 PMCID: PMC8643223 DOI: 10.2147/pgpm.s339854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Accepted: 11/05/2021] [Indexed: 11/23/2022] Open
Abstract
Introduction Genetic variability in genes encoding drug-metabolizing enzymes may contribute to the heterogeneity of drug responses in different populations. Extensive research in pharmacogenomics in major populations around the world provides us with a great deal of information about drug-related genetic polymorphisms. Objective The purpose of this study was to detect the genetic variation of drug-metabolism-related genes in the five ethnic minorities Daur, Hezhen, Ewenki, Mongolian and Manchu in China, and to analyze the distribution differences among ethnic groups. Methods We genotyped 32 SNPs of drug metabolism genes in 882 healthy Chinese volunteers from five ethnic groups. The genotype frequency and allele frequency of the five ethnic groups were calculated, and the different variants among the five ethnic groups were compared by chi-square test. Genetic parameters were analyzed using Popgene software. The genetic structure of five ethnic minorities was analyzed by principal component analysis, and compared with 26 populations. Results We found that SNPs of genes related to drug metabolism existed diversity in different populations. Among them, rs8192766 and rs9419082 in CYP2E1 showed statistical differences between Daur and Manchu, and NAT2 rs1801280 showed statistical differences between Hezhen and Mongolian. In addition, the five populations we studied had the smallest differences with EAS populations. There was haplotype diversity in CHST3, VKORC1, CYP1A2 and CYP2E1 genes in the five ethnic minorities, and these haplotype polymorphisms were related to the use of corresponding drug doses. Cluster analysis shows that the five ethnic minorities in Heilongjiang Province are clustered together with the EAS populations. Conclusion These results suggest that understanding the diversity of drug-related genetic markers is critical for individualized drug gene therapy programs in ethnic minorities in China as well as in populations highly mixed with these ethnic groups.
Collapse
Affiliation(s)
- Tingting Zhang
- Laboratory of Medical Genetics, Harbin Medical University, Harbin, People's Republic of China.,Key Laboratory of Preservation of Human Genetic Resources and Disease Control in China (Harbin Medical University), Ministry of Education, Harbin, People's Republic of China
| | - Qiuyan Li
- Laboratory of Medical Genetics, Harbin Medical University, Harbin, People's Republic of China.,Key Laboratory of Preservation of Human Genetic Resources and Disease Control in China (Harbin Medical University), Ministry of Education, Harbin, People's Republic of China.,Editorial Department of International Journal of Genetics, Harbin Medical University, Harbin, People's Republic of China
| | - Bonan Dong
- Laboratory of Medical Genetics, Harbin Medical University, Harbin, People's Republic of China.,Key Laboratory of Preservation of Human Genetic Resources and Disease Control in China (Harbin Medical University), Ministry of Education, Harbin, People's Republic of China
| | - Xiao Liang
- Laboratory of Medical Genetics, Harbin Medical University, Harbin, People's Republic of China.,Key Laboratory of Preservation of Human Genetic Resources and Disease Control in China (Harbin Medical University), Ministry of Education, Harbin, People's Republic of China
| | - Mansha Jia
- Scientific Research Centre, The Second Affiliated Hospital of Harbin Medical University, Harbin, People's Republic of China
| | - Jing Bai
- Laboratory of Medical Genetics, Harbin Medical University, Harbin, People's Republic of China.,Key Laboratory of Preservation of Human Genetic Resources and Disease Control in China (Harbin Medical University), Ministry of Education, Harbin, People's Republic of China
| | - Jingcui Yu
- Key Laboratory of Preservation of Human Genetic Resources and Disease Control in China (Harbin Medical University), Ministry of Education, Harbin, People's Republic of China.,Scientific Research Centre, The Second Affiliated Hospital of Harbin Medical University, Harbin, People's Republic of China
| | - Songbin Fu
- Laboratory of Medical Genetics, Harbin Medical University, Harbin, People's Republic of China.,Key Laboratory of Preservation of Human Genetic Resources and Disease Control in China (Harbin Medical University), Ministry of Education, Harbin, People's Republic of China
| |
Collapse
|
26
|
Zverinova S, Guryev V. Variant calling: Considerations, practices, and developments. Hum Mutat 2021; 43:976-985. [PMID: 34882898 PMCID: PMC9545713 DOI: 10.1002/humu.24311] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Revised: 11/02/2021] [Accepted: 12/03/2021] [Indexed: 11/10/2022]
Abstract
The success of many clinical, association, or population genetics studies critically relies on properly performed variant calling step. The variety of modern genomics protocols, techniques, and platforms makes our choices of methods and algorithms difficult and there is no "one size fits all" solution for study design and data analysis. In this review, we discuss considerations that need to be taken into account while designing the study and preparing for the experiments. We outline the variety of variant types that can be detected using sequencing approaches and highlight some specific requirements and basic principles of their detection. Finally, we cover interesting developments that enable variant calling for a broad range of applications in the genomics field. We conclude by discussing technological and algorithmic advances that have the potential to change the ways of calling DNA variants in the nearest future.
Collapse
Affiliation(s)
- Stepanka Zverinova
- European Research Institute for the Biology of Ageing, University of Groningen, University Medical Centre Groningen, Groningen, The Netherlands
| | - Victor Guryev
- European Research Institute for the Biology of Ageing, University of Groningen, University Medical Centre Groningen, Groningen, The Netherlands
| |
Collapse
|
27
|
Krannich T, White WTJ, Niehus S, Holley G, Halldórsson BV, Kehr B. Population-scale detection of non-reference sequence variants using colored de Bruijn graphs. Bioinformatics 2021; 38:604-611. [PMID: 34726732 PMCID: PMC8756200 DOI: 10.1093/bioinformatics/btab749] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Revised: 09/27/2021] [Accepted: 10/28/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION With the increasing throughput of sequencing technologies, structural variant (SV) detection has become possible across tens of thousands of genomes. Non-reference sequence (NRS) variants have drawn less attention compared with other types of SVs due to the computational complexity of detecting them. When using short-read data, the detection of NRS variants inevitably involves a de novo assembly which requires high-quality sequence data at high coverage. Previous studies have demonstrated how sequence data of multiple genomes can be combined for the reliable detection of NRS variants. However, the algorithms proposed in these studies have limited scalability to larger sets of genomes. RESULTS We introduce PopIns2, a tool to discover and characterize NRS variants in many genomes, which scales to considerably larger numbers of genomes than its predecessor PopIns. In this article, we briefly outline the PopIns2 workflow and highlight our novel algorithmic contributions. We developed an entirely new approach for merging contig assemblies of unaligned reads from many genomes into a single set of NRS using a colored de Bruijn graph. Our tests on simulated data indicate that the new merging algorithm ranks among the best approaches in terms of quality and reliability and that PopIns2 shows the best precision for a growing number of genomes processed. Results on the Polaris Diversity Cohort and a set of 1000 Icelandic human genomes demonstrate unmatched scalability for the application on population-scale datasets. AVAILABILITY AND IMPLEMENTATION The source code of PopIns2 is available from https://github.com/kehrlab/PopIns2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Sebastian Niehus
- Regensburg Center for Interventional Immunology (RCI), 93053 Regensburg, Germany
| | | | - Bjarni V Halldórsson
- deCODE Genetics, Reykjavík 102, Iceland,Department of Engineering, School of Technology, Reykjavík University, Reykjavík 102, Iceland
| | - Birte Kehr
- To whom correspondence should be addressed. or
| |
Collapse
|
28
|
Abstract
The reference human genome sequence is inarguably the most important and widely used resource in the fields of human genetics and genomics. It has transformed the conduct of biomedical sciences and brought invaluable benefits to the understanding and improvement of human health. However, the commonly used reference sequence has profound limitations, because across much of its span, it represents the sequence of just one human haplotype. This single, monoploid reference structure presents a critical barrier to representing the broad genomic diversity in the human population. In this review, we discuss the modernization of the reference human genome sequence to a more complete reference of human genomic diversity, known as a human pangenome.
Collapse
Affiliation(s)
- Karen H Miga
- UC Santa Cruz Genomics Institute and Department of Biomedical Engineering, University of California, Santa Cruz, California 95064, USA;
| | - Ting Wang
- Department of Genetics, Edison Family Center for Genome Sciences and Systems Biology, and McDonnell Genome Institute, Washington University School of Medicine, St. Louis, Missouri 63110, USA;
| |
Collapse
|
29
|
Li Q, Tian S, Yan B, Liu CM, Lam TW, Li R, Luo R. Building a Chinese pan-genome of 486 individuals. Commun Biol 2021; 4:1016. [PMID: 34462542 PMCID: PMC8405635 DOI: 10.1038/s42003-021-02556-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2020] [Accepted: 08/13/2021] [Indexed: 02/07/2023] Open
Abstract
Pan-genome sequence analysis of human population ancestry is critical for expanding and better defining human genome sequence diversity. However, the amount of genetic variation still missing from current human reference sequences is still unknown. Here, we used 486 deep-sequenced Han Chinese genomes to identify 276 Mbp of DNA sequences that, to our knowledge, are absent in the current human reference. We classified these sequences into individual-specific and common sequences, and propose that the common sequence size is uncapped with a growing population. The 46.646 Mbp common sequences obtained from the 486 individuals improved the accuracy of variant calling and mapping rate when added to the reference genome. We also analyzed the genomic positions of these common sequences and found that they came from genomic regions characterized by high mutation rate and low pathogenicity. Our study authenticates the Chinese pan-genome as representative of DNA sequences specific to the Han Chinese population missing from the GRCh38 reference genome and establishes the newly defined common sequences as candidates to supplement the current human reference.
Collapse
Affiliation(s)
- Qiuhui Li
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Shilin Tian
- Novogene Bioinformatics Institute, Beijing, China
| | - Bin Yan
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Chi Man Liu
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Tak-Wah Lam
- Department of Computer Science, The University of Hong Kong, Hong Kong, China.
| | - Ruiqiang Li
- Novogene Bioinformatics Institute, Beijing, China.
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Hong Kong, China.
| |
Collapse
|
30
|
McDonald TL, Zhou W, Castro CP, Mumm C, Switzenberg JA, Mills RE, Boyle AP. Cas9 targeted enrichment of mobile elements using nanopore sequencing. Nat Commun 2021; 12:3586. [PMID: 34117247 PMCID: PMC8196195 DOI: 10.1038/s41467-021-23918-y] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Accepted: 05/25/2021] [Indexed: 02/05/2023] Open
Abstract
Mobile element insertions (MEIs) are repetitive genomic sequences that contribute to genetic variation and can lead to genetic disorders. Targeted and whole-genome approaches using short-read sequencing have been developed to identify reference and non-reference MEIs; however, the read length hampers detection of these elements in complex genomic regions. Here, we pair Cas9-targeted nanopore sequencing with computational methodologies to capture active MEIs in human genomes. We demonstrate parallel enrichment for distinct classes of MEIs, averaging 44% of reads on-targeted signals and exhibiting a 13.4-54x enrichment over whole-genome approaches. We show an individual flow cell can recover most MEIs (97% L1Hs, 93% AluYb, 51% AluYa, 99% SVA_F, and 65% SVA_E). We identify seventeen non-reference MEIs in GM12878 overlooked by modern, long-read analysis pipelines, primarily in repetitive genomic regions. This work introduces the utility of nanopore sequencing for MEI enrichment and lays the foundation for rapid discovery of elusive, repetitive genetic elements.
Collapse
Affiliation(s)
- Torrin L McDonald
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Weichen Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Christopher P Castro
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Camille Mumm
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Jessica A Switzenberg
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Ryan E Mills
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA.
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
| | - Alan P Boyle
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA.
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
31
|
Giles HH, Hegde MR, Lyon E, Stanley CM, Kerr ID, Garlapow ME, Eggington JM. The Science and Art of Clinical Genetic Variant Classification and Its Impact on Test Accuracy. Annu Rev Genomics Hum Genet 2021; 22:285-307. [PMID: 33900788 DOI: 10.1146/annurev-genom-121620-082709] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Clinical genetic variant classification science is a growing subspecialty of clinical genetics and genomics. The field's continued improvement is essential for the success of precision medicine in both germline (hereditary) and somatic (oncology) contexts. This review focuses on variant classification for DNA next-generation sequencing tests. We first summarize current limitations in variant discovery and definition, and then describe the current five- and four-tier classification systems outlined in dominant standards and guideline publications for germline and somatic tests, respectively. We then discuss measures of variant classification discordance and the field's bias for positive results, as well as considerations for panel size and population screening in the context of estimates of positive predictive value thatincorporate estimated variant classification imperfections. Finally, we share opinions on the current state of variant classification from some of the authors of the most widely used standards and guideline publications and from other domain experts.
Collapse
Affiliation(s)
- Hunter H Giles
- Center for Genomic Interpretation, Sandy, Utah 84092, USA; , ,
| | - Madhuri R Hegde
- PerkinElmer Genomics, Waltham, Massachusetts 02450, USA; .,Department of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia 30332, USA
| | - Elaine Lyon
- HudsonAlpha Clinical Services Lab, Huntsville, Alabama 35806, USA;
| | - Christine M Stanley
- C2i Genomics, Cambridge, Massachusetts 02139, USA.,Variantyx, Framingham, Massachusetts 01701, USA;
| | | | | | | |
Collapse
|
32
|
Charon C, Allodji R, Meyer V, Deleuze JF. Impact of pre- and post-variant filtration strategies on imputation. Sci Rep 2021; 11:6214. [PMID: 33737531 PMCID: PMC7973508 DOI: 10.1038/s41598-021-85333-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Accepted: 02/22/2021] [Indexed: 01/04/2023] Open
Abstract
Quality control (QC) methods for genome-wide association studies and fine mapping are commonly used for imputation, however they result in loss of many single nucleotide polymorphisms (SNPs). To investigate the consequences of filtration on imputation, we studied the direct effects on the number of markers, their allele frequencies, imputation quality scores and post-filtration events. We pre-phrased 1031 genotyped individuals from diverse ethnicities and compared the imputed variants to 1089 NCBI recorded individuals for additional validation. Without QC-based variant pre-filtration, we observed no impairment in the imputation of SNPs that failed QC whereas with pre-filtration there was an overall loss of information. Significant differences between frequencies with and without pre-filtration were found only in the range of very rare (5E-04-1E-03) and rare variants (1E-03-5E-03) (p < 1E-04). Increasing the post-filtration imputation quality score from 0.3 to 0.8 reduced the number of single nucleotide variants (SNVs) < 0.001 2.5 fold with or without QC pre-filtration and halved the number of very rare variants (5E-04). Thus, to maintain confidence and enough SNVs, we propose here a two-step filtering procedure which allows less stringent filtering prior to imputation and post-imputation in order to increase the number of very rare and rare variants compared to conservative filtration methods.
Collapse
Affiliation(s)
- Céline Charon
- CEA Paris-Saclay, Institut François Jacob, Centre National de Recherche en Génomique Humaine, 2 rue Gaston Crémieux, Evry, 91057, France.
| | - Rodrigue Allodji
- Radiation Epidemiology Group CESP, Inserm Unit 1018, Gustave Roussy Université Paris Saclay, 114 rue Edouard Vaillant, Villejuif, 94805, France
| | - Vincent Meyer
- CEA Paris-Saclay, Institut François Jacob, Centre National de Recherche en Génomique Humaine, 2 rue Gaston Crémieux, Evry, 91057, France
| | - Jean-François Deleuze
- CEA Paris-Saclay, Institut François Jacob, Centre National de Recherche en Génomique Humaine, 2 rue Gaston Crémieux, Evry, 91057, France
| |
Collapse
|
33
|
Feng X, Li H. Higher Rates of Processed Pseudogene Acquisition in Humans and Three Great Apes Revealed by Long-Read Assemblies. Mol Biol Evol 2021; 38:2958-2966. [PMID: 33681998 DOI: 10.1093/molbev/msab062] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
LINE-1-mediated retrotransposition of protein-coding mRNAs is an active process in modern humans for both germline and somatic genomes. Prior works that surveyed human data mostly relied on detecting discordant mappings of paired-end short reads, or exon junctions contained in short reads. Moreover, there have been few genome-wide comparisons between gene retrocopies in great apes and humans. In this study, we introduced a more sensitive and accurate method to identify processed pseudogenes. Our method utilizes long-read assemblies, and more importantly, is able to provide full-length retrocopy sequences as well as flanking regions which are missed by short-read based methods. From 22 human individuals, we pinpointed 40 processed pseudogenes that are not present in the human reference genome GRCh38 and identified 17 pseudogenes that are in GRCh38 but absent from some input individuals. This represents a significantly higher discovery rate than previous reports (39 pseudogenes not in the reference genome out of 939 individuals). We also provided an overview of lineage-specific retrocopies in chimpanzee, gorilla, and orangutan genomes.
Collapse
Affiliation(s)
- Xiaowen Feng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
34
|
Torkamaneh D, Belzile F. Accurate Imputation of Untyped Variants from Deep Sequencing Data. Methods Mol Biol 2021; 2243:271-281. [PMID: 33606262 DOI: 10.1007/978-1-0716-1103-6_13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/09/2023]
Abstract
The quality, statistical power, and resolution of genome-wide association studies (GWAS) are largely dependent on the comprehensiveness of genotypic data. Over the last few years, despite the constant decrease in the price of sequencing, whole-genome sequencing (WGS) of association panels comprising a large number of samples remains cost-prohibitive. Therefore, most GWAS populations are still genotyped using low-coverage genotyping methods resulting in incomplete datasets. Imputation of untyped variants is a powerful method to maximize the number of SNPs identified in study samples, it increases the power and resolution of GWAS and allows to integrate genotyping datasets obtained from various sources. Here, we describe the key concepts underlying imputation of untyped variants, including the architecture of reference panels, and review some of the associated challenges and how these can be addressed. We also discuss the need and available methods to rigorously assess the accuracy of imputed data prior to their use in any genetic study.
Collapse
Affiliation(s)
- Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, Canada.,Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec City, QC, Canada.,Department of Plant Agriculture, University of Guelph, Guelph, ON, Canada
| | - François Belzile
- Département de Phytologie, Université Laval, Québec City, QC, Canada. .,Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec City, QC, Canada.
| |
Collapse
|
35
|
Chen L, Pryce JE, Hayes BJ, Daetwyler HD. Investigating the Effect of Imputed Structural Variants from Whole-Genome Sequence on Genome-Wide Association and Genomic Prediction in Dairy Cattle. Animals (Basel) 2021; 11:ani11020541. [PMID: 33669735 PMCID: PMC7922624 DOI: 10.3390/ani11020541] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Revised: 02/09/2021] [Accepted: 02/12/2021] [Indexed: 02/06/2023] Open
Abstract
Simple Summary Structural variants are large changes to the DNA sequences that differ from individual to individual. We discovered and quality-controlled a set of 24,908 structural variants and used a technique called imputation to infer them into 35,588 Holstein and Jersey cattle. We then investigated whether the structural variants affected key dairy cattle traits such as milk production, fertility and overall conformation. Structural variants explained generally less than 10 percent of the phenotypic variation in these traits. Four of the structural variants were significantly associated with dairy cattle production traits. However, the inclusion of the structural variants in the genomic prediction model did not increase genomic prediction accuracy. Abstract Structural variations (SVs) are large DNA segments of deletions, duplications, copy number variations, inversions and translocations in a re-sequenced genome compared to a reference genome. They have been found to be associated with several complex traits in dairy cattle and could potentially help to improve genomic prediction accuracy of dairy traits. Imputation of SVs was performed in individuals genotyped with single-nucleotide polymorphism (SNP) panels without the expense of sequencing them. In this study, we generated 24,908 high-quality SVs in a total of 478 whole-genome sequenced Holstein and Jersey cattle. We imputed 4489 SVs with R2 > 0.5 into 35,568 Holstein and Jersey dairy cattle with 578,999 SNPs with two pipelines, FImpute and Eagle2.3-Minimac3. Genome-wide association studies for production, fertility and overall type with these 4489 SVs revealed four significant SVs, of which two were highly linked to significant SNP. We also estimated the variance components for SNP and SV models for these traits using genomic best linear unbiased prediction (GBLUP). Furthermore, we assessed the effect on genomic prediction accuracy of adding SVs to GBLUP models. The estimated percentage of genetic variance captured by SVs for production traits was up to 4.57% for milk yield in bulls and 3.53% for protein yield in cows. Finally, no consistent increase in genomic prediction accuracy was observed when including SVs in GBLUP.
Collapse
Affiliation(s)
- Long Chen
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083, Australia; (L.C.); (J.E.P.); (B.J.H.)
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC 3083, Australia
| | - Jennie E. Pryce
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083, Australia; (L.C.); (J.E.P.); (B.J.H.)
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC 3083, Australia
| | - Ben J. Hayes
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083, Australia; (L.C.); (J.E.P.); (B.J.H.)
- Queensland Alliance for Agriculture and Food Innovation, Centre for Animal Science, The University of Queensland, St. Lucia, QLD 4067, Australia
| | - Hans D. Daetwyler
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083, Australia; (L.C.); (J.E.P.); (B.J.H.)
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC 3083, Australia
- Correspondence:
| |
Collapse
|
36
|
Y chromosome structural variation in infertile men detected by targeted next-generation sequencing. J Assist Reprod Genet 2021; 38:941-948. [PMID: 33454900 DOI: 10.1007/s10815-020-02031-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2020] [Accepted: 12/08/2020] [Indexed: 01/21/2023] Open
Abstract
PURPOSE To provide a validated method to identify copy number variation (CNV) in regions of the Y chromosome of infertile men by next-generation sequencing (NGS). METHODS Semen analysis was used to determine the quality of semen and diagnose infertility. Deletion of the azoospermia factor (AZF) region in the Y chromosome was detected by a routine sequence-tagged-site PCR (STS-PCR) method. We then used the NGS method to detect CNV in the AZF region, including deletions and duplications. RESULTS A total of 326 samples from male infertility patients, family members, and sperm donors were studied between January 2011 and May 2017. AZF microdeletions were detected in 120 patients by STS-PCR, and these results were consistent with the results from NGS. In addition, of the 160 patients and male family members who had no microdeletions detected by STS-PCR, 51 cases were found to exhibit Y chromosome structural variations by the NGS method (31.88%, 51/160). No microdeletions were found in 46 donors by STS-PCR, but the NGS method revealed 11 of these donors (23.91%, 11/46) carried structural variations, which were mainly in the AZFc region, including partial deletions and duplications. CONCLUSION The established NGS method can replace the conventional STS-PCR method to detect Y chromosome microdeletions. The NGS method can detect CNV, such as partial deletion or duplication, and provide details of the abnormal range and size of variations.
Collapse
|
37
|
Lee YG, Lee JY, Kim J, Kim YJ. Insertion variants missing in the human reference genome are widespread among human populations. BMC Biol 2020; 18:167. [PMID: 33187521 PMCID: PMC7666470 DOI: 10.1186/s12915-020-00894-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Accepted: 10/09/2020] [Indexed: 01/07/2023] Open
Abstract
Background Structural variants comprise diverse genomic arrangements including deletions, insertions, inversions, and translocations, which can generally be detected in humans through sequence comparison to the reference genome. Among structural variants, insertions are the least frequently identified variants, mainly due to ascertainment bias in the reference genome, lack of previous sequence knowledge, and low complexity of typical insertion sequences. Though recent developments in long-read sequencing deliver promise in annotating individual non-reference insertions, population-level catalogues on non-reference insertion variants have not been identified and the possible functional roles of these hidden variants remain elusive. Results To detect non-reference insertion variants, we developed a pipeline, InserTag, which generates non-reference contigs by local de novo assembly and then infers the full-sequence of insertion variants by tracing contigs from non-human primates and other human genome assemblies. Application of the pipeline to data from 2535 individuals of the 1000 Genomes Project helped identify 1696 non-reference insertion variants and re-classify the variants as retention of ancestral sequences or novel sequence insertions based on the ancestral state. Genotyping of the variants showed that individuals had, on average, 0.92-Mbp sequences missing from the reference genome, 92% of the variants were common (allele frequency > 5%) among human populations, and more than half of the variants were major alleles. Among human populations, African populations were the most divergent and had the most non-reference sequences, which was attributed to the greater prevalence of high-frequency insertion variants. The subsets of insertion variants were in high linkage disequilibrium with phenotype-associated SNPs and showed signals of recent continent-specific selection. Conclusions Non-reference insertion variants represent an important type of genetic variation in the human population, and our developed pipeline, InserTag, provides the frameworks for the detection and genotyping of non-reference sequences missing from human populations. Supplementary information Supplementary information accompanies this paper at 10.1186/s12915-020-00894-1.
Collapse
Affiliation(s)
- Young-Gun Lee
- Department of Integrated Omics for Biomedical Science, WCU Graduate School, Yonsei University, Seoul, Republic of Korea
| | - Jin-Young Lee
- Department of Biochemistry, College of Life Science and Technology, Yonsei University, Seoul, Republic of Korea
| | - Junhyong Kim
- Department of Biology, University of Pennsylvania, Philadelphia, PA, USA
| | - Young-Joon Kim
- Department of Integrated Omics for Biomedical Science, WCU Graduate School, Yonsei University, Seoul, Republic of Korea. .,Department of Biochemistry, College of Life Science and Technology, Yonsei University, Seoul, Republic of Korea.
| |
Collapse
|
38
|
Crysnanto D, Pausch H. Bovine breed-specific augmented reference graphs facilitate accurate sequence read mapping and unbiased variant discovery. Genome Biol 2020; 21:184. [PMID: 32718320 PMCID: PMC7385871 DOI: 10.1186/s13059-020-02105-0%0a%0a] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Accepted: 07/14/2020] [Indexed: 09/28/2023] Open
Abstract
BACKGROUND The current bovine genomic reference sequence was assembled from a Hereford cow. The resulting linear assembly lacks diversity because it does not contain allelic variation, a drawback of linear references that causes reference allele bias. High nucleotide diversity and the separation of individuals by hundreds of breeds make cattle ideally suited to investigate the optimal composition of variation-aware references. RESULTS We augment the bovine linear reference sequence (ARS-UCD1.2) with variants filtered for allele frequency in dairy (Brown Swiss, Holstein) and dual-purpose (Fleckvieh, Original Braunvieh) cattle breeds to construct either breed-specific or pan-genome reference graphs using the vg toolkit. We find that read mapping is more accurate to variation-aware than linear references if pre-selected variants are used to construct the genome graphs. Graphs that contain random variants do not improve read mapping over the linear reference sequence. Breed-specific augmented and pan-genome graphs enable almost similar mapping accuracy improvements over the linear reference. We construct a whole-genome graph that contains the Hereford-based reference sequence and 14 million alleles that have alternate allele frequency greater than 0.03 in the Brown Swiss cattle breed. Our novel variation-aware reference facilitates accurate read mapping and unbiased sequence variant genotyping for SNPs and Indels. CONCLUSIONS We develop the first variation-aware reference graph for an agricultural animal ( https://doi.org/10.5281/zenodo.3759712 ). Our novel reference structure improves sequence read mapping and variant genotyping over the linear reference. Our work is a first step towards the transition from linear to variation-aware reference structures in species with high genetic diversity and many sub-populations.
Collapse
|
39
|
Crysnanto D, Pausch H. Bovine breed-specific augmented reference graphs facilitate accurate sequence read mapping and unbiased variant discovery. Genome Biol 2020; 21:184. [PMID: 32718320 PMCID: PMC7385871 DOI: 10.1186/s13059-020-02105-0] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Accepted: 07/14/2020] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND The current bovine genomic reference sequence was assembled from a Hereford cow. The resulting linear assembly lacks diversity because it does not contain allelic variation, a drawback of linear references that causes reference allele bias. High nucleotide diversity and the separation of individuals by hundreds of breeds make cattle ideally suited to investigate the optimal composition of variation-aware references. RESULTS We augment the bovine linear reference sequence (ARS-UCD1.2) with variants filtered for allele frequency in dairy (Brown Swiss, Holstein) and dual-purpose (Fleckvieh, Original Braunvieh) cattle breeds to construct either breed-specific or pan-genome reference graphs using the vg toolkit. We find that read mapping is more accurate to variation-aware than linear references if pre-selected variants are used to construct the genome graphs. Graphs that contain random variants do not improve read mapping over the linear reference sequence. Breed-specific augmented and pan-genome graphs enable almost similar mapping accuracy improvements over the linear reference. We construct a whole-genome graph that contains the Hereford-based reference sequence and 14 million alleles that have alternate allele frequency greater than 0.03 in the Brown Swiss cattle breed. Our novel variation-aware reference facilitates accurate read mapping and unbiased sequence variant genotyping for SNPs and Indels. CONCLUSIONS We develop the first variation-aware reference graph for an agricultural animal ( https://doi.org/10.5281/zenodo.3759712 ). Our novel reference structure improves sequence read mapping and variant genotyping over the linear reference. Our work is a first step towards the transition from linear to variation-aware reference structures in species with high genetic diversity and many sub-populations.
Collapse
|
40
|
Abel HJ, Larson DE, Regier AA, Chiang C, Das I, Kanchi KL, Layer RM, Neale BM, Salerno WJ, Reeves C, Buyske S, Matise TC, Muzny DM, Zody MC, Lander ES, Dutcher SK, Stitziel NO, Hall IM. Mapping and characterization of structural variation in 17,795 human genomes. Nature 2020; 583:83-89. [PMID: 32460305 PMCID: PMC7547914 DOI: 10.1038/s41586-020-2371-0] [Citation(s) in RCA: 159] [Impact Index Per Article: 39.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2018] [Accepted: 05/18/2020] [Indexed: 12/18/2022]
Abstract
A key goal of whole-genome sequencing for studies of human genetics is to interrogate all forms of variation, including single-nucleotide variants, small insertion or deletion (indel) variants and structural variants. However, tools and resources for the study of structural variants have lagged behind those for smaller variants. Here we used a scalable pipeline1 to map and characterize structural variants in 17,795 deeply sequenced human genomes. We publicly release site-frequency data to create the largest, to our knowledge, whole-genome-sequencing-based structural variant resource so far. On average, individuals carry 2.9 rare structural variants that alter coding regions; these variants affect the dosage or structure of 4.2 genes and account for 4.0-11.2% of rare high-impact coding alleles. Using a computational model, we estimate that structural variants account for 17.2% of rare alleles genome-wide, with predicted deleterious effects that are equivalent to loss-of-function coding alleles; approximately 90% of such structural variants are noncoding deletions (mean 19.1 per genome). We report 158,991 ultra-rare structural variants and show that 2% of individuals carry ultra-rare megabase-scale structural variants, nearly half of which are balanced or complex rearrangements. Finally, we infer the dosage sensitivity of genes and noncoding elements, and reveal trends that relate to element class and conservation. This work will help to guide the analysis and interpretation of structural variants in the era of whole-genome sequencing.
Collapse
Affiliation(s)
- Haley J Abel
- McDonnell Genome Institute, Washington University School of Medicine, St Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St Louis, MO, USA
| | - David E Larson
- McDonnell Genome Institute, Washington University School of Medicine, St Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St Louis, MO, USA
| | - Allison A Regier
- McDonnell Genome Institute, Washington University School of Medicine, St Louis, MO, USA
- Department of Medicine, Washington University School of Medicine, St Louis, MO, USA
| | - Colby Chiang
- McDonnell Genome Institute, Washington University School of Medicine, St Louis, MO, USA
| | - Indraniel Das
- McDonnell Genome Institute, Washington University School of Medicine, St Louis, MO, USA
| | - Krishna L Kanchi
- McDonnell Genome Institute, Washington University School of Medicine, St Louis, MO, USA
| | - Ryan M Layer
- BioFrontiers Institute, University of Colorado, Boulder, CO, USA
- Department of Computer Science, University of Colorado, Boulder, CO, USA
| | - Benjamin M Neale
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - William J Salerno
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | | | - Steven Buyske
- Department of Statistics, Rutgers University, Piscataway, NJ, USA
| | - Tara C Matise
- Department of Genetics, Rutgers University, Piscataway, NJ, USA
| | - Donna M Muzny
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | | | - Eric S Lander
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Susan K Dutcher
- McDonnell Genome Institute, Washington University School of Medicine, St Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St Louis, MO, USA
| | - Nathan O Stitziel
- McDonnell Genome Institute, Washington University School of Medicine, St Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St Louis, MO, USA
- Department of Medicine, Washington University School of Medicine, St Louis, MO, USA
| | - Ira M Hall
- McDonnell Genome Institute, Washington University School of Medicine, St Louis, MO, USA.
- Department of Genetics, Washington University School of Medicine, St Louis, MO, USA.
- Department of Medicine, Washington University School of Medicine, St Louis, MO, USA.
| |
Collapse
|
41
|
Louzada S, Algady W, Weyell E, Zuccherato LW, Brajer P, Almalki F, Scliar MO, Naslavsky MS, Yamamoto GL, Duarte YAO, Passos-Bueno MR, Zatz M, Yang F, Hollox EJ. Structural variation of the malaria-associated human glycophorin A-B-E region. BMC Genomics 2020; 21:446. [PMID: 32600246 PMCID: PMC7325229 DOI: 10.1186/s12864-020-06849-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 06/18/2020] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Approximately 5% of the human genome shows common structural variation, which is enriched for genes involved in the immune response and cell-cell interactions. A well-established region of extensive structural variation is the glycophorin gene cluster, comprising three tandemly-repeated regions about 120 kb in length and carrying the highly homologous genes GYPA, GYPB and GYPE. Glycophorin A (encoded by GYPA) and glycophorin B (encoded by GYPB) are glycoproteins present at high levels on the surface of erythrocytes, and they have been suggested to act as decoy receptors for viral pathogens. They are receptors for the invasion of the protist parasite Plasmodium falciparum, a causative agent of malaria. A particular complex structural variant, called DUP4, creates a GYPB-GYPA fusion gene known to confer resistance to malaria. Many other structural variants exist across the glycophorin gene cluster, and they remain poorly characterised. RESULTS Here, we analyse sequences from 3234 diploid genomes from across the world for structural variation at the glycophorin locus, confirming 15 variants in the 1000 Genomes project cohort, discovering 9 new variants, and characterising a selection of these variants using fibre-FISH and breakpoint mapping at the sequence level. We identify variants predicted to create novel fusion genes and a common inversion duplication variant at appreciable frequencies in West Africans. We show that almost all variants can be explained by non-allelic homologous recombination and by comparing the structural variant breakpoints with recombination hotspot maps, confirm the importance of a particular meiotic recombination hotspot on structural variant formation in this region. CONCLUSIONS We identify and validate large structural variants in the human glycophorin A-B-E gene cluster which may be associated with different clinical aspects of malaria.
Collapse
Affiliation(s)
- Sandra Louzada
- Wellcome Sanger Institute, Hinxton, Cambridge, UK
- Present address: Laboratory of Cytogenomics and Animal Genomics (CAG), Department of Genetics and Biotechnology, University of Trás-os-Montes and Alto Douro (UTAD), Vila Real, Portugal
- Present address: BioISI - Biosystems & Integrative Sciences Institute, Faculty of Sciences, University of Lisboa, Lisbon, Portugal
| | - Walid Algady
- Department of Genetics and Genome Biology, University of Leicester, Leicester, UK
| | - Eleanor Weyell
- Department of Genetics and Genome Biology, University of Leicester, Leicester, UK
| | - Luciana W Zuccherato
- Department of Pathology, Faculty of Medicine, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Paulina Brajer
- Department of Genetics and Genome Biology, University of Leicester, Leicester, UK
| | - Faisal Almalki
- Department of Genetics and Genome Biology, University of Leicester, Leicester, UK
| | - Marilia O Scliar
- Human Genome and Stem Cell Research Center, Department of Genetics and Evolutionary Biology, Instituto de Biociências, Universidade de São Paulo, São Paulo, Brazil
| | - Michel S Naslavsky
- Human Genome and Stem Cell Research Center, Department of Genetics and Evolutionary Biology, Instituto de Biociências, Universidade de São Paulo, São Paulo, Brazil
| | - Guilherme L Yamamoto
- Human Genome and Stem Cell Research Center, Department of Genetics and Evolutionary Biology, Instituto de Biociências, Universidade de São Paulo, São Paulo, Brazil
| | - Yeda A O Duarte
- School of Nursing, Universidade de São Paulo, São Paulo, Brazil
| | - Maria Rita Passos-Bueno
- Human Genome and Stem Cell Research Center, Department of Genetics and Evolutionary Biology, Instituto de Biociências, Universidade de São Paulo, São Paulo, Brazil
| | - Mayana Zatz
- Human Genome and Stem Cell Research Center, Department of Genetics and Evolutionary Biology, Instituto de Biociências, Universidade de São Paulo, São Paulo, Brazil
| | | | - Edward J Hollox
- Department of Genetics and Genome Biology, University of Leicester, Leicester, UK.
| |
Collapse
|
42
|
Jakubosky D, Smith EN, D'Antonio M, Jan Bonder M, Young Greenwald WW, D'Antonio-Chronowska A, Matsui H, Stegle O, Montgomery SB, DeBoever C, Frazer KA. Discovery and quality analysis of a comprehensive set of structural variants and short tandem repeats. Nat Commun 2020; 11:2928. [PMID: 32522985 PMCID: PMC7287045 DOI: 10.1038/s41467-020-16481-5] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2019] [Accepted: 05/05/2020] [Indexed: 02/07/2023] Open
Abstract
Structural variants (SVs) and short tandem repeats (STRs) are important sources of genetic diversity but are not routinely analyzed in genetic studies because they are difficult to accurately identify and genotype. Because SVs and STRs range in size and type, it is necessary to apply multiple algorithms that incorporate different types of evidence from sequencing data and employ complex filtering strategies to discover a comprehensive set of high-quality and reproducible variants. Here we assemble a set of 719 deep whole genome sequencing (WGS) samples (mean 42×) from 477 distinct individuals which we use to discover and genotype a wide spectrum of SV and STR variants using five algorithms. We use 177 unique pairs of genetic replicates to identify factors that affect variant call reproducibility and develop a systematic filtering strategy to create of one of the most complete and well characterized maps of SVs and STRs to date.
Collapse
Affiliation(s)
- David Jakubosky
- Biomedical Sciences Graduate Program, University of California San Diego, La Jolla, CA, 92093-0419, USA
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093-0419, USA
| | - Erin N Smith
- Department of Pediatrics, University of California San Diego, La Jolla, CA, 92093, USA
| | - Matteo D'Antonio
- Institute of Genomic Medicine, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA
| | - Marc Jan Bonder
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - William W Young Greenwald
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, USA
| | | | - Hiroko Matsui
- Institute of Genomic Medicine, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA
| | - Oliver Stegle
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center, Heidelberg, Germany
| | - Stephen B Montgomery
- Department of Pathology, Stanford University, Stanford, CA, 94305, USA
- Department of Genetics, Stanford University, Stanford, CA, 94305, USA
| | - Christopher DeBoever
- Institute of Genomic Medicine, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA
| | - Kelly A Frazer
- Department of Pediatrics, University of California San Diego, La Jolla, CA, 92093, USA.
- Institute of Genomic Medicine, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA.
| |
Collapse
|
43
|
Linder RA, Majumder A, Chakraborty M, Long A. Two Synthetic 18-Way Outcrossed Populations of Diploid Budding Yeast with Utility for Complex Trait Dissection. Genetics 2020; 215:323-342. [PMID: 32241804 PMCID: PMC7268983 DOI: 10.1534/genetics.120.303202] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2020] [Accepted: 03/31/2020] [Indexed: 02/07/2023] Open
Abstract
Advanced-generation multiparent populations (MPPs) are a valuable tool for dissecting complex traits, having more power than genome-wide association studies to detect rare variants and higher resolution than F2 linkage mapping. To extend the advantages of MPPs in budding yeast, we describe the creation and characterization of two outbred MPPs derived from 18 genetically diverse founding strains. We carried out de novo assemblies of the genomes of the 18 founder strains, such that virtually all variation segregating between these strains is known, and represented those assemblies as Santa Cruz Genome Browser tracks. We discovered complex patterns of structural variation segregating among the founders, including a large deletion within the vacuolar ATPase VMA1, several different deletions within the osmosensor MSB2, a series of deletions and insertions at PRM7 and the adjacent BSC1, as well as copy number variation at the dehydrogenase ALD2 Resequenced haploid recombinant clones from the two MPPs have a median unrecombined block size of 66 kb, demonstrating that the population is highly recombined. We pool-sequenced the two MPPs to 3270× and 2226× coverage and demonstrated that we can accurately estimate local haplotype frequencies using pooled data. We further downsampled the pool-sequenced data to ∼20-40× and showed that local haplotype frequency estimates remained accurate, with median error rates 0.8 and 0.6% at 20× and 40×, respectively. Haplotypes frequencies are estimated much more accurately than SNP frequencies obtained directly from the same data. Deep sequencing of the two populations revealed that 10 or more founders are present at a detectable frequency for > 98% of the genome, validating the utility of this resource for the exploration of the role of standing variation in the architecture of complex traits.
Collapse
Affiliation(s)
- Robert A Linder
- Department of Ecology and Evolutionary Biology, School of Biological Sciences, University of California, Irvine, California 92697-2525
| | - Arundhati Majumder
- Department of Ecology and Evolutionary Biology, School of Biological Sciences, University of California, Irvine, California 92697-2525
| | - Mahul Chakraborty
- Department of Ecology and Evolutionary Biology, School of Biological Sciences, University of California, Irvine, California 92697-2525
| | - Anthony Long
- Department of Ecology and Evolutionary Biology, School of Biological Sciences, University of California, Irvine, California 92697-2525
| |
Collapse
|
44
|
Abstract
A key goal of whole-genome sequencing for studies of human genetics is to interrogate all forms of variation, including single-nucleotide variants, small insertion or deletion (indel) variants and structural variants. However, tools and resources for the study of structural variants have lagged behind those for smaller variants. Here we used a scalable pipeline1 to map and characterize structural variants in 17,795 deeply sequenced human genomes. We publicly release site-frequency data to create the largest, to our knowledge, whole-genome-sequencing-based structural variant resource so far. On average, individuals carry 2.9 rare structural variants that alter coding regions; these variants affect the dosage or structure of 4.2 genes and account for 4.0-11.2% of rare high-impact coding alleles. Using a computational model, we estimate that structural variants account for 17.2% of rare alleles genome-wide, with predicted deleterious effects that are equivalent to loss-of-function coding alleles; approximately 90% of such structural variants are noncoding deletions (mean 19.1 per genome). We report 158,991 ultra-rare structural variants and show that 2% of individuals carry ultra-rare megabase-scale structural variants, nearly half of which are balanced or complex rearrangements. Finally, we infer the dosage sensitivity of genes and noncoding elements, and reveal trends that relate to element class and conservation. This work will help to guide the analysis and interpretation of structural variants in the era of whole-genome sequencing.
Collapse
|
45
|
Collins RL, Brand H, Karczewski KJ, Zhao X, Alföldi J, Francioli LC, Khera AV, Lowther C, Gauthier LD, Wang H, Watts NA, Solomonson M, O'Donnell-Luria A, Baumann A, Munshi R, Walker M, Whelan CW, Huang Y, Brookings T, Sharpe T, Stone MR, Valkanas E, Fu J, Tiao G, Laricchia KM, Ruano-Rubio V, Stevens C, Gupta N, Cusick C, Margolin L, Taylor KD, Lin HJ, Rich SS, Post WS, Chen YDI, Rotter JI, Nusbaum C, Philippakis A, Lander E, Gabriel S, Neale BM, Kathiresan S, Daly MJ, Banks E, MacArthur DG, Talkowski ME. A structural variation reference for medical and population genetics. Nature 2020; 581:444-451. [PMID: 32461652 PMCID: PMC7334194 DOI: 10.1038/s41586-020-2287-8] [Citation(s) in RCA: 516] [Impact Index Per Article: 129.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2019] [Accepted: 03/31/2020] [Indexed: 12/16/2022]
Abstract
Structural variants (SVs) rearrange large segments of DNA1 and can have profound consequences in evolution and human disease2,3. As national biobanks, disease-association studies, and clinical genetic testing have grown increasingly reliant on genome sequencing, population references such as the Genome Aggregation Database (gnomAD)4 have become integral in the interpretation of single-nucleotide variants (SNVs)5. However, there are no reference maps of SVs from high-coverage genome sequencing comparable to those for SNVs. Here we present a reference of sequence-resolved SVs constructed from 14,891 genomes across diverse global populations (54% non-European) in gnomAD. We discovered a rich and complex landscape of 433,371 SVs, from which we estimate that SVs are responsible for 25–29% of all rare protein-truncating events per genome. We found strong correlations between natural selection against damaging SNVs and rare SVs that disrupt or duplicate protein-coding sequence, which suggests that genes that are highly intolerant to loss-of-function are also sensitive to increased dosage6. We also uncovered modest selection against noncoding SVs in cis-regulatory elements, although selection against protein-truncating SVs was stronger than all noncoding effects. Finally, we identified very large (over one megabase), rare SVs in 3.9% of samples, and estimate that 0.13% of individuals may carry an SV that meets the existing criteria for clinically important incidental findings7. This SV resource is freely distributed via the gnomAD browser8 and will have broad utility in population genetics, disease-association studies, and diagnostic screening. A large empirical assessment of sequence-resolved structural variants from 14,891 genomes across diverse global populations in the Genome Aggregation Database (gnomAD) provides a reference map for disease-association studies, population genetics, and diagnostic screening.
Collapse
Affiliation(s)
- Ryan L Collins
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.,Division of Medical Sciences, Harvard Medical School, Boston, MA, USA
| | - Harrison Brand
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.,Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Konrad J Karczewski
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Analytical and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Xuefang Zhao
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.,Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Jessica Alföldi
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Analytical and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Laurent C Francioli
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Analytical and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.,Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Amit V Khera
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Chelsea Lowther
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.,Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Laura D Gauthier
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Data Science Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Harold Wang
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Nicholas A Watts
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Analytical and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Matthew Solomonson
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Analytical and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Anne O'Donnell-Luria
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Analytical and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Alexander Baumann
- Data Science Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Ruchi Munshi
- Data Science Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Mark Walker
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Data Science Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | - Yongqing Huang
- Data Science Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Ted Brookings
- Data Science Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Ted Sharpe
- Data Science Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Matthew R Stone
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Elise Valkanas
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.,Division of Medical Sciences, Harvard Medical School, Boston, MA, USA
| | - Jack Fu
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.,Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Grace Tiao
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Analytical and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Kristen M Laricchia
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Analytical and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | | | - Christine Stevens
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Namrata Gupta
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Caroline Cusick
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Lauren Margolin
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | | | - Kent D Taylor
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, Los Angeles Biomedical Research Institute at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Henry J Lin
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, Los Angeles Biomedical Research Institute at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Stephen S Rich
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
| | - Wendy S Post
- Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Yii-Der Ida Chen
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, Los Angeles Biomedical Research Institute at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Jerome I Rotter
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, Los Angeles Biomedical Research Institute at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Chad Nusbaum
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Cellarity Inc., Cambridge, MA, USA
| | - Anthony Philippakis
- Data Science Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Eric Lander
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Department of Systems Biology, Harvard Medical School, Boston, MA, USA.,Department of Biology, MIT, Cambridge, MA, USA
| | - Stacey Gabriel
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Benjamin M Neale
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.,Analytical and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.,Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Sekar Kathiresan
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.,Department of Medicine, Harvard Medical School, Boston, MA, USA.,Division of Cardiology, Massachusetts General Hospital, Boston, MA, USA
| | - Mark J Daly
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.,Analytical and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.,Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Eric Banks
- Data Science Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Daniel G MacArthur
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.,Analytical and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.,Department of Medicine, Harvard Medical School, Boston, MA, USA.,Centre for Population Genomics, Garvan Institute of Medical Research, and UNSW Sydney, Sydney, Australia.,Centre for Population Genomics, Murdoch Children's Research Institute, Melbourne, Australia
| | - Michael E Talkowski
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA. .,Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA. .,Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA. .,Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
46
|
Eizenga JM, Novak AM, Sibbesen JA, Heumos S, Ghaffaari A, Hickey G, Chang X, Seaman JD, Rounthwaite R, Ebler J, Rautiainen M, Garg S, Paten B, Marschall T, Sirén J, Garrison E. Pangenome Graphs. Annu Rev Genomics Hum Genet 2020; 21:139-162. [PMID: 32453966 DOI: 10.1146/annurev-genom-120219-080406] [Citation(s) in RCA: 100] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Low-cost whole-genome assembly has enabled the collection of haplotype-resolved pangenomes for numerous organisms. In turn, this technological change is encouraging the development of methods that can precisely address the sequence and variation described in large collections of related genomes. These approaches often use graphical models of the pangenome to support algorithms for sequence alignment, visualization, functional genomics, and association studies. The additional information provided to these methods by the pangenome allows them to achieve superior performance on a variety of bioinformatic tasks, including read alignment, variant calling, and genotyping. Pangenome graphs stand to become a ubiquitous tool in genomics. Although it is unclear whether they will replace linearreference genomes, their ability to harmoniously relate multiple sequence and coordinate systems will make them useful irrespective of which pangenomic models become most common in the future.
Collapse
Affiliation(s)
- Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Adam M Novak
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Jonas A Sibbesen
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Simon Heumos
- Quantitative Biology Center, University of Tübingen, 72076 Tübingen, Germany
| | - Ali Ghaffaari
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Glenn Hickey
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Xian Chang
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Josiah D Seaman
- Royal Botanic Gardens, Kew, Richmond TW9 3AB, United Kingdom.,School of Biological and Chemical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom
| | - Robin Rounthwaite
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Jana Ebler
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Mikko Rautiainen
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School for Computer Science, Saarland University, 66123 Saarbrücken, Germany
| | - Shilpa Garg
- Departments of Genetics and Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02215, USA.,Department of Data Sciences, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, 66123 Saarbrücken, Germany
| | - Jouni Sirén
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| | - Erik Garrison
- Genomics Institute, University of California, Santa Cruz, California 95064, USA;
| |
Collapse
|
47
|
Determining the impact of uncharacterized inversions in the human genome by droplet digital PCR. Genome Res 2020; 30:724-735. [PMID: 32424072 PMCID: PMC7263195 DOI: 10.1101/gr.255273.119] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2019] [Accepted: 04/17/2020] [Indexed: 12/20/2022]
Abstract
Despite the interest in characterizing genomic variation, the presence of large repeats at the breakpoints hinders the analysis of many structural variants. This is especially problematic for inversions, since there is typically no gain or loss of DNA. Here, we tested novel linkage-based droplet digital PCR (ddPCR) assays to study 20 inversions ranging from 3.1 to 742 kb flanked by inverted repeats (IRs) up to 134 kb long. Of those, we validated 13 inversions predicted by different genome-wide techniques. In addition, we obtained new experimental human population information across 95 African, European, and East Asian individuals for 16 inversions, including four already validated variants without high-throughput genotyping methods. Through comparison with previous data, independent replicates and both inversion breakpoints, we demonstrate that the technique is highly accurate and reproducible. Most studied inversions are widespread across continents, and their frequency is negatively correlated with genetic length. Moreover, all except two show clear signs of being recurrent, and we could better define the factors affecting recurrence levels and estimate the inversion rate across the genome. Finally, the generated genotypes have allowed us to check inversion functional effects, validating gene expression differences reported before for two inversions and finding new candidate associations. Therefore, the developed methodology makes it possible to screen these and other complex genomic variants quickly in a large number of samples for the first time, highlighting the importance of direct genotyping to assess their potential consequences and clinical implications.
Collapse
|
48
|
Abstract
Since the early days of the genome era, the scientific community has relied on a single 'reference' genome for each species, which is used as the basis for a wide range of genetic analyses, including studies of variation within and across species. As sequencing costs have dropped, thousands of new genomes have been sequenced, and scientists have come to realize that a single reference genome is inadequate for many purposes. By sampling a diverse set of individuals, one can begin to assemble a pan-genome: a collection of all the DNA sequences that occur in a species. Here we review efforts to create pan-genomes for a range of species, from bacteria to humans, and we further consider the computational methods that have been proposed in order to capture, interpret and compare pan-genome data. As scientists continue to survey and catalogue the genomic variation across human populations and begin to assemble a human pan-genome, these efforts will increase our power to connect variation to human diversity, disease and beyond.
Collapse
Affiliation(s)
- Rachel M Sherman
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.
| | - Steven L Salzberg
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins School of Medicine, Baltimore, MD, USA
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
49
|
Abstract
Identifying structural variation (SV) is essential for genome interpretation but has been historically difficult due to limitations inherent to available genome technologies. Detection methods that use ensemble algorithms and emerging sequencing technologies have enabled the discovery of thousands of SVs, uncovering information about their ubiquity, relationship to disease and possible effects on biological mechanisms. Given the variability in SV type and size, along with unique detection biases of emerging genomic platforms, multiplatform discovery is necessary to resolve the full spectrum of variation. Here, we review modern approaches for investigating SVs and proffer that, moving forwards, studies integrating biological information with detection will be necessary to comprehensively understand the impact of SV in the human genome.
Collapse
Affiliation(s)
- Steve S Ho
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Alexander E Urban
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, CA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Ryan E Mills
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA.
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
50
|
Alanko J, Bannai H, Cazaux B, Peterlongo P, Stoye J. Finding all maximal perfect haplotype blocks in linear time. Algorithms Mol Biol 2020; 15:2. [PMID: 32055252 PMCID: PMC7008532 DOI: 10.1186/s13015-020-0163-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Accepted: 01/28/2020] [Indexed: 11/10/2022] Open
Abstract
Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals' haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (Advances in bioinformatics and computational biology: 11th Brazilian symposium on bioinformatics, BSB 2018, Niterói, Brazil, October 30 - November 1, 2018, Proceedings, 2018. 10.1007/978-3-030-01722-4_3) suggested the maximal perfect haplotype block as a very simple combinatorial pattern, forming the basis of a new method to perform rapid genome-wide selection scans. The algorithm they presented for identifying these blocks, however, had a worst-case running time quadratic in the genome length. It was posed as an open problem whether an optimal, linear-time algorithm exists. In this paper we give two algorithms that achieve this time bound, one conceptually very simple one using suffix trees and a second one using the positional Burrows-Wheeler Transform, that is very efficient also in practice.
Collapse
Affiliation(s)
- Jarno Alanko
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Hideo Bannai
- Department of Informatics, Kyushu University, Fukuoka, Japan
| | - Bastien Cazaux
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | | | - Jens Stoye
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| |
Collapse
|