1
|
Schreiber M, Jayakodi M, Stein N, Mascher M. Plant pangenomes for crop improvement, biodiversity and evolution. Nat Rev Genet 2024; 25:563-577. [PMID: 38378816 DOI: 10.1038/s41576-024-00691-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/14/2023] [Indexed: 02/22/2024]
Abstract
Plant genome sequences catalogue genes and the genetic elements that regulate their expression. Such inventories further research aims as diverse as mapping the molecular basis of trait diversity in domesticated plants or inquiries into the origin of evolutionary innovations in flowering plants millions of years ago. The transformative technological progress of DNA sequencing in the past two decades has enabled researchers to sequence ever more genomes with greater ease. Pangenomes - complete sequences of multiple individuals of a species or higher taxonomic unit - have now entered the geneticists' toolkit. The genomes of crop plants and their wild relatives are being studied with translational applications in breeding in mind. But pangenomes are applicable also in ecological and evolutionary studies, as they help classify and monitor biodiversity across the tree of life, deepen our understanding of how plant species diverged and show how plants adapt to changing environments or new selection pressures exerted by human beings.
Collapse
Affiliation(s)
- Mona Schreiber
- Department of Biology, University of Marburg, Marburg, Germany
| | - Murukarthick Jayakodi
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Nils Stein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
- Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| | - Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.
| |
Collapse
|
2
|
Lysenkova Wiklander M, Arvidsson G, Bunikis I, Lundmark A, Raine A, Marincevic-Zuniga Y, Gezelius H, Bremer A, Feuk L, Ameur A, Nordlund J. A multiomic characterization of the leukemia cell line REH using short- and long-read sequencing. Life Sci Alliance 2024; 7:e202302481. [PMID: 38777370 PMCID: PMC11111970 DOI: 10.26508/lsa.202302481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Revised: 05/02/2024] [Accepted: 05/02/2024] [Indexed: 05/25/2024] Open
Abstract
The B-cell acute lymphoblastic leukemia (ALL) cell line REH, with the t(12;21) ETV6::RUNX1 translocation, is known to have a complex karyotype defined by a series of large-scale chromosomal rearrangements. Taken from a 15-yr-old at relapse, the cell line offers a practical model for the study of pediatric B-ALL. In recent years, short- and long-read DNA and RNA sequencing have emerged as a complement to karyotyping techniques in the resolution of structural variants in an oncological context. Here, we explore the integration of long-read PacBio and Oxford Nanopore whole-genome sequencing, IsoSeq RNA sequencing, and short-read Illumina sequencing to create a detailed genomic and transcriptomic characterization of the REH cell line. Whole-genome sequencing clarified the molecular traits of disrupted ALL-associated genes including CDKN2A, PAX5, BTG1, VPREB1, and TBL1XR1, as well as the glucocorticoid receptor NR3C1 Meanwhile, transcriptome sequencing identified seven fusion genes within the genomic breakpoints. Together, our extensive whole-genome investigation makes high-quality open-source data available to the leukemia genomics community.
Collapse
Affiliation(s)
- Mariya Lysenkova Wiklander
- https://ror.org/048a87296 Department of Medical Sciences, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 SciLifeLab, Uppsala University, Uppsala, Sweden
| | - Gustav Arvidsson
- https://ror.org/048a87296 Department of Medical Sciences, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 SciLifeLab, Uppsala University, Uppsala, Sweden
| | - Ignas Bunikis
- https://ror.org/048a87296 SciLifeLab, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 National Genomics Infrastructure, Uppsala University, Uppsala, Sweden
| | - Anders Lundmark
- https://ror.org/048a87296 Department of Medical Sciences, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 SciLifeLab, Uppsala University, Uppsala, Sweden
| | - Amanda Raine
- https://ror.org/048a87296 Department of Medical Sciences, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 SciLifeLab, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 National Genomics Infrastructure, Uppsala University, Uppsala, Sweden
| | - Yanara Marincevic-Zuniga
- https://ror.org/048a87296 Department of Medical Sciences, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 SciLifeLab, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 National Genomics Infrastructure, Uppsala University, Uppsala, Sweden
| | - Henrik Gezelius
- https://ror.org/048a87296 Department of Medical Sciences, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 SciLifeLab, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 National Genomics Infrastructure, Uppsala University, Uppsala, Sweden
| | - Anna Bremer
- https://ror.org/048a87296 SciLifeLab, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
- https://ror.org/01apvbh93 Department of Clinical Genetics, Uppsala University Hospital, Uppsala, Sweden
| | - Lars Feuk
- https://ror.org/048a87296 SciLifeLab, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 National Genomics Infrastructure, Uppsala University, Uppsala, Sweden
| | - Adam Ameur
- https://ror.org/048a87296 SciLifeLab, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 National Genomics Infrastructure, Uppsala University, Uppsala, Sweden
| | - Jessica Nordlund
- https://ror.org/048a87296 Department of Medical Sciences, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 SciLifeLab, Uppsala University, Uppsala, Sweden
- https://ror.org/048a87296 National Genomics Infrastructure, Uppsala University, Uppsala, Sweden
| |
Collapse
|
3
|
Yu Z, Coorens THH, Uddin MM, Ardlie KG, Lennon N, Natarajan P. Genetic variation across and within individuals. Nat Rev Genet 2024; 25:548-562. [PMID: 38548833 DOI: 10.1038/s41576-024-00709-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/09/2024] [Indexed: 04/12/2024]
Abstract
Germline variation and somatic mutation are intricately connected and together shape human traits and disease risks. Germline variants are present from conception, but they vary between individuals and accumulate over generations. By contrast, somatic mutations accumulate throughout life in a mosaic manner within an individual due to intrinsic and extrinsic sources of mutations and selection pressures acting on cells. Recent advancements, such as improved detection methods and increased resources for association studies, have drastically expanded our ability to investigate germline and somatic genetic variation and compare underlying mutational processes. A better understanding of the similarities and differences in the types, rates and patterns of germline and somatic variants, as well as their interplay, will help elucidate the mechanisms underlying their distinct yet interlinked roles in human health and biology.
Collapse
Affiliation(s)
- Zhi Yu
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Cardiovascular Research Center and Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | | | - Md Mesbah Uddin
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Cardiovascular Research Center and Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | | | - Niall Lennon
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Pradeep Natarajan
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Cardiovascular Research Center and Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
- Department of Medicine, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
4
|
Kolesnikov A, Cook D, Nattestad M, Brambrink L, McNulty B, Gorzynski J, Goenka S, Ashley EA, Jain M, Miga KH, Paten B, Chang PC, Carroll A, Shafin K. Local read haplotagging enables accurate long-read small variant calling. Nat Commun 2024; 15:5907. [PMID: 39003259 PMCID: PMC11246426 DOI: 10.1038/s41467-024-50079-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 06/28/2024] [Indexed: 07/15/2024] Open
Abstract
Long-read sequencing technology has enabled variant detection in difficult-to-map regions of the genome and enabled rapid genetic diagnosis in clinical settings. Rapidly evolving third-generation sequencing platforms like Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are introducing newer platforms and data types. It has been demonstrated that variant calling methods based on deep neural networks can use local haplotyping information with long-reads to improve the genotyping accuracy. However, using local haplotype information creates an overhead as variant calling needs to be performed multiple times which ultimately makes it difficult to extend to new data types and platforms as they get introduced. In this work, we have developed a local haplotype approximate method that enables state-of-the-art variant calling performance with multiple sequencing platforms including PacBio Revio system, ONT R10.4 simplex and duplex data. This addition of local haplotype approximation simplifies long-read variant calling with DeepVariant.
Collapse
Affiliation(s)
| | - Daniel Cook
- Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA, USA
| | | | | | - Brandy McNulty
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | | | | | | | - Miten Jain
- Northeastern university, Boston, MA, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Pi-Chuan Chang
- Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA, USA
| | - Andrew Carroll
- Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA, USA.
| | - Kishwar Shafin
- Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA, USA.
| |
Collapse
|
5
|
Wang M, Huang Y, Liu K, Wang Z, Zhang M, Yuan H, Duan S, Wei L, Yao H, Sun Q, Zhong J, Tang R, Chen J, Sun Y, Li X, Su H, Yang Q, Hu L, Yun L, Yang J, Nie S, Cai Y, Yan J, Zhou K, Wang C, Zhu B, Liu C, He G. Multiple Human Population Movements and Cultural Dispersal Events Shaped the Landscape of Chinese Paternal Heritage. Mol Biol Evol 2024; 41:msae122. [PMID: 38885310 PMCID: PMC11232699 DOI: 10.1093/molbev/msae122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Revised: 05/30/2024] [Accepted: 06/13/2024] [Indexed: 06/20/2024] Open
Abstract
Large-scale genomic projects and ancient DNA innovations have ushered in a new paradigm for exploring human evolutionary history. However, the genetic legacy of spatiotemporally diverse ancient Eurasians within Chinese paternal lineages remains unresolved. Here, we report an integrated Y-chromosome genomic database encompassing 15,563 individuals from both modern and ancient Eurasians, including 919 newly reported individuals, to investigate the Chinese paternal genomic diversity. The high-resolution, time-stamped phylogeny reveals multiple diversification events and extensive expansions in the early and middle Neolithic. We identify four major ancient population movements, each associated with technological innovations that have shaped the Chinese paternal landscape. First, the expansion of early East Asians and millet farmers from the Yellow River Basin predominantly carrying O2/D subclades significantly influenced the formation of the Sino-Tibetan people and facilitated the permanent settlement of the Tibetan Plateau. Second, the dispersal of rice farmers from the Yangtze River Valley carrying O1 and certain O2 sublineages reshapes the genetic makeup of southern Han Chinese, as well as the Tai-Kadai, Austronesian, Hmong-Mien, and Austroasiatic people. Third, the Neolithic Siberian Q/C paternal lineages originated and proliferated among hunter-gatherers on the Mongolian Plateau and the Amur River Basin, leaving a significant imprint on the gene pools of northern China. Fourth, the J/G/R paternal lineages derived from western Eurasia, which were initially spread by Yamnaya-related steppe pastoralists, maintain their presence primarily in northwestern China. Overall, our research provides comprehensive genetic evidence elucidating the significant impact of interactions with culturally distinct ancient Eurasians on the patterns of paternal diversity in modern Chinese populations.
Collapse
Affiliation(s)
- Mengge Wang
- Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China
- Center for Archaeological Science, Sichuan University, Chengdu 610000, China
- Faculty of Forensic Medicine, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou 510275, China
| | - Yuguo Huang
- Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China
| | - Kaijun Liu
- School of International Tourism and Culture, Guizhou Normal University, Guiyang 550025, China
- MoFang Human Genome Research Institute, Tianfu Software Park, Chengdu, Sichuan 610042, China
| | - Zhiyong Wang
- Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China
- School of Forensic Medicine, Kunming Medical University, Kunming 650500, China
| | - Menghan Zhang
- Institute of Modern Languages and Linguistics, Fudan University, Shanghai 200433, China
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Haibing Yuan
- Center for Archaeological Science, Sichuan University, Chengdu 610000, China
| | - Shuhan Duan
- Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China
- School of Basic Medical Sciences, North Sichuan Medical College, Nanchong 637100, China
| | - Lanhai Wei
- School of Ethnology and Anthropology, Institute of Humanities and Human Sciences, Inner Mongolia Normal University, Hohhot 010022, China
| | - Hongbing Yao
- Belt and Road Research Center for Forensic Molecular Anthropology Gansu University of Political Science and Law, Lanzhou 730000, China
| | - Qiuxia Sun
- Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China
- Department of Forensic Medicine, College of Basic Medicine, Chongqing Medical University, Chongqing 400331, China
| | - Jie Zhong
- Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China
| | - Renkuan Tang
- Department of Forensic Medicine, College of Basic Medicine, Chongqing Medical University, Chongqing 400331, China
| | - Jing Chen
- Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China
- School of Forensic Medicine, Shanxi Medical University, Jinzhong 030001, China
| | - Yuntao Sun
- Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China
- Institute of Forensic Medicine, West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, Chengdu 610041, China
| | - Xiangping Li
- Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China
- School of Forensic Medicine, Kunming Medical University, Kunming 650500, China
| | - Haoran Su
- Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China
- School of Laboratory Medicine and Center for Genetics and Prenatal Diagnosis, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan 637007, China
| | - Qingxin Yang
- Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China
- School of Forensic Medicine, Kunming Medical University, Kunming 650500, China
| | - Liping Hu
- School of Forensic Medicine, Kunming Medical University, Kunming 650500, China
| | - Libing Yun
- Institute of Forensic Medicine, West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, Chengdu 610041, China
| | - Junbao Yang
- Institute of Basic Medicine and Forensic Medicine, North Sichuan Medical College and Center for Genetics and Prenatal Diagnosis, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan 637007, China
| | - Shengjie Nie
- School of Forensic Medicine, Kunming Medical University, Kunming 650500, China
| | - Yan Cai
- School of Laboratory Medicine and Center for Genetics and Prenatal Diagnosis, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan 637007, China
| | - Jiangwei Yan
- School of Forensic Medicine, Shanxi Medical University, Jinzhong 030001, China
| | - Kun Zhou
- MoFang Human Genome Research Institute, Tianfu Software Park, Chengdu, Sichuan 610042, China
| | - Chuanchao Wang
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Xiamen University, Xiamen 361005, China
| | - Bofeng Zhu
- Guangzhou Key Laboratory of Forensic Multi-Omics for Precision Identification, School of Forensic Medicine, Southern Medical University, Guangzhou 510515, China
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, Guangdong 510515, China
| | - Chao Liu
- Guangzhou Key Laboratory of Forensic Multi-Omics for Precision Identification, School of Forensic Medicine, Southern Medical University, Guangzhou 510515, China
- Anti-Drug Technology Center of Guangdong Province, Guangzhou 510230, China
| | - Guanglin He
- Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China
- Center for Archaeological Science, Sichuan University, Chengdu 610000, China
| |
Collapse
|
6
|
Di Maio S, Zöscher P, Weissensteiner H, Forer L, Schachtl-Riess JF, Amstler S, Streiter G, Pfurtscheller C, Paulweber B, Kronenberg F, Coassin S, Schönherr S. Resolving intra-repeat variation in medically relevant VNTRs from short-read sequencing data using the cardiovascular risk gene LPA as a model. Genome Biol 2024; 25:167. [PMID: 38926899 PMCID: PMC11201333 DOI: 10.1186/s13059-024-03316-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Accepted: 06/18/2024] [Indexed: 06/28/2024] Open
Abstract
BACKGROUND Variable number tandem repeats (VNTRs) are highly polymorphic DNA regions harboring many potentially disease-causing variants. However, VNTRs often appear unresolved ("dark") in variation databases due to their repetitive nature. One particularly complex and medically relevant VNTR is the KIV-2 VNTR located in the cardiovascular disease gene LPA which encompasses up to 70% of the coding sequence. RESULTS Using the highly complex LPA gene as a model, we develop a computational approach to resolve intra-repeat variation in VNTRs from largely available short-read sequencing data. We apply the approach to six protein-coding VNTRs in 2504 samples from the 1000 Genomes Project and developed an optimized method for the LPA KIV-2 VNTR that discriminates the confounding KIV-2 subtypes upfront. This results in an F1-score improvement of up to 2.1-fold compared to previously published strategies. Finally, we analyze the LPA VNTR in > 199,000 UK Biobank samples, detecting > 700 KIV-2 mutations. This approach successfully reveals new strong Lp(a)-lowering effects for KIV-2 variants, with protective effect against coronary artery disease, and also validated previous findings based on tagging SNPs. CONCLUSIONS Our approach paves the way for reliable variant detection in VNTRs at scale and we show that it is transferable to other dark regions, which will help unlock medical information hidden in VNTRs.
Collapse
Affiliation(s)
- Silvia Di Maio
- Institute of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria
| | - Peter Zöscher
- Institute of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria
| | - Hansi Weissensteiner
- Institute of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria
| | - Lukas Forer
- Institute of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria
| | | | - Stephan Amstler
- Institute of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria
| | - Gertraud Streiter
- Institute of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria
| | - Cathrin Pfurtscheller
- Institute of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria
| | - Bernhard Paulweber
- Department of Internal Medicine I, Paracelsus Medical University/Salzburger Landeskliniken, Salzburg, Austria
| | - Florian Kronenberg
- Institute of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria
| | - Stefan Coassin
- Institute of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria
| | - Sebastian Schönherr
- Institute of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria.
| |
Collapse
|
7
|
Hemstrom W, Grummer JA, Luikart G, Christie MR. Next-generation data filtering in the genomics era. Nat Rev Genet 2024:10.1038/s41576-024-00738-6. [PMID: 38877133 DOI: 10.1038/s41576-024-00738-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/25/2024] [Indexed: 06/16/2024]
Abstract
Genomic data are ubiquitous across disciplines, from agriculture to biodiversity, ecology, evolution and human health. However, these datasets often contain noise or errors and are missing information that can affect the accuracy and reliability of subsequent computational analyses and conclusions. A key step in genomic data analysis is filtering - removing sequencing bases, reads, genetic variants and/or individuals from a dataset - to improve data quality for downstream analyses. Researchers are confronted with a multitude of choices when filtering genomic data; they must choose which filters to apply and select appropriate thresholds. To help usher in the next generation of genomic data filtering, we review and suggest best practices to improve the implementation, reproducibility and reporting standards for filter types and thresholds commonly applied to genomic datasets. We focus mainly on filters for minor allele frequency, missing data per individual or per locus, linkage disequilibrium and Hardy-Weinberg deviations. Using simulated and empirical datasets, we illustrate the large effects of different filtering thresholds on common population genetics statistics, such as Tajima's D value, population differentiation (FST), nucleotide diversity (π) and effective population size (Ne).
Collapse
Affiliation(s)
- William Hemstrom
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.
| | - Jared A Grummer
- Flathead Lake Biological Station, Wildlife Biology Program and Division of Biological Sciences, University of Montana, Missoula, MT, USA
| | - Gordon Luikart
- Flathead Lake Biological Station, Wildlife Biology Program and Division of Biological Sciences, University of Montana, Missoula, MT, USA
| | - Mark R Christie
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.
- Department of Forestry and Natural Resources, Purdue University, West Lafayette, IN, USA.
| |
Collapse
|
8
|
Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024; 25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Technological advances enabling massively parallel measurement of biological features - such as microarrays, high-throughput sequencing and mass spectrometry - have ushered in the omics era, now in its third decade. The resulting complex landscape of analytical methods has naturally fostered the growth of an omics benchmarking industry. Benchmarking refers to the process of objectively comparing and evaluating the performance of different computational or analytical techniques when processing and analysing large-scale biological data sets, such as transcriptomics, proteomics and metabolomics. With thousands of omics benchmarking studies published over the past 25 years, the field has matured to the point where the foundations of benchmarking have been established and well described. However, generating meaningful benchmarking data and properly evaluating performance in this complex domain remains challenging. In this Review, we highlight some common oversights and pitfalls in omics benchmarking. We also establish a methodology to bring the issues that can be addressed into focus and to be transparent about those that cannot: this takes the form of a spreadsheet template of guidelines for comprehensive reporting, intended to accompany publications. In addition, a survey of recent developments in benchmarking is provided as well as specific guidance for commonly encountered difficulties.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
9
|
English AC, Dolzhenko E, Ziaei Jam H, McKenzie SK, Olson ND, De Coster W, Park J, Gu B, Wagner J, Eberle MA, Gymrek M, Chaisson MJP, Zook JM, Sedlazeck FJ. Analysis and benchmarking of small and large genomic variants across tandem repeats. Nat Biotechnol 2024:10.1038/s41587-024-02225-z. [PMID: 38671154 DOI: 10.1038/s41587-024-02225-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Accepted: 03/28/2024] [Indexed: 04/28/2024]
Abstract
Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits and are linked to over 60 disease phenotypes. However, they are often excluded from at-scale studies because of challenges with variant calling and representation, as well as a lack of a genome-wide standard. Here, to promote the development of TR methods, we created a catalog of TR regions and explored TR properties across 86 haplotype-resolved long-read human assemblies. We curated variants from the Genome in a Bottle (GIAB) HG002 individual to create a TR dataset to benchmark existing and future TR analysis methods. We also present an improved variant comparison method that handles variants greater than 4 bp in length and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds ~24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 'truth-set' TR benchmark. We demonstrate the utility of this pipeline across short-read and long-read technologies.
Collapse
Affiliation(s)
- Adam C English
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
| | | | - Helyaneh Ziaei Jam
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
| | | | - Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Wouter De Coster
- Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, VIB, Antwerp, Belgium
- Applied and Translational Neurogenomics Group, Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
| | - Jonghun Park
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Bida Gu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
- Department of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA.
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
10
|
Kalleberg J, Rissman J, Schnabel RD. Overcoming Limitations to Deep Learning in Domesticated Animals with TrioTrain. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.15.589602. [PMID: 38659907 PMCID: PMC11042298 DOI: 10.1101/2024.04.15.589602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
Variant calling across diverse species remains challenging as most bioinformatics tools default to assumptions based on human genomes. DeepVariant (DV) excels without joint genotyping while offering fewer implementation barriers. However, the growing appeal of a "universal" algorithm has magnified the unknown impacts when used with non-human genomes. Here, we use bovine genomes to assess the limits of human-genome-trained models in other species. We introduce the first multi-species DV model that achieves a lower Mendelian Inheritance Error (MIE) rate during single-sample genotyping. Our novel approach, TrioTrain, automates extending DV for species without Genome In A Bottle (GIAB) resources and uses region shuffling to mitigate barriers for SLURM-based clusters. To offset imperfect truth labels for animal genomes, we remove Mendelian discordant variants before training, where models are tuned to genotype the offspring correctly. With TrioTrain, we use cattle, yak, and bison trios to build 30 model iterations across five phases. We observe remarkable performance across phases when testing the GIAB human trios with a mean SNP F1 score >0.990. In HG002, our phase 4 bovine model identifies more variants at a lower MIE rate than DeepTrio. In bovine F1-hybrid genomes, our model substantially reduces inheritance errors with a mean MIE rate of 0.03 percent. Although constrained by imperfect labels, we find that multi-species, trio-based training produces a robust variant calling model. Our research demonstrates that exclusively training with human genomes restricts the application of deep-learning approaches for comparative genomics.
Collapse
Affiliation(s)
- Jenna Kalleberg
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
| | - Jacob Rissman
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
| | - Robert D Schnabel
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
- University of Missouri, Genetics Area Program, Columbia, MO, 65201 USA
| |
Collapse
|
11
|
Kosugi S, Terao C. Comparative evaluation of SNVs, indels, and structural variations detected with short- and long-read sequencing data. Hum Genome Var 2024; 11:18. [PMID: 38632226 PMCID: PMC11024196 DOI: 10.1038/s41439-024-00276-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/12/2024] [Accepted: 03/20/2024] [Indexed: 04/19/2024] Open
Abstract
Short- and long-read sequencing technologies are routinely used to detect DNA variants, including SNVs, indels, and structural variations (SVs). However, the differences in the quality and quantity of variants detected between short- and long-read data are not fully understood. In this study, we comprehensively evaluated the variant calling performance of short- and long-read-based SNV, indel, and SV detection algorithms (6 for SNVs, 12 for indels, and 13 for SVs) using a novel evaluation framework incorporating manual visual inspection. The results showed that indel-insertion calls greater than 10 bp were poorly detected by short-read-based detection algorithms compared to long-read-based algorithms; however, the recall and precision of SNV and indel-deletion detection were similar between short- and long-read data. The recall of SV detection with short-read-based algorithms was significantly lower in repetitive regions, especially for small- to intermediate-sized SVs, than that detected with long-read-based algorithms. In contrast, the recall and precision of SV detection in nonrepetitive regions were similar between short- and long-read data. These findings suggest the need for refined strategies, such as incorporating multiple variant detection algorithms, to generate a more complete set of variants using short-read data.
Collapse
Affiliation(s)
- Shunichi Kosugi
- Center for Genome Informatics, Research Organization of Information and Systems, Joint Support-Center for Data Science Research, Shizuoka, Japan.
- Advanced Genomics Center, National Institute of Genetics, Shizuoka, Japan.
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan.
- Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan.
| | - Chikashi Terao
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan
- The Department of Applied Genetics, The School of Pharmaceutical Sciences, University of Shizuoka, Shizuoka, Japan
| |
Collapse
|
12
|
Krenn M, Wagner M, Zulehner G, Weng R, Jäger F, Keritam O, Sener M, Brücke C, Milenkovic I, Langer A, Buchinger D, Habersam R, Mayerhanser K, Brugger M, Brunet T, Jacob M, Graf E, Berutti R, Cetin H, Hoefele J, Winkelmann J, Zimprich F, Rath J. Next-generation sequencing and comprehensive data reassessment in 263 adult patients with neuromuscular disorders: insights into the gray zone of molecular diagnoses. J Neurol 2024; 271:1937-1946. [PMID: 38127101 PMCID: PMC10972933 DOI: 10.1007/s00415-023-12101-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 11/03/2023] [Accepted: 11/04/2023] [Indexed: 12/23/2023]
Abstract
BACKGROUND Neuromuscular disorders (NMDs) are heterogeneous conditions with a considerable fraction attributed to monogenic defects. Despite the advancements in genomic medicine, many patients remain without a diagnosis. Here, we investigate whether a comprehensive reassessment strategy improves the diagnostic outcomes. METHODS We analyzed 263 patients with NMD phenotypes that underwent diagnostic exome or genome sequencing at our tertiary referral center between 2015 and 2023. We applied a comprehensive reassessment encompassing variant reclassification, re-phenotyping and NGS data reanalysis. Multivariable logistic regression was performed to identify predictive factors associated with a molecular diagnosis. RESULTS Initially, a molecular diagnosis was identified in 53 cases (20%), while an additional 23 (9%) had findings of uncertain significance. Following comprehensive reassessment, the diagnostic yield increased to 23%, revealing 44 distinct monogenic etiologies. Reasons for newly obtained molecular diagnoses were variant reclassifications in 7 and NGS data reanalysis in 3 cases including one recently described disease-gene association (DNAJB4). Male sex reduced the odds of receiving a molecular diagnosis (OR 0.42; 95%CI 0.21-0.82), while a positive family history (OR 5.46; 95%CI 2.60-11.76) and a myopathy phenotype (OR 2.72; 95%CI 1.11-7.14) increased the likelihood. 7% were resolved through targeted genetic testing or classified as acquired etiologies. CONCLUSION Our findings reinforce the use of NGS in NMDs of suspected monogenic origin. We show that a comprehensive reassessment enhances diagnostic accuracy. However, one needs to be aware that genetic diagnoses are often made with uncertainty and can even be downgraded based on new evidence.
Collapse
Affiliation(s)
- Martin Krenn
- Department of Neurology, Medical University of Vienna, Waehringer Guertel 18-20, 1090, Vienna, Austria
- Comprehensive Center for Clinical Neurosciences and Mental Health, Medical University of Vienna, Vienna, Austria
| | - Matias Wagner
- Institute of Human Genetics, Klinikum Rechts Der Isar, School of Medicine, Technical University of Munich, Munich, Germany
- Institute of Neurogenomics, Helmholtz Zentrum München, Munich, Germany
| | - Gudrun Zulehner
- Department of Neurology, Medical University of Vienna, Waehringer Guertel 18-20, 1090, Vienna, Austria
- Comprehensive Center for Clinical Neurosciences and Mental Health, Medical University of Vienna, Vienna, Austria
| | - Rosa Weng
- Department of Neurology, Medical University of Vienna, Waehringer Guertel 18-20, 1090, Vienna, Austria
- Comprehensive Center for Clinical Neurosciences and Mental Health, Medical University of Vienna, Vienna, Austria
| | - Fiona Jäger
- Department of Neurology, Medical University of Vienna, Waehringer Guertel 18-20, 1090, Vienna, Austria
- Comprehensive Center for Clinical Neurosciences and Mental Health, Medical University of Vienna, Vienna, Austria
| | - Omar Keritam
- Department of Neurology, Medical University of Vienna, Waehringer Guertel 18-20, 1090, Vienna, Austria
- Comprehensive Center for Clinical Neurosciences and Mental Health, Medical University of Vienna, Vienna, Austria
| | - Merve Sener
- Department of Neurology, Medical University of Vienna, Waehringer Guertel 18-20, 1090, Vienna, Austria
- Comprehensive Center for Clinical Neurosciences and Mental Health, Medical University of Vienna, Vienna, Austria
| | - Christof Brücke
- Department of Neurology, Medical University of Vienna, Waehringer Guertel 18-20, 1090, Vienna, Austria
- Comprehensive Center for Clinical Neurosciences and Mental Health, Medical University of Vienna, Vienna, Austria
| | - Ivan Milenkovic
- Department of Neurology, Medical University of Vienna, Waehringer Guertel 18-20, 1090, Vienna, Austria
- Comprehensive Center for Clinical Neurosciences and Mental Health, Medical University of Vienna, Vienna, Austria
| | - Agnes Langer
- Department of Neurology, Medical University of Vienna, Waehringer Guertel 18-20, 1090, Vienna, Austria
- Comprehensive Center for Clinical Neurosciences and Mental Health, Medical University of Vienna, Vienna, Austria
| | - Dominic Buchinger
- Department of Neurology, Medical University of Vienna, Waehringer Guertel 18-20, 1090, Vienna, Austria
- Comprehensive Center for Clinical Neurosciences and Mental Health, Medical University of Vienna, Vienna, Austria
| | - Richard Habersam
- Department of Neurology, Medical University of Vienna, Waehringer Guertel 18-20, 1090, Vienna, Austria
- Comprehensive Center for Clinical Neurosciences and Mental Health, Medical University of Vienna, Vienna, Austria
| | - Katharina Mayerhanser
- Institute of Human Genetics, Klinikum Rechts Der Isar, School of Medicine, Technical University of Munich, Munich, Germany
| | - Melanie Brugger
- Institute of Human Genetics, Klinikum Rechts Der Isar, School of Medicine, Technical University of Munich, Munich, Germany
| | - Theresa Brunet
- Institute of Human Genetics, Klinikum Rechts Der Isar, School of Medicine, Technical University of Munich, Munich, Germany
- Department of Pediatric Neurology, Developmental Medicine and Social Pediatrics, Dr. Von Hauner's Children's Hospital, University of Munich, Munich, Germany
| | - Maureen Jacob
- Institute of Human Genetics, Klinikum Rechts Der Isar, School of Medicine, Technical University of Munich, Munich, Germany
| | - Elisabeth Graf
- Institute of Human Genetics, Klinikum Rechts Der Isar, School of Medicine, Technical University of Munich, Munich, Germany
| | - Riccardo Berutti
- Institute of Human Genetics, Klinikum Rechts Der Isar, School of Medicine, Technical University of Munich, Munich, Germany
- Institute of Neurogenomics, Helmholtz Zentrum München, Munich, Germany
| | - Hakan Cetin
- Department of Neurology, Medical University of Vienna, Waehringer Guertel 18-20, 1090, Vienna, Austria
- Comprehensive Center for Clinical Neurosciences and Mental Health, Medical University of Vienna, Vienna, Austria
| | - Julia Hoefele
- Institute of Human Genetics, Klinikum Rechts Der Isar, School of Medicine, Technical University of Munich, Munich, Germany
| | - Juliane Winkelmann
- Institute of Human Genetics, Klinikum Rechts Der Isar, School of Medicine, Technical University of Munich, Munich, Germany
- Institute of Neurogenomics, Helmholtz Zentrum München, Munich, Germany
| | - Fritz Zimprich
- Department of Neurology, Medical University of Vienna, Waehringer Guertel 18-20, 1090, Vienna, Austria
- Comprehensive Center for Clinical Neurosciences and Mental Health, Medical University of Vienna, Vienna, Austria
| | - Jakob Rath
- Department of Neurology, Medical University of Vienna, Waehringer Guertel 18-20, 1090, Vienna, Austria.
- Comprehensive Center for Clinical Neurosciences and Mental Health, Medical University of Vienna, Vienna, Austria.
| |
Collapse
|
13
|
Ergun MA, Cinal O, Bakışlı B, Emül AA, Baysan M. COSAP: Comparative Sequencing Analysis Platform. BMC Bioinformatics 2024; 25:130. [PMID: 38532317 DOI: 10.1186/s12859-024-05756-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 03/20/2024] [Indexed: 03/28/2024] Open
Abstract
BACKGROUND Recent improvements in sequencing technologies enabled detailed profiling of genomic features. These technologies mostly rely on short reads which are merged and compared to reference genome for variant identification. These operations should be done with computers due to the size and complexity of the data. The need for analysis software resulted in many programs for mapping, variant calling and annotation steps. Currently, most programs are either expensive enterprise software with proprietary code which makes access and verification very difficult or open-access programs that are mostly based on command-line operations without user interfaces and extensive documentation. Moreover, a high level of disagreement is observed among popular mapping and variant calling algorithms in multiple studies, which makes relying on a single algorithm unreliable. User-friendly open-source software tools that offer comparative analysis are an important need considering the growth of sequencing technologies. RESULTS Here, we propose Comparative Sequencing Analysis Platform (COSAP), an open-source platform that provides popular sequencing algorithms for SNV, indel, structural variant calling, copy number variation, microsatellite instability and fusion analysis and their annotations. COSAP is packed with a fully functional user-friendly web interface and a backend server which allows full independent deployment for both individual and institutional scales. COSAP is developed as a workflow management system and designed to enhance cooperation among scientists with different backgrounds. It is publicly available at https://cosap.bio and https://github.com/MBaysanLab/cosap/ . The source code of the frontend and backend services can be found at https://github.com/MBaysanLab/cosap-webapi/ and https://github.com/MBaysanLab/cosap_frontend/ respectively. All services are packed as Docker containers as well. Pipelines that combine algorithms can be customized and new algorithms can be added with minimal coding through modular structure. CONCLUSIONS COSAP simplifies and speeds up the process of DNA sequencing analyses providing commonly used algorithms for SNV, indel, structural variant calling, copy number variation, microsatellite instability and fusion analysis as well as their annotations. COSAP is packed with a fully functional user-friendly web interface and a backend server which allows full independent deployment for both individual and institutional scales. Standardized implementations of popular algorithms in a modular platform make comparisons much easier to assess the impact of alternative pipelines which is crucial in establishing reproducibility of sequencing analyses.
Collapse
Affiliation(s)
- Mehmet Arif Ergun
- Department of Computer Engineering, Istanbul Technical University, 34469, Istanbul, Turkey
| | - Omer Cinal
- Department of Computer Engineering, Istanbul Technical University, 34469, Istanbul, Turkey
| | - Berkant Bakışlı
- Department of Computer Engineering, Istanbul Technical University, 34469, Istanbul, Turkey
| | - Abdullah Asım Emül
- Department of Computer Engineering, Istanbul Technical University, 34469, Istanbul, Turkey
| | - Mehmet Baysan
- Department of Computer Engineering, Istanbul Technical University, 34469, Istanbul, Turkey.
| |
Collapse
|
14
|
Paulin LF, Fan J, O'Neill K, Pleasance E, Porter VL, Jones SJM, Sedlazeck FJ. The benefit of a complete reference genome for cancer structural variant analysis. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.03.15.24304369. [PMID: 38562786 PMCID: PMC10984048 DOI: 10.1101/2024.03.15.24304369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
The complexities of cancer genomes are becoming more easily interpreted due to advancements in sequencing technologies and improved bioinformatic analysis. Structural variants (SVs) represent an important subset of somatic events in tumors. While detection of SVs has been markedly improved by the development of long-read sequencing, somatic variant identification and annotation remains challenging. We hypothesized that use of a completed human reference genome (CHM13-T2T) would improve somatic SV calling. Our findings in a tumour/normal matched benchmark sample and two patient samples show that the CHM13-T2T improves SV detection and prioritization accuracy compared to GRCh38, with a notable reduction in false positive calls. We also overcame the lack of annotation resources for CHM13-T2T by lifting over CHM13-T2T-aligned reads to the GRCh38 genome, therefore combining both improved alignment and advanced annotations. In this process, we assessed the current SV benchmark set for COLO829/COLO829BL across four replicates sequenced at different centers with different long-read technologies. We discovered instability of this cell line across these replicates; 346 SVs (1.13%) were only discoverable in a single replicate. We identify 49 somatic SVs, which appear to be stable as they are consistently present across the four replicates. As such, we propose this consensus set as an updated benchmark for somatic SV calling and include both GRCh38 and CHM13-T2T coordinates in our benchmark. The benchmark is available at: 10.5281/zenodo.10819636 Our work demonstrates new approaches to optimize somatic SV prioritization in cancer with potential improvements in other genetic diseases.
Collapse
Affiliation(s)
- Luis F Paulin
- Human Genome Sequencing Center Baylor College of Medicine, Houston, TX, USA
| | - Jeremy Fan
- Canada's Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, BC, Canada
| | - Kieran O'Neill
- Canada's Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, BC, Canada
| | - Erin Pleasance
- Canada's Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, BC, Canada
| | - Vanessa L Porter
- Canada's Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, BC, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada
| | - Steven J M Jones
- Canada's Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, BC, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center Baylor College of Medicine, Houston, TX, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| |
Collapse
|
15
|
Brlek P, Bulić L, Bračić M, Projić P, Škaro V, Shah N, Shah P, Primorac D. Implementing Whole Genome Sequencing (WGS) in Clinical Practice: Advantages, Challenges, and Future Perspectives. Cells 2024; 13:504. [PMID: 38534348 DOI: 10.3390/cells13060504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 03/04/2024] [Accepted: 03/11/2024] [Indexed: 03/28/2024] Open
Abstract
The integration of whole genome sequencing (WGS) into all aspects of modern medicine represents the next step in the evolution of healthcare. Using this technology, scientists and physicians can observe the entire human genome comprehensively, generating a plethora of new sequencing data. Modern computational analysis entails advanced algorithms for variant detection, as well as complex models for classification. Data science and machine learning play a crucial role in the processing and interpretation of results, using enormous databases and statistics to discover new and support current genotype-phenotype correlations. In clinical practice, this technology has greatly enabled the development of personalized medicine, approaching each patient individually and in accordance with their genetic and biochemical profile. The most propulsive areas include rare disease genomics, oncogenomics, pharmacogenomics, neonatal screening, and infectious disease genomics. Another crucial application of WGS lies in the field of multi-omics, working towards the complete integration of human biomolecular data. Further technological development of sequencing technologies has led to the birth of third and fourth-generation sequencing, which include long-read sequencing, single-cell genomics, and nanopore sequencing. These technologies, alongside their continued implementation into medical research and practice, show great promise for the future of the field of medicine.
Collapse
Affiliation(s)
- Petar Brlek
- St. Catherine Specialty Hospital, 10000 Zagreb, Croatia
- International Center for Applied Biological Research, 10000 Zagreb, Croatia
- School of Medicine, Josip Juraj Strossmayer University of Osijek, 31000 Osijek, Croatia
| | - Luka Bulić
- St. Catherine Specialty Hospital, 10000 Zagreb, Croatia
| | - Matea Bračić
- St. Catherine Specialty Hospital, 10000 Zagreb, Croatia
| | - Petar Projić
- International Center for Applied Biological Research, 10000 Zagreb, Croatia
| | | | - Nidhi Shah
- Dartmouth Hitchcock Medical Center, Lebannon, NH 03766, USA
| | - Parth Shah
- Dartmouth Hitchcock Medical Center, Lebannon, NH 03766, USA
| | - Dragan Primorac
- St. Catherine Specialty Hospital, 10000 Zagreb, Croatia
- International Center for Applied Biological Research, 10000 Zagreb, Croatia
- School of Medicine, Josip Juraj Strossmayer University of Osijek, 31000 Osijek, Croatia
- Medical School, University of Split, 21000 Split, Croatia
- Eberly College of Science, The Pennsylvania State University, State College, PA 16802, USA
- The Henry C. Lee College of Criminal Justice and Forensic Sciences, University of New Haven, West Haven, CT 06516, USA
- REGIOMED Kliniken, 96450 Coburg, Germany
- Medical School, University of Rijeka, 51000 Rijeka, Croatia
- Faculty of Dental Medicine and Health, Josip Juraj Strossmayer University of Osijek, 31000 Osijek, Croatia
- Medical School, University of Mostar, 88000 Mostar, Bosnia and Herzegovina
- National Forensic Sciences University, Gujarat 382007, India
| |
Collapse
|
16
|
Song B, Buckler ES, Stitzer MC. New whole-genome alignment tools are needed for tapping into plant diversity. TRENDS IN PLANT SCIENCE 2024; 29:355-369. [PMID: 37749022 DOI: 10.1016/j.tplants.2023.08.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 07/19/2023] [Accepted: 08/23/2023] [Indexed: 09/27/2023]
Abstract
Genome alignment is one of the most foundational methods for genome sequence studies. With rapid advances in sequencing and assembly technologies, these newly assembled genomes present challenges for alignment tools to meet the increased complexity and scale. Plant genome alignment is technologically challenging because of frequent whole-genome duplications (WGDs) as well as chromosome rearrangements and fractionation, high nucleotide diversity, widespread structural variation, and high transposable element (TE) activity causing large proportions of repeat elements. We summarize classical pairwise and multiple genome alignment (MGA) methods, and highlight techniques that are widely used or are being developed by the plant research community. We also outline the remaining challenges for precise genome alignment and the interpretation of alignment results in plants.
Collapse
Affiliation(s)
- Baoxing Song
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agriculture Sciences in Weifang, Weifang, Shandong 261325, China; Key Laboratory of Maize Biology and Genetic Breeding in Arid Area of Northwest Region of the Ministry of Agriculture, College of Agronomy, Northwest A&F University, Yangling, Shaanxi 712100, China.
| | - Edward S Buckler
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA; Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853, USA; Agricultural Research Service, United States Department of Agriculture, Ithaca, NY 14853, USA
| | - Michelle C Stitzer
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA; Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA.
| |
Collapse
|
17
|
Zheng Z, Zhu M, Zhang J, Liu X, Hou L, Liu W, Yuan S, Luo C, Yao X, Liu J, Yang Y. A sequence-aware merger of genomic structural variations at population scale. Nat Commun 2024; 15:960. [PMID: 38307885 PMCID: PMC10837428 DOI: 10.1038/s41467-024-45244-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 01/18/2024] [Indexed: 02/04/2024] Open
Abstract
Merging structural variations (SVs) at the population level presents a significant challenge, yet it is essential for conducting comprehensive genotypic analyses, especially in the era of pangenomics. Here, we introduce PanPop, a tool that utilizes an advanced sequence-aware SV merging algorithm to efficiently merge SVs of various types. We demonstrate that PanPop can merge and optimize the majority of multiallelic SVs into informative biallelic variants. We show its superior precision and lower rates of missing data compared to alternative software solutions. Our approach not only enables the filtering of SVs by leveraging multiple SV callers for enhanced accuracy but also facilitates the accurate merging of large-scale population SVs. These capabilities of PanPop will help to accelerate future SV-related studies.
Collapse
Affiliation(s)
- Zeyu Zheng
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Mingjia Zhu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Jin Zhang
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Xinfeng Liu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Liqiang Hou
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Wenyu Liu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Shuai Yuan
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Changhong Luo
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Xinhao Yao
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China
| | - Jianquan Liu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China.
| | - Yongzhi Yang
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Ecology, Lanzhou University, Lanzhou, China.
| |
Collapse
|
18
|
Charron P, Kang M. VariantDetective: an accurate all-in-one pipeline for detecting consensus bacterial SNPs and SVs. Bioinformatics 2024; 40:btae066. [PMID: 38366603 PMCID: PMC10898327 DOI: 10.1093/bioinformatics/btae066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 01/16/2024] [Accepted: 02/14/2024] [Indexed: 02/18/2024] Open
Abstract
MOTIVATION Genomic variations comprise a spectrum of alterations, ranging from single nucleotide polymorphisms (SNPs) to large-scale structural variants (SVs), which play crucial roles in bacterial evolution and species diversification. Accurately identifying SNPs and SVs is beneficial for subsequent evolutionary and epidemiological studies. This study presents VariantDetective (VD), a novel, user-friendly, and all-in-one pipeline combining SNP and SV calling to generate consensus genomic variants using multiple tools. RESULTS The VD pipeline accepts various file types as input to initiate SNP and/or SV calling, and benchmarking results demonstrate VD's robustness and high accuracy across multiple tested datasets when compared to existing variant calling approaches. AVAILABILITY AND IMPLEMENTATION The source code, test data, and relevant information for VD are freely accessible at https://github.com/OLF-Bioinformatics/VariantDetective under the MIT License.
Collapse
Affiliation(s)
- Philippe Charron
- Ottawa Laboratory-Fallowfield, Canadian Food Inspection Agency, 3851 Fallowfield Road, Nepean, Ontario K2J 4S1, Canada
| | - Mingsong Kang
- Ottawa Laboratory-Fallowfield, Canadian Food Inspection Agency, 3851 Fallowfield Road, Nepean, Ontario K2J 4S1, Canada
| |
Collapse
|
19
|
Groza C, Schwendinger-Schreck C, Cheung WA, Farrow EG, Thiffault I, Lake J, Rizzo WB, Evrony G, Curran T, Bourque G, Pastinen T. Pangenome graphs improve the analysis of structural variants in rare genetic diseases. Nat Commun 2024; 15:657. [PMID: 38253606 PMCID: PMC10803329 DOI: 10.1038/s41467-024-44980-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 01/10/2024] [Indexed: 01/24/2024] Open
Abstract
Rare DNA alterations that cause heritable diseases are only partially resolvable by clinical next-generation sequencing due to the difficulty of detecting structural variation (SV) in all genomic contexts. Long-read, high fidelity genome sequencing (HiFi-GS) detects SVs with increased sensitivity and enables assembling personal and graph genomes. We leverage standard reference genomes, public assemblies (n = 94) and a large collection of HiFi-GS data from a rare disease program (Genomic Answers for Kids, GA4K, n = 574 assemblies) to build a graph genome representing a unified SV callset in GA4K, identify common variation and prioritize SVs that are more likely to cause genetic disease (MAF < 0.01). Using graphs, we obtain a higher level of reproducibility than the standard reference approach. We observe over 200,000 SV alleles unique to GA4K, including nearly 1000 rare variants that impact coding sequence. With improved specificity for rare SVs, we isolate 30 candidate SVs in phenotypically prioritized genes, including known disease SVs. We isolate a novel diagnostic SV in KMT2E, demonstrating use of personal assemblies coupled with pangenome graphs for rare disease genomics. The community may interrogate our pangenome with additional assemblies to discover new SVs within the allele frequency spectrum relevant to genetic diseases.
Collapse
Affiliation(s)
- Cristian Groza
- Quantitative Life Sciences, McGill University, Montréal, QC, Canada
| | | | - Warren A Cheung
- Genomic Medicine Center, Children's Mercy Hospital and Research Institute, KC, MO, USA
| | - Emily G Farrow
- Genomic Medicine Center, Children's Mercy Hospital and Research Institute, KC, MO, USA
| | - Isabelle Thiffault
- Genomic Medicine Center, Children's Mercy Hospital and Research Institute, KC, MO, USA
| | | | - William B Rizzo
- Child Health Research Institute, Department of Pediatrics, Nebraska Medical Center, Omaha, NE, USA
| | - Gilad Evrony
- Center for Human Genetics and Genomics, Department of Pediatrics, Neuroscience & Physiology, New York University Grossman School of Medicine, New York, NY, USA
| | - Tom Curran
- Children's Mercy Research Institute, Kansas City, MO, USA
| | - Guillaume Bourque
- Canadian Center for Computational Genomics, McGill University, Montréal, QC, Canada.
- Department of Human Genetics, McGill University, Montréal, QC, Canada.
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan.
- Victor Phillip Dahdaleh Institute of Genomic Medicine at McGill University, Montréal, QC, Canada.
| | - Tomi Pastinen
- Genomic Medicine Center, Children's Mercy Hospital and Research Institute, KC, MO, USA.
| |
Collapse
|
20
|
Barbitoff YA, Ushakov MO, Lazareva TE, Nasykhova YA, Glotov AS, Predeus AV. Bioinformatics of germline variant discovery for rare disease diagnostics: current approaches and remaining challenges. Brief Bioinform 2024; 25:bbad508. [PMID: 38271481 PMCID: PMC10810331 DOI: 10.1093/bib/bbad508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 11/18/2023] [Accepted: 12/12/2023] [Indexed: 01/27/2024] Open
Abstract
Next-generation sequencing (NGS) has revolutionized the field of rare disease diagnostics. Whole exome and whole genome sequencing are now routinely used for diagnostic purposes; however, the overall diagnosis rate remains lower than expected. In this work, we review current approaches used for calling and interpretation of germline genetic variants in the human genome, and discuss the most important challenges that persist in the bioinformatic analysis of NGS data in medical genetics. We describe and attempt to quantitatively assess the remaining problems, such as the quality of the reference genome sequence, reproducible coverage biases, or variant calling accuracy in complex regions of the genome. We also discuss the prospects of switching to the complete human genome assembly or the human pan-genome and important caveats associated with such a switch. We touch on arguably the hardest problem of NGS data analysis for medical genomics, namely, the annotation of genetic variants and their subsequent interpretation. We highlight the most challenging aspects of annotation and prioritization of both coding and non-coding variants. Finally, we demonstrate the persistent prevalence of pathogenic variants in the coding genome, and outline research directions that may enhance the efficiency of NGS-based disease diagnostics.
Collapse
Affiliation(s)
- Yury A Barbitoff
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
- Bioinformatics Institute, Kentemirovskaya st. 2A, 197342, St. Petersburg, Russia
| | - Mikhail O Ushakov
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Tatyana E Lazareva
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Yulia A Nasykhova
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Andrey S Glotov
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Alexander V Predeus
- Bioinformatics Institute, Kentemirovskaya st. 2A, 197342, St. Petersburg, Russia
| |
Collapse
|
21
|
Behera S, Catreux S, Rossi M, Truong S, Huang Z, Ruehle M, Visvanath A, Parnaby G, Roddey C, Onuchic V, Cameron DL, English A, Mehtalia S, Han J, Mehio R, Sedlazeck FJ. Comprehensive and accurate genome analysis at scale using DRAGEN accelerated algorithms. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.02.573821. [PMID: 38260545 PMCID: PMC10802302 DOI: 10.1101/2024.01.02.573821] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Research and medical genomics require comprehensive and scalable solutions to drive the discovery of novel disease targets, evolutionary drivers, and genetic markers with clinical significance. This necessitates a framework to identify all types of variants independent of their size (e.g., SNV/SV) or location (e.g., repeats). Here we present DRAGEN that utilizes novel methods based on multigenomes, hardware acceleration, and machine learning based variant detection to provide novel insights into individual genomes with ~30min computation time (from raw reads to variant detection). DRAGEN outperforms all other state-of-the-art methods in speed and accuracy across all variant types (SNV, indel, STR, SV, CNV) and further incorporates specialized methods to obtain key insights in medically relevant genes (e.g., HLA, SMN, GBA). We showcase DRAGEN across 3,202 genomes and demonstrate its scalability, accuracy, and innovations to further advance the integration of comprehensive genomics for research and medical applications.
Collapse
Affiliation(s)
- Sairam Behera
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | | | | | | | | | | | | | | | | | | | | | - Adam English
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | | | | | | | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, TX, USA
- Department of Computer Science, Rice University, TX, USA
| |
Collapse
|
22
|
English A, Dolzhenko E, Jam HZ, Mckenzie S, Olson ND, De Coster W, Park J, Gu B, Wagner J, Eberle MA, Gymrek M, Chaisson MJP, Zook JM, Sedlazeck FJ. Benchmarking of small and large variants across tandem repeats. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.29.564632. [PMID: 37961319 PMCID: PMC10634962 DOI: 10.1101/2023.10.29.564632] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits, and are linked to over 60 disease phenotypes. However, their complexity often excludes them from at-scale studies due to challenges with variant calling, representation, and lack of a genome-wide standard. To promote TR methods development, we create a comprehensive catalog of TR regions and explore its properties across 86 samples. We then curate variants from the GIAB HG002 individual to create a tandem repeat benchmark. We also present a variant comparison method that handles small and large alleles and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds ∼24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 TR benchmark. We work with the GIAB community to demonstrate the utility of this benchmark across short and long read technologies.
Collapse
|
23
|
Yang C, Zhou Y, Song Y, Wu D, Zeng Y, Nie L, Liu P, Zhang S, Chen G, Xu J, Zhou H, Zhou L, Qian X, Liu C, Tan S, Zhou C, Dai W, Xu M, Qi Y, Wang X, Guo L, Fan G, Wang A, Deng Y, Zhang Y, Jin J, He Y, Guo C, Guo G, Zhou Q, Xu X, Yang H, Wang J, Xu S, Mao Y, Jin X, Ruan J, Zhang G. The complete and fully-phased diploid genome of a male Han Chinese. Cell Res 2023; 33:745-761. [PMID: 37452091 PMCID: PMC10542383 DOI: 10.1038/s41422-023-00849-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2023] [Accepted: 06/29/2023] [Indexed: 07/18/2023] Open
Abstract
Since the release of the complete human genome, the priority of human genomic study has now been shifting towards closing gaps in ethnic diversity. Here, we present a fully phased and well-annotated diploid human genome from a Han Chinese male individual (CN1), in which the assemblies of both haploids achieve the telomere-to-telomere (T2T) level. Comparison of this diploid genome with the CHM13 haploid T2T genome revealed significant variations in the centromere. Outside the centromere, we discovered 11,413 structural variations, including numerous novel ones. We also detected thousands of CN1 alleles that have accumulated high substitution rates and a few that have been under positive selection in the East Asian population. Further, we found that CN1 outperforms CHM13 as a reference genome in mapping and variant calling for the East Asian population owing to the distinct structural variants of the two references. Comparison of SNP calling for a large cohort of 8869 Chinese genomes using CN1 and CHM13 as reference respectively showed that the reference bias profoundly impacts rare SNP calling, with nearly 2 million rare SNPs miss-called with different reference genomes. Finally, applying the CN1 as a reference, we discovered 5.80 Mb and 4.21 Mb putative introgression sequences from Neanderthal and Denisovan, respectively, including many East Asian specific ones undetected using CHM13 as the reference. Our analyses reveal the advances of using CN1 as a reference for population genomic studies and paleo-genomic studies. This complete genome will serve as an alternative reference for future genomic studies on the East Asian population.
Collapse
Affiliation(s)
- Chentao Yang
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Yang Zhou
- BGI-Shenzhen, Shenzhen, Guangdong, China
- BGI Research-Wuhan, BGI, Wuhan, Hubei, China
| | - Yanni Song
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China
| | - Dongya Wu
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China
- Institute of Crop Science & Institute of Bioinformatics, Zhejiang University, Hangzhou, Zhejiang, China
| | - Yan Zeng
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Lei Nie
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | | | - Shilong Zhang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Guangji Chen
- BGI-Shenzhen, Shenzhen, Guangdong, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Jinjin Xu
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Hongling Zhou
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China
| | - Long Zhou
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China
- Innovation Center of Yangtze River Delta, Zhejiang University, Hangzhou, Zhejiang, China
| | - Xiaobo Qian
- BGI-Shenzhen, Shenzhen, Guangdong, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Chenlu Liu
- Life Sciences Institute, Zhejiang University, Hangzhou, Zhejiang, China
| | | | | | - Wei Dai
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Mengyang Xu
- BGI-Shenzhen, Shenzhen, Guangdong, China
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Yanwei Qi
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Xiaobo Wang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China
| | - Lidong Guo
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Guangyi Fan
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Aijun Wang
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Yuan Deng
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Yong Zhang
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | | | - Yunqiu He
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| | - Chunxue Guo
- BGI-Shenzhen, Shenzhen, Guangdong, China
- BGI-Hangzhou, Hangzhou, Zhejiang, China
| | - Guoji Guo
- School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China
| | - Qing Zhou
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China
- Life Sciences Institute, Zhejiang University, Hangzhou, Zhejiang, China
| | - Xun Xu
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | | | - Jian Wang
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Shuhua Xu
- State Key Laboratory of Genetic Engineering, Center for Evolutionary Biology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, China
- Human Phenome Institute, Zhangjiang Fudan International Innovation Center, and Ministry of Education Key Laboratory of Contemporary Anthropology, Fudan University, Shanghai, China
- Jiangsu Key Laboratory of Phylogenomics & Comparative Genomics, International Joint Center of Genomics of Jiangsu Province School of Life Sciences, Jiangsu Normal University, Xuzhou, Jiangsu, China
- Department of Liver Surgery and Transplantation Liver Cancer Institute, Zhongshan Hospital, Fudan University, Shanghai, China
- Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, Yunnan, China
| | - Yafei Mao
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Xin Jin
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Jue Ruan
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China.
| | - Guojie Zhang
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China.
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China.
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China.
- Innovation Center of Yangtze River Delta, Zhejiang University, Hangzhou, Zhejiang, China.
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan, China.
| |
Collapse
|
24
|
Kolesnikov A, Cook D, Nattestad M, McNulty B, Gorzynski J, Goenka S, Ashley EA, Jain M, Miga KH, Paten B, Chang PC, Carroll A, Shafin K. Local read haplotagging enables accurate long-read small variant calling. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.07.556731. [PMID: 37745389 PMCID: PMC10515762 DOI: 10.1101/2023.09.07.556731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
Long-read sequencing technology has enabled variant detection in difficult-to-map regions of the genome and enabled rapid genetic diagnosis in clinical settings. Rapidly evolving third-generation sequencing platforms like Pacific Biosciences (PacBio) and Oxford nanopore technologies (ONT) are introducing newer platforms and data types. It has been demonstrated that variant calling methods based on deep neural networks can use local haplotyping information with long-reads to improve the genotyping accuracy. However, using local haplotype information creates an overhead as variant calling needs to be performed multiple times which ultimately makes it difficult to extend to new data types and platforms as they get introduced. In this work, we have developed a local haplotype approximate method that enables state-of-the-art variant calling performance with multiple sequencing platforms including PacBio Revio system, ONT R10.4 simplex and duplex data. This addition of local haplotype approximation makes DeepVariant a universal variant calling solution for long-read sequencing platforms.
Collapse
Affiliation(s)
| | - Daniel Cook
- Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA
| | | | - Brandy McNulty
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California, USA
| | | | | | | | - Miten Jain
- Northeastern university, Boston, MA, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California, USA
| | | | | | | |
Collapse
|
25
|
Aradhya S, Facio FM, Metz H, Manders T, Colavin A, Kobayashi Y, Nykamp K, Johnson B, Nussbaum RL. Applications of artificial intelligence in clinical laboratory genomics. AMERICAN JOURNAL OF MEDICAL GENETICS. PART C, SEMINARS IN MEDICAL GENETICS 2023; 193:e32057. [PMID: 37507620 DOI: 10.1002/ajmg.c.32057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 07/13/2023] [Accepted: 07/19/2023] [Indexed: 07/30/2023]
Abstract
The transition from analog to digital technologies in clinical laboratory genomics is ushering in an era of "big data" in ways that will exceed human capacity to rapidly and reproducibly analyze those data using conventional approaches. Accurately evaluating complex molecular data to facilitate timely diagnosis and management of genomic disorders will require supportive artificial intelligence methods. These are already being introduced into clinical laboratory genomics to identify variants in DNA sequencing data, predict the effects of DNA variants on protein structure and function to inform clinical interpretation of pathogenicity, link phenotype ontologies to genetic variants identified through exome or genome sequencing to help clinicians reach diagnostic answers faster, correlate genomic data with tumor staging and treatment approaches, utilize natural language processing to identify critical published medical literature during analysis of genomic data, and use interactive chatbots to identify individuals who qualify for genetic testing or to provide pre-test and post-test education. With careful and ethical development and validation of artificial intelligence for clinical laboratory genomics, these advances are expected to significantly enhance the abilities of geneticists to translate complex data into clearly synthesized information for clinicians to use in managing the care of their patients at scale.
Collapse
Affiliation(s)
- Swaroop Aradhya
- Invitae Corporation, San Francisco, California, USA
- Adjunct Clinical Faculty, Department of Pathology, Stanford University School of Medicine, Stanford, California, USA
| | | | - Hillery Metz
- Invitae Corporation, San Francisco, California, USA
| | - Toby Manders
- Invitae Corporation, San Francisco, California, USA
| | | | | | - Keith Nykamp
- Invitae Corporation, San Francisco, California, USA
| | | | - Robert L Nussbaum
- Invitae Corporation, San Francisco, California, USA
- Volunteer Faculty, School of Medicine, University of California San Francisco, San Francisco, California, USA
| |
Collapse
|
26
|
Yu H, Zheng Z, Su J, Lam TW, Luo R. Boosting variant-calling performance with multi-platform sequencing data using Clair3-MP. BMC Bioinformatics 2023; 24:308. [PMID: 37537536 PMCID: PMC10401749 DOI: 10.1186/s12859-023-05434-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 07/31/2023] [Indexed: 08/05/2023] Open
Abstract
BACKGROUND With the continuous advances in third-generation sequencing technology and the increasing affordability of next-generation sequencing technology, sequencing data from different sequencing technology platforms is becoming more common. While numerous benchmarking studies have been conducted to compare variant-calling performance across different platforms and approaches, little attention has been paid to the potential of leveraging the strengths of different platforms to optimize overall performance, especially integrating Oxford Nanopore and Illumina sequencing data. RESULTS We investigated the impact of multi-platform data on the performance of variant calling through carefully designed experiments with a deep learning-based variant caller named Clair3-MP (Multi-Platform). Through our research, we not only demonstrated the capability of ONT-Illumina data for improved variant calling, but also identified the optimal scenarios for utilizing ONT-Illumina data. In addition, we revealed that the improvement in variant calling using ONT-Illumina data comes from an improvement in difficult genomic regions, such as the large low-complexity regions and segmental and collapse duplication regions. Moreover, Clair3-MP can incorporate reference genome stratification information to achieve a small but measurable improvement in variant calling. Clair3-MP is accessible as an open-source project at: https://github.com/HKU-BAL/Clair3-MP . CONCLUSIONS These insights have important implications for researchers and practitioners alike, providing valuable guidance for improving the reliability and efficiency of genomic analysis in diverse applications.
Collapse
Affiliation(s)
- Huijing Yu
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China
| | - Zhenxian Zheng
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China
| | - Junhao Su
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China.
| | - Tak-Wah Lam
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China.
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China.
| |
Collapse
|
27
|
Wilton R, Szalay AS. Short-read aligner performance in germline variant identification. Bioinformatics 2023; 39:btad480. [PMID: 37527006 PMCID: PMC10421969 DOI: 10.1093/bioinformatics/btad480] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 06/01/2023] [Accepted: 07/31/2023] [Indexed: 08/03/2023] Open
Abstract
MOTIVATION Read alignment is an essential first step in the characterization of DNA sequence variation. The accuracy of variant-calling results depends not only on the quality of read alignment and variant-calling software but also on the interaction between these complex software tools. RESULTS In this review, we evaluate short-read aligner performance with the goal of optimizing germline variant-calling accuracy. We examine the performance of three general-purpose short-read aligners-BWA-MEM, Bowtie 2, and Arioc-in conjunction with three germline variant callers: DeepVariant, FreeBayes, and GATK HaplotypeCaller. We discuss the behavior of the read aligners with regard to the data elements on which the variant callers rely, and illustrate how the runtime configurations of these software tools combine to affect variant-calling performance. AVAILABILITY AND IMPLEMENTATION The quick brown fox jumps over the lazy dog.
Collapse
Affiliation(s)
- Richard Wilton
- Department of Physics and Astronomy, Johns Hopkins University, Baltimore, MD 21218, United States
| | - Alexander S Szalay
- Department of Physics and Astronomy, Johns Hopkins University, Baltimore, MD 21218, United States
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, United States
| |
Collapse
|