1
|
Kalleberg J, Rissman J, Schnabel RD. Overcoming Limitations to Deep Learning in Domesticated Animals with TrioTrain. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.15.589602. [PMID: 38659907 PMCID: PMC11042298 DOI: 10.1101/2024.04.15.589602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
Variant calling across diverse species remains challenging as most bioinformatics tools default to assumptions based on human genomes. DeepVariant (DV) excels without joint genotyping while offering fewer implementation barriers. However, the growing appeal of a "universal" algorithm has magnified the unknown impacts when used with non-human genomes. Here, we use bovine genomes to assess the limits of human-genome-trained models in other species. We introduce the first multi-species DV model that achieves a lower Mendelian Inheritance Error (MIE) rate during single-sample genotyping. Our novel approach, TrioTrain, automates extending DV for species without Genome In A Bottle (GIAB) resources and uses region shuffling to mitigate barriers for SLURM-based clusters. To offset imperfect truth labels for animal genomes, we remove Mendelian discordant variants before training, where models are tuned to genotype the offspring correctly. With TrioTrain, we use cattle, yak, and bison trios to build 30 model iterations across five phases. We observe remarkable performance across phases when testing the GIAB human trios with a mean SNP F1 score >0.990. In HG002, our phase 4 bovine model identifies more variants at a lower MIE rate than DeepTrio. In bovine F1-hybrid genomes, our model substantially reduces inheritance errors with a mean MIE rate of 0.03 percent. Although constrained by imperfect labels, we find that multi-species, trio-based training produces a robust variant calling model. Our research demonstrates that exclusively training with human genomes restricts the application of deep-learning approaches for comparative genomics.
Collapse
Affiliation(s)
- Jenna Kalleberg
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
| | - Jacob Rissman
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
| | - Robert D Schnabel
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
- University of Missouri, Genetics Area Program, Columbia, MO, 65201 USA
| |
Collapse
|
2
|
Kosugi S, Terao C. Comparative evaluation of SNVs, indels, and structural variations detected with short- and long-read sequencing data. Hum Genome Var 2024; 11:18. [PMID: 38632226 PMCID: PMC11024196 DOI: 10.1038/s41439-024-00276-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/12/2024] [Accepted: 03/20/2024] [Indexed: 04/19/2024] Open
Abstract
Short- and long-read sequencing technologies are routinely used to detect DNA variants, including SNVs, indels, and structural variations (SVs). However, the differences in the quality and quantity of variants detected between short- and long-read data are not fully understood. In this study, we comprehensively evaluated the variant calling performance of short- and long-read-based SNV, indel, and SV detection algorithms (6 for SNVs, 12 for indels, and 13 for SVs) using a novel evaluation framework incorporating manual visual inspection. The results showed that indel-insertion calls greater than 10 bp were poorly detected by short-read-based detection algorithms compared to long-read-based algorithms; however, the recall and precision of SNV and indel-deletion detection were similar between short- and long-read data. The recall of SV detection with short-read-based algorithms was significantly lower in repetitive regions, especially for small- to intermediate-sized SVs, than that detected with long-read-based algorithms. In contrast, the recall and precision of SV detection in nonrepetitive regions were similar between short- and long-read data. These findings suggest the need for refined strategies, such as incorporating multiple variant detection algorithms, to generate a more complete set of variants using short-read data.
Collapse
Affiliation(s)
- Shunichi Kosugi
- Center for Genome Informatics, Research Organization of Information and Systems, Joint Support-Center for Data Science Research, Shizuoka, Japan.
- Advanced Genomics Center, National Institute of Genetics, Shizuoka, Japan.
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan.
- Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan.
| | - Chikashi Terao
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan
- The Department of Applied Genetics, The School of Pharmaceutical Sciences, University of Shizuoka, Shizuoka, Japan
| |
Collapse
|
3
|
Connor R, Shakya M, Yarmosh DA, Maier W, Martin R, Bradford R, Brister JR, Chain PSG, Copeland CA, di Iulio J, Hu B, Ebert P, Gunti J, Jin Y, Katz KS, Kochergin A, LaRosa T, Li J, Li PE, Lo CC, Rashid S, Maiorova ES, Xiao C, Zalunin V, Purcell L, Pruitt KD. Recommendations for Uniform Variant Calling of SARS-CoV-2 Genome Sequence across Bioinformatic Workflows. Viruses 2024; 16:430. [PMID: 38543795 PMCID: PMC10975397 DOI: 10.3390/v16030430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Revised: 02/12/2024] [Accepted: 02/16/2024] [Indexed: 04/01/2024] Open
Abstract
Genomic sequencing of clinical samples to identify emerging variants of SARS-CoV-2 has been a key public health tool for curbing the spread of the virus. As a result, an unprecedented number of SARS-CoV-2 genomes were sequenced during the COVID-19 pandemic, which allowed for rapid identification of genetic variants, enabling the timely design and testing of therapies and deployment of new vaccine formulations to combat the new variants. However, despite the technological advances of deep sequencing, the analysis of the raw sequence data generated globally is neither standardized nor consistent, leading to vastly disparate sequences that may impact identification of variants. Here, we show that for both Illumina and Oxford Nanopore sequencing platforms, downstream bioinformatic protocols used by industry, government, and academic groups resulted in different virus sequences from same sample. These bioinformatic workflows produced consensus genomes with differences in single nucleotide polymorphisms, inclusion and exclusion of insertions, and/or deletions, despite using the same raw sequence as input datasets. Here, we compared and characterized such discrepancies and propose a specific suite of parameters and protocols that should be adopted across the field. Consistent results from bioinformatic workflows are fundamental to SARS-CoV-2 and future pathogen surveillance efforts, including pandemic preparation, to allow for a data-driven and timely public health response.
Collapse
Affiliation(s)
- Ryan Connor
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA; (R.C.); (J.R.B.); (J.G.); (Y.J.); (K.S.K.); (A.K.); (C.X.); (V.Z.)
| | - Migun Shakya
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA; (M.S.); (P.S.G.C.); (B.H.); (P.-E.L.); (C.-C.L.)
| | - David A. Yarmosh
- American Type Culture Collection, Manassas, VA 20110, USA; (D.A.Y.); (R.B.); (S.R.)
- BEI Resources, Manassas, VA 20110, USA
| | - Wolfgang Maier
- Galaxy Europe Team, University of Freiburg, 79085 Freiburg, Germany;
| | - Ross Martin
- Clinical Virology Department, Gilead Sciences, Foster City, CA 94404, USA; (R.M.); (J.L.); (E.S.M.)
| | - Rebecca Bradford
- American Type Culture Collection, Manassas, VA 20110, USA; (D.A.Y.); (R.B.); (S.R.)
- BEI Resources, Manassas, VA 20110, USA
| | - J. Rodney Brister
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA; (R.C.); (J.R.B.); (J.G.); (Y.J.); (K.S.K.); (A.K.); (C.X.); (V.Z.)
| | - Patrick S. G. Chain
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA; (M.S.); (P.S.G.C.); (B.H.); (P.-E.L.); (C.-C.L.)
| | | | - Julia di Iulio
- Vir Biotechnology Inc., San Francisco, CA 94158, USA; (J.d.I.); (L.P.)
| | - Bin Hu
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA; (M.S.); (P.S.G.C.); (B.H.); (P.-E.L.); (C.-C.L.)
| | - Philip Ebert
- Eli Lilly and Company, Indianapolis, IN 46225, USA;
| | - Jonathan Gunti
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA; (R.C.); (J.R.B.); (J.G.); (Y.J.); (K.S.K.); (A.K.); (C.X.); (V.Z.)
| | - Yumi Jin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA; (R.C.); (J.R.B.); (J.G.); (Y.J.); (K.S.K.); (A.K.); (C.X.); (V.Z.)
| | - Kenneth S. Katz
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA; (R.C.); (J.R.B.); (J.G.); (Y.J.); (K.S.K.); (A.K.); (C.X.); (V.Z.)
| | - Andrey Kochergin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA; (R.C.); (J.R.B.); (J.G.); (Y.J.); (K.S.K.); (A.K.); (C.X.); (V.Z.)
| | - Tré LaRosa
- Deloitte Consulting LLP, Rosslyn, VA 22209, USA; (C.A.C.); (T.L.)
| | - Jiani Li
- Clinical Virology Department, Gilead Sciences, Foster City, CA 94404, USA; (R.M.); (J.L.); (E.S.M.)
| | - Po-E Li
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA; (M.S.); (P.S.G.C.); (B.H.); (P.-E.L.); (C.-C.L.)
| | - Chien-Chi Lo
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA; (M.S.); (P.S.G.C.); (B.H.); (P.-E.L.); (C.-C.L.)
| | - Sujatha Rashid
- American Type Culture Collection, Manassas, VA 20110, USA; (D.A.Y.); (R.B.); (S.R.)
| | - Evguenia S. Maiorova
- Clinical Virology Department, Gilead Sciences, Foster City, CA 94404, USA; (R.M.); (J.L.); (E.S.M.)
| | - Chunlin Xiao
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA; (R.C.); (J.R.B.); (J.G.); (Y.J.); (K.S.K.); (A.K.); (C.X.); (V.Z.)
| | - Vadim Zalunin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA; (R.C.); (J.R.B.); (J.G.); (Y.J.); (K.S.K.); (A.K.); (C.X.); (V.Z.)
| | - Lisa Purcell
- Vir Biotechnology Inc., San Francisco, CA 94158, USA; (J.d.I.); (L.P.)
| | - Kim D. Pruitt
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA; (R.C.); (J.R.B.); (J.G.); (Y.J.); (K.S.K.); (A.K.); (C.X.); (V.Z.)
| |
Collapse
|
4
|
Chafai N, Bonizzi L, Botti S, Badaoui B. Emerging applications of machine learning in genomic medicine and healthcare. Crit Rev Clin Lab Sci 2024; 61:140-163. [PMID: 37815417 DOI: 10.1080/10408363.2023.2259466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Accepted: 09/12/2023] [Indexed: 10/11/2023]
Abstract
The integration of artificial intelligence technologies has propelled the progress of clinical and genomic medicine in recent years. The significant increase in computing power has facilitated the ability of artificial intelligence models to analyze and extract features from extensive medical data and images, thereby contributing to the advancement of intelligent diagnostic tools. Artificial intelligence (AI) models have been utilized in the field of personalized medicine to integrate clinical data and genomic information of patients. This integration allows for the identification of customized treatment recommendations, ultimately leading to enhanced patient outcomes. Notwithstanding the notable advancements, the application of artificial intelligence (AI) in the field of medicine is impeded by various obstacles such as the limited availability of clinical and genomic data, the diversity of datasets, ethical implications, and the inconclusive interpretation of AI models' results. In this review, a comprehensive evaluation of multiple machine learning algorithms utilized in the fields of clinical and genomic medicine is conducted. Furthermore, we present an overview of the implementation of artificial intelligence (AI) in the fields of clinical medicine, drug discovery, and genomic medicine. Finally, a number of constraints pertaining to the implementation of artificial intelligence within the healthcare industry are examined.
Collapse
Affiliation(s)
- Narjice Chafai
- Laboratory of Biodiversity, Ecology, and Genome, Faculty of Sciences, Department of Biology, Mohammed V University in Rabat, Rabat, Morocco
| | - Luigi Bonizzi
- Department of Biomedical, Surgical and Dental Science, University of Milan, Milan, Italy
| | - Sara Botti
- PTP Science Park, Via Einstein - Loc. Cascina Codazza, Lodi, Italy
| | - Bouabid Badaoui
- Laboratory of Biodiversity, Ecology, and Genome, Faculty of Sciences, Department of Biology, Mohammed V University in Rabat, Rabat, Morocco
- African Sustainable Agriculture Research Institute (ASARI), Mohammed VI Polytechnic University (UM6P), Laâyoune, Morocco
| |
Collapse
|
5
|
Wang L, Yan X, Wu H, Wang F, Zhong Z, Zheng G, Xiao Q, Wu K, Na W. Selection Signal Analysis Reveals Hainan Yellow Cattle Are Being Selectively Bred for Heat Tolerance. Animals (Basel) 2024; 14:775. [PMID: 38473160 DOI: 10.3390/ani14050775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2024] [Revised: 02/24/2024] [Accepted: 02/27/2024] [Indexed: 03/14/2024] Open
Abstract
Hainan yellow cattle are indigenous Zebu cattle from southern China known for their tolerance of heat and strong resistance to disease. Generations of adaptation to the tropical environment of southern China and decades of artificial breeding have left identifiable selection signals in their genomic makeup. However, information on the selection signatures of Hainan yellow cattle is scarce. Herein, we compared the genomes of Hainan yellow cattle with those of Zebu, Qinchuan, Nanyang, and Yanbian cattle breeds by the composite likelihood ratio method (CLR), Tajima's D method, and identifying runs of homozygosity (ROHs), each of which may provide evidence of the genes responsible for heat tolerance in Hainan yellow cattle. The results showed that 5210, 1972, and 1290 single nucleotide polymorphisms (SNPs) were screened by the CLR method, Tajima's D method, and ROH method, respectively. A total of 453, 450, and 325 genes, respectively, were identified near these SNPs. These genes were significantly enriched in 65 Gene Ontology (GO) functional terms and 11 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (corrected p < 0.05). Five genes-Adenosylhomocysteinase-like 2, DnaJ heat shock protein family (Hsp40) member C3, heat shock protein family A (Hsp70) member 1A, CD53 molecule, and zinc finger and BTB domain containing 12-were recognized as candidate genes associated with heat tolerance. After further functional verification of these genes, the research results may benefit the understanding of the genetic mechanism of the heat tolerance in Hainan yellow cattle, which lay the foundation for subsequent studies on heat stress in this breed.
Collapse
Affiliation(s)
- Liuhao Wang
- School of Tropical Agriculture and Forestry, Hainan University, Haikou 570228, China
| | - Xuehao Yan
- School of Tropical Agriculture and Forestry, Hainan University, Haikou 570228, China
| | - Hongfen Wu
- School of Tropical Agriculture and Forestry, Hainan University, Haikou 570228, China
| | - Feifan Wang
- School of Tropical Agriculture and Forestry, Hainan University, Haikou 570228, China
| | - Ziqi Zhong
- School of Tropical Agriculture and Forestry, Hainan University, Haikou 570228, China
| | - Gang Zheng
- School of Tropical Agriculture and Forestry, Hainan University, Haikou 570228, China
| | - Qian Xiao
- School of Tropical Agriculture and Forestry, Hainan University, Haikou 570228, China
| | - Kebang Wu
- School of Tropical Agriculture and Forestry, Hainan University, Haikou 570228, China
| | - Wei Na
- School of Tropical Agriculture and Forestry, Hainan University, Haikou 570228, China
| |
Collapse
|
6
|
Zhou Y, Kathiresan N, Yu Z, Rivera LF, Yang Y, Thimma M, Manickam K, Chebotarov D, Mauleon R, Chougule K, Wei S, Gao T, Green CD, Zuccolo A, Xie W, Ware D, Zhang J, McNally KL, Wing RA. A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset. BMC Biol 2024; 22:13. [PMID: 38273258 PMCID: PMC10809545 DOI: 10.1186/s12915-024-01820-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Accepted: 01/09/2024] [Indexed: 01/27/2024] Open
Abstract
BACKGROUND Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most widely used SNP calling software tools publicly available, but unfortunately, high-performance computing versions of this tool have yet to become widely available and affordable. RESULTS Here we report an open-source high-performance computing genome variant calling workflow (HPC-GVCW) for GATK that can run on multiple computing platforms from supercomputers to desktop machines. We benchmarked HPC-GVCW on multiple crop species for performance and accuracy with comparable results with previously published reports (using GATK alone). Finally, we used HPC-GVCW in production mode to call SNPs on a "subpopulation aware" 16-genome rice reference panel with ~ 3000 resequenced rice accessions. The entire process took ~ 16 weeks and resulted in the identification of an average of 27.3 M SNPs/genome and the discovery of ~ 2.3 million novel SNPs that were not present in the flagship reference genome for rice (i.e., IRGSP RefSeq). CONCLUSIONS This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment.
Collapse
Affiliation(s)
- Yong Zhou
- Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
- Arizona Genomics Institute (AGI), School of Plant Sciences, University of Arizona, Tucson, AZ, 85721, USA
| | - Nagarajan Kathiresan
- KAUST Supercomputing Laboratory (KSL), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Zhichao Yu
- Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
- National Key Laboratory of Crop Genetic Improvement, Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan, 430070, China
| | - Luis F Rivera
- Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Yujian Yang
- National Key Laboratory of Crop Genetic Improvement, Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan, 430070, China
| | - Manjula Thimma
- Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Keerthana Manickam
- Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Dmytro Chebotarov
- International Rice Research Institute (IRRI), Los Baños, Laguna, 4031, Philippines
| | - Ramil Mauleon
- International Rice Research Institute (IRRI), Los Baños, Laguna, 4031, Philippines
| | - Kapeel Chougule
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
| | - Sharon Wei
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
| | - Tingting Gao
- National Key Laboratory of Crop Genetic Improvement, Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan, 430070, China
| | - Carl D Green
- Information Technology Department, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Andrea Zuccolo
- Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
- Crop Science Research Center (CSRC), Scuola Superiore Sant'Anna, Pisa, 56127, Italy
| | - Weibo Xie
- National Key Laboratory of Crop Genetic Improvement, Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan, 430070, China
| | - Doreen Ware
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
- USDA ARS NEA Plant, Soil & Nutrition Laboratory Research Unit, Ithaca, NY, 14853, USA
| | - Jianwei Zhang
- National Key Laboratory of Crop Genetic Improvement, Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan, 430070, China
| | - Kenneth L McNally
- International Rice Research Institute (IRRI), Los Baños, Laguna, 4031, Philippines
| | - Rod A Wing
- Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia.
- Arizona Genomics Institute (AGI), School of Plant Sciences, University of Arizona, Tucson, AZ, 85721, USA.
- International Rice Research Institute (IRRI), Los Baños, Laguna, 4031, Philippines.
| |
Collapse
|
7
|
Kawai Y, Watanabe Y, Omae Y, Miyahara R, Khor SS, Noiri E, Kitajima K, Shimanuki H, Gatanaga H, Hata K, Hattori K, Iida A, Ishibashi-Ueda H, Kaname T, Kanto T, Matsumura R, Miyo K, Noguchi M, Ozaki K, Sugiyama M, Takahashi A, Tokuda H, Tomita T, Umezawa A, Watanabe H, Yoshida S, Goto YI, Maruoka Y, Matsubara Y, Niida S, Mizokami M, Tokunaga K. Exploring the genetic diversity of the Japanese population: Insights from a large-scale whole genome sequencing analysis. PLoS Genet 2023; 19:e1010625. [PMID: 38060463 PMCID: PMC10703243 DOI: 10.1371/journal.pgen.1010625] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2023] [Accepted: 10/24/2023] [Indexed: 12/18/2023] Open
Abstract
The Japanese archipelago is a terminal location for human migration, and the contemporary Japanese people represent a unique population whose genomic diversity has been shaped by multiple migrations from Eurasia. We analyzed the genomic characteristics that define the genetic makeup of the modern Japanese population from a population genetics perspective from the genomic data of 9,287 samples obtained by high-coverage whole-genome sequencing (WGS) by the National Center Biobank Network. The dataset comprised populations from the Ryukyu Islands and other parts of the Japanese archipelago (Hondo). The Hondo population underwent two episodes of population decline during the Jomon period, corresponding to the Late Neolithic, and the Edo period, corresponding to the Early Modern era, while the Ryukyu population experienced a population decline during the shell midden period of the Late Neolithic in this region. Haplotype analysis suggested increased allele frequencies for genes related to alcohol and fatty acid metabolism, which were reported as loci that had experienced positive natural selection. Two genes related to alcohol metabolism were found to be 12,500 years out of phase with the time when they began to increase in the allele frequency; this finding indicates that the genomic diversity of Japanese people has been shaped by events closely related to agriculture and food production.
Collapse
Affiliation(s)
- Yosuke Kawai
- Genome Medical Science Project, Research Institute, National Center for Global Health and Medicine, Shinjuku-ku, Tokyo, Japan
| | - Yusuke Watanabe
- Genome Medical Science Project, Research Institute, National Center for Global Health and Medicine, Shinjuku-ku, Tokyo, Japan
| | - Yosuke Omae
- Genome Medical Science Project, Research Institute, National Center for Global Health and Medicine, Shinjuku-ku, Tokyo, Japan
- Central Biobank, National Center Biobank Network, Shinjuku-ku, Tokyo, Japan
| | - Reiko Miyahara
- Central Biobank, National Center Biobank Network, Shinjuku-ku, Tokyo, Japan
| | - Seik-Soon Khor
- Genome Medical Science Project, Research Institute, National Center for Global Health and Medicine, Shinjuku-ku, Tokyo, Japan
| | - Eisei Noiri
- Central Biobank, National Center Biobank Network, Shinjuku-ku, Tokyo, Japan
| | - Koji Kitajima
- Central Biobank, National Center Biobank Network, Shinjuku-ku, Tokyo, Japan
- Department of Data Science Center for Clinical Sciences, National Center for Global Health and Medicine, Shinjuku-ku, Tokyo, Japan
| | - Hideyuki Shimanuki
- Central Biobank, National Center Biobank Network, Shinjuku-ku, Tokyo, Japan
- Department of Data Science Center for Clinical Sciences, National Center for Global Health and Medicine, Shinjuku-ku, Tokyo, Japan
| | - Hiroyuki Gatanaga
- AIDS Clinical Center, National Center for Global Health and Medicine, Shinjuku-ku, Tokyo, Japan
| | - Kenichiro Hata
- Department of Maternal-Fetal Biology, National Center for Child Health and Development, Setagaya-ku, Tokyo, Japan
| | - Kotaro Hattori
- Department of Bioresources, Medical Genome Center, National Center of Neurology and Psychiatry, Kodaira, Tokyo, Japan
| | - Aritoshi Iida
- Department of Clinical Genome Analysis, Medical Genome Center, National Center of Neurology and Psychiatry, Kodaira, Tokyo, Japan
| | | | - Tadashi Kaname
- Department of Genome Medicine, National Center for Child Health and Development, Setagaya-ku, Tokyo, Japan
| | - Tatsuya Kanto
- Department of Liver Disease, Research Center for Hepatitis and Immunology, National Center for Global Health and Medicine, Ichikawa, Chiba, Japan
| | - Ryo Matsumura
- Department of Bioresources, Medical Genome Center, National Center of Neurology and Psychiatry, Kodaira, Tokyo, Japan
| | - Kengo Miyo
- Center for Medical Informatics Intelligence, National Center for Global Health and Medicine, Shinjuku-ku, Tokyo, Japan
| | - Michio Noguchi
- NCVC Biobank, National Cerebral and Cardiovascular Center, Suita, Osaka, Japan
| | - Kouichi Ozaki
- Medical Genome Center, Research Institute, National Center for Geriatrics and Gerontology, Obu, Aichi, Japan
- RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
| | - Masaya Sugiyama
- Department of Viral Pathogenesis and Controls, Research Institute, National Center for Global Health and Medicine, Shinjuku-ku, Tokyo, Japan
| | - Ayako Takahashi
- NCVC Biobank, National Cerebral and Cardiovascular Center, Suita, Osaka, Japan
| | - Haruhiko Tokuda
- Core Facility Administration, Research Institute, National Center for Geriatrics and Gerontology, Obu, Aichi, Japan
- Department of Metabolic Research, Research Institute, National Center for Geriatrics and Gerontology, Obu, Aichi, Japan
- Department of Clinical Laboratory, Hospital, National Center for Geriatrics and Gerontology, Obu, Aichi, Japan
| | - Tsutomu Tomita
- NCVC Biobank, National Cerebral and Cardiovascular Center, Suita, Osaka, Japan
| | - Akihiro Umezawa
- Center for Regenerative Medicine, Research Institute, National Center for Child Health and Development, Setagaya-ku, Tokyo, Japan
| | - Hiroshi Watanabe
- Core Facility Administration, Research Institute, National Center for Geriatrics and Gerontology, Obu, Aichi, Japan
- Innovation Center for Translational Research, Hospital, National Center for Geriatrics and Gerontology, Obu, Aichi, Japan
| | - Sumiko Yoshida
- Department of Bioresources, Medical Genome Center, National Center of Neurology and Psychiatry, Kodaira, Tokyo, Japan
| | - Yu-ichi Goto
- Medical Genome Center, National Center of Neurology and Psychiatry, Kodaira, Tokyo, Japan
| | - Yutaka Maruoka
- Department of Oral and Maxillofacial Surgery, National Center for Global Health and Medicine, Shinjuku-ku, Tokyo, Japan
| | - Yoichi Matsubara
- National Center for Child Health and Development, Setagaya-ku, Tokyo, Japan
| | - Shumpei Niida
- Core Facility Administration, Research Institute, National Center for Geriatrics and Gerontology, Obu, Aichi, Japan
| | - Masashi Mizokami
- Genome Medical Science Project, Research Institute, National Center for Global Health and Medicine, Ichikawa, Chiba, Japan
| | - Katsushi Tokunaga
- Genome Medical Science Project, Research Institute, National Center for Global Health and Medicine, Shinjuku-ku, Tokyo, Japan
- Central Biobank, National Center Biobank Network, Shinjuku-ku, Tokyo, Japan
| |
Collapse
|
8
|
Rioux B, Chong M, Walker R, McGlasson S, Rannikmäe K, McCartney D, McCabe J, Brown R, Crow YJ, Hunt D, Whiteley W. Phenotypes associated with genetic determinants of type I interferon regulation in the UK Biobank: a protocol. Wellcome Open Res 2023; 8:550. [PMID: 38855722 PMCID: PMC11162527 DOI: 10.12688/wellcomeopenres.20385.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 06/11/2024] Open
Abstract
Background Type I interferons are cytokines involved in innate immunity against viruses. Genetic disorders of type I interferon regulation are associated with a range of autoimmune and cerebrovascular phenotypes. Carriers of pathogenic variants involved in genetic disorders of type I interferons are generally considered asymptomatic. Preliminary data suggests, however, that genetically determined dysregulation of type I interferon responses is associated with autoimmunity, and may also be relevant to sporadic cerebrovascular disease and dementia. We aim to determine whether functional variants in genes involved in type I interferon regulation and signalling are associated with the risk of autoimmunity, stroke, and dementia in a population cohort. Methods We will perform a hypothesis-driven candidate pathway association study of type I interferon-related genes using rare variants in the UK Biobank (UKB). We will manually curate type I interferon regulation and signalling genes from a literature review and Gene Ontology, followed by clinical and functional filtering. Variants of interest will be included based on pre-defined clinical relevance and functional annotations (using LOFTEE, M-CAP and a minor allele frequency <0.1%). The association of variants with 15 clinical and three neuroradiological phenotypes will be assessed with a rare variant genetic risk score and gene-level tests, using a Bonferroni-corrected p-value threshold from the number of genetic units and phenotypes tested. We will explore the association of significant genetic units with 196 additional health-related outcomes to help interpret their relevance and explore the clinical spectrum of genetic perturbations of type I interferon. Ethics and dissemination The UKB has received ethical approval from the North West Multicentre Research Ethics Committee, and all participants provided written informed consent at recruitment. This research will be conducted using the UKB Resource under application number 93160. We expect to disseminate our results in a peer-reviewed journal and at an international cardiovascular conference.
Collapse
Affiliation(s)
- Bastien Rioux
- Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, Scotland, UK
| | - Michael Chong
- Population Health Research Institute, McMaster University, Hamilton, Ontario, Canada
- Thrombosis and Atherosclerosis Research Institute, McMaster University, Hamilton, Ontario, Canada
- Department of Pathology and Molecular Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Rosie Walker
- Department of Psychology, University of Exeter, Exeter, England, UK
| | - Sarah McGlasson
- Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, Scotland, UK
| | - Kristiina Rannikmäe
- Centre for Medical Informatics, Usher Institute, University of Edinburgh, Edinburgh, Scotland, UK
| | - Daniel McCartney
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, Scotland, UK
| | - John McCabe
- School of Medicine, University College Dublin, Dublin, Leinster, Ireland
- Department of Medicine for the Elderly, Mater Misericordiae University Hospital, Dublin, Ireland
| | - Robin Brown
- Department of Clinical Neurosciences, University of Cambridge, Cambridge, England, UK
| | - Yanick J. Crow
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, Scotland, UK
- Laboratory of Neurogenetics and Neuroinflammation, Institut Imagine, Université de Paris, Paris, France
| | - David Hunt
- Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, Scotland, UK
| | - William Whiteley
- Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, Scotland, UK
- MRC Population Health Unit, Nuffield Department of Population Health, University of Oxford, Oxford, England, UK
| |
Collapse
|
9
|
Zhang YJ, Luo Z, Sun Y, Liu J, Chen Z. From beasts to bytes: Revolutionizing zoological research with artificial intelligence. Zool Res 2023; 44:1115-1131. [PMID: 37933101 PMCID: PMC10802096 DOI: 10.24272/j.issn.2095-8137.2023.263] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 10/30/2023] [Indexed: 11/08/2023] Open
Abstract
Since the late 2010s, Artificial Intelligence (AI) including machine learning, boosted through deep learning, has boomed as a vital tool to leverage computer vision, natural language processing and speech recognition in revolutionizing zoological research. This review provides an overview of the primary tasks, core models, datasets, and applications of AI in zoological research, including animal classification, resource conservation, behavior, development, genetics and evolution, breeding and health, disease models, and paleontology. Additionally, we explore the challenges and future directions of integrating AI into this field. Based on numerous case studies, this review outlines various avenues for incorporating AI into zoological research and underscores its potential to enhance our understanding of the intricate relationships that exist within the animal kingdom. As we build a bridge between beast and byte realms, this review serves as a resource for envisioning novel AI applications in zoological research that have not yet been explored.
Collapse
Affiliation(s)
- Yu-Juan Zhang
- Chongqing Key Laboratory of Vector Insects
- Chongqing Key Laboratory of Animal Biology
- College of Life Science, Chongqing Normal University, Chongqing 401331, China
| | - Zeyu Luo
- Chongqing Key Laboratory of Vector Insects
- Chongqing Key Laboratory of Animal Biology
- College of Life Science, Chongqing Normal University, Chongqing 401331, China
| | - Yawen Sun
- Chongqing Key Laboratory of Vector Insects
- Chongqing Key Laboratory of Animal Biology
- College of Life Science, Chongqing Normal University, Chongqing 401331, China
| | - Junhao Liu
- Chongqing Key Laboratory of Vector Insects
- Chongqing Key Laboratory of Animal Biology
- College of Life Science, Chongqing Normal University, Chongqing 401331, China
| | - Zongqing Chen
- School of Mathematical Sciences
- National Center for Applied Mathematics in Chongqing, Chongqing Normal University, Chongqing 401331, China. E-mail:
| |
Collapse
|
10
|
Steyaert W, Haer-Wigman L, Pfundt R, Hellebrekers D, Steehouwer M, Hampstead J, de Boer E, Stegmann A, Yntema H, Kamsteeg EJ, Brunner H, Hoischen A, Gilissen C. Systematic analysis of paralogous regions in 41,755 exomes uncovers clinically relevant variation. Nat Commun 2023; 14:6845. [PMID: 37891200 PMCID: PMC10611741 DOI: 10.1038/s41467-023-42531-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Accepted: 10/13/2023] [Indexed: 10/29/2023] Open
Abstract
The short lengths of short-read sequencing reads challenge the analysis of paralogous genomic regions in exome and genome sequencing data. Most genetic variants within these homologous regions therefore remain unidentified in standard analyses. Here, we present a method (Chameleolyser) that accurately identifies single nucleotide variants and small insertions/deletions (SNVs/Indels), copy number variants and ectopic gene conversion events in duplicated genomic regions using whole-exome sequencing data. Application to a cohort of 41,755 exome samples yields 20,432 rare homozygous deletions and 2,529,791 rare SNVs/Indels, of which we show that 338,084 are due to gene conversion events. None of the SNVs/Indels are detectable using regular analysis techniques. Validation by high-fidelity long-read sequencing in 20 samples confirms >88% of called variants. Focusing on variation in known disease genes leads to a direct molecular diagnosis in 25 previously undiagnosed patients. Our method can readily be applied to existing exome data.
Collapse
Affiliation(s)
- Wouter Steyaert
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525, GA, Nijmegen, The Netherlands
- Radboud Institute for Molecular Life Sciences, Nijmegen, Netherlands
| | - Lonneke Haer-Wigman
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525, GA, Nijmegen, The Netherlands
| | - Rolph Pfundt
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525, GA, Nijmegen, The Netherlands
| | - Debby Hellebrekers
- Maastricht University Medical Center + , Department of Clinical Genetics, Maastricht, Netherlands
| | - Marloes Steehouwer
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525, GA, Nijmegen, The Netherlands
| | - Juliet Hampstead
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525, GA, Nijmegen, The Netherlands
| | - Elke de Boer
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525, GA, Nijmegen, The Netherlands
- Radboud University, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, Netherlands
| | - Alexander Stegmann
- Maastricht University Medical Center + , Department of Clinical Genetics, Maastricht, Netherlands
| | - Helger Yntema
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525, GA, Nijmegen, The Netherlands
| | - Erik-Jan Kamsteeg
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525, GA, Nijmegen, The Netherlands
| | - Han Brunner
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525, GA, Nijmegen, The Netherlands
- Maastricht University Medical Center + , Department of Clinical Genetics, Maastricht, Netherlands
| | - Alexander Hoischen
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525, GA, Nijmegen, The Netherlands
- Radboud Institute for Molecular Life Sciences, Nijmegen, Netherlands
- Radboud University Medical Center, Department of Internal Medicine and Radboud Center for Infectious Diseases (RCI), Nijmegen, Netherlands
| | - Christian Gilissen
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Geert Grooteplein 10, 6525, GA, Nijmegen, The Netherlands.
- Radboud Institute for Molecular Life Sciences, Nijmegen, Netherlands.
| |
Collapse
|
11
|
Guhlin J, Le Lec MF, Wold J, Koot E, Winter D, Biggs PJ, Galla SJ, Urban L, Foster Y, Cox MP, Digby A, Uddstrom LR, Eason D, Vercoe D, Davis T, Howard JT, Jarvis ED, Robertson FE, Robertson BC, Gemmell NJ, Steeves TE, Santure AW, Dearden PK. Species-wide genomics of kākāpō provides tools to accelerate recovery. Nat Ecol Evol 2023; 7:1693-1705. [PMID: 37640765 DOI: 10.1038/s41559-023-02165-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2022] [Accepted: 07/11/2023] [Indexed: 08/31/2023]
Abstract
The kākāpō is a critically endangered, intensively managed, long-lived nocturnal parrot endemic to Aotearoa New Zealand. We generated and analysed whole-genome sequence data for nearly all individuals living in early 2018 (169 individuals) to generate a high-quality species-wide genetic variant callset. We leverage extensive long-term metadata to quantify genome-wide diversity of the species over time and present new approaches using probabilistic programming, combined with a phenotype dataset spanning five decades, to disentangle phenotypic variance into environmental and genetic effects while quantifying uncertainty in small populations. We find associations for growth, disease susceptibility, clutch size and egg fertility within genic regions previously shown to influence these traits in other species. Finally, we generate breeding values to predict phenotype and illustrate that active management over the past 45 years has maintained both genome-wide diversity and diversity in breeding values and, hence, evolutionary potential. We provide new pathways for informing future conservation management decisions for kākāpō, including prioritizing individuals for translocation and monitoring individuals with poor growth or high disease risk. Overall, by explicitly addressing the challenge of the small sample size, we provide a template for the inclusion of genomic data that will be transformational for species recovery efforts around the globe.
Collapse
Affiliation(s)
- Joseph Guhlin
- Genomics Aotearoa, Biochemistry Department, School of Biomedical Sciences, University of Otago, Dunedin, Aotearoa New Zealand
| | - Marissa F Le Lec
- Genomics Aotearoa, Biochemistry Department, School of Biomedical Sciences, University of Otago, Dunedin, Aotearoa New Zealand
| | - Jana Wold
- School of Biological Sciences, University of Canterbury, Christchurch, Aotearoa New Zealand
| | - Emily Koot
- The New Zealand Institute for Plant and Food Research Ltd, Palmerston North, Aotearoa New Zealand
| | - David Winter
- School of Natural Sciences, Massey University, Palmerston North, Aotearoa New Zealand
| | - Patrick J Biggs
- School of Natural Sciences, Massey University, Palmerston North, Aotearoa New Zealand
- School of Veterinary Science, Massey University, Palmerston North, Aotearoa New Zealand
| | - Stephanie J Galla
- School of Biological Sciences, University of Canterbury, Christchurch, Aotearoa New Zealand
- Department of Biological Sciences, Boise State University, Boise, ID, USA
| | - Lara Urban
- Department of Anatomy, School of Biomedical Sciences, University of Otago, Dunedin, Aotearoa New Zealand
- Helmholtz Pioneer Campus, Helmholtz Zentrum Muenchen, Neuherberg, Germany
- Helmholtz AI, Helmholtz Zentrum Muenchen, Neuherberg, Germany
- School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Yasmin Foster
- Department of Zoology, University of Otago, Dunedin, Aotearoa New Zealand
| | - Murray P Cox
- School of Natural Sciences, Massey University, Palmerston North, Aotearoa New Zealand
- Department of Statistics, University of Auckland, Auckland, Aotearoa New Zealand
| | - Andrew Digby
- Kākāpō Recovery Programme, Department of Conservation, Invercargill, Aotearoa New Zealand
| | - Lydia R Uddstrom
- Kākāpō Recovery Programme, Department of Conservation, Invercargill, Aotearoa New Zealand
| | - Daryl Eason
- Kākāpō Recovery Programme, Department of Conservation, Invercargill, Aotearoa New Zealand
| | - Deidre Vercoe
- Kākāpō Recovery Programme, Department of Conservation, Invercargill, Aotearoa New Zealand
| | - Tāne Davis
- Rakiura Tītī Islands Administering Body, Invercargill, Aotearoa New Zealand
| | - Jason T Howard
- Neurogenetics of Language Lab, The Rockefeller University, New York, NY, USA
- Mirxes, Cambridge, MA, USA
| | - Erich D Jarvis
- The Rockefeller University, New York, NY, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Fiona E Robertson
- Department of Zoology, University of Otago, Dunedin, Aotearoa New Zealand
| | - Bruce C Robertson
- Department of Zoology, University of Otago, Dunedin, Aotearoa New Zealand
| | - Neil J Gemmell
- Department of Anatomy, School of Biomedical Sciences, University of Otago, Dunedin, Aotearoa New Zealand
| | - Tammy E Steeves
- School of Biological Sciences, University of Canterbury, Christchurch, Aotearoa New Zealand
| | - Anna W Santure
- School of Biological Sciences, University of Auckland, Auckland, Aotearoa New Zealand
| | - Peter K Dearden
- Genomics Aotearoa, Biochemistry Department, School of Biomedical Sciences, University of Otago, Dunedin, Aotearoa New Zealand.
| |
Collapse
|
12
|
Xu X, Chen B, Zhang J, Lan S, Wu S. Whole-genome resequencing analysis of the medicinal plant Gardenia jasminoides. PeerJ 2023; 11:e16056. [PMID: 37744244 PMCID: PMC10512932 DOI: 10.7717/peerj.16056] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 08/17/2023] [Indexed: 09/26/2023] Open
Abstract
Background Gardenia jasminoides is a species of Chinese medicinal plant, which has high medicinal and economic value and rich genetic diversity, but the study on its genetic diversity is far not enough. Methods In this study, one wild and one cultivated gardenia materials were resequenced using IlluminaHiSeq sequencing platform and the data were evaluated to understand the genomic characteristics of G. jasminoides. Results After data analysis, the results showed that clean data of 11.77G, Q30 reached 90.96%. The average comparison rate between the sample and reference genome was 96.08%, the average coverage depth was 15X, and the genome coverage was 85.93%. The SNPs of FD and YP1 were identified, and 3,087,176 and 3,241,416 SNPs were developed, respectively. In addition, SNP non-synonymous mutation, InDel mutation, SV mutation and CNV mutation were also detected between the sample and the reference genome, and KEGG, GO and COG database annotations were made for genes with DNA level variation. The structural gene variation in the biosynthetic pathway of crocin and gardenia, the main medicinal substance of G. jasminoides was further explored, which provided basic data for molecular breeding and genetic diversity of G. jasminoides in the future.
Collapse
Affiliation(s)
- Xinyu Xu
- Fujian Academy of Forestry Sciences, Fuzhou, Fujian, China
- College of Landscape and Architecture, Fujian Agriculture and Forestry University, Fuzhou, Fujian, China
| | - Bihua Chen
- Fujian Academy of Forestry Sciences, Fuzhou, Fujian, China
| | - Juan Zhang
- Fujian Academy of Forestry Sciences, Fuzhou, Fujian, China
| | - Siren Lan
- College of Landscape and Architecture, Fujian Agriculture and Forestry University, Fuzhou, Fujian, China
| | - Shasha Wu
- College of Landscape and Architecture, Fujian Agriculture and Forestry University, Fuzhou, Fujian, China
| |
Collapse
|
13
|
Nakamichi K, Van Gelder RN, Chao JR, Mustafi D. Targeted adaptive long-read sequencing for discovery of complex phased variants in inherited retinal disease patients. Sci Rep 2023; 13:8535. [PMID: 37237007 PMCID: PMC10219926 DOI: 10.1038/s41598-023-35791-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 05/24/2023] [Indexed: 05/28/2023] Open
Abstract
Inherited retinal degenerations (IRDs) are a heterogeneous group of predominantly monogenic disorders with over 300 causative genes identified. Short-read exome sequencing is commonly used to genotypically diagnose patients with clinical features of IRDs, however, in up to 30% of patients with autosomal recessive IRDs, one or no disease-causing variants are identified. Furthermore, chromosomal maps cannot be reconstructed for allelic variant discovery with short-reads. Long-read genome sequencing can provide complete coverage of disease loci and a targeted approach can focus sequencing bandwidth to a genomic region of interest to provide increased depth and haplotype reconstruction to uncover cases of missing heritability. We demonstrate that targeted adaptive long-read sequencing on the Oxford Nanopore Technologies (ONT) platform of the USH2A gene from three probands in a family with the most common cause of the syndromic IRD, Usher Syndrome, resulted in greater than 12-fold target gene sequencing enrichment on average. This focused depth of sequencing allowed for haplotype reconstruction and phased variant identification. We further show that variants obtained from the haplotype-aware genotyping pipeline can be heuristically ranked to focus on potential pathogenic candidates without a priori knowledge of the disease-causing variants. Moreover, consideration of the variants unique to targeted long-read sequencing that are not covered by short-read technology demonstrated higher precision and F1 scores for variant discovery by long-read sequencing. This work establishes that targeted adaptive long-read sequencing can generate targeted, chromosome-phased data sets for identification of coding and non-coding disease-causing alleles in IRDs and can be applicable to other Mendelian diseases.
Collapse
Affiliation(s)
- Kenji Nakamichi
- Department of Ophthalmology, Roger and Karalis Johnson Retina Center, University of Washington, Seattle, WA, 98109, USA
| | - Russell N Van Gelder
- Department of Ophthalmology, Roger and Karalis Johnson Retina Center, University of Washington, Seattle, WA, 98109, USA
| | - Jennifer R Chao
- Department of Ophthalmology, Roger and Karalis Johnson Retina Center, University of Washington, Seattle, WA, 98109, USA
| | - Debarshi Mustafi
- Department of Ophthalmology, Roger and Karalis Johnson Retina Center, University of Washington, Seattle, WA, 98109, USA.
- Brotman Baty Institute for Precision Medicine, Seattle, WA, 98195, USA.
- Division of Ophthalmology, Seattle Children's Hospital, Seattle, WA, 98105, USA.
| |
Collapse
|
14
|
Lloret-Villas A, Pausch H, Leonard AS. The size and composition of haplotype reference panels impact the accuracy of imputation from low-pass sequencing in cattle. Genet Sel Evol 2023; 55:33. [PMID: 37170101 PMCID: PMC10173671 DOI: 10.1186/s12711-023-00809-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Accepted: 05/02/2023] [Indexed: 05/13/2023] Open
Abstract
BACKGROUND Low-pass sequencing followed by sequence variant genotype imputation is an alternative to the routine microarray-based genotyping in cattle. However, the impact of haplotype reference panels and their interplay with the coverage of low-pass whole-genome sequencing data have not been sufficiently explored in typical livestock settings where only a small number of reference samples is available. METHODS Sequence variant genotyping accuracy was compared between two variant callers, GATK and DeepVariant, in 50 Brown Swiss cattle with sequencing coverages ranging from 4- to 63-fold. Haplotype reference panels of varying sizes and composition were built with DeepVariant based on 501 individuals from nine breeds. High-coverage sequence data for 24 Brown Swiss cattle were downsampled to between 0.01- and 4-fold to mimic low-pass sequencing. GLIMPSE was used to infer sequence variant genotypes from the low-pass sequencing data using different haplotype reference panels. The accuracy of the sequence variant genotypes that were inferred from low-pass sequencing data was compared with sequence variant genotypes called from high-coverage data. RESULTS DeepVariant was used to establish bovine haplotype reference panels because it outperformed GATK in all evaluations. Within-breed haplotype reference panels were more accurate and efficient to impute sequence variant genotypes from low-pass sequencing than equally-sized multibreed haplotype reference panels for all target sample coverages and allele frequencies. F1 scores greater than 0.9, which indicate high harmonic means of recall and precision of called genotypes, were achieved with 0.25-fold sequencing coverage when large breed-specific haplotype reference panels (n = 150) were used. In absence of such large within-breed haplotype panels, variant genotyping accuracy from low-pass sequencing could be increased either by adding non-related samples to the haplotype reference panel or by increasing the coverage of the low-pass sequencing data. Sequence variant genotyping from low-pass sequencing was substantially less accurate when the reference panel lacked individuals from the target breed. CONCLUSIONS Variant genotyping is more accurate with DeepVariant than GATK. DeepVariant is therefore suitable to establish bovine haplotype reference panels. Medium-sized breed-specific haplotype reference panels and large multibreed haplotype reference panels enable accurate imputation of low-pass sequencing data in a typical cattle breed.
Collapse
Affiliation(s)
| | - Hubert Pausch
- Animal Genomics, ETH Zürich, Universitätstrasse 2, Zürich, 8092, Switzerland
| | - Alexander S Leonard
- Animal Genomics, ETH Zürich, Universitätstrasse 2, Zürich, 8092, Switzerland
| |
Collapse
|
15
|
Harvey WT, Ebert P, Ebler J, Audano PA, Munson KM, Hoekzema K, Porubsky D, Beck CR, Marschall T, Garimella K, Eichler EE. Whole-genome long-read sequencing downsampling and its effect on variant calling precision and recall. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.04.539448. [PMID: 37205567 PMCID: PMC10187267 DOI: 10.1101/2023.05.04.539448] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Advances in long-read sequencing (LRS) technology continue to make whole-genome sequencing more complete, affordable, and accurate. LRS provides significant advantages over short-read sequencing approaches, including phased de novo genome assembly, access to previously excluded genomic regions, and discovery of more complex structural variants (SVs) associated with disease. Limitations remain with respect to cost, scalability, and platform-dependent read accuracy and the tradeoffs between sequence coverage and sensitivity of variant discovery are important experimental considerations for the application of LRS. We compare the genetic variant calling precision and recall of Oxford Nanopore Technologies (ONT) and PacBio HiFi platforms over a range of sequence coverages. For read-based applications, LRS sensitivity begins to plateau around 12-fold coverage with a majority of variants called with reasonable accuracy (F1 score above 0.5), and both platforms perform well for SV detection. Genome assembly increases variant calling precision and recall of SVs and indels in HiFi datasets with HiFi outperforming ONT in quality as measured by the F1 score of assembly-based variant callsets. While both technologies continue to evolve, our work offers guidance to design cost-effective experimental strategies that do not compromise on discovering novel biology.
Collapse
Affiliation(s)
- William T. Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Peter A. Audano
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Katherine M. Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Christine R. Beck
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
- Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT 06032 USA
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Kiran Garimella
- Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Evan E. Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| |
Collapse
|
16
|
Ruperao P, Gandham P, Odeny DA, Mayes S, Selvanayagam S, Thirunavukkarasu N, Das RR, Srikanda M, Gandhi H, Habyarimana E, Manyasa E, Nebie B, Deshpande SP, Rathore A. Exploring the sorghum race level diversity utilizing 272 sorghum accessions genomic resources. FRONTIERS IN PLANT SCIENCE 2023; 14:1143512. [PMID: 37008459 PMCID: PMC10063887 DOI: 10.3389/fpls.2023.1143512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 02/22/2023] [Indexed: 06/19/2023]
Abstract
Due to evolutionary divergence, sorghum race populations exhibit significant genetic and morphological variation. A k-mer-based sorghum race sequence comparison identified the conserved k-mers of all 272 accessions from sorghum and the race-specific genetic signatures identified the gene variability in 10,321 genes (PAVs). To understand sorghum race structure, diversity and domestication, a deep learning-based variant calling approach was employed in a set of genotypic data derived from a diverse panel of 272 sorghum accessions. The data resulted in 1.7 million high-quality genome-wide SNPs and identified selective signature (both positive and negative) regions through a genome-wide scan with different (iHS and XP-EHH) statistical methods. We discovered 2,370 genes associated with selection signatures including 179 selective sweep regions distributed over 10 chromosomes. Co-localization of these regions undergoing selective pressure with previously reported QTLs and genes revealed that the signatures of selection could be related to the domestication of important agronomic traits such as biomass and plant height. The developed k-mer signatures will be useful in the future to identify the sorghum race and for trait and SNP markers for assisting in plant breeding programs.
Collapse
Affiliation(s)
- Pradeep Ruperao
- Center of Excellence in Genomics and Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India
| | - Prasad Gandham
- School of Plant, Environmental and Soil Sciences, Louisiana State University Agricultural Center, LA, United States
| | - Damaris A. Odeny
- Center of Excellence in Genomics and Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India
| | - Sean Mayes
- Center of Excellence in Genomics and Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India
| | | | - Nepolean Thirunavukkarasu
- Genomics and Molecular Breeding Lab, Indian Council of Agricultural Research (ICAR) - Indian Institute of Millets Research, Hyderabad, India
| | - Roma R. Das
- International Crops Research Institute for the Semi-Arid Tropics, Hyderabad, India
| | - Manasa Srikanda
- Department of Statistics, Osmania University, Hyderabad, India
| | - Harish Gandhi
- International Maize and Wheat Improvement Center (CIMMYT), Nairobi, Kenya
| | - Ephrem Habyarimana
- International Crops Research Institute for the Semi-Arid Tropics, Hyderabad, India
| | - Eric Manyasa
- Sorghum Breeding Program, International Crops Research Institute for the Semi-Arid Tropics, Nairobi, Kenya
| | - Baloua Nebie
- International Maize and Wheat Improvement Center (CIMMYT), Dakar, Senegal
| | | | - Abhishek Rathore
- Excellence in Breeding, International Maize and Wheat Improvement Center (CIMMYT), Hyderabad, India
| |
Collapse
|
17
|
Population Structure and Genetic Diversity Analysis of “Yufen 1” H Line Chickens Using Whole-Genome Resequencing. Life (Basel) 2023; 13:life13030793. [PMID: 36983948 PMCID: PMC10059704 DOI: 10.3390/life13030793] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Revised: 03/13/2023] [Accepted: 03/13/2023] [Indexed: 03/17/2023] Open
Abstract
The effective protection and utilization of poultry resources depend on an accurate understanding of the genetic diversity and population structure. The breeding of the specialized poultry lineage “Yufen 1”, with its defined characteristics, was approved by the China Poultry Genetic Resource Committee in 2015. Thus, to investigate the relationship between the progenitor H line and other poultry breeds, the genetic diversity and population structure of “Yufen 1” H line (YF) were investigated and compared with those of 2 commercial chicken breeds, the ancestor breed Red Jungle Fowls, and 11 Chinese indigenous chicken breeds based on a whole-genome resequencing approach using 8,112,424 SNPs. The genetic diversity of YF was low, and the rate of linkage disequilibrium decay was significantly slower than that of the other Chinese indigenous breeds. In addition, it was shown that the YF population was strongly selected during intensive breeding and that genetic resources have been seriously threatened, which highlights the need to establish a systematic conservation strategy as well as utilization techniques to maintain genetic diversity within YF. Moreover, a principal component analysis, a neighbor-joining tree analysis, a structure analysis, and genetic differentiation indices indicated that YF harbors a distinctive genetic resource with a unique genetic structure separate from that of Chinese indigenous breeds at the genome level. The findings provide a valuable resource and the theoretical basis for the further conservation and utilization of YF.
Collapse
|
18
|
Betschart RO, Thiéry A, Aguilera-Garcia D, Zoche M, Moch H, Twerenbold R, Zeller T, Blankenberg S, Ziegler A. Comparison of calling pipelines for whole genome sequencing: an empirical study demonstrating the importance of mapping and alignment. Sci Rep 2022; 12:21502. [PMID: 36513709 PMCID: PMC9748128 DOI: 10.1038/s41598-022-26181-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Accepted: 12/12/2022] [Indexed: 12/14/2022] Open
Abstract
Rapid advances in high-throughput DNA sequencing technologies have enabled the conduct of whole genome sequencing (WGS) studies, and several bioinformatics pipelines have become available. The aim of this study was the comparison of 6 WGS data pre-processing pipelines, involving two mapping and alignment approaches (GATK utilizing BWA-MEM2 2.2.1, and DRAGEN 3.8.4) and three variant calling pipelines (GATK 4.2.4.1, DRAGEN 3.8.4 and DeepVariant 1.1.0). We sequenced one genome in a bottle (GIAB) sample 70 times in different runs, and one GIAB trio in triplicate. The truth set of the GIABs was used for comparison, and performance was assessed by computation time, F1 score, precision, and recall. In the mapping and alignment step, the DRAGEN pipeline was faster than the GATK with BWA-MEM2 pipeline. DRAGEN showed systematically higher F1 score, precision, and recall values than GATK for single nucleotide variations (SNVs) and Indels in simple-to-map, complex-to-map, coding and non-coding regions. In the variant calling step, DRAGEN was fastest. In terms of accuracy, DRAGEN and DeepVariant performed similarly and both superior to GATK, with slight advantages for DRAGEN for Indels and for DeepVariant for SNVs. The DRAGEN pipeline showed the lowest Mendelian inheritance error fraction for the GIAB trios. Mapping and alignment played a key role in variant calling of WGS, with the DRAGEN outperforming GATK.
Collapse
Affiliation(s)
- Raphael O Betschart
- Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 1, 7265, Davos Wolfgang, Switzerland
| | - Alexandre Thiéry
- Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 1, 7265, Davos Wolfgang, Switzerland
| | - Domingo Aguilera-Garcia
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Schmelzbergstrasse 12, 8091, Zurich, Switzerland
| | - Martin Zoche
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Schmelzbergstrasse 12, 8091, Zurich, Switzerland
| | - Holger Moch
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Schmelzbergstrasse 12, 8091, Zurich, Switzerland
| | - Raphael Twerenbold
- Department of Cardiology, University Heart & Vascular Center, University Medical Center Hamburg Eppendorf, Martinistr. 52, 20251, Hamburg, Germany
- University Center of Cardiovascular Research Hamburg, University Medical Center Hamburg Eppendorf, Martinistr. 52, 20251, Hamburg, Germany
- German Center for Cardiovascular Research (DZHK), Partner Site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Tanja Zeller
- Department of Cardiology, University Heart & Vascular Center, University Medical Center Hamburg Eppendorf, Martinistr. 52, 20251, Hamburg, Germany
- University Center of Cardiovascular Research Hamburg, University Medical Center Hamburg Eppendorf, Martinistr. 52, 20251, Hamburg, Germany
- German Center for Cardiovascular Research (DZHK), Partner Site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Stefan Blankenberg
- Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 1, 7265, Davos Wolfgang, Switzerland
- Department of Cardiology, University Heart & Vascular Center, University Medical Center Hamburg Eppendorf, Martinistr. 52, 20251, Hamburg, Germany
- University Center of Cardiovascular Research Hamburg, University Medical Center Hamburg Eppendorf, Martinistr. 52, 20251, Hamburg, Germany
- German Center for Cardiovascular Research (DZHK), Partner Site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Andreas Ziegler
- Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 1, 7265, Davos Wolfgang, Switzerland.
- Department of Cardiology, University Heart & Vascular Center, University Medical Center Hamburg Eppendorf, Martinistr. 52, 20251, Hamburg, Germany.
- School Mathematics, Statistics and Computer Science, Scottsville, Private Bag X01, Pietermaritzburg, 3209, South Africa.
| |
Collapse
|
19
|
Li J, Wang T, Liu W, Yin D, Lai Z, Zhang G, Zhang K, Ji J, Yin S. A high-quality chromosome-level genome assembly of Pelteobagrus vachelli provides insights into its environmental adaptation and population history. Front Genet 2022; 13:1050192. [DOI: 10.3389/fgene.2022.1050192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 11/01/2022] [Indexed: 11/16/2022] Open
Abstract
Pelteobagrus vachelli is a freshwater fish with high economic value, but the lack of genome resources has severely restricted its industrial development and population conservation. Here, we constructed the first chromosome-level genome assembly of P. vachelli with a total length of approximately 662.13 Mb and a contig N50 was 14.02 Mb, and scaffolds covering 99.79% of the assembly were anchored to 26 chromosomes. Combining the comparative genome results and transcriptome data under environmental stress (high temperature, hypoxia and Edwardsiella. ictaluri infection), the MAPK signaling pathway, PI3K-Akt signaling pathway and apelin signaling pathway play an important role in environmental adaptation of P. vachelli, and these pathways were interconnected by the ErbB family and involved in cell proliferation, differentiation and apoptosis. Population evolution analysis showed that artificial interventions have affected wild populations of P. vachelli. This study provides a useful genomic information for the genetic breeding of P. vachelli, as well as references for further studies on fish biology and evolution.
Collapse
|
20
|
Connor R, Yarmosh DA, Maier W, Shakya M, Martin R, Bradford R, Brister JR, Chain PS, Copeland CA, di Iulio J, Hu B, Ebert P, Gunti J, Jin Y, Katz KS, Kochergin A, LaRosa T, Li J, Li PE, Lo CC, Rashid S, Maiorova ES, Xiao C, Zalunin V, Pruitt KD. Towards increased accuracy and reproducibility in SARS-CoV-2 next generation sequence analysis for public health surveillance. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2022:2022.11.03.515010. [PMID: 36380755 PMCID: PMC9645426 DOI: 10.1101/2022.11.03.515010] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
During the COVID-19 pandemic, SARS-CoV-2 surveillance efforts integrated genome sequencing of clinical samples to identify emergent viral variants and to support rapid experimental examination of genome-informed vaccine and therapeutic designs. Given the broad range of methods applied to generate new viral genomes, it is critical that consensus and variant calling tools yield consistent results across disparate pipelines. Here we examine the impact of sequencing technologies (Illumina and Oxford Nanopore) and 7 different downstream bioinformatic protocols on SARS-CoV-2 variant calling as part of the NIH Accelerating COVID-19 Therapeutic Interventions and Vaccines (ACTIV) Tracking Resistance and Coronavirus Evolution (TRACE) initiative, a public-private partnership established to address the COVID-19 outbreak. Our results indicate that bioinformatic workflows can yield consensus genomes with different single nucleotide polymorphisms, insertions, and/or deletions even when using the same raw sequence input datasets. We introduce the use of a specific suite of parameters and protocols that greatly improves the agreement among pipelines developed by diverse organizations. Such consistency among bioinformatic pipelines is fundamental to SARS-CoV-2 and future pathogen surveillance efforts. The application of analysis standards is necessary to more accurately document phylogenomic trends and support data-driven public health responses.
Collapse
Affiliation(s)
- Ryan Connor
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - David A Yarmosh
- American Type Culture Collection, 10807 University Blvd, Manassas, VA 20110, USA
- BEI Resources
| | - Wolfgang Maier
- Galaxy Europe Team, University of Freiburg, Freiburg, Germany
| | - Migun Shakya
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545 USA
| | - Ross Martin
- Clinical Virology Department, Gilead Sciences, 333 Lakeside Dr, Foster City, CA 94404, USA
| | - Rebecca Bradford
- American Type Culture Collection, 10807 University Blvd, Manassas, VA 20110, USA
- BEI Resources
| | - J Rodney Brister
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Patrick Sg Chain
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545 USA
| | - Courtney A Copeland
- Deloitte Consulting LLP, 1919 North Lynn St, Suite 1500, Rosslyn, VA 22209 USA
| | | | - Bin Hu
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545 USA
| | | | - Jonathan Gunti
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Yumi Jin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Kenneth S Katz
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Andrey Kochergin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Tré LaRosa
- Deloitte Consulting LLP, 1919 North Lynn St, Suite 1500, Rosslyn, VA 22209 USA
| | - Jiani Li
- Clinical Virology Department, Gilead Sciences, 333 Lakeside Dr, Foster City, CA 94404, USA
| | - Po-E Li
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545 USA
| | - Chien-Chi Lo
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545 USA
| | - Sujatha Rashid
- American Type Culture Collection, 10807 University Blvd, Manassas, VA 20110, USA
| | - Evguenia S Maiorova
- Clinical Virology Department, Gilead Sciences, 333 Lakeside Dr, Foster City, CA 94404, USA
| | - Chunlin Xiao
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Vadim Zalunin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Kim D Pruitt
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|