1
|
Sasso S, Saag L, Spros R, Beneker O, Molinaro L, Biagini SA, Lehouck A, Van De Vijver K, Hui R, D'Atanasio E, Kushniarevich A, Kabral H, Metspalu E, Guellil M, Ali MQA, Geypen J, Hoebreckx M, Berk B, De Winter N, Driesen P, Pijpelink A, Van Damme P, Scheib CL, Deschepper E, Deckers P, Snoeck C, Dewilde M, Ervynck A, Tambets K, Larmuseau MHD, Kivisild T. Capturing the fusion of two ancestries and kinship structures in Merovingian Flanders. Proc Natl Acad Sci U S A 2024; 121:e2406734121. [PMID: 38913897 DOI: 10.1073/pnas.2406734121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Accepted: 05/17/2024] [Indexed: 06/26/2024] Open
Abstract
The Merovingian period (5th to 8th cc AD) was a time of demographic, socioeconomic, cultural, and political realignment in Western Europe. Here, we report the whole-genome shotgun sequence data of 30 human skeletal remains from a coastal Late Merovingian site of Koksijde (675 to 750 AD), alongside 18 remains from two Early to Late Medieval sites in present-day Flanders, Belgium. We find two distinct ancestries, one shared with Early Medieval England and the Netherlands, while the other, minor component, reflecting likely continental Gaulish ancestry. Kinship analyses identified no large pedigrees characteristic to elite burials revealing instead a high modularity of distant relationships among individuals of the main ancestry group. In contrast, individuals with >90% Gaulish ancestry had no kinship links among sampled individuals. Evidence for population structure and major differences in the extent of Gaulish ancestry in the main group, including in a mother-daughter pair, suggests ongoing admixture in the community at the time of their burial. The isotopic and genetic evidence combined supports a model by which the burials, representing an established coastal nonelite community, had incorporated migrants from inland populations. The main group of burials at Koksijde shows an abundance of >5 cM long shared allelic intervals with the High Medieval site nearby, implying long-term continuity and suggesting that similarly to Britain, the Early Medieval ancestry shifts left a significant and long-lasting impact on the genetic makeup of the Flemish population. We find substantial allele frequency differences between the two ancestry groups in pigmentation and diet-associated variants, including those linked with lactase persistence, likely reflecting ancestry change rather than local adaptation.
Collapse
Affiliation(s)
- Stefania Sasso
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | - Lehti Saag
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | - Rachèl Spros
- Research Unit: Archaeology, Environmental Changes and Geo-Chemistry (AMGC), Vrije Universiteit Brussel, 1050 Brussels, Belgium
- Research Unit: Social History of Capitalism, Vrije Universiteit Brussel, 1050 Brussels, Belgium
| | - Owyn Beneker
- Department of Human Genetics, KU Leuven, 3000 Leuven, Belgium
| | | | - Simone A Biagini
- Department of Human Genetics, KU Leuven, 3000 Leuven, Belgium
- Institut de Biologia Evolutiva, Departament de Medicina i Ciències de la Vida, Universitat Pompeu Fabra, Parc de Recerca Biomèdica de Barcelona, 08003 Barcelona, Spain
| | | | | | - Ruoyun Hui
- Alan Turing Institute, NW1 2DB London, United Kingdom
| | - Eugenia D'Atanasio
- Institute of Molecular Biology and Pathology, Italian National Research Council, Rome, Italy
| | - Alena Kushniarevich
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | - Helja Kabral
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | - Ene Metspalu
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | - Meriam Guellil
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
- Department of Evolutionary Anthropology, University of Vienna, 1030 Vienna, Austria
| | | | | | | | - Birgit Berk
- Birgit Berk Fysische Anthropologie, 6231EC Meerssen, Netherlands
| | | | | | - April Pijpelink
- Crematie en Inhumatie Analyse (CRINA) Fysische Antropologie, 5237 JG 's-Hertogenbosch, Netherlands
| | - Philip Van Damme
- Department of Neurology, KU Leuven and Center for Brain & Disease Research Vlaamse Instituut voor Biotechnologie, 3000 Leuven, Belgium
- Department of Neurosciences, KU Leuven and Center for Brain & Disease Research VIB, 3000 Leuven, Belgium
| | - Christiana L Scheib
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
- Department of Zoology, University of Cambridge, CB2 3EJ Cambridge, United Kingdom
- Department of Archaeology, University of Cambridge, CB2 3DZ Cambridge, United Kingdom
- St John's College, University of Cambridge, CB2 1TP Cambridge, United Kingdom
| | - Ewoud Deschepper
- Historical Archaeology Research Group, Department of Archaeology, Ghent University, 9000 Ghent, Belgium
| | | | - Christophe Snoeck
- Research Unit: Archaeology, Environmental Changes and Geo-Chemistry (AMGC), Vrije Universiteit Brussel, 1050 Brussels, Belgium
| | - Marc Dewilde
- Flanders Heritage Agency, 1000 Brussels, Belgium
| | | | - Kristiina Tambets
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
| | | | - Toomas Kivisild
- Estonian Biocentre, Institute of Genomics, University of Tartu, Tartu 51010, Estonia
- Department of Human Genetics, KU Leuven, 3000 Leuven, Belgium
| |
Collapse
|
2
|
Nakamura W, Hirata M, Oda S, Chiba K, Okada A, Mateos RN, Sugawa M, Iida N, Ushiama M, Tanabe N, Sakamoto H, Sekine S, Hirasawa A, Kawai Y, Tokunaga K, Tsujimoto SI, Shiba N, Ito S, Yoshida T, Shiraishi Y. Assessing the efficacy of target adaptive sampling long-read sequencing through hereditary cancer patient genomes. NPJ Genom Med 2024; 9:11. [PMID: 38368425 PMCID: PMC10874402 DOI: 10.1038/s41525-024-00394-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2023] [Accepted: 01/15/2024] [Indexed: 02/19/2024] Open
Abstract
Innovations in sequencing technology have led to the discovery of novel mutations that cause inherited diseases. However, many patients with suspected genetic diseases remain undiagnosed. Long-read sequencing technologies are expected to significantly improve the diagnostic rate by overcoming the limitations of short-read sequencing. In addition, Oxford Nanopore Technologies (ONT) offers adaptive sampling and computationally driven target enrichment technology. This enables more affordable intensive analysis of target gene regions compared to standard non-selective long-read sequencing. In this study, we developed an efficient computational workflow for target adaptive sampling long-read sequencing (TAS-LRS) and evaluated it through application to 33 genomes collected from suspected hereditary cancer patients. Our workflow can identify single nucleotide variants with nearly the same accuracy as the short-read platform and elucidate complex forms of structural variations. We also newly identified several SINE-R/VNTR/Alu (SVA) elements affecting the APC gene in two patients with familial adenomatous polyposis, as well as their sites of origin. In addition, we demonstrated that off-target reads from adaptive sampling, which is typically discarded, can be effectively used to accurately genotype common single-nucleotide polymorphisms (SNPs) across the entire genome, enabling the calculation of a polygenic risk score. Furthermore, we identified allele-specific MLH1 promoter hypermethylation in a Lynch syndrome patient. In summary, our workflow with TAS-LRS can simultaneously capture monogenic risk variants including complex structural variations, polygenic background as well as epigenetic alterations, and will be an efficient platform for genetic disease research and diagnosis.
Collapse
Affiliation(s)
- Wataru Nakamura
- Division of Genome Analysis Platform Development, National Cancer Center Research Institute, Tokyo, Japan
- Department of Pediatrics, Yokohama City University Hospital, Kanagawa, Japan
| | - Makoto Hirata
- Division of Genetic Medicine and Services, National Cancer Center Hospital, Tokyo, Japan
- Department of Molecular Pathology, National Cancer Center Research Institute, Tokyo, Japan
| | - Satoyo Oda
- Division of Genetic Medicine and Services, National Cancer Center Hospital, Tokyo, Japan
- Division of Laboratory Medicine, National Cancer Center Hospital, Tokyo, Japan
| | - Kenichi Chiba
- Division of Genome Analysis Platform Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Ai Okada
- Division of Genome Analysis Platform Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Raúl Nicolás Mateos
- Division of Genome Analysis Platform Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Masahiro Sugawa
- Division of Genome Analysis Platform Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Naoko Iida
- Division of Genome Analysis Platform Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Mineko Ushiama
- Division of Genetic Medicine and Services, National Cancer Center Hospital, Tokyo, Japan
- Department of Clinical Genetics, National Cancer Center Research Institute, Tokyo, Japan
| | - Noriko Tanabe
- Division of Genetic Medicine and Services, National Cancer Center Hospital, Tokyo, Japan
| | - Hiromi Sakamoto
- Division of Genetic Medicine and Services, National Cancer Center Hospital, Tokyo, Japan
- Department of Clinical Genetics, National Cancer Center Research Institute, Tokyo, Japan
| | - Shigeki Sekine
- Division of Molecular Pathology, National Cancer Center Research Institute, Tokyo, Japan
| | - Akira Hirasawa
- Department of Clinical Genetics and Genomic Medicine, Okayama University Hospital, Okayama, Japan
| | - Yosuke Kawai
- Genome Medical Science Project, Research Institute, National Center for Global Health and Medicine, Tokyo, Japan
| | - Katsushi Tokunaga
- Genome Medical Science Project, Research Institute, National Center for Global Health and Medicine, Tokyo, Japan
- Central Biobank, National Center Biobank Network, Tokyo, Japan
| | - Shin-Ichi Tsujimoto
- Department of Pediatrics, Yokohama City University Hospital, Kanagawa, Japan
| | - Norio Shiba
- Department of Pediatrics, Yokohama City University Hospital, Kanagawa, Japan
| | - Shuichi Ito
- Department of Pediatrics, Yokohama City University Hospital, Kanagawa, Japan
| | - Teruhiko Yoshida
- Division of Genetic Medicine and Services, National Cancer Center Hospital, Tokyo, Japan
- Department of Clinical Genetics, National Cancer Center Research Institute, Tokyo, Japan
| | - Yuichi Shiraishi
- Division of Genome Analysis Platform Development, National Cancer Center Research Institute, Tokyo, Japan.
| |
Collapse
|
3
|
Li Z. vcfpp: a C++ API for rapid processing of the variant call format. Bioinformatics 2024; 40:btae049. [PMID: 38273677 PMCID: PMC10868310 DOI: 10.1093/bioinformatics/btae049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 01/08/2024] [Accepted: 01/23/2024] [Indexed: 01/27/2024] Open
Abstract
MOTIVATION Given the widespread use of the variant call format (VCF/BCF) coupled with continuous surge in big data, there remains a perpetual demand for fast and flexible methods to manipulate these comprehensive formats across various programming languages. RESULTS This work presents vcfpp, a C++ API of HTSlib in a single file, providing an intuitive interface to manipulate VCF/BCF files rapidly and safely, in addition to being portable. Moreover, this work introduces the vcfppR package to demonstrate the development of a high-performance R package with vcfpp, allowing for rapid and straightforward variants analyses. AVAILABILITY AND IMPLEMENTATION vcfpp is available from https://github.com/Zilong-Li/vcfpp under MIT license. vcfppR is available from https://cran.r-project.org/web/packages/vcfppR.
Collapse
Affiliation(s)
- Zilong Li
- Section for Computational and RNA Biology, University of Copenhagen, Copenhagen 2200, Denmark
| |
Collapse
|
4
|
Zhang K, Liang J, Fu Y, Chu J, Fu L, Wang Y, Li W, Zhou Y, Li J, Yin X, Wang H, Liu X, Mou C, Wang C, Wang H, Dong X, Yan D, Yu M, Zhao S, Li X, Ma Y. AGIDB: a versatile database for genotype imputation and variant decoding across species. Nucleic Acids Res 2024; 52:D835-D849. [PMID: 37889051 PMCID: PMC10767904 DOI: 10.1093/nar/gkad913] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 10/05/2023] [Accepted: 10/11/2023] [Indexed: 10/28/2023] Open
Abstract
The high cost of large-scale, high-coverage whole-genome sequencing has limited its application in genomics and genetics research. The common approach has been to impute whole-genome sequence variants obtained from a few individuals for a larger population of interest individually genotyped using SNP chip. An alternative involves low-coverage whole-genome sequencing (lcWGS) of all individuals in the larger population, followed by imputation to sequence resolution. To overcome limitations of processing lcWGS data and meeting specific genotype imputation requirements, we developed AGIDB (https://agidb.pro), a website comprising tools and database with an unprecedented sample size and comprehensive variant decoding for animals. AGIDB integrates whole-genome sequencing and chip data from 17 360 and 174 945 individuals, respectively, across 89 species to identify over one billion variants, totaling a massive 688.57 TB of processed data. AGIDB focuses on integrating multiple genotype imputation scenarios. It also provides user-friendly searching and data analysis modules that enable comprehensive annotation of genetic variants for specific populations. To meet a wide range of research requirements, AGIDB offers downloadable reference panels for each species in addition to its extensive dataset, variant decoding and utility tools. We hope that AGIDB will become a key foundational resource in genetics and breeding, providing robust support to researchers.
Collapse
Affiliation(s)
- Kaili Zhang
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
| | - Jiete Liang
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
| | - Yuhua Fu
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
| | - Jinyu Chu
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
| | - Liangliang Fu
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
- Hubei Hongshan Laboratory, Wuhan 430070, China
- The Cooperative Innovation Center for Sustainable Pig Production, Huazhong Agricultural University, Wuhan 430070, China
| | - Yongfei Wang
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
| | - Wangjiao Li
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
| | - You Zhou
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
| | - Jinhua Li
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
| | - Xiaoxiao Yin
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
| | - Haiyan Wang
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Xiaolei Liu
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
- Hubei Hongshan Laboratory, Wuhan 430070, China
| | - Chunyan Mou
- College of Animal Science and Technology, Southwest University, Chongqing 402460, China
| | - Chonglong Wang
- Key Laboratory of Pig Molecular Quantitative Genetics of Anhui Academy of Agricultural Sciences, Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety Engineering, Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Hefei 230031, China
| | - Heng Wang
- College of Animal Science and Technology, Shandong Agricultural University, Taian 271018, China
| | - Xinxing Dong
- Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China
| | - Dawei Yan
- Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China
| | - Mei Yu
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
- Hubei Hongshan Laboratory, Wuhan 430070, China
| | - Shuhong Zhao
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
- Hubei Hongshan Laboratory, Wuhan 430070, China
- Lingnan Modern Agricultural Science and Technology Guangdong Laboratory, Guangzhou 510642, China
| | - Xinyun Li
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
- Hubei Hongshan Laboratory, Wuhan 430070, China
| | - Yunlong Ma
- Key Laboratory of Agricultural Animal Genetics, Breeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of Agriculture, Huazhong Agricultural University, Wuhan 430070, China
- Lingnan Modern Agricultural Science and Technology Guangdong Laboratory, Guangzhou 510642, China
| |
Collapse
|
5
|
Lamb HJ, Nguyen LT, Copley JP, Engle BN, Hayes BJ, Ross EM. Imputation strategies for genomic prediction using nanopore sequencing. BMC Biol 2023; 21:286. [PMID: 38066581 PMCID: PMC10709982 DOI: 10.1186/s12915-023-01782-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Accepted: 11/27/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND Genomic prediction describes the use of SNP genotypes to predict complex traits and has been widely applied in humans and agricultural species. Genotyping-by-sequencing, a method which uses low-coverage sequence data paired with genotype imputation, is becoming an increasingly popular SNP genotyping method for genomic prediction. The development of Oxford Nanopore Technologies' (ONT) MinION sequencer has now made genotyping-by-sequencing portable and rapid. Here we evaluate the speed and accuracy of genomic predictions using low-coverage ONT sequence data in a population of cattle using four imputation approaches. We also investigate the effect of SNP reference panel size on imputation performance. RESULTS SNP array genotypes and ONT sequence data for 62 beef heifers were used to calculate genomic estimated breeding values (GEBVs) from 641 k SNP for four traits. GEBV accuracy was much higher when genome-wide flanking SNP from sequence data were used to help impute the 641 k panel used for genomic predictions. Using the imputation package QUILT, correlations between ONT and low-density SNP array genomic breeding values were greater than 0.91 and up to 0.97 for sequencing coverages as low as 0.1 × using a reference panel of 48 million SNP. Imputation time was significantly reduced by decreasing the number of flanking sequence SNP used in imputation for all methods. When compared to high-density SNP arrays, genotyping accuracy and genomic breeding value correlations at 0.5 × coverage were also found to be higher than those imputed from low-density arrays. CONCLUSIONS Here we demonstrated accurate genomic prediction is possible with ONT sequence data from sequencing coverages as low as 0.1 × , and imputation time can be as short as 10 min per sample. We also demonstrate that in this population, genotyping-by-sequencing at 0.1 × coverage can be more accurate than imputation from low-density SNP arrays.
Collapse
Affiliation(s)
- H J Lamb
- Centre for Animal Science, Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, St. Lucia, QLD, 4067, Australia.
| | - L T Nguyen
- Centre for Animal Science, Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, St. Lucia, QLD, 4067, Australia
| | - J P Copley
- Centre for Animal Science, Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, St. Lucia, QLD, 4067, Australia
| | - B N Engle
- USDA, ARS, U.S. Meat Animal Research Centre, Clay Centre, NE, 68933, USA
| | - B J Hayes
- Centre for Animal Science, Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, St. Lucia, QLD, 4067, Australia
| | - E M Ross
- Centre for Animal Science, Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, St. Lucia, QLD, 4067, Australia
| |
Collapse
|
6
|
Amin MR, Hasan M, Arnab SP, DeGiorgio M. Tensor Decomposition-based Feature Extraction and Classification to Detect Natural Selection from Genomic Data. Mol Biol Evol 2023; 40:msad216. [PMID: 37772983 PMCID: PMC10581699 DOI: 10.1093/molbev/msad216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Revised: 08/10/2023] [Accepted: 09/14/2023] [Indexed: 09/30/2023] Open
Abstract
Inferences of adaptive events are important for learning about traits, such as human digestion of lactose after infancy and the rapid spread of viral variants. Early efforts toward identifying footprints of natural selection from genomic data involved development of summary statistic and likelihood methods. However, such techniques are grounded in simple patterns or theoretical models that limit the complexity of settings they can explore. Due to the renaissance in artificial intelligence, machine learning methods have taken center stage in recent efforts to detect natural selection, with strategies such as convolutional neural networks applied to images of haplotypes. Yet, limitations of such techniques include estimation of large numbers of model parameters under nonconvex settings and feature identification without regard to location within an image. An alternative approach is to use tensor decomposition to extract features from multidimensional data although preserving the latent structure of the data, and to feed these features to machine learning models. Here, we adopt this framework and present a novel approach termed T-REx, which extracts features from images of haplotypes across sampled individuals using tensor decomposition, and then makes predictions from these features using classical machine learning methods. As a proof of concept, we explore the performance of T-REx on simulated neutral and selective sweep scenarios and find that it has high power and accuracy to discriminate sweeps from neutrality, robustness to common technical hurdles, and easy visualization of feature importance. Therefore, T-REx is a powerful addition to the toolkit for detecting adaptive processes from genomic data.
Collapse
Affiliation(s)
- Md Ruhul Amin
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Mahmudul Hasan
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Sandipan Paul Arnab
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| |
Collapse
|
7
|
Rubinacci S, Hofmeister RJ, Sousa da Mota B, Delaneau O. Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes. Nat Genet 2023:10.1038/s41588-023-01438-3. [PMID: 37386250 DOI: 10.1038/s41588-023-01438-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 05/31/2023] [Indexed: 07/01/2023]
Abstract
The release of 150,119 UK Biobank sequences represents an unprecedented opportunity as a reference panel to impute low-coverage whole-genome sequencing data with high accuracy but current methods cannot cope with the size of the data. Here we introduce GLIMPSE2, a low-coverage whole-genome sequencing imputation method that scales sublinearly in both the number of samples and markers, achieving efficient whole-genome imputation from the UK Biobank reference panel while retaining high accuracy for ancient and modern genomes, particularly at rare variants and for very low-coverage samples.
Collapse
Affiliation(s)
- Simone Rubinacci
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Robin J Hofmeister
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Bárbara Sousa da Mota
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Olivier Delaneau
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| |
Collapse
|
8
|
Sousa da Mota B, Rubinacci S, Cruz Dávalos DI, G Amorim CE, Sikora M, Johannsen NN, Szmyt MH, Włodarczak P, Szczepanek A, Przybyła MM, Schroeder H, Allentoft ME, Willerslev E, Malaspinas AS, Delaneau O. Imputation of ancient human genomes. Nat Commun 2023; 14:3660. [PMID: 37339987 DOI: 10.1038/s41467-023-39202-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 06/02/2023] [Indexed: 06/22/2023] Open
Abstract
Due to postmortem DNA degradation and microbial colonization, most ancient genomes have low depth of coverage, hindering genotype calling. Genotype imputation can improve genotyping accuracy for low-coverage genomes. However, it is unknown how accurate ancient DNA imputation is and whether imputation introduces bias to downstream analyses. Here we re-sequence an ancient trio (mother, father, son) and downsample and impute a total of 43 ancient genomes, including 42 high-coverage (above 10x) genomes. We assess imputation accuracy across ancestries, time, depth of coverage, and sequencing technology. We find that ancient and modern DNA imputation accuracies are comparable. When downsampled at 1x, 36 of the 42 genomes are imputed with low error rates (below 5%) while African genomes have higher error rates. We validate imputation and phasing results using the ancient trio data and an orthogonal approach based on Mendel's rules of inheritance. We further compare the downstream analysis results between imputed and high-coverage genomes, notably principal component analysis, genetic clustering, and runs of homozygosity, observing similar results starting from 0.5x coverage, except for the African genomes. These results suggest that, for most populations and depths of coverage as low as 0.5x, imputation is a reliable method that can improve ancient DNA studies.
Collapse
Affiliation(s)
- Bárbara Sousa da Mota
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
| | - Simone Rubinacci
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
| | - Diana Ivette Cruz Dávalos
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
| | | | - Martin Sikora
- Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Niels N Johannsen
- Department of Archaeology and Heritage Studies, Aarhus University, Aarhus, Denmark
| | - Marzena H Szmyt
- Institute for Eastern Research, Adam Mickiewicz University in Poznań, Poznań, Poland
| | - Piotr Włodarczak
- Institute of Archaeology and Ethnology, Polish Academy of Sciences, Kraków, Poland
| | - Anita Szczepanek
- Institute of Archaeology and Ethnology, Polish Academy of Sciences, Kraków, Poland
- Department of Anatomy, Jagiellonian University, Medical College, Kraków, Poland
| | | | - Hannes Schroeder
- The Globe Institute, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Morten E Allentoft
- Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark
- Trace and Environmental DNA (TrEnD) Laboratory, School of Molecular and Life Science, Curtin University, Bentley, WA, Australia
| | - Eske Willerslev
- Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark
- GeoGenetics Group, Department of Zoology, University of Cambridge, Cambridge, UK
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK
- MARUM, University of Bremen, Bremen, Germany
| | - Anna-Sapfo Malaspinas
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
- Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland.
| | - Olivier Delaneau
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
- Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland.
| |
Collapse
|
9
|
Lloret-Villas A, Pausch H, Leonard AS. The size and composition of haplotype reference panels impact the accuracy of imputation from low-pass sequencing in cattle. Genet Sel Evol 2023; 55:33. [PMID: 37170101 PMCID: PMC10173671 DOI: 10.1186/s12711-023-00809-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Accepted: 05/02/2023] [Indexed: 05/13/2023] Open
Abstract
BACKGROUND Low-pass sequencing followed by sequence variant genotype imputation is an alternative to the routine microarray-based genotyping in cattle. However, the impact of haplotype reference panels and their interplay with the coverage of low-pass whole-genome sequencing data have not been sufficiently explored in typical livestock settings where only a small number of reference samples is available. METHODS Sequence variant genotyping accuracy was compared between two variant callers, GATK and DeepVariant, in 50 Brown Swiss cattle with sequencing coverages ranging from 4- to 63-fold. Haplotype reference panels of varying sizes and composition were built with DeepVariant based on 501 individuals from nine breeds. High-coverage sequence data for 24 Brown Swiss cattle were downsampled to between 0.01- and 4-fold to mimic low-pass sequencing. GLIMPSE was used to infer sequence variant genotypes from the low-pass sequencing data using different haplotype reference panels. The accuracy of the sequence variant genotypes that were inferred from low-pass sequencing data was compared with sequence variant genotypes called from high-coverage data. RESULTS DeepVariant was used to establish bovine haplotype reference panels because it outperformed GATK in all evaluations. Within-breed haplotype reference panels were more accurate and efficient to impute sequence variant genotypes from low-pass sequencing than equally-sized multibreed haplotype reference panels for all target sample coverages and allele frequencies. F1 scores greater than 0.9, which indicate high harmonic means of recall and precision of called genotypes, were achieved with 0.25-fold sequencing coverage when large breed-specific haplotype reference panels (n = 150) were used. In absence of such large within-breed haplotype panels, variant genotyping accuracy from low-pass sequencing could be increased either by adding non-related samples to the haplotype reference panel or by increasing the coverage of the low-pass sequencing data. Sequence variant genotyping from low-pass sequencing was substantially less accurate when the reference panel lacked individuals from the target breed. CONCLUSIONS Variant genotyping is more accurate with DeepVariant than GATK. DeepVariant is therefore suitable to establish bovine haplotype reference panels. Medium-sized breed-specific haplotype reference panels and large multibreed haplotype reference panels enable accurate imputation of low-pass sequencing data in a typical cattle breed.
Collapse
Affiliation(s)
| | - Hubert Pausch
- Animal Genomics, ETH Zürich, Universitätstrasse 2, Zürich, 8092, Switzerland
| | - Alexander S Leonard
- Animal Genomics, ETH Zürich, Universitätstrasse 2, Zürich, 8092, Switzerland
| |
Collapse
|
10
|
Mun T, Vaddadi NSK, Langmead B. Pangenomic genotyping with the marker array. Algorithms Mol Biol 2023; 18:2. [PMID: 37147657 PMCID: PMC10161648 DOI: 10.1186/s13015-023-00225-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 04/22/2023] [Indexed: 05/07/2023] Open
Abstract
We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while reducing the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods. The method is implemented in the open source software tool rowbowt available at https://github.com/alshai/rowbowt .
Collapse
Affiliation(s)
- Taher Mun
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | | | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
11
|
Amin MR, Hasan M, Arnab SP, DeGiorgio M. Tensor decomposition based feature extraction and classification to detect natural selection from genomic data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.27.527731. [PMID: 37034767 PMCID: PMC10081272 DOI: 10.1101/2023.03.27.527731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Inferences of adaptive events are important for learning about traits, such as human digestion of lactose after infancy and the rapid spread of viral variants. Early efforts toward identifying footprints of natural selection from genomic data involved development of summary statistic and likelihood methods. However, such techniques are grounded in simple patterns or theoretical models that limit the complexity of settings they can explore. Due to the renaissance in artificial intelligence, machine learning methods have taken center stage in recent efforts to detect natural selection, with strategies such as convolutional neural networks applied to images of haplotypes. Yet, limitations of such techniques include estimation of large numbers of model parameters under non-convex settings and feature identification without regard to location within an image. An alternative approach is to use tensor decomposition to extract features from multidimensional data while preserving the latent structure of the data, and to feed these features to machine learning models. Here, we adopt this framework and present a novel approach termed T-REx , which extracts features from images of haplotypes across sampled individuals using tensor decomposition, and then makes predictions from these features using classical machine learning methods. As a proof of concept, we explore the performance of T-REx on simulated neutral and selective sweep scenarios and find that it has high power and accuracy to discriminate sweeps from neutrality, robustness to common technical hurdles, and easy visualization of feature importance. Therefore, T-REx is a powerful addition to the toolkit for detecting adaptive processes from genomic data.
Collapse
|
12
|
Shipilina D, Pal A, Stankowski S, Chan YF, Barton NH. On the origin and structure of haplotype blocks. Mol Ecol 2023; 32:1441-1457. [PMID: 36433653 PMCID: PMC10946714 DOI: 10.1111/mec.16793] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Revised: 11/16/2022] [Accepted: 11/18/2022] [Indexed: 11/27/2022]
Abstract
The term "haplotype block" is commonly used in the developing field of haplotype-based inference methods. We argue that the term should be defined based on the structure of the Ancestral Recombination Graph (ARG), which contains complete information on the ancestry of a sample. We use simulated examples to demonstrate key features of the relationship between haplotype blocks and ancestral structure, emphasizing the stochasticity of the processes that generate them. Even the simplest cases of neutrality or of a "hard" selective sweep produce a rich structure, often missed by commonly used statistics. We highlight a number of novel methods for inferring haplotype structure, based on the full ARG, or on a sequence of trees, and illustrate how they can be used to define haplotype blocks using an empirical data set. While the advent of new, computationally efficient methods makes it possible to apply these concepts broadly, they (and additional new methods) could benefit from adding features to explore haplotype blocks, as we define them. Understanding and applying the concept of the haplotype block will be essential to fully exploit long and linked-read sequencing technologies.
Collapse
Affiliation(s)
- Daria Shipilina
- Evolutionary Biology Program, Department of Ecology and Genetics (IEG), Uppsala University, Uppsala, Sweden
- Institute of Science and Technology Austria, Klosterneuburg, Austria
- Swedish Collegium for Advanced Study, Uppsala, Sweden
| | - Arka Pal
- Institute of Science and Technology Austria, Klosterneuburg, Austria
| | - Sean Stankowski
- Institute of Science and Technology Austria, Klosterneuburg, Austria
| | | | - Nicholas H Barton
- Institute of Science and Technology Austria, Klosterneuburg, Austria
| |
Collapse
|
13
|
Nguyen TV, Vander Jagt CJ, Wang J, Daetwyler HD, Xiang R, Goddard ME, Nguyen LT, Ross EM, Hayes BJ, Chamberlain AJ, MacLeod IM. In it for the long run: perspectives on exploiting long-read sequencing in livestock for population scale studies of structural variants. Genet Sel Evol 2023; 55:9. [PMID: 36721111 PMCID: PMC9887926 DOI: 10.1186/s12711-023-00783-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 01/23/2023] [Indexed: 02/02/2023] Open
Abstract
Studies have demonstrated that structural variants (SV) play a substantial role in the evolution of species and have an impact on Mendelian traits in the genome. However, unlike small variants (< 50 bp), it has been challenging to accurately identify and genotype SV at the population scale using short-read sequencing. Long-read sequencing technologies are becoming competitively priced and can address several of the disadvantages of short-read sequencing for the discovery and genotyping of SV. In livestock species, analysis of SV at the population scale still faces challenges due to the lack of resources, high costs, technological barriers, and computational limitations. In this review, we summarize recent progress in the characterization of SV in the major livestock species, the obstacles that still need to be overcome, as well as the future directions in this growing field. It seems timely that research communities pool resources to build global population-scale long-read sequencing consortiums for the major livestock species for which the application of genomic tools has become cost-effective.
Collapse
Affiliation(s)
- Tuan V. Nguyen
- grid.452283.a0000 0004 0407 2669Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083 Australia
| | - Christy J. Vander Jagt
- grid.452283.a0000 0004 0407 2669Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083 Australia
| | - Jianghui Wang
- grid.452283.a0000 0004 0407 2669Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083 Australia
| | - Hans D. Daetwyler
- grid.452283.a0000 0004 0407 2669Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083 Australia ,grid.1018.80000 0001 2342 0938School of Applied Systems Biology, La Trobe University, Bundoora, VIC 3083 Australia
| | - Ruidong Xiang
- grid.452283.a0000 0004 0407 2669Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083 Australia ,grid.1008.90000 0001 2179 088XFaculty of Veterinary & Agricultural Science, The University of Melbourne, Parkville, VIC 3052 Australia
| | - Michael E. Goddard
- grid.452283.a0000 0004 0407 2669Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083 Australia ,grid.1008.90000 0001 2179 088XFaculty of Veterinary & Agricultural Science, The University of Melbourne, Parkville, VIC 3052 Australia
| | - Loan T. Nguyen
- grid.1003.20000 0000 9320 7537Queensland Alliance for Agriculture and Food Innovation, University of Queensland, St Lucia, QLD 4072 Australia
| | - Elizabeth M. Ross
- grid.1003.20000 0000 9320 7537Queensland Alliance for Agriculture and Food Innovation, University of Queensland, St Lucia, QLD 4072 Australia
| | - Ben J. Hayes
- grid.1003.20000 0000 9320 7537Queensland Alliance for Agriculture and Food Innovation, University of Queensland, St Lucia, QLD 4072 Australia
| | - Amanda J. Chamberlain
- grid.452283.a0000 0004 0407 2669Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083 Australia ,grid.1018.80000 0001 2342 0938School of Applied Systems Biology, La Trobe University, Bundoora, VIC 3083 Australia
| | - Iona M. MacLeod
- grid.452283.a0000 0004 0407 2669Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083 Australia
| |
Collapse
|
14
|
Nachmanson D, Pagadala M, Steward J, Cheung C, Bruce LK, Lee NQ, O'Keefe TJ, Lin GY, Hasteh F, Morris GP, Carter H, Harismendy O. Accurate genome-wide genotyping from archival tissue to explore the contribution of common genetic variants to pre-cancer outcomes. J Transl Med 2022; 20:623. [PMID: 36575447 PMCID: PMC9793518 DOI: 10.1186/s12967-022-03810-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2022] [Accepted: 12/05/2022] [Indexed: 12/28/2022] Open
Abstract
PURPOSE The contribution of common genetic variants to pre-cancer progression is understudied due to long follow-up time, rarity of poor outcomes and lack of available germline DNA collection. Alternatively, DNA from diagnostic archival tissue is available, but its somatic nature, limited quantity and suboptimal quality would require an accurate cost-effective genome-wide germline genotyping methodology. EXPERIMENTAL DESIGN Blood and tissue DNA from 10 individuals were used to benchmark the accuracy of Single Nucleotide Polymorphisms (SNP) genotypes, Polygenic Risk Scores (PRS) or HLA haplotypes using low-coverage whole-genome sequencing (lc-WGS) and genotype imputation. Tissue-derived PRS were further evaluated for 36 breast cancer patients (11.7 years median follow-up time) diagnosed with DCIS and used to model the risk of Breast Cancer Subsequent Events (BCSE). RESULTS Tissue-derived germline DNA profiling resulted in accurate genotypes at common SNPs (blood correlation r2 > 0.94) and across 22 disease-related polygenic risk scores (PRS, mean correlation r = 0.93). Imputed Class I and II HLA haplotypes were 96.7% and 82.5% concordant with clinical-grade blood HLA haplotypes, respectively. In DCIS patients, tissue-derived PRS was significantly associated with BCSE (HR = 2, 95% CI 1.2-3.8). The top and bottom decile patients had an estimated 28% and 5% chance of BCSE at 10 years, respectively. CONCLUSIONS Archival tissue DNA germline profiling using lc-WGS and imputation, represents a cost and resource-effective alternative in the retrospective design of long-term disease genetic studies. Initial results in breast cancer suggest that common risk variants contribute to pre-cancer progression.
Collapse
Affiliation(s)
- Daniela Nachmanson
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, 9500 Gilman Drive, San Diego, CA, 92093, USA
| | - Meghana Pagadala
- Biomedical Science Graduate Program, University of California San Diego, 9500 Gilman Drive, San Diego, CA, 92093, USA
| | - Joseph Steward
- Moores Cancer Center, University of California San Diego, 3855 Health Science Drive, San Diego, CA, 92093, USA
| | - Callie Cheung
- Moores Cancer Center, University of California San Diego, 3855 Health Science Drive, San Diego, CA, 92093, USA
| | - Lauryn Keeler Bruce
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, 9500 Gilman Drive, San Diego, CA, 92093, USA
| | - Nicole Q Lee
- Moores Cancer Center, University of California San Diego, 3855 Health Science Drive, San Diego, CA, 92093, USA
| | - Thomas J O'Keefe
- Department of Surgery, University of California San Diego, 9500 Gilman Drive, San Diego, CA, 92093, USA
| | - Grace Y Lin
- Department of Pathology, University of California San Diego, 9500 Gilman Drive, San Diego, CA, 92093, USA
| | - Farnaz Hasteh
- Department of Pathology, University of California San Diego, 9500 Gilman Drive, San Diego, CA, 92093, USA
| | - Gerald P Morris
- Department of Pathology, University of California San Diego, 9500 Gilman Drive, San Diego, CA, 92093, USA
| | - Hannah Carter
- Moores Cancer Center, University of California San Diego, 3855 Health Science Drive, San Diego, CA, 92093, USA
- Division of Medical Genetics, Department of Medicine, University of California San Diego, La Jolla, CA, 92093, USA
| | - Olivier Harismendy
- Moores Cancer Center, University of California San Diego, 3855 Health Science Drive, San Diego, CA, 92093, USA.
- Division of Biomedical Informatics, Department of Medicine, University of California San Diego, 9500 Gilman Drive, San Diego, CA, 92093, USA.
| |
Collapse
|
15
|
Wang D, Xie K, Wang Y, Hu J, Li W, Yang A, Zhang Q, Ning C, Fan X. Cost-effectively dissecting the genetic architecture of complex wool traits in rabbits by low-coverage sequencing. Genet Sel Evol 2022; 54:75. [PMCID: PMC9673297 DOI: 10.1186/s12711-022-00766-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Accepted: 10/31/2022] [Indexed: 11/19/2022] Open
Abstract
Background Rabbit wool traits are important in fiber production and for model organism research on hair growth, but their genetic architecture remains obscure. In this study, we focused on wool characteristics in Angora rabbits, a breed well-known for the quality of its wool. Considering the cost to generate population-scale sequence data and the biased detection of variants using chip data, developing an effective genotyping strategy using low-coverage whole-genome sequencing (LCS) data is necessary to conduct genetic analyses. Results Different genotype imputation strategies (BaseVar + STITCH, Bcftools + Beagle4, and GATK + Beagle5), sequencing coverages (0.1X, 0.5X, 1.0X, 1.5X, and 2.0X), and sample sizes (100, 200, 300, 400, 500, and 600) were compared. Our results showed that using BaseVar + STITCH at a sequencing depth of 1.0X with a sample size larger than 300 resulted in the highest genotyping accuracy, with a genotype concordance higher than 98.8% and genotype accuracy higher than 0.97. We performed multivariate genome-wide association studies (GWAS), followed by conditional GWAS and estimation of the confidence intervals of quantitative trait loci (QTL) to investigate the genetic architecture of wool traits. Six QTL were detected, which explained 0.4 to 7.5% of the phenotypic variation. Gene-level mapping identified the fibroblast growth factor 10 (FGF10) gene as associated with fiber growth and diameter, which agrees with previous results from functional data analyses on the FGF gene family in other species, and is relevant for wool rabbit breeding. Conclusions We suggest that LCS followed by imputation can be a cost-effective alternative to array and high-depth sequencing for assessing common variants. GWAS combined with LCS can identify new QTL and candidate genes that are associated with quantitative traits. This study provides a cost-effective and powerful method for investigating the genetic architecture of complex traits, which will be useful for genomic breeding applications. Supplementary Information The online version contains supplementary material available at 10.1186/s12711-022-00766-y.
Collapse
Affiliation(s)
- Dan Wang
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Kerui Xie
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Yanyan Wang
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Jiaqing Hu
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Wenqiang Li
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Aiguo Yang
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Qin Zhang
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Chao Ning
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Xinzhong Fan
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| |
Collapse
|
16
|
Song M, Greenbaum J, Luttrell J, Zhou W, Wu C, Luo Z, Qiu C, Zhao LJ, Su KJ, Tian Q, Shen H, Hong H, Gong P, Shi X, Deng HW, Zhang C. An autoencoder-based deep learning method for genotype imputation. Front Artif Intell 2022; 5:1028978. [PMID: 36406474 PMCID: PMC9671213 DOI: 10.3389/frai.2022.1028978] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2022] [Accepted: 09/29/2022] [Indexed: 11/06/2022] Open
Abstract
Genotype imputation has a wide range of applications in genome-wide association study (GWAS), including increasing the statistical power of association tests, discovering trait-associated loci in meta-analyses, and prioritizing causal variants with fine-mapping. In recent years, deep learning (DL) based methods, such as sparse convolutional denoising autoencoder (SCDA), have been developed for genotype imputation. However, it remains a challenging task to optimize the learning process in DL-based methods to achieve high imputation accuracy. To address this challenge, we have developed a convolutional autoencoder (AE) model for genotype imputation and implemented a customized training loop by modifying the training process with a single batch loss rather than the average loss over batches. This modified AE imputation model was evaluated using a yeast dataset, the human leukocyte antigen (HLA) data from the 1,000 Genomes Project (1KGP), and our in-house genotype data from the Louisiana Osteoporosis Study (LOS). Our modified AE imputation model has achieved comparable or better performance than the existing SCDA model in terms of evaluation metrics such as the concordance rate (CR), the Hellinger score, the scaled Euclidean norm (SEN) score, and the imputation quality score (IQS) in all three datasets. Taking the imputation results from the HLA data as an example, the AE model achieved an average CR of 0.9468 and 0.9459, Hellinger score of 0.9765 and 0.9518, SEN score of 0.9977 and 0.9953, and IQS of 0.9515 and 0.9044 at missing ratios of 10% and 20%, respectively. As for the results of LOS data, it achieved an average CR of 0.9005, Hellinger score of 0.9384, SEN score of 0.9940, and IQS of 0.8681 at the missing ratio of 20%. In summary, our proposed method for genotype imputation has a great potential to increase the statistical power of GWAS and improve downstream post-GWAS analyses.
Collapse
Affiliation(s)
- Meng Song
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, United States
| | - Jonathan Greenbaum
- Tulane Center of Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA, United States
| | - Joseph Luttrell
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, United States
| | - Weihua Zhou
- College of Computing, Michigan Technological University, Houghton, MI, United States
| | - Chong Wu
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, United States
| | - Zhe Luo
- Tulane Center of Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA, United States
| | - Chuan Qiu
- Tulane Center of Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA, United States
| | - Lan Juan Zhao
- Tulane Center of Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA, United States
| | - Kuan-Jui Su
- Tulane Center of Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA, United States
| | - Qing Tian
- Tulane Center of Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA, United States
| | - Hui Shen
- Tulane Center of Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA, United States
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, United States
| | - Ping Gong
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, United States
| | - Xinghua Shi
- Department of Computer & Information Sciences, Temple University, Philadelphia, PA, United States
| | - Hong-Wen Deng
- Tulane Center of Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA, United States,*Correspondence: Hong-Wen Deng
| | - Chaoyang Zhang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, United States,Chaoyang Zhang
| |
Collapse
|
17
|
Ning C, Xie K, Huang J, Di Y, Wang Y, Yang A, Hu J, Zhang Q, Wang D, Fan X. Marker density and statistical model designs to increase accuracy of genomic selection for wool traits in Angora rabbits. Front Genet 2022; 13:968712. [PMID: 36118881 PMCID: PMC9478554 DOI: 10.3389/fgene.2022.968712] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Accepted: 08/17/2022] [Indexed: 11/13/2022] Open
Abstract
The Angora rabbit, a well-known breed for fiber production, has been undergoing traditional breeding programs relying mainly on phenotypes. Genomic selection (GS) uses genomic information and promises to accelerate genetic gain. Practically, to implement GS in Angora rabbit breeding, it is necessary to evaluate different marker densities and GS models to develop suitable strategies for an optimized breeding pipeline. Considering a lack in microarray, low-coverage sequencing combined with genotype imputation was used to boost the number of SNPs across the rabbit genome. Here, in a population of 629 Angora rabbits, a total of 18,577,154 high-quality SNPs were imputed (imputation accuracy above 98%) based on low-coverage sequencing of 3.84X genomic coverage, and wool traits and body weight were measured at 70, 140 and 210 days of age. From the original markers, 0.5K, 1K, 3K, 5K, 10K, 50K, 100K, 500K, 1M and 2M were randomly selected and evaluated, resulting in 50K markers as the baseline for the heritability estimation and genomic prediction. Comparing to the GS performance of single-trait models, the prediction accuracy of nearly all traits could be improved by multi-trait models, which might because multiple-trait models used information from genetically correlated traits. Furthermore, we observed high significant negative correlation between the increased prediction accuracy from single-trait to multiple-trait models and estimated heritability. The results indicated that low-heritability traits could borrow more information from correlated traits and hence achieve higher prediction accuracy. The research first reported heritability estimation in rabbits by using genome-wide markers, and provided 50K as an optimal marker density for further microarray design, genetic evaluation and genomic selection in Angora rabbits. We expect that the work could provide strategies for GS in early selection, and optimize breeding programs in rabbits.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Dan Wang
- *Correspondence: Dan Wang, ; Xinzhong Fan,
| | | |
Collapse
|
18
|
Mun T, Vaddadi NSK, Langmead B. Pangenomic Genotyping with the Marker Array. ALGORITHMS IN BIOINFORMATICS : ... INTERNATIONAL WORKSHOP, WABI ..., PROCEEDINGS. WABI (WORKSHOP) 2022; 242:19. [PMID: 36409181 PMCID: PMC9674407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while avoiding the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods.
Collapse
Affiliation(s)
- Taher Mun
- Johns Hopkins University, Baltimore MD, USA; Illumina, San Diego, USA
| | | | | |
Collapse
|
19
|
Abstract
Genomics data are important for advancing biomedical research, improving clinical care, and informing other disciplines such as forensics and genealogy. However, privacy concerns arise when genomic data are shared. In particular, the identifying nature of genetic information, its direct relationship to health status, and the potential financial harm and stigmatization posed to individuals and their blood relatives call for a survey of the privacy issues related to sharing genetic and related data and potential solutions to overcome these issues. In this work, we provide an overview of the importance of genomic privacy, the information gleaned from genomics data, the sources of potential private information leakages in genomics, and ways to preserve privacy while utilizing the genetic information in research. We discuss the relationship between trust in the scientific community and protecting privacy, illuminating a future roadmap for data sharing and study participation.
Collapse
Affiliation(s)
- Gamze Gürsoy
- Department of Biomedical Informatics, Columbia University, New York, NY, USA; .,New York Genome Center, New York, NY, USA
| |
Collapse
|
20
|
Functional genomics data: privacy risk assessment and technological mitigation. Nat Rev Genet 2022; 23:245-258. [PMID: 34759381 DOI: 10.1038/s41576-021-00428-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/18/2021] [Indexed: 12/15/2022]
Abstract
The generation of functional genomics data by next-generation sequencing has increased greatly in the past decade. Broad sharing of these data is essential for research advancement but poses notable privacy challenges, some of which are analogous to those that occur when sharing genetic variant data. However, there are also unique privacy challenges that arise from cryptic information leakage during the processing and summarization of functional genomics data from raw reads to derived quantities, such as gene expression values. Here, we review these challenges and present potential solutions for mitigating privacy risks while allowing broad data dissemination and analysis.
Collapse
|
21
|
Teng J, Zhao C, Wang D, Chen Z, Tang H, Li J, Mei C, Yang Z, Ning C, Zhang Q. Assessment of the performance of different imputation methods for low-coverage sequencing in Holstein cattle. J Dairy Sci 2022; 105:3355-3366. [DOI: 10.3168/jds.2021-21360] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Accepted: 12/13/2021] [Indexed: 12/27/2022]
|
22
|
Zhang H, Zhang X, Li M, Yang Y, Li Z, Xu Y, Wang H, Wang D, Zhang Y, Wang H, Fu Q, Zheng J, Yi H. Molecular mapping for fruit-related traits, and joint identification of candidate genes and selective sweeps for seed size in melon. Genomics 2022; 114:110306. [DOI: 10.1016/j.ygeno.2022.110306] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Revised: 12/22/2021] [Accepted: 02/01/2022] [Indexed: 11/17/2022]
|
23
|
Lamb HJ, Hayes BJ, Randhawa IAS, Nguyen LT, Ross EM. Genomic prediction using low-coverage portable Nanopore sequencing. PLoS One 2021; 16:e0261274. [PMID: 34910782 PMCID: PMC8673642 DOI: 10.1371/journal.pone.0261274] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 11/26/2021] [Indexed: 11/18/2022] Open
Abstract
Most traits in livestock, crops and humans are polygenic, that is, a large number of loci contribute to genetic variation. Effects at these loci lie along a continuum ranging from common low-effect to rare high-effect variants that cumulatively contribute to the overall phenotype. Statistical methods to calculate the effect of these loci have been developed and can be used to predict phenotypes in new individuals. In agriculture, these methods are used to select superior individuals using genomic breeding values; in humans these methods are used to quantitatively measure an individual’s disease risk, termed polygenic risk scores. Both fields typically use SNP array genotypes for the analysis. Recently, genotyping-by-sequencing has become popular, due to lower cost and greater genome coverage (including structural variants). Oxford Nanopore Technologies’ (ONT) portable sequencers have the potential to combine the benefits genotyping-by-sequencing with portability and decreased turn-around time. This introduces the potential for in-house clinical genetic disease risk screening in humans or calculating genomic breeding values on-farm in agriculture. Here we demonstrate the potential of the later by calculating genomic breeding values for four traits in cattle using low-coverage ONT sequence data and comparing these breeding values to breeding values calculated from SNP arrays. At sequencing coverages between 2X and 4X the correlation between ONT breeding values and SNP array-based breeding values was > 0.92 when imputation was used and > 0.88 when no imputation was used. With an average sequencing coverage of 0.5x the correlation between the two methods was between 0.85 and 0.92 using imputation, depending on the trait. This suggests that ONT sequencing has potential for in clinic or on-farm genomic prediction, however, further work to validate these findings in a larger population still remains.
Collapse
Affiliation(s)
- Harrison J. Lamb
- Centre for Animal Science, Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Brisbane, QLD, Australia
- * E-mail:
| | - Ben J. Hayes
- Centre for Animal Science, Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Brisbane, QLD, Australia
| | - Imtiaz A. S. Randhawa
- School of Veterinary Science, The University of Queensland, Brisbane, QLD, Australia
| | - Loan T. Nguyen
- Centre for Animal Science, Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Brisbane, QLD, Australia
| | - Elizabeth M. Ross
- Centre for Animal Science, Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
24
|
Haplotype-aware inference of human chromosome abnormalities. Proc Natl Acad Sci U S A 2021; 118:2109307118. [PMID: 34772814 DOI: 10.1073/pnas.2109307118] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/16/2021] [Indexed: 12/25/2022] Open
Abstract
Extra or missing chromosomes-a phenomenon termed aneuploidy-frequently arise during human meiosis and embryonic mitosis and are the leading cause of pregnancy loss, including in the context of in vitro fertilization (IVF). While meiotic aneuploidies affect all cells and are deleterious, mitotic errors generate mosaicism, which may be compatible with healthy live birth. Large-scale abnormalities such as triploidy and haploidy also contribute to adverse pregnancy outcomes, but remain hidden from standard sequencing-based approaches to preimplantation genetic testing for aneuploidy (PGT-A). The ability to reliably distinguish meiotic and mitotic aneuploidies, as well as abnormalities in genome-wide ploidy, may thus prove valuable for enhancing IVF outcomes. Here, we describe a statistical method for distinguishing these forms of aneuploidy based on analysis of low-coverage whole-genome sequencing data, which is the current standard in the field. Our approach overcomes the sparse nature of the data by leveraging allele frequencies and linkage disequilibrium (LD) measured in a population reference panel. The method, which we term LD-informed PGT-A (LD-PGTA), retains high accuracy down to coverage as low as 0.05 × and at higher coverage can also distinguish between meiosis I and meiosis II errors based on signatures spanning the centromeres. LD-PGTA provides fundamental insight into the origins of human chromosome abnormalities, as well as a practical tool with the potential to improve genetic testing during IVF.
Collapse
|
25
|
Gusev A, Groha S, Taraszka K, Semenov YR, Zaitlen N. Constructing germline research cohorts from the discarded reads of clinical tumor sequences. Genome Med 2021; 13:179. [PMID: 34749793 PMCID: PMC8576948 DOI: 10.1186/s13073-021-00999-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Accepted: 10/28/2021] [Indexed: 12/02/2022] Open
Abstract
Background Hundreds of thousands of cancer patients have had targeted (panel) tumor sequencing to identify clinically meaningful mutations. In addition to improving patient outcomes, this activity has led to significant discoveries in basic and translational domains. However, the targeted nature of clinical tumor sequencing has a limited scope, especially for germline genetics. In this work, we assess the utility of discarded, off-target reads from tumor-only panel sequencing for the recovery of genome-wide germline genotypes through imputation. Methods We developed a framework for inference of germline variants from tumor panel sequencing, including imputation, quality control, inference of genetic ancestry, germline polygenic risk scores, and HLA alleles. We benchmarked our framework on 833 individuals with tumor sequencing and matched germline SNP array data. We then applied our approach to a prospectively collected panel sequencing cohort of 25,889 tumors. Results We demonstrate high to moderate accuracy of each inferred feature relative to direct germline SNP array genotyping: individual common variants were imputed with a mean accuracy (correlation) of 0.86, genetic ancestry was inferred with a correlation of > 0.98, polygenic risk scores were inferred with a correlation of > 0.90, and individual HLA alleles were inferred with a correlation of > 0.80. We demonstrate a minimal influence on the accuracy of somatic copy number alterations and other tumor features. We showcase the feasibility and utility of our framework by analyzing 25,889 tumors and identifying the relationships between genetic ancestry, polygenic risk, and tumor characteristics that could not be studied with conventional on-target tumor data. Conclusions We conclude that targeted tumor sequencing can be leveraged to build rich germline research cohorts from existing data and make our analysis pipeline publicly available to facilitate this effort. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-021-00999-4.
Collapse
Affiliation(s)
- Alexander Gusev
- Division of Population Sciences, Dana-Farber Cancer Institute and Harvard Medical School, Boston, MA, USA. .,Division of Genetics, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA. .,The Broad Institute of MIT & Harvard, Cambridge, MA, USA.
| | - Stefan Groha
- Division of Population Sciences, Dana-Farber Cancer Institute and Harvard Medical School, Boston, MA, USA.,The Broad Institute of MIT & Harvard, Cambridge, MA, USA
| | - Kodi Taraszka
- Departments of Neurology and Computational Medicine, UCLA, Los Angeles, CA, USA
| | - Yevgeniy R Semenov
- Department of Dermatology, Massachusetts General Hospital, Boston, MA, USA
| | - Noah Zaitlen
- Departments of Neurology and Computational Medicine, UCLA, Los Angeles, CA, USA.
| |
Collapse
|
26
|
O'Connell J, Yun T, Moreno M, Li H, Litterman N, Kolesnikov A, Noblin E, Chang PC, Shastri A, Dorfman EH, Shringarpure S, Auton A, Carroll A, McLean CY. A population-specific reference panel for improved genotype imputation in African Americans. Commun Biol 2021; 4:1269. [PMID: 34741098 PMCID: PMC8571350 DOI: 10.1038/s42003-021-02777-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Accepted: 10/12/2021] [Indexed: 12/17/2022] Open
Abstract
There is currently a dearth of accessible whole genome sequencing (WGS) data for individuals residing in the Americas with Sub-Saharan African ancestry. We generated whole genome sequencing data at intermediate (15×) coverage for 2,294 individuals with large amounts of Sub-Saharan African ancestry, predominantly Atlantic African admixed with varying amounts of European and American ancestry. We performed extensive comparisons of variant callers, phasing algorithms, and variant filtration on these data to construct a high quality imputation panel containing data from 2,269 unrelated individuals. With the exception of the TOPMed imputation server (which notably cannot be downloaded), our panel substantially outperformed other available panels when imputing African American individuals. The raw sequencing data, variant calls and imputation panel for this cohort are all freely available via dbGaP and should prove an invaluable resource for further study of admixed African genetics.
Collapse
Affiliation(s)
| | | | | | - Helen Li
- Google Health, Cambridge, MA, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Reinspection of a Clinical Proteomics Tumor Analysis Consortium (CPTAC) Dataset with Cloud Computing Reveals Abundant Post-Translational Modifications and Protein Sequence Variants. Cancers (Basel) 2021; 13:cancers13205034. [PMID: 34680183 PMCID: PMC8534219 DOI: 10.3390/cancers13205034] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Revised: 09/14/2021] [Accepted: 10/01/2021] [Indexed: 12/14/2022] Open
Abstract
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) has provided some of the most in-depth analyses of the phenotypes of human tumors ever constructed. Today, the majority of proteomic data analysis is still performed using software housed on desktop computers which limits the number of sequence variants and post-translational modifications that can be considered. The original CPTAC studies limited the search for PTMs to only samples that were chemically enriched for those modified peptides. Similarly, the only sequence variants considered were those with strong evidence at the exon or transcript level. In this multi-institutional collaborative reanalysis, we utilized unbiased protein databases containing millions of human sequence variants in conjunction with hundreds of common post-translational modifications. Using these tools, we identified tens of thousands of high-confidence PTMs and sequence variants. We identified 4132 phosphorylated peptides in nonenriched samples, 93% of which were confirmed in the samples which were chemically enriched for phosphopeptides. In addition, our results also cover 90% of the high-confidence variants reported by the original proteogenomics study, without the need for sample specific next-generation sequencing. Finally, we report fivefold more somatic and germline variants that have an independent evidence at the peptide level, including mutations in ERRB2 and BCAS1. In this reanalysis of CPTAC proteomic data with cloud computing, we present an openly available and searchable web resource of the highest-coverage proteomic profiling of human tumors described to date.
Collapse
|
28
|
Irving-Pease EK, Muktupavela R, Dannemann M, Racimo F. Quantitative Human Paleogenetics: What can Ancient DNA Tell us About Complex Trait Evolution? Front Genet 2021; 12:703541. [PMID: 34422004 PMCID: PMC8371751 DOI: 10.3389/fgene.2021.703541] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 07/08/2021] [Indexed: 12/13/2022] Open
Abstract
Genetic association data from national biobanks and large-scale association studies have provided new prospects for understanding the genetic evolution of complex traits and diseases in humans. In turn, genomes from ancient human archaeological remains are now easier than ever to obtain, and provide a direct window into changes in frequencies of trait-associated alleles in the past. This has generated a new wave of studies aiming to analyse the genetic component of traits in historic and prehistoric times using ancient DNA, and to determine whether any such traits were subject to natural selection. In humans, however, issues about the portability and robustness of complex trait inference across different populations are particularly concerning when predictions are extended to individuals that died thousands of years ago, and for which little, if any, phenotypic validation is possible. In this review, we discuss the advantages of incorporating ancient genomes into studies of trait-associated variants, the need for models that can better accommodate ancient genomes into quantitative genetic frameworks, and the existing limits to inferences about complex trait evolution, particularly with respect to past populations.
Collapse
Affiliation(s)
- Evan K. Irving-Pease
- Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of Copenhagen, Copenhagen, Denmark
| | - Rasa Muktupavela
- Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of Copenhagen, Copenhagen, Denmark
| | - Michael Dannemann
- Center for Genomics, Evolution and Medicine, Institute of Genomics, University of Tartu, Tartu, Estonia
| | - Fernando Racimo
- Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|