1
|
Zheng M, Liao J, Li Z, Xu Z, Jiang Z, Tan L, Fu R, Xu H, Li Z, Zhang X, Nie Q. Evaluation of the selection of key individuals for genotype imputation in Chinese yellow-feathered chicken. Poult Sci 2023; 102:102901. [PMID: 37499612 PMCID: PMC10393784 DOI: 10.1016/j.psj.2023.102901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2023] [Revised: 06/02/2023] [Accepted: 06/24/2023] [Indexed: 07/29/2023] Open
Abstract
Genotype imputation is a powerful technique employed by next-generation sequencing (NGS) and genotyping arrays, which can significantly enhance the cost-effectiveness and efficiency of genomic selection. The accuracy of imputation is largely determined by the choice of reference panel, with previous studies generally demonstrating that a closely related population as a reference panel leads to greater accuracy than a more distantly related population. Various strategies have been proposed for selecting desirable individuals via targeted resequencing, but their efficiencies need further improvement. In this study, we present a practical broiler selection methodology for a local Chinese chicken line that integrates established methods based on pedigree, genomics, and random sampling, and leverages genotype and pedigree information from the yellow-plumage dwarf chicken line. The efficacy of these selection strategies was assessed by evaluating their ability to accurately impute masked genotypes from data obtained using a 600K chip. Our findings reveal that the pedigree-based method yields superior accuracy in genotype imputation, whereas the haplotype-based method exhibits greater stability. Nonetheless, the impact of these targeted methods for selecting key individuals is slightly different when initiating a new sequencing project in a production context. Overall, this study highlights the advantages of using the pedigree-based approach as the preferred method for optimizing genotype imputation in broiler chickens.
Collapse
Affiliation(s)
- Ming Zheng
- Lingnan Guangdong Laboratory of Modern Agriculture, College of Animal Science, South China Agricultural University, Guangzhou 510642, Guangdong, China; Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, and Key Laboratory of Chicken Genetics, Breeding and Reproduction, Ministry of Agriculture, Guangzhou 510642, Guangdong, China; State Key Laboratory of Livestock and Poultry Breeding, South China Agricultural University, Guangzhou 510642, Guangdong, China
| | - Jiahao Liao
- Lingnan Guangdong Laboratory of Modern Agriculture, College of Animal Science, South China Agricultural University, Guangzhou 510642, Guangdong, China; Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, and Key Laboratory of Chicken Genetics, Breeding and Reproduction, Ministry of Agriculture, Guangzhou 510642, Guangdong, China; State Key Laboratory of Livestock and Poultry Breeding, South China Agricultural University, Guangzhou 510642, Guangdong, China
| | - Zhuohang Li
- Lingnan Guangdong Laboratory of Modern Agriculture, College of Animal Science, South China Agricultural University, Guangzhou 510642, Guangdong, China; Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, and Key Laboratory of Chicken Genetics, Breeding and Reproduction, Ministry of Agriculture, Guangzhou 510642, Guangdong, China; State Key Laboratory of Livestock and Poultry Breeding, South China Agricultural University, Guangzhou 510642, Guangdong, China
| | - Zhenqiang Xu
- Guangdong Wens Nanfang Poultry Breeding Co., Ltd., Xinxing 527439, China
| | - Ziqin Jiang
- Guangdong Wens Nanfang Poultry Breeding Co., Ltd., Xinxing 527439, China
| | - Liangtian Tan
- Lingnan Guangdong Laboratory of Modern Agriculture, College of Animal Science, South China Agricultural University, Guangzhou 510642, Guangdong, China; Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, and Key Laboratory of Chicken Genetics, Breeding and Reproduction, Ministry of Agriculture, Guangzhou 510642, Guangdong, China; State Key Laboratory of Livestock and Poultry Breeding, South China Agricultural University, Guangzhou 510642, Guangdong, China
| | - Rong Fu
- Lingnan Guangdong Laboratory of Modern Agriculture, College of Animal Science, South China Agricultural University, Guangzhou 510642, Guangdong, China; Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, and Key Laboratory of Chicken Genetics, Breeding and Reproduction, Ministry of Agriculture, Guangzhou 510642, Guangdong, China; State Key Laboratory of Livestock and Poultry Breeding, South China Agricultural University, Guangzhou 510642, Guangdong, China
| | - Haiping Xu
- Lingnan Guangdong Laboratory of Modern Agriculture, College of Animal Science, South China Agricultural University, Guangzhou 510642, Guangdong, China; Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, and Key Laboratory of Chicken Genetics, Breeding and Reproduction, Ministry of Agriculture, Guangzhou 510642, Guangdong, China; State Key Laboratory of Livestock and Poultry Breeding, South China Agricultural University, Guangzhou 510642, Guangdong, China
| | - Zhenhui Li
- Lingnan Guangdong Laboratory of Modern Agriculture, College of Animal Science, South China Agricultural University, Guangzhou 510642, Guangdong, China; Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, and Key Laboratory of Chicken Genetics, Breeding and Reproduction, Ministry of Agriculture, Guangzhou 510642, Guangdong, China; State Key Laboratory of Livestock and Poultry Breeding, South China Agricultural University, Guangzhou 510642, Guangdong, China
| | - Xiquan Zhang
- Lingnan Guangdong Laboratory of Modern Agriculture, College of Animal Science, South China Agricultural University, Guangzhou 510642, Guangdong, China; Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, and Key Laboratory of Chicken Genetics, Breeding and Reproduction, Ministry of Agriculture, Guangzhou 510642, Guangdong, China; State Key Laboratory of Livestock and Poultry Breeding, South China Agricultural University, Guangzhou 510642, Guangdong, China
| | - Qinghua Nie
- Lingnan Guangdong Laboratory of Modern Agriculture, College of Animal Science, South China Agricultural University, Guangzhou 510642, Guangdong, China; Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, and Key Laboratory of Chicken Genetics, Breeding and Reproduction, Ministry of Agriculture, Guangzhou 510642, Guangdong, China; State Key Laboratory of Livestock and Poultry Breeding, South China Agricultural University, Guangzhou 510642, Guangdong, China.
| |
Collapse
|
2
|
Ros-Freixedes R, Johnsson M, Whalen A, Chen CY, Valente BD, Herring WO, Gorjanc G, Hickey JM. Genomic prediction with whole-genome sequence data in intensely selected pig lines. GENETICS SELECTION EVOLUTION 2022; 54:65. [PMID: 36153511 PMCID: PMC9509613 DOI: 10.1186/s12711-022-00756-0] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/28/2022] [Accepted: 09/05/2022] [Indexed: 12/03/2022]
Abstract
Background Early simulations indicated that whole-genome sequence data (WGS) could improve the accuracy of genomic predictions within and across breeds. However, empirical results have been ambiguous so far. Large datasets that capture most of the genomic diversity in a population must be assembled so that allele substitution effects are estimated with high accuracy. The objectives of this study were to use a large pig dataset from seven intensely selected lines to assess the benefits of using WGS for genomic prediction compared to using commercial marker arrays and to identify scenarios in which WGS provides the largest advantage. Methods We sequenced 6931 individuals from seven commercial pig lines with different numerical sizes. Genotypes of 32.8 million variants were imputed for 396,100 individuals (17,224 to 104,661 per line). We used BayesR to perform genomic prediction for eight complex traits. Genomic predictions were performed using either data from a standard marker array or variants preselected from WGS based on association tests. Results The accuracies of genomic predictions based on preselected WGS variants were not robust across traits and lines and the improvements in prediction accuracy that we achieved so far with WGS compared to standard marker arrays were generally small. The most favourable results for WGS were obtained when the largest training sets were available and standard marker arrays were augmented with preselected variants with statistically significant associations to the trait. With this method and training sets of around 80k individuals, the accuracy of within-line genomic predictions was on average improved by 0.025. With multi-line training sets, improvements of 0.04 compared to marker arrays could be expected. Conclusions Our results showed that WGS has limited potential to improve the accuracy of genomic predictions compared to marker arrays in intensely selected pig lines. Thus, although we expect that larger improvements in accuracy from the use of WGS are possible with a combination of larger training sets and optimised pipelines for generating and analysing such datasets, the use of WGS in the current implementations of genomic prediction should be carefully evaluated against the cost of large-scale WGS data on a case-by-case basis. Supplementary Information The online version contains supplementary material available at 10.1186/s12711-022-00756-0.
Collapse
|
3
|
Rare and population-specific functional variation across pig lines. Genet Sel Evol 2022; 54:39. [PMID: 35659233 PMCID: PMC9164375 DOI: 10.1186/s12711-022-00732-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Accepted: 05/17/2022] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND It is expected that functional, mainly missense and loss-of-function (LOF), and regulatory variants are responsible for most phenotypic differences between breeds and genetic lines of livestock species that have undergone diverse selection histories. However, there is still limited knowledge about the existing missense and LOF variation in commercial livestock populations, in particular regarding population-specific variation and how it can affect applications such as across-breed genomic prediction. METHODS We re-sequenced the whole genome of 7848 individuals from nine commercial pig lines (average sequencing coverage: 4.1×) and imputed whole-genome genotypes for 440,610 pedigree-related individuals. The called variants were categorized according to predicted functional annotation (from LOF to intergenic) and prevalence level (number of lines in which the variant segregated; from private to widespread). Variants in each category were examined in terms of their distribution along the genome, alternative allele frequency, per-site Wright's fixation index (FST), individual load, and association to production traits. RESULTS Of the 46 million called variants, 28% were private (called in only one line) and 21% were widespread (called in all nine lines). Genomic regions with a low recombination rate were enriched with private variants. Low-prevalence variants (called in one or a few lines only) were enriched for lower allele frequencies, lower FST, and putatively functional and regulatory roles (including LOF and deleterious missense variants). On average, individuals carried fewer private deleterious missense alleles than expected compared to alleles with other predicted consequences. Only a small subset of the low-prevalence variants had intermediate allele frequencies and explained small fractions of phenotypic variance (up to 3.2%) of production traits. The significant low-prevalence variants had higher per-site FST than the non-significant ones. These associated low-prevalence variants were tagged by other more widespread variants in high linkage disequilibrium, including intergenic variants. CONCLUSIONS Most low-prevalence variants have low minor allele frequencies and only a small subset of low-prevalence variants contributed detectable fractions of phenotypic variance of production traits. Accounting for low-prevalence variants is therefore unlikely to noticeably benefit across-breed analyses, such as the prediction of genomic breeding values in a population using reference populations of a different genetic background.
Collapse
|
4
|
Dauben CM, Große-Brinkhaus C, Heuß EM, Henne H, Tholen E. Comparison of the choice of animals for re-sequencing in two maternal pig lines. Genet Sel Evol 2022; 54:16. [PMID: 35183111 PMCID: PMC8858453 DOI: 10.1186/s12711-022-00706-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Accepted: 01/31/2022] [Indexed: 11/10/2022] Open
Abstract
Next-generation sequencing is a promising approach for the detection of causal variants within previously identified quantitative trait loci. Because of the costs of re-sequencing experiments, this application is currently mainly restricted to subsets of animals from already genotyped populations. Imputation from a lower to a higher marker density could represent a useful complementary approach. An analysis of the literature shows that several strategies are available to select animals for re-sequencing. This study demonstrates an animal selection workflow under practical conditions. Our approach considers different data sources and limited resources such as budget and availability of sampling material. The workflow combines previously described approaches and makes use of genotype and pedigree information from a Landrace and Large White population. Genotypes were phased and haplotypes were accurately estimated with AlphaPhase. Then, AlphaSeqOpt was used to optimize selection of animals for re-sequencing, reflecting the existing diversity of haplotypes. AlphaSeqOpt and ENDOG were used to select individuals based on pedigree information and by taking into account key animals that represent the genetic diversity of the populations. After the best selection criteria were determined, a subset of 57 animals was selected for subsequent re-sequencing. In order to evaluate and assess the advantage of this procedure, imputation accuracy was assessed by setting a set of single nucleotide polymorphism (SNP) chip genotypes to missing. Accuracy values were compared to those of alternative selection scenarios and the results showed the clear benefits of a targeted selection within this practical-driven approach. Especially imputation of low-frequency markers benefits from the combined approach described here. Accuracy was increased by up to 12% compared to a randomized or exclusively haplotype-based selection of sequencing candidates.
Collapse
Affiliation(s)
- Christina M Dauben
- Institute of Animal Science, University of Bonn, Endenicher Allee 15, 53115, Bonn, Germany
| | | | - Esther M Heuß
- Institute of Animal Science, University of Bonn, Endenicher Allee 15, 53115, Bonn, Germany
| | - Hubert Henne
- BHZP GmbH, An der Wassermühle 8, 21368, Dahlenburg-Ellringen, Germany
| | - Ernst Tholen
- Institute of Animal Science, University of Bonn, Endenicher Allee 15, 53115, Bonn, Germany
| |
Collapse
|
5
|
Cheng H, Xu K, Li J, Abraham KJ. Optimizing Sequencing Resources in Genotyped Livestock Populations Using Linear Programming. Front Genet 2021; 12:740340. [PMID: 34745214 PMCID: PMC8570094 DOI: 10.3389/fgene.2021.740340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Accepted: 09/20/2021] [Indexed: 11/13/2022] Open
Abstract
Low-cost genome-wide single-nucleotide polymorphisms (SNPs) are routinely used in animal breeding programs. Compared to SNP arrays, the use of whole-genome sequence data generated by the next-generation sequencing technologies (NGS) has great potential in livestock populations. However, sequencing a large number of animals to exploit the full potential of whole-genome sequence data is not feasible. Thus, novel strategies are required for the allocation of sequencing resources in genotyped livestock populations such that the entire population can be imputed, maximizing the efficiency of whole genome sequencing budgets. We present two applications of linear programming for the efficient allocation of sequencing resources. The first application is to identify the minimum number of animals for sequencing subject to the criterion that each haplotype in the population is contained in at least one of the animals selected for sequencing. The second application is the selection of animals whose haplotypes include the largest possible proportion of common haplotypes present in the population, assuming a limited sequencing budget. Both applications are available in an open source program LPChoose. In both applications, LPChoose has similar or better performance than some other methods suggesting that linear programming methods offer great potential for the efficient allocation of sequencing resources. The utility of these methods can be increased through the development of improved heuristics.
Collapse
Affiliation(s)
- Hao Cheng
- Department of Animal Science, University of California, Davis, Davis, CA, United States
| | - Keyu Xu
- Department of Animal Science, University of California, Davis, Davis, CA, United States
| | - Jinghui Li
- Department of Animal Science, University of California, Davis, Davis, CA, United States
| | - Kuruvilla Joseph Abraham
- Department of Economics, FEARP, University of São-Paulo, Ribeirão Preto, Brazil.,Department of Computer Science-ICMC, University of São Paulo, São Carlos, Brazil
| |
Collapse
|
6
|
da Silva ÉDB, Xavier A, Faria MV. Impact of Genomic Prediction Model, Selection Intensity, and Breeding Strategy on the Long-Term Genetic Gain and Genetic Erosion in Soybean Breeding. Front Genet 2021; 12:637133. [PMID: 34539725 PMCID: PMC8440908 DOI: 10.3389/fgene.2021.637133] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Accepted: 08/05/2021] [Indexed: 11/21/2022] Open
Abstract
Genomic-assisted breeding has become an important tool in soybean breeding. However, the impact of different genomic selection (GS) approaches on short- and long-term gains is not well understood. Such gains are conditional on the breeding design and may vary with a combination of the prediction model, family size, selection strategies, and selection intensity. To address these open questions, we evaluated various scenarios through a simulated closed soybean breeding program over 200 breeding cycles. Genomic prediction was performed using genomic best linear unbiased prediction (GBLUP), Bayesian methods, and random forest, benchmarked against selection on phenotypic values, true breeding values (TBV), and random selection. Breeding strategies included selections within family (WF), across family (AF), and within pre-selected families (WPSF), with selection intensities of 2.5, 5.0, 7.5, and 10.0%. Selections were performed at the F4 generation, where individuals were phenotyped and genotyped with a 6K single nucleotide polymorphism (SNP) array. Initial genetic parameters for the simulation were estimated from the SoyNAM population. WF selections provided the most significant long-term genetic gains. GBLUP and Bayesian methods outperformed random forest and provided most of the genetic gains within the first 100 generations, being outperformed by phenotypic selection after generation 100. All methods provided similar performances under WPSF selections. A faster decay in genetic variance was observed when individuals were selected AF and WPSF, as 80% of the genetic variance was depleted within 28-58 cycles, whereas WF selections preserved the variance up to cycle 184. Surprisingly, the selection intensity had less impact on long-term gains than did the breeding strategies. The study supports that genetic gains can be optimized in the long term with specific combinations of prediction models, family size, selection strategies, and selection intensity. A combination of strategies may be necessary for balancing the short-, medium-, and long-term genetic gains in breeding programs while preserving the genetic variance.
Collapse
Affiliation(s)
| | - Alencar Xavier
- Department of Biostatistics, Corteva Agriscience, Johnston, IA, United States
- Department of Agronomy, Purdue University, West Lafayette, IN, United States
| | - Marcos Ventura Faria
- Department of Agronomy, Universidade Estadual do Centro-Oeste, Guarapuava, Brazil
| |
Collapse
|
7
|
Akdemir D, Rio S, Isidro y Sánchez J. TrainSel: An R Package for Selection of Training Populations. Front Genet 2021; 12:655287. [PMID: 34025720 PMCID: PMC8138169 DOI: 10.3389/fgene.2021.655287] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2021] [Accepted: 03/31/2021] [Indexed: 01/01/2023] Open
Abstract
A major barrier to the wider use of supervised learning in emerging applications, such as genomic selection, is the lack of sufficient and representative labeled data to train prediction models. The amount and quality of labeled training data in many applications is usually limited and therefore careful selection of the training examples to be labeled can be useful for improving the accuracies in predictive learning tasks. In this paper, we present an R package, TrainSel, which provides flexible, efficient, and easy-to-use tools that can be used for the selection of training populations (STP). We illustrate its use, performance, and potentials in four different supervised learning applications within and outside of the plant breeding area.
Collapse
Affiliation(s)
- Deniz Akdemir
- Agriculture & Food Science Centre, Animal and Crop Science Division, University College Dublin, Dublin, Ireland
| | - Simon Rio
- Centro de Biotecnologia y Genómica de Plantas (CBGP, UPM-INIA), Instituto Nacional de Investigación y Tecnologia Agraria y Alimentaria (INIA), Universidad Politécnica de Madrid (UPM), Madrid, Spain
| | - Julio Isidro y Sánchez
- Centro de Biotecnologia y Genómica de Plantas (CBGP, UPM-INIA), Instituto Nacional de Investigación y Tecnologia Agraria y Alimentaria (INIA), Universidad Politécnica de Madrid (UPM), Madrid, Spain
| |
Collapse
|
8
|
Nosková A, Bhati M, Kadri NK, Crysnanto D, Neuenschwander S, Hofer A, Pausch H. Characterization of a haplotype-reference panel for genotyping by low-pass sequencing in Swiss Large White pigs. BMC Genomics 2021; 22:290. [PMID: 33882824 PMCID: PMC8061004 DOI: 10.1186/s12864-021-07610-5] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Accepted: 04/13/2021] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND The key-ancestor approach has been frequently applied to prioritize individuals for whole-genome sequencing based on their marginal genetic contribution to current populations. Using this approach, we selected 70 key ancestors from two lines of the Swiss Large White breed that have been selected divergently for fertility and fattening traits and sequenced their genomes with short paired-end reads. RESULTS Using pedigree records, we estimated the effective population size of the dam and sire line to 72 and 44, respectively. In order to assess sequence variation in both lines, we sequenced the genomes of 70 boars at an average coverage of 16.69-fold. The boars explained 87.95 and 95.35% of the genetic diversity of the breeding populations of the dam and sire line, respectively. Reference-guided variant discovery using the GATK revealed 26,862,369 polymorphic sites. Principal component, admixture and fixation index (FST) analyses indicated considerable genetic differentiation between the lines. Genomic inbreeding quantified using runs of homozygosity was higher in the sire than dam line (0.28 vs 0.26). Using two complementary approaches, we detected 51 signatures of selection. However, only six signatures of selection overlapped between both lines. We used the sequenced haplotypes of the 70 key ancestors as a reference panel to call 22,618,811 genotypes in 175 pigs that had been sequenced at very low coverage (1.11-fold) using the GLIMPSE software. The genotype concordance, non-reference sensitivity and non-reference discrepancy between thus inferred and Illumina PorcineSNP60 BeadChip-called genotypes was 97.60, 98.73 and 3.24%, respectively. The low-pass sequencing-derived genomic relationship coefficients were highly correlated (r > 0.99) with those obtained from microarray genotyping. CONCLUSIONS We assessed genetic diversity within and between two lines of the Swiss Large White pig breed. Our analyses revealed considerable differentiation, even though the split into two populations occurred only few generations ago. The sequenced haplotypes of the key ancestor animals enabled us to implement genotyping by low-pass sequencing which offers an intriguing cost-effective approach to increase the variant density over current array-based genotyping by more than 350-fold.
Collapse
Affiliation(s)
- Adéla Nosková
- Animal Genomics, ETH Zürich, Eschikon 27, 8315, Lindau, Switzerland.
| | - Meenu Bhati
- Animal Genomics, ETH Zürich, Eschikon 27, 8315, Lindau, Switzerland
| | | | - Danang Crysnanto
- Animal Genomics, ETH Zürich, Eschikon 27, 8315, Lindau, Switzerland
| | | | | | - Hubert Pausch
- Animal Genomics, ETH Zürich, Eschikon 27, 8315, Lindau, Switzerland
| |
Collapse
|
9
|
Ros-Freixedes R, Whalen A, Gorjanc G, Mileham AJ, Hickey JM. Evaluation of sequencing strategies for whole-genome imputation with hybrid peeling. Genet Sel Evol 2020; 52:18. [PMID: 32248818 PMCID: PMC7132986 DOI: 10.1186/s12711-020-00537-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Accepted: 03/27/2020] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND For assembling large whole-genome sequence datasets for routine use in research and breeding, the sequencing strategy should be adapted to the methods that will be used later for variant discovery and imputation. In this study, we used simulation to explore the impact that the sequencing strategy and level of sequencing investment have on the overall accuracy of imputation using hybrid peeling, a pedigree-based imputation method that is well suited for large livestock populations. METHODS We simulated marker array and whole-genome sequence data for 15 populations with simulated or real pedigrees that had different structures. In these populations, we evaluated the effect on imputation accuracy of seven methods for selecting which individuals to sequence, the generation of the pedigree to which the sequenced individuals belonged, the use of variable or uniform coverage, and the trade-off between the number of sequenced individuals and their sequencing coverage. For each population, we considered four levels of investment in sequencing that were proportional to the size of the population. RESULTS Imputation accuracy depended greatly on pedigree depth. The distribution of the sequenced individuals across the generations of the pedigree underlay the performance of the different methods used to select individuals to sequence and it was critical for achieving high imputation accuracy in both early and late generations. Imputation accuracy was highest with a uniform coverage across the sequenced individuals of 2× rather than variable coverage. An investment equivalent to the cost of sequencing 2% of the population at 2× provided high imputation accuracy. The gain in imputation accuracy from additional investment decreased with larger populations and higher levels of investment. However, to achieve the same imputation accuracy, a proportionally greater investment must be used in the smaller populations compared to the larger ones. CONCLUSIONS Suitable sequencing strategies for subsequent imputation with hybrid peeling involve sequencing ~2% of the population at a uniform coverage 2×, distributed preferably across all generations of the pedigree, except for the few earliest generations that lack genotyped ancestors. Such sequencing strategies are beneficial for generating whole-genome sequence data in populations with deep pedigrees of closely related individuals.
Collapse
Affiliation(s)
- Roger Ros-Freixedes
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
- Departament de Ciència Animal, Universitat de Lleida-Agrotecnio Center, Lleida, Spain
| | - Andrew Whalen
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| | | | - John M. Hickey
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| |
Collapse
|
10
|
Ros-Freixedes R, Whalen A, Chen CY, Gorjanc G, Herring WO, Mileham AJ, Hickey JM. Accuracy of whole-genome sequence imputation using hybrid peeling in large pedigreed livestock populations. Genet Sel Evol 2020; 52:17. [PMID: 32248811 PMCID: PMC7132992 DOI: 10.1186/s12711-020-00536-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Accepted: 03/27/2020] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND The coupling of appropriate sequencing strategies and imputation methods is critical for assembling large whole-genome sequence datasets from livestock populations for research and breeding. In this paper, we describe and validate the coupling of a sequencing strategy with the imputation method hybrid peeling in real animal breeding settings. METHODS We used data from four pig populations of different size (18,349 to 107,815 individuals) that were widely genotyped at densities between 15,000 and 75,000 markers genome-wide. Around 2% of the individuals in each population were sequenced (most of them at 1× or 2× and 37-92 individuals per population, totalling 284, at 15-30×). We imputed whole-genome sequence data with hybrid peeling. We evaluated the imputation accuracy by removing the sequence data of the 284 individuals with high coverage, using a leave-one-out design. We simulated data that mimicked the sequencing strategy used in the real populations to quantify the factors that affected the individual-wise and variant-wise imputation accuracies using regression trees. RESULTS Imputation accuracy was high for the majority of individuals in all four populations (median individual-wise dosage correlation: 0.97). Imputation accuracy was lower for individuals in the earliest generations of each population than for the rest, due to the lack of marker array data for themselves and their ancestors. The main factors that determined the individual-wise imputation accuracy were the genotyping status, the availability of marker array data for immediate ancestors, and the degree of connectedness to the rest of the population, but sequencing coverage of the relatives had no effect. The main factors that determined variant-wise imputation accuracy were the minor allele frequency and the number of individuals with sequencing coverage at each variant site. Results were validated with the empirical observations. CONCLUSIONS We demonstrate that the coupling of an appropriate sequencing strategy and hybrid peeling is a powerful strategy for generating whole-genome sequence data with high accuracy in large pedigreed populations where only a small fraction of individuals (2%) had been sequenced, mostly at low coverage. This is a critical step for the successful implementation of whole-genome sequence data for genomic prediction and fine-mapping of causal variants.
Collapse
Affiliation(s)
- Roger Ros-Freixedes
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
- Departament de Ciència Animal, Universitat de Lleida-Agrotecnio Center, Lleida, Spain
| | - Andrew Whalen
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| | - Ching-Yi Chen
- The Pig Improvement Company, Genus plc, 100 Bluegrass Commons Blvd Ste 2200, Hendersonville, TN 37075 USA
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| | - William O. Herring
- The Pig Improvement Company, Genus plc, 100 Bluegrass Commons Blvd Ste 2200, Hendersonville, TN 37075 USA
| | | | - John M. Hickey
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
| |
Collapse
|
11
|
Butty AM, Sargolzaei M, Miglior F, Stothard P, Schenkel FS, Gredler-Grandl B, Baes CF. Optimizing Selection of the Reference Population for Genotype Imputation From Array to Sequence Variants. Front Genet 2019; 10:510. [PMID: 31214246 PMCID: PMC6554347 DOI: 10.3389/fgene.2019.00510] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 05/10/2019] [Indexed: 11/29/2022] Open
Abstract
Imputation of high-density genotypes to whole-genome sequences (WGS) is a cost-effective method to increase the density of available markers within a population. Imputed genotypes have been successfully used for genomic selection and discovery of variants associated with traits of interest for the population. To allow for the use of imputed genotypes for genomic analyses, accuracy of imputation must be high. Accuracy of imputation is influenced by multiple factors, such as size and composition of the reference group, and the allele frequency of variants included. Understanding the use of imputed WGSs prior to the generation of the reference population is important, as accurate imputation might be more focused, for instance, on common or on rare variants. The aim of this study was to present and evaluate new methods to select animals for sequencing relying on a previously genotyped population. The Genetic Diversity Index method optimizes the number of unique haplotypes in the future reference population, while the Highly Segregating Haplotype selection method targets haplotype alleles found throughout the majority of the population of interest. First the WGSs of a dairy cattle population were simulated. The simulated sequences mimicked the linkage disequilibrium level and the variants’ frequency distribution observed in currently available Holstein sequences. Then, reference populations of different sizes, in which animals were selected using both novel methods proposed here as well as two other methods presented in previous studies, were created. Finally, accuracies of imputation obtained with different reference populations were compared against each other. The novel methods were found to have overall accuracies of imputation of more than 0.85. Accuracies of imputation of rare variants reached values above 0.50. In conclusion, if imputed sequences are to be used for discovery of novel associations between variants and traits of interest in the population, animals carrying novel information should be selected and, consequently, the Genetic Diversity Index method proposed here may be used. If sequences are to be used to impute the overall genotyped population, a reference population consisting of common haplotypes carriers selected using the proposed Highly Segregating Haplotype method is recommended.
Collapse
Affiliation(s)
- Adrien M Butty
- Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada
| | - Mehdi Sargolzaei
- Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada.,Select Sires Inc., Plain City, OH, United States
| | - Filippo Miglior
- Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada
| | - Paul Stothard
- Department of Agricultural, Food & Nutritional Science, University of Alberta, Edmonton, AB, Canada
| | - Flavio S Schenkel
- Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada
| | - Birgit Gredler-Grandl
- Qualitas AG, Zug, Switzerland.,Animal Breeding and Genomics Centre, Wageningen UR Livestock Research, Wageningen, Netherlands
| | - Christine F Baes
- Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada.,Institute of Genetics, Vetsuisse Faculty, University of Bern, Bern, Switzerland
| |
Collapse
|
12
|
Johnsson M, Ros-Freixedes R, Gorjanc G, Campbell MA, Naswa S, Kelly K, Lightner J, Rounsley S, Hickey JM. Sequence variation, evolutionary constraint, and selection at the CD163 gene in pigs. Genet Sel Evol 2018; 50:69. [PMID: 30572815 PMCID: PMC6302423 DOI: 10.1186/s12711-018-0440-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2018] [Accepted: 12/16/2018] [Indexed: 11/24/2022] Open
Abstract
Background In this work, we investigated sequence variation, evolutionary constraint, and selection at the CD163 gene in pigs. A functional CD163 protein is required for infection by porcine reproductive and respiratory syndrome virus, which is a serious pathogen with major impacts on pig production. Results We used targeted pooled sequencing of the exons of CD163 to detect sequence variants in 35,000 pigs of diverse genetic backgrounds and to search for potential stop-gain and frameshift indel variants. Then, we used whole-genome sequence data from three pig lines to calculate: a variant intolerance score that measures the tolerance of genes to protein coding variation; an estimate of selection on protein-coding variation over evolutionary time; and haplotype diversity statistics to detect recent selective sweeps during breeding. Conclusions Using a deep survey of sequence variation in the CD163 gene in domestic pigs, we found no potential knockout variants. The CD163 gene was moderately intolerant to variation and showed evidence of positive selection in the pig lineage, but no evidence of recent selective sweeps during breeding.
Collapse
Affiliation(s)
- Martin Johnsson
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, EH25 9RG, Scotland, UK.,Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Box 7023, 750 07, Uppsala, Sweden
| | - Roger Ros-Freixedes
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, EH25 9RG, Scotland, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, EH25 9RG, Scotland, UK
| | | | - Sudhir Naswa
- Genus plc, 1525 River Road, DeForest, WI, 53532, USA
| | | | | | | | - John M Hickey
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, EH25 9RG, Scotland, UK.
| |
Collapse
|
13
|
Whalen A, Ros-Freixedes R, Wilson DL, Gorjanc G, Hickey JM. Hybrid peeling for fast and accurate calling, phasing, and imputation with sequence data of any coverage in pedigrees. Genet Sel Evol 2018; 50:67. [PMID: 30563452 PMCID: PMC6299538 DOI: 10.1186/s12711-018-0438-2] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2018] [Accepted: 12/11/2018] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND In this paper, we extend multi-locus iterative peeling to provide a computationally efficient method for calling, phasing, and imputing sequence data of any coverage in small or large pedigrees. Our method, called hybrid peeling, uses multi-locus iterative peeling to estimate shared chromosome segments between parents and their offspring at a subset of loci, and then uses single-locus iterative peeling to aggregate genomic information across multiple generations at the remaining loci. RESULTS Using a synthetic dataset, we first analysed the performance of hybrid peeling for calling and phasing genotypes in disconnected families, which contained only a focal individual and its parents and grandparents. Second, we analysed the performance of hybrid peeling for calling and phasing genotypes in the context of a full general pedigree. Third, we analysed the performance of hybrid peeling for imputing whole-genome sequence data to non-sequenced individuals in the population. We found that hybrid peeling substantially increased the number of called and phased genotypes by leveraging sequence information on related individuals. The calling rate and accuracy increased when the full pedigree was used compared to a reduced pedigree of just parents and grandparents. Finally, hybrid peeling imputed accurately whole-genome sequence to non-sequenced individuals. CONCLUSIONS We believe that this algorithm will enable the generation of low cost and high accuracy whole-genome sequence data in many pedigreed populations. We make this algorithm available as a standalone program called AlphaPeel.
Collapse
Affiliation(s)
- Andrew Whalen
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| | - Roger Ros-Freixedes
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| | - David L. Wilson
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| | - John M. Hickey
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| |
Collapse
|
14
|
Ros-Freixedes R, Battagin M, Johnsson M, Gorjanc G, Mileham AJ, Rounsley SD, Hickey JM. Impact of index hopping and bias towards the reference allele on accuracy of genotype calls from low-coverage sequencing. Genet Sel Evol 2018; 50:64. [PMID: 30545283 PMCID: PMC6293637 DOI: 10.1186/s12711-018-0436-4] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Accepted: 11/30/2018] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Inherent sources of error and bias that affect the quality of sequence data include index hopping and bias towards the reference allele. The impact of these artefacts is likely greater for low-coverage data than for high-coverage data because low-coverage data has scant information and many standard tools for processing sequence data were designed for high-coverage data. With the proliferation of cost-effective low-coverage sequencing, there is a need to understand the impact of these errors and bias on resulting genotype calls from low-coverage sequencing. RESULTS We used a dataset of 26 pigs sequenced both at 2× with multiplexing and at 30× without multiplexing to show that index hopping and bias towards the reference allele due to alignment had little impact on genotype calls. However, pruning of alternative haplotypes supported by a number of reads below a predefined threshold, which is a default and desired step of some variant callers for removing potential sequencing errors in high-coverage data, introduced an unexpected bias towards the reference allele when applied to low-coverage sequence data. This bias reduced best-guess genotype concordance of low-coverage sequence data by 19.0 absolute percentage points. CONCLUSIONS We propose a simple pipeline to correct the preferential bias towards the reference allele that can occur during variant discovery and we recommend that users of low-coverage sequence data be wary of unexpected biases that may be produced by bioinformatic tools that were designed for high-coverage sequence data.
Collapse
Affiliation(s)
- Roger Ros-Freixedes
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland UK
| | - Mara Battagin
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland UK
| | - Martin Johnsson
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland UK
- Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Box 7023, 750 07 Uppsala, Sweden
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland UK
| | | | | | - John M. Hickey
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland UK
| |
Collapse
|
15
|
A method for allocating low-coverage sequencing resources by targeting haplotypes rather than individuals. Genet Sel Evol 2017; 49:78. [PMID: 29070022 PMCID: PMC5655873 DOI: 10.1186/s12711-017-0353-y] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2017] [Accepted: 10/18/2017] [Indexed: 11/25/2022] Open
Abstract
Background This paper describes a heuristic method for allocating low-coverage sequencing resources by targeting haplotypes rather than individuals. Low-coverage sequencing assembles high-coverage sequence information for every individual by accumulating data from the genome segments that they share with many other individuals into consensus haplotypes. Deriving the consensus haplotypes accurately is critical for achieving a high phasing and imputation accuracy. In order to enable accurate phasing and imputation of sequence information for the whole population, we allocate the available sequencing resources among individuals with existing phased genomic data by targeting the sequencing coverage of their haplotypes. Results Our method, called AlphaSeqOpt, prioritizes haplotypes using a score function that is based on the frequency of the haplotypes in the sequencing set relative to the target coverage. AlphaSeqOpt has two steps: (1) selection of an initial set of individuals by iteratively choosing the individuals that have the maximum score conditional on the current set, and (2) refinement of the set through several rounds of exchanges of individuals. AlphaSeqOpt is very effective for distributing a fixed amount of sequencing resources evenly across haplotypes, which results in a reduction of the proportion of haplotypes that are sequenced below the target coverage. AlphaSeqOpt can provide a greater proportion of haplotypes sequenced at the target coverage by sequencing less individuals, as compared with other methods that use a score function based on haplotype frequencies in the population. A refinement of the initially selected set can provide a larger more diverse set with more unique individuals, which is beneficial in the context of low-coverage sequencing. We extend the method with an approach for filtering rare haplotypes based on their flanking haplotypes, so that only those that are likely to derive from a recombination event are targeted. Conclusions We present a method for allocating sequencing resources so that a greater proportion of haplotypes are sequenced at a coverage that is sufficiently high for population-based imputation with low-coverage sequencing. The haplotype score function, the refinement step, and the new approach for filtering rare haplotypes make AlphaSeqOpt more effective for that purpose than previously reported methods for reducing sequencing redundancy. Electronic supplementary material The online version of this article (doi:10.1186/s12711-017-0353-y) contains supplementary material, which is available to authorized users.
Collapse
|