1
|
Dallaire X, Bouchard R, Hénault P, Ulmo-Diaz G, Normandeau E, Mérot C, Bernatchez L, Moore JS. Widespread Deviant Patterns of Heterozygosity in Whole-Genome Sequencing Due to Autopolyploidy, Repeated Elements, and Duplication. Genome Biol Evol 2023; 15:evad229. [PMID: 38085037 PMCID: PMC10752349 DOI: 10.1093/gbe/evad229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/30/2023] [Indexed: 12/28/2023] Open
Abstract
Most population genomic tools rely on accurate single nucleotide polymorphism (SNP) calling and filtering to meet their underlying assumptions. However, genomic complexity, resulting from structural variants, paralogous sequences, and repetitive elements, presents significant challenges in assembling contiguous reference genomes. Consequently, short-read resequencing studies can encounter mismapping issues, leading to SNPs that deviate from Mendelian expected patterns of heterozygosity and allelic ratio. In this study, we employed the ngsParalog software to identify such deviant SNPs in whole-genome sequencing (WGS) data with low (1.5×) to intermediate (4.8×) coverage for four species: Arctic Char (Salvelinus alpinus), Lake Whitefish (Coregonus clupeaformis), Atlantic Salmon (Salmo salar), and the American Eel (Anguilla rostrata). The analyses revealed that deviant SNPs accounted for 22% to 62% of all SNPs in salmonid datasets and approximately 11% in the American Eel dataset. These deviant SNPs were particularly concentrated within repetitive elements and genomic regions that had recently undergone rediploidization in salmonids. Additionally, narrow peaks of elevated coverage were ubiquitous along all four reference genomes, encompassed most deviant SNPs, and could be partially associated with transposons and tandem repeats. Including these deviant SNPs in genomic analyses led to highly distorted site frequency spectra, underestimated pairwise FST values, and overestimated nucleotide diversity. Considering the widespread occurrence of deviant SNPs arising from a variety of sources, their important impact in estimating population parameters, and the availability of effective tools to identify them, we propose that excluding deviant SNPs from WGS datasets is required to improve genomic inferences for a wide range of taxa and sequencing depths.
Collapse
Affiliation(s)
- Xavier Dallaire
- Institut de biologie intégrative et des systèmes, Université Laval, Québec, Canada
- Centre d'Études Nordiques, Université Laval, Québec, Canada
| | - Raphael Bouchard
- Institut de biologie intégrative et des systèmes, Université Laval, Québec, Canada
- Ressources Aquatique Québec, Université de Rimouski, Rimouski, Canada
| | - Philippe Hénault
- Institut de biologie intégrative et des systèmes, Université Laval, Québec, Canada
- Ressources Aquatique Québec, Université de Rimouski, Rimouski, Canada
| | - Gabriela Ulmo-Diaz
- Institut de biologie intégrative et des systèmes, Université Laval, Québec, Canada
- Ressources Aquatique Québec, Université de Rimouski, Rimouski, Canada
| | - Eric Normandeau
- Institut de biologie intégrative et des systèmes, Université Laval, Québec, Canada
- Ressources Aquatique Québec, Université de Rimouski, Rimouski, Canada
- Plateforme de bio-informatique de l’IBIS, Université Laval, Québec, Canada
| | - Claire Mérot
- CNRS, UMR 6553 ECOBIO, Université de Rennes, Rennes, France
| | - Louis Bernatchez
- Institut de biologie intégrative et des systèmes, Université Laval, Québec, Canada
- Ressources Aquatique Québec, Université de Rimouski, Rimouski, Canada
| | - Jean-Sébastien Moore
- Institut de biologie intégrative et des systèmes, Université Laval, Québec, Canada
- Centre d'Études Nordiques, Université Laval, Québec, Canada
- Ressources Aquatique Québec, Université de Rimouski, Rimouski, Canada
| |
Collapse
|
2
|
Karunarathne P, Zhou Q, Schliep K, Milesi P. A comprehensive framework for detecting copy number variants from single nucleotide polymorphism data: 'rCNV', a versatile r package for paralogue and CNV detection. Mol Ecol Resour 2023; 23:1772-1789. [PMID: 37515483 DOI: 10.1111/1755-0998.13843] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Revised: 07/04/2023] [Accepted: 07/07/2023] [Indexed: 07/31/2023]
Abstract
Recent studies have highlighted the significant role of copy number variants (CNVs) in phenotypic diversity, environmental adaptation and species divergence across eukaryotes. The presence of CNVs also has the potential to introduce genotyping biases, which can pose challenges to accurate population and quantitative genetic analyses. However, detecting CNVs in genomes, particularly in non-model organisms, presents a formidable challenge. To address this issue, we have developed a statistical framework and an accompanying r software package that leverage allelic-read depth from single nucleotide polymorphism (SNP) data for accurate CNV detection. Our framework capitalises on two key principles. First, it exploits the distribution of allelic-read depth ratios in heterozygotes for individual SNPs by comparing it against an expected distribution based on binomial sampling. Second, it identifies SNPs exhibiting an apparent excess of heterozygotes under Hardy-Weinberg equilibrium. By employing multiple statistical tests, our method not only enhances sensitivity to sampling effects but also effectively addresses reference biases, resulting in optimised SNP classification. Our framework is compatible with various NGS technologies (e.g. RADseq, Exome-capture). This versatility enables CNV calling from genomes of diverse complexities. To streamline the analysis process, we have implemented our framework in the user-friendly r package 'rCNV', which automates the entire workflow seamlessly. We trained our models using simulated data and validated their performance on four datasets derived from different sequencing technologies, including RADseq (Chinook salmon-Oncorhynchus tshawytscha), Rapture (American lobster-Homarus americanus), Exome-capture (Norway spruce-Picea abies) and WGS (Malaria mosquito-Anopheles gambiae).
Collapse
Affiliation(s)
- Piyal Karunarathne
- Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Uppsala, Sweden
- Science for Life Laboratory (SciLifeLab), Uppsala, Sweden
- Institute of Population Genetics, Heinrich Heine University, Düsseldorf, Germany
| | - Qiujie Zhou
- Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Uppsala, Sweden
- Science for Life Laboratory (SciLifeLab), Uppsala, Sweden
| | - Klaus Schliep
- Institute of Computational Biotechnology, Graz University of Technology, Graz, Austria
| | - Pascal Milesi
- Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Uppsala, Sweden
- Science for Life Laboratory (SciLifeLab), Uppsala, Sweden
| |
Collapse
|
3
|
Piertney SB, Wenzel M, Jamieson AJ. Large effective population size masks population genetic structure in Hirondellea amphipods within the deepest marine ecosystem, the Mariana Trench. Mol Ecol 2023; 32:2206-2218. [PMID: 36808786 DOI: 10.1111/mec.16887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2022] [Revised: 02/07/2023] [Accepted: 02/09/2023] [Indexed: 02/20/2023]
Abstract
The examination of genetic structure in the deep-ocean hadal zone has focused on divergence between tectonic trenches to understand how environment and geography may drive species divergence and promote endemism. There has been little attempt to examine localized genetic structure within trenches, partly because of logistical challenges associated with sampling at an appropriate scale, and the large effective population sizes of species that can be sampled adequately may mask underlying genetic structure. Here we examine genetic structure in the superabundant amphipod Hirondellea gigas in the Mariana Trench at depths of 8126-10,545 m. RAD sequencing was used to identify 3182 loci containing 43,408 single nucleotide polymorphisms (SNPs) across individuals after stringent pruning of loci to prevent paralogous multicopy genomic regions being erroneously merged. Principal components analysis of SNP genotypes resolved no genetic structure between sampling locations, consistent with a signature of panmixia. However, discriminant analysis of principal components identified divergence between all sites driven by 301 outlier SNPs in 169 loci and significantly associated with latitude and depth. Functional annotation of loci identified differences between singleton loci used in analysis and paralogous loci pruned from the data set and also between outlier and nonoutlier loci, all consistent with hypotheses explaining the role of transposable elements driving genome dynamics. This study challenges the traditional perspective that highly abundant amphipods within a trench form a single panmictic population. We discuss the findings in relation to eco-evolutionary and ontogenetic processes operating in the deep sea, and highlight key challenges associated with population genetic analysis in nonmodel systems with inherent large effective population sizes and genomes.
Collapse
Affiliation(s)
| | - Marius Wenzel
- School of Biological Sciences, University of Aberdeen, Aberdeen, UK
| | - Alan J Jamieson
- Minderoo-UWA Deep-Sea Research Centre, School of Biological Sciences and Oceans Institute, The University of Western Australia, Perth, Western Australia, Australia
| |
Collapse
|
4
|
Clark LV, Mays W, Lipka AE, Sacks EJ. A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes. BMC Bioinformatics 2022; 23:101. [PMID: 35317727 PMCID: PMC8939213 DOI: 10.1186/s12859-022-04635-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2021] [Accepted: 03/10/2022] [Indexed: 12/02/2022] Open
Abstract
Background Given the economic and environmental importance of allopolyploids and other species with highly duplicated genomes, there is a need for methods to distinguish paralogs, i.e. duplicate sequences within a genome, from Mendelian loci, i.e. single copy sequences that pair at meiosis. The ratio of observed to expected heterozygosity is an effective tool for filtering loci but requires genotyping to be performed first at a high computational cost, whereas counting the number of sequence tags detected per genotype is computationally quick but very ineffective in inbred or polyploid populations. Therefore, new methods are needed for filtering paralogs. Results We introduce a novel statistic, Hind/HE, that uses the probability that two reads sampled from a genotype will belong to different alleles, instead of observed heterozygosity. The expected value of Hind/HE is the same across all loci in a dataset, regardless of read depth or allele frequency. In contrast to methods based on observed heterozygosity, it can be estimated and used for filtering loci prior to genotype calling. In addition to filtering paralogs, it can be used to filter loci with null alleles or high overdispersion, and identify individuals with unexpected ploidy and hybrid status. We demonstrate that the statistic is useful at read depths as low as five to 10, well below the depth needed for accurate genotype calling in polyploid and outcrossing species. Conclusions Our methodology for estimating Hind/HE across loci and individuals, as well as determining reasonable thresholds for filtering loci, is implemented in polyRAD v1.6, available at https://github.com/lvclark/polyRAD. In large sequencing datasets, we anticipate that the ability to filter markers and identify problematic individuals prior to genotype calling will save researchers considerable computational time. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04635-9.
Collapse
Affiliation(s)
- Lindsay V Clark
- Roy J. Carver Biotechnology Center, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA.
| | - Wittney Mays
- Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA.,Sandia National Laboratories, Livermore, CA, 94551, USA
| | - Alexander E Lipka
- Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Erik J Sacks
- Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| |
Collapse
|
5
|
Reynes L, Thibaut T, Mauger S, Blanfuné A, Holon F, Cruaud C, Couloux A, Valero M, Aurelle D. Genomic signatures of clonality in the deep water kelp Laminaria rodriguezii. Mol Ecol 2021; 30:1806-1822. [PMID: 33629449 DOI: 10.1111/mec.15860] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2020] [Revised: 02/18/2021] [Accepted: 02/19/2021] [Indexed: 12/17/2022]
Abstract
The development of population genomic approaches in non-model species allows for renewed studies of the impact of reproductive systems and genetic drift on population diversity. Here, we investigate the genomic signatures of partial clonality in the deep water kelp Laminaria rodriguezii, known to reproduce by both sexual and asexual means. We compared these results with the species Laminaria digitata, a closely related species that differs by different traits, in particular its reproductive mode (no clonal reproduction). We analysed genome-wide variation with dd-RAD sequencing using 4,077 SNPs in L. rodriguezii and 7,364 SNPs in L. digitata. As predicted for partially clonal populations, we show that the distribution of FIS within populations of L. rodriguezii is shifted toward negative values, with a high number of loci showing heterozygote excess. This finding is the opposite of what we observed within sexual populations of L. digitata, characterized by a generalized deficit in heterozygotes. Furthermore, we observed distinct distributions of FIS among populations of L. rodriguezii, which is congruent with the predictions of theoretical models for different levels of clonality and genetic drift. These findings highlight that the empirical distribution of FIS is a promising feature for the genomic study of asexuality in natural populations. Our results also show that the populations of L. rodriguezii analysed here are genetically differentiated and probably isolated. Our study provides a conceptual framework to investigate partial clonality on the basis of RAD-sequencing SNPs. These results could be obtained without any reference genome, and are therefore of interest for various non-model species.
Collapse
Affiliation(s)
- Lauric Reynes
- CNRS, IRD, MIO, Aix Marseille Université, Université de Toulon, Marseille, France
| | - Thierry Thibaut
- CNRS, IRD, MIO, Aix Marseille Université, Université de Toulon, Marseille, France
| | - Stéphane Mauger
- IRL 3614, Evolutionary Biology and Ecology of Algae, CNRS, UC, UACH, Sorbonne Université, Roscoff, France
| | - Aurélie Blanfuné
- CNRS, IRD, MIO, Aix Marseille Université, Université de Toulon, Marseille, France
| | | | - Corinne Cruaud
- Genoscope, Institut de Biologie François-Jacob, Commissariat à l'Energie Atomique (CEA), Université Paris-Saclay, Evry, France
| | - Arnaud Couloux
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, Evry, France
| | - Myriam Valero
- IRL 3614, Evolutionary Biology and Ecology of Algae, CNRS, UC, UACH, Sorbonne Université, Roscoff, France
| | - Didier Aurelle
- CNRS, IRD, MIO, Aix Marseille Université, Université de Toulon, Marseille, France
- Institut de Systématique Évolution Biodiversité (ISYEB, UMR 7205), Muséum National d'Histoire Naturelle, CNRS, EPHE, Sorbonne Université, Paris, France
| |
Collapse
|
6
|
Gargiulo R, Kull T, Fay MF. Effective double-digest RAD sequencing and genotyping despite large genome size. Mol Ecol Resour 2021; 21:1037-1055. [PMID: 33351289 DOI: 10.1111/1755-0998.13314] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2020] [Revised: 12/03/2020] [Accepted: 12/14/2020] [Indexed: 11/28/2022]
Abstract
Obtaining informative data is the ambition of any genomic project, but in nonmodel species with very large genomes, pursuing such a goal requires surmounting a series of analytical challenges. Double-digest RAD sequencing is routinely used in nonmodel organisms and offers some control over the volume of data obtained. However, the volume of data recovered is not always an indication of the reliability of data sets, and quality checks are necessary to ensure that true and artefactual information is set apart. In the present study, we aim to fill the gap existing between the known applicability of RAD sequencing methods in plants with large genomes and the use of the retrieved loci for population genetic inference. By analysing two populations of Cypripedium calceolus, a nonmodel orchid species with a large genome size (1C ~ 31.6 Gbp), we provide a complete workflow from library preparation to bioinformatic filtering and inference of genetic diversity and differentiation. We show how filtering strategies to dismiss potentially misleading data need to be explored and adapted to data set-specific features. Moreover, we suggest that the occurrence of organellar sequences in libraries should not be neglected when planning the experiment and analysing the results. Finally, we explain how, in the absence of prior information about the genome of the species, seeking high standards of quality during library preparation and sequencing can provide an insurance against unpredicted technical or biological constraints.
Collapse
Affiliation(s)
| | - Tiiu Kull
- Estonian University of Life Sciences, Tartu, Estonia
| | - Michael F Fay
- Royal Botanic Gardens, Kew, Richmond, Surrey, UK.,School of Biological Sciences, University of Western Australia, Crawley, WA, Australia
| |
Collapse
|
7
|
Záveská E, Kirschner P, Frajman B, Wessely J, Willner W, Gattringer A, Hülber K, Lazić D, Dobeš C, Schönswetter P. Evidence for Glacial Refugia of the Forest Understorey Species Helleborus niger (Ranunculaceae) in the Southern as Well as in the Northern Limestone Alps. FRONTIERS IN PLANT SCIENCE 2021; 12:683043. [PMID: 34040627 PMCID: PMC8141911 DOI: 10.3389/fpls.2021.683043] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/19/2021] [Accepted: 04/14/2021] [Indexed: 05/10/2023]
Abstract
Glacial refugia of alpine and subnival biota have been intensively studied in the European Alps but the fate of forests and their understory species in that area remains largely unclear. In order to fill this gap, we aimed at disentangling the spatiotemporal diversification of disjunctly distributed black hellebore Helleborus niger (Ranunculaceae). We applied a set of phylogeographic analyses based on restriction-site associated DNA sequencing (RADseq) data and plastid DNA sequences to a range-wide sampling of populations. These analyses were supplemented with species distribution models generated for the present and the Last Glacial Maximum (LGM). We used exploratory analyses to delimit genomically coherent groups and then employed demographic modeling to reconstruct the history of these groups. We uncovered a deep split between two major genetic groups with western and eastern distribution within the Southern Limestone Alps, likely reflecting divergent evolution since the mid-Pleistocene in two glacial refugia situated along the unglaciated southern margin of the Alps. Long-term presence in the Southern Limestone Alps is also supported by high numbers of private alleles, elevated levels of nucleotide diversity and the species' modeled distribution at the LGM. The deep genetic divergence, however, is not reflected in leaf shape variation, suggesting that the morphological discrimination of genetically divergent entities within H. niger is questionable. At a shallower level, populations from the Northern Limestone Alps are differentiated from those in the Southern Limestone Alps in both RADseq and plastid DNA data sets, reflecting the North-South disjunction within the Eastern Alps. The underlying split was dated to ca. 0.1 mya, which is well before the LGM. In the same line, explicit tests of demographic models consistently rejected the hypothesis that the partial distribution area in the Northern Limestone Alps is the result of postglacial colonization. Taken together, our results strongly support that forest understory species such as H. niger have survived the LGM in refugia situated along the southern, but also along the northern or northeastern periphery of the Alps. Being a slow migrator, the species has likely survived repeated glacial-interglacial circles in distributional stasis while the composition of the tree canopy changed in the meanwhile.
Collapse
Affiliation(s)
- Eliška Záveská
- Department of Botany, University of Innsbruck, Innsbruck, Austria
- Institute of Botany of the Czech Academy of Sciences, Průhonice, Czechia
| | | | - Božo Frajman
- Department of Botany, University of Innsbruck, Innsbruck, Austria
| | - Johannes Wessely
- Department of Botany and Biodiversity Research, University of Vienna, Vienna, Austria
| | - Wolfgang Willner
- Department of Botany and Biodiversity Research, University of Vienna, Vienna, Austria
| | - Andreas Gattringer
- Department of Botany and Biodiversity Research, University of Vienna, Vienna, Austria
| | - Karl Hülber
- Department of Botany and Biodiversity Research, University of Vienna, Vienna, Austria
- *Correspondence: Karl Hülber,
| | - Desanka Lazić
- Department of Forest Genetics and Forest Tree Breeding, Georg-August University of Göttingen, Göttingen, Germany
| | - Christoph Dobeš
- Institute of Forest Genetics, Austrian Research Centre for Forests, Vienna, Austria
| | | |
Collapse
|
8
|
Tigano A. A population genomics approach to uncover the CNVs, and their evolutionary significance, hidden in reduced-representation sequencing data sets. Mol Ecol 2020; 29:4749-4753. [PMID: 32997366 DOI: 10.1111/mec.15665] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Accepted: 09/11/2020] [Indexed: 12/01/2022]
Abstract
The importance of structural variation in adaptation and speciation is becoming increasingly evident in the literature. Among SVs, copy number variants (CNVs) are known to affect phenotypes through changes in gene expression and can potentially reduce recombination between alleles with different copy numbers. However, little is known about their abundance, distribution and frequency in natural populations. In a "From the Cover" article in this issue of Molecular Ecology, Dorant et al. (2020) present a new cost-effective approach to genotype copy number variants (CNVs) from large reduced-representation sequencing (RRS) data sets in nonmodel organisms, and thus to analyse sequence and structural variation jointly. They show that in American lobsters (Homarus americanus), CNVs exhibit strong population structure and several significant associations with annual variance in sea surface temperature, while SNPs fail to uncover any population structure or genotype-environment associations. Their results clearly illustrate that structural variants like CNVs can potentially store important information on differentiation and adaptive differences that cannot be retrieved from the analysis of sequence variation alone. To better understand the factors affecting the evolution of CNVs and their role in adaptation and speciation, we need to compare and synthesize data from a wide variety of species with different demographic histories and genome structure. The approach developed by Dorant et al. (2020) now allows to gain crucial knowledge on CNVs in a cost-effective way, even in species with limited genomic resources.
Collapse
Affiliation(s)
- Anna Tigano
- Department of Molecular, Cellular and Biomedical Sciences, University of New Hampshire, Durham, NH, USA.,Hubbard Center for Genome Studies, University of New Hampshire, Durham, NH, USA
| |
Collapse
|
9
|
Hartvig I, So T, Changtragoon S, Tran HT, Bouamanivong S, Ogden R, Senn H, Vieira FG, Turner F, Talbot R, Theilade I, Nielsen LR, Kjær ED. Conservation genetics of the critically endangered Siamese rosewood (Dalbergia cochinchinensis): recommendations for management and sustainable use. CONSERV GENET 2020. [DOI: 10.1007/s10592-020-01279-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
|