1
|
Bohutínská M, Vlček J, Monnahan P, Kolář F. Population Genomic Analysis of Diploid-Autopolyploid Species. Methods Mol Biol 2023; 2545:297-324. [PMID: 36720820 DOI: 10.1007/978-1-0716-2561-3_16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
This chapter outlines an empirical analysis of genome-wide single-nucleotide polymorphism (SNP) variation and its underlying drivers among multiple natural populations within a diploid-autopolyploid species. The aim is to reconstruct the genetic structure among natural populations of varying ploidy and infer footprints of selection in these populations, framed around specific questions that are typically encountered when analyzing a mixed-ploidy data set,e.g., addressing the relevance of natural whole-genome duplication for speciation and adaptation. We briefly review the options for the analysis of polyploid population genomic data involving variant calling, population structure, demographic history inference, and selection scanning approaches. Further, we provide suggestions for methods and associated software, possible caveats, and examples of their application to mixed-ploidy and autopolyploid data sets.
Collapse
Affiliation(s)
- Magdalena Bohutínská
- Department of Botany, Faculty of Science, Charles University, Prague, Czech Republic.,Institute of Botany of the Czech Academy of Sciences, Průhonice, Czech Republic
| | - Jakub Vlček
- Department of Botany, Faculty of Science, Charles University, Prague, Czech Republic
| | - Patrick Monnahan
- Department of Pediatrics, University of Minnesota, Minneapolis, MN, USA
| | - Filip Kolář
- Department of Botany, Faculty of Science, Charles University, Prague, Czech Republic. .,Institute of Botany of the Czech Academy of Sciences, Průhonice, Czech Republic.
| |
Collapse
|
2
|
Wang Y, Yu J, Jiang M, Lei W, Zhang X, Tang H. Sequencing and Assembly of Polyploid Genomes. Methods Mol Biol 2023; 2545:429-458. [PMID: 36720827 DOI: 10.1007/978-1-0716-2561-3_23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
Polyploidy has been observed throughout major eukaryotic clades and has played a vital role in the evolution of angiosperms. Recent polyploidizations often result in highly complex genome structures, posing challenges to genome assembly and phasing. Recent advances in sequencing technologies and genome assembly algorithms have enabled high-quality, near-complete chromosome-level assemblies of polyploid genomes. Advances in novel sequencing technologies include highly accurate single-molecule sequencing with HiFi reads, chromosome conformation capture with Hi-C technique, and linked reads sequencing. Additionally, new computational approaches have also significantly improved the precision and reliability of polyploid genome assembly and phasing, such as HiCanu, hifiasm, ALLHiC, and PolyGembler. Herein, we review recently published polyploid genomes and compare the various sequencing, assembly, and phasing approaches that are utilized in these genome studies. Finally, we anticipate that accurate and telomere-to-telomere chromosome-level assembly of polyploid genomes could ultimately become a routine procedure in the near future.
Collapse
Affiliation(s)
- Yibin Wang
- Center for Genomics and Biotechnology, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Key Laboratory of Genetics, Breeding and Multiple Utilization of Crops, Ministry of Education, College of Life Sciences, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Jiaxin Yu
- Center for Genomics and Biotechnology, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Key Laboratory of Genetics, Breeding and Multiple Utilization of Crops, Ministry of Education, College of Life Sciences, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Mengwei Jiang
- Center for Genomics and Biotechnology, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Key Laboratory of Genetics, Breeding and Multiple Utilization of Crops, Ministry of Education, College of Life Sciences, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Wenlong Lei
- Center for Genomics and Biotechnology, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Key Laboratory of Genetics, Breeding and Multiple Utilization of Crops, Ministry of Education, College of Life Sciences, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Xingtan Zhang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Haibao Tang
- Center for Genomics and Biotechnology, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Key Laboratory of Genetics, Breeding and Multiple Utilization of Crops, Ministry of Education, College of Life Sciences, Fujian Agriculture and Forestry University, Fuzhou, China
| |
Collapse
|
3
|
Thérèse Navarro A, Bourke PM, van de Weg E, Clot CR, Arens P, Finkers R, Maliepaard C. Smooth Descent: A ploidy-aware algorithm to improve linkage mapping in the presence of genotyping errors. Front Genet 2023; 14:1049988. [PMID: 36936433 PMCID: PMC10014611 DOI: 10.3389/fgene.2023.1049988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 02/20/2023] [Indexed: 03/08/2023] Open
Abstract
Linkage mapping is an approach to order markers based on recombination events. Mapping algorithms cannot easily handle genotyping errors, which are common in high-throughput genotyping data. To solve this issue, strategies have been developed, aimed mostly at identifying and eliminating these errors. One such strategy is SMOOTH, an iterative algorithm to detect genotyping errors. Unlike other approaches, SMOOTH can also be used to impute the most probable alternative genotypes, but its application is limited to diploid species and to markers heterozygous in only one of the parents. In this study we adapted SMOOTH to expand its use to any marker type and to autopolyploids with the use of identity-by-descent probabilities, naming the updated algorithm Smooth Descent (SD). We applied SD to real and simulated data, showing that in the presence of genotyping errors this method produces better genetic maps in terms of marker order and map length. SD is particularly useful for error rates between 5% and 20% and when error rates are not homogeneous among markers or individuals. With a starting error rate of 10%, SD reduced it to ∼5% in diploids, ∼7% in tetraploids and ∼8.5% in hexaploids. Conversely, the correlation between true and estimated genetic maps increased by 0.03 in tetraploids and by 0.2 in hexaploids, while worsening slightly in diploids (∼0.0011). We also show that the combination of genotype curation and map re-estimation allowed us to obtain better genetic maps while correcting wrong genotypes. We have implemented this algorithm in the R package Smooth Descent.
Collapse
|
4
|
Mazrouee S. ARHap: Association Rule Haplotype Phasing. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3281-3294. [PMID: 34648456 DOI: 10.1109/tcbb.2021.3119955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
This article proposes a novel approach for Individual Human phasing through discovery of interesting hidden relations among single variant sites. The proposed framework, called ARHap, learns strong association rules among variant loci on the genome and develops a combinatorial approach for fast and accurate haplotype phasing based on the discovered associations. ARHap is composed of two main modules or processing phases. In the first phase, called association rule learning, ARHap identifies quantitative association rules from a collection of DNA reads of the organism under study, resulting in a set of strong rules that reveal the inter-dependency of alleles. In the next phase, called haplotype reconstruction, we develop algorithms to utilize the learned rules to construct highly reliable haplotypes at individual single nucleotide polymorphism (SNP) sites. ARHap has several features that lead to both fast and accurate haplotyping. It uses an incremental haplotype reconstruction approach that enables us to generate association rules according to the unreconstructed SNP sites during each round of the algorithm. During each round, the association rule learning module generates rules while constraining the length of the rules and limiting the rules to those that contribute to reconstruction of unreconstructed sites only. The framework begins by generating rules of small size and highly strong. The rule length can increase and/or criteria about strongness of the rule are adjusted gradually, during subsequent rounds, if some SNP sites have remained unreconstructed. This adaptive approach, which uses feedback from haplotype reconstruction module, eliminates generation of rules that do not contribute to haplotype reconstruction as well as weak rules that may introduce error in the final haplotypes. Extensive experimental analyses on datasets representing diploid organisms demonstrate superiority of ARHap in diploid haplotyping compared to the state-of-the-art algorithms. In particular, we show that this novel approach to haplotype phasing not only is fast but also achieves significantly better accuracy performance compared to other read-based computational approaches.
Collapse
|
5
|
Abou Saada O, Tsouris A, Eberlein C, Friedrich A, Schacherer J. nPhase: an accurate and contiguous phasing method for polyploids. Genome Biol 2021; 22:126. [PMID: 33926549 PMCID: PMC8082856 DOI: 10.1186/s13059-021-02342-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Accepted: 04/08/2021] [Indexed: 01/06/2023] Open
Abstract
While genome sequencing and assembly are now routine, we do not have a full, precise picture of polyploid genomes. No existing polyploid phasing method provides accurate and contiguous haplotype predictions. We developed nPhase, a ploidy agnostic tool that leverages long reads and accurate short reads to solve alignment-based phasing for samples of unspecified ploidy (https://github.com/OmarOakheart/nPhase). nPhase is validated by tests on simulated and real polyploids. nPhase obtains on average over 95% accuracy and a contiguous 1.25 haplotigs per haplotype to cover more than 90% of each chromosome (heterozygosity rate ≥ 0.5%). nPhase allows population genomics and hybrid studies of polyploids.
Collapse
Affiliation(s)
- Omar Abou Saada
- Université de Strasbourg, CNRS, GMGM UMR, 7156, Strasbourg, France
| | - Andreas Tsouris
- Université de Strasbourg, CNRS, GMGM UMR, 7156, Strasbourg, France
| | - Chris Eberlein
- Université de Strasbourg, CNRS, GMGM UMR, 7156, Strasbourg, France
| | - Anne Friedrich
- Université de Strasbourg, CNRS, GMGM UMR, 7156, Strasbourg, France.
| | - Joseph Schacherer
- Université de Strasbourg, CNRS, GMGM UMR, 7156, Strasbourg, France. .,Institut Universitaire de France (IUF), Paris, France.
| |
Collapse
|
6
|
Garg S. Computational methods for chromosome-scale haplotype reconstruction. Genome Biol 2021; 22:101. [PMID: 33845884 PMCID: PMC8040228 DOI: 10.1186/s13059-021-02328-9] [Citation(s) in RCA: 48] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Accepted: 03/25/2021] [Indexed: 12/13/2022] Open
Abstract
High-quality chromosome-scale haplotype sequences of diploid genomes, polyploid genomes, and metagenomes provide important insights into genetic variation associated with disease and biodiversity. However, whole-genome short read sequencing does not yield haplotype information spanning whole chromosomes directly. Computational assembly of shorter haplotype fragments is required for haplotype reconstruction, which can be challenging owing to limited fragment lengths and high haplotype and repeat variability across genomes. Recent advancements in long-read and chromosome-scale sequencing technologies, alongside computational innovations, are improving the reconstruction of haplotypes at the level of whole chromosomes. Here, we review recent and discuss methodological progress and perspectives in these areas.
Collapse
Affiliation(s)
- Shilpa Garg
- Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
7
|
Schrinner SD, Mari RS, Ebler J, Rautiainen M, Seillier L, Reimer JJ, Usadel B, Marschall T, Klau GW. Haplotype threading: accurate polyploid phasing from long reads. Genome Biol 2020; 21:252. [PMID: 32951599 PMCID: PMC7504856 DOI: 10.1186/s13059-020-02158-1] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Accepted: 08/26/2020] [Indexed: 01/19/2023] Open
Abstract
Resolving genomes at haplotype level is crucial for understanding the evolutionary history of polyploid species and for designing advanced breeding strategies. Polyploid phasing still presents considerable challenges, especially in regions of collapsing haplotypes.We present WHATSHAP POLYPHASE, a novel two-stage approach that addresses these challenges by (i) clustering reads and (ii) threading the haplotypes through the clusters. Our method outperforms the state-of-the-art in terms of phasing quality. Using a real tetraploid potato dataset, we demonstrate how to assemble local genomic regions of interest at the haplotype level. Our algorithm is implemented as part of the widely used open source tool WhatsHap.
Collapse
Affiliation(s)
- Sven D Schrinner
- Algorithmic Bioinformatics, Heinrich Heine University Düsseldorf, Universitätsstr. 1, Düsseldorf, 40225, Germany
| | - Rebecca Serra Mari
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Moorenstraße 5, Düsseldorf, 40225, Germany
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany
- Graduate School of Computer Science, Saarland Informatics Campus E1.3, Saarbrücken, 66123, Germany
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Moorenstraße 5, Düsseldorf, 40225, Germany
| | - Mikko Rautiainen
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany
- Graduate School of Computer Science, Saarland Informatics Campus E1.3, Saarbrücken, 66123, Germany
- Max Planck Institute for Informatics, Saarbrücken, 66123, Germany
| | - Lancelot Seillier
- Institute for Biology I, RWTH Aachen, Worringer Weg 3, Aachen, 52074, Germany
| | - Julia J Reimer
- Institute for Biology I, RWTH Aachen, Worringer Weg 3, Aachen, 52074, Germany
| | - Björn Usadel
- Forschungszentrum Jülich IBG-4, Wilhelm-Johnen-Str., Jülich, 52428, Germany
- Institute for Biology I, RWTH Aachen, Worringer Weg 3, Aachen, 52074, Germany
- Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University Düsseldorf, Universitätsstr. 1, Düsseldorf, 40225, Germany
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Moorenstraße 5, Düsseldorf, 40225, Germany.
| | - Gunnar W Klau
- Algorithmic Bioinformatics, Heinrich Heine University Düsseldorf, Universitätsstr. 1, Düsseldorf, 40225, Germany.
- Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University Düsseldorf, Universitätsstr. 1, Düsseldorf, 40225, Germany.
| |
Collapse
|
8
|
Sankararaman A, Vikalo H, Baccelli F. ComHapDet: a spatial community detection algorithm for haplotype assembly. BMC Genomics 2020; 21:586. [PMID: 32900369 PMCID: PMC7488034 DOI: 10.1186/s12864-020-06935-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual's susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data. RESULTS We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet - a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants. CONCLUSIONS Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.
Collapse
Affiliation(s)
- Abishek Sankararaman
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.
| | - Haris Vikalo
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - François Baccelli
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.,Department of Mathematics, The University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
9
|
Majidian S, Kahaei MH, de Ridder D. Hap10: reconstructing accurate and long polyploid haplotypes using linked reads. BMC Bioinformatics 2020; 21:253. [PMID: 32552661 PMCID: PMC7302376 DOI: 10.1186/s12859-020-03584-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Accepted: 06/05/2020] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Haplotype information is essential for many genetic and genomic analyses, including genotype-phenotype associations in human, animals and plants. Haplotype assembly is a method for reconstructing haplotypes from DNA sequencing reads. By the advent of new sequencing technologies, new algorithms are needed to ensure long and accurate haplotypes. While a few linked-read haplotype assembly algorithms are available for diploid genomes, to the best of our knowledge, no algorithms have yet been proposed for polyploids specifically exploiting linked reads. RESULTS The first haplotyping algorithm designed for linked reads generated from a polyploid genome is presented, built on a typical short-read haplotyping method, SDhaP. Using the input aligned reads and called variants, the haplotype-relevant information is extracted. Next, reads with the same barcodes are combined to produce molecule-specific fragments. Then, these fragments are clustered into strongly connected components which are then used as input of a haplotype assembly core in order to estimate accurate and long haplotypes. CONCLUSIONS Hap10 is a novel algorithm for haplotype assembly of polyploid genomes using linked reads. The performance of the algorithms is evaluated in a number of simulation scenarios and its applicability is demonstrated on a real dataset of sweet potato.
Collapse
Affiliation(s)
- Sina Majidian
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, 16846-13114, Iran
| | - Mohammad Hossein Kahaei
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, 16846-13114, Iran.
| | - Dick de Ridder
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands
| |
Collapse
|
10
|
Majidian S, Kahaei MH, de Ridder D. Minimum error correction-based haplotype assembly: Considerations for long read data. PLoS One 2020; 15:e0234470. [PMID: 32530974 PMCID: PMC7292361 DOI: 10.1371/journal.pone.0234470] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 05/27/2020] [Indexed: 11/23/2022] Open
Abstract
The single nucleotide polymorphism (SNP) is the most widely studied type of genetic variation. A haplotype is defined as the sequence of alleles at SNP sites on each haploid chromosome. Haplotype information is essential in unravelling the genome-phenotype association. Haplotype assembly is a well-known approach for reconstructing haplotypes, exploiting reads generated by DNA sequencing devices. The Minimum Error Correction (MEC) metric is often used for reconstruction of haplotypes from reads. However, problems with the MEC metric have been reported. Here, we investigate the MEC approach to demonstrate that it may result in incorrectly reconstructed haplotypes for devices that produce error-prone long reads. Specifically, we evaluate this approach for devices developed by Illumina, Pacific BioSciences and Oxford Nanopore Technologies. We show that imprecise haplotypes may be reconstructed with a lower MEC than that of the exact haplotype. The performance of MEC is explored for different coverage levels and error rates of data. Our simulation results reveal that in order to avoid incorrect MEC-based haplotypes, a coverage of 25 is needed for reads generated by Pacific BioSciences RS systems.
Collapse
Affiliation(s)
- Sina Majidian
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, Iran
| | - Mohammad Hossein Kahaei
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, Iran
- * E-mail:
| | - Dick de Ridder
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
| |
Collapse
|
11
|
Pavan S, Delvento C, Ricciardi L, Lotti C, Ciani E, D'Agostino N. Recommendations for Choosing the Genotyping Method and Best Practices for Quality Control in Crop Genome-Wide Association Studies. Front Genet 2020; 11:447. [PMID: 32587600 PMCID: PMC7299185 DOI: 10.3389/fgene.2020.00447] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2019] [Accepted: 04/14/2020] [Indexed: 12/19/2022] Open
Abstract
High-throughput genotyping boosts genome-wide association studies (GWAS) in crop species, leading to the identification of single-nucleotide polymorphisms (SNPs) associated with economically important traits. Choosing a cost-effective genotyping method for crop GWAS requires careful examination of several aspects, namely, the purpose and the scale of the study, crop-specific genomic features, and technical and economic matters associated with each genotyping option. Once genotypic data have been obtained, quality control (QC) procedures must be applied to avoid bias and false signals in genotype–phenotype association tests. QC for human GWAS has been extensively reviewed; however, QC for crop GWAS may require different actions, depending on the GWAS population type. Here, we review most popular genotyping methods based on next-generation sequencing (NGS) and array hybridization, and report observations that should guide the investigator in the choice of the genotyping method for crop GWAS. We provide recommendations to perform QC in crop species, and deliver an overview of bioinformatics tools that can be used to accomplish all needed tasks. Overall, this work aims to provide guidelines to harmonize those procedures leading to SNP datasets ready for crop GWAS.
Collapse
Affiliation(s)
- Stefano Pavan
- Department of Soil, Plant and Food Science, Section of Genetics and Plant Breeding, University of Bari Aldo Moro, Bari, Italy.,Institute of Biomedical Technologies, National Research Council (CNR), Bari, Italy
| | - Chiara Delvento
- Department of Soil, Plant and Food Science, Section of Genetics and Plant Breeding, University of Bari Aldo Moro, Bari, Italy
| | - Luigi Ricciardi
- Department of Soil, Plant and Food Science, Section of Genetics and Plant Breeding, University of Bari Aldo Moro, Bari, Italy
| | - Concetta Lotti
- Department of Agricultural, Food and Environmental Sciences, University of Foggia, Foggia, Italy
| | - Elena Ciani
- Department of Biosciences, Biotechnologies and Biopharmaceutics, University of Bari Aldo Moro, Bari, Italy
| | - Nunzio D'Agostino
- Department of Agricultural Sciences, University of Naples Federico II, Naples, Italy
| |
Collapse
|
12
|
Moeinzadeh MH, Yang J, Muzychenko E, Gallone G, Heller D, Reinert K, Haas S, Vingron M. Ranbow: A fast and accurate method for polyploid haplotype reconstruction. PLoS Comput Biol 2020; 16:e1007843. [PMID: 32469863 PMCID: PMC7310859 DOI: 10.1371/journal.pcbi.1007843] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2019] [Revised: 06/23/2020] [Accepted: 04/03/2020] [Indexed: 12/30/2022] Open
Abstract
Reconstructing haplotypes from sequencing data is one of the major challenges in genetics. Haplotypes play a crucial role in many analyses, including genome-wide association studies and population genetics. Haplotype reconstruction becomes more difficult for higher numbers of homologous chromosomes, as it is often the case for polyploid plants. This complexity is compounded further by higher heterozygosity, which denotes the frequent presence of variants between haplotypes. We have designed Ranbow, a new tool for haplotype reconstruction of polyploid genome from short read sequencing data. Ranbow integrates all types of small variants in bi- and multi-allelic sites to reconstruct haplotypes. To evaluate Ranbow and currently available competing methods on real data, we have created and released a real gold standard dataset from sweet potato sequencing data. Our evaluations on real and simulated data clearly show Ranbow’s superior performance in terms of accuracy, haplotype length, memory usage, and running time. Specifically, Ranbow is one order of magnitude faster than the next best method. The efficiency and accuracy of Ranbow makes whole genome haplotype reconstruction of complex genome with higher ploidy feasible. We focus on the problem of reconstructing haplotypes for polyploid genomes. Our approach explored using short read sequence data from a highly heterozygous hexaploid genome. We observed that short read data from strongly heterozygous organisms open up a way for haplotype reconstruction by supplying overlap information between reads. We therefore investigated the role of heterozygosity and ploidy number. Though higher heterozygosity provides more useful reads for reconstructing haplotypes, polyploidy increases the challenge in assembling reads into longer sequences. We called this the problem of “Ambiguity of Merging” fragments. We addressed this problem by designing a new algorithm called Ranbow. Ranbow was evaluated on real and simulated data from the genomes of tetraploid Capsella bursa-pastoris (Shepherd’s Purse) and hexaploid Ipomoea batatas (sweet potato). We were able to show that our method achieved high accuracy and long assembled haplotypes in a feasible amount of time, performing at a level consistently superior to other algorithms.
Collapse
Affiliation(s)
- M-Hossein Moeinzadeh
- Max Planck Institute for Molecular Genetics, Berlin, Germany
- Department of Mathematics and Computer Science, Freie Universitat Berlin, Berlin, Germany
- * E-mail:
| | - Jun Yang
- Max Planck Institute for Molecular Genetics, Berlin, Germany
- Shanghai Chenshan Plant Science Research Center, Chinese Academy of Sciences, Shanghai Chenshan Botanical Garden, Shanghai, China
| | | | | | - David Heller
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Knut Reinert
- Department of Mathematics and Computer Science, Freie Universitat Berlin, Berlin, Germany
| | - Stefan Haas
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Martin Vingron
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| |
Collapse
|
13
|
Mazrouee S, Wang W. PolyCluster: Minimum Fragment Disagreement Clustering for Polyploid Phasing. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:264-277. [PMID: 30040655 DOI: 10.1109/tcbb.2018.2858803] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Phasing is an emerging area in computational biology with important applications in clinical decision making and biomedical sciences. While machine learning techniques have shown tremendous potential in many biomedical applications, their utility in phasing has not yet been fully understood. In this paper, we investigate development of clustering-based techniques for phasing in polyploidy organisms where more than two copies of each chromosome exist in the cells of the organism under study. We develop a novel framework, called PolyCluster, based on the concept of correlation clustering followed by an effective cluster merging mechanism to minimize the amount of disagreement among short reads residing in each cluster. We first introduce a graph model to quantify the amount of similarity between each pair of DNA reads. We then present a combination of linear programming, rounding, region-growing, and cluster merging to group similar reads and reconstruct haplotypes. Our extensive analysis demonstrates the effectiveness of PolyCluster in accurate and scalable phasing. In particular, we show that PolyCluster reduces switching error of H-PoP, HapColor, and HapTree by 44.4, 51.2, and 48.3 percent, respectively. Also, the running time of PolyCluster is several orders-of-magnitude less than HapTree while it achieves a running time comparable to other algorithms.
Collapse
|
14
|
Siragusa E, Haiminen N, Utro F, Parida L. Linear Time Algorithms to Construct Populations Fitting Multiple Constraint Distributions at Genomic Scales. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1132-1142. [PMID: 28991752 DOI: 10.1109/tcbb.2017.2760879] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Computer simulations can be used to study population genetic methods, models, and parameters, as well as to predict potential outcomes. For example, in plant populations, predicting the outcome of breeding operations can be studied using simulations. In-silico construction of populations with pre-specified characteristics is an important task in breeding optimization and other population genetic studies. We present two linear time Simulation using Best-fit Algorithms (SimBA) for two classes of problems where each co-fits two distributions: SimBA-LD fits linkage disequilibrium and minimum allele frequency distributions, while SimBA-hap fits founder-haplotype and polyploid allele dosage distributions. An incremental gap-filling version of previously introduced SimBA-LD is here demonstrated to accurately fit the target distributions, allowing efficient large scale simulations. SimBA-hap accuracy and efficiency is demonstrated by simulating tetraploid populations with varying numbers of founder haplotypes, we evaluate both a linear time greedy algoritm and an optimal solution based on mixed-integer programming. SimBA is available on http://researcher.watson.ibm.com/project/5669.
Collapse
|
15
|
Motazedi E, Maliepaard C, Finkers R, Visser R, de Ridder D. Family-Based Haplotype Estimation and Allele Dosage Correction for Polyploids Using Short Sequence Reads. Front Genet 2019; 10:335. [PMID: 31040862 PMCID: PMC6477055 DOI: 10.3389/fgene.2019.00335] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 03/28/2019] [Indexed: 12/27/2022] Open
Abstract
DNA sequence reads contain information about the genomic variants located on a single chromosome. By extracting and extending this information using the overlaps between the reads, the haplotypes of an individual can be obtained. Using parent-offspring relationships in a population can considerably improve the quality of the haplotypes obtained from short reads, as pedigree information can be used to correct for spurious overlaps (due to sequencing errors) and insufficient overlaps (due to short read lengths, low genomic variation and shallow coverage). We developed a novel method, PopPoly, to estimate polyploid haplotypes in an F1-population from short sequence data by taking into consideration the transmission of the haplotypes from the parents to the offspring. In addition, this information is employed to improve genotype dosage estimation and to call missing genotypes in the population. Through simulations, we compare PopPoly to other haplotyping methods and show its better performance. We evaluate PopPoly by applying it to a tetraploid potato cross at nine genomic regions involved in tuber formation.
Collapse
Affiliation(s)
- Ehsan Motazedi
- Bioinformatics Group, Wageningen University & Research, Wageningen, Netherlands.,Plant Breeding, Wageningen University & Research, Wageningen, Netherlands
| | - Chris Maliepaard
- Plant Breeding, Wageningen University & Research, Wageningen, Netherlands
| | - Richard Finkers
- Plant Breeding, Wageningen University & Research, Wageningen, Netherlands
| | - Richard Visser
- Plant Breeding, Wageningen University & Research, Wageningen, Netherlands
| | - Dick de Ridder
- Bioinformatics Group, Wageningen University & Research, Wageningen, Netherlands
| |
Collapse
|
16
|
Majidian S, Kahaei MH. NGS based haplotype assembly using matrix completion. PLoS One 2019; 14:e0214455. [PMID: 30913270 PMCID: PMC6435133 DOI: 10.1371/journal.pone.0214455] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2018] [Accepted: 03/13/2019] [Indexed: 12/30/2022] Open
Abstract
We apply matrix completion methods for haplotype assembly from NGS reads to develop the new HapSVT, HapNuc, and HapOPT algorithms. This is performed by applying a mathematical model to convert the reads to an incomplete matrix and estimating unknown components. This process is followed by quantizing and decoding the completed matrix in order to estimate haplotypes. These algorithms are compared to the state-of-the-art algorithms using simulated data as well as the real fosmid data. It is shown that the SNP missing rate and the haplotype block length of the proposed HapOPT are better than those of HapCUT2 with comparable accuracy in terms of reconstruction rate and switch error rate. A program implementing the proposed algorithms in MATLAB is freely available at https://github.com/smajidian/HapMC.
Collapse
Affiliation(s)
- Sina Majidian
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, Iran. 16846-13114
| | - Mohammad Hossein Kahaei
- School of Electrical Engineering, Iran University of Science & Technology, Narmak, Tehran, Iran. 16846-13114
- * E-mail:
| |
Collapse
|
17
|
Morales-Briones DF, Liston A, Tank DC. Phylogenomic analyses reveal a deep history of hybridization and polyploidy in the Neotropical genus Lachemilla (Rosaceae). THE NEW PHYTOLOGIST 2018; 218:1668-1684. [PMID: 29604235 DOI: 10.1111/nph.15099] [Citation(s) in RCA: 80] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2017] [Accepted: 02/09/2018] [Indexed: 05/10/2023]
Abstract
Hybridization, incomplete lineage sorting, and phylogenetic error produce similar incongruence patterns, representing a great challenge for phylogenetic reconstruction. Here, we use sequence capture data and multiple species tree and species network approaches to resolve the backbone phylogeny of the Neotropical genus Lachemilla, while distinguishing among sources of incongruence. We used 396 nuclear loci and nearly complete plastome sequences from 27 species to clarify the relationships among the major groups of Lachemilla, and explored multiple sources of conflict between gene trees and species trees inferred with a plurality of approaches. All phylogenetic methods recovered the four major groups previously proposed for Lachemilla, but species tree methods recovered different topologies for relationships between these four clades. Species network analyses revealed that one major clade, Orbiculate, is likely of ancient hybrid origin, representing one of the main sources of incongruence among the species trees. Additionally, we found evidence for a potential whole genome duplication event shared by Lachemilla and allied genera. Lachemilla shows clear evidence of ancient and recent hybridization throughout the evolutionary history of the group. Also, we show the necessity to use phylogenetic network approaches that can simultaneously accommodate incomplete lineage sorting and gene flow when studying groups that show patterns of reticulation.
Collapse
Affiliation(s)
- Diego F Morales-Briones
- Department of Biological Sciences, University of Idaho, 875 Perimeter Drive MS 3051, Moscow, ID, 83844-3051, USA
- Institute for Bioinformatics and Evolutionary Studies, University of Idaho, 875 Perimeter Drive MS 3051, Moscow, ID, 83844-3051, USA
- Stillinger Herbarium, University of Idaho, 875 Perimeter Drive MS 3051, Moscow, ID, 83844-3051, USA
| | - Aaron Liston
- Department of Botany and Plant Pathology, Oregon State University, 2082 Cordley Hall, Corvallis, OR, 97331, USA
| | - David C Tank
- Department of Biological Sciences, University of Idaho, 875 Perimeter Drive MS 3051, Moscow, ID, 83844-3051, USA
- Institute for Bioinformatics and Evolutionary Studies, University of Idaho, 875 Perimeter Drive MS 3051, Moscow, ID, 83844-3051, USA
- Stillinger Herbarium, University of Idaho, 875 Perimeter Drive MS 3051, Moscow, ID, 83844-3051, USA
| |
Collapse
|
18
|
Bourke PM, Voorrips RE, Visser RGF, Maliepaard C. Tools for Genetic Studies in Experimental Populations of Polyploids. FRONTIERS IN PLANT SCIENCE 2018; 9:513. [PMID: 29720992 PMCID: PMC5915555 DOI: 10.3389/fpls.2018.00513] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2018] [Accepted: 04/04/2018] [Indexed: 05/19/2023]
Abstract
Polyploid organisms carry more than two copies of each chromosome, a condition rarely tolerated in animals but which occurs relatively frequently in the plant kingdom. One of the principal challenges faced by polyploid organisms is to evolve stable meiotic mechanisms to faithfully transmit genetic information to the next generation upon which the study of inheritance is based. In this review we look at the tools available to the research community to better understand polyploid inheritance, many of which have only recently been developed. Most of these tools are intended for experimental populations (rather than natural populations), facilitating genomics-assisted crop improvement and plant breeding. This is hardly surprising given that a large proportion of domesticated plant species are polyploid. We focus on three main areas: (1) polyploid genotyping; (2) genetic and physical mapping; and (3) quantitative trait analysis and genomic selection. We also briefly review some miscellaneous topics such as the mode of inheritance and the availability of polyploid simulation software. The current polyploid analytic toolbox includes software for assigning marker genotypes (and in particular, estimating the dosage of marker alleles in the heterozygous condition), establishing chromosome-scale linkage phase among marker alleles, constructing (short-range) haplotypes, generating linkage maps, performing genome-wide association studies (GWAS) and quantitative trait locus (QTL) analyses, and simulating polyploid populations. These tools can also help elucidate the mode of inheritance (disomic, polysomic or a mixture of both as in segmental allopolyploids) or reveal whether double reduction and multivalent chromosomal pairing occur. An increasing number of polyploids (or associated diploids) are being sequenced, leading to publicly available reference genome assemblies. Much work remains in order to keep pace with developments in genomic technologies. However, such technologies also offer the promise of understanding polyploid genomes at a level which hitherto has remained elusive.
Collapse
Affiliation(s)
| | | | | | - Chris Maliepaard
- Plant Breeding, Wageningen University & Research, Wageningen, Netherlands
| |
Collapse
|
19
|
Abstract
BACKGROUND Haplotype assembly is the task of reconstructing haplotypes of an individual from a mixture of sequenced chromosome fragments. Haplotype information enables studies of the effects of genetic variations on an organism's phenotype. Most of the mathematical formulations of haplotype assembly are known to be NP-hard and haplotype assembly becomes even more challenging as the sequencing technology advances and the length of the paired-end reads and inserts increases. Assembly of haplotypes polyploid organisms is considerably more difficult than in the case of diploids. Hence, scalable and accurate schemes with provable performance are desired for haplotype assembly of both diploid and polyploid organisms. RESULTS We propose a framework that formulates haplotype assembly from sequencing data as a sparse tensor decomposition. We cast the problem as that of decomposing a tensor having special structural constraints and missing a large fraction of its entries into a product of two factors, U and [Formula: see text]; tensor [Formula: see text] reveals haplotype information while U is a sparse matrix encoding the origin of erroneous sequencing reads. An algorithm, AltHap, which reconstructs haplotypes of either diploid or polyploid organisms by iteratively solving this decomposition problem is proposed. The performance and convergence properties of AltHap are theoretically analyzed and, in doing so, guarantees on the achievable minimum error correction scores and correct phasing rate are established. The developed framework is applicable to diploid, biallelic and polyallelic polyploid species. The code for AltHap is freely available from https://github.com/realabolfazl/AltHap . CONCLUSION AltHap was tested in a number of different scenarios and was shown to compare favorably to state-of-the-art methods in applications to haplotype assembly of diploids, and significantly outperforms existing techniques when applied to haplotype assembly of polyploids.
Collapse
Affiliation(s)
- Abolfazl Hashemi
- Department of ECE, University of Texas at Austin, Austin, Texas, USA
| | - Banghua Zhu
- EE Department, Tsinghua University, Beijing, China
| | - Haris Vikalo
- Department of ECE, University of Texas at Austin, Austin, Texas, USA
| |
Collapse
|
20
|
Kyriakidou M, Tai HH, Anglin NL, Ellis D, Strömvik MV. Current Strategies of Polyploid Plant Genome Sequence Assembly. FRONTIERS IN PLANT SCIENCE 2018; 9:1660. [PMID: 30519250 PMCID: PMC6258962 DOI: 10.3389/fpls.2018.01660] [Citation(s) in RCA: 101] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/13/2018] [Accepted: 10/25/2018] [Indexed: 05/14/2023]
Abstract
Polyploidy or duplication of an entire genome occurs in the majority of angiosperms. The understanding of polyploid genomes is important for the improvement of those crops, which humans rely on for sustenance and basic nutrition. As climate change continues to pose a potential threat to agricultural production, there will increasingly be a demand for plant cultivars that can resist biotic and abiotic stresses and also provide needed and improved nutrition. In the past decade, Next Generation Sequencing (NGS) has fundamentally changed the genomics landscape by providing tools for the exploration of polyploid genomes. Here, we review the challenges of the assembly of polyploid plant genomes, and also present recent advances in genomic resources and functional tools in molecular genetics and breeding. As genomes of diploid and less heterozygous progenitor species are increasingly available, we discuss the lack of complexity of these currently available reference genomes as they relate to polyploid crops. Finally, we review recent approaches of haplotyping by phasing and the impact of third generation technologies on polyploid plant genome assembly.
Collapse
Affiliation(s)
- Maria Kyriakidou
- Department of Plant Science, McGill University, Montreal, QC, Canada
| | - Helen H. Tai
- Fredericton Research and Development Centre, Agriculture and Agri-Food Canada, Fredericton, NB, Canada
| | | | | | - Martina V. Strömvik
- Department of Plant Science, McGill University, Montreal, QC, Canada
- *Correspondence: Martina V. Strömvik
| |
Collapse
|