1
|
Selvakumar R, Jat GS, Manjunathagowda DC. Allele mining through TILLING and EcoTILLING approaches in vegetable crops. PLANTA 2023; 258:15. [PMID: 37311932 DOI: 10.1007/s00425-023-04176-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Accepted: 06/01/2023] [Indexed: 06/15/2023]
Abstract
MAIN CONCLUSION The present review illustrates a comprehensive overview of the allele mining for genetic improvement in vegetable crops, and allele exploration methods and their utilization in various applications related to pre-breeding of economically important traits in vegetable crops. Vegetable crops have numerous wild descendants, ancestors and terrestrial races that could be exploited to develop high-yielding and climate-resilient varieties resistant/tolerant to biotic and abiotic stresses. To further boost the genetic potential of economic traits, the available genomic tools must be targeted and re-opened for exploitation of novel alleles from genetic stocks by the discovery of beneficial alleles from wild relatives and their introgression to cultivated types. This capability would be useful for giving plant breeders direct access to critical alleles that confer higher production, improve bioactive compounds, increase water and nutrient productivity as well as biotic and abiotic stress resilience. Allele mining is a new sophisticated technique for dissecting naturally occurring allelic variants in candidate genes that influence important traits which could be used for genetic improvement of vegetable crops. Target-induced local lesions in genomes (TILLINGs) is a sensitive mutation detection avenue in functional genomics, particularly wherein genome sequence information is limited or not available. Population exposure to chemical mutagens and the absence of selectivity lead to TILLING and EcoTILLING. EcoTILLING may lead to natural induction of SNPs and InDels. It is anticipated that as TILLING is used for vegetable crops improvement in the near future, indirect benefits will become apparent. Therefore, in this review we have highlighted the up-to-date information on allele mining for genetic enhancement in vegetable crops and methods of allele exploration and their use in pre-breeding for improvement of economic traits.
Collapse
Affiliation(s)
- Raman Selvakumar
- ICAR-Indian Agricultural Research Institute, Pusa Campus, New Delhi, 110 012, India
| | - Gograj Singh Jat
- ICAR-Indian Agricultural Research Institute, Pusa Campus, New Delhi, 110 012, India.
| | | |
Collapse
|
2
|
Efficient Two-Stage Analysis for Complex Trait Association with Arbitrary Depth Sequencing Data. STATS 2023. [DOI: 10.3390/stats6010029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/22/2023] Open
Abstract
Sequencing-based genetic association analysis is typically performed by first generating genotype calls from sequence data and then performing association tests on the called genotypes. Standard approaches require accurate genotype calling (GC), which can be achieved either with high sequencing depth (typically available in a small number of individuals) or via computationally intensive multi-sample linkage disequilibrium (LD)-aware methods. We propose a computationally efficient two-stage combination approach for association analysis, in which single-nucleotide polymorphisms (SNPs) are screened in the first stage via a rapid maximum likelihood (ML)-based method on sequence data directly (without first calling genotypes), and then the selected SNPs are evaluated in the second stage by performing association tests on genotypes from multi-sample LD-aware calling. Extensive simulation- and real data-based studies show that the proposed two-stage approaches can save 80% of the computational costs and still obtain more than 90% of the power of the classical method to genotype all markers at various depths d≥2.
Collapse
|
3
|
Muppidi P, Wright E, Wassmer SC, Gupta H. Diagnosis of cerebral malaria: Tools to reduce Plasmodium falciparum associated mortality. Front Cell Infect Microbiol 2023; 13:1090013. [PMID: 36844403 PMCID: PMC9947298 DOI: 10.3389/fcimb.2023.1090013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Accepted: 01/24/2023] [Indexed: 02/11/2023] Open
Abstract
Cerebral malaria (CM) is a major cause of mortality in Plasmodium falciparum (Pf) infection and is associated with the sequestration of parasitised erythrocytes in the microvasculature of the host's vital organs. Prompt diagnosis and treatment are key to a positive outcome in CM. However, current diagnostic tools remain inadequate to assess the degree of brain dysfunction associated with CM before the window for effective treatment closes. Several host and parasite factor-based biomarkers have been suggested as rapid diagnostic tools with potential for early CM diagnosis, however, no specific biomarker signature has been validated. Here, we provide an updated review on promising CM biomarker candidates and evaluate their applicability as point-of-care tools in malaria-endemic areas.
Collapse
Affiliation(s)
- Pranavi Muppidi
- Department of Infection Biology, London School of Hygiene and Tropical Medicine, London, United Kingdom
| | - Emily Wright
- Department of Infection Biology, London School of Hygiene and Tropical Medicine, London, United Kingdom
| | - Samuel C. Wassmer
- Department of Infection Biology, London School of Hygiene and Tropical Medicine, London, United Kingdom
| | - Himanshu Gupta
- Department of Infection Biology, London School of Hygiene and Tropical Medicine, London, United Kingdom
- Department of Biotechnology, Institute of Applied Sciences & Humanities, GLA University, Mathura, UP, India
| |
Collapse
|
4
|
Chen L, Yang W, Li D, Ma Y, Chen L, You S, Liu S. Poly cytosine (C)/poly adenine (A) modified probe for signal "on-off-on" assay of single-base mismatched dsDNA by a competitive mechanism. Anal Chim Acta 2023; 1239:340705. [PMID: 36628713 DOI: 10.1016/j.aca.2022.340705] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 12/01/2022] [Accepted: 12/03/2022] [Indexed: 12/12/2022]
Abstract
Direct discrimination of single-base mismatched dsDNA by a simple method or strategy would provide enormous opportunities for applications in the fields of life sciences and disease diagnosis. Herein, the peroxidase-mimicking activity of a metal-organic framework nanoprobe (MOF) was well exploited for the direct discrimination of single-base mismatched dsDNA based on a competition-induced signal on-off-on mechanism. The single-base mismatched dsDNA related with FecB gene (usually guanine (G)/thymine (T) mismatch) and MIL-88B-NH2 were used as target and MOF model, respectively. Firstly, polyA/polyC were loosely adsorbed onto the MOFs via the weak interaction to block the peroxidase activity of MOF, inducing the signal transition from on to off. Unexpectedly, the single-base mismatched (GT) dsDNA could reverse the signal response of MOF probe from off to on. But it could not occur for other nonspecific mismatches, such as CT and TT-mismatched dsDNA. A synergistic interaction mechanism between multiple GT mismatches and polyA/polyC was attempted to explain the competitive dissociation of polyA/polyC from MOF for the recovery of peroxidase activity. With it, a wide linear detection ranges from 10-9 M-10-5 M of GT mismatched dsDNA and a low detection limit of 0.247 nM could be achieved, even in the real samples. The effect of mismatched base number or position was also studied. Such a simple, rapid, cost-effective, and one-step mixing and checking method for single-base mismatched dsDNA discrimination eliminates the complex sample pretreatment, special DNA probe design, exclusive amplification or signal readout means. It thus offers a simple and effective route for direct discrimination of mismatched dsDNA and might hold a huge potential for the applications in gene analysis, disease diagnosis, and elementary research in life sciences.
Collapse
Affiliation(s)
- Lihua Chen
- Key Laboratory of Optic-electric Sensing and Analytical Chemistry for Life Science, Shandong Key Laboratory of Biochemical Analysis, Key Laboratory of Analytical Chemistry for Life Science in Universities of Shandong, Key Laboratory of Ecochemical Engineering, College of Chemistry and Molecular Engineering, Qingdao University of Science and Technology, Qingdao, 266042, PR China.
| | - Wenjie Yang
- Key Laboratory of Optic-electric Sensing and Analytical Chemistry for Life Science, Shandong Key Laboratory of Biochemical Analysis, Key Laboratory of Analytical Chemistry for Life Science in Universities of Shandong, Key Laboratory of Ecochemical Engineering, College of Chemistry and Molecular Engineering, Qingdao University of Science and Technology, Qingdao, 266042, PR China
| | - Dong Li
- Key Laboratory of Optic-electric Sensing and Analytical Chemistry for Life Science, Shandong Key Laboratory of Biochemical Analysis, Key Laboratory of Analytical Chemistry for Life Science in Universities of Shandong, Key Laboratory of Ecochemical Engineering, College of Chemistry and Molecular Engineering, Qingdao University of Science and Technology, Qingdao, 266042, PR China
| | - Yunkang Ma
- Key Laboratory of Optic-electric Sensing and Analytical Chemistry for Life Science, Shandong Key Laboratory of Biochemical Analysis, Key Laboratory of Analytical Chemistry for Life Science in Universities of Shandong, Key Laboratory of Ecochemical Engineering, College of Chemistry and Molecular Engineering, Qingdao University of Science and Technology, Qingdao, 266042, PR China
| | - Lili Chen
- Key Laboratory of Optic-electric Sensing and Analytical Chemistry for Life Science, Shandong Key Laboratory of Biochemical Analysis, Key Laboratory of Analytical Chemistry for Life Science in Universities of Shandong, Key Laboratory of Ecochemical Engineering, College of Chemistry and Molecular Engineering, Qingdao University of Science and Technology, Qingdao, 266042, PR China
| | - Shuang You
- Key Laboratory of Optic-electric Sensing and Analytical Chemistry for Life Science, Shandong Key Laboratory of Biochemical Analysis, Key Laboratory of Analytical Chemistry for Life Science in Universities of Shandong, Key Laboratory of Ecochemical Engineering, College of Chemistry and Molecular Engineering, Qingdao University of Science and Technology, Qingdao, 266042, PR China
| | - Shufeng Liu
- College of Chemistry and Chemical Engineering, Yantai University, Yantai, 264005, PR China.
| |
Collapse
|
5
|
Manjula G, Pranavchand R, Kumuda I, Reddy BS, Reddy BM. The SNP rs7865618 of 9p21.3 locus emerges as the most promising marker of coronary artery disease in the southern Indian population. Sci Rep 2020; 10:21511. [PMID: 33298998 PMCID: PMC7726101 DOI: 10.1038/s41598-020-77080-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2020] [Accepted: 11/05/2020] [Indexed: 11/09/2022] Open
Abstract
Development of coronary artery disease (CAD) is primarily due to the process of atherosclerosis, however the prognosis of CAD depends on pleiotropic effects of the genes located at 9p21.3 region. Genome wide association studies revealed association of variants in this region with CAD pathology. However, specific marker in predicting CAD development or progression is not yet identified. In the present study, 35 SNPs at 9p21.3 region, located in the cyclin dependent kinase inhibitor (CDKN2A/CDKN2B) genes, were genotyped among 350 CAD cases and 480 controls from the southern Indian population of Hyderabad using fluidigm nanofluidic SNP genotyping system and the data were analyzed using PLINK and R softwares. Of the 35 SNPs analysed, only one SNP, rs7865618, was found to be highly significantly associated with CAD, even after correction for multiple testing (p = 0.008). The AG and GG genotypes of this SNP conferred 3.08 and 1.93 folds increased risk for CAD respectively. In particular, this SNP was significantly associated with severe anatomic (triple vessel disease p = 0.023) and phenotypic (acute coronary syndrome p = 0.007) categories of CAD. Pair wise SNP interaction analysis between the SNPs of 9p21.3 and 11q23.3 regions revealed significantly increased risk of three SNPs of 11q23.3 region that were not associated individually, in conjunction with rs7865618 of 9p21.3.
Collapse
Affiliation(s)
- Gorre Manjula
- Department of Genetics, Osmania University, Hyderabad, India
| | | | - Irgam Kumuda
- Department of Genetics, Osmania University, Hyderabad, India
| | - B Sriteja Reddy
- Dr Pinnamaneni, Siddhartha Institute of Medical Sciences and Research Foundation, Vijayawada, India
| | - Battini Mohan Reddy
- Department of Genetics, Osmania University, Hyderabad, India. .,Molecular Anthropology Group, Indian Statistical Institute, Hyderabad, India. .,Emeritus Scientist (ICMR), Department of Genetics, Osmania University, Hyderabad, 500007, India.
| |
Collapse
|
6
|
Guo J, Khan J, Pradhan S, Shahi D, Khan N, Avci M, Mcbreen J, Harrison S, Brown-Guedira G, Murphy JP, Johnson J, Mergoum M, Esten Mason R, Ibrahim AMH, Sutton R, Griffey C, Babar MA. Multi-Trait Genomic Prediction of Yield-Related Traits in US Soft Wheat under Variable Water Regimes. Genes (Basel) 2020; 11:genes11111270. [PMID: 33126620 PMCID: PMC7716228 DOI: 10.3390/genes11111270] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Revised: 10/23/2020] [Accepted: 10/26/2020] [Indexed: 11/16/2022] Open
Abstract
The performance of genomic prediction (GP) on genetically correlated traits can be improved through an interdependence multi-trait model under a multi-environment context. In this study, a panel of 237 soft facultative wheat (Triticum aestivum L.) lines was evaluated to compare single- and multi-trait models for predicting grain yield (GY), harvest index (HI), spike fertility (SF), and thousand grain weight (TGW). The panel was phenotyped in two locations and two years in Florida under drought and moderately drought stress conditions, while the genotyping was performed using 27,957 genotyping-by-sequencing (GBS) single nucleotide polymorphism (SNP) makers. Five predictive models including Multi-environment Genomic Best Linear Unbiased Predictor (MGBLUP), Bayesian Multi-trait Multi-environment (BMTME), Bayesian Multi-output Regressor Stacking (BMORS), Single-trait Multi-environment Deep Learning (SMDL), and Multi-trait Multi-environment Deep Learning (MMDL) were compared. Across environments, the multi-trait statistical model (BMTME) was superior to the multi-trait DL model for prediction accuracy in most scenarios, but the DL models were comparable to the statistical models for response to selection. The multi-trait model also showed 5 to 22% more genetic gain compared to the single-trait model across environment reflected by the response to selection. Overall, these results suggest that multi-trait genomic prediction can be an efficient strategy for economically important yield component related traits in soft wheat.
Collapse
Affiliation(s)
- Jia Guo
- Department of Agronomy, University of Florida, Gainesville, FL 32611, USA; (J.G.); (J.K.); (S.P.); (D.S.); (N.K.); (M.A.); (J.M.)
| | - Jahangir Khan
- Department of Agronomy, University of Florida, Gainesville, FL 32611, USA; (J.G.); (J.K.); (S.P.); (D.S.); (N.K.); (M.A.); (J.M.)
| | - Sumit Pradhan
- Department of Agronomy, University of Florida, Gainesville, FL 32611, USA; (J.G.); (J.K.); (S.P.); (D.S.); (N.K.); (M.A.); (J.M.)
| | - Dipendra Shahi
- Department of Agronomy, University of Florida, Gainesville, FL 32611, USA; (J.G.); (J.K.); (S.P.); (D.S.); (N.K.); (M.A.); (J.M.)
| | - Naeem Khan
- Department of Agronomy, University of Florida, Gainesville, FL 32611, USA; (J.G.); (J.K.); (S.P.); (D.S.); (N.K.); (M.A.); (J.M.)
| | - Muhsin Avci
- Department of Agronomy, University of Florida, Gainesville, FL 32611, USA; (J.G.); (J.K.); (S.P.); (D.S.); (N.K.); (M.A.); (J.M.)
| | - Jordan Mcbreen
- Department of Agronomy, University of Florida, Gainesville, FL 32611, USA; (J.G.); (J.K.); (S.P.); (D.S.); (N.K.); (M.A.); (J.M.)
| | - Stephen Harrison
- School of Plant Environment and Soil Sciences, Louisiana State University, Baton Rouge, LA 70803, USA;
| | | | - Joseph Paul Murphy
- Department of Crop and Soil Sciences, North Carolina State University, Raleigh, NC 27607, USA;
| | - Jerry Johnson
- Department of Crop and Soil Sciences, University of Georgia, Griffin, GA 32223, USA; (J.J.); (M.M.)
| | - Mohamed Mergoum
- Department of Crop and Soil Sciences, University of Georgia, Griffin, GA 32223, USA; (J.J.); (M.M.)
| | - Richanrd Esten Mason
- Department of Crop Soil and Environmental Sciences, University of Arkansas, Fayetteville, AR 72701, USA;
| | - Amir M. H. Ibrahim
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX 77843, USA; (A.M.H.I.); (R.S.)
| | - Russel Sutton
- Department of Soil and Crop Sciences, Texas A&M University, College Station, TX 77843, USA; (A.M.H.I.); (R.S.)
| | - Carl Griffey
- School of Plant and Environmental Sciences, Virginia Tech, Blacksburg, VA 24061, USA;
| | - Md Ali Babar
- Department of Agronomy, University of Florida, Gainesville, FL 32611, USA; (J.G.); (J.K.); (S.P.); (D.S.); (N.K.); (M.A.); (J.M.)
- Correspondence:
| |
Collapse
|
7
|
Guo J, Pradhan S, Shahi D, Khan J, Mcbreen J, Bai G, Murphy JP, Babar MA. Increased Prediction Accuracy Using Combined Genomic Information and Physiological Traits in A Soft Wheat Panel Evaluated in Multi-Environments. Sci Rep 2020; 10:7023. [PMID: 32341406 PMCID: PMC7184575 DOI: 10.1038/s41598-020-63919-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2019] [Accepted: 03/11/2020] [Indexed: 12/28/2022] Open
Abstract
An integration of field-based phenotypic and genomic data can potentially increase the genetic gain in wheat breeding for complex traits such as grain and biomass yield. To validate this hypothesis in empirical field experiments, we compared the prediction accuracy between multi-kernel physiological and genomic best linear unbiased prediction (BLUP) model to a single-kernel physiological or genomic BLUP model for grain yield (GY) using a soft wheat population that was evaluated in four environments. The physiological data including canopy temperature (CT), SPAD chlorophyll content (SPAD), membrane thermostability (MT), rate of senescence (RS), stay green trait (SGT), and NDVI values were collected at four environments (2016, 2017, and 2018 at Citra, FL; 2017 at Quincy, FL). Using a genotyping-by-sequencing (GBS) approach, a total of 19,353 SNPs were generated and used to estimate prediction model accuracy. Prediction accuracies of grain yield evaluated in four environments improved when physiological traits and/or interaction effects (genotype × environment or physiology × environment) were included in the model compared to models with only genomic data. The proposed multi-kernel models that combined physiological and genomic data showed 35 to 169% increase in prediction accuracy compared to models with only genomic data included when heading date was used as a covariate. In general, higher response to selection was captured by the model combing effects of physiological and genotype × environment interaction compared to other models. The results of this study support the integration of field-based physiological data into GY prediction to improve genetic gain from selection in soft wheat under a multi-environment context.
Collapse
Affiliation(s)
- Jia Guo
- Department of Agronomy, University of Florida, Gainesville, FL, USA
| | - Sumit Pradhan
- Department of Agronomy, University of Florida, Gainesville, FL, USA
| | - Dipendra Shahi
- Department of Agronomy, University of Florida, Gainesville, FL, USA
| | - Jahangir Khan
- Department of Agronomy, University of Florida, Gainesville, FL, USA
| | - Jordan Mcbreen
- Department of Agronomy, University of Florida, Gainesville, FL, USA
| | - Guihua Bai
- USDA-ARS Central Small Grain Genotyping Lab, Manhattan, Kansas, USA
| | - J Paul Murphy
- Crop and Soil Sciences, North Carolina State University, Raleigh, North Carolina, USA
| | - Md Ali Babar
- Department of Agronomy, University of Florida, Gainesville, FL, USA.
| |
Collapse
|
8
|
Structural variants exhibit widespread allelic heterogeneity and shape variation in complex traits. Nat Commun 2019; 10:4872. [PMID: 31653862 PMCID: PMC6814777 DOI: 10.1038/s41467-019-12884-1] [Citation(s) in RCA: 84] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2018] [Accepted: 09/25/2019] [Indexed: 12/11/2022] Open
Abstract
It has been hypothesized that individually-rare hidden structural variants (SVs) could account for a significant fraction of variation in complex traits. Here we identified more than 20,000 euchromatic SVs from 14 Drosophila melanogaster genome assemblies, of which ~40% are invisible to high specificity short-read genotyping approaches. SVs are common, with 31.5% of diploid individuals harboring a SV in genes larger than 5kb, and 24% harboring multiple SVs in genes larger than 10kb. SV minor allele frequencies are rarer than amino acid polymorphisms, suggesting that SVs are more deleterious. We show that a number of functionally important genes harbor previously hidden structural variants likely to affect complex phenotypes. Furthermore, SVs are overrepresented in candidate genes associated with quantitative trait loci mapped using the Drosophila Synthetic Population Resource. We conclude that SVs are ubiquitous, frequently constitute a heterogeneous allelic series, and can act as rare alleles of large effect.
Collapse
|
9
|
Obala J, Saxena RK, Singh VK, Kumar CVS, Saxena KB, Tongoona P, Sibiya J, Varshney RK. Development of sequence-based markers for seed protein content in pigeonpea. Mol Genet Genomics 2018; 294:57-68. [PMID: 30173295 DOI: 10.1007/s00438-018-1484-8] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2018] [Accepted: 08/22/2018] [Indexed: 12/30/2022]
Abstract
Pigeonpea is an important source of dietary protein to over a billion people globally, but genetic enhancement of seed protein content (SPC) in the crop has received limited attention for a long time. Use of genomics-assisted breeding would facilitate accelerating genetic gain for SPC. However, neither genetic markers nor genes associated with this important trait have been identified in this crop. Therefore, the present study exploited whole genome re-sequencing (WGRS) data of four pigeonpea genotypes (~ 12X coverage) to identify sequence-based markers and associated candidate genes for SPC. By combining a common variant filtering strategy on available WGRS data with knowledge of gene functions in relation to SPC, 108 sequence variants from 57 genes were identified. These genes were assigned to 19 GO molecular function categories with 56% belonging to only two categories. Furthermore, Sanger sequencing confirmed presence of 75.4% of the variants in 37 genes. Out of 30 sequence variants converted into CAPS/dCAPS markers, 17 showed high level of polymorphism between low and high SPC genotypes. Assay of 16 of the polymorphic CAPS/dCAPS markers on an F2 population of the cross ICP 5529 (high SPC) × ICP 11605 (low SPC), resulted in four of the CAPS/dCAPS markers significantly (P < 0.05) co-segregated with SPC. In summary, four markers derived from mutations in four genes will be useful for enhancing/regulating SPC in pigeonpea crop improvement programs.
Collapse
Affiliation(s)
- Jimmy Obala
- International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, 502324, India
- University of KwaZulu-Natal, African Center for Crop Improvement, Scottsville, Pietermaritzburg, 3209, South Africa
| | - Rachit K Saxena
- International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, 502324, India
| | - Vikas K Singh
- International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, 502324, India
| | - C V Sameer Kumar
- International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, 502324, India
| | - K B Saxena
- International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, 502324, India
| | - Pangirayi Tongoona
- University of KwaZulu-Natal, African Center for Crop Improvement, Scottsville, Pietermaritzburg, 3209, South Africa
| | - Julia Sibiya
- University of KwaZulu-Natal, African Center for Crop Improvement, Scottsville, Pietermaritzburg, 3209, South Africa
| | - Rajeev K Varshney
- International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, 502324, India.
| |
Collapse
|
10
|
Vo NS, Phan V. Leveraging known genomic variants to improve detection of variants, especially close-by Indels. Bioinformatics 2018; 34:2918-2926. [PMID: 29590294 DOI: 10.1093/bioinformatics/bty183] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Accepted: 03/23/2018] [Indexed: 12/30/2022] Open
Abstract
Motivation The detection of genomic variants has great significance in genomics, bioinformatics, biomedical research and its applications. However, despite a lot of effort, Indels and structural variants are still under-characterized compared to SNPs. Current approaches based on next-generation sequencing data usually require large numbers of reads (high coverage) to be able to detect such types of variants accurately. However Indels, especially those close to each other, are still hard to detect accurately. Results We introduce a novel approach that leverages known variant information, e.g. provided by dbSNP, dbVar, ExAC or the 1000 Genomes Project, to improve sensitivity of detecting variants, especially close-by Indels. In our approach, the standard reference genome and the known variants are combined to build a meta-reference, which is expected to be probabilistically closer to the subject genomes than the standard reference. An alignment algorithm, which can take into account known variant information, is developed to accurately align reads to the meta-reference. This strategy resulted in accurate alignment and variant calling even with low coverage data. We showed that compared to popular methods such as GATK and SAMtools, our method significantly improves the sensitivity of detecting variants, especially Indels that are close to each other. In particular, our method was able to call these close-by Indels at a 15-20% higher sensitivity than other methods at low coverage, and still get 1-5% higher sensitivity at high coverage, at competitive precision. These results were validated using simulated data with variant profiles extracted from the 1000 Genomes Project data, and real data from the Illumina Platinum Genomes Project and ExAC database. Our finding suggests that by incorporating known variant information in an appropriate manner, sensitive variant calling is possible at a low cost. Availability and implementation Implementation can be found in our public code repository https://github.com/namsyvo/IVC. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nam S Vo
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Vinhthuy Phan
- Department of Computer Science, The University of Memphis, Memphis, TN, USA
| |
Collapse
|
11
|
Nielsen ES, Henriques R, Toonen RJ, Knapp ISS, Guo B, von der Heyden S. Complex signatures of genomic variation of two non-model marine species in a homogeneous environment. BMC Genomics 2018; 19:347. [PMID: 29743012 PMCID: PMC5944137 DOI: 10.1186/s12864-018-4721-y] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2017] [Accepted: 04/23/2018] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Genomic tools are increasingly being used on non-model organisms to provide insights into population structure and variability, including signals of selection. However, most studies are carried out in regions with distinct environmental gradients or across large geographical areas, in which local adaptation is expected to occur. Therefore, the focus of this study is to characterize genomic variation and selective signals over short geographic areas within a largely homogeneous region. To assess adaptive signals between microhabitats within the rocky shore, we compared genomic variation between the Cape urchin (Parechinus angulosus), which is a low to mid-shore species, and the Granular limpet (Scutellastra granularis), a high shore specialist. RESULTS Using pooled restriction site associated DNA (RAD) sequencing, we described patterns of genomic variation and identified outlier loci in both species. We found relatively low numbers of outlier SNPs within each species, and identified outlier genes associated with different selective pressures than those previously identified in studies conducted over larger environmental gradients. The number of population-specific outlier loci differed between species, likely owing to differential selective pressures within the intertidal environment. Interestingly, the outlier loci were highly differentiated within the two northernmost populations for both species, suggesting that unique evolutionary forces are acting on marine invertebrates within this region. CONCLUSIONS Our study provides a background for comparative genomic studies focused on non-model species, as well as a baseline for the adaptive potential of marine invertebrates along the South African west coast. We also discuss the caveats associated with Pool-seq and potential biases of sequencing coverage on downstream genomic metrics. The findings provide evidence of species-specific selective pressures within a homogeneous environment, and suggest that selective forces acting on small scales are just as crucial to acknowledge as those acting on larger scales. As a whole, our findings imply that future population genomic studies should expand from focusing on model organisms and/or studying heterogeneous regions to better understand the evolutionary processes shaping current and future biodiversity patterns, particularly when used in a comparative phylogeographic context.
Collapse
Affiliation(s)
- Erica S Nielsen
- Evolutionary Genomics Group, Department of Botany and Zoology, University of Stellenbosch, Private Bag X1, Matieland,, 7602, South Africa
| | - Romina Henriques
- Evolutionary Genomics Group, Department of Botany and Zoology, University of Stellenbosch, Private Bag X1, Matieland,, 7602, South Africa
| | - Robert J Toonen
- Hawai'i Institute of Marine Biology, School of Ocean and Earth Science and Technology, University of Hawai'i at Mānoa, Kāne'ohe, HI, 96744, USA
| | - Ingrid S S Knapp
- Hawai'i Institute of Marine Biology, School of Ocean and Earth Science and Technology, University of Hawai'i at Mānoa, Kāne'ohe, HI, 96744, USA
| | - Baocheng Guo
- The Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology Chinese Academy of Sciences, Beijing, 100101, China
| | - Sophie von der Heyden
- Evolutionary Genomics Group, Department of Botany and Zoology, University of Stellenbosch, Private Bag X1, Matieland,, 7602, South Africa.
| |
Collapse
|
12
|
Kouprina N, Liskovykh M, Lee NCO, Noskov VN, Waterfall JJ, Walker RL, Meltzer PS, Topol EJ, Larionov V. Analysis of the 9p21.3 sequence associated with coronary artery disease reveals a tendency for duplication in a CAD patient. Oncotarget 2018; 9:15275-15291. [PMID: 29632643 PMCID: PMC5880603 DOI: 10.18632/oncotarget.24567] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2017] [Accepted: 02/10/2018] [Indexed: 11/25/2022] Open
Abstract
Tandem segmental duplications (SDs) greater than 10 kb are widespread in complex genomes. They provide material for gene divergence and evolutionary adaptation, while formation of specific de novo SDs is a hallmark of cancer and some human diseases. Most SDs map to distinct genomic regions termed ‘duplication blocks’. SDs organization within these blocks is often poorly characterized as they are mosaics of ancestral duplicons juxtaposed with younger duplicons arising from more recent duplication events. Structural and functional analysis of SDs is further hampered as long repetitive DNA structures are underrepresented in existing BAC and YAC libraries. We applied Transformation-Associated Recombination (TAR) cloning, a versatile technique for large DNA manipulation, to selectively isolate the coronary artery disease (CAD) interval sequence within the 9p21.3 chromosome locus from a patient with coronary artery disease and normal individuals. Four tandem head-to-tail duplicons, each ∼50 kb long, were recovered in the patient but not in normal individuals. Sequence analysis revealed that the repeats varied by 10-15 SNPs between each other and by 82 SNPs between the human genome sequence (version hg19). SNPs polymorphism within the junctions between repeats allowed two junction types to be distinguished, Type 1 and Type 2, which were found at a 2:1 ratio. The junction sequences contained an Alu element, a sequence previously shown to play a role in duplication. Knowledge of structural variation in the CAD interval from more patients could help link this locus to cardiovascular diseases susceptibility, and maybe relevant to other cases of regional amplification, including cancer.
Collapse
Affiliation(s)
- Natalay Kouprina
- Developmental Therapeutics Branch, National Cancer Institute, Bethesda, MD 20892, USA
| | - Mikhail Liskovykh
- Developmental Therapeutics Branch, National Cancer Institute, Bethesda, MD 20892, USA
| | - Nicholas C O Lee
- Developmental Therapeutics Branch, National Cancer Institute, Bethesda, MD 20892, USA
| | - Vladimir N Noskov
- Developmental Therapeutics Branch, National Cancer Institute, Bethesda, MD 20892, USA
| | | | - Robert L Walker
- Genetics Branch, National Cancer Institute, Bethesda, MD 20892, USA
| | - Paul S Meltzer
- Genetics Branch, National Cancer Institute, Bethesda, MD 20892, USA
| | - Eric J Topol
- The Scripps Translational Science Institute, The Scripps Research Institute and Scripps Health, La Jolla, CA 92037, USA
| | - Vladimir Larionov
- Developmental Therapeutics Branch, National Cancer Institute, Bethesda, MD 20892, USA
| |
Collapse
|
13
|
Lu J, Liu Y, Xu J, Mei Z, Shi Y, Liu P, He J, Wang X, Meng Y, Feng S, Shen C, Wang H. High-Density Genetic Map Construction and Stem Total Polysaccharide Content-Related QTL Exploration for Chinese Endemic Dendrobium (Orchidaceae). FRONTIERS IN PLANT SCIENCE 2018; 9:398. [PMID: 29636767 PMCID: PMC5880926 DOI: 10.3389/fpls.2018.00398] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2017] [Accepted: 03/12/2018] [Indexed: 05/19/2023]
Abstract
Plants of the Dendrobium genus are orchids with not only ornamental value but also high medicinal value. To understand the genetic basis of variations in active ingredients of the stem total polysaccharide contents (STPCs) among different Dendrobium species, it is of paramount importance to understand the mechanism of STPC formation and identify genes affecting its process at the whole genome level. Here, we report the first high-density single-nucleotide polymorphism (SNP) integrated genetic map with a good genome coverage of Dendrobium. The specific-locus amplified fragment sequencing (SLAF-seq) technology led to identification of 7,013,400 SNPs from 1,503,626 high-quality SLAF markers from two parents (Dendrobium moniliforme ♀ × Dendrobium officinale ♂) and their interspecific F1 hybrid population. The final genetic map contained 8, 573 SLAF markers, covering 19 linkage groups (LGs). This genetic map spanned a length of 2,737.49 cM, where the average distance between markers is 0.32 cM. In total, 5 quantitative trait loci (QTL) related to STPC were identified, 3 of which have candidate genes within the confidence intervals of these stable QTLs based on the D. officinale genome sequence. This study will build a foundation up for the mapping of other medicinal-related traits and provide an important reference for the molecular breeding of these Chinese herb.
Collapse
Affiliation(s)
- Jiangjie Lu
- College of Life and Environmental Sciences, Hangzhou Normal University, Hangzhou, China
- Zhejiang Provincial Key Laboratory for Genetic Improvement and Quality Control of Medicinal Plants, Hangzhou Normal University, Hangzhou, China
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, China
- *Correspondence: Jiangjie Lu
| | - Yuyang Liu
- College of Life and Environmental Sciences, Hangzhou Normal University, Hangzhou, China
- Zhejiang Provincial Key Laboratory for Genetic Improvement and Quality Control of Medicinal Plants, Hangzhou Normal University, Hangzhou, China
| | - Jing Xu
- Center of Rare Plant Medicine Research of Zhejiang Province, Wuyi, China
- Zhejiang ShouXianGu Pharmaceutical Co. Ltd., Wuyi, China
| | - Ziwei Mei
- College of Pharmaceutical Science, Zhejiang Chinese Medical University, Hangzhou, China
| | - Yujun Shi
- School of Foreign Languages, Zhejiang Gongshang University, Hangzhou, China
| | - Pengli Liu
- College of Life and Environmental Sciences, Hangzhou Normal University, Hangzhou, China
- Zhejiang Provincial Key Laboratory for Genetic Improvement and Quality Control of Medicinal Plants, Hangzhou Normal University, Hangzhou, China
| | - Jianbo He
- Soybean Research Institute, Nanjing Agricultural University, Nanjing, China
| | - Xiaotong Wang
- Center of Rare Plant Medicine Research of Zhejiang Province, Wuyi, China
- Zhejiang ShouXianGu Pharmaceutical Co. Ltd., Wuyi, China
| | - Yijun Meng
- College of Life and Environmental Sciences, Hangzhou Normal University, Hangzhou, China
- Zhejiang Provincial Key Laboratory for Genetic Improvement and Quality Control of Medicinal Plants, Hangzhou Normal University, Hangzhou, China
| | - Shangguo Feng
- College of Life and Environmental Sciences, Hangzhou Normal University, Hangzhou, China
- Zhejiang Provincial Key Laboratory for Genetic Improvement and Quality Control of Medicinal Plants, Hangzhou Normal University, Hangzhou, China
| | - Chenjia Shen
- College of Life and Environmental Sciences, Hangzhou Normal University, Hangzhou, China
- Zhejiang Provincial Key Laboratory for Genetic Improvement and Quality Control of Medicinal Plants, Hangzhou Normal University, Hangzhou, China
| | - Huizhong Wang
- College of Life and Environmental Sciences, Hangzhou Normal University, Hangzhou, China
- Zhejiang Provincial Key Laboratory for Genetic Improvement and Quality Control of Medicinal Plants, Hangzhou Normal University, Hangzhou, China
- Huizhong Wang
| |
Collapse
|
14
|
Gupta P, Reddaiah B, Salava H, Upadhyaya P, Tyagi K, Sarma S, Datta S, Malhotra B, Thomas S, Sunkum A, Devulapalli S, Till BJ, Sreelakshmi Y, Sharma R. Next-generation sequencing (NGS)-based identification of induced mutations in a doubly mutagenized tomato (Solanum lycopersicum) population. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2017; 92:495-508. [PMID: 28779536 DOI: 10.1111/tpj.13654] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Revised: 07/25/2017] [Accepted: 07/26/2017] [Indexed: 05/21/2023]
Abstract
The identification of mutations in targeted genes has been significantly simplified by the advent of TILLING (Targeting Induced Local Lesions In Genomes), speeding up the functional genomic analysis of animals and plants. Next-generation sequencing (NGS) is gradually replacing classical TILLING for mutation detection, as it allows the analysis of a large number of amplicons in short durations. The NGS approach was used to identify mutations in a population of Solanum lycopersicum (tomato) that was doubly mutagenized by ethylmethane sulphonate (EMS). Twenty-five genes belonging to carotenoids and folate metabolism were PCR-amplified and screened to identify potentially beneficial alleles. To augment efficiency, the 600-bp amplicons were directly sequenced in a non-overlapping manner in Illumina MiSeq, obviating the need for a fragmentation step before library preparation. A comparison of the different pooling depths revealed that heterozygous mutations could be identified up to 128-fold pooling. An evaluation of six different software programs (camba, crisp, gatk unified genotyper, lofreq, snver and vipr) revealed that no software program was robust enough to predict mutations with high fidelity. Among these, crisp and camba predicted mutations with lower false discovery rates. The false positives were largely eliminated by considering only mutations commonly predicted by two different software programs. The screening of 23.47 Mb of tomato genome yielded 75 predicted mutations, 64 of which were confirmed by Sanger sequencing with an average mutation density of 1/367 Kb. Our results indicate that NGS combined with multiple variant detection tools can reduce false positives and significantly speed up the mutation discovery rate.
Collapse
Affiliation(s)
- Prateek Gupta
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Bodanapu Reddaiah
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Hymavathi Salava
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Pallawi Upadhyaya
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Kamal Tyagi
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Supriya Sarma
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Sneha Datta
- Plant Breeding and Genetics Laboratory, IAEA Seibersdorf Laboratories, Reaktorstrasse 1, Seibersdorf, Austria
| | - Bharti Malhotra
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Sherinmol Thomas
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Anusha Sunkum
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Sameera Devulapalli
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Bradley John Till
- Plant Breeding and Genetics Laboratory, IAEA Seibersdorf Laboratories, Reaktorstrasse 1, Seibersdorf, Austria
| | - Yellamaraju Sreelakshmi
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Rameshwar Sharma
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| |
Collapse
|
15
|
Xiao S, Wang P, Dong L, Zhang Y, Han Z, Wang Q, Wang Z. Whole-genome single-nucleotide polymorphism (SNP) marker discovery and association analysis with the eicosapentaenoic acid (EPA) and docosahexaenoic acid (DHA) content in Larimichthys crocea. PeerJ 2016; 4:e2664. [PMID: 28028455 PMCID: PMC5180582 DOI: 10.7717/peerj.2664] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2016] [Accepted: 10/07/2016] [Indexed: 12/30/2022] Open
Abstract
Whole-genome single-nucleotide polymorphism (SNP) markers are valuable genetic resources for the association and conservation studies. Genome-wide SNP development in many teleost species are still challenging because of the genome complexity and the cost of re-sequencing. Genotyping-By-Sequencing (GBS) provided an efficient reduced representative method to squeeze cost for SNP detection; however, most of recent GBS applications were reported on plant organisms. In this work, we used an EcoRI-NlaIII based GBS protocol to teleost large yellow croaker, an important commercial fish in China and East-Asia, and reported the first whole-genome SNP development for the species. 69,845 high quality SNP markers that evenly distributed along genome were detected in at least 80% of 500 individuals. Nearly 95% randomly selected genotypes were successfully validated by Sequenom MassARRAY assay. The association studies with the muscle eicosapentaenoic acid (EPA) and docosahexaenoic acid (DHA) content discovered 39 significant SNP markers, contributing as high up to ∼63% genetic variance that explained by all markers. Functional genes that involved in fat digestion and absorption pathway were identified, such as APOB, CRAT and OSBPL10. Notably, PPT2 Gene, previously identified in the association study of the plasma n-3 and n-6 polyunsaturated fatty acid level in human, was re-discovered in large yellow croaker. Our study verified that EcoRI-NlaIII based GBS could produce quality SNP markers in a cost-efficient manner in teleost genome. The developed SNP markers and the EPA and DHA associated SNP loci provided invaluable resources for the population structure, conservation genetics and genomic selection of large yellow croaker and other fish organisms.
Collapse
Affiliation(s)
- Shijun Xiao
- Fisheries College, Jimei University, Xiamen, Fujian, China
| | - Panpan Wang
- Fisheries College, Jimei University, Xiamen, Fujian, China
| | - Linsong Dong
- Fisheries College, Jimei University, Xiamen, Fujian, China
| | - Yaguang Zhang
- Fisheries College, Jimei University, Xiamen, Fujian, China
| | - Zhaofang Han
- Fisheries College, Jimei University, Xiamen, Fujian, China
| | - Qiurong Wang
- Fisheries College, Jimei University, Xiamen, Fujian, China
| | - Zhiyong Wang
- Fisheries College, Jimei University, Xiamen, Fujian, China
| |
Collapse
|
16
|
Hoffberg SL, Kieran TJ, Catchen JM, Devault A, Faircloth BC, Mauricio R, Glenn TC. RAD
cap: sequence capture of dual‐digest
RAD
seq libraries with identifiable duplicates and reduced missing data. Mol Ecol Resour 2016; 16:1264-78. [DOI: 10.1111/1755-0998.12566] [Citation(s) in RCA: 98] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2016] [Revised: 07/06/2016] [Accepted: 07/11/2016] [Indexed: 12/21/2022]
Affiliation(s)
| | - Troy J. Kieran
- Department of Environmental Health Science University of Georgia Athens GA 30602 USA
| | - Julian M. Catchen
- Department of Animal Biology University of Illinois Urbana IL 61801 USA
| | | | - Brant C. Faircloth
- Department of Biological Sciences and Museum of Natural Science Louisiana State University Baton Rouge LA 70803 USA
| | - Rodney Mauricio
- Department of Genetics University of Georgia Athens GA 30602 USA
| | - Travis C. Glenn
- Department of Genetics University of Georgia Athens GA 30602 USA
- Department of Environmental Health Science University of Georgia Athens GA 30602 USA
| |
Collapse
|
17
|
Greenwood JM, Ezquerra AL, Behrens S, Branca A, Mallet L. Current analysis of host–parasite interactions with a focus on next generation sequencing data. ZOOLOGY 2016; 119:298-306. [DOI: 10.1016/j.zool.2016.06.010] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2015] [Revised: 06/22/2016] [Accepted: 06/22/2016] [Indexed: 01/21/2023]
|
18
|
Shin S, Park J. Characterization of sequence-specific errors in various next-generation sequencing systems. MOLECULAR BIOSYSTEMS 2016; 12:914-22. [DOI: 10.1039/c5mb00750j] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Next-generation sequencing (NGS) is a powerful method for functional microbial ecology in a variety of environments including human's body. In this work, novel sequence-specific errors (SSEs) from the currently popular NGS systems and their hotspots were discovered, providing a scientific basis for filtering poor-quality sequence reads from the different NGS systems.
Collapse
Affiliation(s)
- Sunguk Shin
- Department of Civil and Environmental Engineering
- Yonsei University
- Seoul
- Republic of Korea
| | - Joonhong Park
- Department of Civil and Environmental Engineering
- Yonsei University
- Seoul
- Republic of Korea
| |
Collapse
|
19
|
Kim W. Transmission Disequilibrium Tests Based on Read Counts for Low-Coverage Next-Generation Sequence Data. Hum Hered 2015; 80:36-49. [PMID: 26278553 DOI: 10.1159/000434645] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2015] [Accepted: 05/30/2015] [Indexed: 11/19/2022] Open
Abstract
The purpose of this paper is the introduction of new statistical methods for case-parent trio association studies based on the read counts that can be obtained from next-generation sequencing (NGS) experiments. This work focuses on the inclusion of low-coverage data into the case-parent trio design without genotype classification or imputation. Two different approaches are considered: (1) a likelihood-based approach implementing a 15-component parametric mixture model and (2) a model-free approach that applies non-parametric statistical methods to the ratios of the read counts to coverage. Simulation studies are conducted to evaluate the performances of the proposed tests. In addition, the non-centrality parameters of the mixture likelihood-based tests are derived to determine sample sizes and coverage for a NGS experimental design. As an example, the sample sizes to maintain specified powers of a published adolescent idiopathic scoliosis (AIS) study are presented. The simulation results show that the tests using the genotypes classified by the maximum Bayesian posterior probability have significantly inflated type I error rates for low-coverage data. The tests using the posterior probabilities instead of the classified genotypes show lower power than the proposed tests. Generally, power for the likelihood-based approach is higher than that for the non-parametric ratio-based approach. For the AIS example, approximately 654 trios with 4× coverage are necessary to maintain 90% power when detecting an association of odds ratio 2 at a locus with a minor allele frequency of 0.35 at the level of significance α = 5 × 10(-8). By comparison, approximately 416 trios with 25× coverage are required to maintain the same power with the same settings. The R and C source codes to calculate the proposed test statistics, the sample sizes and power can be obtained by contacting the author (wkim@cau.ac.kr).
Collapse
Affiliation(s)
- Wonkuk Kim
- Department of Applied Statistics, Chung-Ang University, Seoul, South Korea
| |
Collapse
|
20
|
Wang Y, Liu A, Mills JL, Boehnke M, Wilson AF, Bailey-Wilson JE, Xiong M, Wu CO, Fan R. Pleiotropy analysis of quantitative traits at gene level by multivariate functional linear models. Genet Epidemiol 2015; 39:259-75. [PMID: 25809955 PMCID: PMC4443751 DOI: 10.1002/gepi.21895] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Revised: 01/28/2015] [Accepted: 01/28/2015] [Indexed: 10/23/2022]
Abstract
In genetics, pleiotropy describes the genetic effect of a single gene on multiple phenotypic traits. A common approach is to analyze the phenotypic traits separately using univariate analyses and combine the test results through multiple comparisons. This approach may lead to low power. Multivariate functional linear models are developed to connect genetic variant data to multiple quantitative traits adjusting for covariates for a unified analysis. Three types of approximate F-distribution tests based on Pillai-Bartlett trace, Hotelling-Lawley trace, and Wilks's Lambda are introduced to test for association between multiple quantitative traits and multiple genetic variants in one genetic region. The approximate F-distribution tests provide much more significant results than those of F-tests of univariate analysis and optimal sequence kernel association test (SKAT-O). Extensive simulations were performed to evaluate the false positive rates and power performance of the proposed models and tests. We show that the approximate F-distribution tests control the type I error rates very well. Overall, simultaneous analysis of multiple traits can increase power performance compared to an individual test of each trait. The proposed methods were applied to analyze (1) four lipid traits in eight European cohorts, and (2) three biochemical traits in the Trinity Students Study. The approximate F-distribution tests provide much more significant results than those of F-tests of univariate analysis and SKAT-O for the three biochemical traits. The approximate F-distribution tests of the proposed functional linear models are more sensitive than those of the traditional multivariate linear models that in turn are more sensitive than SKAT-O in the univariate case. The analysis of the four lipid traits and the three biochemical traits detects more association than SKAT-O in the univariate case.
Collapse
Affiliation(s)
- Yifan Wang
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Aiyi Liu
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - James L. Mills
- Epidemiology Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Michael Boehnke
- Department of Biostatistics, School of Public Health, The University of Michigan, Ann Arbor, Michigan, United States of America
| | - Alexander F. Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Joan E. Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Momiao Xiong
- Human Genetics Center, University of Texas - Houston, Houston, Texas, United States of America
| | - Colin O. Wu
- Office of Biostatistics Research, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Ruzong Fan
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| |
Collapse
|
21
|
Sampson J, Jacobs K, Yeager M, Chanock S, Chatterjee N. Efficient study design for next generation sequencing. Genet Epidemiol 2015; 35:269-77. [PMID: 21370254 DOI: 10.1002/gepi.20575] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2010] [Revised: 12/24/2010] [Accepted: 01/12/2011] [Indexed: 01/23/2023]
Abstract
Next Generation Sequencing represents a powerful tool for detecting genetic variation associated with human disease. Because of the high cost of this technology, it is critical that we develop efficient study designs that consider the trade-off between the number of subjects (n) and the coverage depth (µ). How we divide our resources between the two can greatly impact study success, particularly in pilot studies. We propose a strategy for selecting the optimal combination of n and µ for studies aimed at detecting rare variants and for studies aimed at detecting associations between rare or uncommon variants and disease. For detecting rare variants, we find the optimal coverage depth to be between 2 and 8 reads when using the likelihood ratio test. For association studies, we find the strategy of sequencing all available subjects to be preferable. In deriving these combinations, we provide a detailed analysis describing the distribution of depth across a genome and the depth needed to identify a minor allele in an individual. The optimal coverage depth depends on the aims of the study, and the chosen depth can have a large impact on study success.
Collapse
Affiliation(s)
- Joshua Sampson
- Biostatistics Branch, DCEG, National Cancer Institute, Rockville, MD 20852, USA.
| | | | | | | | | |
Collapse
|
22
|
Bansal V, Libiger O. Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations. BMC Bioinformatics 2015; 16:4. [PMID: 25592880 PMCID: PMC4301802 DOI: 10.1186/s12859-014-0418-7] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2014] [Accepted: 12/10/2014] [Indexed: 01/18/2023] Open
Abstract
Background Estimation of individual ancestry from genetic data is useful for the analysis of disease association studies, understanding human population history and interpreting personal genomic variation. New, computationally efficient methods are needed for ancestry inference that can effectively utilize existing information about allele frequencies associated with different human populations and can work directly with DNA sequence reads. Results We describe a fast method for estimating the relative contribution of known reference populations to an individual’s genetic ancestry. Our method utilizes allele frequencies from the reference populations and individual genotype or sequence data to obtain a maximum likelihood estimate of the global admixture proportions using the BFGS optimization algorithm. It accounts for the uncertainty in genotypes present in sequence data by using genotype likelihoods and does not require individual genotype data from external reference panels. Simulation studies and application of the method to real datasets demonstrate that our method is significantly times faster than previous methods and has comparable accuracy. Using data from the 1000 Genomes project, we show that estimates of the genome-wide average ancestry for admixed individuals are consistent between exome sequence data and whole-genome low-coverage sequence data. Finally, we demonstrate that our method can be used to estimate admixture proportions using pooled sequence data making it a valuable tool for controlling for population stratification in sequencing based association studies that utilize DNA pooling. Conclusions Our method is an efficient and versatile tool for estimating ancestry from DNA sequence data and is available from https://sites.google.com/site/vibansal/software/iAdmix. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0418-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Vikas Bansal
- Department of Pediatrics, University of California San Diego, 9500 Gilman Drive, La Jolla, 92093, CA, USA. .,Scripps Translational Science Institute, 3344 N Torrey Pines Court, La Jolla, 92037, CA, USA.
| | - Ondrej Libiger
- Scripps Translational Science Institute, 3344 N Torrey Pines Court, La Jolla, 92037, CA, USA. .,Current address: MD Revolution, San Diego, CA, USA.
| |
Collapse
|
23
|
Fan R, Wang Y, Mills JL, Carter TC, Lobach I, Wilson AF, Bailey-Wilson JE, Weeks DE, Xiong M. Generalized functional linear models for gene-based case-control association studies. Genet Epidemiol 2014; 38:622-637. [PMID: 25203683 PMCID: PMC4189986 DOI: 10.1002/gepi.21840] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2014] [Revised: 04/29/2014] [Accepted: 05/28/2014] [Indexed: 01/23/2023]
Abstract
By using functional data analysis techniques, we developed generalized functional linear models for testing association between a dichotomous trait and multiple genetic variants in a genetic region while adjusting for covariates. Both fixed and mixed effect models are developed and compared. Extensive simulations show that Rao's efficient score tests of the fixed effect models are very conservative since they generate lower type I errors than nominal levels, and global tests of the mixed effect models generate accurate type I errors. Furthermore, we found that the Rao's efficient score test statistics of the fixed effect models have higher power than the sequence kernel association test (SKAT) and its optimal unified version (SKAT-O) in most cases when the causal variants are both rare and common. When the causal variants are all rare (i.e., minor allele frequencies less than 0.03), the Rao's efficient score test statistics and the global tests have similar or slightly lower power than SKAT and SKAT-O. In practice, it is not known whether rare variants or common variants in a gene region are disease related. All we can assume is that a combination of rare and common variants influences disease susceptibility. Thus, the improved performance of our models when the causal variants are both rare and common shows that the proposed models can be very useful in dissecting complex traits. We compare the performance of our methods with SKAT and SKAT-O on real neural tube defects and Hirschsprung's disease datasets. The Rao's efficient score test statistics and the global tests are more sensitive than SKAT and SKAT-O in the real data analysis. Our methods can be used in either gene-disease genome-wide/exome-wide association studies or candidate gene analyses.
Collapse
Affiliation(s)
- Ruzong Fan
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research Eunice Kennedy Shriver National Institute of Child Health and Human Development National Institutes of Health, Rockville, MD 20852
| | - Yifan Wang
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research Eunice Kennedy Shriver National Institute of Child Health and Human Development National Institutes of Health, Rockville, MD 20852
| | - James L. Mills
- Epidemiology Branch, Division of Intramural Population Health Research Eunice Kennedy Shriver National Institute of Child Health and Human Development National Institutes of Health, Rockville, MD 20852
| | - Tonia C. Carter
- Center for Human Genetics, Marshfield Clinic, Marshfield, WI 54449
| | - Iryna Lobach
- Department of Neurology, School of Medicine University of California, San Francisco, CA 94185
| | - Alexander F. Wilson
- Statistical Genetics Section, Computational and Statistical Genomics Branch National Human Genome Research Institute National Institutes of Health, Bethesda, MD 20892
| | - Joan E. Bailey-Wilson
- Statistical Genetics Section, Computational and Statistical Genomics Branch National Human Genome Research Institute National Institutes of Health, Bethesda, MD 20892
| | - Daniel E. Weeks
- Departments of Human Genetics and Biostatistics, Graduate School of Public Health University of Pittsburgh, Pittsburgh, PA 15261
| | - Momiao Xiong
- Human Genetics Center, University of Texas - Houston P.O. Box 20334, Houston, Texas 77225
| |
Collapse
|
24
|
Saad M, Wijsman EM. Combining family- and population-based imputation data for association analysis of rare and common variants in large pedigrees. Genet Epidemiol 2014; 38:579-90. [PMID: 25132070 PMCID: PMC4190076 DOI: 10.1002/gepi.21844] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2014] [Revised: 05/24/2014] [Accepted: 06/27/2014] [Indexed: 12/27/2022]
Abstract
In the last two decades, complex traits have become the main focus of genetic studies. The hypothesis that both rare and common variants are associated with complex traits is increasingly being discussed. Family-based association studies using relatively large pedigrees are suitable for both rare and common variant identification. Because of the high cost of sequencing technologies, imputation methods are important for increasing the amount of information at low cost. A recent family-based imputation method, Genotype Imputation Given Inheritance (GIGI), is able to handle large pedigrees and accurately impute rare variants, but does less well for common variants where population-based methods perform better. Here, we propose a flexible approach to combine imputation data from both family- and population-based methods. We also extend the Sequence Kernel Association Test for Rare and Common variants (SKAT-RC), originally proposed for data from unrelated subjects, to family data in order to make use of such imputed data. We call this extension "famSKAT-RC." We compare the performance of famSKAT-RC and several other existing burden and kernel association tests. In simulated pedigree sequence data, our results show an increase of imputation accuracy from use of our combining approach. Also, they show an increase of power of the association tests with this approach over the use of either family- or population-based imputation methods alone, in the context of rare and common variants. Moreover, our results show better performance of famSKAT-RC compared to the other considered tests, in most scenarios investigated here.
Collapse
Affiliation(s)
- Mohamad Saad
- Division of Medical Genetics, Department of Medicine; and Department
of Biostatistics, University of Washington, Seattle, WA 98195, USA
| | - Ellen M. Wijsman
- Division of Medical Genetics, Department of Medicine; and Department
of Biostatistics, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
25
|
Abstract
Background High-throughput sequencing is a cost effective method for identifying genetic variation, and it is currently in use on a large scale across the field of biology, including ecology and population genetics. Correctly identifying variable sites and allele frequencies from sequencing data remains challenging, in large part due to artifacts and biases inherent in the sequencing process. Selecting variants that are diagnostic is commonly done using diversity statistics like FST, but these measures are not ideal for the task. Results Here, we develop a method that directly calculates the expected amount of information gained from observing each variant site. We then develop and implement a conservative estimator that takes into account uncertainity introduced by sampling bias and sequencing error. This estimator is applied to simulated and real sequencing data, and we discuss how it performs compared to the commonly used existing methods for identifying diagnostic polymorphisms. Conclusion The expected information content gives an easy to interpret measure for the usefulness of variant sites. The results show that we achieve a clear separation between true variants and noise, allowing us to select candidate sites with a high degree of confidence.
Collapse
|
26
|
Sun L, Liu S, Wang R, Jiang Y, Zhang Y, Zhang J, Bao L, Kaltenboeck L, Dunham R, Waldbieser G, Liu Z. Identification and analysis of genome-wide SNPs provide insight into signatures of selection and domestication in channel catfish (Ictalurus punctatus). PLoS One 2014; 9:e109666. [PMID: 25313648 PMCID: PMC4196944 DOI: 10.1371/journal.pone.0109666] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2014] [Accepted: 09/02/2014] [Indexed: 12/28/2022] Open
Abstract
Domestication and selection for important performance traits can impact the genome, which is most often reflected by reduced heterozygosity in and surrounding genes related to traits affected by selection. In this study, analysis of the genomic impact caused by domestication and artificial selection was conducted by investigating the signatures of selection using single nucleotide polymorphisms (SNPs) in channel catfish (Ictalurus punctatus). A total of 8.4 million candidate SNPs were identified by using next generation sequencing. On average, the channel catfish genome harbors one SNP per 116 bp. Approximately 6.6 million, 5.3 million, 4.9 million, 7.1 million and 6.7 million SNPs were detected in the Marion, Thompson, USDA103, Hatchery strain, and wild population, respectively. The allele frequencies of 407,861 SNPs differed significantly between the domestic and wild populations. With these SNPs, 23 genomic regions with putative selective sweeps were identified that included 11 genes. Although the function for the majority of the genes remain unknown in catfish, several genes with known function related to aquaculture performance traits were included in the regions with selective sweeps. These included hypoxia-inducible factor 1β· HIFιβ ¨ and the transporter gene ATP-binding cassette sub-family B member 5 (ABCB5). HIF1β· is important for response to hypoxia and tolerance to low oxygen levels is a critical aquaculture trait. The large numbers of SNPs identified from this study are valuable for the development of high-density SNP arrays for genetic and genomic studies of performance traits in catfish.
Collapse
Affiliation(s)
- Luyang Sun
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences, and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Shikai Liu
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences, and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Ruijia Wang
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences, and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Yanliang Jiang
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences, and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Yu Zhang
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences, and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Jiaren Zhang
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences, and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Lisui Bao
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences, and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Ludmilla Kaltenboeck
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences, and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Rex Dunham
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences, and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| | - Geoff Waldbieser
- USDA-ARS Warmwater Aquaculture Research Unit, Stoneville, Mississippi, United States of America
| | - Zhanjiang Liu
- The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences, and Program of Cell and Molecular Biosciences, Auburn University, Auburn, Alabama, United States of America
| |
Collapse
|
27
|
Bianco L, Cestaro A, Sargent DJ, Banchi E, Derdak S, Di Guardo M, Salvi S, Jansen J, Viola R, Gut I, Laurens F, Chagné D, Velasco R, van de Weg E, Troggio M. Development and validation of a 20K single nucleotide polymorphism (SNP) whole genome genotyping array for apple (Malus × domestica Borkh). PLoS One 2014; 9:e110377. [PMID: 25303088 PMCID: PMC4193858 DOI: 10.1371/journal.pone.0110377] [Citation(s) in RCA: 101] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2014] [Accepted: 09/12/2014] [Indexed: 01/08/2023] Open
Abstract
High-density SNP arrays for genome-wide assessment of allelic variation have made high resolution genetic characterization of crop germplasm feasible. A medium density array for apple, the IRSC 8K SNP array, has been successfully developed and used for screens of bi-parental populations. However, the number of robust and well-distributed markers contained on this array was not sufficient to perform genome-wide association analyses in wider germplasm sets, or Pedigree-Based Analysis at high precision, because of rapid decay of linkage disequilibrium. We describe the development of an Illumina Infinium array targeting 20K SNPs. The SNPs were predicted from re-sequencing data derived from the genomes of 13 Malus × domestica apple cultivars and one accession belonging to a crab apple species (M. micromalus). A pipeline for SNP selection was devised that avoided the pitfalls associated with the inclusion of paralogous sequence variants, supported the construction of robust multi-allelic SNP haploblocks and selected up to 11 entries within narrow genomic regions of ±5 kb, termed focal points (FPs). Broad genome coverage was attained by placing FPs at 1 cM intervals on a consensus genetic map, complementing them with FPs to enrich the ends of each of the chromosomes, and by bridging physical intervals greater than 400 Kbps. The selection also included ∼3.7K validated SNPs from the IRSC 8K array. The array has already been used in other studies where ∼15.8K SNP markers were mapped with an average of ∼6.8K SNPs per full-sib family. The newly developed array with its high density of polymorphic validated SNPs is expected to be of great utility for Pedigree-Based Analysis and Genomic Selection. It will also be a valuable tool to help dissect the genetic mechanisms controlling important fruit quality traits, and to aid the identification of marker-trait associations suitable for the application of Marker Assisted Selection in apple breeding programs.
Collapse
Affiliation(s)
- Luca Bianco
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
| | - Alessandro Cestaro
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
| | - Daniel James Sargent
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
| | - Elisa Banchi
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
| | - Sophia Derdak
- CNAG – Centro Nacional de Análisis Genómico, Parc Científic de Barcelona, Barcelona, Spain
| | - Mario Di Guardo
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
- Wageningen UR Plant Breeding, Wageningen University and Research Centre, Wageningen, The Netherlands
| | | | - Johannes Jansen
- Biometris, Wageningen University and Research Centre, Wageningen, The Netherlands
| | - Roberto Viola
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
| | - Ivo Gut
- CNAG – Centro Nacional de Análisis Genómico, Parc Científic de Barcelona, Barcelona, Spain
| | - Francois Laurens
- INRA, UMR1345 Institut de Recherche en Horticulture and Semences, Beaucouzé, France
| | - David Chagné
- Plant & Food Research, Palmerston North Research Centre, Palmerston North, New Zealand
| | - Riccardo Velasco
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
| | - Eric van de Weg
- Wageningen UR Plant Breeding, Wageningen University and Research Centre, Wageningen, The Netherlands
| | - Michela Troggio
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
- * E-mail:
| |
Collapse
|
28
|
Salari R, Saleh SS, Kashef-Haghighi D, Khavari D, Newburger DE, West RB, Sidow A, Batzoglou S. Inference of tumor phylogenies with improved somatic mutation discovery. J Comput Biol 2014; 20:933-44. [PMID: 24195709 DOI: 10.1089/cmb.2013.0106] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Next-generation sequencing technologies provide a powerful tool for studying genome evolution during progression of advanced diseases such as cancer. Although many recent studies have employed new sequencing technologies to detect mutations across multiple, genetically related tumors, current methods do not exploit available phylogenetic information to improve the accuracy of their variant calls. Here, we present a novel algorithm that uses somatic single-nucleotide variations (SNVs) in multiple, related tissue samples as lineage markers for phylogenetic tree reconstruction. Our method then leverages the inferred phylogeny to improve the accuracy of SNV discovery. Experimental analyses demonstrate that our method achieves up to 32% improvement for somatic SNV calling of multiple, related samples over the accuracy of GATK's Unified Genotyper, the state-of-the-art multisample SNV caller.
Collapse
Affiliation(s)
- Raheleh Salari
- 1 Department of Computer Science, Stanford University , Stanford, California
| | | | | | | | | | | | | | | |
Collapse
|
29
|
Liu D, Ma C, Hong W, Huang L, Liu M, Liu H, Zeng H, Deng D, Xin H, Song J, Xu C, Sun X, Hou X, Wang X, Zheng H. Construction and analysis of high-density linkage map using high-throughput sequencing data. PLoS One 2014; 9:e98855. [PMID: 24905985 PMCID: PMC4048240 DOI: 10.1371/journal.pone.0098855] [Citation(s) in RCA: 180] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2014] [Accepted: 05/08/2014] [Indexed: 12/31/2022] Open
Abstract
Linkage maps enable the study of important biological questions. The construction of high-density linkage maps appears more feasible since the advent of next-generation sequencing (NGS), which eases SNP discovery and high-throughput genotyping of large population. However, the marker number explosion and genotyping errors from NGS data challenge the computational efficiency and linkage map quality of linkage study methods. Here we report the HighMap method for constructing high-density linkage maps from NGS data. HighMap employs an iterative ordering and error correction strategy based on a k-nearest neighbor algorithm and a Monte Carlo multipoint maximum likelihood algorithm. Simulation study shows HighMap can create a linkage map with three times as many markers as ordering-only methods while offering more accurate marker orders and stable genetic distances. Using HighMap, we constructed a common carp linkage map with 10,004 markers. The singleton rate was less than one-ninth of that generated by JoinMap4.1. Its total map distance was 5,908 cM, consistent with reports on low-density maps. HighMap is an efficient method for constructing high-density, high-quality linkage maps from high-throughput population NGS data. It will facilitate genome assembling, comparative genomic analysis, and QTL studies. HighMap is available at http://highmap.biomarker.com.cn/.
Collapse
Affiliation(s)
- Dongyuan Liu
- Biomarker Technologies Corporation, Beijing, China
| | - Chouxian Ma
- Biomarker Technologies Corporation, Beijing, China
| | - Weiguo Hong
- Biomarker Technologies Corporation, Beijing, China
| | - Long Huang
- Biomarker Technologies Corporation, Beijing, China
| | - Min Liu
- Biomarker Technologies Corporation, Beijing, China
| | - Hui Liu
- Biomarker Technologies Corporation, Beijing, China
| | - Huaping Zeng
- Biomarker Technologies Corporation, Beijing, China
| | - Dejing Deng
- Biomarker Technologies Corporation, Beijing, China
| | - Huaigen Xin
- Biomarker Technologies Corporation, Beijing, China
| | - Jun Song
- Biomarker Technologies Corporation, Beijing, China
| | - Chunhua Xu
- Biomarker Technologies Corporation, Beijing, China
| | - Xiaowen Sun
- Heilongjiang River Fisheries Research Institute, Chinese Academy of Fishery Sciences, Harbin, China
| | - Xilin Hou
- State Key laboratory of Crop Genetic and Germplasm Enhancement, Key Laboratory of Biology and Germplasm Enhancement of Horticultural Crops in East China, Ministry of Agriculture, Nanjing Agricultural University, Nanjing, China
| | - Xiaowu Wang
- Biomarker Technologies Corporation, Beijing, China
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences (IVF, CAAS), Beijing, China
- * E-mail: (XWW) (XW); (HKZ) (HZ)
| | - Hongkun Zheng
- Biomarker Technologies Corporation, Beijing, China
- * E-mail: (XWW) (XW); (HKZ) (HZ)
| |
Collapse
|
30
|
Fan R, Wang Y, Mills JL, Wilson AF, Bailey-Wilson JE, Xiong M. Functional linear models for association analysis of quantitative traits. Genet Epidemiol 2014; 37:726-42. [PMID: 24130119 DOI: 10.1002/gepi.21757] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2013] [Revised: 07/15/2013] [Accepted: 08/14/2013] [Indexed: 12/19/2022]
Abstract
Functional linear models are developed in this paper for testing associations between quantitative traits and genetic variants, which can be rare variants or common variants or the combination of the two. By treating multiple genetic variants of an individual in a human population as a realization of a stochastic process, the genome of an individual in a chromosome region is a continuum of sequence data rather than discrete observations. The genome of an individual is viewed as a stochastic function that contains both linkage and linkage disequilibrium (LD) information of the genetic markers. By using techniques of functional data analysis, both fixed and mixed effect functional linear models are built to test the association between quantitative traits and genetic variants adjusting for covariates. After extensive simulation analysis, it is shown that the F-distributed tests of the proposed fixed effect functional linear models have higher power than that of sequence kernel association test (SKAT) and its optimal unified test (SKAT-O) for three scenarios in most cases: (1) the causal variants are all rare, (2) the causal variants are both rare and common, and (3) the causal variants are common. The superior performance of the fixed effect functional linear models is most likely due to its optimal utilization of both genetic linkage and LD information of multiple genetic variants in a genome and similarity among different individuals, while SKAT and SKAT-O only model the similarities and pairwise LD but do not model linkage and higher order LD information sufficiently. In addition, the proposed fixed effect models generate accurate type I error rates in simulation studies. We also show that the functional kernel score tests of the proposed mixed effect functional linear models are preferable in candidate gene analysis and small sample problems. The methods are applied to analyze three biochemical traits in data from the Trinity Students Study.
Collapse
Affiliation(s)
- Ruzong Fan
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Rockville, Maryland, United States of America
| | | | | | | | | | | |
Collapse
|
31
|
Hou Y, Fan W, Yan L, Li R, Lian Y, Huang J, Li J, Xu L, Tang F, Xie XS, Qiao J. Genome analyses of single human oocytes. Cell 2014; 155:1492-506. [PMID: 24360273 DOI: 10.1016/j.cell.2013.11.040] [Citation(s) in RCA: 224] [Impact Index Per Article: 22.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2013] [Revised: 10/31/2013] [Accepted: 11/25/2013] [Indexed: 11/16/2022]
Abstract
Single-cell genome analyses of human oocytes are important for meiosis research and preimplantation genomic screening. However, the nonuniformity of single-cell whole-genome amplification hindered its use. Here, we demonstrate genome analyses of single human oocytes using multiple annealing and looping-based amplification cycle (MALBAC)-based sequencing technology. By sequencing the triads of the first and second polar bodies (PB1 and PB2) and the oocyte pronuclei from same female egg donors, we phase the genomes of these donors with detected SNPs and determine the crossover maps of their oocytes. Our data exhibit an expected crossover interference and indicate a weak chromatid interference. Further, the genome of the oocyte pronucleus, including information regarding aneuploidy and SNPs in disease-associated alleles, can be accurately deduced from the genomes of PB1 and PB2. The MALBAC-based preimplantation genomic screening in in vitro fertilization (IVF) enables accurate and cost-effective selection of normal fertilized eggs for embryo transfer.
Collapse
Affiliation(s)
- Yu Hou
- Biodynamic Optical Imaging Center, College of Life Sciences and Center for Reproductive Medicine, Third Hospital, Peking University, Beijing 100871, China
| | - Wei Fan
- Biodynamic Optical Imaging Center, College of Life Sciences and Center for Reproductive Medicine, Third Hospital, Peking University, Beijing 100871, China; Peking-Tsinghua Center for Life Science, Beijing 100084, China
| | - Liying Yan
- Biodynamic Optical Imaging Center, College of Life Sciences and Center for Reproductive Medicine, Third Hospital, Peking University, Beijing 100871, China
| | - Rong Li
- Biodynamic Optical Imaging Center, College of Life Sciences and Center for Reproductive Medicine, Third Hospital, Peking University, Beijing 100871, China
| | - Ying Lian
- Biodynamic Optical Imaging Center, College of Life Sciences and Center for Reproductive Medicine, Third Hospital, Peking University, Beijing 100871, China
| | - Jin Huang
- Biodynamic Optical Imaging Center, College of Life Sciences and Center for Reproductive Medicine, Third Hospital, Peking University, Beijing 100871, China
| | - Jinsen Li
- Biodynamic Optical Imaging Center, College of Life Sciences and Center for Reproductive Medicine, Third Hospital, Peking University, Beijing 100871, China
| | - Liya Xu
- Biodynamic Optical Imaging Center, College of Life Sciences and Center for Reproductive Medicine, Third Hospital, Peking University, Beijing 100871, China
| | - Fuchou Tang
- Biodynamic Optical Imaging Center, College of Life Sciences and Center for Reproductive Medicine, Third Hospital, Peking University, Beijing 100871, China; Ministry of Education Key Laboratory of Cell Proliferation and Differentiation, Beijing 100871, China.
| | - X Sunney Xie
- Biodynamic Optical Imaging Center, College of Life Sciences and Center for Reproductive Medicine, Third Hospital, Peking University, Beijing 100871, China; Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, USA.
| | - Jie Qiao
- Biodynamic Optical Imaging Center, College of Life Sciences and Center for Reproductive Medicine, Third Hospital, Peking University, Beijing 100871, China; Key Laboratory of Assisted Reproduction, Ministry of Education and Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproductive Technology, Beijing 100191, China.
| |
Collapse
|
32
|
Challenges in the Next-Generation Sequencing Field. NEXT GENERATION SEQUENCING TECHNOLOGIES AND CHALLENGES IN SEQUENCE ASSEMBLY 2014. [DOI: 10.1007/978-1-4939-0715-1_5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
33
|
Durtschi J, Margraf RL, Coonrod EM, Mallempati KC, Voelkerding KV. VarBin, a novel method for classifying true and false positive variants in NGS data. BMC Bioinformatics 2013; 14 Suppl 13:S2. [PMID: 24266885 PMCID: PMC3849648 DOI: 10.1186/1471-2105-14-s13-s2] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
Background Variant discovery for rare genetic diseases using Illumina genome or exome sequencing involves screening of up to millions of variants to find only the one or few causative variant(s). Sequencing or alignment errors create "false positive" variants, which are often retained in the variant screening process. Methods to remove false positive variants often retain many false positive variants. This report presents VarBin, a method to prioritize variants based on a false positive variant likelihood prediction. Methods VarBin uses the Genome Analysis Toolkit variant calling software to calculate the variant-to-wild type genotype likelihood ratio at each variant change and position divided by read depth. The resulting Phred-scaled, likelihood-ratio by depth (PLRD) was used to segregate variants into 4 Bins with Bin 1 variants most likely true and Bin 4 most likely false positive. PLRD values were calculated for a proband of interest and 41 additional Illumina HiSeq, exome and whole genome samples (proband's family or unrelated samples). At variant sites without apparent sequencing or alignment error, wild type/non-variant calls cluster near -3 PLRD and variant calls typically cluster above 10 PLRD. Sites with systematic variant calling problems (evident by variant quality scores and biases as well as displayed on the iGV viewer) tend to have higher and more variable wild type/non-variant PLRD values. Depending on the separation of a proband's variant PLRD value from the cluster of wild type/non-variant PLRD values for background samples at the same variant change and position, the VarBin method's classification is assigned to each proband variant (Bin 1 to Bin 4). Results To assess VarBin performance, Sanger sequencing was performed on 98 variants in the proband and background samples. True variants were confirmed in 97% of Bin 1 variants, 30% of Bin 2, and 0% of Bin 3/Bin 4. Conclusions These data indicate that VarBin correctly classifies the majority of true variants as Bin 1 and Bin 3/4 contained only false positive variants. The "uncertain" Bin 2 contained both true and false positive variants. Future work will further differentiate the variants in Bin 2.
Collapse
|
34
|
Zeng F, Jiang R, Chen T. PyroHMMsnp: an SNP caller for Ion Torrent and 454 sequencing data. Nucleic Acids Res 2013; 41:e136. [PMID: 23700313 PMCID: PMC3711422 DOI: 10.1093/nar/gkt372] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Both 454 and Ion Torrent sequencers are capable of producing large amounts of long high-quality sequencing reads. However, as both methods sequence homopolymers in one cycle, they both suffer from homopolymer uncertainty and incorporation asynchronization. In mapping, such sequencing errors could shift alignments around homopolymers and thus induce incorrect mismatches, which have become a critical barrier against the accurate detection of single nucleotide polymorphisms (SNPs). In this article, we propose a hidden Markov model (HMM) to statistically and explicitly formulate homopolymer sequencing errors by the overcall, undercall, insertion and deletion. We use a hierarchical model to describe the sequencing and base-calling processes, and we estimate parameters of the HMM from resequencing data by an expectation-maximization algorithm. Based on the HMM, we develop a realignment-based SNP-calling program, termed PyroHMMsnp, which realigns read sequences around homopolymers according to the error model and then infers the underlying genotype by using a Bayesian approach. Simulation experiments show that the performance of PyroHMMsnp is exceptional across various sequencing coverages in terms of sensitivity, specificity and F1 measure, compared with other tools. Analysis of the human resequencing data shows that PyroHMMsnp predicts 12.9% more SNPs than Samtools while achieving a higher specificity. (http://code.google.com/p/pyrohmmsnp/).
Collapse
Affiliation(s)
- Feng Zeng
- Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, China
| | | | | |
Collapse
|
35
|
Kang J, Huang KC, Xu Z, Wang Y, Abecasis GR, Li Y. AbCD: arbitrary coverage design for sequencing-based genetic studies. Bioinformatics 2013; 29:799-801. [PMID: 23357921 DOI: 10.1093/bioinformatics/btt041] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
Recent advances in sequencing technologies have revolutionized genetic studies. Although high-coverage sequencing can uncover most variants present in the sequenced sample, low-coverage sequencing is appealing for its cost effectiveness. Here, we present AbCD (arbitrary coverage design) to aid the design of sequencing-based studies. AbCD is a user-friendly interface providing pre-estimated effective sample sizes, specific to each minor allele frequency category, for designs with arbitrary coverage (0.5-30×) and sample size (20-10 000), and for four major ethnic groups (Europeans, Africans, Asians and African Americans). In addition, we also present two software tools: ShotGun and DesignPlanner, which were used to generate the estimates behind AbCD. ShotGun is a flexible short-read simulator for arbitrary user-specified read length and average depth, allowing cycle-specific sequencing error rates and realistic read depth distributions. DesignPlanner is a full pipeline that uses ShotGun to generate sequence data and performs initial SNP discovery, uses our previously presented linkage disequilibrium-aware method to call genotypes, and, finally, provides minor allele frequency-specific effective sample sizes. ShotGun plus DesignPlanner can accommodate effective sample size estimate for any combination of high-depth and low-depth data (for example, whole-genome low-depth plus exonic high-depth) or combination of sequence and genotype data [for example, whole-exome sequencing plus genotyping from existing Genomewide Association Study (GWAS)].
Collapse
Affiliation(s)
- Jian Kang
- Faculty of Kinesiology, University of Calgary, Calgary, AB T2N1N4, Canada
| | | | | | | | | | | |
Collapse
|
36
|
Wilson MR, Allard MW, Brown EW. The forensic analysis of foodborne bacterial pathogens in the age of whole-genome sequencing. Cladistics 2013; 29:449-461. [DOI: 10.1111/cla.12012] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/26/2012] [Indexed: 01/07/2023] Open
Affiliation(s)
- Mark R. Wilson
- Forensic Science Program; 325 Natural Science Bldg; Western Carolina University; Cullowhee; NC; 28723; USA
| | - Marc W. Allard
- Division of Microbiology (HFS-710), Center for Food Safety & Applied Nutrition; US Food & Drug Administration; 5100 Paint Branch Parkway; College Park; MD; USA
| | - Eric W. Brown
- Division of Microbiology (HFS-710), Center for Food Safety & Applied Nutrition; US Food & Drug Administration; 5100 Paint Branch Parkway; College Park; MD; USA
| |
Collapse
|
37
|
Feder AF, Petrov DA, Bergland AO. LDx: estimation of linkage disequilibrium from high-throughput pooled resequencing data. PLoS One 2012; 7:e48588. [PMID: 23152785 PMCID: PMC3494690 DOI: 10.1371/journal.pone.0048588] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2012] [Accepted: 10/03/2012] [Indexed: 12/14/2022] Open
Abstract
High-throughput pooled resequencing offers significant potential for whole genome population sequencing. However, its main drawback is the loss of haplotype information. In order to regain some of this information, we present LDx, a computational tool for estimating linkage disequilibrium (LD) from pooled resequencing data. LDx uses an approximate maximum likelihood approach to estimate LD (r(2)) between pairs of SNPs that can be observed within and among single reads. LDx also reports r(2) estimates derived solely from observed genotype counts. We demonstrate that the LDx estimates are highly correlated with r(2) estimated from individually resequenced strains. We discuss the performance of LDx using more stringent quality conditions and infer via simulation the degree to which performance can improve based on read depth. Finally we demonstrate two possible uses of LDx with real and simulated pooled resequencing data. First, we use LDx to infer genomewide patterns of decay of LD with physical distance in D. melanogaster population resequencing data. Second, we demonstrate that r(2) estimates from LDx are capable of distinguishing alternative demographic models representing plausible demographic histories of D. melanogaster.
Collapse
Affiliation(s)
- Alison F Feder
- Department of Biology, Stanford University, Stanford, California, United States of America.
| | | | | |
Collapse
|
38
|
Faita F, Vecoli C, Foffa I, Andreassi MG. Next generation sequencing in cardiovascular diseases. World J Cardiol 2012; 4:288-95. [PMID: 23110245 PMCID: PMC3482622 DOI: 10.4330/wjc.v4.i10.288] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/13/2012] [Revised: 09/08/2012] [Accepted: 09/15/2012] [Indexed: 02/06/2023] Open
Abstract
In the last few years, the advent of next generation sequencing (NGS) has revolutionized the approach to genetic studies, making whole-genome sequencing a possible way of obtaining global genomic information. NGS has very recently been shown to be successful in identifying novel causative mutations of rare or common Mendelian disorders. At the present time, it is expected that NGS will be increasingly important in the study of inherited and complex cardiovascular diseases (CVDs). However, the NGS approach to the genetics of CVDs represents a territory which has not been widely investigated. The identification of rare and frequent genetic variants can be very important in clinical practice to detect pathogenic mutations or to establish a profile of risk for the development of pathology. The purpose of this paper is to discuss the recent application of NGS in the study of several CVDs such as inherited cardiomyopathies, channelopathies, coronary artery disease and aortic aneurysm. We also discuss the future utility and challenges related to NGS in studying the genetic basis of CVDs in order to improve diagnosis, prevention, and treatment.
Collapse
Affiliation(s)
- Francesca Faita
- Francesca Faita, Cecilia Vecoli, Ilenia Foffa, Maria Grazia Andreassi, CNR, Institute of Clinical Physiology, 54100 Massa, Italy
| | | | | | | |
Collapse
|
39
|
Zhou B. An empirical Bayes mixture model for SNP detection in pooled sequencing data. Bioinformatics 2012; 28:2569-75. [PMID: 22914221 DOI: 10.1093/bioinformatics/bts501] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Detecting single-nucleotide polymorphism (SNP) in pooled sequencing data is more challenging than in individual sequencing because of sampling variations across pools. To effectively differentiate SNP signal from sequencing error, appropriate estimation of the sequencing error is necessary. In this article, we propose an empirical Bayes mixture (EBM) model for SNP detection and allele frequency estimation in pooled sequencing data. RESULTS The proposed model reliably learns the error distribution by pooling information across pools and genomic positions. In addition, the proposed EBM model builds in characteristics unique to the pooled sequencing data, boosting the sensitivity of SNP detection. For large-scale inference in SNP detection, the EBM model provides a flexible and robust way for estimation and control of local false discovery rate. We demonstrate the performance of the proposed method through simulation studies and real data application. AVAILABILITY Implementation of this method is available at https://sites.google.com/site/zhouby98.
Collapse
Affiliation(s)
- Baiyu Zhou
- Department of Epidemiology & Population Health, Albert Einstein College of Medicine, Bronx, NY 10461, USA.
| |
Collapse
|
40
|
Hasmats J, Green H, Solnestam BW, Zajac P, Huss M, Orear C, Validire P, Bjursell M, Lundeberg J. Validation of whole genome amplification for analysis of the p53 tumor suppressor gene in limited amounts of tumor samples. Biochem Biophys Res Commun 2012; 425:379-83. [DOI: 10.1016/j.bbrc.2012.07.101] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Accepted: 07/19/2012] [Indexed: 10/28/2022]
|
41
|
Zhu Y, Bergland AO, González J, Petrov DA. Empirical validation of pooled whole genome population re-sequencing in Drosophila melanogaster. PLoS One 2012; 7:e41901. [PMID: 22848651 PMCID: PMC3406057 DOI: 10.1371/journal.pone.0041901] [Citation(s) in RCA: 73] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2012] [Accepted: 06/28/2012] [Indexed: 11/26/2022] Open
Abstract
The sequencing of pooled non-barcoded individuals is an inexpensive and efficient means of assessing genome-wide population allele frequencies, yet its accuracy has not been thoroughly tested. We assessed the accuracy of this approach on whole, complex eukaryotic genomes by resequencing pools of largely isogenic, individually sequenced Drosophila melanogaster strains. We called SNPs in the pooled data and estimated false positive and false negative rates using the SNPs called in individual strain as a reference. We also estimated allele frequency of the SNPs using “pooled” data and compared them with “true” frequencies taken from the estimates in the individual strains. We demonstrate that pooled sequencing provides a faithful estimate of population allele frequency with the error well approximated by binomial sampling, and is a reliable means of novel SNP discovery with low false positive rates. However, a sufficient number of strains should be used in the pooling because variation in the amount of DNA derived from individual strains is a substantial source of noise when the number of pooled strains is low. Our results and analysis confirm that pooled sequencing is a very powerful and cost-effective technique for assessing of patterns of sequence variation in populations on genome-wide scales, and is applicable to any dataset where sequencing individuals or individual cells is impossible, difficult, time consuming, or expensive.
Collapse
Affiliation(s)
- Yuan Zhu
- Department of Genetics, Stanford University, Stanford, California, United States of America.
| | | | | | | |
Collapse
|
42
|
Flannick J, Korn JM, Fontanillas P, Grant GB, Banks E, Depristo MA, Altshuler D. Efficiency and power as a function of sequence coverage, SNP array density, and imputation. PLoS Comput Biol 2012; 8:e1002604. [PMID: 22807667 PMCID: PMC3395607 DOI: 10.1371/journal.pcbi.1002604] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2012] [Accepted: 05/24/2012] [Indexed: 01/19/2023] Open
Abstract
High coverage whole genome sequencing provides near complete information about genetic variation. However, other technologies can be more efficient in some settings by (a) reducing redundant coverage within samples and (b) exploiting patterns of genetic variation across samples. To characterize as many samples as possible, many genetic studies therefore employ lower coverage sequencing or SNP array genotyping coupled to statistical imputation. To compare these approaches individually and in conjunction, we developed a statistical framework to estimate genotypes jointly from sequence reads, array intensities, and imputation. In European samples, we find similar sensitivity (89%) and specificity (99.6%) from imputation with either 1× sequencing or 1 M SNP arrays. Sensitivity is increased, particularly for low-frequency polymorphisms (), when low coverage sequence reads are added to dense genome-wide SNP arrays — the converse, however, is not true. At sites where sequence reads and array intensities produce different sample genotypes, joint analysis reduces genotype errors and identifies novel error modes. Our joint framework informs the use of next-generation sequencing in genome wide association studies and supports development of improved methods for genotype calling. In this work we address a series of questions prompted by the rise of next-generation sequencing as a data collection strategy for genetic studies. How does low coverage sequencing compare to traditional microarray based genotyping? Do studies increase sensitivity by collecting both sequencing and array data? What can we learn about technology error modes based on analysis of SNPs for which sequence and array data disagree? To answer these questions, we developed a statistical framework to estimate genotypes from sequence reads, array intensities, and imputation. Through experiments with intensity and read data from the Hapmap and 1000 Genomes (1000 G) Projects, we show that 1 M SNP arrays used for genome wide association studies perform similarly to 1× sequencing. We find that adding low coverage sequence reads to dense array data significantly increases rare variant sensitivity, but adding dense array data to low coverage sequencing has only a small impact. Finally, we describe an improved SNP calling algorithm used in the 1000 G project, inspired by a novel next-generation sequencing error mode identified through analysis of disputed SNPs. These results inform the use of next-generation sequencing in genetic studies and model an approach to further improve genotype calling methods.
Collapse
Affiliation(s)
- Jason Flannick
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America
- Department of Molecular Biology and Diabetes Unit, Massachusetts General Hospital, Boston, Massachusetts, United States of America
| | - Joshua M. Korn
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America
- Department of Molecular Biology and Diabetes Unit, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Harvard-MIT Division of Health Sciences and Technology, Cambridge, Massachusetts, United States of America
- Graduate Program in Biophysics, Harvard University, Cambridge, Massachusetts, United States of America
| | - Pierre Fontanillas
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America
| | - George B. Grant
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America
| | - Eric Banks
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America
| | - Mark A. Depristo
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America
| | - David Altshuler
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America
- Department of Molecular Biology and Diabetes Unit, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Department of Genetics and Medicine, Harvard Medical School, Boston, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|
43
|
Single Nucleotide Polymorphism (SNP) Detection and Genotype Calling from Massively Parallel Sequencing (MPS) Data. STATISTICS IN BIOSCIENCES 2012; 5:3-25. [PMID: 24489615 DOI: 10.1007/s12561-012-9067-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Massively parallel sequencing (MPS), since its debut in 2005, has transformed the field of genomic studies. These new sequencing technologies have resulted in the successful identification of causal variants for several rare Mendelian disorders. They have also begun to deliver on their promise to explain some of the missing heritability from genome-wide association studies (GWAS) of complex traits. We anticipate a rapidly growing number of MPS-based studies for a diverse range of applications in the near future. One crucial and nearly inevitable step is to detect SNPs and call genotypes at the detected polymorphic sites from the sequencing data. Here, we review statistical methods that have been proposed in the past five years for this purpose. In addition, we discuss emerging issues and future directions related to SNP detection and genotype calling from MPS data.
Collapse
|
44
|
Zhou B, Whittemore AS. Improving sequence-based genotype calls with linkage disequilibrium and pedigree information. Ann Appl Stat 2012. [DOI: 10.1214/11-aoas527] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
45
|
Li M, Stoneking M. A new approach for detecting low-level mutations in next-generation sequence data. Genome Biol 2012; 13:R34. [PMID: 22621726 PMCID: PMC3446287 DOI: 10.1186/gb-2012-13-5-r34] [Citation(s) in RCA: 73] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2011] [Revised: 05/14/2012] [Accepted: 05/23/2012] [Indexed: 01/01/2023] Open
Abstract
We propose a new method that incorporates population re-sequencing data, distribution of reads, and strand bias in detecting low-level mutations. The method can accurately identify low-level mutations down to a level of 2.3%, with an average coverage of 500×, and with a false discovery rate of less than 1%. In addition, we also discuss other problems in detecting low-level mutations, including chimeric reads and sample cross-contamination, and provide possible solutions to them.
Collapse
Affiliation(s)
- Mingkun Li
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, D04103, Leipzig, Germany.
| | | |
Collapse
|
46
|
Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat Genet 2012; 44:631-5. [PMID: 22610117 DOI: 10.1038/ng.2283] [Citation(s) in RCA: 177] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2011] [Accepted: 04/16/2012] [Indexed: 12/14/2022]
Abstract
Genome-wide association studies (GWAS) have proven to be a powerful method to identify common genetic variants contributing to susceptibility to common diseases. Here, we show that extremely low-coverage sequencing (0.1-0.5×) captures almost as much of the common (>5%) and low-frequency (1-5%) variation across the genome as SNP arrays. As an empirical demonstration, we show that genome-wide SNP genotypes can be inferred at a mean r(2) of 0.71 using off-target data (0.24× average coverage) in a whole-exome study of 909 samples. Using both simulated and real exome-sequencing data sets, we show that association statistics obtained using extremely low-coverage sequencing data attain similar P values at known associated variants as data from genotyping arrays, without an excess of false positives. Within the context of reductions in sample preparation and sequencing costs, funds invested in extremely low-coverage sequencing can yield several times the effective sample size of GWAS based on SNP array data and a commensurate increase in statistical power.
Collapse
|
47
|
Gerstung M, Beisel C, Rechsteiner M, Wild P, Schraml P, Moch H, Beerenwinkel N. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nat Commun 2012; 3:811. [PMID: 22549840 DOI: 10.1038/ncomms1814] [Citation(s) in RCA: 178] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2011] [Accepted: 03/30/2012] [Indexed: 01/06/2023] Open
|
48
|
Determination of RET Sequence Variation in an MEN2 Unaffected Cohort Using Multiple-Sample Pooling and Next-Generation Sequencing. J Thyroid Res 2012; 2012:318232. [PMID: 22545224 PMCID: PMC3321559 DOI: 10.1155/2012/318232] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/07/2011] [Accepted: 01/23/2012] [Indexed: 11/30/2022] Open
Abstract
Multisample, nonindexed pooling combined with next-generation sequencing (NGS) was used to discover RET proto-oncogene sequence variation within a cohort known to be unaffected by multiple endocrine neoplasia type 2 (MEN2). DNA samples (113 Caucasians, 23 persons of other ethnicities) were amplified for RET intron 9 to intron 16 and then divided into 5 pools of <30 samples each before library prep and NGS. Two controls were included in this study, a single sample and a pool of 50 samples that had been previously sequenced by the same NGS methods. All 59 variants previously detected in the 50-pool control were present. Of the 61 variants detected in the unaffected cohort, 20 variants were novel changes. Several variants were validated by high-resolution melting analysis and Sanger sequencing, and their allelic frequencies correlated well with those determined by NGS. The results from this unaffected cohort will be added to the RET MEN2 database.
Collapse
|
49
|
Crawford JE, Lazzaro BP. Assessing the accuracy and power of population genetic inference from low-pass next-generation sequencing data. Front Genet 2012; 3:66. [PMID: 22536207 PMCID: PMC3334522 DOI: 10.3389/fgene.2012.00066] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2012] [Accepted: 04/05/2012] [Indexed: 01/17/2023] Open
Abstract
Next-generation sequencing (NGS) technologies have made it possible to address population genetic questions in almost any system, but high error rates associated with such data can introduce significant biases into downstream analyses, necessitating careful experimental design and interpretation in studies based on short-read sequencing. Exploration of population genetic analyses based on NGS has revealed some of the potential biases, but previous work has emphasized parameters relevant to human population genetics and further examination of parameters relevant to other systems is necessary, including situations where sample sizes are small and genetic variation is high. To assess experimental power to address several principal objectives of population genetic studies under these conditions, we simulated population samples under selective sweep, population growth, and population subdivision models and tested the power to accurately infer population genetic parameters from sequence polymorphism data obtained through simulated 4×, 8×, and 15× read depth sequence data. We found that estimates of population genetic differentiation and population growth parameters were systematically biased when inference was based on 4× sequencing, but biases were markedly reduced at even 8× read depth. We also found that the power to identify footprints of positive selection depends on an interaction between read depth and the strength of selection, with strong selection being recovered consistently at all read depths, but weak selection requiring deeper read depths for reliable detection. Although we have explored only a small subset of the many possible experimental designs and population genetic models, using only one SNP-calling approach, our results reveal some general patterns and provide some assessment of what biases could be expected under similar experimental structures.
Collapse
|
50
|
Rossetti S, Hopp K, Sikkink RA, Sundsbak JL, Lee YK, Kubly V, Eckloff BW, Ward CJ, Winearls CG, Torres VE, Harris PC. Identification of gene mutations in autosomal dominant polycystic kidney disease through targeted resequencing. J Am Soc Nephrol 2012; 23:915-33. [PMID: 22383692 DOI: 10.1681/asn.2011101032] [Citation(s) in RCA: 127] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
Mutations in two large multi-exon genes, PKD1 and PKD2, cause autosomal dominant polycystic kidney disease (ADPKD). The duplication of PKD1 exons 1-32 as six pseudogenes on chromosome 16, the high level of allelic heterogeneity, and the cost of Sanger sequencing complicate mutation analysis, which can aid diagnostics of ADPKD. We developed and validated a strategy to analyze both the PKD1 and PKD2 genes using next-generation sequencing by pooling long-range PCR amplicons and multiplexing bar-coded libraries. We used this approach to characterize a cohort of 230 patients with ADPKD. This process detected definitely and likely pathogenic variants in 115 (63%) of 183 patients with typical ADPKD. In addition, we identified atypical mutations, a gene conversion, and one missed mutation resulting from allele dropout, and we characterized the pattern of deep intronic variation for both genes. In summary, this strategy involving next-generation sequencing is a model for future genetic characterization of large ADPKD populations.
Collapse
Affiliation(s)
- Sandro Rossetti
- Division of Nephrology and Hypertension, Mayo Clinic, Rochester, MN 55905, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|