1
|
Sun S, Cheng F, Han D, Wei S, Zhong A, Massoudian S, Johnson AB. Pairwise comparative analysis of six haplotype assembly methods based on users' experience. BMC Genom Data 2023; 24:35. [PMID: 37386408 PMCID: PMC10311811 DOI: 10.1186/s12863-023-01134-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Accepted: 05/25/2023] [Indexed: 07/01/2023] Open
Abstract
BACKGROUND A haplotype is a set of DNA variants inherited together from one parent or chromosome. Haplotype information is useful for studying genetic variation and disease association. Haplotype assembly (HA) is a process of obtaining haplotypes using DNA sequencing data. Currently, there are many HA methods with their own strengths and weaknesses. This study focused on comparing six HA methods or algorithms: HapCUT2, MixSIH, PEATH, WhatsHap, SDhaP, and MAtCHap using two NA12878 datasets named hg19 and hg38. The 6 HA algorithms were run on chromosome 10 of these two datasets, each with 3 filtering levels based on sequencing depth (DP1, DP15, and DP30). Their outputs were then compared. RESULT Run time (CPU time) was compared to assess the efficiency of 6 HA methods. HapCUT2 was the fastest HA for 6 datasets, with run time consistently under 2 min. In addition, WhatsHap was relatively fast, and its run time was 21 min or less for all 6 datasets. The other 4 HA algorithms' run time varied across different datasets and coverage levels. To assess their accuracy, pairwise comparisons were conducted for each pair of the six packages by generating their disagreement rates for both haplotype blocks and Single Nucleotide Variants (SNVs). The authors also compared them using switch distance (error), i.e., the number of positions where two chromosomes of a certain phase must be switched to match with the known haplotype. HapCUT2, PEATH, MixSIH, and MAtCHap generated output files with similar numbers of blocks and SNVs, and they had relatively similar performance. WhatsHap generated a much larger number of SNVs in the hg19 DP1 output, which caused it to have high disagreement percentages with other methods. However, for the hg38 data, WhatsHap had similar performance as the other 4 algorithms, except SDhaP. The comparison analysis showed that SDhaP had a much larger disagreement rate when it was compared with the other algorithms in all 6 datasets. CONCLUSION The comparative analysis is important because each algorithm is different. The findings of this study provide a deeper understanding of the performance of currently available HA algorithms and useful input for other users.
Collapse
Affiliation(s)
- Shuying Sun
- Department of Mathematics, Texas State University, San Marcos, TX USA
| | - Flora Cheng
- Carnegie Mellon University, Pittsburgh, PA USA
| | - Daphne Han
- Carnegie Mellon University, Pittsburgh, PA USA
| | - Sarah Wei
- Massachusetts Institute of Technology, Cambridge, MA USA
| | | | | | | |
Collapse
|
2
|
Bu M, Xu M, Tao S, Cui P, He B. Evaluation of Different SNP Analysis Software and Optimal Mining Process in Tree Species. Life (Basel) 2023; 13:life13051069. [PMID: 37240714 DOI: 10.3390/life13051069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 03/24/2023] [Accepted: 04/11/2023] [Indexed: 05/28/2023] Open
Abstract
Single nucleotide polymorphism (SNP) is one of the most widely used molecular markers to help researchers understand the relationship between phenotypes and genotypes. SNP calling mainly consists of two steps, including read alignment and locus identification based on statistical models, and various software have been developed and applied in this issue. Meanwhile, in our study, very low agreement (<25%) was found among the prediction results generated by different software, which was much less consistent than expected. In order to obtain the optimal protocol of SNP mining in tree species, the algorithm principles of different alignment and SNP mining software were discussed in detail. And the prediction results were further validated based on in silico and experimental methods. In addition, hundreds of validated SNPs were provided along with some practical suggestions on program selection and accuracy improvement were provided, and we wish that these results could lay the foundation for the subsequent analysis of SNP mining.
Collapse
Affiliation(s)
- Mengjia Bu
- Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Area, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
- State Key Laboratory of Crop Stress Adaptation and Improvement, School of Life Sciences, Henan University, Kaifeng 475004, China
- Shenzhen Research Institute of Henan University, Shenzhen 518000, China
| | - Mengxuan Xu
- Co-Innovation Center for Sustainable Forestry in Southern China, Nanjing Forestry University, Nanjing 210037, China
| | - Shentong Tao
- Co-Innovation Center for Sustainable Forestry in Southern China, Nanjing Forestry University, Nanjing 210037, China
| | - Peng Cui
- Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Area, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Bing He
- Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Area, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| |
Collapse
|
3
|
Ni Y, Liu X, Simeneh ZM, Yang M, Li R. Benchmarking of Nanopore R10.4 and R9.4.1 flow cells in single-cell whole-genome amplification and whole-genome shotgun sequencing. Comput Struct Biotechnol J 2023; 21:2352-2364. [PMID: 37025654 PMCID: PMC10070092 DOI: 10.1016/j.csbj.2023.03.038] [Citation(s) in RCA: 38] [Impact Index Per Article: 38.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2022] [Revised: 03/21/2023] [Accepted: 03/22/2023] [Indexed: 03/30/2023] Open
Abstract
Third-generation sequencing can be used in human cancer genomics and epigenomic research. Oxford Nanopore Technologies (ONT) recently released R10.4 flow cell, which claimed an improved read accuracy compared to R9.4.1 flow cell. To evaluate the benefits and defects of R10.4 flow cell for cancer cell profiling on MinION devices, we used the human non-small-cell lung-carcinoma cell line HCC78 to construct libraries for both single-cell whole-genome amplification (scWGA) and whole-genome shotgun sequencing. The R10.4 and R9.4.1 reads were benchmarked in terms of read accuracy, variant detection, modification calling, genome recovery rate and compared with the next generation sequencing (NGS) reads. The results highlighted that the R10.4 outperforms R9.4.1 reads, achieving a higher modal read accuracy of over 99.1%, superior variation detection, lower false-discovery rate (FDR) in methylation calling, and comparable genome recovery rate. To achieve high yields scWGA sequencing in the ONT platform as NGS, we recommended multiple displacement amplification with a modified T7 endonuclease Ⅰ cutting procedure as a promising method. In addition, we provided a possible solution to filter the likely false positive sites among the whole genome region with R10.4 by using scWGA sequencing result as a negative control. Our study is the first benchmark of whole genome single-cell sequencing using ONT R10.4 and R9.4.1 MinION flow cells by clarifying the capacity of genomic and epigenomic profiling within a single flow cell. A promising method for scWGA sequencing together with the methylation calling results can benefit researchers who work on cancer cell genomic and epigenomic profiling using third-generation sequencing.
Collapse
Affiliation(s)
- Ying Ni
- Department of Precision Diagnostic and Therapeutic Technology, City University of Hong Kong Shenzhen Futian Research Institute, Shenzhen, Guangdong, China
- Department of Biomedical Sciences and Tung Biomedical Sciences Centre, City University of Hong Kong, Hong Kong, China
| | - Xudong Liu
- Department of Infectious Diseases and Public Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong, China
| | - Zemenu Mengistie Simeneh
- Department of Precision Diagnostic and Therapeutic Technology, City University of Hong Kong Shenzhen Futian Research Institute, Shenzhen, Guangdong, China
- Department of Biomedical Sciences and Tung Biomedical Sciences Centre, City University of Hong Kong, Hong Kong, China
| | - Mengsu Yang
- Department of Precision Diagnostic and Therapeutic Technology, City University of Hong Kong Shenzhen Futian Research Institute, Shenzhen, Guangdong, China
- Department of Biomedical Sciences and Tung Biomedical Sciences Centre, City University of Hong Kong, Hong Kong, China
- Corresponding author at: Department of Precision Diagnostic and Therapeutic Technology, City University of Hong Kong Shenzhen Futian Research Institute, Shenzhen, Guangdong, China.
| | - Runsheng Li
- Department of Infectious Diseases and Public Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong, China
- Southern Marine Science and Engineering Guangdong Laboratory, Guangzhou, China
- Corresponding author at: Department of Infectious Diseases and Public Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong, China.
| |
Collapse
|
4
|
Li C, Fan X, Guo X, Liu Y, Wang M, Zhao XC, Wu P, Yan Q, Sun L. Accuracy benchmark of the GeneMind GenoLab M sequencing platform for WGS and WES analysis. BMC Genomics 2022; 23:533. [PMID: 35869426 PMCID: PMC9308344 DOI: 10.1186/s12864-022-08775-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 07/18/2022] [Indexed: 11/23/2022] Open
Abstract
Background GenoLab M is a recently developed next-generation sequencing (NGS) platform from GeneMind Biosciences. To establish the performance of GenoLab M, we present the first report to benchmark and compare the WGS and WES sequencing data of the GenoLab M sequencer to NovaSeq 6000 and NextSeq 550 platform in various types of analysis. For WGS, thirty-fold sequencing from Illumina NovaSeq platform and processed by GATK pipeline is currently considered as the golden standard. Thus this dataset is generated as a benchmark reference in this study. Results GenoLab M showed an average of 94.62% of Q20 percentage for base quality, while the NovaSeq was slightly higher at 96.97%. However, GenoLab M outperformed NovaSeq or NextSeq at a duplication rate, suggesting more usable data after deduplication. For WGS short variant calling, GenoLab M showed significant accuracy improvement over the same depth dataset from NovaSeq, and reached similar accuracy to NovaSeq 33X dataset with 22x depth. For 100X WES, the F-score and Precision in GenoLab M were higher than NovaSeq or NextSeq, especially for InDel calling. Conclusions GenoLab M is a promising NGS platform for high-performance WGS and WES applications. For WGS, 22X depth in the GenoLab M sequencing platform offers a cost-effective alternative to the current mainstream 33X depth on Illumina.
Collapse
|
5
|
Genome-Wide Association Study of Body Weight Trait in Yaks. Animals (Basel) 2022; 12:ani12141855. [PMID: 35883402 PMCID: PMC9311934 DOI: 10.3390/ani12141855] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Revised: 07/14/2022] [Accepted: 07/19/2022] [Indexed: 01/03/2023] Open
Abstract
The yak is the largest meat-producing mammal around the Tibetan Plateau, and it plays an important role in the economic development and maintenance of the ecological environment throughout much of the Asian highlands. Understanding the genetic components of body weight is key for future improvement in yak breeding; therefore, genome-wide association studies (GWAS) were performed, and the results were used to mine plant and animal genetic resources. We conducted whole genome sequencing on 406 Maiwa yaks at 10 × coverage. Using a multiple loci mixed linear model (MLMM), fixed and random model circulating probability unification (FarmCPU), and Bayesian-information and linkage-disequilibrium iteratively nested keyway (BLINK), we found that a total of 25,000 single-nucleotide polymorphisms (SNPs) were distributed across chromosomes, and seven markers were identified as significantly (p-values < 3.91 × 10−7) associated with the body weight trait,. Several candidate genes, including MFSD4, LRRC37B, and NCAM2, were identified. This research will help us achieve a better understanding of the genotype−phenotype relationship for body weight.
Collapse
|
6
|
Lefouili M, Nam K. The evaluation of Bcftools mpileup and GATK HaplotypeCaller for variant calling in non-human species. Sci Rep 2022; 12:11331. [PMID: 35790846 PMCID: PMC9256665 DOI: 10.1038/s41598-022-15563-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 06/27/2022] [Indexed: 11/09/2022] Open
Abstract
Identification of genetic variations is a central part of population and quantitative genomics studies based on high-throughput sequencing data. Even though popular variant callers such as Bcftools mpileup and GATK HaplotypeCaller were developed nearly 10 years ago, their performance is still largely unknown for non-human species. Here, we showed by benchmark analyses with a simulated insect population that Bcftools mpileup performs better than GATK HaplotypeCaller in terms of recovery rate and accuracy regardless of mapping software. The vast majority of false positives were observed from repeats, especially for GATK HaplotypeCaller. Variant scores calculated by GATK did not clearly distinguish true positives from false positives in the vast majority of cases, implying that hard-filtering with GATK could be challenging. These results suggest that Bcftools mpileup may be the first choice for non-human studies and that variants within repeats might have to be excluded for downstream analyses.
Collapse
Affiliation(s)
| | - Kiwoong Nam
- DGIMI, Univ Montpellier, INRAE, Montpellier, France.
| |
Collapse
|
7
|
Schneider M, Shrestha A, Ballvora A, Léon J. High-throughput estimation of allele frequencies using combined pooled-population sequencing and haplotype-based data processing. PLANT METHODS 2022; 18:34. [PMID: 35313910 PMCID: PMC8935755 DOI: 10.1186/s13007-022-00852-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/29/2021] [Accepted: 02/07/2022] [Indexed: 06/14/2023]
Abstract
BACKGROUND In addition to heterogeneity and artificial selection, natural selection is one of the forces used to combat climate change and improve agrobiodiversity in evolutionary plant breeding. Accurate identification of the specific genomic effects of natural selection will likely accelerate transfer between populations. Thus, insights into changes in allele frequency, adequate population size, gene flow and drift are essential. However, observing such effects often involves a trade-off between costs and resolution when a large sample of genotypes for many loci is analysed. Pool genotyping approaches achieve high resolution and precision in estimating allele frequency when sequence coverage is high. Nevertheless, high-coverage pool sequencing of large genomes is expensive. RESULTS Three pool samples (n = 300, 300, 288) from a barley backcross population were generated to assess the population's allele frequency. The tested population (BC2F21) has undergone 18 generations of natural adaption to conventional farming practice. The accuracies of estimated pool-based allele frequencies and genome coverage yields were compared using three next-generation sequencing genotyping methods. To achieve accurate allele frequency estimates with low sequence coverage, we employed a haplotyping approach. Low coverage allele frequencies of closely located single polymorphisms were aggregated into a single haplotype allele frequency, yielding 2-to-271-times higher depth and increased precision. When we combined different haplotyping tactics, we found that gene and chip marker-based haplotype analyses performed equivalently or better compared with simple contig haplotype windows. Comparing multiple pool samples and referencing against an individual sequencing approach revealed that whole-genome pool re-sequencing (WGS) achieved the highest correlation with individual genotyping (≥ 0.97). In contrast, transcriptome-based genotyping (MACE) and genotyping by sequencing (GBS) pool replicates were significantly associated with higher error rates and lower correlations, but are still valuable to detect large allele frequency variations. CONCLUSIONS The proposed strategy identified the allele frequency of populations with high accuracy at low cost. This is particularly relevant to evolutionary plant breeding of crops with very large genomes, such as barley. Whole-genome low coverage re-sequencing at 0.03 × coverage per genotype accurately estimated the allele frequency when a loci-based haplotyping approach was applied. The implementation of annotated haplotypes capitalises on the biological background and statistical robustness.
Collapse
Affiliation(s)
- Michael Schneider
- Institute of Crop Science and Resource Conservation, University of Bonn, Plant Breeding, Katzenburgweg 5, 53115, Bonn, Germany
- Institute for Quantitative Genetics and Genomics of Plants, University Duesseldorf, Universitätsstraße 1, 40225, Düsseldorf, Germany
| | - Asis Shrestha
- Institute of Crop Science and Resource Conservation, University of Bonn, Plant Breeding, Katzenburgweg 5, 53115, Bonn, Germany
- Institute for Quantitative Genetics and Genomics of Plants, University Duesseldorf, Universitätsstraße 1, 40225, Düsseldorf, Germany
| | - Agim Ballvora
- Institute of Crop Science and Resource Conservation, University of Bonn, Plant Breeding, Katzenburgweg 5, 53115, Bonn, Germany
| | - Jens Léon
- Institute of Crop Science and Resource Conservation, University of Bonn, Plant Breeding, Katzenburgweg 5, 53115, Bonn, Germany.
| |
Collapse
|
8
|
Bayer PE. Skim-Based Genotyping by Sequencing Using a Double Haploid Population to Call SNPs, Infer Gene Conversions, and Improve Genome Assemblies. Methods Mol Biol 2022; 2443:405-413. [PMID: 35037217 DOI: 10.1007/978-1-0716-2067-0_20] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Genotyping by sequencing (GBS) is an emerging technology to rapidly call an abundance of single nucleotide polymorphisms (SNPs) using genome sequencing technology. Several different methodologies and approaches have recently been established, most of these relying on a specific preparation of data. Here we describe our GBS pipeline, which uses high coverage reads from two parents and low coverage reads from their double haploid offspring to call SNPs on a large scale. The upside of this approach is the high resolution and scalability of the method.
Collapse
Affiliation(s)
- Philipp Emanuel Bayer
- School of Biological Sciences, University of Western Australia, Perth, WA, Australia.
| |
Collapse
|
9
|
Wagner DD, Carleton HA, Trees E, Katz LS. Evaluating whole-genome sequencing quality metrics for enteric pathogen outbreaks. PeerJ 2021; 9:e12446. [PMID: 34900416 PMCID: PMC8627651 DOI: 10.7717/peerj.12446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Accepted: 10/18/2021] [Indexed: 11/25/2022] Open
Abstract
Background Whole genome sequencing (WGS) has gained increasing importance in responses to enteric bacterial outbreaks. Common analysis procedures for WGS, single nucleotide polymorphisms (SNPs) and genome assembly, are highly dependent upon WGS data quality. Methods Raw, unprocessed WGS reads from Escherichia coli, Salmonella enterica, and Shigella sonnei outbreak clusters were characterized for four quality metrics: PHRED score, read length, library insert size, and ambiguous nucleotide composition. PHRED scores were strongly correlated with improved SNPs analysis results in E. coli and S. enterica clusters. Results Assembly quality showed only moderate correlations with PHRED scores and library insert size, and then only for Salmonella. To improve SNP analyses and assemblies, we compared seven read-healing pipelines to improve these four quality metrics and to see how well they improved SNP analysis and genome assembly. The most effective read healing pipelines for SNPs analysis incorporated quality-based trimming, fixed-width trimming, or both. The Lyve-SET SNPs pipeline showed a more marked improvement than the CFSAN SNP Pipeline, but the latter performed better on raw, unhealed reads. For genome assembly, SPAdes enabled significant improvements in healed E. coli reads only, while Skesa yielded no significant improvements on healed reads. Conclusions PHRED scores will continue to be a crucial quality metric albeit not of equal impact across all types of analyses for all enteric bacteria. While trimming-based read healing performed well for SNPs analyses, different read healing approaches are likely needed for genome assembly or other, emerging WGS analysis methodologies.
Collapse
Affiliation(s)
- Darlene D Wagner
- Division of Viral Diseases, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.,Eagle Medical Services, LLC, Atlanta, GA, United States of America
| | - Heather A Carleton
- Enteric Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
| | - Eija Trees
- Association of Public Health Laboratories, Silver Spring, MD, United States of America
| | - Lee S Katz
- Enteric Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.,Center for Food Safety, University of Georgia, Griffin, GA, United States of America
| |
Collapse
|
10
|
Hua M, Liu J, Du P, Liu X, Li M, Wang H, Chen C, Xu X, Jiang Y, Wang Y, Zeng H, Li A. The novel outer membrane protein from OprD/Occ family is associated with hypervirulence of carbapenem resistant Acinetobacter baumannii ST2/KL22. Virulence 2021; 12:1-11. [PMID: 33258407 PMCID: PMC7781578 DOI: 10.1080/21505594.2020.1856560] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Revised: 11/16/2020] [Accepted: 11/18/2020] [Indexed: 01/04/2023] Open
Abstract
Acinetobacter baumannii has become a major healthcare threat that causes nosocomial infections, especially in critically ill patients. The spread of carbapenem-resistant A. baumannii (CRAB) strains has long been a clinical concern. It is important to study the epidemiology and virulence characteristics of different CRAB isolates in order to tailor infection prevention and antibiotic prescribing. In this study, a total of 71 CRAB isolates were collected in the hospital, and clinical characteristics of infections were analyzed. The genomic characteristics and phylogenetic relationships were elucidated based on genome sequencing and analysis. The isolates were assigned to three sequence types (STs, Pasteur) and nine capsular polysaccharide (KL) types, among which ST2/KL22 was the most prevalent CRAB in the hospital. Even though all the ST2/KL22 isolates contained the same reported virulence genes, one specific clade of ST2/KL22 showed more pathogenic in mouse infection model. Complete genomic analysis revealed differences at the oprD locus between the low- and high-virulent isolates. More specifically, a premature stop codon in the low-virulence strains resulted in truncated OprD expression. By evaluating pathogenicity in C57BL/6 J mice, knock-out of oprD in high-virulent isolate resulted in virulence attenuation, and complementing the avirulent strain with full-length oprD from high-virulent isolate enhanced virulence of the former. The oprD gene may be associated with the enhanced virulence of the specific ST2/KL22 clone, which provides a potential molecular marker for screening the hypervirulent A. baumannii strains.
Collapse
Affiliation(s)
- Mingxi Hua
- Institute of Infectious Diseases, Beijing Ditan Hospital, Capital Medical University, Beijing
| | - Jingyuan Liu
- Department of Critical Care Medicine, Beijing Ditan Hospital, Capital Medical University, Beijing
| | - Pengcheng Du
- Institute of Infectious Diseases, Beijing Ditan Hospital, Capital Medical University, Beijing
- Beijing Key Laboratory of Emerging Infectious Diseases, Beijing
| | - Xinzhe Liu
- Institute of Infectious Diseases, Beijing Ditan Hospital, Capital Medical University, Beijing
- Beijing Key Laboratory of Emerging Infectious Diseases, Beijing
| | - Min Li
- Clinical Laboratory, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Huizhu Wang
- Clinical Laboratory, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Chen Chen
- Institute of Infectious Diseases, Beijing Ditan Hospital, Capital Medical University, Beijing
- Beijing Key Laboratory of Emerging Infectious Diseases, Beijing
| | - Xinmin Xu
- Clinical Laboratory, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Yu Jiang
- Department of Stomatology, Beijing Children’s Hospital, Capital Medical University, Beijing
| | - Yajie Wang
- Clinical Laboratory, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Hui Zeng
- Institute of Infectious Diseases, Beijing Ditan Hospital, Capital Medical University, Beijing
- Beijing Key Laboratory of Emerging Infectious Diseases, Beijing
| | - Ang Li
- Department of Critical Care Medicine, Beijing Ditan Hospital, Capital Medical University, Beijing
| |
Collapse
|
11
|
Casellas J, Martín de Hijas-Villalba M, Vázquez-Gómez M, Id-Lahoucine S. Low-coverage whole-genome sequencing in livestock species for individual traceability and parentage testing. Livest Sci 2021. [DOI: 10.1016/j.livsci.2021.104629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
12
|
Zanti M, Michailidou K, Loizidou MA, Machattou C, Pirpa P, Christodoulou K, Spyrou GM, Kyriacou K, Hadjisavvas A. Performance evaluation of pipelines for mapping, variant calling and interval padding, for the analysis of NGS germline panels. BMC Bioinformatics 2021; 22:218. [PMID: 33910496 PMCID: PMC8080428 DOI: 10.1186/s12859-021-04144-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Accepted: 04/15/2021] [Indexed: 11/10/2022] Open
Abstract
Background Next-generation sequencing (NGS) represents a significant advancement in clinical genetics. However, its use creates several technical, data interpretation and management challenges. It is essential to follow a consistent data analysis pipeline to achieve the highest possible accuracy and avoid false variant calls. Herein, we aimed to compare the performance of twenty-eight combinations of NGS data analysis pipeline compartments, including short-read mapping (BWA-MEM, Bowtie2, Stampy), variant calling (GATK-HaplotypeCaller, GATK-UnifiedGenotyper, SAMtools) and interval padding (null, 50 bp, 100 bp) methods, along with a commercially available pipeline (BWA Enrichment, Illumina®). Fourteen germline DNA samples from breast cancer patients were sequenced using a targeted NGS panel approach and subjected to data analysis. Results We highlight that interval padding is required for the accurate detection of intronic variants including spliceogenic pathogenic variants (PVs). In addition, using nearly default parameters, the BWA Enrichment algorithm, failed to detect these spliceogenic PVs and a missense PV in the TP53 gene. We also recommend the BWA-MEM algorithm for sequence alignment, whereas variant calling should be performed using a combination of variant calling algorithms; GATK-HaplotypeCaller and SAMtools for the accurate detection of insertions/deletions and GATK-UnifiedGenotyper for the efficient detection of single nucleotide variant calls. Conclusions These findings have important implications towards the identification of clinically actionable variants through panel testing in a clinical laboratory setting, when dedicated bioinformatics personnel might not always be available. The results also reveal the necessity of improving the existing tools and/or at the same time developing new pipelines to generate more reliable and more consistent data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04144-1.
Collapse
Affiliation(s)
- Maria Zanti
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus.,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Bioinformatics Department, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Kyriaki Michailidou
- Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Biostatistics Unit, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Maria A Loizidou
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus.,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus
| | - Christina Machattou
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Panagiota Pirpa
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Kyproula Christodoulou
- Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Neurogenetics Department, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - George M Spyrou
- Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Bioinformatics Department, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Kyriacos Kyriacou
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus.,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus
| | - Andreas Hadjisavvas
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus. .,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.
| |
Collapse
|
13
|
Paskov K, Jung JY, Chrisman B, Stockham NT, Washington P, Varma M, Sun MW, Wall DP. Estimating sequencing error rates using families. BioData Min 2021; 14:27. [PMID: 33892748 PMCID: PMC8063364 DOI: 10.1186/s13040-021-00259-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2020] [Accepted: 03/29/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. RESULTS We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method's versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. CONCLUSION Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.
Collapse
Affiliation(s)
- Kelley Paskov
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
| | - Jae-Yoon Jung
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.,Department of Pediatrics (Systems Medicine), Stanford University, Stanford, CA, USA
| | - Brianna Chrisman
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Nate T Stockham
- Department of Neuroscience, Stanford University, Stanford, CA, USA
| | - Peter Washington
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Maya Varma
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Min Woo Sun
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Dennis P Wall
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA. .,Department of Pediatrics (Systems Medicine), Stanford University, Stanford, CA, USA.
| |
Collapse
|
14
|
Valiente-Mullor C, Beamud B, Ansari I, Francés-Cuesta C, García-González N, Mejía L, Ruiz-Hueso P, González-Candelas F. One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads. PLoS Comput Biol 2021; 17:e1008678. [PMID: 33503026 PMCID: PMC7870062 DOI: 10.1371/journal.pcbi.1008678] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Revised: 02/08/2021] [Accepted: 01/05/2021] [Indexed: 12/17/2022] Open
Abstract
Mapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a frequently used approach in microbial genomics. However, the choice of a reference may represent a source of errors that may affect subsequent analyses such as the detection of single nucleotide polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference choice on short-read sequence data from five clinically and epidemiologically relevant bacteria (Klebsiella pneumoniae, Legionella pneumophila, Neisseria gonorrhoeae, Pseudomonas aeruginosa and Serratia marcescens). Publicly available whole-genome assemblies encompassing the genomic diversity of these species were selected as reference sequences, and read alignment statistics, SNP calling, recombination rates, dN/dS ratios, and phylogenetic trees were evaluated depending on the mapping reference. The choice of different reference genomes proved to have an impact on almost all the parameters considered in the five species. In addition, these biases had potential epidemiological implications such as including/excluding isolates of particular clades and the estimation of genetic distances. These findings suggest that the single reference approach might introduce systematic errors during mapping that affect subsequent analyses, particularly for data sets with isolates from genetically diverse backgrounds. In any case, exploring the effects of different references on the final conclusions is highly recommended. Mapping consists in the alignment of reads (i.e., DNA fragments) obtained through high-throughput genome sequencing to a previously assembled reference sequence. It is a common practice in genomic studies to use a single reference for mapping, usually the ‘reference genome’ of a species—a high-quality assembly. However, the selection of an optimal reference is hindered by intrinsic intra-species genetic variability, particularly in bacteria. It is known that genetic differences between the reference genome and the read sequences may produce incorrect alignments during mapping. Eventually, these errors could lead to misidentification of variants and biased reconstruction of phylogenetic trees (which reflect ancestry between different bacterial lineages). To our knowledge, this is the first work to systematically examine the effect of different references for mapping on the inference of tree topology as well as the impact on recombination and natural selection inferences. Furthermore, the novelty of this work relies on a procedure that guarantees that we are evaluating only the effect of the reference. This effect has proved to be pervasive in the five bacterial species that we have studied and, in some cases, alterations in phylogenetic trees could lead to incorrect epidemiological inferences. Hence, the use of different reference genomes may be prescriptive to assess the potential biases of mapping.
Collapse
Affiliation(s)
- Carlos Valiente-Mullor
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
| | - Beatriz Beamud
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
- * E-mail: (BB); (FG-C)
| | - Iván Ansari
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
| | - Carlos Francés-Cuesta
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
| | - Neris García-González
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
| | - Lorena Mejía
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
- Instituto de Microbiología, Colegio de Ciencias Biológicas y Ambientales, Universidad San Francisco de Quito, Quito, Ecuador
| | - Paula Ruiz-Hueso
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
| | - Fernando González-Candelas
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
- CIBER in Epidemiology and Public Health, Valencia, Spain
- * E-mail: (BB); (FG-C)
| |
Collapse
|
15
|
High-Throughput Genotyping Technologies in Plant Taxonomy. Methods Mol Biol 2021; 2222:149-166. [PMID: 33301093 DOI: 10.1007/978-1-0716-0997-2_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Molecular markers provide researchers with a powerful tool for variation analysis between plant genomes. They are heritable and widely distributed across the genome and for this reason have many applications in plant taxonomy and genotyping. Over the last decade, molecular marker technology has developed rapidly and is now a crucial component for genetic linkage analysis, trait mapping, diversity analysis, and association studies. This chapter focuses on molecular marker discovery, its application, and future perspectives for plant genotyping through pangenome assemblies. Included are descriptions of automated methods for genome and sequence distance estimation, genome contaminant analysis in sequence reads, genome structural variation, and SNP discovery methods.
Collapse
|
16
|
Cline E, Wisittipanit N, Boongoen T, Chukeatirote E, Struss D, Eungwanichayapant A. Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data. PeerJ 2020; 8:e10501. [PMID: 33354434 PMCID: PMC7727374 DOI: 10.7717/peerj.10501] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Accepted: 11/15/2020] [Indexed: 12/30/2022] Open
Abstract
Background Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alignments stored in Sequence Alignment/Map (SAM) files. Each alignment has a mapping quality (MAPQ) score indicating the probability a read is incorrectly aligned. This study investigated the recalibration of probability estimates used to compute MAPQ scores for improving variant calling performance in single-sample, low-coverage settings. Materials and Methods Simulated tomato, hot pepper and rice genomes were implanted with known variants. From these, simulated paired-end reads were generated at low coverage and aligned to the original reference genomes. Features extracted from the SAM formatted alignment files for tomato were used to train machine learning models to detect incorrectly aligned reads and output estimates of the probability of misalignment for each read in all three data sets. MAPQ scores were then re-computed from these estimates. Next, the SAM files were updated with new MAPQ scores. Finally, Variant calling was performed on the original and recalibrated alignments and the results compared. Results Incorrectly aligned reads comprised only 0.16% of the reads in the training set. This severe class imbalance required special consideration for model training. The F1 score for detecting misaligned reads ranged from 0.76 to 0.82. The best performing model was used to compute new MAPQ scores. Single Nucleotide Polymorphism (SNP) detection was improved after mapping score recalibration. In rice, recall for called SNPs increased by 5.2%, while for tomato and pepper it increased by 3.1% and 1.5%, respectively. For all three data sets the precision of SNP calls ranged from 0.91 to 0.95, and was largely unchanged both before and after mapping score recalibration. Conclusion Recalibrating MAPQ scores delivers modest improvements in single-sample variant calling results. Some variant callers operate on multiple samples simultaneously. They exploit every sample’s reads to compensate for the low read-depth of individual samples. This improves polymorphism detection and genotype inference. It may be that small improvements in single-sample settings translate to larger gains in a multi-sample experiment. A study to investigate this is ongoing.
Collapse
Affiliation(s)
- Eliot Cline
- School of Science, Mae Fah Luang University, Amphur Muang, Chiang Rai, Thailand
- Department of Biotechnology, East West Seed Company, San Sai, Chiang Mai, Thailand
| | | | - Tossapon Boongoen
- Center of Excellence in AI and Emerging Technologies, School of Information Technology, Mae Fah Luang University, Amphur Muang, Chiang Rai, Thailand
| | | | - Darush Struss
- Department of Biotechnology, East West Seed Company, San Sai, Chiang Mai, Thailand
| | | |
Collapse
|
17
|
Abstract
Advances in next-generation sequencing technology have enabled whole genome sequencing (WGS) to be widely used for identification of causal variants in a spectrum of genetic-related disorders, and provided new insight into how genetic polymorphisms affect disease phenotypes. The development of different bioinformatics pipelines has continuously improved the variant analysis of WGS data. However, there is a necessity for a systematic performance comparison of these pipelines to provide guidance on the application of WGS-based scientific and clinical genomics. In this study, we evaluated the performance of three variant calling pipelines (GATK, DRAGEN and DeepVariant) using the Genome in a Bottle Consortium, "synthetic-diploid" and simulated WGS datasets. DRAGEN and DeepVariant show better accuracy in SNP and indel calling, with no significant differences in their F1-score. DRAGEN platform offers accuracy, flexibility and a highly-efficient execution speed, and therefore superior performance in the analysis of WGS data on a large scale. The combination of DRAGEN and DeepVariant also suggests a good balance of accuracy and efficiency as an alternative solution for germline variant detection in further applications. Our results facilitate the standardization of benchmarking analysis of bioinformatics pipelines for reliable variant detection, which is critical in genetics-based medical research and clinical applications.
Collapse
|
18
|
Zhao S, Agafonov O, Azab A, Stokowy T, Hovig E. Accuracy and efficiency of germline variant calling pipelines for human genome data. Sci Rep 2020; 10:20222. [PMID: 33214604 PMCID: PMC7678823 DOI: 10.1038/s41598-020-77218-4] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Accepted: 11/02/2020] [Indexed: 12/30/2022] Open
Abstract
Advances in next-generation sequencing technology have enabled whole genome sequencing (WGS) to be widely used for identification of causal variants in a spectrum of genetic-related disorders, and provided new insight into how genetic polymorphisms affect disease phenotypes. The development of different bioinformatics pipelines has continuously improved the variant analysis of WGS data. However, there is a necessity for a systematic performance comparison of these pipelines to provide guidance on the application of WGS-based scientific and clinical genomics. In this study, we evaluated the performance of three variant calling pipelines (GATK, DRAGEN and DeepVariant) using the Genome in a Bottle Consortium, "synthetic-diploid" and simulated WGS datasets. DRAGEN and DeepVariant show better accuracy in SNP and indel calling, with no significant differences in their F1-score. DRAGEN platform offers accuracy, flexibility and a highly-efficient execution speed, and therefore superior performance in the analysis of WGS data on a large scale. The combination of DRAGEN and DeepVariant also suggests a good balance of accuracy and efficiency as an alternative solution for germline variant detection in further applications. Our results facilitate the standardization of benchmarking analysis of bioinformatics pipelines for reliable variant detection, which is critical in genetics-based medical research and clinical applications.
Collapse
Affiliation(s)
- Sen Zhao
- Department of Tumor Biology, Institute of Cancer Research, The Norwegian Radium Hospital, Oslo University Hospital, 0310, Oslo, Norway
| | | | - Abdulrahman Azab
- Center for Bioinformatics, Department of Informatics, University of Oslo, 0316, Oslo, Norway
- Division of Research Computing, University Center for Information Technology (USIT), University of Oslo, 0316, Oslo, Norway
| | - Tomasz Stokowy
- Computational Biology Unit, Institute of Informatics, University of Bergen, 5008, Bergen, Norway
- Department of Clinical Science, University of Bergen, 5021, Bergen, Norway
| | - Eivind Hovig
- Department of Tumor Biology, Institute of Cancer Research, The Norwegian Radium Hospital, Oslo University Hospital, 0310, Oslo, Norway.
- Center for Bioinformatics, Department of Informatics, University of Oslo, 0316, Oslo, Norway.
| |
Collapse
|
19
|
Liang J, Zhao W, Lu C, Liu D, Li P, Ye X, Zhao Y, Zhang J, Yang D. Next-Generation Sequencing Analysis of ctDNA for the Detection of Glioma and Metastatic Brain Tumors in Adults. Front Neurol 2020; 11:544. [PMID: 32973641 PMCID: PMC7473301 DOI: 10.3389/fneur.2020.00544] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2020] [Accepted: 05/14/2020] [Indexed: 12/12/2022] Open
Abstract
Background and aims: The next-generation sequencing technologies and their related assessments of circulating tumor DNA in both glioma and metastatic brain tumors remain largely limited. Methods: Based largely on a protocol approved by the institutional review board at Peking University International Hospital, the current retrospective, single-center study was conducted. Genomic DNA was extracted from blood samples or tumor tissues. With the application of NextSeq 500 instrument (Illumina), Sequencing was performed with an average coverage of 550-fold. Paired-end sequencing was employed utilized with an attempt to achieve improved sensitivity of duplicate detection and therefore to increase the detection reliability of possible fusions. Results: A total of 28 patients (21 men and 7 women) with brain tumors in the present study were involved in the study. The patients enrolled were assigned into two groups, including glioma group (n = 21) and metastatic brain tumor group (n = 7). The mean age of metastatic brain tumor group (59.86 ± 8.85 y), (43.65 ± 13.05 y) reported significantly higher results in comparison to that of glioma group (45.3 ± 12.3 years) (P < 0.05). The mutant genes in metastatic brain tumor group included ALK, MDM2, ATM, BRCA1, FGFR1, MDM4 and KRAS; however, there were no glioma-related mutant genes including MGMT, IDH1, IDH2, 1p/19q, and BRAF et al. Interesteringly, only two patient (28.3%) was detected blood ctDNA in metastatic brain tumor group; In contrast, blood ctDNA was found in ten glioma patients (47.6%) including 1p/19q, MDM2, ERBB2, IDH1, CDKN2A, CDK4, PDGFRA, CCNE1, MET. The characterizations of IDH mutations in the glioma included IDH1 mutation (p.R132H) and IDH2 mutation (p.R172K). The mutation rate of IDH in tumor tissues was 37.06 ± 8.32%, which was significantly higher than blood samples (P < 0.05). Conclusion: The present study demonstrated that the mutant genes among glioma and metastatic brain tumors are shown to be different. Moreover, the ctDNAs in the metastatic brain tumors included ALK and MDM2, and glioma-related ctDNAs included 1p/19q and MDM2 followed by frequencies of ERBB2, IDH1, CDKN2A, CDK4, PDGFRA, CCNE1, MET. These ctDNAs might be biomarkers and therapeutic responders in brain tumor.
Collapse
Affiliation(s)
- Jianfeng Liang
- Department of Neurosurgery, Peking University International Hospital, Beijing, China
| | - Wanni Zhao
- Department of General Surgery, Beijing Hospital, National Center of Gerontology, Beijing, China
| | - Changyu Lu
- Department of Neurosurgery, Peking University International Hospital, Beijing, China
| | - Danni Liu
- HaploX Biotechnology, Shenzhen, China
| | - Ping Li
- Department of Hematology, Tongji Hospital of Tongji University, Shanghai, China
| | - Xun Ye
- Department of Neurosurgery, Peking University International Hospital, Beijing, China
| | - Yuanli Zhao
- Department of Neurosurgery, Peking University International Hospital, Beijing, China.,Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | | | - Dong Yang
- Department of Neurosurgery, China-Japan Friendship Hospital, Beijing, China.,The 2nd People's Hospital of Tibet Autonomous Region, Lhasa, China
| |
Collapse
|
20
|
Yao Z, You FM, N'Diaye A, Knox RE, McCartney C, Hiebert CW, Pozniak C, Xu W. Evaluation of variant calling tools for large plant genome re-sequencing. BMC Bioinformatics 2020; 21:360. [PMID: 32807073 PMCID: PMC7430858 DOI: 10.1186/s12859-020-03704-1] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Accepted: 07/28/2020] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Discovering single nucleotide polymorphisms (SNPs) from agriculture crop genome sequences has been a widely used strategy for developing genetic markers for several applications including marker-assisted breeding, population diversity studies for eco-geographical adaption, genotyping crop germplasm collections, and others. Accurately detecting SNPs from large polyploid crop genomes such as wheat is crucial and challenging. A few variant calling methods have been previously developed but they show a low concordance between their variant calls. A gold standard of variant sets generated from one human individual sample was established for variant calling tool evaluations, however hitherto no gold standard of crop variant set is available for wheat use. The intent of this study was to evaluate seven SNP variant calling tools (FreeBayes, GATK, Platypus, Samtools/mpileup, SNVer, VarScan, VarDict) with the two most popular mapping tools (BWA-mem and Bowtie2) on wheat whole exome capture (WEC) re-sequencing data from allohexaploid wheat. RESULTS We found the BWA-mem mapping tool had both a higher mapping rate and a higher accuracy rate than Bowtie2. With the same mapping quality (MQ) cutoff, BWA-mem detected more variant bases in mapping reads than Bowtie2. The reads preprocessed with quality trimming or duplicate removal did not significantly affect the final mapping performance in terms of mapped reads. Based on the concordance and receiver operating characteristic (ROC), the Samtools/mpileup variant calling tool with BWA-mem mapping of raw sequence reads outperformed other tests followed by FreeBayes and GATK in terms of specificity and sensitivity. VarDict and VarScan were the poorest performing variant calling tools with the wheat WEC sequence data. CONCLUSION The BWA-mem and Samtools/mpileup pipeline, with no need to preprocess the raw read data before mapping onto the reference genome, was ascertained the optimum for SNP calling for the complex wheat genome re-sequencing. These results also provide useful guidelines for reliable variant identification from deep sequencing of other large polyploid crop genomes.
Collapse
Affiliation(s)
- Zhen Yao
- Morden Research and Development Centre, Agriculture and Agri-Food Canada, 101 Route 100, Morden, Manitoba, R6M 1Y5, Canada
| | - Frank M You
- Ottawa Research and Development Centre, Agriculture and Agri-Food Canada, 960 Carling Avenue, Ottawa, Ontario, K1A 0C6, Canada
| | - Amidou N'Diaye
- Department of Plant Sciences, University of Saskatchewan, Saskatoon, Saskatchewan, S7N 5A8, Canada
| | - Ron E Knox
- Swift Current Research and Development Centre, Agriculture and Agri-Food Canada, Box 1030, Swift Current, Saskatchewan, S9H 3X2, Canada
| | - Curt McCartney
- Morden Research and Development Centre, Agriculture and Agri-Food Canada, 101 Route 100, Morden, Manitoba, R6M 1Y5, Canada
| | - Colin W Hiebert
- Morden Research and Development Centre, Agriculture and Agri-Food Canada, 101 Route 100, Morden, Manitoba, R6M 1Y5, Canada
| | - Curtis Pozniak
- Department of Plant Sciences, University of Saskatchewan, Saskatoon, Saskatchewan, S7N 5A8, Canada
| | - Wayne Xu
- Morden Research and Development Centre, Agriculture and Agri-Food Canada, 101 Route 100, Morden, Manitoba, R6M 1Y5, Canada.
| |
Collapse
|
21
|
Schumer M, Powell DL, Corbett-Detig R. Versatile simulations of admixture and accurate local ancestry inference with mixnmatch and ancestryinfer. Mol Ecol Resour 2020; 20:1141-1151. [PMID: 32324964 PMCID: PMC7384932 DOI: 10.1111/1755-0998.13175] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2019] [Revised: 03/09/2020] [Accepted: 04/15/2020] [Indexed: 12/13/2022]
Abstract
It has become clear that hybridization between species is much more common than previously recognized. As a result, we now know that the genomes of many modern species, including our own, are a patchwork of regions derived from past hybridization events. Increasingly researchers are interested in disentangling which regions of the genome originated from each parental species using local ancestry inference methods. Due to the diverse effects of admixture, this interest is shared across disparate fields, from human genetics to research in ecology and evolutionary biology. However, local ancestry inference methods are sensitive to a range of biological and technical parameters which can impact accuracy. Here we present paired simulation and ancestry inference pipelines, mixnmatch and ancestryinfer, to help researchers plan and execute local ancestry inference studies. mixnmatch can simulate arbitrarily complex demographic histories in the parental and hybrid populations, selection on hybrids, and technical variables such as coverage and contamination. ancestryinfer takes as input sequencing reads from simulated or real individuals, and implements an efficient local ancestry inference pipeline. We perform a series of simulations with mixnmatch to pinpoint factors that influence accuracy in local ancestry inference and highlight useful features of the two pipelines. mixnmatch is a powerful tool for simulations of hybridization while ancestryinfer facilitates local ancestry inference on real or simulated data.
Collapse
Affiliation(s)
- Molly Schumer
- Department of Biology, Stanford University
- Centro de Investigaciones Científicas de las Huastecas “Aguazarca”
- Hanna H. Gray Fellow, Howard Hughes Medical Institute
| | - Daniel L. Powell
- Department of Biology, Stanford University
- Centro de Investigaciones Científicas de las Huastecas “Aguazarca”
- Department of Biology, Texas A&M University
| | - Russ Corbett-Detig
- Genomics Institute, University of California, Santa Cruz
- Department of Biomolecular Engineering, University of California, Santa Cruz
| |
Collapse
|
22
|
Dissanayake R, Braich S, Cogan NOI, Smith K, Kaur S. Characterization of Genetic and Allelic Diversity Amongst Cultivated and Wild Lentil Accessions for Germplasm Enhancement. Front Genet 2020; 11:546. [PMID: 32587602 PMCID: PMC7298104 DOI: 10.3389/fgene.2020.00546] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2019] [Accepted: 05/06/2020] [Indexed: 12/13/2022] Open
Abstract
Intensive breeding of cultivated lentil has resulted in a relatively narrow genetic base, which limits the options to increase crop productivity through selection. Assessment of genetic diversity in the wild gene pool of lentil, as well as characterization of useful and novel alleles/genes that can be introgressed into elite germplasm, presents new opportunities and pathways for germplasm enhancement, followed by successful crop improvement. In the current study, a lentil collection consisting of 467 wild and cultivated accessions that originated from 10 diverse geographical regions was assessed, to understand genetic relationships among different lentil species/subspecies. A total of 422,101 high-confidence SNP markers were identified against the reference lentil genome (cv. CDC Redberry). Phylogenetic analysis clustered the germplasm collection into four groups, namely, Lens culinaris/Lens orientalis, Lens lamottei/Lens odemensis, Lens ervoides, and Lens nigricans. A weak correlation was observed between geographical origin and genetic relationship, except for some accessions of L. culinaris and L. ervoides. Genetic distance matrices revealed a comparable level of variation within the gene pools of L. culinaris (Nei’s coefficient 0.01468–0.71163), L. ervoides (Nei’s coefficient 0.01807–0.71877), and L. nigricans (Nei’s coefficient 0.02188–1.2219). In order to understand any genic differences at species/subspecies level, allele frequencies were calculated from a subset of 263 lentil accessions. Among all cultivated and wild lentil species, L. nigricans exhibited the greatest allelic differentiation across the genome compared to all other species/subspecies. Major differences were observed on six genomic regions with the largest being on Chromosome 1 (c. 1 Mbp). These results indicate that L. nigricans is the most distantly related to L. culinaris and additional structural variations are likely to be identified from genome sequencing studies. This would provide further insights into evolutionary relationships between cultivated and wild lentil germplasm, for germplasm improvement and introgression.
Collapse
Affiliation(s)
- Ruwani Dissanayake
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, Australia.,Faculty of Veterinary and Agricultural Sciences, The University of Melbourne, Parkville, VIC, Australia
| | - Shivraj Braich
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, Australia.,School of Applied Systems Biology, La Trobe University, Melbourne, VIC, Australia
| | - Noel O I Cogan
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, Australia.,School of Applied Systems Biology, La Trobe University, Melbourne, VIC, Australia
| | - Kevin Smith
- Faculty of Veterinary and Agricultural Sciences, The University of Melbourne, Parkville, VIC, Australia.,Agriculture Victoria, Hamilton, VIC, Australia
| | - Sukhjiwan Kaur
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, Australia
| |
Collapse
|
23
|
Özdemir Özdoğan G, Kaya H. Next-Generation Sequencing Data Analysis on Pool-Seq and Low-Coverage Retinoblastoma Data. Interdiscip Sci 2020; 12:302-310. [PMID: 32519123 DOI: 10.1007/s12539-020-00374-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2019] [Revised: 04/26/2020] [Accepted: 05/22/2020] [Indexed: 12/31/2022]
Abstract
Next-generation sequencing (NGS) is related to massively parallel or deep deoxyribonucleic acid (DNA) sequencing technology which has revolutionized genomic researches in recent years. Although the cost of generating NGS data was decreased compared to the one at the time of emerging this technology, its cost might still be somewhat a problem. Hence, new strategies as pool-seq and low-coverage NGS data have been developed to overcome the cost problem. Despite decreasing cost, it is important to elucidate whether they are efficient in NGS studies. We applied a bioinformatics pipeline on pool-seq and low-coverage retinoblastoma data retrieved from only tumor data. Retinoblastoma is an eye malignancy in childhood that is initiated by RB1 mutation or MYCN amplification and can lead to the loss of vision of eye(s), and even sometimes life. We applied our pipeline on both retinoblastoma disease data and two other particular data to testify the validity and also for comparison purposes in the aspect of performance. High-confidence variant calls from Genome in a Bottle Consortium were used for fulfilling these purposes. We observed that our pipeline successfully called higher number of variants than a standard pipeline for all these three different data. Besides, the recall and F-score values were quite better in our pipeline as being noteworthy. We further presented our results on disease data in the aspects of the variants, variant types and disease-related genes. This study provides a guideline for performing NGS data analysis pipeline on pool-seq and low-coverage sequencing data in conjunction. To get more conclusive outcomes of these two strategies, we recommend using cancer data having higher mutation rates and larger pools.
Collapse
Affiliation(s)
| | - Hilal Kaya
- Department of Computer Engineering, Ankara Yildirim Beyazit University, 06010, Ankara, Turkey.
| |
Collapse
|
24
|
Tattini L, Tellini N, Mozzachiodi S, D'Angiolo M, Loeillet S, Nicolas A, Liti G. Accurate Tracking of the Mutational Landscape of Diploid Hybrid Genomes. Mol Biol Evol 2020; 36:2861-2877. [PMID: 31397846 PMCID: PMC6878955 DOI: 10.1093/molbev/msz177] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Mutations, recombinations, and genome duplications may promote genetic diversity and trigger evolutionary processes. However, quantifying these events in diploid hybrid genomes is challenging. Here, we present an integrated experimental and computational workflow to accurately track the mutational landscape of yeast diploid hybrids (MuLoYDH) in terms of single-nucleotide variants, small insertions/deletions, copy-number variants, aneuploidies, and loss-of-heterozygosity. Pairs of haploid Saccharomyces parents were combined to generate ancestor hybrids with phased genomes and varying levels of heterozygosity. These diploids were evolved under different laboratory protocols, in particular mutation accumulation experiments. Variant simulations enabled the efficient integration of competitive and standard mapping of short reads, depending on local levels of heterozygosity. Experimental validations proved the high accuracy and resolution of our computational approach. Finally, applying MuLoYDH to four different diploids revealed striking genetic background effects. Homozygous Saccharomyces cerevisiae showed a ∼4-fold higher mutation rate compared with its closely related species S. paradoxus. Intraspecies hybrids unveiled that a substantial fraction of the genome (∼250 bp per generation) was shaped by loss-of-heterozygosity, a process strongly inhibited in interspecies hybrids by high levels of sequence divergence between homologous chromosomes. In contrast, interspecies hybrids exhibited higher single-nucleotide mutation rates compared with intraspecies hybrids. MuLoYDH provided an unprecedented quantitative insight into the evolutionary processes that mold diploid yeast genomes and can be generalized to other genetic systems.
Collapse
Affiliation(s)
- Lorenzo Tattini
- CNRS UMR7284, INSERM, IRCAN, Université Côte d'Azur, Nice, France
| | - Nicolò Tellini
- CNRS UMR7284, INSERM, IRCAN, Université Côte d'Azur, Nice, France
| | | | | | - Sophie Loeillet
- CNRS UMR3244, Institut Curie, PSL Research University, Paris, France
| | - Alain Nicolas
- CNRS UMR3244, Institut Curie, PSL Research University, Paris, France
| | - Gianni Liti
- CNRS UMR7284, INSERM, IRCAN, Université Côte d'Azur, Nice, France
| |
Collapse
|
25
|
Iquebal MA, Sharma P, Jasrotia RS, Jaiswal S, Kaur A, Saroha M, Angadi UB, Sheoran S, Singh R, Singh GP, Rai A, Tiwari R, Kumar D. RNAseq analysis reveals drought-responsive molecular pathways with candidate genes and putative molecular markers in root tissue of wheat. Sci Rep 2019; 9:13917. [PMID: 31558740 PMCID: PMC6763491 DOI: 10.1038/s41598-019-49915-2] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Accepted: 08/12/2019] [Indexed: 01/08/2023] Open
Abstract
Drought is one of the major impediments in wheat productivity. Traditional breeding and marker assisted QTL introgression had limited success. Available wheat genomic and RNA-seq data can decipher novel drought tolerance mechanisms with putative candidate gene and marker discovery. Drought is first sensed by root tissue but limited information is available about how roots respond to drought stress. In this view, two contrasting genotypes, namely, NI5439 41 (drought tolerant) and WL711 (drought susceptible) were used to generate ~78.2 GB data for the responses of wheat roots to drought. A total of 45139 DEGs, 13820 TF, 288 miRNAs, 640 pathways and 435829 putative markers were obtained. Study reveals use of such data in QTL to QTN refinement by analysis on two model drought-responsive QTLs on chromosome 3B in wheat roots possessing 18 differentially regulated genes with 190 sequence variants (173 SNPs and 17 InDels). Gene regulatory networks showed 69 hub-genes integrating ABA dependent and independent pathways controlling sensing of drought, root growth, uptake regulation, purine metabolism, thiamine metabolism and antibiotics pathways, stomatal closure and senescence. Eleven SSR markers were validated in a panel of 18 diverse wheat varieties. For effective future use of findings, web genomic resources were developed. We report RNA-Seq approach on wheat roots describing the drought response mechanisms under field drought conditions along with genomic resources, warranted in endeavour of wheat productivity.
Collapse
Affiliation(s)
- Mir Asif Iquebal
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Library Avenue, PUSA, New Delhi, 110012, India
| | - Pradeep Sharma
- ICAR-Indian Institute of Wheat and Barley Research, Karnal, Haryana, 132001, India
| | - Rahul Singh Jasrotia
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Library Avenue, PUSA, New Delhi, 110012, India
| | - Sarika Jaiswal
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Library Avenue, PUSA, New Delhi, 110012, India
| | - Amandeep Kaur
- ICAR-Indian Institute of Wheat and Barley Research, Karnal, Haryana, 132001, India
| | - Monika Saroha
- ICAR-Indian Institute of Wheat and Barley Research, Karnal, Haryana, 132001, India
| | - U B Angadi
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Library Avenue, PUSA, New Delhi, 110012, India
| | - Sonia Sheoran
- ICAR-Indian Institute of Wheat and Barley Research, Karnal, Haryana, 132001, India
| | - Rajender Singh
- ICAR-Indian Institute of Wheat and Barley Research, Karnal, Haryana, 132001, India
| | - G P Singh
- ICAR-Indian Institute of Wheat and Barley Research, Karnal, Haryana, 132001, India
| | - Anil Rai
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Library Avenue, PUSA, New Delhi, 110012, India
| | - Ratan Tiwari
- ICAR-Indian Institute of Wheat and Barley Research, Karnal, Haryana, 132001, India.
| | - Dinesh Kumar
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Library Avenue, PUSA, New Delhi, 110012, India.
| |
Collapse
|
26
|
Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nat Commun 2019; 10:3240. [PMID: 31324872 PMCID: PMC6642177 DOI: 10.1038/s41467-019-11146-4] [Citation(s) in RCA: 142] [Impact Index Per Article: 28.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Accepted: 06/26/2019] [Indexed: 01/12/2023] Open
Abstract
In recent years, many software packages for identifying structural variants (SVs) using whole-genome sequencing data have been released. When published, a new method is commonly compared with those already available, but this tends to be selective and incomplete. The lack of comprehensive benchmarking of methods presents challenges for users in selecting methods and for developers in understanding algorithm behaviours and limitations. Here we report the comprehensive evaluation of 10 SV callers, selected following a rigorous process and spanning the breadth of detection approaches, using high-quality reference cell lines, as well as simulations. Due to the nature of available truth sets, our focus is on general-purpose rather than somatic callers. We characterise the impact on performance of event size and type, sequencing characteristics, and genomic context, and analyse the efficacy of ensemble calling and calibration of variant quality scores. Finally, we provide recommendations for both users and methods developers. A number of computational methods have been developed for calling structural variants (SVs) using short read sequencing data. Here, the authors perform a comprehensive benchmarking analysis comparing 10 general-purpose callers and provide recommendations for both users and methods developers.
Collapse
|
27
|
Chen J, Li X, Zhong H, Meng Y, Du H. Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers. Sci Rep 2019; 9:9345. [PMID: 31249349 PMCID: PMC6597787 DOI: 10.1038/s41598-019-45835-3] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Accepted: 06/12/2019] [Indexed: 12/17/2022] Open
Abstract
The development and innovation of next generation sequencing (NGS) and the subsequent analysis tools have gain popularity in scientific researches and clinical diagnostic applications. Hence, a systematic comparison of the sequencing platforms and variant calling pipelines could provide significant guidance to NGS-based scientific and clinical genomics. In this study, we compared the performance, concordance and operating efficiency of 27 combinations of sequencing platforms and variant calling pipelines, testing three variant calling pipelines—Genome Analysis Tool Kit HaplotypeCaller, Strelka2 and Samtools-Varscan2 for nine data sets for the NA12878 genome sequenced by different platforms including BGISEQ500, MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten. For the variants calling performance of 12 combinations in WES datasets, all combinations displayed good performance in calling SNPs, with their F-scores entirely higher than 0.96, and their performance in calling INDELs varies from 0.75 to 0.91. And all 15 combinations in WGS datasets also manifested good performance, with F-scores in calling SNPs were entirely higher than 0.975 and their performance in calling INDELs varies from 0.71 to 0.93. All of these combinations manifested high concordance in variant identification, while the divergence of variants identification in WGS datasets were larger than that in WES datasets. We also down-sampled the original WES and WGS datasets at a series of gradient coverage across multiple platforms, then the variants calling period consumed by the three pipelines at each coverage were counted, respectively. For the GIAB datasets on both BGI and Illumina platforms, Strelka2 manifested its ultra-performance in detecting accuracy and processing efficiency compared with other two pipelines on each sequencing platform, which was recommended in the further promotion and application of next generation sequencing technology. The results of our researches will provide useful and comprehensive guidelines for personal or organizational researchers in reliable and consistent variants identification.
Collapse
Affiliation(s)
- Jiayun Chen
- School of Biology and Biological Engineering & Department of Biomedical Engineering, South China University of Technology, Guangzhou, China
| | - Xingsong Li
- School of Biology and Biological Engineering & Department of Biomedical Engineering, South China University of Technology, Guangzhou, China
| | - Hongbin Zhong
- School of Biology and Biological Engineering & Department of Biomedical Engineering, South China University of Technology, Guangzhou, China
| | - Yuhuan Meng
- School of Biology and Biological Engineering & Department of Biomedical Engineering, South China University of Technology, Guangzhou, China.
| | - Hongli Du
- School of Biology and Biological Engineering & Department of Biomedical Engineering, South China University of Technology, Guangzhou, China.
| |
Collapse
|
28
|
Wright B, Farquharson KA, McLennan EA, Belov K, Hogg CJ, Grueber CE. From reference genomes to population genomics: comparing three reference-aligned reduced-representation sequencing pipelines in two wildlife species. BMC Genomics 2019; 20:453. [PMID: 31159724 PMCID: PMC6547446 DOI: 10.1186/s12864-019-5806-y] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2018] [Accepted: 05/17/2019] [Indexed: 11/13/2022] Open
Abstract
Background Recent advances in genomics have greatly increased research opportunities for non-model species. For wildlife, a growing availability of reference genomes means that population genetics is no longer restricted to a small set of anonymous loci. When used in conjunction with a reference genome, reduced-representation sequencing (RRS) provides a cost-effective method for obtaining reliable diversity information for population genetics. Many software tools have been developed to process RRS data, though few studies of non-model species incorporate genome alignment in calling loci. A commonly-used RRS analysis pipeline, Stacks, has this capacity and so it is timely to compare its utility with existing software originally designed for alignment and analysis of whole genome sequencing data. Here we examine population genetic inferences from two species for which reference-aligned reduced-representation data have been collected. Our two study species are a threatened Australian marsupial (Tasmanian devil Sarcophilus harrisii; declining population) and an Arctic-circle migrant bird (pink-footed goose Anser brachyrhynchus; expanding population). Analyses of these data are compared using Stacks versus two widely-used genomics packages, SAMtools and GATK. We also introduce a custom R script to improve the reliability of single nucleotide polymorphism (SNP) calls in all pipelines and conduct population genetic inferences for non-model species with reference genomes. Results Although we identified orders of magnitude fewer SNPs in our devil dataset than for goose, we found remarkable symmetry between the two species in our assessment of software performance. For both datasets, all three methods were able to delineate population structure, even with varying numbers of loci. For both species, population structure inferences were influenced by the percent of missing data. Conclusions For studies of non-model species with a reference genome, we recommend combining Stacks output with further filtering (as included in our R pipeline) for population genetic studies, paying particular attention to potential impact of missing data thresholds. We recognise SAMtools as a viable alternative for researchers more familiar with this software. We caution against the use of GATK in studies with limited computational resources or time. Electronic supplementary material The online version of this article (10.1186/s12864-019-5806-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Belinda Wright
- Faculty of Science, The University of Sydney, School of Life and Environmental Sciences, Sydney, Australia
| | - Katherine A Farquharson
- Faculty of Science, The University of Sydney, School of Life and Environmental Sciences, Sydney, Australia
| | - Elspeth A McLennan
- Faculty of Science, The University of Sydney, School of Life and Environmental Sciences, Sydney, Australia
| | - Katherine Belov
- Faculty of Science, The University of Sydney, School of Life and Environmental Sciences, Sydney, Australia
| | - Carolyn J Hogg
- Faculty of Science, The University of Sydney, School of Life and Environmental Sciences, Sydney, Australia
| | - Catherine E Grueber
- Faculty of Science, The University of Sydney, School of Life and Environmental Sciences, Sydney, Australia. .,San Diego Zoo Global, San Diego, USA.
| |
Collapse
|
29
|
Abstract
Background DNA methylation is an epigenetic event that may regulate gene expression. Because of this regulation role, aberrant DNA methylation is often associated with many diseases. Within-sample DNA co-methylation is the similarity of methylation in nearby cytosine sites of a chromosome. It is important to study co-methylation patterns. However, it is not well studied yet, and it is unclear to us what co-methylation patterns normal DNA samples have. Are the co-methylation patterns of the same tissue across several samples different? Are the co-methylation patterns of various tissues of the same sample different? To answer these questions, we conduct analyses using two sets of data: 3-sample-1-tissue (3S1T) and 1-sample-8-tissue (1S8T). Results To study the co-methylation patterns of the two datasets, 3S1T and 1S8T, we investigate the following questions: How often does one methylation state change to other methylation states and how is this change associated with chromosome distance? Based on the 3S1T data, we find there is not significant co-methylation difference among the same spleen tissues of three different samples. However, the analysis results of 1S8T data show that there were significant differences among eight tissues of one sample. For both 3S1T and 1S8T data, we find that the no/low methylation state A and high/full methylation state D tend to remain the same along a chromosome region. We also find that the low/partial methylation state B and partial/high methylation state C tend to change to higher methylation states along a chromosome. Finally, we find that lengths of most co-methylation regions are very short with only a few hundred base pairs. In fact, only a small proportion of methylated regions are longer than 1000 base pairs. Conclusions In this paper, we have addressed a few questions regarding within-sample co-methylation patterns in normal tissues. Our statistical analysis results and answers may help researchers to better understand the biological process of DNA methylation. This may pave the way to develop better analysis methods for future methylation research. Electronic supplementary material The online version of this article (10.1186/s13040-019-0198-8) contains supplementary material, which is available to authorized users.
Collapse
|
30
|
Veeckman E, Van Glabeke S, Haegeman A, Muylle H, van Parijs FRD, Byrne SL, Asp T, Studer B, Rohde A, Roldán-Ruiz I, Vandepoele K, Ruttink T. Overcoming challenges in variant calling: exploring sequence diversity in candidate genes for plant development in perennial ryegrass (Lolium perenne). DNA Res 2019; 26:1-12. [PMID: 30325414 PMCID: PMC6379033 DOI: 10.1093/dnares/dsy033] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2018] [Accepted: 09/06/2018] [Indexed: 11/13/2022] Open
Abstract
Revealing DNA sequence variation within the Lolium perenne genepool is important for genetic analysis and development of breeding applications. We reviewed current literature on plant development to select candidate genes in pathways that control agronomic traits, and identified 503 orthologues in L. perenne. Using targeted resequencing, we constructed a comprehensive catalogue of genomic variation for a L. perenne germplasm collection of 736 genotypes derived from current cultivars, breeding material and wild accessions. To overcome challenges of variant calling in heterogeneous outbreeding species, we used two complementary strategies to explore sequence diversity. First, four variant calling pipelines were integrated with the VariantMetaCaller to reach maximal sensitivity. Additional multiplex amplicon sequencing was used to empirically estimate an appropriate precision threshold. Second, a de novo assembly strategy was used to reconstruct divergent alleles for each gene. The advantage of this approach was illustrated by discovery of 28 novel alleles of LpSDUF247, a polymorphic gene co-segregating with the S-locus of the grass self-incompatibility system. Our approach is applicable to other genetically diverse outbreeding species. The resulting collection of functionally annotated variants can be mined for variants causing phenotypic variation, either through genetic association studies, or by selecting carriers of rare defective alleles for physiological analyses.
Collapse
Affiliation(s)
- Elisabeth Veeckman
- ILVO, Plant Sciences Unit, B Melle, Belgium.,Bioinformatics Institute Ghent, Ghent University, B Ghent, Belgium.,Department of Plant Biotechnology and Bioinformatics, Ghent University, B Ghent, Belgium
| | | | | | | | | | | | - Torben Asp
- Department of Molecular Biology and Genetics, Faculty of Science and Technology, Research Center Flakkebjerg Aarhus University, DK Slagelse, Denmark
| | - Bruno Studer
- Molecular Plant Breeding, Institute of Agricultural Sciences, ETH Zurich, CH Zurich, Switzerland
| | | | - Isabel Roldán-Ruiz
- ILVO, Plant Sciences Unit, B Melle, Belgium.,Department of Plant Biotechnology and Bioinformatics, Ghent University, B Ghent, Belgium
| | - Klaas Vandepoele
- Bioinformatics Institute Ghent, Ghent University, B Ghent, Belgium.,Department of Plant Biotechnology and Bioinformatics, Ghent University, B Ghent, Belgium.,Center for Plant Systems Biology, VIB, B Ghent, Belgium
| | - Tom Ruttink
- ILVO, Plant Sciences Unit, B Melle, Belgium.,Bioinformatics Institute Ghent, Ghent University, B Ghent, Belgium
| |
Collapse
|
31
|
Jaiswal S, Jadhav PV, Jasrotia RS, Kale PB, Kad SK, Moharil MP, Dudhare MS, Kheni J, Deshmukh AG, Mane SS, Nandanwar RS, Penna S, Manjaya JG, Iquebal MA, Tomar RS, Kawar PG, Rai A, Kumar D. Transcriptomic signature reveals mechanism of flower bud distortion in witches'-broom disease of soybean (Glycine max). BMC PLANT BIOLOGY 2019; 19:26. [PMID: 30646861 PMCID: PMC6332543 DOI: 10.1186/s12870-018-1601-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/04/2018] [Accepted: 12/12/2018] [Indexed: 05/10/2023]
Abstract
BACKGROUND Soybean (Glycine max L. Merril) crop is major source of edible oil and protein for human and animals besides its various industrial uses including biofuels. Phytoplasma induced floral bud distortion syndrome (FBD), also known as witches' broom syndrome (WBS) has been one of the major biotic stresses adversely affecting its productivity. Transcriptomic approach can be used for knowledge discovery of this disease manifestation by morpho-physiological key pathways. RESULTS We report transcriptomic study using Illumina HiSeq NGS data of FBD in soybean, revealing 17,454 differentially expressed genes, 5561 transcription factors, 139 pathways and 176,029 genic region putative markers single sequence repeats, single nucleotide polymorphism and Insertion Deletion. Roles of PmbA, Zn-dependent protease, SAP family and auxin responsive system are described revealing mechanism of flower bud distortion having abnormalities in pollen, stigma development. Validation of 10 randomly selected genes was done by qPCR. Our findings describe the basic mechanism of FBD disease, right from sensing of phytoplasma infection by host plant triggering molecular signalling leading to mobilization of carbohydrate and protein, phyllody, abnormal pollen development, improved colonization of insect in host plants to spread the disease. Study reveals how phytoplasma hijacks metabolic machinery of soybean manifesting FBD. CONCLUSIONS This is the first report of transcriptomic signature of FBD or WBS disease of soybean revealing morphological and metabolic changes which attracts insect for spread of disease. All the genic region putative markers may be used as genomic resource for variety improvement and new agro-chemical development for disease control to enhance soybean productivity.
Collapse
Affiliation(s)
- Sarika Jaiswal
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Library Avenue, PUSA, New Delhi, 110012 India
| | - Pravin V. Jadhav
- Post Graduate Institute, Dr. Panjabrao Deshmukh Krishi Vidyapeeth, Akola, Maharashtra, 444104 India
| | - Rahul Singh Jasrotia
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Library Avenue, PUSA, New Delhi, 110012 India
| | - Prashant B. Kale
- National Research Centre on Plant Biotechnology, LBS Centre, PUSA Campus, New Delhi, 110012 India
| | - Snehal K. Kad
- Post Graduate Institute, Dr. Panjabrao Deshmukh Krishi Vidyapeeth, Akola, Maharashtra, 444104 India
| | - Mangesh P. Moharil
- Post Graduate Institute, Dr. Panjabrao Deshmukh Krishi Vidyapeeth, Akola, Maharashtra, 444104 India
| | - Mahendra S. Dudhare
- Post Graduate Institute, Dr. Panjabrao Deshmukh Krishi Vidyapeeth, Akola, Maharashtra, 444104 India
| | - Jashminkumar Kheni
- Department of Biotechnology, Junagadh Agricultural University, Junagadh, Gujarat India
| | - Amit G. Deshmukh
- Post Graduate Institute, Dr. Panjabrao Deshmukh Krishi Vidyapeeth, Akola, Maharashtra, 444104 India
| | - Shyamsundar S. Mane
- Post Graduate Institute, Dr. Panjabrao Deshmukh Krishi Vidyapeeth, Akola, Maharashtra, 444104 India
| | - Ravindra S. Nandanwar
- Post Graduate Institute, Dr. Panjabrao Deshmukh Krishi Vidyapeeth, Akola, Maharashtra, 444104 India
| | - Suprasanna Penna
- Nuclear Agriculture and Biotechnology Division, Homi Bhabha National Institute, Bhabha Atomic Research Centre (BARC), Trombay, Mumbai, 400 085 India
| | - Joy G. Manjaya
- Nuclear Agriculture and Biotechnology Division, Homi Bhabha National Institute, Bhabha Atomic Research Centre (BARC), Trombay, Mumbai, 400 085 India
| | - Mir Asif Iquebal
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Library Avenue, PUSA, New Delhi, 110012 India
| | - Rukam Singh Tomar
- Department of Biotechnology, Junagadh Agricultural University, Junagadh, Gujarat India
| | - Prashant G. Kawar
- ICAR- Directorate of Floricultural Research, College of Agriculture, Pune, Maharashtra, 411 005, India
| | - Anil Rai
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Library Avenue, PUSA, New Delhi, 110012 India
| | - Dinesh Kumar
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Library Avenue, PUSA, New Delhi, 110012 India
| |
Collapse
|
32
|
Malmberg MM, Barbulescu DM, Drayton MC, Shinozuka M, Thakur P, Ogaji YO, Spangenberg GC, Daetwyler HD, Cogan NOI. Evaluation and Recommendations for Routine Genotyping Using Skim Whole Genome Re-sequencing in Canola. FRONTIERS IN PLANT SCIENCE 2018; 9:1809. [PMID: 30581450 PMCID: PMC6292936 DOI: 10.3389/fpls.2018.01809] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/09/2018] [Accepted: 11/21/2018] [Indexed: 05/25/2023]
Abstract
Whole genome sequencing offers genome wide, unbiased markers, and inexpensive library preparation. With the cost of sequencing decreasing rapidly, many plant genomes of modest size are amenable to skim whole genome resequencing (skim WGR). The use of skim WGR in diverse sample sets without the use of imputation was evaluated in silico in 149 canola samples representative of global diversity. Fastq files with an average of 10x coverage of the reference genome were used to generate skim samples representing 0.25x, 0.5x, 1x, 2x, 3x, 4x, and 5x sequencing coverage. Applying a pre-defined list of SNPs versus de novo SNP discovery was evaluated. As skim WGR is expected to result in some degree of insufficient allele sampling, all skim coverage levels were filtered at a range of minimum read depths from a relaxed minimum read depth of 2 to a stringent read depth of 5, resulting in 28 list-based SNP sets. As a broad recommendation, genotyping pre-defined SNPs between 1x and 2x coverage with relatively stringent depth filtering is appropriate for a diverse sample set of canola due to a balance between marker number, sufficient accuracy, and sequencing cost, but depends on the intended application. This was experimentally examined in two sample sets with different genetic backgrounds: 1x coverage of 1,590 individuals from 84 Australian spring type four-parent crosses aimed at maximizing diversity as well as one commercial F1 hybrid, and 2x coverage of 379 doubled haploids (DHs) derived from a subset of the four-parent crosses. To determine optimal coverage in a simpler genetic background, the DH sample sequence coverage was further down sampled in silico. The flexible and cost-effective nature of the protocol makes it highly applicable across a range of species and purposes.
Collapse
Affiliation(s)
- M. Michelle Malmberg
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, Australia
| | | | - Michelle C. Drayton
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, Australia
| | - Maiko Shinozuka
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, Australia
| | - Preeti Thakur
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, Australia
| | - Yvonne O. Ogaji
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, Australia
| | - German C. Spangenberg
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, Australia
| | - Hans D. Daetwyler
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, Australia
| | - Noel O. I. Cogan
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, Australia
| |
Collapse
|
33
|
Paudel D, Kannan B, Yang X, Harris-Shultz K, Thudi M, Varshney RK, Altpeter F, Wang J. Surveying the genome and constructing a high-density genetic map of napiergrass (Cenchrus purpureus Schumach). Sci Rep 2018; 8:14419. [PMID: 30258215 PMCID: PMC6158254 DOI: 10.1038/s41598-018-32674-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2018] [Accepted: 09/13/2018] [Indexed: 01/17/2023] Open
Abstract
Napiergrass (Cenchrus purpureus Schumach) is a tropical forage grass and a promising lignocellulosic biofuel feedstock due to its high biomass yield, persistence, and nutritive value. However, its utilization for breeding has lagged behind other crops due to limited genetic and genomic resources. In this study, next-generation sequencing was first used to survey the genome of napiergrass. Napiergrass sequences displayed high synteny to the pearl millet genome and showed expansions in the pearl millet genome along with genomic rearrangements between the two genomes. An average repeat content of 27.5% was observed in napiergrass including 5,339 simple sequence repeats (SSRs). Furthermore, to construct a high-density genetic map of napiergrass, genotyping-by-sequencing (GBS) was employed in a bi-parental population of 185 F1 hybrids. A total of 512 million high quality reads were generated and 287,093 SNPs were called by using multiple de-novo and reference-based SNP callers. Single dose SNPs were used to construct the first high-density linkage map that resulted in 1,913 SNPs mapped to 14 linkage groups, spanning a length of 1,410 cM and a density of 1 marker per 0.73 cM. This map can be used for many further genetic and genomic studies in napiergrass and related species.
Collapse
Affiliation(s)
- Dev Paudel
- Agronomy Department, IFAS, University of Florida, Gainesville, FL, 32611, USA
| | - Baskaran Kannan
- Agronomy Department, IFAS, University of Florida, Gainesville, FL, 32611, USA
| | - Xiping Yang
- Agronomy Department, IFAS, University of Florida, Gainesville, FL, 32611, USA
| | - Karen Harris-Shultz
- Crop Genetics and Breeding Research Unit, USDA-Agricultural Research Service, 115 Coastal Way, Tifton, GA, 31793, USA
| | - Mahendar Thudi
- Center of Excellence in Genomics & Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, 502324, Telangana State, India
| | - Rajeev K Varshney
- Center of Excellence in Genomics & Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, 502324, Telangana State, India
| | - Fredy Altpeter
- Agronomy Department, IFAS, University of Florida, Gainesville, FL, 32611, USA.,Plant Molecular and Cellular Biology Program, Genetic Institute, University of Florida, Gainesville, FL, 32611, USA
| | - Jianping Wang
- Agronomy Department, IFAS, University of Florida, Gainesville, FL, 32611, USA. .,Plant Molecular and Cellular Biology Program, Genetic Institute, University of Florida, Gainesville, FL, 32611, USA. .,Center for Genomics and Biotechnology, Key Laboratory of Genetics, Breeding and Multiple Utilization of Corps, Ministry of Education, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Fujian Agriculture and Forestry University, Fuzhou, Fujian, 350002, China.
| |
Collapse
|
34
|
Vo NS, Phan V. Leveraging known genomic variants to improve detection of variants, especially close-by Indels. Bioinformatics 2018; 34:2918-2926. [PMID: 29590294 DOI: 10.1093/bioinformatics/bty183] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Accepted: 03/23/2018] [Indexed: 12/30/2022] Open
Abstract
Motivation The detection of genomic variants has great significance in genomics, bioinformatics, biomedical research and its applications. However, despite a lot of effort, Indels and structural variants are still under-characterized compared to SNPs. Current approaches based on next-generation sequencing data usually require large numbers of reads (high coverage) to be able to detect such types of variants accurately. However Indels, especially those close to each other, are still hard to detect accurately. Results We introduce a novel approach that leverages known variant information, e.g. provided by dbSNP, dbVar, ExAC or the 1000 Genomes Project, to improve sensitivity of detecting variants, especially close-by Indels. In our approach, the standard reference genome and the known variants are combined to build a meta-reference, which is expected to be probabilistically closer to the subject genomes than the standard reference. An alignment algorithm, which can take into account known variant information, is developed to accurately align reads to the meta-reference. This strategy resulted in accurate alignment and variant calling even with low coverage data. We showed that compared to popular methods such as GATK and SAMtools, our method significantly improves the sensitivity of detecting variants, especially Indels that are close to each other. In particular, our method was able to call these close-by Indels at a 15-20% higher sensitivity than other methods at low coverage, and still get 1-5% higher sensitivity at high coverage, at competitive precision. These results were validated using simulated data with variant profiles extracted from the 1000 Genomes Project data, and real data from the Illumina Platinum Genomes Project and ExAC database. Our finding suggests that by incorporating known variant information in an appropriate manner, sensitive variant calling is possible at a low cost. Availability and implementation Implementation can be found in our public code repository https://github.com/namsyvo/IVC. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nam S Vo
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Vinhthuy Phan
- Department of Computer Science, The University of Memphis, Memphis, TN, USA
| |
Collapse
|
35
|
Shringarpure SS, Mathias RA, Hernandez RD, O'Connor TD, Szpiech ZA, Torres R, De La Vega FM, Bustamante CD, Barnes KC, Taub MA. Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data. Bioinformatics 2018; 33:1147-1153. [PMID: 28035032 PMCID: PMC5408850 DOI: 10.1093/bioinformatics/btw786] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2016] [Accepted: 12/07/2016] [Indexed: 12/30/2022] Open
Abstract
Motivation Variant calling from next-generation sequencing (NGS) data is susceptible to false positive calls due to sequencing, mapping and other errors. To better distinguish true from false positive calls, we present a method that uses genotype array data from the sequenced samples, rather than public data such as HapMap or dbSNP, to train an accurate classifier using Random Forests. We demonstrate our method on a set of variant calls obtained from 642 African-ancestry genomes from the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), sequenced to high depth (30X). Results We have applied our classifier to compare call sets generated with different calling methods, including both single-sample and multi-sample callers. At a False Positive Rate of 5%, our method determines true positive rates of 97.5%, 95% and 99% on variant calls obtained using Illuminas single-sample caller CASAVA, Real Time Genomics multisample variant caller, and the GATK UnifiedGenotyper, respectively. Since NGS sequencing data may be accompanied by genotype data for the same samples, either collected concurrent to sequencing or from a previous study, our method can be trained on each dataset to provide a more accurate computational validation of site calls compared to generic methods. Moreover, our method allows for adjustment based on allele frequency (e.g. a different set of criteria to determine quality for rare versus common variants) and thereby provides insight into sequencing characteristics that indicate call quality for variants of different frequencies. Availability and Implementation Code is available on Github at: https://github.com/suyashss/variant_validation. Contacts suyashs@stanford.edu or mtaub@jhsph.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Suyash S Shringarpure
- Departments of Genetics and Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Rasika A Mathias
- 23 and Me Inc, Mountain View, CA, USA.,Department of Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Ryan D Hernandez
- Department of Epidemiology, Bloomberg School of Public Health, JHU, Baltimore, MD, USA.,Department of Bioengineering and Therapeutic Sciences.,Institute for Human Genetics
| | - Timothy D O'Connor
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA, USA.,Institute for Genome Sciences.,Program in Personalized and Genomic Medicine
| | - Zachary A Szpiech
- Department of Epidemiology, Bloomberg School of Public Health, JHU, Baltimore, MD, USA
| | - Raul Torres
- Department of Medicine, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Francisco M De La Vega
- Departments of Genetics and Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Carlos D Bustamante
- Departments of Genetics and Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Kathleen C Barnes
- 23 and Me Inc, Mountain View, CA, USA.,Department of Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Margaret A Taub
- Biomedical Sciences Graduate Program, University of California, San Francisco, San Francisco, CA, USA
| | | |
Collapse
|
36
|
Ochoa I, Hernaez M, Goldfeder R, Weissman T, Ashley E. Effect of lossy compression of quality scores on variant calling. Brief Bioinform 2017; 18:183-194. [PMID: 26966283 DOI: 10.1093/bib/bbw011] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Indexed: 12/30/2022] Open
Abstract
Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear.Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.
Collapse
Affiliation(s)
- Idoia Ochoa
- Electrical Engineering department, 350 Serra Mall, Stanford, CA, USA
| | - Mikel Hernaez
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Rachel Goldfeder
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Euan Ashley
- Department of Medicine, Stanford University, Stanford, CA, USA.,Stanford Center for Inherited Cardiovascular Disease, Stanford University, Stanford, CA, USA.,Department of Genetics, Stanford University, Stanford, CA, USA
| |
Collapse
|
37
|
Jasrotia RS, Iquebal MA, Yadav PK, Kumar N, Jaiswal S, Angadi UB, Rai A, Kumar D. Development of transcriptome based web genomic resources of yellow mosaic disease in Vigna mungo. PHYSIOLOGY AND MOLECULAR BIOLOGY OF PLANTS : AN INTERNATIONAL JOURNAL OF FUNCTIONAL PLANT BIOLOGY 2017; 23:767-777. [PMID: 29158627 PMCID: PMC5671452 DOI: 10.1007/s12298-017-0470-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/24/2017] [Revised: 09/06/2017] [Accepted: 09/11/2017] [Indexed: 05/27/2023]
Abstract
Vigna mungo (Urdbean) is cultivated in the tropical and sub-tropical continental region of Asia. It is not only important source of dietary protein and nutritional elements, but also of immense value to human health due to medicinal properties. Yellow mosaic disease caused by Mungbean Yellow Mosaic India Virus is known to incur huge loss to crop, adversely affecting crop yield. Contrasting genotypes are ideal source for knowledge discovery of plant defence mechanism and associated candidate genes for varietal improvement. Whole genome sequence of this crop is yet to be completed. Moreover, genomic resources are also not freely accessible, thus available transcriptome data can be of immense use. V. mungo Transcriptome database, accessible at http://webtom.cabgrid.res.in/vmtdb/ has been developed using available data of two contrasting varieties viz., cv. VM84 (resistant) and cv. T9 (susceptible). De novo assembly was carried out using Trinity and CAP3. Out of total 240,945 unigenes, 165,894 (68.8%) showed similarity with known genes against NR database, and remaining 31.2% were found to be novel. We found 22,101 differentially expressed genes in all datasets, 44,335 putative genic SSR markers, 4105 SNPs and Indels, 64,964 transcriptional factor, 546 mature miRNA target prediction in 703 differentially expressed unigenes and 137 pathways. MAPK, salicylic acid-binding protein 2-like, pathogenesis-related protein and NBS-LRR domain were found which may play an important role in defence against pathogens. This is the first web genomic resource of V. mungo for future genome annotation as well as ready to use markers for future variety improvement program.
Collapse
Affiliation(s)
- Rahul Singh Jasrotia
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
- Department of Computational Biology & Bioinformatics, Sam Higginbottom University of Agriculture, Technology & Sciences (SHUATS), Allahabad, 211007 India
| | - Mir Asif Iquebal
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| | - Pramod Kumar Yadav
- Department of Computational Biology & Bioinformatics, Sam Higginbottom University of Agriculture, Technology & Sciences (SHUATS), Allahabad, 211007 India
| | - Neeraj Kumar
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| | - Sarika Jaiswal
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| | - U. B. Angadi
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| | - Anil Rai
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| | - Dinesh Kumar
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| |
Collapse
|
38
|
Whiston R, Finlay EK, McCabe MS, Cormican P, Flynn P, Cromie A, Hansen PJ, Lyons A, Fair S, Lonergan P, O' Farrelly C, Meade KG. A dual targeted β-defensin and exome sequencing approach to identify, validate and functionally characterise genes associated with bull fertility. Sci Rep 2017; 7:12287. [PMID: 28947819 PMCID: PMC5613009 DOI: 10.1038/s41598-017-12498-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Accepted: 09/11/2017] [Indexed: 12/30/2022] Open
Abstract
Bovine fertility remains a critical issue underpinning the sustainability of the agricultural sector. Phenotypic records collected on >7,000 bulls used in artificial insemination (AI) were used to identify 160 reliable and divergently fertile bulls for a dual strategy of targeted sequencing (TS) of fertility-related β-defensin genes and whole exome sequencing (WES). A haplotype spanning multiple β-defensin genes and containing 94 SNPs was significantly associated with fertility and functional analysis confirmed that sperm from bulls possessing the haplotype showed significantly enhanced binding to oviductal epithelium. WES of all exons in the genome in 24 bulls of high and low fertility identified 484 additional SNPs significantly associated with fertility. After validation, the most significantly associated SNP was located in the FOXJ3 gene, a transcription factor which regulates sperm function in mice. This study represents the first comprehensive characterisation of genetic variation in bovine β-defensin genes and functional analysis supports a role for β-defensins in regulating bull sperm function. This first application of WES in AI bulls with divergent fertility phenotypes has identified a novel role for the transcription factor FOXJ3 in the regulation of bull fertility. Validated genetic variants associated with bull fertility could prove useful for improving reproductive outcomes in cattle.
Collapse
Affiliation(s)
- Ronan Whiston
- Animal & Bioscience Research Department, Animal & Grassland Research and Innovation Centre, Teagasc, Grange, Co. Meath, Ireland
| | - Emma K Finlay
- Animal & Bioscience Research Department, Animal & Grassland Research and Innovation Centre, Teagasc, Grange, Co. Meath, Ireland
| | - Matthew S McCabe
- Animal & Bioscience Research Department, Animal & Grassland Research and Innovation Centre, Teagasc, Grange, Co. Meath, Ireland
| | - Paul Cormican
- Animal & Bioscience Research Department, Animal & Grassland Research and Innovation Centre, Teagasc, Grange, Co. Meath, Ireland
| | - Paul Flynn
- Weatherbys Scientific, Johnstown, Naas, Co Kildare, Ireland
| | - Andrew Cromie
- Irish Cattle Breeding Federation, Bandon, Co. Cork, Ireland
| | - Peter J Hansen
- Department of Animal Sciences, University of Florida, Gainesville, Florida, USA
| | - Alan Lyons
- Department of Biological Sciences, University of Limerick, Limerick, Ireland
| | - Sean Fair
- Department of Biological Sciences, University of Limerick, Limerick, Ireland
| | - Patrick Lonergan
- School of Agriculture and Food Science, University College Dublin, Belfield, Dublin 4, Ireland
| | - Cliona O' Farrelly
- Trinity Biomedical Sciences Institute, Trinity College, Dublin 2, Ireland
| | - Kieran G Meade
- Animal & Bioscience Research Department, Animal & Grassland Research and Innovation Centre, Teagasc, Grange, Co. Meath, Ireland.
| |
Collapse
|
39
|
Yang X, Song J, You Q, Paudel DR, Zhang J, Wang J. Mining sequence variations in representative polyploid sugarcane germplasm accessions. BMC Genomics 2017; 18:594. [PMID: 28793856 PMCID: PMC5551020 DOI: 10.1186/s12864-017-3980-3] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2016] [Accepted: 08/01/2017] [Indexed: 11/10/2022] Open
Abstract
Background Sugarcane (Saccharum spp.) is one of the most important economic crops because of its high sugar production and biofuel potential. Due to the high polyploid level and complex genome of sugarcane, it has been a huge challenge to investigate genomic sequence variations, which are critical for identifying alleles contributing to important agronomic traits. In order to mine the genetic variations in sugarcane, genotyping by sequencing (GBS), was used to genotype 14 representative Saccharum complex accessions. GBS is a method to generate a large number of markers, enabled by next generation sequencing (NGS) and the genome complexity reduction using restriction enzymes. Results To use GBS for high throughput genotyping highly polyploid sugarcane, the GBS analysis pipelines in 14 Saccharum complex accessions were established by evaluating different alignment methods, sequence variants callers, and sequence depth for single nucleotide polymorphism (SNP) filtering. By using the established pipeline, a total of 76,251 non-redundant SNPs, 5642 InDels, 6380 presence/absence variants (PAVs), and 826 copy number variations (CNVs) were detected among the 14 accessions. In addition, non-reference based universal network enabled analysis kit and Stacks de novo called 34,353 and 109,043 SNPs, respectively. In the 14 accessions, the percentages of single dose SNPs ranged from 38.3% to 62.3% with an average of 49.6%, much more than the portions of multiple dosage SNPs. Concordantly called SNPs were used to evaluate the phylogenetic relationship among the 14 accessions. The results showed that the divergence time between the Erianthus genus and the Saccharum genus was more than 10 million years ago (MYA). The Saccharum species separated from their common ancestors ranging from 0.19 to 1.65 MYA. Conclusions The GBS pipelines including the reference sequences, alignment methods, sequence variant callers, and sequence depth were recommended and discussed for the Saccharum complex and other related species. A large number of sequence variations were discovered in the Saccharum complex, including SNPs, InDels, PAVs, and CNVs. Genome-wide SNPs were further used to illustrate sequence features of polyploid species and demonstrated the divergence of different species in the Saccharum complex. The results of this study showed that GBS was an effective NGS-based method to discover genomic sequence variations in highly polyploid and heterozygous species.
Collapse
Affiliation(s)
- Xiping Yang
- Department of Agronomy, University of Florida, Gainesville, FL, 32610, USA
| | - Jian Song
- Department of Agronomy, University of Florida, Gainesville, FL, 32610, USA
| | - Qian You
- Department of Agronomy, University of Florida, Gainesville, FL, 32610, USA
| | - Dev R Paudel
- Department of Agronomy, University of Florida, Gainesville, FL, 32610, USA
| | - Jisen Zhang
- FAFU and UIUC-SIB Joint Center for Genomics and Biotechnology, Haixia Institute of Science and Techonology, Fujian Agriculture and Forestry University, Fuzhou, Fujian, 350002, China
| | - Jianping Wang
- Department of Agronomy, University of Florida, Gainesville, FL, 32610, USA. .,FAFU and UIUC-SIB Joint Center for Genomics and Biotechnology, Haixia Institute of Science and Techonology, Fujian Agriculture and Forestry University, Fuzhou, Fujian, 350002, China. .,Genetics Institute, Plant Molecular and Biology Program, University of Florida, Gainesville, FL, 32610, USA.
| |
Collapse
|
40
|
|
41
|
MetaGaAP: A Novel Pipeline to Estimate Community Composition and Abundance from Non-Model Sequence Data. BIOLOGY 2017; 6:biology6010014. [PMID: 28218638 PMCID: PMC5372007 DOI: 10.3390/biology6010014] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/01/2016] [Revised: 01/06/2017] [Accepted: 02/07/2017] [Indexed: 02/01/2023]
Abstract
Next generation sequencing and bioinformatic approaches are increasingly used to quantify microorganisms within populations by analysis of ‘meta-barcode’ data. This approach relies on comparison of amplicon sequences of ‘barcode’ regions from a population with public-domain databases of reference sequences. However, for many organisms relevant ‘barcode’ regions may not have been identified and large databases of reference sequences may not be available. A workflow and software pipeline, ‘MetaGaAP,’ was developed to identify and quantify genotypes through four steps: shotgun sequencing and identification of polymorphisms in a metapopulation to identify custom ‘barcode’ regions of less than 30 polymorphisms within the span of a single ‘read’, amplification and sequencing of the ‘barcode’, generation of a custom database of polymorphisms, and quantitation of the relative abundance of genotypes. The pipeline and workflow were validated in a ‘wild type’ Alphabaculovirus isolate, Helicoverpa armigera single nucleopolyhedrovirus (HaSNPV-AC53) and a tissue-culture derived strain (HaSNPV-AC53-T2). The approach was validated by comparison of polymorphisms in amplicons and shotgun data, and by comparison of predicted dominant and co-dominant genotypes with Sanger sequences. The computational power required to generate and search the database effectively limits the number of polymorphisms that can be included in a barcode to 30 or less. The approach can be used in quantitative analysis of the ecology and pathology of non-model organisms.
Collapse
|
42
|
Hofmann AL, Behr J, Singer J, Kuipers J, Beisel C, Schraml P, Moch H, Beerenwinkel N. Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers. BMC Bioinformatics 2017; 18:8. [PMID: 28049408 PMCID: PMC5209852 DOI: 10.1186/s12859-016-1417-7] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2016] [Accepted: 12/10/2016] [Indexed: 12/30/2022] Open
Abstract
Background Next-generation sequencing of matched tumor and normal biopsy pairs has become a technology of paramount importance for precision cancer treatment. Sequencing costs have dropped tremendously, allowing the sequencing of the whole exome of tumors for just a fraction of the total treatment costs. However, clinicians and scientists cannot take full advantage of the generated data because the accuracy of analysis pipelines is limited. This particularly concerns the reliable identification of subclonal mutations in a cancer tissue sample with very low frequencies, which may be clinically relevant. Results Using simulations based on kidney tumor data, we compared the performance of nine state-of-the-art variant callers, namely deepSNV, GATK HaplotypeCaller, GATK UnifiedGenotyper, JointSNVMix2, MuTect, SAMtools, SiNVICT, SomaticSniper, and VarScan2. The comparison was done as a function of variant allele frequencies and coverage. Our analysis revealed that deepSNV and JointSNVMix2 perform very well, especially in the low-frequency range. We attributed false positive and false negative calls of the nine tools to specific error sources and assigned them to processing steps of the pipeline. All of these errors can be expected to occur in real data sets. We found that modifying certain steps of the pipeline or parameters of the tools can lead to substantial improvements in performance. Furthermore, a novel integration strategy that combines the ranks of the variants yielded the best performance. More precisely, the rank-combination of deepSNV, JointSNVMix2, MuTect, SiNVICT and VarScan2 reached a sensitivity of 78% when fixing the precision at 90%, and outperformed all individual tools, where the maximum sensitivity was 71% with the same precision. Conclusions The choice of well-performing tools for alignment and variant calling is crucial for the correct interpretation of exome sequencing data obtained from mixed samples, and common pipelines are suboptimal. We were able to relate observed substantial differences in performance to the underlying statistical models of the tools, and to pinpoint the error sources of false positive and false negative calls. These findings might inspire new software developments that improve exome sequencing pipelines and further the field of precision cancer treatment. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1417-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ariane L Hofmann
- Department of Biosystems Science and Engineering, ETH Zurich, Mattenstr, Basel, 26, 4058, Switzerland.,Swiss Institute of Bioinformatics, Mattenstr, Basel, 26, 4058, Switzerland
| | - Jonas Behr
- Department of Biosystems Science and Engineering, ETH Zurich, Mattenstr, Basel, 26, 4058, Switzerland.,Swiss Institute of Bioinformatics, Mattenstr, Basel, 26, 4058, Switzerland
| | - Jochen Singer
- Department of Biosystems Science and Engineering, ETH Zurich, Mattenstr, Basel, 26, 4058, Switzerland.,Swiss Institute of Bioinformatics, Mattenstr, Basel, 26, 4058, Switzerland
| | - Jack Kuipers
- Department of Biosystems Science and Engineering, ETH Zurich, Mattenstr, Basel, 26, 4058, Switzerland.,Swiss Institute of Bioinformatics, Mattenstr, Basel, 26, 4058, Switzerland
| | - Christian Beisel
- Department of Biosystems Science and Engineering, ETH Zurich, Mattenstr, Basel, 26, 4058, Switzerland
| | - Peter Schraml
- Institute for Surgical Pathology, University Hospital Zurich, Schmelzbergstrasse 12, Zurich, 8091, Switzerland
| | - Holger Moch
- Institute for Surgical Pathology, University Hospital Zurich, Schmelzbergstrasse 12, Zurich, 8091, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Mattenstr, Basel, 26, 4058, Switzerland. .,Swiss Institute of Bioinformatics, Mattenstr, Basel, 26, 4058, Switzerland.
| |
Collapse
|
43
|
Yu LX, Zheng P, Bhamidimarri S, Liu XP, Main D. The Impact of Genotyping-by-Sequencing Pipelines on SNP Discovery and Identification of Markers Associated with Verticillium Wilt Resistance in Autotetraploid Alfalfa ( Medicago sativa L.). FRONTIERS IN PLANT SCIENCE 2017; 8:89. [PMID: 28223988 PMCID: PMC5293825 DOI: 10.3389/fpls.2017.00089] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/20/2016] [Accepted: 01/16/2017] [Indexed: 05/08/2023]
Abstract
Verticillium wilt (VW) of alfalfa is a soilborne disease causing severe yield loss in alfalfa. To identify molecular markers associated with VW resistance, we used an integrated framework of genome-wide association study (GWAS) with high-throughput genotyping by sequencing (GBS) to identify loci associated with VW resistance in an F1 full-sib alfalfa population. Phenotyping was performed using manual inoculation of the pathogen to cloned plants of each individual and disease severity was scored using a standard scale. Genotyping was done by GBS, followed by genotype calling using three bioinformatics pipelines including the TASSEL-GBS pipeline (TASSEL), the Universal Network Enabled Analysis Kit (UNEAK), and the haplotype-based FreeBayes pipeline (FreeBayes). The resulting numbers of SNPs, marker density, minor allele frequency (MAF) and heterozygosity were compared among the pipelines. The TASSEL pipeline generated more markers with the highest density and MAF, whereas the highest heterozygosity was obtained by the UNEAK pipeline. The FreeBayes pipeline generated tetraploid genotypes, with the least number of markers. SNP markers generated from each pipeline were used independently for marker-trait association. Markers significantly associated with VW resistance identified by each pipeline were compared. Similar marker loci were found on chromosomes 5, 6, and 7, whereas different loci on chromosome 1, 2, 3, and 4 were identified by different pipelines. Most significant markers were located on chromosome 6 and they were identified by all three pipelines. Of those identified, several loci were linked to known genes whose functions are involved in the plants' resistance to pathogens. Further investigation on these loci and their linked genes would provide insight into understanding molecular mechanisms of VW resistance in alfalfa. Functional markers closely linked to the resistance loci would be useful for MAS to improve alfalfa cultivars with enhanced resistance to the disease.
Collapse
Affiliation(s)
- Long-Xi Yu
- Plant Germplasm Introduction and Testing Research, United States Department of Agriculture-Agricultural Research Service, ProsserWA, USA
- *Correspondence: Long-Xi Yu,
| | - Ping Zheng
- Department of Horticulture, Washington State University, PullmanWA, USA
| | | | - Xiang-Ping Liu
- Plant Germplasm Introduction and Testing Research, United States Department of Agriculture-Agricultural Research Service, ProsserWA, USA
| | - Dorie Main
- Department of Horticulture, Washington State University, PullmanWA, USA
| |
Collapse
|
44
|
Brumme CJ, Poon AFY. Promises and pitfalls of Illumina sequencing for HIV resistance genotyping. Virus Res 2016; 239:97-105. [PMID: 27993623 DOI: 10.1016/j.virusres.2016.12.008] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2016] [Revised: 12/15/2016] [Accepted: 12/15/2016] [Indexed: 12/13/2022]
Abstract
Genetic sequencing ("genotyping") plays a critical role in the modern clinical management of HIV infection. This virus evolves rapidly within patients because of its error-prone reverse transcriptase and short generation time. Consequently, HIV variants with mutations that confer resistance to one or more antiretroviral drugs can emerge during sub-optimal treatment. There are now multiple HIV drug resistance interpretation algorithms that take the region of the HIV genome encoding the major drug targets as inputs; expert use of these algorithms can significantly improve to clinical outcomes in HIV treatment. Next-generation sequencing has the potential to revolutionize HIV resistance genotyping by lowering the threshold that rare but clinically significant HIV variants can be detected reproducibly, and by conferring improved cost-effectiveness in high-throughput scenarios. In this review, we discuss the relative merits and challenges of deploying the Illumina MiSeq instrument for clinical HIV genotyping.
Collapse
Affiliation(s)
- Chanson J Brumme
- BC Centre for Excellence in HIV/AIDS, Vancouver, British Columbia, Canada
| | - Art F Y Poon
- Department of Pathology & Laboratory Medicine, Western University, London, Ontario, Canada.
| |
Collapse
|
45
|
From next-generation resequencing reads to a high-quality variant data set. Heredity (Edinb) 2016; 118:111-124. [PMID: 27759079 DOI: 10.1038/hdy.2016.102] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Revised: 09/03/2016] [Accepted: 09/06/2016] [Indexed: 12/11/2022] Open
Abstract
Sequencing has revolutionized biology by permitting the analysis of genomic variation at an unprecedented resolution. High-throughput sequencing is fast and inexpensive, making it accessible for a wide range of research topics. However, the produced data contain subtle but complex types of errors, biases and uncertainties that impose several statistical and computational challenges to the reliable detection of variants. To tap the full potential of high-throughput sequencing, a thorough understanding of the data produced as well as the available methodologies is required. Here, I review several commonly used methods for generating and processing next-generation resequencing data, discuss the influence of errors and biases together with their resulting implications for downstream analyses and provide general guidelines and recommendations for producing high-quality single-nucleotide polymorphism data sets from raw reads by highlighting several sophisticated reference-based methods representing the current state of the art.
Collapse
|
46
|
Tian S, Yan H, Kalmbach M, Slager SL. Impact of post-alignment processing in variant discovery from whole exome data. BMC Bioinformatics 2016; 17:403. [PMID: 27716037 PMCID: PMC5048557 DOI: 10.1186/s12859-016-1279-z] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2016] [Accepted: 09/26/2016] [Indexed: 01/11/2023] Open
Abstract
Background GATK Best Practices workflows are widely used in large-scale sequencing projects and recommend post-alignment processing before variant calling. Two key post-processing steps include the computationally intensive local realignment around known INDELs and base quality score recalibration (BQSR). Both have been shown to reduce erroneous calls; however, the findings are mainly supported by the analytical pipeline that incorporates BWA and GATK UnifiedGenotyper. It is not known whether there is any benefit of post-processing and to what extent the benefit might be for pipelines implementing other methods, especially given that both mappers and callers are typically updated. Moreover, because sequencing platforms are upgraded regularly and the new platforms provide better estimations of read quality scores, the need for post-processing is also unknown. Finally, some regions in the human genome show high sequence divergence from the reference genome; it is unclear whether there is benefit from post-processing in these regions. Results We used both simulated and NA12878 exome data to comprehensively assess the impact of post-processing for five or six popular mappers together with five callers. Focusing on chromosome 6p21.3, which is a region of high sequence divergence harboring the human leukocyte antigen (HLA) system, we found that local realignment had little or no impact on SNP calling, but increased sensitivity was observed in INDEL calling for the Stampy + GATK UnifiedGenotyper pipeline. No or only a modest effect of local realignment was detected on the three haplotype-based callers and no evidence of effect on Novoalign. BQSR had virtually negligible effect on INDEL calling and generally reduced sensitivity for SNP calling that depended on caller, coverage and level of divergence. Specifically, for SAMtools and FreeBayes calling in the regions with low divergence, BQSR reduced the SNP calling sensitivity but improved the precision when the coverage is insufficient. However, in regions of high divergence (e.g., the HLA region), BQSR reduced the sensitivity of both callers with little gain in precision rate. For the other three callers, BQSR reduced the sensitivity without increasing the precision rate regardless of coverage and divergence level. Conclusions We demonstrated that the gain from post-processing is not universal; rather, it depends on mapper and caller combination, and the benefit is influenced further by sequencing depth and divergence level. Our analysis highlights the importance of considering these key factors in deciding to apply the computationally intensive post-processing to Illumina exome data. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1279-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shulan Tian
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA
| | - Huihuang Yan
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA
| | - Michael Kalmbach
- Division of Research and Education Support Systems, Department of Information Technology Mayo Clinic, Rochester, MN, 55905, USA
| | - Susan L Slager
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA.
| |
Collapse
|
47
|
Jakaitiene A, Avino M, Guarracino MR. Beta-Binomial Model for the Detection of Rare Mutations in Pooled Next-Generation Sequencing Experiments. J Comput Biol 2016; 24:357-367. [PMID: 27632638 DOI: 10.1089/cmb.2016.0106] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Against diminishing costs, next-generation sequencing (NGS) still remains expensive for studies with a large number of individuals. As cost saving, sequencing genome of pools containing multiple samples might be used. Currently, there are many software available for the detection of single-nucleotide polymorphisms (SNPs). Sensitivity and specificity depend on the model used and data analyzed, indicating that all software have space for improvement. We use beta-binomial model to detect rare mutations in untagged pooled NGS experiments. We propose a multireference framework for pooled data with ability being specific up to two patients affected by neuromuscular disorders (NMD). We assessed the results comparing with The Genome Analysis Toolkit (GATK), CRISP, SNVer, and FreeBayes. Our results show that the multireference approach applying beta-binomial model is accurate in predicting rare mutations at 0.01 fraction. Finally, we explored the concordance of mutations between the model and software, checking their involvement in any NMD-related gene. We detected seven novel SNPs, for which the functional analysis produced enriched terms related to locomotion and musculature.
Collapse
Affiliation(s)
- Audrone Jakaitiene
- 1 Bioinformatics and Biostatistics Center, Department of Human and Medical Genetics, Faculty of Medicine, Vilnius University , Vilnius, Lithuania
| | - Mariano Avino
- 2 High Performance Computing and Networking Institute , National Research Council, Naples, Italy
| | - Mario Rosario Guarracino
- 2 High Performance Computing and Networking Institute , National Research Council, Naples, Italy
| |
Collapse
|
48
|
Huang Z, Rustagi N, Veeraraghavan N, Carroll A, Gibbs R, Boerwinkle E, Venkata MG, Yu F. A hybrid computational strategy to address WGS variant analysis in >5000 samples. BMC Bioinformatics 2016; 17:361. [PMID: 27612449 PMCID: PMC5018196 DOI: 10.1186/s12859-016-1211-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2016] [Accepted: 08/25/2016] [Indexed: 11/22/2022] Open
Abstract
Background The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies. Results We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms. Conclusions Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1211-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Zhuoyi Huang
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Navin Rustagi
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | | | | | - Richard Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Eric Boerwinkle
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.,Human Genetics Center, University of Texas Health Science Center, Houston, TX, USA
| | | | - Fuli Yu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
| |
Collapse
|
49
|
Tian S, Yan H, Neuhauser C, Slager SL. An analytical workflow for accurate variant discovery in highly divergent regions. BMC Genomics 2016; 17:703. [PMID: 27590916 PMCID: PMC5010666 DOI: 10.1186/s12864-016-3045-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2016] [Accepted: 08/25/2016] [Indexed: 02/07/2023] Open
Abstract
Background Current variant discovery methods often start with the mapping of short reads to a reference genome; yet, their performance deteriorates in genomic regions where the reads are highly divergent from the reference sequence. This is particularly problematic for the human leukocyte antigen (HLA) region on chromosome 6p21.3. This region is associated with over 100 diseases, but variant calling is hindered by the extreme divergence across different haplotypes. Results We simulated reads from chromosome 6 exonic regions over a wide range of sequence divergence and coverage depth. We systematically assessed combinations between five mappers and five callers for their performance on simulated data and exome-seq data from NA12878, a well-studied individual in which multiple public call sets have been generated. Among those combinations, the number of known SNPs differed by about 5 % in the non-HLA regions of chromosome 6 but over 20 % in the HLA region. Notably, GSNAP mapping combined with GATK UnifiedGenotyper calling identified about 20 % more known SNPs than most existing methods without a noticeable loss of specificity, with 100 % sensitivity in three highly polymorphic HLA genes examined. Much larger differences were observed among these combinations in INDEL calling from both non-HLA and HLA regions. We obtained similar results with our internal exome-seq data from a cohort of chronic lymphocytic leukemia patients. Conclusions We have established a workflow enabling variant detection, with high sensitivity and specificity, over the full spectrum of divergence seen in the human genome. Comparing to public call sets from NA12878 has highlighted the overall superiority of GATK UnifiedGenotyper, followed by GATK HaplotypeCaller and SAMtools, in SNP calling, and of GATK HaplotypeCaller and Platypus in INDEL calling, particularly in regions of high sequence divergence such as the HLA region. GSNAP and Novoalign are the ideal mappers in combination with the above callers. We expect that the proposed workflow should be applicable to variant discovery in other highly divergent regions. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3045-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shulan Tian
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA
| | - Huihuang Yan
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA
| | - Claudia Neuhauser
- Informatics Institute, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Susan L Slager
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA.
| |
Collapse
|
50
|
Hou D, Chen C, Seely EJ, Chen S, Song Y. High-Throughput Sequencing-Based Immune Repertoire Study during Infectious Disease. Front Immunol 2016; 7:336. [PMID: 27630639 PMCID: PMC5005336 DOI: 10.3389/fimmu.2016.00336] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2016] [Accepted: 08/19/2016] [Indexed: 11/13/2022] Open
Abstract
The selectivity of the adaptive immune response is based on the enormous diversity of T and B cell antigen-specific receptors. The immune repertoire, the collection of T and B cells with functional diversity in the circulatory system at any given time, is dynamic and reflects the essence of immune selectivity. In this article, we review the recent advances in immune repertoire study of infectious diseases, which were achieved by traditional techniques and high-throughput sequencing (HTS) techniques. HTS techniques enable the determination of complementary regions of lymphocyte receptors with unprecedented efficiency and scale. This progress in methodology enhances the understanding of immunologic changes during pathogen challenge and also provides a basis for further development of novel diagnostic markers, immunotherapies, and vaccines.
Collapse
Affiliation(s)
- Dongni Hou
- Department of Pulmonary Medicine, Zhongshan Hospital, Fudan University , Shanghai , China
| | - Cuicui Chen
- Department of Pulmonary Medicine, Zhongshan Hospital, Fudan University , Shanghai , China
| | - Eric John Seely
- Department of Medicine, Division of Pulmonary and Critical Care Medicine, University of California San Francisco , San Francisco, CA , USA
| | - Shujing Chen
- Department of Pulmonary Medicine, Zhongshan Hospital, Fudan University , Shanghai , China
| | - Yuanlin Song
- Department of Pulmonary Medicine, Zhongshan Hospital, Fudan University , Shanghai , China
| |
Collapse
|