1
|
Clouard C, Ausmees K, Nettelblad C. A joint use of pooling and imputation for genotyping SNPs. BMC Bioinformatics 2022; 23:421. [PMID: 36229780 PMCID: PMC9563787 DOI: 10.1186/s12859-022-04974-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Accepted: 09/29/2022] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Despite continuing technological advances, the cost for large-scale genotyping of a high number of samples can be prohibitive. The purpose of this study is to design a cost-saving strategy for SNP genotyping. We suggest making use of pooling, a group testing technique, to drop the amount of SNP arrays needed. We believe that this will be of the greatest importance for non-model organisms with more limited resources in terms of cost-efficient large-scale chips and high-quality reference genomes, such as application in wildlife monitoring, plant and animal breeding, but it is in essence species-agnostic. The proposed approach consists in grouping and mixing individual DNA samples into pools before testing these pools on bead-chips, such that the number of pools is less than the number of individual samples. We present a statistical estimation algorithm, based on the pooling outcomes, for inferring marker-wise the most likely genotype of every sample in each pool. Finally, we input these estimated genotypes into existing imputation algorithms. We compare the imputation performance from pooled data with the Beagle algorithm, and a local likelihood-aware phasing algorithm closely modeled on MaCH that we implemented. RESULTS We conduct simulations based on human data from the 1000 Genomes Project, to aid comparison with other imputation studies. Based on the simulated data, we find that pooling impacts the genotype frequencies of the directly identifiable markers, without imputation. We also demonstrate how a combinatorial estimation of the genotype probabilities from the pooling design can improve the prediction performance of imputation models. Our algorithm achieves 93% concordance in predicting unassayed markers from pooled data, thus it outperforms the Beagle imputation model which reaches 80% concordance. We observe that the pooling design gives higher concordance for the rare variants than traditional low-density to high-density imputation commonly used for cost-effective genotyping of large cohorts. CONCLUSIONS We present promising results for combining a pooling scheme for SNP genotyping with computational genotype imputation on human data. These results could find potential applications in any context where the genotyping costs form a limiting factor on the study size, such as in marker-assisted selection in plant breeding.
Collapse
Affiliation(s)
- Camille Clouard
- Division of Scientific Computing, Department of Information Technology, Uppsala University, Lägerhyddsvägen 1, hus 10, 75237 Uppsala, Sweden
| | - Kristiina Ausmees
- Division of Scientific Computing, Department of Information Technology, Uppsala University, Lägerhyddsvägen 1, hus 10, 75237 Uppsala, Sweden
| | - Carl Nettelblad
- Division of Scientific Computing, Department of Information Technology, Uppsala University, Lägerhyddsvägen 1, hus 10, 75237 Uppsala, Sweden
| |
Collapse
|
2
|
seekCRIT: Detecting and characterizing differentially expressed circular RNAs using high-throughput sequencing data. PLoS Comput Biol 2020; 16:e1008338. [PMID: 33079938 PMCID: PMC7598922 DOI: 10.1371/journal.pcbi.1008338] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2019] [Revised: 10/30/2020] [Accepted: 09/13/2020] [Indexed: 11/19/2022] Open
Abstract
Over the past two decades, researchers have discovered a special form of alternative splicing that produces a circular form of RNA. Although these circular RNAs (circRNAs) have garnered considerable attention in the scientific community for their biogenesis and functions, the focus of current studies has been on the tissue-specific circRNAs that exist only in one tissue but not in other tissues or on the disease-specific circRNAs that exist in certain disease conditions, such as cancer, but not under normal conditions. This approach was conducted in the relative absence of methods that analyze a group of common circRNAs that exist in both conditions, but are more abundant in one condition relative to another (differentially expressed). Studies of differentially expressed circRNAs (DECs) between two conditions would serve as a significant first step in filling this void. Here, we introduce a novel computational tool, seekCRIT (seek for differentially expressed CircRNAs In Transcriptome), that identifies the DECs between two conditions from high-throughput sequencing data. Using rat retina RNA-seq data from ischemic and normal conditions, we show that over 74% of identifiable circRNAs are expressed in both conditions and over 40 circRNAs are differentially expressed between two conditions. We also obtain a high qPCR validation rate of 90% for DECs with a FDR of < 5%. Our results demonstrate that seekCRIT is a novel and efficient approach to detect DECs using rRNA depleted RNA-seq data. seekCRIT is freely downloadable at https://github.com/UofLBioinformatics/seekCRIT. The source code is licensed under the MIT License. seekCRIT is developed and tested on Linux CentOS-7. The focus of circRNA studies has been on condition-specific circRNAs, however, there are situations in which circRNAs exist in both conditions with different abundance. Here, we introduce a new and robust analytic software, seekCRIT (seek for differentially expressed CircRNAs In Transcriptome), that identifies the differentially expressed circRNAs (DECs) between two conditions from high-throughput sequencing data. seekCRIT provides a straightforward normalized quantification of circRNAs and statistical measures by adapting a junction-count-based estimation approach. Using publicly available ribosomal RNA depleted RNA-seq data and our own rat retina RNA-seq data, we show that seekCRIT can efficiently detect circRNAs and identify DECs. We also obtain a high qPCR validation rate of 90% for DECs with a FDR of < 5%. Our results demonstrate that seekCRIT is a novel and efficient software to detect DECs using rRNA depleted RNA-seq data.
Collapse
|
3
|
Zhernakov AI, Afonin AM, Gavriliuk ND, Moiseeva OM, Zhukov VA. s-dePooler: determination of polymorphism carriers from overlapping DNA pools. BMC Bioinformatics 2019; 20:45. [PMID: 30669964 PMCID: PMC6343301 DOI: 10.1186/s12859-019-2616-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Accepted: 01/09/2019] [Indexed: 11/26/2022] Open
Abstract
Background Samples pooling is a method widely used in studies to reduce costs and labour. DNA sample pooling combined with massive parallel sequencing is a powerful tool for discovering DNA variants (polymorphisms) in large analysing populations, which is the base of such research fields as Genome-Wide Association Studies, evolutionary and population studies, etc. Usage of overlapping pools where each sample is present in multiple pools can enhance the accuracy of polymorphism detection and allow identifying carriers of rare-variants. Surprisingly there is a lack of tools for result interpretation and carrier identification, i.e. for “depooling”. Results Here we present s-dePooler, the application for analysis of pooling experiments data. s-dePooler uses the variants information (VCF-file) and the pooling scheme to produce a list of candidate carriers for each polymorphism. We incorporated s-dePooler into a pipeline (dePoP) for automation of pooling analysis. The performance of the pipeline was tested on a synthetic dataset built using the 1000 Genomes Project data, resulting in the successful identification 97% of carriers of polymorphisms present in fewer than ~ 10% of carriers. Conclusions s-dePooler along with dePoP can be used to identify carriers of polymorphisms in overlapping pools, and is compatible with any pooling scheme with equivalent molar ratios of pooled samples. s-dePooler and dePoP with usage instructions and test data are freely available at https://github.com/lab9arriam/depop. Electronic supplementary material The online version of this article (10.1186/s12859-019-2616-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Aleksandr Igorevich Zhernakov
- Research Department of Non-Coronary Heart Diseases, Almazov National Medical Research Center, Ministry of Health of Russia, 2 Akkuratova St., St. Petersburg, 197341, Russia. .,All-Russia Research Institute for Agricultural Microbiology (ARRIAM), 3 Podbelsky Ch., St. Petersburg - Pushkin, 196608, Russia.
| | - Alexey Mikhailovich Afonin
- All-Russia Research Institute for Agricultural Microbiology (ARRIAM), 3 Podbelsky Ch., St. Petersburg - Pushkin, 196608, Russia
| | - Natalia Dmitrievna Gavriliuk
- Research Department of Non-Coronary Heart Diseases, Almazov National Medical Research Center, Ministry of Health of Russia, 2 Akkuratova St., St. Petersburg, 197341, Russia
| | - Olga Mikhailovna Moiseeva
- Research Department of Non-Coronary Heart Diseases, Almazov National Medical Research Center, Ministry of Health of Russia, 2 Akkuratova St., St. Petersburg, 197341, Russia
| | - Vladimir Aleksandrovich Zhukov
- Research Department of Non-Coronary Heart Diseases, Almazov National Medical Research Center, Ministry of Health of Russia, 2 Akkuratova St., St. Petersburg, 197341, Russia.,All-Russia Research Institute for Agricultural Microbiology (ARRIAM), 3 Podbelsky Ch., St. Petersburg - Pushkin, 196608, Russia
| |
Collapse
|
4
|
Zhang Q, Guldbrandtsen B, Calus MPL, Lund MS, Sahana G. Comparison of gene-based rare variant association mapping methods for quantitative traits in a bovine population with complex familial relationships. Genet Sel Evol 2016; 48:60. [PMID: 27534618 PMCID: PMC4989328 DOI: 10.1186/s12711-016-0238-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2016] [Accepted: 08/04/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND There is growing interest in the role of rare variants in the variation of complex traits due to increasing evidence that rare variants are associated with quantitative traits. However, association methods that are commonly used for mapping common variants are not effective to map rare variants. Besides, livestock populations have large half-sib families and the occurrence of rare variants may be confounded with family structure, which makes it difficult to disentangle their effects from family mean effects. We compared the power of methods that are commonly applied in human genetics to map rare variants in cattle using whole-genome sequence data and simulated phenotypes. We also studied the power of mapping rare variants using linear mixed models (LMM), which are the method of choice to account for both family relationships and population structure in cattle. RESULTS We observed that the power of the LMM approach was low for mapping a rare variant (defined as those that have frequencies lower than 0.01) with a moderate effect (5 to 8 % of phenotypic variance explained by multiple rare variants that vary from 5 to 21 in number) contributing to a QTL with a sample size of 1000. In contrast, across the scenarios studied, statistical methods that are specialized for mapping rare variants increased power regardless of whether multiple rare variants or a single rare variant underlie a QTL. Different methods for combining rare variants in the test single nucleotide polymorphism set resulted in similar power irrespective of the proportion of total genetic variance explained by the QTL. However, when the QTL variance is very small (only 0.1 % of the total genetic variance), these specialized methods for mapping rare variants and LMM generally had no power to map the variants within a gene with sample sizes of 1000 or 5000. CONCLUSIONS We observed that the methods that combine multiple rare variants within a gene into a meta-variant generally had greater power to map rare variants compared to LMM. Therefore, it is recommended to use rare variant association mapping methods to map rare genetic variants that affect quantitative traits in livestock, such as bovine populations.
Collapse
Affiliation(s)
- Qianqian Zhang
- Department of Molecular Biology and Genetics, Center for Quantitative Genetics and Genomics, Aarhus University, Tjele, 8830, Denmark. .,Animal Breeding and Genomics Centre, Wageningen UR Livestock Research, Wageningen, The Netherlands.
| | - Bernt Guldbrandtsen
- Department of Molecular Biology and Genetics, Center for Quantitative Genetics and Genomics, Aarhus University, Tjele, 8830, Denmark
| | - Mario P L Calus
- Animal Breeding and Genomics Centre, Wageningen UR Livestock Research, Wageningen, The Netherlands
| | - Mogens Sandø Lund
- Department of Molecular Biology and Genetics, Center for Quantitative Genetics and Genomics, Aarhus University, Tjele, 8830, Denmark
| | - Goutam Sahana
- Department of Molecular Biology and Genetics, Center for Quantitative Genetics and Genomics, Aarhus University, Tjele, 8830, Denmark
| |
Collapse
|
5
|
Li C, Cao C, Tu J, Sun X. An accurate clone-based haplotyping method by overlapping pool sequencing. Nucleic Acids Res 2016; 44:e112. [PMID: 27095193 PMCID: PMC4937318 DOI: 10.1093/nar/gkw284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2015] [Accepted: 04/07/2016] [Indexed: 11/25/2022] Open
Abstract
Chromosome-long haplotyping of human genomes is important to identify genetic variants with differing gene expression, in human evolution studies, clinical diagnosis, and other biological and medical fields. Although several methods have realized haplotyping based on sequencing technologies or population statistics, accuracy and cost are factors that prohibit their wide use. Borrowing ideas from group testing theories, we proposed a clone-based haplotyping method by overlapping pool sequencing. The clones from a single individual were pooled combinatorially and then sequenced. According to the distinct pooling pattern for each clone in the overlapping pool sequencing, alleles for the recovered variants could be assigned to their original clones precisely. Subsequently, the clone sequences could be reconstructed by linking these alleles accordingly and assembling them into haplotypes with high accuracy. To verify the utility of our method, we constructed 130 110 clones in silico for the individual NA12878 and simulated the pooling and sequencing process. Ultimately, 99.9% of variants on chromosome 1 that were covered by clones from both parental chromosomes were recovered correctly, and 112 haplotype contigs were assembled with an N50 length of 3.4 Mb and no switch errors. A comparison with current clone-based haplotyping methods indicated our method was more accurate.
Collapse
Affiliation(s)
- Cheng Li
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, Jiangsu 210002, China
| | - Changchang Cao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, Jiangsu 210002, China
| | - Jing Tu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, Jiangsu 210002, China
| | - Xiao Sun
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, Jiangsu 210002, China
| |
Collapse
|
6
|
|
7
|
Kim K, Seong MW, Chung WH, Park SS, Leem S, Park W, Kim J, Lee K, Park RW, Kim N. Effect of Next-Generation Exome Sequencing Depth for Discovery of Diagnostic Variants. Genomics Inform 2015; 13:31-9. [PMID: 26175660 PMCID: PMC4500796 DOI: 10.5808/gi.2015.13.2.31] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2015] [Revised: 05/26/2015] [Accepted: 05/28/2015] [Indexed: 02/06/2023] Open
Abstract
Sequencing depth, which is directly related to the cost and time required for the generation, processing, and maintenance of next-generation sequencing data, is an important factor in the practical utilization of such data in clinical fields. Unfortunately, identifying an exome sequencing depth adequate for clinical use is a challenge that has not been addressed extensively. Here, we investigate the effect of exome sequencing depth on the discovery of sequence variants for clinical use. Toward this, we sequenced ten germ-line blood samples from breast cancer patients on the Illumina platform GAII(x) at a high depth of ~200×. We observed that most function-related diverse variants in the human exonic regions could be detected at a sequencing depth of 120×. Furthermore, investigation using a diagnostic gene set showed that the number of clinical variants identified using exome sequencing reached a plateau at an average sequencing depth of about 120×. Moreover, the phenomena were consistent across the breast cancer samples.
Collapse
Affiliation(s)
- Kyung Kim
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon 443-749, Korea. ; Department of Biomedical Science, Graduate School, Ajou University, Suwon 443-749, Korea. ; Korean Bioinformation Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon 305-806, Korea
| | - Moon-Woo Seong
- Department of Laboratory Medicine, Seoul National University Hospital College of Medicine, Seoul 110-799, Korea
| | - Won-Hyong Chung
- Korean Bioinformation Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon 305-806, Korea
| | - Sung Sup Park
- Department of Laboratory Medicine, Seoul National University Hospital College of Medicine, Seoul 110-799, Korea
| | - Sangseob Leem
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon 443-749, Korea
| | - Won Park
- Department of Functional Genomics, Korea University of Science and Technology, Daejeon 305-806, Korea. ; Epigenomics Research Center, Genome Institute, Korea Research Institute of Bioscience and Biotechnology, Daejeon 305-806, Korea
| | - Jihyun Kim
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon 443-749, Korea. ; Department of Biomedical Science, Graduate School, Ajou University, Suwon 443-749, Korea
| | - KiYoung Lee
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon 443-749, Korea. ; Department of Biomedical Science, Graduate School, Ajou University, Suwon 443-749, Korea
| | - Rae Woong Park
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon 443-749, Korea. ; Department of Biomedical Science, Graduate School, Ajou University, Suwon 443-749, Korea
| | - Namshin Kim
- Department of Functional Genomics, Korea University of Science and Technology, Daejeon 305-806, Korea. ; Epigenomics Research Center, Genome Institute, Korea Research Institute of Bioscience and Biotechnology, Daejeon 305-806, Korea
| |
Collapse
|
8
|
Cao CC, Sun X. Accurate estimation of haplotype frequency from pooled sequencing data and cost-effective identification of rare haplotype carriers by overlapping pool sequencing. Bioinformatics 2014; 31:515-22. [PMID: 25304780 DOI: 10.1093/bioinformatics/btu670] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION A variety of hypotheses have been proposed for finding the missing heritability of complex diseases in genome-wide association studies. Studies have focused on the value of haplotype to improve the power of detecting associations with disease. To facilitate haplotype-based association analysis, it is necessary to accurately estimate haplotype frequencies of pooled samples. RESULTS Taking advantage of databases that contain prior haplotypes, we present Ehapp based on the algorithm for solving the system of linear equations to estimate the frequencies of haplotypes from pooled sequencing data. Effects of various factors in sequencing on the performance are evaluated using simulated data. Our method could estimate the frequencies of haplotypes with only about 3% average relative difference for pooled sequencing of the mixture of 10 haplotypes with total coverage of 50×. When unknown haplotypes exist, our method maintains excellent performance for haplotypes with actual frequencies >0.05. Comparisons with present method on simulated data in conjunction with publicly available Illumina sequencing data indicate that our method is state of the art for many sequencing study designs. We also demonstrate the feasibility of applying overlapping pool sequencing to identify rare haplotype carriers cost-effectively. AVAILABILITY AND IMPLEMENTATION Ehapp (in Perl) for the Linux platforms is available online (http://bioinfo.seu.edu.cn/Ehapp/). CONTACT xsun@seu.edu.cn SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chang-Chang Cao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Xiao Sun
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| |
Collapse
|
9
|
Cao CC, Li C, Sun X. Quantitative group testing-based overlapping pool sequencing to identify rare variant carriers. BMC Bioinformatics 2014; 15:195. [PMID: 24934981 PMCID: PMC4229885 DOI: 10.1186/1471-2105-15-195] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2013] [Accepted: 06/10/2014] [Indexed: 11/23/2022] Open
Abstract
Background Genome-wide association studies have revealed that rare variants are responsible for a large portion of the heritability of some complex human diseases. This highlights the increasing importance of detecting and screening for rare variants. Although the massively parallel sequencing technologies have greatly reduced the cost of DNA sequencing, the identification of rare variant carriers by large-scale re-sequencing remains prohibitively expensive because of the huge challenge of constructing libraries for thousands of samples. Recently, several studies have reported that techniques from group testing theory and compressed sensing could help identify rare variant carriers in large-scale samples with few pooled sequencing experiments and a dramatically reduced cost. Results Based on quantitative group testing, we propose an efficient overlapping pool sequencing strategy that allows the efficient recovery of variant carriers in numerous individuals with much lower costs than conventional methods. We used random k-set pool designs to mix samples, and optimized the design parameters according to an indicative probability. Based on a mathematical model of sequencing depth distribution, an optimal threshold was selected to declare a pool positive or negative. Then, using the quantitative information contained in the sequencing results, we designed a heuristic Bayesian probability decoding algorithm to identify variant carriers. Finally, we conducted in silico experiments to find variant carriers among 200 simulated Escherichia coli strains. With the simulated pools and publicly available Illumina sequencing data, our method correctly identified the variant carriers for 91.5–97.9% variants with the variant frequency ranging from 0.5 to 1.5%. Conclusions Using the number of reads, variant carriers could be identified precisely even though samples were randomly selected and pooled. Our method performed better than the published DNA Sudoku design and compressed sequencing, especially in reducing the required data throughput and cost.
Collapse
Affiliation(s)
| | | | - Xiao Sun
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China.
| |
Collapse
|