1
|
Mangiola S, Roth-Schulze AJ, Trussart M, Zozaya-Valdés E, Ma M, Gao Z, Rubin AF, Speed TP, Shim H, Papenfuss AT. sccomp: Robust differential composition and variability analysis for single-cell data. Proc Natl Acad Sci U S A 2023; 120:e2203828120. [PMID: 37549298 PMCID: PMC10438834 DOI: 10.1073/pnas.2203828120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2022] [Accepted: 05/18/2023] [Indexed: 08/09/2023] Open
Abstract
Cellular omics such as single-cell genomics, proteomics, and microbiomics allow the characterization of tissue and microbial community composition, which can be compared between conditions to identify biological drivers. This strategy has been critical to revealing markers of disease progression, such as cancer and pathogen infection. A dedicated statistical method for differential variability analysis is lacking for cellular omics data, and existing methods for differential composition analysis do not model some compositional data properties, suggesting there is room to improve model performance. Here, we introduce sccomp, a method for differential composition and variability analyses that jointly models data count distribution, compositionality, group-specific variability, and proportion mean-variability association, being aware of outliers. sccomp provides a comprehensive analysis framework that offers realistic data simulation and cross-study knowledge transfer. Here, we demonstrate that mean-variability association is ubiquitous across technologies, highlighting the inadequacy of the very popular Dirichlet-multinomial distribution. We show that sccomp accurately fits experimental data, significantly improving performance over state-of-the-art algorithms. Using sccomp, we identified differential constraints and composition in the microenvironment of primary breast cancer.
Collapse
Affiliation(s)
- Stefano Mangiola
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC3052, Australia
- Department of Medical Biology, University of Melbourne, Parkville, VIC3052, Australia
| | - Alexandra J. Roth-Schulze
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC3052, Australia
- Department of Medical Biology, University of Melbourne, Parkville, VIC3052, Australia
| | - Marie Trussart
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC3052, Australia
| | - Enrique Zozaya-Valdés
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC3052, Australia
- Department of Medical Biology, University of Melbourne, Parkville, VIC3052, Australia
| | - Mengyao Ma
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC3052, Australia
| | - Zijie Gao
- Melbourne Integrative Genomics, University of Melbourne, Parkville, VIC3052, Australia
- School of Mathematics and Statistics, University of Melbourne, Parkville, VIC3052, Australia
| | - Alan F. Rubin
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC3052, Australia
- Department of Medical Biology, University of Melbourne, Parkville, VIC3052, Australia
| | - Terence P. Speed
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC3052, Australia
| | - Heejung Shim
- Melbourne Integrative Genomics, University of Melbourne, Parkville, VIC3052, Australia
- School of Mathematics and Statistics, University of Melbourne, Parkville, VIC3052, Australia
| | - Anthony T. Papenfuss
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC3052, Australia
- Department of Medical Biology, University of Melbourne, Parkville, VIC3052, Australia
| |
Collapse
|
2
|
Becker D, Champredon D, Chato C, Gugan G, Poon A. SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications. NAR Genom Bioinform 2023; 5:lqad038. [PMID: 37101658 PMCID: PMC10124968 DOI: 10.1093/nargab/lqad038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 02/15/2023] [Accepted: 04/06/2023] [Indexed: 04/28/2023] Open
Abstract
Genetic sequencing is subject to many different types of errors, but most analyses treat the resultant sequences as if they are known without error. Next generation sequencing methods rely on significantly larger numbers of reads than previous sequencing methods in exchange for a loss of accuracy in each individual read. Still, the coverage of such machines is imperfect and leaves uncertainty in many of the base calls. In this work, we demonstrate that the uncertainty in sequencing techniques will affect downstream analysis and propose a straightforward method to propagate the uncertainty. Our method (which we have dubbed Sequence Uncertainty Propagation, or SUP) uses a probabilistic matrix representation of individual sequences which incorporates base quality scores as a measure of uncertainty that naturally lead to resampling and replication as a framework for uncertainty propagation. With the matrix representation, resampling possible base calls according to quality scores provides a bootstrap- or prior distribution-like first step towards genetic analysis. Analyses based on these re-sampled sequences will include a more complete evaluation of the error involved in such analyses. We demonstrate our resampling method on SARS-CoV-2 data. The resampling procedures add a linear computational cost to the analyses, but the large impact on the variance in downstream estimates makes it clear that ignoring this uncertainty may lead to overly confident conclusions. We show that SARS-CoV-2 lineage designations via Pangolin are much less certain than the bootstrap support reported by Pangolin would imply and the clock rate estimates for SARS-CoV-2 are much more variable than reported.
Collapse
Affiliation(s)
- Devan Becker
- To whom correspondence should be addressed. Tel: +1 519 884 1970 (Ext 2464);
| | | | - Connor Chato
- Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada
| | - Gopi Gugan
- Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada
| | - Art Poon
- Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada
| |
Collapse
|
3
|
Dang Z, Yang J, Wang L, Tao Q, Zhang F, Zhang Y, Luo Z. Sampling Variation of RAD-Seq Data from Diploid and Tetraploid Potato ( Solanum tuberosum L.). PLANTS 2021; 10:plants10020319. [PMID: 33562246 PMCID: PMC7915145 DOI: 10.3390/plants10020319] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/27/2020] [Revised: 01/24/2021] [Accepted: 02/02/2021] [Indexed: 12/02/2022]
Abstract
The new sequencing technology enables identification of genome-wide sequence-based variants at a population level and a competitively low cost. The sequence variant-based molecular markers have motivated enormous interest in population and quantitative genetic analyses. Generation of the sequence data involves a sophisticated experimental process embedded with rich non-biological variation. Statistically, the sequencing process indeed involves sampling DNA fragments from an individual sequence. Adequate knowledge of sampling variation of the sequence data generation is one of the key statistical properties for any downstream analysis of the data and for implementing statistically appropriate methods. This paper reports a thorough investigation on modeling the sampling variation of the sequence data from the optimized RAD-seq (Restriction sit associated DNA sequencing) experiments with two parents and their offspring of diploid and autotetraploid potato (Solanum tuberosum L.). The analysis shows significant dispersion in sampling variation of the sequence data over that expected under multinomial distribution as widely assumed in the literature and provides statistical methods for modeling the variation and calculating the model parameters, which may be easily implemented in real sequence datasets. The optimized design of RAD-seq experiments enabled effective control of presentation of undesirable chloroplast DNA and RNA genes in the sequence data generated.
Collapse
Affiliation(s)
- Zhenyu Dang
- Laboratory of Population and Quantitative Genetics, Institute of Biostatistics, Fudan University Shanghai, Shanghai 200433, China; (Z.D.); (J.Y.); (L.W.); (Q.T.); (Y.Z.)
| | - Jixuan Yang
- Laboratory of Population and Quantitative Genetics, Institute of Biostatistics, Fudan University Shanghai, Shanghai 200433, China; (Z.D.); (J.Y.); (L.W.); (Q.T.); (Y.Z.)
| | - Lin Wang
- Laboratory of Population and Quantitative Genetics, Institute of Biostatistics, Fudan University Shanghai, Shanghai 200433, China; (Z.D.); (J.Y.); (L.W.); (Q.T.); (Y.Z.)
| | - Qin Tao
- Laboratory of Population and Quantitative Genetics, Institute of Biostatistics, Fudan University Shanghai, Shanghai 200433, China; (Z.D.); (J.Y.); (L.W.); (Q.T.); (Y.Z.)
| | - Fengjun Zhang
- Qinghai Academy of Agricultural and Forestry Sciences, Xining 200433, China;
| | - Yuxin Zhang
- Laboratory of Population and Quantitative Genetics, Institute of Biostatistics, Fudan University Shanghai, Shanghai 200433, China; (Z.D.); (J.Y.); (L.W.); (Q.T.); (Y.Z.)
| | - Zewei Luo
- Laboratory of Population and Quantitative Genetics, Institute of Biostatistics, Fudan University Shanghai, Shanghai 200433, China; (Z.D.); (J.Y.); (L.W.); (Q.T.); (Y.Z.)
- School of Biosciences, University of Birmingham, Birmingham B15 2TT, UK
- Correspondence: or ; Tel.: +44-121-414-5404
| |
Collapse
|
4
|
Winter DJ, Wu SH, Howell AA, Azevedo RBR, Zufall RA, Cartwright RA. accuMUlate: a mutation caller designed for mutation accumulation experiments. Bioinformatics 2019; 34:2659-2660. [PMID: 29566129 DOI: 10.1093/bioinformatics/bty165] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2017] [Accepted: 03/15/2018] [Indexed: 11/13/2022] Open
Abstract
Summary Mutation accumulation (MA) is the most widely used method for directly studying the effects of mutation. By sequencing whole genomes from MA lines, researchers can directly study the rate and molecular spectra of spontaneous mutations and use these results to understand how mutation contributes to biological processes. At present there is no software designed specifically for identifying mutations from MA lines. Here we describe accuMUlate, a probabilistic mutation caller that reflects the design of a typical MA experiment while being flexible enough to accommodate properties unique to any particular experiment. Availability and implementation accuMUlate is available from https://github.com/dwinter/accuMUlate. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David J Winter
- The Biodesign Institute, Arizona State University, Tempe, AZ, USA.,Institute of Fundamental Sciences, Massey University, Palmerston North, New Zealand
| | - Steven H Wu
- The Biodesign Institute, Arizona State University, Tempe, AZ, USA.,Bioconsortia Inc, Davis, CA, USA
| | - Abigail A Howell
- The Biodesign Institute, Arizona State University, Tempe, AZ, USA.,School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Ricardo B R Azevedo
- Department of Biology and Biochemistry, University of Houston, Houston, TX, USA
| | - Rebecca A Zufall
- Department of Biology and Biochemistry, University of Houston, Houston, TX, USA
| | - Reed A Cartwright
- The Biodesign Institute, Arizona State University, Tempe, AZ, USA.,School of Life Sciences, Arizona State University, Tempe, AZ, USA
| |
Collapse
|
5
|
Günther T, Nettelblad C. The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLoS Genet 2019; 15:e1008302. [PMID: 31348818 PMCID: PMC6685638 DOI: 10.1371/journal.pgen.1008302] [Citation(s) in RCA: 103] [Impact Index Per Article: 20.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Revised: 08/07/2019] [Accepted: 07/10/2019] [Indexed: 11/18/2022] Open
Abstract
Haploid high quality reference genomes are an important resource in genomic research projects. A consequence is that DNA fragments carrying the reference allele will be more likely to map successfully, or receive higher quality scores. This reference bias can have effects on downstream population genomic analysis when heterozygous sites are falsely considered homozygous for the reference allele. In palaeogenomic studies of human populations, mapping against the human reference genome is used to identify endogenous human sequences. Ancient DNA studies usually operate with low sequencing coverages and fragmentation of DNA molecules causes a large proportion of the sequenced fragments to be shorter than 50 bp-reducing the amount of accepted mismatches, and increasing the probability of multiple matching sites in the genome. These ancient DNA specific properties are potentially exacerbating the impact of reference bias on downstream analyses, especially since most studies of ancient human populations use pseudo-haploid data, i.e. they randomly sample only one sequencing read per site. We show that reference bias is pervasive in published ancient DNA sequence data of prehistoric humans with some differences between individual genomic regions. We illustrate that the strength of reference bias is negatively correlated with fragment length. Most genomic regions we investigated show little to no mapping bias but even a small proportion of sites with bias can impact analyses of those particular loci or slightly skew genome-wide estimates. Therefore, reference bias has the potential to cause minor but significant differences in the results of downstream analyses such as population allele sharing, heterozygosity estimates and estimates of archaic ancestry. These spurious results highlight how important it is to be aware of these technical artifacts and that we need strategies to mitigate the effect. Therefore, we suggest some post-mapping filtering strategies to resolve reference bias which help to reduce its impact substantially.
Collapse
Affiliation(s)
- Torsten Günther
- Human Evolution, Department of Organismal Biology, Uppsala University, Uppsala, Sweden
| | - Carl Nettelblad
- Division of Scientific Computing, Department of Information Technology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| |
Collapse
|
6
|
Wong TKF, Ranjard L, Lin Y, Rodrigo AG. HaploJuice : accurate haplotype assembly from a pool of sequences with known relative concentrations. BMC Bioinformatics 2018; 19:389. [PMID: 30348075 PMCID: PMC6198429 DOI: 10.1186/s12859-018-2424-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Accepted: 10/09/2018] [Indexed: 11/10/2022] Open
Abstract
Background Pooling techniques, where multiple sub-samples are mixed in a single sample, are widely used to take full advantage of high-throughput DNA sequencing. Recently, Ranjard et al. (PLoS ONE 13:0195090, 2018) proposed a pooling strategy without the use of barcodes. Three sub-samples were mixed in different known proportions (i.e. 62.5%, 25% and 12.5%), and a method was developed to use these proportions to reconstruct the three haplotypes effectively. Results HaploJuice provides an alternative haplotype reconstruction algorithm for Ranjard et al.’s pooling strategy. HaploJuice significantly increases the accuracy by first identifying the empirical proportions of the three mixed sub-samples and then assembling the haplotypes using a dynamic programming approach. HaploJuice was evaluated against five different assembly algorithms, Hmmfreq (Ranjard et al., PLoS ONE 13:0195090, 2018), ShoRAH (Zagordi et al., BMC Bioinformatics 12:119, 2011), SAVAGE (Baaijens et al., Genome Res 27:835-848, 2017), PredictHaplo (Prabhakaran et al., IEEE/ACM Trans Comput Biol Bioinform 11:182-91, 2014) and QuRe (Prosperi and Salemi, Bioinformatics 28:132-3, 2012). Using simulated and real data sets, HaploJuice reconstructed the true sequences with the highest coverage and the lowest error rate. Conclusion HaploJuice provides high accuracy in haplotype reconstruction, making Ranjard et al.’s pooling strategy more efficient, feasible, and applicable, with the benefit of reducing the sequencing cost.
Collapse
Affiliation(s)
- Thomas K F Wong
- The Research School of Biology, The Australian National University, Acton ACT, 2601, Australia.
| | - Louis Ranjard
- The Research School of Biology, The Australian National University, Acton ACT, 2601, Australia
| | - Yu Lin
- College of Engineering and Computer Science, The Australian National University, Acton ACT, 2601, Australia
| | - Allen G Rodrigo
- The Research School of Biology, The Australian National University, Acton ACT, 2601, Australia
| |
Collapse
|
7
|
Spooner W, McLaren W, Slidel T, Finch DK, Butler R, Campbell J, Eghobamien L, Rider D, Kiefer CM, Robinson MJ, Hardman C, Cunningham F, Vaughan T, Flicek P, Huntington CC. Haplosaurus computes protein haplotypes for use in precision drug design. Nat Commun 2018; 9:4128. [PMID: 30297836 PMCID: PMC6175845 DOI: 10.1038/s41467-018-06542-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2017] [Accepted: 09/07/2018] [Indexed: 01/08/2023] Open
Abstract
Selecting the most appropriate protein sequences is critical for precision drug design. Here we describe Haplosaurus, a bioinformatic tool for computation of protein haplotypes. Haplosaurus computes protein haplotypes from pre-existing chromosomally-phased genomic variation data. Integration into the Ensembl resource provides rapid and detailed protein haplotypes retrieval. Using Haplosaurus, we build a database of unique protein haplotypes from the 1000 Genomes dataset reflecting real-world protein sequence variability and their prevalence. For one in seven genes, their most common protein haplotype differs from the reference sequence and a similar number differs on their most common haplotype between human populations. Three case studies show how knowledge of the range of commonly encountered protein forms predicted in populations leads to insights into therapeutic efficacy. Haplosaurus and its associated database is expected to find broad applications in many disciplines using protein sequences and particularly impactful for therapeutics design.
Collapse
Affiliation(s)
- William Spooner
- Eagle Genomics Ltd., Biodata Innovation Centre, Wellcome Genome Campus, Hinxton, Cambridge, CB10 3DR UK
- Genomics England, QMUL Dawson Hall, London, EC1M 6BQ UK
| | - William McLaren
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD UK
| | | | | | - Robin Butler
- MedImmune Ltd., Granta Park, Cambridge, CB21 4QR UK
| | | | | | - David Rider
- MedImmune Ltd., Granta Park, Cambridge, CB21 4QR UK
| | | | | | | | - Fiona Cunningham
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD UK
| | | | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD UK
| | | |
Collapse
|
8
|
Ranjard L, Wong TKF, Rodrigo AG. Reassembling haplotypes in a mixture of pooled amplicons when the relative concentrations are known: A proof-of-concept study on the efficient design of next-generation sequencing strategies. PLoS One 2018; 13:e0195090. [PMID: 29621260 PMCID: PMC5886459 DOI: 10.1371/journal.pone.0195090] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2017] [Accepted: 03/18/2018] [Indexed: 12/02/2022] Open
Abstract
Next-generation sequencing can be costly and labour intensive. Usually, the sequencing cost per sample is reduced by pooling amplified DNA = amplicons) derived from different individuals on the same sequencing lane. Barcodes unique to each amplicon permit short-read sequences to be assigned appropriately. However, the cost of the library preparation increases with the number of barcodes used. We propose an alternative to barcoding: by using different known proportions of individually-derived amplicons in a pooled sample, each is characterised a priori by an expected depth of coverage. We have developed a Hidden Markov Model that uses these expected proportions to reconstruct the input sequences. We apply this method to pools of mitochondrial DNA amplicons extracted from kangaroo meat, genus Macropus. Our experiments indicate that the sequence coverage can be efficiently used to index the short-reads and that we can reassemble the input haplotypes when secondary factors impacting the coverage are controlled. We therefore demonstrate that, by combining our approach with standard barcoding, the cost of the library preparation is reduced to a third.
Collapse
Affiliation(s)
- Louis Ranjard
- The Research School of Biology, The Australian National University, Australia
- * E-mail:
| | - Thomas K. F. Wong
- The Research School of Biology, The Australian National University, Australia
| | - Allen G. Rodrigo
- The Research School of Biology, The Australian National University, Australia
| |
Collapse
|