26
|
DeMaere MZ, Darling AE. Sim3C: simulation of Hi-C and Meta3C proximity ligation sequencing technologies. Gigascience 2018; 7:4628124. [PMID: 29149264 PMCID: PMC5827349 DOI: 10.1093/gigascience/gix103] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2017] [Accepted: 10/23/2017] [Indexed: 02/02/2023] Open
Abstract
Background Chromosome conformation capture (3C) and Hi-C DNA sequencing methods have rapidly advanced our understanding of the spatial organization of genomes and metagenomes. Many variants of these protocols have been developed, each with their own strengths. Currently there is no systematic means for simulating sequence data from this family of sequencing protocols, potentially hindering the advancement of algorithms to exploit this new datatype. Findings We describe a computational simulator that, given simple parameters and reference genome sequences, will simulate Hi-C sequencing on those sequences. The simulator models the basic spatial structure in genomes that is commonly observed in Hi-C and 3C datasets, including the distance-decay relationship in proximity ligation, differences in the frequency of interaction within and across chromosomes, and the structure imposed by cells. A means to model the 3D structure of randomly generated topologically associating domains is provided. The simulator considers several sources of error common to 3C and Hi-C library preparation and sequencing methods, including spurious proximity ligation events and sequencing error. Conclusions We have introduced the first comprehensive simulator for 3C and Hi-C sequencing protocols. We expect the simulator to have use in testing of Hi-C data analysis algorithms, as well as more general value for experimental design, where questions such as the required depth of sequencing, enzyme choice, and other decisions can be made in advance in order to ensure adequate statistical power with respect to experimental hypothesis testing.
Collapse
|
27
|
Fourment M, Claywell BC, Dinh V, McCoy C, Matsen Iv FA, Darling AE. Effective Online Bayesian Phylogenetics via Sequential Monte Carlo with Guided Proposals. Syst Biol 2018; 67:490-502. [PMID: 29186587 PMCID: PMC5920299 DOI: 10.1093/sysbio/syx090] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2017] [Accepted: 11/20/2017] [Indexed: 11/14/2022] Open
Abstract
Modern infectious disease outbreak surveillance produces continuous streams of sequence data which require phylogenetic analysis as data arrives. Current software packages for Bayesian phylogenetic inference are unable to quickly incorporate new sequences as they become available, making them less useful for dynamically unfolding evolutionary stories. This limitation can be addressed by applying a class of Bayesian statistical inference algorithms called sequential Monte Carlo (SMC) to conduct online inference, wherein new data can be continuously incorporated to update the estimate of the posterior probability distribution. In this article, we describe and evaluate several different online phylogenetic sequential Monte Carlo (OPSMC) algorithms. We show that proposing new phylogenies with a density similar to the Bayesian prior suffers from poor performance, and we develop “guided” proposals that better match the proposal density to the posterior. Furthermore, we show that the simplest guided proposals can exhibit pathological behavior in some situations, leading to poor results, and that the situation can be resolved by heating the proposal density. The results demonstrate that relative to the widely used MCMC-based algorithm implemented in MrBayes, the total time required to compute a series of phylogenetic posteriors as sequences arrive can be significantly reduced by the use of OPSMC, without incurring a significant loss in accuracy.
Collapse
|
28
|
O'Donoghue SI, Baldi BF, Clark SJ, Darling AE, Hogan JM, Kaur S, Maier-Hein L, McCarthy DJ, Moore WJ, Stenau E, Swedlow JR, Vuong J, Procter JB. Visualization of Biomedical Data. Annu Rev Biomed Data Sci 2018. [DOI: 10.1146/annurev-biodatasci-080917-013424] [Citation(s) in RCA: 45] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The rapid increase in volume and complexity of biomedical data requires changes in research, communication, and clinical practices. This includes learning how to effectively integrate automated analysis with high–data density visualizations that clearly express complex phenomena. In this review, we summarize key principles and resources from data visualization research that help address this difficult challenge. We then survey how visualization is being used in a selection of emerging biomedical research areas, including three-dimensional genomics, single-cell RNA sequencing (RNA-seq), the protein structure universe, phosphoproteomics, augmented reality–assisted surgery, and metagenomics. While specific research areas need highly tailored visualizations, there are common challenges that can be addressed with general methods and strategies. Also common, however, are poor visualization practices. We outline ongoing initiatives aimed at improving visualization practices in biomedical research via better tools, peer-to-peer learning, and interdisciplinary collaboration with computer scientists, science communicators, and graphic designers. These changes are revolutionizing how we see and think about our data.
Collapse
|
29
|
Fourment M, Darling AE. Local and relaxed clocks: the best of both worlds. PeerJ 2018; 6:e5140. [PMID: 30002973 PMCID: PMC6034591 DOI: 10.7717/peerj.5140] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2018] [Accepted: 06/09/2018] [Indexed: 11/20/2022] Open
Abstract
Time-resolved phylogenetic methods use information about the time of sample collection to estimate the rate of evolution. Originally, the models used to estimate evolutionary rates were quite simple, assuming that all lineages evolve at the same rate, an assumption commonly known as the molecular clock. Richer and more complex models have since been introduced to capture the phenomenon of substitution rate variation among lineages. Two well known model extensions are the local clock, wherein all lineages in a clade share a common substitution rate, and the uncorrelated relaxed clock, wherein the substitution rate on each lineage is independent from other lineages while being constrained to fit some parametric distribution. We introduce a further model extension, called the flexible local clock (FLC), which provides a flexible framework to combine relaxed clock models with local clock models. We evaluate the flexible local clock on simulated and real datasets and show that it provides substantially improved fit to an influenza dataset. An implementation of the model is available for download from https://www.github.com/4ment/flc.
Collapse
|
30
|
Wang K, Chen YQ, Salido MM, Kohli GS, Kong JL, Liang HJ, Yao ZT, Xie YT, Wu HY, Cai SQ, Drautz-Moses DI, Darling AE, Schuster SC, Yang L, Ding Y. The rapid in vivo evolution of Pseudomonas aeruginosa in ventilator-associated pneumonia patients leads to attenuated virulence. Open Biol 2018; 7:rsob.170029. [PMID: 28878043 PMCID: PMC5627047 DOI: 10.1098/rsob.170029] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Accepted: 07/26/2017] [Indexed: 01/15/2023] Open
Abstract
Pseudomonas aeruginosa is an opportunistic pathogen that causes severe airway infections in humans. These infections are usually difficult to treat and associated with high mortality rates. While colonizing the human airways, P. aeruginosa could accumulate genetic mutations that often lead to its better adaptability to the host environment. Understanding these evolutionary traits may provide important clues for the development of effective therapies to treat P. aeruginosa infections. In this study, 25 P. aeruginosa isolates were longitudinally sampled from the airways of four ventilator-associated pneumonia (VAP) patients. Pacbio and Illumina sequencing were used to analyse the in vivo evolutionary trajectories of these isolates. Our analysis showed that positive selection dominantly shaped P. aeruginosa genomes during VAP infections and led to three convergent evolution events, including loss-of-function mutations of lasR and mpl, and a pyoverdine-deficient phenotype. Specifically, lasR encodes one of the major transcriptional regulators in quorum sensing, whereas mpl encodes an enzyme responsible for recycling cell wall peptidoglycan. We also found that P. aeruginosa isolated at late stages of VAP infections produce less elastase and are less virulent in vivo than their earlier isolated counterparts, suggesting the short-term in vivo evolution of P. aeruginosa leads to attenuated virulence.
Collapse
|
31
|
Deutscher AT, Burke CM, Darling AE, Riegler M, Reynolds OL, Chapman TA. Near full-length 16S rRNA gene next-generation sequencing revealed Asaia as a common midgut bacterium of wild and domesticated Queensland fruit fly larvae. MICROBIOME 2018; 6:85. [PMID: 29729663 PMCID: PMC5935925 DOI: 10.1186/s40168-018-0463-y] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/30/2017] [Accepted: 04/19/2018] [Indexed: 05/25/2023]
Abstract
BACKGROUND Gut microbiota affects tephritid (Diptera: Tephritidae) fruit fly development, physiology, behavior, and thus the quality of flies mass-reared for the sterile insect technique (SIT), a target-specific, sustainable, environmentally benign form of pest management. The Queensland fruit fly, Bactrocera tryoni (Tephritidae), is a significant horticultural pest in Australia and can be managed with SIT. Little is known about the impacts that laboratory-adaptation (domestication) and mass-rearing have on the tephritid larval gut microbiome. Read lengths of previous fruit fly next-generation sequencing (NGS) studies have limited the resolution of microbiome studies, and the diversity within populations is often overlooked. In this study, we used a new near full-length (> 1300 nt) 16S rRNA gene amplicon NGS approach to characterize gut bacterial communities of individual B. tryoni larvae from two field populations (developing in peaches) and three domesticated populations (mass- or laboratory-reared on artificial diets). RESULTS Near full-length 16S rRNA gene sequences were obtained for 56 B. tryoni larvae. OTU clustering at 99% similarity revealed that gut bacterial diversity was low and significantly lower in domesticated larvae. Bacteria commonly associated with fruit (Acetobacteraceae, Enterobacteriaceae, and Leuconostocaceae) were detected in wild larvae, but were largely absent from domesticated larvae. However, Asaia, an acetic acid bacterium not frequently detected within adult tephritid species, was detected in larvae of both wild and domesticated populations (55 out of 56 larval gut samples). Larvae from the same single peach shared a similar gut bacterial profile, whereas larvae from different peaches collected from the same tree had different gut bacterial profiles. Clustering of the Asaia near full-length sequences at 100% similarity showed that the wild flies from different locations had different Asaia strains. CONCLUSIONS Variation in the gut bacterial communities of B. tryoni larvae depends on diet, domestication, and horizontal acquisition. Bacterial variation in wild larvae suggests that more than one bacterial species can perform the same functional role; however, Asaia could be an important gut bacterium in larvae and warrants further study. A greater understanding of the functions of the bacteria detected in larvae could lead to increased fly quality and performance as part of the SIT.
Collapse
|
32
|
Dinh V, Darling AE, Matsen IV FA. Online Bayesian Phylogenetic Inference: Theoretical Foundations via Sequential Monte Carlo. Syst Biol 2018; 67:503-517. [PMID: 29244177 PMCID: PMC5920340 DOI: 10.1093/sysbio/syx087] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2017] [Revised: 11/08/2017] [Accepted: 11/09/2017] [Indexed: 11/29/2022] Open
Abstract
Phylogenetics, the inference of evolutionary trees from molecular sequence data such as DNA, is an enterprise that yields valuable evolutionary understanding of many biological systems. Bayesian phylogenetic algorithms, which approximate a posterior distribution on trees, have become a popular if computationally expensive means of doing phylogenetics. Modern data collection technologies are quickly adding new sequences to already substantial databases. With all current techniques for Bayesian phylogenetics, computation must start anew each time a sequence becomes available, making it costly to maintain an up-to-date estimate of a phylogenetic posterior. These considerations highlight the need for an online Bayesian phylogenetic method which can update an existing posterior with new sequences. Here, we provide theoretical results on the consistency and stability of methods for online Bayesian phylogenetic inference based on Sequential Monte Carlo (SMC) and Markov chain Monte Carlo. We first show a consistency result, demonstrating that the method samples from the correct distribution in the limit of a large number of particles. Next, we derive the first reported set of bounds on how phylogenetic likelihood surfaces change when new sequences are added. These bounds enable us to characterize the theoretical performance of sampling algorithms by bounding the effective sample size (ESS) with a given number of particles from below. We show that the ESS is guaranteed to grow linearly as the number of particles in an SMC sampler grows. Surprisingly, this result holds even though the dimensions of the phylogenetic model grow with each new added sequence.
Collapse
|
33
|
Bogema DR, Micallef ML, Liu M, Padula MP, Djordjevic SP, Darling AE, Jenkins C. Analysis of Theileria orientalis draft genome sequences reveals potential species-level divergence of the Ikeda, Chitose and Buffeli genotypes. BMC Genomics 2018; 19:298. [PMID: 29703152 PMCID: PMC5921998 DOI: 10.1186/s12864-018-4701-2] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2017] [Accepted: 04/18/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Theileria orientalis (Apicomplexa: Piroplasmida) has caused clinical disease in cattle of Eastern Asia for many years and its recent rapid spread throughout Australian and New Zealand herds has caused substantial economic losses to production through cattle deaths, late term abortion and morbidity. Disease outbreaks have been linked to the detection of a pathogenic genotype of T. orientalis, genotype Ikeda, which is also responsible for disease outbreaks in Asia. Here, we sequenced and compared the draft genomes of one pathogenic (Ikeda) and two apathogenic (Chitose, Buffeli) isolates of T. orientalis sourced from Australian herds. RESULTS Using de novo assembled sequences and a single nucleotide variant (SNV) analysis pipeline, we found extensive genetic divergence between the T. orientalis genotypes. A genome-wide phylogeny reconstructed to address continued confusion over nomenclature of this species displayed concordance with prior phylogenetic studies based on the major piroplasm surface protein (MPSP) gene. However, average nucleotide identity (ANI) values revealed that the divergence between isolates is comparable to that observed between other theilerias which represent distinct species. Analysis of SNVs revealed putative recombination between the Chitose and Buffeli genotypes and also between Australian and Japanese Ikeda isolates. Finally, to inform future vaccine studies, dN/dS ratios and surface location predictions were analysed. Six predicted surface protein targets were confirmed to be expressed during the piroplasm phase of the parasite by mass spectrometry. CONCLUSIONS We used whole genome sequencing to demonstrate that the T. orientalis Ikeda, Chitose and Buffeli variants show substantial genetic divergence. Our data indicates that future researchers could potentially consider disease-associated Ikeda and closely related genotypes as a separate species from non-pathogenic Chitose and Buffeli.
Collapse
|
34
|
Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S, Dröge J, Gregor I, Majda S, Fiedler J, Dahms E, Bremges A, Fritz A, Garrido-Oter R, Jørgensen TS, Shapiro N, Blood PD, Gurevich A, Bai Y, Turaev D, DeMaere MZ, Chikhi R, Nagarajan N, Quince C, Meyer F, Balvočiūtė M, Hansen LH, Sørensen SJ, Chia BKH, Denis B, Froula JL, Wang Z, Egan R, Don Kang D, Cook JJ, Deltel C, Beckstette M, Lemaitre C, Peterlongo P, Rizk G, Lavenier D, Wu YW, Singer SW, Jain C, Strous M, Klingenberg H, Meinicke P, Barton MD, Lingner T, Lin HH, Liao YC, Silva GGZ, Cuevas DA, Edwards RA, Saha S, Piro VC, Renard BY, Pop M, Klenk HP, Göker M, Kyrpides NC, Woyke T, Vorholt JA, Schulze-Lefert P, Rubin EM, Darling AE, Rattei T, McHardy AC. Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software. Nat Methods 2017; 14:1063-1071. [PMID: 28967888 DOI: 10.1101/099127] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2016] [Accepted: 08/25/2017] [Indexed: 05/25/2023]
Abstract
Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.
Collapse
|
35
|
Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S, Dröge J, Gregor I, Majda S, Fiedler J, Dahms E, Bremges A, Fritz A, Garrido-Oter R, Jørgensen TS, Shapiro N, Blood PD, Gurevich A, Bai Y, Turaev D, DeMaere MZ, Chikhi R, Nagarajan N, Quince C, Meyer F, Balvočiūtė M, Hansen LH, Sørensen SJ, Chia BKH, Denis B, Froula JL, Wang Z, Egan R, Don Kang D, Cook JJ, Deltel C, Beckstette M, Lemaitre C, Peterlongo P, Rizk G, Lavenier D, Wu YW, Singer SW, Jain C, Strous M, Klingenberg H, Meinicke P, Barton MD, Lingner T, Lin HH, Liao YC, Silva GGZ, Cuevas DA, Edwards RA, Saha S, Piro VC, Renard BY, Pop M, Klenk HP, Göker M, Kyrpides NC, Woyke T, Vorholt JA, Schulze-Lefert P, Rubin EM, Darling AE, Rattei T, McHardy AC. Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software. Nat Methods 2017; 14:1063-1071. [PMID: 28967888 DOI: 10.1038/nmeth.4458] [Citation(s) in RCA: 436] [Impact Index Per Article: 62.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2016] [Accepted: 08/25/2017] [Indexed: 12/12/2022]
Abstract
Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.
Collapse
|
36
|
Liu MY, Worden P, Monahan LG, DeMaere MZ, Burke CM, Djordjevic SP, Charles IG, Darling AE. Evaluation of ddRADseq for reduced representation metagenome sequencing. PeerJ 2017; 5:e3837. [PMID: 28948110 PMCID: PMC5609526 DOI: 10.7717/peerj.3837] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2017] [Accepted: 08/31/2017] [Indexed: 11/23/2022] Open
Abstract
Background Profiling of microbial communities via metagenomic shotgun sequencing has enabled researches to gain unprecedented insight into microbial community structure and the functional roles of community members. This study describes a method and basic analysis for a metagenomic adaptation of the double digest restriction site associated DNA sequencing (ddRADseq) protocol for reduced representation metagenome profiling. Methods This technique takes advantage of the sequence specificity of restriction endonucleases to construct an Illumina-compatible sequencing library containing DNA fragments that are between a pair of restriction sites located within close proximity. This results in a reduced sequencing library with coverage breadth that can be tuned by size selection. We assessed the performance of the metagenomic ddRADseq approach by applying the full method to human stool samples and generating sequence data. Results The ddRADseq data yields a similar estimate of community taxonomic profile as obtained from shotgun metagenome sequencing of the same human stool samples. No obvious bias with respect to genomic G + C content and the estimated relative species abundance was detected. Discussion Although ddRADseq does introduce some bias in taxonomic representation, the bias is likely to be small relative to DNA extraction bias. ddRADseq appears feasible and could have value as a tool for metagenome-wide association studies.
Collapse
|
37
|
Fourment M, Darling AE, Holmes EC. The impact of migratory flyways on the spread of avian influenza virus in North America. BMC Evol Biol 2017; 17:118. [PMID: 28545432 PMCID: PMC5445350 DOI: 10.1186/s12862-017-0965-4] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Accepted: 05/11/2017] [Indexed: 11/16/2022] Open
Abstract
Background Wild birds are the major reservoir hosts for influenza A viruses (AIVs) and have been implicated in the emergence of pandemic events in livestock and human populations. Understanding how AIVs spread within and across continents is therefore critical to the development of successful strategies to manage and reduce the impact of influenza outbreaks. In North America many bird species undergo seasonal migratory movements along a North-South axis, thereby providing opportunities for viruses to spread over long distances. However, the role played by such avian flyways in shaping the genetic structure of AIV populations remains uncertain. Results To assess the relative contribution of bird migration along flyways to the genetic structure of AIV we performed a large-scale phylogeographic study of viruses sampled in the USA and Canada, involving the analysis of 3805 to 4505 sequences from 36 to 38 geographic localities depending on the gene segment data set. To assist in this we developed a maximum likelihood-based genetic algorithm to explore a wide range of complex spatial models, depicting a more complete picture of the migration network than determined previously. Conclusions Based on phylogenies estimated from nucleotide sequence data sets, our results show that AIV migration rates are significantly higher within than between flyways, indicating that the migratory patterns of birds play a key role in viral dispersal. These findings provide valuable insights into the evolution, maintenance and transmission of AIVs, in turn allowing the development of improved programs for surveillance and risk assessment.
Collapse
|
38
|
DeMaere MZ, Darling AE. Deconvoluting simulated metagenomes: the performance of hard- and soft- clustering algorithms applied to metagenomic chromosome conformation capture (3C). PeerJ 2016; 4:e2676. [PMID: 27843713 PMCID: PMC5103821 DOI: 10.7717/peerj.2676] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2016] [Accepted: 10/11/2016] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND Chromosome conformation capture, coupled with high throughput DNA sequencing in protocols like Hi-C and 3C-seq, has been proposed as a viable means of generating data to resolve the genomes of microorganisms living in naturally occuring environments. Metagenomic Hi-C and 3C-seq datasets have begun to emerge, but the feasibility of resolving genomes when closely related organisms (strain-level diversity) are present in the sample has not yet been systematically characterised. METHODS We developed a computational simulation pipeline for metagenomic 3C and Hi-C sequencing to evaluate the accuracy of genomic reconstructions at, above, and below an operationally defined species boundary. We simulated datasets and measured accuracy over a wide range of parameters. Five clustering algorithms were evaluated (2 hard, 3 soft) using an adaptation of the extended B-cubed validation measure. RESULTS When all genomes in a sample are below 95% sequence identity, all of the tested clustering algorithms performed well. When sequence data contains genomes above 95% identity (our operational definition of strain-level diversity), a naive soft-clustering extension of the Louvain method achieves the highest performance. DISCUSSION Previously, only hard-clustering algorithms have been applied to metagenomic 3C and Hi-C data, yet none of these perform well when strain-level diversity exists in a metagenomic sample. Our simple extension of the Louvain method performed the best in these scenarios, however, accuracy remained well below the levels observed for samples without strain-level diversity. Strain resolution is also highly dependent on the amount of available 3C sequence data, suggesting that depth of sequencing must be carefully considered during experimental design. Finally, there appears to be great scope to improve the accuracy of strain resolution through further algorithm development.
Collapse
|
39
|
Burke CM, Darling AE. A method for high precision sequencing of near full-length 16S rRNA genes on an Illumina MiSeq. PeerJ 2016; 4:e2492. [PMID: 27688981 PMCID: PMC5036073 DOI: 10.7717/peerj.2492] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2015] [Accepted: 08/25/2016] [Indexed: 12/21/2022] Open
Abstract
Background The bacterial 16S rRNA gene has historically been used in defining bacterial taxonomy and phylogeny. However, there are currently no high-throughput methods to sequence full-length 16S rRNA genes present in a sample with precision. Results We describe a method for sequencing near full-length 16S rRNA gene amplicons using the high throughput Illumina MiSeq platform and test it using DNA from human skin swab samples. Proof of principle of the approach is demonstrated, with the generation of 1,604 sequences greater than 1,300 nt from a single Nano MiSeq run, with accuracy estimated to be 100-fold higher than standard Illumina reads. The reads were chimera filtered using information from a single molecule dual tagging scheme that boosts the signal available for chimera detection. Conclusions This method could be scaled up to generate many thousands of sequences per MiSeq run and could be applied to other sequencing platforms. This has great potential for populating databases with high quality, near full-length 16S rRNA gene sequences from under-represented taxa and environments and facilitates analyses of microbial communities at higher resolution.
Collapse
|
40
|
Roy Chowdhury P, DeMaere M, Chapman T, Worden P, Charles IG, Darling AE, Djordjevic SP. Comparative genomic analysis of toxin-negative strains of Clostridium difficile from humans and animals with symptoms of gastrointestinal disease. BMC Microbiol 2016; 16:41. [PMID: 26971047 PMCID: PMC4789261 DOI: 10.1186/s12866-016-0653-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2015] [Accepted: 03/02/2016] [Indexed: 12/13/2022] Open
Abstract
Background Clostridium difficile infections (CDI) are a significant health problem to humans and food animals. Clostridial toxins ToxA and ToxB encoded by genes tcdA and tcdB are located on a pathogenicity locus known as the PaLoc and are the major virulence factors of C. difficile. While toxin-negative strains of C. difficile are often isolated from faeces of animals and patients suffering from CDI, they are not considered to play a role in disease. Toxin-negative strains of C. difficile have been used successfully to treat recurring CDI but their propensity to acquire the PaLoc via lateral gene transfer and express clinically relevant levels of toxins has reinforced the need to characterise them genetically. In addition, further studies that examine the pathogenic potential of toxin-negative strains of C. difficile and the frequency by which toxin-negative strains may acquire the PaLoc are needed. Results We undertook a comparative genomic analysis of five Australian toxin-negative isolates of C. difficile that lack tcdA, tcdB and both binary toxin genes cdtA and cdtB that were recovered from humans and farm animals with symptoms of gastrointestinal disease. Our analyses show that the five C. difficile isolates cluster closely with virulent toxigenic strains of C. difficile belonging to the same sequence type (ST) and have virulence gene profiles akin to those in toxigenic strains. Furthermore, phage acquisition appears to have played a key role in the evolution of C. difficile. Conclusions Our results are consistent with the C. difficile global population structure comprising six clades each containing both toxin-positive and toxin-negative strains. Our data also suggests that toxin-negative strains of C. difficile encode a repertoire of putative virulence factors that are similar to those found in toxigenic strains of C. difficile, raising the possibility that acquisition of PaLoc by toxin-negative strains poses a threat to human health. Studies in appropriate animal models are needed to examine the pathogenic potential of toxin-negative strains of C. difficile and to determine the frequency by which toxin-negative strains may acquire the PaLoc. Electronic supplementary material The online version of this article (doi:10.1186/s12866-016-0653-3) contains supplementary material, which is available to authorized users.
Collapse
|
41
|
Joss TV, Burke CM, Hudson BJ, Darling AE, Forer M, Alber DG, Charles IG, Stow NW. Bacterial Communities Vary between Sinuses in Chronic Rhinosinusitis Patients. Front Microbiol 2016; 6:1532. [PMID: 26834708 PMCID: PMC4722142 DOI: 10.3389/fmicb.2015.01532] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2015] [Accepted: 12/21/2015] [Indexed: 12/02/2022] Open
Abstract
Chronic rhinosinusitis (CRS) is a common and potentially debilitating disease characterized by inflammation of the sinus mucosa for longer than 12 weeks. Bacterial colonization of the sinuses and its role in the pathogenesis of this disease is an ongoing area of research. Recent advances in culture-independent molecular techniques for bacterial identification have the potential to provide a more accurate and complete assessment of the sinus microbiome, however there is little concordance in results between studies, possibly due to differences in the sampling location and techniques. This study aimed to determine whether the microbial communities from one sinus could be considered representative of all sinuses, and examine differences between two commonly used methods for sample collection, swabs, and tissue biopsies. High-throughput DNA sequencing of the bacterial 16S rRNA gene was applied to both swab and tissue samples from multiple sinuses of 19 patients undergoing surgery for treatment of CRS. Results from swabs and tissue biopsies showed a high degree of similarity, indicating that swabbing is sufficient to recover the microbial community from the sinuses. Microbial communities from different sinuses within individual patients differed to varying degrees, demonstrating that it is possible for distinct microbiomes to exist simultaneously in different sinuses of the same patient. The sequencing results correlated well with culture-based pathogen identification conducted in parallel, although the culturing missed many species detected by sequencing. This finding has implications for future research into the sinus microbiome, which should take this heterogeneity into account by sampling patients from more than one sinus.
Collapse
|
42
|
O'Flynn C, Deusch O, Darling AE, Eisen JA, Wallis C, Davis IJ, Harris SJ. Comparative Genomics of the Genus Porphyromonas Identifies Adaptations for Heme Synthesis within the Prevalent Canine Oral Species Porphyromonas cangingivalis. Genome Biol Evol 2015; 7:3397-413. [PMID: 26568374 PMCID: PMC4700951 DOI: 10.1093/gbe/evv220] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Porphyromonads play an important role in human periodontal disease and recently have been shown to be highly prevalent in canine mouths. Porphyromonas cangingivalis is the most prevalent canine oral bacterial species in both plaque from healthy gingiva and plaque from dogs with early periodontitis. The ability of P. cangingivalis to flourish in the different environmental conditions characterized by these two states suggests a degree of metabolic flexibility. To characterize the genes responsible for this, the genomes of 32 isolates (including 18 newly sequenced and assembled) from 18 Porphyromonad species from dogs, humans, and other mammals were compared. Phylogenetic trees inferred using core genes largely matched previous findings; however, comparative genomic analysis identified several genes and pathways relating to heme synthesis that were present in P. cangingivalis but not in other Porphyromonads. Porphyromonas cangingivalis has a complete protoporphyrin IX synthesis pathway potentially allowing it to synthesize its own heme unlike pathogenic Porphyromonads such as Porphyromonas gingivalis that acquire heme predominantly from blood. Other pathway differences such as the ability to synthesize siroheme and vitamin B12 point to enhanced metabolic flexibility for P. cangingivalis, which may underlie its prevalence in the canine oral cavity.
Collapse
|
43
|
Dunitz MI, Lang JM, Jospin G, Darling AE, Eisen JA, Coil DA. Swabs to genomes: a comprehensive workflow. PeerJ 2015; 3:e960. [PMID: 26020012 PMCID: PMC4435499 DOI: 10.7717/peerj.960] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2015] [Accepted: 04/24/2015] [Indexed: 02/04/2023] Open
Abstract
The sequencing, assembly, and basic analysis of microbial genomes, once a painstaking and expensive undertaking, has become much easier for research labs with access to standard molecular biology and computational tools. However, there are a confusing variety of options available for DNA library preparation and sequencing, and inexperience with bioinformatics can pose a significant barrier to entry for many who may be interested in microbial genomics. The objective of the present study was to design, test, troubleshoot, and publish a simple, comprehensive workflow from the collection of an environmental sample (a swab) to a published microbial genome; empowering even a lab or classroom with limited resources and bioinformatics experience to perform it.
Collapse
|
44
|
Wyrsch E, Roy Chowdhury P, Abraham S, Santos J, Darling AE, Charles IG, Chapman TA, Djordjevic SP. Comparative genomic analysis of a multiple antimicrobial resistant enterotoxigenic E. coli O157 lineage from Australian pigs. BMC Genomics 2015; 16:165. [PMID: 25888127 PMCID: PMC4384309 DOI: 10.1186/s12864-015-1382-y] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2014] [Accepted: 02/23/2015] [Indexed: 01/01/2023] Open
Abstract
Background Enterotoxigenic Escherichia coli (ETEC) are a major economic threat to pig production globally, with serogroups O8, O9, O45, O101, O138, O139, O141, O149 and O157 implicated as the leading diarrhoeal pathogens affecting pigs below four weeks of age. A multiple antimicrobial resistant ETEC O157 (O157 SvETEC) representative of O157 isolates from a pig farm in New South Wales, Australia that experienced repeated bouts of pre- and post-weaning diarrhoea resulting in multiple fatalities was characterized here. Enterohaemorrhagic E. coli (EHEC) O157:H7 cause both sporadic and widespread outbreaks of foodborne disease, predominantly have a ruminant origin and belong to the ST11 clonal complex. Here, for the first time, we conducted comparative genomic analyses of two epidemiologically-unrelated porcine, disease-causing ETEC O157; E. coli O157 SvETEC and E. coli O157:K88 734/3, and examined their phylogenetic relationship with EHEC O157:H7. Results O157 SvETEC and O157:K88 734/3 belong to a novel sequence type (ST4245) that comprises part of the ST23 complex and are genetically distinct from EHEC O157. Comparative phylogenetic analysis using PhyloSift shows that E. coli O157 SvETEC and E. coli O157:K88 734/3 group into a single clade and are most similar to the extraintestinal avian pathogenic Escherichia coli (APEC) isolate O78 that clusters within the ST23 complex. Genome content was highly similar between E. coli O157 SvETEC, O157:K88 734/3 and APEC O78, with variability predominantly limited to laterally acquired elements, including prophages, plasmids and antimicrobial resistance gene loci. Putative ETEC virulence factors, including the toxins STb and LT and the K88 (F4) adhesin, were conserved between O157 SvETEC and O157:K88 734/3. The O157 SvETEC isolate also encoded the heat stable enterotoxin STa and a second allele of STb, whilst a prophage within O157:K88 734/3 encoded the serum survival gene bor. Both isolates harbor a large repertoire of antibiotic resistance genes but their association with mobile elements remains undetermined. Conclusions We present an analysis of the first draft genome sequences of two epidemiologically-unrelated, pathogenic ETEC O157. E. coli O157 SvETEC and E. coli O157:K88 734/3 belong to the ST23 complex and are phylogenetically distinct to EHEC O157 lineages that reside within the ST11 complex. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1382-y) contains supplementary material, which is available to authorized users.
Collapse
|
45
|
Coil D, Jospin G, Darling AE. A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data. Bioinformatics 2014; 31:587-9. [DOI: 10.1093/bioinformatics/btu661] [Citation(s) in RCA: 765] [Impact Index Per Article: 76.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
46
|
Lauro FM, Senstius SJ, Cullen J, Neches R, Jensen RM, Brown MV, Darling AE, Givskov M, McDougald D, Hoeke R, Ostrowski M, Philip GK, Paulsen IT, Grzymski JJ. The common oceanographer: crowdsourcing the collection of oceanographic data. PLoS Biol 2014; 12:e1001947. [PMID: 25203659 PMCID: PMC4159111 DOI: 10.1371/journal.pbio.1001947] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
47
|
Darling AE, McKinnon J, Worden P, Santos J, Charles IG, Chowdhury PR, Djordjevic SP. A draft genome of Escherichia coli sequence type 127 strain 2009-46. Gut Pathog 2014; 6:32. [PMID: 25197321 PMCID: PMC4155142 DOI: 10.1186/1757-4749-6-32] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Accepted: 07/15/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Escherichia coli are a frequent cause of urinary tract infections (UTI) and are thought to have a foodborne origin. E. coli with sequence type 127 (ST127) are emerging pathogens increasingly implicated as a cause of urinary tract infections (UTI) globally. A ST127 isolate (2009-46) resistant to ampicillin and trimethoprim was recovered from the urine of a 56 year old patient with a UTI from a hospital in Sydney, Australia and was characterised here. RESULTS We sequenced the genome of Escherichia coli 2009-46 using the Illumina Nextera XT and MiSeq technologies. Assembly of the sequence data reconstructed a 5.14 Mbp genome in 89 scaffolds with an N50 of 161 kbp. The genome has extensive similarity to other sequenced uropathogenic E. coli genomes, but also has several genes that are potentially related to virulence and pathogenicity that are not present in the reference E. coli strain. CONCLUSION E. coli 2009-46 is a multiple antibiotic resistant, phylogroup B2 isolate recovered from a patient with a UTI. This is the first description of a drug resistant E. coli ST127 in Australia.
Collapse
|
48
|
Beitel CW, Froenicke L, Lang JM, Korf IF, Michelmore RW, Eisen JA, Darling AE. Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products. PeerJ 2014; 2:e415. [PMID: 24918035 PMCID: PMC4045339 DOI: 10.7717/peerj.415] [Citation(s) in RCA: 78] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2014] [Accepted: 05/15/2014] [Indexed: 12/13/2022] Open
Abstract
Metagenomics is a valuable tool for the study of microbial communities but has been limited by the difficulty of "binning" the resulting sequences into groups corresponding to the individual species and strains that constitute the community. Moreover, there are presently no methods to track the flow of mobile DNA elements such as plasmids through communities or to determine which of these are co-localized within the same cell. We address these limitations by applying Hi-C, a technology originally designed for the study of three-dimensional genome structure in eukaryotes, to measure the cellular co-localization of DNA sequences. We leveraged Hi-C data generated from a simple synthetic metagenome sample to accurately cluster metagenome assembly contigs into groups that contain nearly complete genomes of each species. The Hi-C data also reliably associated plasmids with the chromosomes of their host and with each other. We further demonstrated that Hi-C data provides a long-range signal of strain-specific genotypes, indicating such data may be useful for high-resolution genotyping of microbial populations. Our work demonstrates that Hi-C sequencing data provide valuable information for metagenome analyses that are not currently obtainable by other methods. This metagenomic Hi-C method could facilitate future studies of the fine-scale population structure of microbes, as well as studies of how antibiotic resistance plasmids (or other genetic elements) mobilize in microbial communities. The method is not limited to microbiology; the genetic architecture of other heterogeneous populations of cells could also be studied with this technique.
Collapse
|
49
|
Darling AE, Worden P, Chapman TA, Roy Chowdhury P, Charles IG, Djordjevic SP. The genome of Clostridium difficile 5.3. Gut Pathog 2014; 6:4. [PMID: 24565059 PMCID: PMC4234979 DOI: 10.1186/1757-4749-6-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/23/2013] [Accepted: 02/01/2014] [Indexed: 12/26/2022] Open
Abstract
Background Clostridium difficile is the leading cause of infectious diarrhea in humans and responsible for large outbreaks of enteritis in neonatal pigs in both North America and Europe. Disease caused by C. difficile typically occurs during antibiotic therapy and its emergence over the past 40 years is linked with the widespread use of broad-spectrum antibiotics in both human and veterinary medicine. Results We sequenced the genome of Clostridium difficile 5.3 using the Illumina Nextera XT and MiSeq technologies. Assembly of the sequence data reconstructed a 4,009,318 bp genome in 27 scaffolds with an N50 of 786 kbp. The genome has extensive similarity to other sequenced C. difficile genomes, but also has several genes that are potentially related to virulence and pathogenicity that are not present in the reference C. difficile strain. Conclusion Genome sequencing of human and animal isolates is needed to understand the molecular events driving the emergence of C. difficile as a gastrointestinal pathogen of humans and food animals and to better define its zoonotic potential.
Collapse
|
50
|
Darling AE, Jospin G, Lowe E, Matsen FA, Bik HM, Eisen JA. PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2014; 2:e243. [PMID: 24482762 PMCID: PMC3897386 DOI: 10.7717/peerj.243] [Citation(s) in RCA: 407] [Impact Index Per Article: 40.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2013] [Accepted: 12/19/2013] [Indexed: 12/13/2022] Open
Abstract
Like all organisms on the planet, environmental microbes are subject to the forces of molecular evolution. Metagenomic sequencing provides a means to access the DNA sequence of uncultured microbes. By combining DNA sequencing of microbial communities with evolutionary modeling and phylogenetic analysis we might obtain new insights into microbiology and also provide a basis for practical tools such as forensic pathogen detection. In this work we present an approach to leverage phylogenetic analysis of metagenomic sequence data to conduct several types of analysis. First, we present a method to conduct phylogeny-driven Bayesian hypothesis tests for the presence of an organism in a sample. Second, we present a means to compare community structure across a collection of many samples and develop direct associations between the abundance of certain organisms and sample metadata. Third, we apply new tools to analyze the phylogenetic diversity of microbial communities and again demonstrate how this can be associated to sample metadata. These analyses are implemented in an open source software pipeline called PhyloSift. As a pipeline, PhyloSift incorporates several other programs including LAST, HMMER, and pplacer to automate phylogenetic analysis of protein coding and RNA sequences in metagenomic datasets generated by modern sequencing platforms (e.g., Illumina, 454).
Collapse
|