1
|
Coil D, Jospin G, Darling AE. A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data. Bioinformatics 2014; 31:587-9. [DOI: 10.1093/bioinformatics/btu661] [Citation(s) in RCA: 765] [Impact Index Per Article: 69.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
|
11 |
765 |
2
|
Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S, Dröge J, Gregor I, Majda S, Fiedler J, Dahms E, Bremges A, Fritz A, Garrido-Oter R, Jørgensen TS, Shapiro N, Blood PD, Gurevich A, Bai Y, Turaev D, DeMaere MZ, Chikhi R, Nagarajan N, Quince C, Meyer F, Balvočiūtė M, Hansen LH, Sørensen SJ, Chia BKH, Denis B, Froula JL, Wang Z, Egan R, Don Kang D, Cook JJ, Deltel C, Beckstette M, Lemaitre C, Peterlongo P, Rizk G, Lavenier D, Wu YW, Singer SW, Jain C, Strous M, Klingenberg H, Meinicke P, Barton MD, Lingner T, Lin HH, Liao YC, Silva GGZ, Cuevas DA, Edwards RA, Saha S, Piro VC, Renard BY, Pop M, Klenk HP, Göker M, Kyrpides NC, Woyke T, Vorholt JA, Schulze-Lefert P, Rubin EM, Darling AE, Rattei T, McHardy AC. Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software. Nat Methods 2017; 14:1063-1071. [PMID: 28967888 DOI: 10.1038/nmeth.4458] [Citation(s) in RCA: 483] [Impact Index Per Article: 60.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2016] [Accepted: 08/25/2017] [Indexed: 12/12/2022]
Abstract
Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.
Collapse
|
Journal Article |
8 |
483 |
3
|
Rissman AI, Mau B, Biehl BS, Darling AE, Glasner JD, Perna NT. Reordering contigs of draft genomes using the Mauve aligner. ACTA ACUST UNITED AC 2009; 25:2071-3. [PMID: 19515959 PMCID: PMC2723005 DOI: 10.1093/bioinformatics/btp356] [Citation(s) in RCA: 425] [Impact Index Per Article: 26.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Summary: Mauve Contig Mover provides a new method for proposing the relative order of contigs that make up a draft genome based on comparison to a complete or draft reference genome. A novel application of the Mauve aligner and viewer provides an automated reordering algorithm coupled with a powerful drill-down display allowing detailed exploration of results. Availability: The software is available for download at http://gel.ahabs.wisc.edu/mauve. Contact:rissman@wisc.edu Supplementary information:Supplementary data are available at Bioinformatics online and http://gel.ahabs.wisc.edu
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
16 |
425 |
4
|
Darling AE, Jospin G, Lowe E, Matsen FA, Bik HM, Eisen JA. PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2014; 2:e243. [PMID: 24482762 PMCID: PMC3897386 DOI: 10.7717/peerj.243] [Citation(s) in RCA: 424] [Impact Index Per Article: 38.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2013] [Accepted: 12/19/2013] [Indexed: 12/13/2022] Open
Abstract
Like all organisms on the planet, environmental microbes are subject to the forces of molecular evolution. Metagenomic sequencing provides a means to access the DNA sequence of uncultured microbes. By combining DNA sequencing of microbial communities with evolutionary modeling and phylogenetic analysis we might obtain new insights into microbiology and also provide a basis for practical tools such as forensic pathogen detection. In this work we present an approach to leverage phylogenetic analysis of metagenomic sequence data to conduct several types of analysis. First, we present a method to conduct phylogeny-driven Bayesian hypothesis tests for the presence of an organism in a sample. Second, we present a means to compare community structure across a collection of many samples and develop direct associations between the abundance of certain organisms and sample metadata. Third, we apply new tools to analyze the phylogenetic diversity of microbial communities and again demonstrate how this can be associated to sample metadata. These analyses are implemented in an open source software pipeline called PhyloSift. As a pipeline, PhyloSift incorporates several other programs including LAST, HMMER, and pplacer to automate phylogenetic analysis of protein coding and RNA sequences in metagenomic datasets generated by modern sequencing platforms (e.g., Illumina, 454).
Collapse
|
Journal Article |
11 |
424 |
5
|
Tritt A, Eisen JA, Facciotti MT, Darling AE. An integrated pipeline for de novo assembly of microbial genomes. PLoS One 2012; 7:e42304. [PMID: 23028432 PMCID: PMC3441570 DOI: 10.1371/journal.pone.0042304] [Citation(s) in RCA: 360] [Impact Index Per Article: 27.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2011] [Accepted: 07/06/2012] [Indexed: 11/19/2022] Open
Abstract
Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron's Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50% or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
13 |
360 |
6
|
Darling AE, Miklós I, Ragan MA. Dynamics of genome rearrangement in bacterial populations. PLoS Genet 2008; 4:e1000128. [PMID: 18650965 PMCID: PMC2483231 DOI: 10.1371/journal.pgen.1000128] [Citation(s) in RCA: 158] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2007] [Accepted: 06/16/2008] [Indexed: 11/24/2022] Open
Abstract
Genome structure variation has profound impacts on phenotype in organisms ranging from microbes to humans, yet little is known about how natural selection acts on genome arrangement. Pathogenic bacteria such as Yersinia pestis, which causes bubonic and pneumonic plague, often exhibit a high degree of genomic rearrangement. The recent availability of several Yersinia genomes offers an unprecedented opportunity to study the evolution of genome structure and arrangement. We introduce a set of statistical methods to study patterns of rearrangement in circular chromosomes and apply them to the Yersinia. We constructed a multiple alignment of eight Yersinia genomes using Mauve software to identify 78 conserved segments that are internally free from genome rearrangement. Based on the alignment, we applied Bayesian statistical methods to infer the phylogenetic inversion history of Yersinia. The sampling of genome arrangement reconstructions contains seven parsimonious tree topologies, each having different histories of 79 inversions. Topologies with a greater number of inversions also exist, but were sampled less frequently. The inversion phylogenies agree with results suggested by SNP patterns. We then analyzed reconstructed inversion histories to identify patterns of rearrangement. We confirm an over-representation of "symmetric inversions"-inversions with endpoints that are equally distant from the origin of chromosomal replication. Ancestral genome arrangements demonstrate moderate preference for replichore balance in Yersinia. We found that all inversions are shorter than expected under a neutral model, whereas inversions acting within a single replichore are much shorter than expected. We also found evidence for a canonical configuration of the origin and terminus of replication. Finally, breakpoint reuse analysis reveals that inversions with endpoints proximal to the origin of DNA replication are nearly three times more frequent. Our findings represent the first characterization of genome arrangement evolution in a bacterial population evolving outside laboratory conditions. Insight into the process of genomic rearrangement may further the understanding of pathogen population dynamics and selection on the architecture of circular bacterial chromosomes.
Collapse
|
research-article |
17 |
158 |
7
|
Treangen TJ, Koren S, Sommer DD, Liu B, Astrovskaya I, Ondov B, Darling AE, Phillippy AM, Pop M. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. Genome Biol 2013; 14:R2. [PMID: 23320958 PMCID: PMC4053804 DOI: 10.1186/gb-2013-14-1-r2] [Citation(s) in RCA: 154] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2012] [Accepted: 01/15/2013] [Indexed: 12/31/2022] Open
Abstract
We describe MetAMOS, an open source and modular metagenomic assembly and analysis pipeline. MetAMOS represents an important step towards fully automated metagenomic analysis, starting with next-generation sequencing reads and producing genomic scaffolds, open-reading frames and taxonomic or functional annotations. MetAMOS can aid in reducing assembly errors, commonly encountered when assembling metagenomic samples, and improves taxonomic assignment accuracy while also reducing computational cost. MetAMOS can be downloaded from: https://github.com/treangen/MetAMOS.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
12 |
154 |
8
|
Morgan JL, Darling AE, Eisen JA. Metagenomic sequencing of an in vitro-simulated microbial community. PLoS One 2010; 5:e10209. [PMID: 20419134 PMCID: PMC2855710 DOI: 10.1371/journal.pone.0010209] [Citation(s) in RCA: 146] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2009] [Accepted: 03/12/2010] [Indexed: 12/03/2022] Open
Abstract
Background Microbial life dominates the earth, but many species are difficult or even impossible to study under laboratory conditions. Sequencing DNA directly from the environment, a technique commonly referred to as metagenomics, is an important tool for cataloging microbial life. This culture-independent approach involves collecting samples that include microbes in them, extracting DNA from the samples, and sequencing the DNA. A sample may contain many different microorganisms, macroorganisms, and even free-floating environmental DNA. A fundamental challenge in metagenomics has been estimating the abundance of organisms in a sample based on the frequency with which the organism's DNA was observed in reads generated via DNA sequencing. Methodology/Principal Findings We created mixtures of ten microbial species for which genome sequences are known. Each mixture contained an equal number of cells of each species. We then extracted DNA from the mixtures, sequenced the DNA, and measured the frequency with which genomic regions from each organism was observed in the sequenced DNA. We found that the observed frequency of reads mapping to each organism did not reflect the equal numbers of cells that were known to be included in each mixture. The relative organism abundances varied significantly depending on the DNA extraction and sequencing protocol utilized. Conclusions/Significance We describe a new data resource for measuring the accuracy of metagenomic binning methods, created by in vitro-simulation of a metagenomic community. Our in vitro simulation can be used to complement previous in silico benchmark studies. In constructing a synthetic community and sequencing its metagenome, we encountered several sources of observation bias that likely affect most metagenomic experiments to date and present challenges for comparative metagenomic studies. DNA preparation methods have a particularly profound effect in our study, implying that samples prepared with different protocols are not suitable for comparative metagenomics.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
15 |
146 |
9
|
Meyer F, Fritz A, Deng ZL, Koslicki D, Lesker TR, Gurevich A, Robertson G, Alser M, Antipov D, Beghini F, Bertrand D, Brito JJ, Brown CT, Buchmann J, Buluç A, Chen B, Chikhi R, Clausen PTLC, Cristian A, Dabrowski PW, Darling AE, Egan R, Eskin E, Georganas E, Goltsman E, Gray MA, Hansen LH, Hofmeyr S, Huang P, Irber L, Jia H, Jørgensen TS, Kieser SD, Klemetsen T, Kola A, Kolmogorov M, Korobeynikov A, Kwan J, LaPierre N, Lemaitre C, Li C, Limasset A, Malcher-Miranda F, Mangul S, Marcelino VR, Marchet C, Marijon P, Meleshko D, Mende DR, Milanese A, Nagarajan N, Nissen J, Nurk S, Oliker L, Paoli L, Peterlongo P, Piro VC, Porter JS, Rasmussen S, Rees ER, Reinert K, Renard B, Robertsen EM, Rosen GL, Ruscheweyh HJ, Sarwal V, Segata N, Seiler E, Shi L, Sun F, Sunagawa S, Sørensen SJ, Thomas A, Tong C, Trajkovski M, Tremblay J, Uritskiy G, Vicedomini R, Wang Z, Wang Z, Wang Z, Warren A, Willassen NP, Yelick K, You R, Zeller G, Zhao Z, Zhu S, Zhu J, Garrido-Oter R, Gastmeier P, Hacquard S, Häußler S, Khaledi A, Maechler F, Mesny F, Radutoiu S, Schulze-Lefert P, Smit N, Strowig T, Bremges A, Sczyrba A, McHardy AC. Critical Assessment of Metagenome Interpretation: the second round of challenges. Nat Methods 2022; 19:429-440. [PMID: 35396482 PMCID: PMC9007738 DOI: 10.1038/s41592-022-01431-4] [Citation(s) in RCA: 145] [Impact Index Per Article: 48.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2021] [Accepted: 02/14/2022] [Indexed: 12/20/2022]
Abstract
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses. This study presents the results of the second round of the Critical Assessment of Metagenome Interpretation challenges (CAMI II), which is a community-driven effort for comprehensively benchmarking tools for metagenomics data analysis.
Collapse
|
|
3 |
145 |
10
|
Ayres DL, Cummings MP, Baele G, Darling AE, Lewis PO, Swofford DL, Huelsenbeck JP, Lemey P, Rambaut A, Suchard MA. BEAGLE 3: Improved Performance, Scaling, and Usability for a High-Performance Computing Library for Statistical Phylogenetics. Syst Biol 2019; 68:1052-1061. [PMID: 31034053 PMCID: PMC6802572 DOI: 10.1093/sysbio/syz020] [Citation(s) in RCA: 140] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Revised: 04/10/2019] [Accepted: 04/10/2019] [Indexed: 11/12/2022] Open
Abstract
BEAGLE is a high-performance likelihood-calculation library for phylogenetic inference. The BEAGLE library defines a simple, but flexible, application programming interface (API), and includes a collection of efficient implementations for calculation under a variety of evolutionary models on different hardware devices. The library has been integrated into recent versions of popular phylogenetics software packages including BEAST and MrBayes and has been widely used across a diverse range of evolutionary studies. Here, we present BEAGLE 3 with new parallel implementations, increased performance for challenging data sets, improved scalability, and better usability. We have added new OpenCL and central processing unit-threaded implementations to the library, allowing the effective utilization of a wider range of modern hardware. Further, we have extended the API and library to support concurrent computation of independent partial likelihood arrays, for increased performance of nucleotide-model analyses with greater flexibility of data partitioning. For better scalability and usability, we have improved how phylogenetic software packages use BEAGLE in multi-GPU (graphics processing unit) and cluster environments, and introduced an automated method to select the fastest device given the data set, evolutionary model, and hardware. For application developers who wish to integrate the library, we also have developed an online tutorial. To evaluate the effect of the improvements, we ran a variety of benchmarks on state-of-the-art hardware. For a partitioned exemplar analysis, we observe run-time performance improvements as high as 5.9-fold over our previous GPU implementation. BEAGLE 3 is free, open-source software licensed under the Lesser GPL and available at https://beagle-dev.github.io.
Collapse
|
Research Support, N.I.H., Extramural |
6 |
140 |
11
|
Sheppard SK, Didelot X, Jolley KA, Darling AE, Pascoe B, Meric G, Kelly DJ, Cody A, Colles FM, Strachan NJC, Ogden ID, Forbes K, French NP, Carter P, Miller WG, McCarthy ND, Owen R, Litrup E, Egholm M, Affourtit JP, Bentley SD, Parkhill J, Maiden MCJ, Falush D. Progressive genome-wide introgression in agricultural Campylobacter coli. Mol Ecol 2012; 22:1051-64. [PMID: 23279096 PMCID: PMC3749442 DOI: 10.1111/mec.12162] [Citation(s) in RCA: 113] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2012] [Revised: 10/16/2012] [Accepted: 10/21/2012] [Indexed: 01/24/2023]
Abstract
Hybridization between distantly related organisms can facilitate rapid adaptation to novel environments, but is potentially constrained by epistatic fitness interactions among cell components. The zoonotic pathogens Campylobacter coli and C. jejuni differ from each other by around 15% at the nucleotide level, corresponding to an average of nearly 40 amino acids per protein-coding gene. Using whole genome sequencing, we show that a single C. coli lineage, which has successfully colonized an agricultural niche, has been progressively accumulating C. jejuni DNA. Members of this lineage belong to two groups, the ST-828 and ST-1150 clonal complexes. The ST-1150 complex is less frequently isolated and has undergone a substantially greater amount of introgression leading to replacement of up to 23% of the C. coli core genome as well as import of novel DNA. By contrast, the more commonly isolated ST-828 complex bacteria have 10–11% introgressed DNA, and C. jejuni and nonagricultural C. coli lineages each have <2%. Thus, the C. coli that colonize agriculture, and consequently cause most human disease, have hybrid origin, but this cross-species exchange has so far not had a substantial impact on the gene pools of either C. jejuni or nonagricultural C. coli. These findings also indicate remarkable interchangeability of basic cellular machinery after a prolonged period of independent evolution.
Collapse
|
Research Support, Non-U.S. Gov't |
13 |
113 |
12
|
Fritz A, Hofmann P, Majda S, Dahms E, Dröge J, Fiedler J, Lesker TR, Belmann P, DeMaere MZ, Darling AE, Sczyrba A, Bremges A, McHardy AC. CAMISIM: simulating metagenomes and microbial communities. MICROBIOME 2019; 7:17. [PMID: 30736849 PMCID: PMC6368784 DOI: 10.1186/s40168-019-0633-6] [Citation(s) in RCA: 105] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Accepted: 01/21/2019] [Indexed: 05/11/2023]
Abstract
BACKGROUND Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. RESULTS We describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series, and differential abundance studies, includes real and simulated strain-level diversity, and generates second- and third-generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes, we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT, and metaSPAdes, on several thousand small data sets generated with CAMISIM. CONCLUSIONS CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with standards of truth for method evaluation. All data sets and the software are freely available at https://github.com/CAMI-challenge/CAMISIM.
Collapse
|
research-article |
6 |
105 |
13
|
Didelot X, Méric G, Falush D, Darling AE. Impact of homologous and non-homologous recombination in the genomic evolution of Escherichia coli. BMC Genomics 2012; 13:256. [PMID: 22712577 PMCID: PMC3505186 DOI: 10.1186/1471-2164-13-256] [Citation(s) in RCA: 94] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2012] [Accepted: 05/30/2012] [Indexed: 11/10/2022] Open
Abstract
Background Escherichia coli is an important species of bacteria that can live as a harmless inhabitant of the guts of many animals, as a pathogen causing life-threatening conditions or freely in the non-host environment. This diversity of lifestyles has made it a particular focus of interest for studies of genetic variation, mainly with the aim to understand how a commensal can become a deadly pathogen. Many whole genomes of E. coli have been fully sequenced in the past few years, which offer helpful data to help understand how this important species evolved. Results We compared 27 whole genomes encompassing four phylogroups of Escherichia coli (A, B1, B2 and E). From the core-genome we established the clonal relationships between the isolates as well as the role played by homologous recombination during their evolution from a common ancestor. We found strong evidence for sexual isolation between three lineages (A+B1, B2, E), which could be explained by the ecological structuring of E. coli and may represent on-going speciation. We identified three hotspots of homologous recombination, one of which had not been previously described and contains the aroC gene, involved in the essential shikimate metabolic pathway. We also described the role played by non-homologous recombination in the pan-genome, and showed that this process was highly heterogeneous. Our analyses revealed in particular that the genomes of three enterohaemorrhagic (EHEC) strains within phylogroup B1 have converged from originally separate backgrounds as a result of both homologous and non-homologous recombination. Conclusions Recombination is an important force shaping the genomic evolution and diversification of E. coli, both by replacing fragments of genes with an homologous sequence and also by introducing new genes. In this study, several non-random patterns of these events were identified which correlated with important changes in the lifestyle of the bacteria, and therefore provide additional evidence to explain the relationship between genomic variation and ecological adaptation.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
13 |
94 |
14
|
Lang JM, Darling AE, Eisen JA. Phylogeny of bacterial and archaeal genomes using conserved genes: supertrees and supermatrices. PLoS One 2013; 8:e62510. [PMID: 23638103 PMCID: PMC3636077 DOI: 10.1371/journal.pone.0062510] [Citation(s) in RCA: 92] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2012] [Accepted: 03/26/2013] [Indexed: 11/29/2022] Open
Abstract
Over 3000 microbial (bacterial and archaeal) genomes have been made publically available to date, providing an unprecedented opportunity to examine evolutionary genomic trends and offering valuable reference data for a variety of other studies such as metagenomics. The utility of these genome sequences is greatly enhanced when we have an understanding of how they are phylogenetically related to each other. Therefore, we here describe our efforts to reconstruct the phylogeny of all available bacterial and archaeal genomes. We identified 24, single-copy, ubiquitous genes suitable for this phylogenetic analysis. We used two approaches to combine the data for the 24 genes. First, we concatenated alignments of all genes into a single alignment from which a Maximum Likelihood (ML) tree was inferred using RAxML. Second, we used a relatively new approach to combining gene data, Bayesian Concordance Analysis (BCA), as implemented in the BUCKy software, in which the results of 24 single-gene phylogenetic analyses are used to generate a "primary concordance" tree. A comparison of the concatenated ML tree and the primary concordance (BUCKy) tree reveals that the two approaches give similar results, relative to a phylogenetic tree inferred from the 16S rRNA gene. After comparing the results and the methods used, we conclude that the current best approach for generating a single phylogenetic tree, suitable for use as a reference phylogeny for comparative analyses, is to perform a maximum likelihood analysis of a concatenated alignment of conserved, single-copy genes.
Collapse
|
research-article |
12 |
92 |
15
|
Beitel CW, Froenicke L, Lang JM, Korf IF, Michelmore RW, Eisen JA, Darling AE. Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products. PeerJ 2014; 2:e415. [PMID: 24918035 PMCID: PMC4045339 DOI: 10.7717/peerj.415] [Citation(s) in RCA: 82] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2014] [Accepted: 05/15/2014] [Indexed: 12/13/2022] Open
Abstract
Metagenomics is a valuable tool for the study of microbial communities but has been limited by the difficulty of "binning" the resulting sequences into groups corresponding to the individual species and strains that constitute the community. Moreover, there are presently no methods to track the flow of mobile DNA elements such as plasmids through communities or to determine which of these are co-localized within the same cell. We address these limitations by applying Hi-C, a technology originally designed for the study of three-dimensional genome structure in eukaryotes, to measure the cellular co-localization of DNA sequences. We leveraged Hi-C data generated from a simple synthetic metagenome sample to accurately cluster metagenome assembly contigs into groups that contain nearly complete genomes of each species. The Hi-C data also reliably associated plasmids with the chromosomes of their host and with each other. We further demonstrated that Hi-C data provides a long-range signal of strain-specific genotypes, indicating such data may be useful for high-resolution genotyping of microbial populations. Our work demonstrates that Hi-C sequencing data provide valuable information for metagenome analyses that are not currently obtainable by other methods. This metagenomic Hi-C method could facilitate future studies of the fine-scale population structure of microbes, as well as studies of how antibiotic resistance plasmids (or other genetic elements) mobilize in microbial communities. The method is not limited to microbiology; the genetic architecture of other heterogeneous populations of cells could also be studied with this technique.
Collapse
|
Journal Article |
11 |
82 |
16
|
Abstract
Summary: High-throughput DNA sequencing technologies have spurred the development of numerous novel methods for genome assembly. With few exceptions, these algorithms are heuristic and require one or more parameters to be manually set by the user. One approach to parameter tuning involves assembling data from an organism with an available high-quality reference genome, and measuring assembly accuracy using some metrics. We developed a system to measure assembly quality under several scoring metrics, and to compare assembly quality across a variety of assemblers, sequence data types, and parameter choices. When used in conjunction with training data such as a high-quality reference genome and sequence reads from the same organism, our program can be used to manually identify an optimal sequencing and assembly strategy for de novo sequencing of related organisms. Availability: GPL source code and a usage tutorial is at http://ngopt.googlecode.com Contact:aarondarling@ucdavis.edu Supplementary information:Supplementary data is available at Bioinformatics online.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
14 |
81 |
17
|
Quince C, Nurk S, Raguideau S, James R, Soyer OS, Summers JK, Limasset A, Eren AM, Chikhi R, Darling AE. STRONG: metagenomics strain resolution on assembly graphs. Genome Biol 2021; 22:214. [PMID: 34311761 PMCID: PMC8311964 DOI: 10.1186/s13059-021-02419-7] [Citation(s) in RCA: 56] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2020] [Accepted: 06/29/2021] [Indexed: 12/30/2022] Open
Abstract
We introduce STrain Resolution ON assembly Graphs (STRONG), which identifies strains de novo, from multiple metagenome samples. STRONG performs coassembly, and binning into metagenome assembled genomes (MAGs), and stores the coassembly graph prior to variant simplification. This enables the subgraphs and their unitig per-sample coverages, for individual single-copy core genes (SCGs) in each MAG, to be extracted. A Bayesian algorithm, BayesPaths, determines the number of strains present, their haplotypes or sequences on the SCGs, and abundances. STRONG is validated using synthetic communities and for a real anaerobic digestor time series generates haplotypes that match those observed from long Nanopore reads.
Collapse
|
Research Support, N.I.H., Intramural |
4 |
56 |
18
|
Chan CX, Beiko RG, Darling AE, Ragan MA. Lateral transfer of genes and gene fragments in prokaryotes. Genome Biol Evol 2009; 1:429-38. [PMID: 20333212 PMCID: PMC2817436 DOI: 10.1093/gbe/evp044] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/31/2009] [Indexed: 01/24/2023] Open
Abstract
Lateral genetic transfer (LGT) involves the movement of genetic material from one lineage into another and its subsequent incorporation into the new host genome via genetic recombination. Studies in individual taxa have indicated lateral origins for stretches of DNA of greatly varying length, from a few nucleotides to chromosome size. Here we analyze 1,462 sets of single-copy, putatively orthologous genes from 144 fully sequenced prokaryote genomes, asking to what extent complete genes and fragments of genes have been transferred and recombined in LGT. Using a rigorous phylogenetic approach, we find evidence for LGT in at least 476 (32.6%) of these 1,462 gene sets: 286 (19.6%) clearly show one or more "observable recombination breakpoints" within the boundaries of the open reading frame, while a further 190 (13.0%) yield trees that are topologically incongruent with the reference tree but do not contain a recombination breakpoint within the open reading frame. We refer to these gene sets as observable recombination breakpoint positive (ORB(+)) and negative (ORB(-)) respectively. The latter are prima facie instances of lateral transfer of an entire gene or beyond. We observe little functional bias between ORB(+) and ORB(-) gene sets, but find that incorporation of entire genes is potentially more frequent in pathogens than in nonpathogens. As ORB(+) gene sets are about 50% more common than ORB(-) sets in our data, the transfer of gene fragments has been relatively frequent, and the frequency of LGT may have been systematically underestimated in phylogenetic studies.
Collapse
|
research-article |
16 |
55 |
19
|
Mau B, Glasner JD, Darling AE, Perna NT. Genome-wide detection and analysis of homologous recombination among sequenced strains of Escherichia coli. Genome Biol 2006; 7:R44. [PMID: 16737554 PMCID: PMC1779527 DOI: 10.1186/gb-2006-7-5-r44] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2005] [Revised: 02/08/2006] [Accepted: 05/08/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Comparisons of complete bacterial genomes reveal evidence of lateral transfer of DNA across otherwise clonally diverging lineages. Some lateral transfer events result in acquisition of novel genomic segments and are easily detected through genome comparison. Other more subtle lateral transfers involve homologous recombination events that result in substitution of alleles within conserved genomic regions. This type of event is observed infrequently among distantly related organisms. It is reported to be more common within species, but the frequency has been difficult to quantify since the sequences under comparison tend to have relatively few polymorphic sites. RESULTS Here we report a genome-wide assessment of homologous recombination among a collection of six complete Escherichia coli and Shigella flexneri genome sequences. We construct a whole-genome multiple alignment and identify clusters of polymorphic sites that exhibit atypical patterns of nucleotide substitution using a random walk-based method. The analysis reveals one large segment (approximately 100 kb) and 186 smaller clusters of single base pair differences that suggest lateral exchange between lineages. These clusters include portions of 10% of the 3,100 genes conserved in six genomes. Statistical analysis of the functional roles of these genes reveals that several classes of genes are over-represented, including those involved in recombination, transport and motility. CONCLUSION We demonstrate that intraspecific recombination in E. coli is much more common than previously appreciated and may show a bias for certain types of genes. The described method provides high-specificity, conservative inference of past recombination events.
Collapse
|
Research Support, N.I.H., Extramural |
19 |
54 |
20
|
Reid CJ, Wyrsch ER, Roy Chowdhury P, Zingali T, Liu M, Darling AE, Chapman TA, Djordjevic SP. Porcine commensal Escherichia coli: a reservoir for class 1 integrons associated with IS26. Microb Genom 2019; 3. [PMID: 29306352 PMCID: PMC5761274 DOI: 10.1099/mgen.0.000143] [Citation(s) in RCA: 51] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Porcine faecal waste is a serious environmental pollutant. Carriage of antimicrobial-resistance genes (ARGs) and virulence-associated genes (VAGs), and the zoonotic potential of commensal Escherichia coli from swine are largely unknown. Furthermore, little is known about the role of commensal E. coli as contributors to the mobilization of ARGs between food animals and the environment. Here, we report whole-genome sequence analysis of 103 class 1 integron-positive E. coli from the faeces of healthy pigs from two commercial production facilities in New South Wales, Australia. Most strains belonged to phylogroups A and B1, and carried VAGs linked with extraintestinal infection in humans. The 103 strains belonged to 37 multilocus sequence types and clonal complex 10 featured prominently. Seventeen ARGs were detected and 97 % (100/103) of strains carried three or more ARGs. Heavy-metal-resistance genes merA, cusA and terA were also common. IS26 was observed in 98 % (101/103) of strains and was often physically associated with structurally diverse class 1 integrons that carried unique genetic features, which may be tracked. This study provides, to our knowledge, the first detailed genomic analysis and point of reference for commensal E. coli of porcine origin in Australia, facilitating tracking of specific lineages and the mobile resistance genes they carry.
Collapse
|
Research Support, Non-U.S. Gov't |
6 |
51 |
21
|
DeMaere MZ, Darling AE. bin3C: exploiting Hi-C sequencing data to accurately resolve metagenome-assembled genomes. Genome Biol 2019; 20:46. [PMID: 30808380 PMCID: PMC6391755 DOI: 10.1186/s13059-019-1643-1] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2018] [Accepted: 01/29/2019] [Indexed: 11/10/2022] Open
Abstract
Most microbes cannot be easily cultured, and metagenomics provides a means to study them. Current techniques aim to resolve individual genomes from metagenomes, so-called metagenome-assembled genomes (MAGs). Leading approaches depend upon time series or transect studies, the efficacy of which is a function of community complexity, target abundance, and sequencing depth. We describe an unsupervised method that exploits the hierarchical nature of Hi-C interaction rates to resolve MAGs using a single time point. We validate the method and directly compare against a recently announced proprietary service, ProxiMeta. bin3C is an open-source pipeline and makes use of the Infomap clustering algorithm ( https://github.com/cerebis/bin3C ).
Collapse
|
Validation Study |
6 |
50 |
22
|
Darling AE, Treangen TJ, Messeguer X, Perna NT. Analyzing patterns of microbial evolution using the mauve genome alignment system. Methods Mol Biol 2007; 396:135-52. [PMID: 18025691 DOI: 10.1007/978-1-59745-515-2_10] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
During the course of evolution, genomes can undergo large-scale mutation events such as rearrangement and lateral transfer. Such mutations can result in significant variations in gene order and gene content among otherwise closely related organisms. The Mauve genome alignment system can successfully identify such rearrangement and lateral transfer events in comparisons of multiple microbial genomes even under high levels of recombination. This chapter outlines the main features of Mauve and provides examples that describe how to use Mauve to conduct a rigorous multiple genome comparison and study evolutionary patterns.
Collapse
|
Research Support, N.I.H., Extramural |
18 |
48 |
23
|
Burke CM, Darling AE. A method for high precision sequencing of near full-length 16S rRNA genes on an Illumina MiSeq. PeerJ 2016; 4:e2492. [PMID: 27688981 PMCID: PMC5036073 DOI: 10.7717/peerj.2492] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2015] [Accepted: 08/25/2016] [Indexed: 12/21/2022] Open
Abstract
Background The bacterial 16S rRNA gene has historically been used in defining bacterial taxonomy and phylogeny. However, there are currently no high-throughput methods to sequence full-length 16S rRNA genes present in a sample with precision. Results We describe a method for sequencing near full-length 16S rRNA gene amplicons using the high throughput Illumina MiSeq platform and test it using DNA from human skin swab samples. Proof of principle of the approach is demonstrated, with the generation of 1,604 sequences greater than 1,300 nt from a single Nano MiSeq run, with accuracy estimated to be 100-fold higher than standard Illumina reads. The reads were chimera filtered using information from a single molecule dual tagging scheme that boosts the signal available for chimera detection. Conclusions This method could be scaled up to generate many thousands of sequences per MiSeq run and could be applied to other sequencing platforms. This has great potential for populating databases with high quality, near full-length 16S rRNA gene sequences from under-represented taxa and environments and facilitates analyses of microbial communities at higher resolution.
Collapse
|
Journal Article |
9 |
45 |
24
|
O'Donoghue SI, Baldi BF, Clark SJ, Darling AE, Hogan JM, Kaur S, Maier-Hein L, McCarthy DJ, Moore WJ, Stenau E, Swedlow JR, Vuong J, Procter JB. Visualization of Biomedical Data. Annu Rev Biomed Data Sci 2018. [DOI: 10.1146/annurev-biodatasci-080917-013424] [Citation(s) in RCA: 45] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The rapid increase in volume and complexity of biomedical data requires changes in research, communication, and clinical practices. This includes learning how to effectively integrate automated analysis with high–data density visualizations that clearly express complex phenomena. In this review, we summarize key principles and resources from data visualization research that help address this difficult challenge. We then survey how visualization is being used in a selection of emerging biomedical research areas, including three-dimensional genomics, single-cell RNA sequencing (RNA-seq), the protein structure universe, phosphoproteomics, augmented reality–assisted surgery, and metagenomics. While specific research areas need highly tailored visualizations, there are common challenges that can be addressed with general methods and strategies. Also common, however, are poor visualization practices. We outline ongoing initiatives aimed at improving visualization practices in biomedical research via better tools, peer-to-peer learning, and interdisciplinary collaboration with computer scientists, science communicators, and graphic designers. These changes are revolutionizing how we see and think about our data.
Collapse
|
|
7 |
45 |
25
|
Fourment M, Darling AE, Holmes EC. The impact of migratory flyways on the spread of avian influenza virus in North America. BMC Evol Biol 2017; 17:118. [PMID: 28545432 PMCID: PMC5445350 DOI: 10.1186/s12862-017-0965-4] [Citation(s) in RCA: 44] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Accepted: 05/11/2017] [Indexed: 11/16/2022] Open
Abstract
Background Wild birds are the major reservoir hosts for influenza A viruses (AIVs) and have been implicated in the emergence of pandemic events in livestock and human populations. Understanding how AIVs spread within and across continents is therefore critical to the development of successful strategies to manage and reduce the impact of influenza outbreaks. In North America many bird species undergo seasonal migratory movements along a North-South axis, thereby providing opportunities for viruses to spread over long distances. However, the role played by such avian flyways in shaping the genetic structure of AIV populations remains uncertain. Results To assess the relative contribution of bird migration along flyways to the genetic structure of AIV we performed a large-scale phylogeographic study of viruses sampled in the USA and Canada, involving the analysis of 3805 to 4505 sequences from 36 to 38 geographic localities depending on the gene segment data set. To assist in this we developed a maximum likelihood-based genetic algorithm to explore a wide range of complex spatial models, depicting a more complete picture of the migration network than determined previously. Conclusions Based on phylogenies estimated from nucleotide sequence data sets, our results show that AIV migration rates are significantly higher within than between flyways, indicating that the migratory patterns of birds play a key role in viral dispersal. These findings provide valuable insights into the evolution, maintenance and transmission of AIVs, in turn allowing the development of improved programs for surveillance and risk assessment.
Collapse
|
Journal Article |
8 |
44 |