1
|
Jackson DJ, Cerveau N, Posnien N. De novo assembly of transcriptomes and differential gene expression analysis using short-read data from emerging model organisms - a brief guide. Front Zool 2024; 21:17. [PMID: 38902827 PMCID: PMC11188175 DOI: 10.1186/s12983-024-00538-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Accepted: 06/12/2024] [Indexed: 06/22/2024] Open
Abstract
Many questions in biology benefit greatly from the use of a variety of model systems. High-throughput sequencing methods have been a triumph in the democratization of diverse model systems. They allow for the economical sequencing of an entire genome or transcriptome of interest, and with technical variations can even provide insight into genome organization and the expression and regulation of genes. The analysis and biological interpretation of such large datasets can present significant challenges that depend on the 'scientific status' of the model system. While high-quality genome and transcriptome references are readily available for well-established model systems, the establishment of such references for an emerging model system often requires extensive resources such as finances, expertise and computation capabilities. The de novo assembly of a transcriptome represents an excellent entry point for genetic and molecular studies in emerging model systems as it can efficiently assess gene content while also serving as a reference for differential gene expression studies. However, the process of de novo transcriptome assembly is non-trivial, and as a rule must be empirically optimized for every dataset. For the researcher working with an emerging model system, and with little to no experience with assembling and quantifying short-read data from the Illumina platform, these processes can be daunting. In this guide we outline the major challenges faced when establishing a reference transcriptome de novo and we provide advice on how to approach such an endeavor. We describe the major experimental and bioinformatic steps, provide some broad recommendations and cautions for the newcomer to de novo transcriptome assembly and differential gene expression analyses. Moreover, we provide an initial selection of tools that can assist in the journey from raw short-read data to assembled transcriptome and lists of differentially expressed genes.
Collapse
Affiliation(s)
- Daniel J Jackson
- University of Göttingen, Department of Geobiology, Goldschmidtstr.3, Göttingen, 37077, Germany.
| | - Nicolas Cerveau
- University of Göttingen, Department of Geobiology, Goldschmidtstr.3, Göttingen, 37077, Germany
| | - Nico Posnien
- University of Göttingen, Department of Developmental Biology, GZMB, Justus-Von-Liebig-Weg 11, Göttingen, 37077, Germany.
| |
Collapse
|
2
|
Agustinho DP, Fu Y, Menon VK, Metcalf GA, Treangen TJ, Sedlazeck FJ. Unveiling microbial diversity: harnessing long-read sequencing technology. Nat Methods 2024; 21:954-966. [PMID: 38689099 DOI: 10.1038/s41592-024-02262-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 03/29/2024] [Indexed: 05/02/2024]
Abstract
Long-read sequencing has recently transformed metagenomics, enhancing strain-level pathogen characterization, enabling accurate and complete metagenome-assembled genomes, and improving microbiome taxonomic classification and profiling. These advancements are not only due to improvements in sequencing accuracy, but also happening across rapidly changing analysis methods. In this Review, we explore long-read sequencing's profound impact on metagenomics, focusing on computational pipelines for genome assembly, taxonomic characterization and variant detection, to summarize recent advancements in the field and provide an overview of available analytical methods to fully leverage long reads. We provide insights into the advantages and disadvantages of long reads over short reads and their evolution from the early days of long-read sequencing to their recent impact on metagenomics and clinical diagnostics. We further point out remaining challenges for the field such as the integration of methylation signals in sub-strain analysis and the lack of benchmarks.
Collapse
Affiliation(s)
- Daniel P Agustinho
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA
| | - Yilei Fu
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Vipin K Menon
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA
- Senior research project manager, Human Genetics, Genentech, South San Francisco, CA, USA
| | - Ginger A Metcalf
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
- Department of Bioengineering, Rice University, Houston, TX, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA.
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
3
|
Roestel JA, Wiersema JH, Jansen RK, Borsch T, Gruenstaeudl M. On the importance of sequence alignment inspections in plastid phylogenomics - an example from revisiting the relationships of the water-lilies. Cladistics 2024. [PMID: 38761095 DOI: 10.1111/cla.12584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Revised: 04/27/2024] [Accepted: 04/29/2024] [Indexed: 05/20/2024] Open
Abstract
The water-lily clade represents the second earliest-diverging branch of angiosperms. Most of its species belong to Nymphaeaceae, of which the "core Nymphaeaceae"-comprising the genera Euryale, Nymphaea and Victoria-is the most diverse clade. Despite previous molecular phylogenetic studies on the core Nymphaeaceae, various aspects of their evolutionary relationships have remained unresolved. The length-variable introns and intergenic spacers are known to contain most of the sequence variability within the water-lily plastomes. Despite the challenges with multiple sequence alignment, any new molecular phylogenetic investigation on the core Nymphaeaceae should focus on these noncoding plastome regions. For example, a new plastid phylogenomic study on the core Nymphaeaceae should generate DNA sequence alignments of all plastid introns and intergenic spacers based on the principle of conserved sequence motifs. In this investigation, we revisit the phylogenetic history of the core Nymphaeaceae by employing such an approach. Specifically, we use a plastid phylogenomic analysis strategy in which all coding and noncoding partitions are separated and then undergo software-driven DNA sequence alignment, followed by a motif-based alignment inspection and adjustment. This approach allows us to increase the reliability of the character base compared to the default practice of aligning complete plastomes through software algorithms alone. Our approach produces significantly different phylogenetic tree reconstructions for several of the plastome regions under study. The results of these reconstructions underscore that Nymphaea is paraphyletic in its current circumscription, that each of the five subgenera of Nymphaea is monophyletic, and that the subgenus Nymphaea is sister to all other subgenera of Nymphaea. Our results also clarify many evolutionary relationships within the Nymphaea subgenera Brachyceras, Hydrocallis and Nymphaea. In closing, we discuss whether the phylogenetic reconstructions obtained through our motif-based alignment adjustments are in line with morphological evidence on water-lily evolution.
Collapse
Affiliation(s)
- Jessica A Roestel
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, Berlin, 14195, Germany
| | - John H Wiersema
- Department of Botany, National Museum of Natural History - Smithsonian Institution, Washington, DC, 37012, USA
| | - Robert K Jansen
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, 78712, USA
| | - Thomas Borsch
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, Berlin, 14195, Germany
- Botanischer Garten und Botanisches Museum Berlin, Freie Universität Berlin, 14195, Berlin, Germany
| | - Michael Gruenstaeudl
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, Berlin, 14195, Germany
- Department of Biological Sciences, Fort Hays State University, Hays, KS, 67601, USA
| |
Collapse
|
4
|
Kille B, Nute MG, Huang V, Kim E, Phillippy AM, Treangen TJ. Parsnp 2.0: scalable core-genome alignment for massive microbial datasets. Bioinformatics 2024; 40:btae311. [PMID: 38724243 PMCID: PMC11128092 DOI: 10.1093/bioinformatics/btae311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Revised: 04/12/2024] [Accepted: 05/07/2024] [Indexed: 05/21/2024] Open
Abstract
MOTIVATION Since 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014. RESULTS To address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4× and reduce runtime by over 2×, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes. AVAILABILITY AND IMPLEMENTATION Parsnp v2 is available at https://github.com/marbl/parsnp.
Collapse
Affiliation(s)
- Bryce Kille
- Department of Computer Science, Rice University, Houston, TX 77005, United States
| | - Michael G Nute
- Department of Computer Science, Rice University, Houston, TX 77005, United States
| | - Victor Huang
- Department of Computer Science, Rice University, Houston, TX 77005, United States
| | - Eddie Kim
- Department of Computer Science, Rice University, Houston, TX 77005, United States
| | - Adam M Phillippy
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, United States
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX 77005, United States
- Department of Bioengineering, Rice University, Houston, TX 77030, United States
| |
Collapse
|
5
|
Park S, Kwak M, Park S. Complete organelle genomes of Korean fir, Abies koreana and phylogenomics of the gymnosperm genus Abies using nuclear and cytoplasmic DNA sequence data. Sci Rep 2024; 14:7636. [PMID: 38561351 PMCID: PMC10985005 DOI: 10.1038/s41598-024-58253-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 03/27/2024] [Indexed: 04/04/2024] Open
Abstract
Abies koreana E.H.Wilson is an endangered evergreen coniferous tree that is native to high altitudes in South Korea and susceptible to the effects of climate change. Hybridization and reticulate evolution have been reported in the genus; therefore, multigene datasets from nuclear and cytoplasmic genomes are needed to better understand its evolutionary history. Using the Illumina NovaSeq 6000 and Oxford Nanopore Technologies (ONT) PromethION platforms, we generated complete mitochondrial (1,174,803 bp) and plastid (121,341 bp) genomes from A. koreana. The mitochondrial genome is highly dynamic, transitioning from cis- to trans-splicing and breaking conserved gene clusters. In the plastome, the ONT reads revealed two structural conformations of A. koreana. The short inverted repeats (1186 bp) of the A. koreana plastome are associated with different structural types. Transcriptomic sequencing revealed 1356 sites of C-to-U RNA editing in the 41 mitochondrial genes. Using A. koreana as a reference, we additionally produced nuclear and organelle genomic sequences from eight Abies species and generated multiple datasets for maximum likelihood and network analyses. Three sections (Balsamea, Momi, and Pseudopicea) were well grouped in the nuclear phylogeny, but the phylogenomic relationships showed conflicting signals in the mitochondrial and plastid genomes, indicating a complicated evolutionary history that may have included introgressive hybridization. The obtained data illustrate that phylogenomic analyses based on sequences from differently inherited organelle genomes have resulted in conflicting trees. Organelle capture, organelle genome recombination, and incomplete lineage sorting in an ancestral heteroplasmic individual can contribute to phylogenomic discordance. We provide strong support for the relationships within Abies and new insights into the phylogenomic complexity of this genus.
Collapse
Affiliation(s)
- Seongjun Park
- Institute of Natural Science, Yeungnam University, Gyeongsan, Gyeongbuk, 38541, South Korea
| | - Myounghai Kwak
- National Institute of Biological Resources, Incheon, 22689, South Korea.
| | - SeonJoo Park
- Department of Life Sciences, Yeungnam University, Gyeongsan, Gyeongbuk, 38541, South Korea.
| |
Collapse
|
6
|
Wang F, Wang Y, Zeng X, Zhang S, Yu J, Li D, Zhang X. MIKE: an ultrafast, assembly-, and alignment-free approach for phylogenetic tree construction. Bioinformatics 2024; 40:btae154. [PMID: 38547397 PMCID: PMC10990684 DOI: 10.1093/bioinformatics/btae154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 02/06/2024] [Indexed: 04/05/2024] Open
Abstract
MOTIVATION Constructing a phylogenetic tree requires calculating the evolutionary distance between samples or species via large-scale resequencing data, a process that is both time-consuming and computationally demanding. Striking the right balance between accuracy and efficiency is a significant challenge. RESULTS To address this, we introduce a new algorithm, MIKE (MinHash-based k-mer algorithm). This algorithm is designed for the swift calculation of the Jaccard coefficient directly from raw sequencing reads and enables the construction of phylogenetic trees based on the resultant Jaccard coefficient. Simulation results highlight the superior speed of MIKE compared to existing state-of-the-art methods. We used MIKE to reconstruct a phylogenetic tree, incorporating 238 yeast, 303 Zea, 141 Ficus, 67 Oryza, and 43 Saccharum spontaneum samples. MIKE demonstrated accurate performance across varying evolutionary scales, reproductive modes, and ploidy levels, proving itself as a powerful tool for phylogenetic tree construction. AVAILABILITY AND IMPLEMENTATION MIKE is publicly available on Github at https://github.com/Argonum-Clever2/mike.git.
Collapse
Affiliation(s)
- Fang Wang
- College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, Shanxi 030024, China
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Yibin Wang
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Xiaofei Zeng
- Department of Human Cell Biology and Genetics, Joint Laboratory of Guangdong-Hong Kong Universities for Vascular Homeostasis and Diseases, School of Medicine, Southern University of Science and Technology, Shenzhen, Guangdong 508055, China
| | - Shengcheng Zhang
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Jiaxin Yu
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Dongxi Li
- College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, Shanxi 030024, China
| | - Xingtan Zhang
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| |
Collapse
|
7
|
Kille B, Nute MG, Huang V, Kim E, Phillippy AM, Treangen TJ. Parsnp 2.0: Scalable Core-Genome Alignment for Massive Microbial Datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.30.577458. [PMID: 38352342 PMCID: PMC10862825 DOI: 10.1101/2024.01.30.577458] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2024]
Abstract
Motivation Since 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014. Results To address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4x and reduce runtime by over 2x, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes. Availability Parsnp is available at https://github.com/marbl/parsnp.
Collapse
|
8
|
Altenhoff AM, Warwick Vesztrocy A, Bernard C, Train CM, Nicheperovich A, Prieto Baños S, Julca I, Moi D, Nevers Y, Majidian S, Dessimoz C, Glover NM. OMA orthology in 2024: improved prokaryote coverage, ancestral and extant GO enrichment, a revamped synteny viewer and more in the OMA Ecosystem. Nucleic Acids Res 2024; 52:D513-D521. [PMID: 37962356 PMCID: PMC10767875 DOI: 10.1093/nar/gkad1020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/17/2023] [Accepted: 10/23/2023] [Indexed: 11/15/2023] Open
Abstract
In this update paper, we present the latest developments in the OMA browser knowledgebase, which aims to provide high-quality orthology inferences and facilitate the study of gene families, genomes and their evolution. First, we discuss the addition of new species in the database, particularly an expanded representation of prokaryotic species. The OMA browser now offers Ancestral Genome pages and an Ancestral Gene Order viewer, allowing users to explore the evolutionary history and gene content of ancestral genomes. We also introduce a revamped Local Synteny Viewer to compare genomic neighborhoods across both extant and ancestral genomes. Hierarchical Orthologous Groups (HOGs) are now annotated with Gene Ontology annotations, and users can easily perform extant or ancestral GO enrichments. Finally, we recap new tools in the OMA Ecosystem, including OMAmer for proteome mapping, OMArk for proteome quality assessment, OMAMO for model organism selection and Read2Tree for phylogenetic species tree construction from reads. These new features provide exciting opportunities for orthology analysis and comparative genomics. OMA is accessible at https://omabrowser.org.
Collapse
Affiliation(s)
- Adrian M Altenhoff
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- ETH Zurich, Computer Science, Universitätstr. 6, 8092 Zurich, Switzerland
| | - Alex Warwick Vesztrocy
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Charles Bernard
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Clement-Marie Train
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Alina Nicheperovich
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Silvia Prieto Baños
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Irene Julca
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - David Moi
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Yannis Nevers
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Sina Majidian
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Christophe Dessimoz
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Natasha M Glover
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| |
Collapse
|
9
|
Thalén F, Köhne CG, Bleidorn C. Patchwork: Alignment-Based Retrieval and Concatenation of Phylogenetic Markers from Genomic Data. Genome Biol Evol 2023; 15:evad227. [PMID: 38085033 PMCID: PMC10735302 DOI: 10.1093/gbe/evad227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/06/2023] [Indexed: 12/23/2023] Open
Abstract
Low-coverage whole-genome sequencing (also known as "genome skimming") is becoming an increasingly affordable approach to large-scale phylogenetic analyses. While already routinely used to recover organellar genomes, genome skimming is rather rarely utilized for recovering single-copy nuclear markers. One reason might be that only few tools exist to work with this data type within a phylogenomic context, especially to deal with fragmented genome assemblies. We here present a new software tool called Patchwork for mining phylogenetic markers from highly fragmented short-read assemblies as well as directly from sequence reads. Patchwork is an alignment-based tool that utilizes the sequence aligner DIAMOND and is written in the programming language Julia. Homologous regions are obtained via a sequence similarity search, followed by a "hit stitching" phase, in which adjacent or overlapping regions are merged into a single unit. The novel sliding window algorithm trims away any noncoding regions from the resulting sequence. We demonstrate the utility of Patchwork by recovering near-universal single-copy orthologs within a benchmarking study, and we additionally assess the performance of Patchwork in comparison with other programs. We find that Patchwork allows for accurate retrieval of (putatively) single-copy genes from genome skimming data sets at different sequencing depths with high computational speed, outperforming existing software targeting similar tasks. Patchwork is released under the GNU General Public License version 3. Installation instructions, additional documentation, and the source code itself are all available via GitHub at https://github.com/fethalen/Patchwork.
Collapse
Affiliation(s)
- Felix Thalén
- Department for Animal Evolution and Biodiversity, Georg-August-Universität Göttingen, Göttingen 37073, Germany
- Cardio-CARE AG, Medizincampus Davos, Davos Wolfgang 7265, Switzerland
| | - Clara G Köhne
- Department for Animal Evolution and Biodiversity, Georg-August-Universität Göttingen, Göttingen 37073, Germany
| | - Christoph Bleidorn
- Department for Animal Evolution and Biodiversity, Georg-August-Universität Göttingen, Göttingen 37073, Germany
| |
Collapse
|
10
|
Kim BY, Gellert HR, Church SH, Suvorov A, Anderson SS, Barmina O, Beskid SG, Comeault AA, Crown KN, Diamond SE, Dorus S, Fujichika T, Hemker JA, Hrcek J, Kankare M, Katoh T, Magnacca KN, Martin RA, Matsunaga T, Medeiros MJ, Miller DE, Pitnick S, Simoni S, Steenwinkel TE, Schiffer M, Syed ZA, Takahashi A, Wei KHC, Yokoyama T, Eisen MB, Kopp A, Matute D, Obbard DJ, O'Grady PM, Price DK, Toda MJ, Werner T, Petrov DA. Single-fly assemblies fill major phylogenomic gaps across the Drosophilidae Tree of Life. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.02.560517. [PMID: 37873137 PMCID: PMC10592941 DOI: 10.1101/2023.10.02.560517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Long-read sequencing is driving rapid progress in genome assembly across all major groups of life, including species of the family Drosophilidae, a longtime model system for genetics, genomics, and evolution. We previously developed a cost-effective hybrid Oxford Nanopore (ONT) long-read and Illumina short-read sequencing approach and used it to assemble 101 drosophilid genomes from laboratory cultures, greatly increasing the number of genome assemblies for this taxonomic group. The next major challenge is to address the laboratory culture bias in taxon sampling by sequencing genomes of species that cannot easily be reared in the lab. Here, we build upon our previous methods to perform amplification-free ONT sequencing of single wild flies obtained either directly from the field or from ethanol-preserved specimens in museum collections, greatly improving the representation of lesser studied drosophilid taxa in whole-genome data. Using Illumina Novaseq X Plus and ONT P2 sequencers with R10.4.1 chemistry, we set a new benchmark for inexpensive hybrid genome assembly at US $150 per genome while assembling genomes from as little as 35 ng of genomic DNA from a single fly. We present 183 new genome assemblies for 179 species as a resource for drosophilid systematics, phylogenetics, and comparative genomics. Of these genomes, 62 are from pooled lab strains and 121 from single adult flies. Despite the sample limitations of working with small insects, most single-fly diploid assemblies are comparable in contiguity (>1Mb contig N50), completeness (>98% complete dipteran BUSCOs), and accuracy (>QV40 genome-wide with ONT R10.4.1) to assemblies from inbred lines. We present a well-resolved multi-locus phylogeny for 360 drosophilid and 4 outgroup species encompassing all publicly available (as of August 2023) genomes for this group. Finally, we present a Progressive Cactus whole-genome, reference-free alignment built from a subset of 298 suitably high-quality drosophilid genomes. The new assemblies and alignment, along with updated laboratory protocols and computational pipelines, are released as an open resource and as a tool for studying evolution at the scale of an entire insect family.
Collapse
Affiliation(s)
| | | | - Samuel H Church
- Department of Ecology and Evolutionary Biology, Yale University, USA
| | - Anton Suvorov
- Department of Biological Sciences, Virginia Tech, USA
| | - Sean S Anderson
- Department of Biology, University of North Carolina Chapel Hill, USA
| | - Olga Barmina
- Department of Evolution and Ecology, University of California Davis, USA
| | | | - Aaron A Comeault
- School of Environmental and Natural Sciences, Bangor University, UK
| | - K Nicole Crown
- Department of Biology, Case Western Reserve University, USA
| | | | - Steve Dorus
- Center for Reproductive Evolution, Department of Biology, Syracuse University, USA
| | - Takako Fujichika
- Department of Biological Sciences, Tokyo Metropolitan University, Japan
| | - James A Hemker
- Department of Developmental Biology, Stanford University, USA
| | - Jan Hrcek
- Institute of Entomology, Biology Centre, Czech Academy of Sciences, Czechia
| | - Maaria Kankare
- Department of Biological and Environmental Science, University of Jyväskylä, Finland
| | - Toru Katoh
- Department of Biological Sciences, Hokkaido University, Japan
| | - Karl N Magnacca
- Hawaii Invertebrate Program, Division of Forestry & Wildlife, State of Hawaii, USA
| | - Ryan A Martin
- Department of Biology, Case Western Reserve University, USA
| | - Teruyuki Matsunaga
- Department of Complexity Science and Engineering, The University of Tokyo, Japan
| | | | - Danny E Miller
- Division of Genetic Medicine, Department of Pediatrics; Department of Laboratory Medicine and Pathology, University of Washington, USA
| | - Scott Pitnick
- Center for Reproductive Evolution, Department of Biology, Syracuse University, USA
| | - Sara Simoni
- Department of Biology, Stanford University, USA
| | | | - Michele Schiffer
- Daintree Rainforest Observatory, James Cook University, Australia
| | - Zeeshan A Syed
- Center for Reproductive Evolution, Department of Biology, Syracuse University, USA
| | - Aya Takahashi
- Department of Biological Sciences, Tokyo Metropolitan University, Japan
| | - Kevin H-C Wei
- Department of Zoology, The University of British Columbia
| | | | - Michael B Eisen
- Department of Cell and Molecular Biology, University of California Berkeley, United States
- Howard Hughes Medical Institute,University of California Berkeley, United States
| | - Artyom Kopp
- Department of Evolution and Ecology, University of California Davis, USA
| | - Daniel Matute
- Department of Biology, University of North Carolina Chapel Hill, USA
| | - Darren J Obbard
- Institute of Ecology and Evolution, University of Edinburgh, UK
| | | | - Donald K Price
- School of Life Sciences, University of Nevada Las Vegas, USA
| | | | - Thomas Werner
- Department of Biological Sciences, Michigan Technological University, USA
| | - Dmitri A Petrov
- Department of Biology, Stanford University, USA
- CZ Biohub, Investigator
| |
Collapse
|