1
|
Noll NW, Scherber C, Schäffler L. taxalogue: a toolkit to create comprehensive CO1 reference databases. PeerJ 2023; 11:e16253. [PMID: 38077427 PMCID: PMC10702336 DOI: 10.7717/peerj.16253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Accepted: 09/18/2023] [Indexed: 12/18/2023] Open
Abstract
Background Taxonomic identification through DNA barcodes gained considerable traction through the invention of next-generation sequencing and DNA metabarcoding. Metabarcoding allows for the simultaneous identification of thousands of organisms from bulk samples with high taxonomic resolution. However, reliable identifications can only be achieved with comprehensive and curated reference databases. Therefore, custom reference databases are often created to meet the needs of specific research questions. Due to taxonomic inconsistencies, formatting issues, and technical difficulties, building a custom reference database requires tremendous effort. Here, we present taxalogue, an easy-to-use software for creating comprehensive and customized reference databases that provide clean and taxonomically harmonized records. In combination with extensive geographical filtering options, taxalogue opens up new possibilities for generating and testing evolutionary hypotheses. Methods taxalogue collects DNA sequences from several online sources and combines them into a reference database. Taxonomic incongruencies between the different data sources can be harmonized according to available taxonomies. Dereplication and various filtering options are available regarding sequence quality or metadata information. taxalogue is implemented in the open-source Ruby programming language, and the source code is available at https://github.com/nwnoll/taxalogue. We benchmark four reference databases by sequence identity against eight queries from different localities and trapping devices. Subsamples from each reference database were used to compare how well another one is covered. Results taxalogue produces reference databases with the best coverage at high identities for most tested queries, enabling more accurate, reliable predictions with higher certainty than the other benchmarked reference databases. Additionally, the performance of taxalogue is more consistent while providing good coverage for a variety of habitats, regions, and sampling methods. taxalogue simplifies the creation of reference databases and makes the process reproducible and transparent. Multiple available output formats for commonly used downstream applications facilitate the easy adoption of taxalogue in many different software pipelines. The resulting reference databases improve the taxonomic classification accuracy through high coverage of the query sequences at high identities.
Collapse
Affiliation(s)
- Niklas W. Noll
- Centre for Biodiversity Monitoring and Conservation Science, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, North Rhine-Westphalia, Germany
| | - Christoph Scherber
- Centre for Biodiversity Monitoring and Conservation Science, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, North Rhine-Westphalia, Germany
| | - Livia Schäffler
- Centre for Biodiversity Monitoring and Conservation Science, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, North Rhine-Westphalia, Germany
| |
Collapse
|
3
|
deWaard JR, Ratnasingham S, Zakharov EV, Borisenko AV, Steinke D, Telfer AC, Perez KHJ, Sones JE, Young MR, Levesque-Beaudin V, Sobel CN, Abrahamyan A, Bessonov K, Blagoev G, deWaard SL, Ho C, Ivanova NV, Layton KKS, Lu L, Manjunath R, McKeown JTA, Milton MA, Miskie R, Monkhouse N, Naik S, Nikolova N, Pentinsaari M, Prosser SWJ, Radulovici AE, Steinke C, Warne CP, Hebert PDN. A reference library for Canadian invertebrates with 1.5 million barcodes, voucher specimens, and DNA samples. Sci Data 2019; 6:308. [PMID: 31811161 PMCID: PMC6897906 DOI: 10.1038/s41597-019-0320-2] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 11/11/2019] [Indexed: 01/08/2023] Open
Abstract
The reliable taxonomic identification of organisms through DNA sequence data requires a well parameterized library of curated reference sequences. However, it is estimated that just 15% of described animal species are represented in public sequence repositories. To begin to address this deficiency, we provide DNA barcodes for 1,500,003 animal specimens collected from 23 terrestrial and aquatic ecozones at sites across Canada, a nation that comprises 7% of the planet's land surface. In total, 14 phyla, 43 classes, 163 orders, 1123 families, 6186 genera, and 64,264 Barcode Index Numbers (BINs; a proxy for species) are represented. Species-level taxonomy was available for 38% of the specimens, but higher proportions were assigned to a genus (69.5%) and a family (99.9%). Voucher specimens and DNA extracts are archived at the Centre for Biodiversity Genomics where they are available for further research. The corresponding sequence and taxonomic data can be accessed through the Barcode of Life Data System, GenBank, the Global Biodiversity Information Facility, and the Global Genome Biodiversity Network Data Portal.
Collapse
Affiliation(s)
- Jeremy R deWaard
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | | | - Evgeny V Zakharov
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Alex V Borisenko
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Dirk Steinke
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Angela C Telfer
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Kate H J Perez
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Jayme E Sones
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Monica R Young
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | | | - Crystal N Sobel
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Arusyak Abrahamyan
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Kyrylo Bessonov
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
- Public Health Agency of Canada, Guelph, Ontario, Canada
| | - Gergin Blagoev
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Stephanie L deWaard
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Chris Ho
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Natalia V Ivanova
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Kara K S Layton
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
- Ocean Frontier Institute, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Liuqiong Lu
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Ramya Manjunath
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Jaclyn T A McKeown
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Megan A Milton
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Renee Miskie
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Norm Monkhouse
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Suresh Naik
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Nadya Nikolova
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Mikko Pentinsaari
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Sean W J Prosser
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | | | - Claudia Steinke
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Connor P Warne
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - Paul D N Hebert
- Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada.
| |
Collapse
|
4
|
Phillips JD, Gillis DJ, Hanner RH. Incomplete estimates of genetic diversity within species: Implications for DNA barcoding. Ecol Evol 2019; 9:2996-3010. [PMID: 30891232 PMCID: PMC6406011 DOI: 10.1002/ece3.4757] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2018] [Revised: 09/03/2018] [Accepted: 10/12/2018] [Indexed: 02/01/2023] Open
Abstract
DNA barcoding has greatly accelerated the pace of specimen identification to the species level, as well as species delineation. Whereas the application of DNA barcoding to the matching of unknown specimens to known species is straightforward, its use for species delimitation is more controversial, as species discovery hinges critically on present levels of haplotype diversity, as well as patterning of standing genetic variation that exists within and between species. Typical sample sizes for molecular biodiversity assessment using DNA barcodes range from 5 to 10 individuals per species. However, required levels that are necessary to fully gauge haplotype variation at the species level are presumed to be strongly taxon-specific. Importantly, little attention has been paid to determining appropriate specimen sample sizes that are necessary to reveal the majority of intraspecific haplotype variation within any one species. In this paper, we present a brief outline of the current literature and methods on intraspecific sample size estimation for the assessment of COI DNA barcode haplotype sampling completeness. The importance of adequate sample sizes for studies of molecular biodiversity is stressed, with application to a variety of metazoan taxa, through reviewing foundational statistical and population genetic models, with specific application to ray-finned fishes (Chordata: Actinopterygii). Finally, promising avenues for further research in this area are highlighted.
Collapse
Affiliation(s)
- Jarrett D. Phillips
- School of Computer ScienceUniversity of GuelphGuelphOntarioCanada
- Centre for Biodiversity GenomicsBiodiversity Institute of OntarioUniversity of GuelphGuelphOntarioCanada
| | - Daniel J. Gillis
- School of Computer ScienceUniversity of GuelphGuelphOntarioCanada
| | - Robert H. Hanner
- Centre for Biodiversity GenomicsBiodiversity Institute of OntarioUniversity of GuelphGuelphOntarioCanada
- Department of Integrative BiologyUniversity of GuelphGuelphOntarioCanada
| |
Collapse
|
5
|
Prantoni AL, Belmonte-Lopes R, Lana PC, Erséus C. Genetic diversity of marine oligochaetous clitellates in selected areas of the South Atlantic as revealed by DNA barcoding. INVERTEBR SYST 2018. [DOI: 10.1071/is17029] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Marine oligochaetous clitellates are poorly investigated in the South Atlantic Ocean, especially along the east coast of South America. Closely related species are often difficult to distinguish based on morphology. The lack of specialists and modern identification guides have been pointed out as the main reasons for the scarcity of studies in the South Atlantic Ocean as a whole. To increase the knowledge of this group in the South Atlantic, the genetic diversity of a sample of marine oligochaetous clitellates from Brazil, South Africa and Antarctica was assessed by the Automatic Barcode Gap Discovery (ABGD) and the generalised mixed Yule coalescent (GMYC) approaches. In total, 80 cytochrome c oxidase subunit I (COI) sequences were obtained, each with ~658bp, estimated to represent 32 distinct putative species. ABGD established a barcoding gap between 3% and 14% divergence for uncorrected p-distances and the estimates of GMYC were largely concordant. All the clusters or putative species were genetically associated with previously known species or genera. This study thus confirms the adequacy of the COI barcoding approach combined with a genetic divergence threshold at the order of 10% for marine oligochaetous clitellates.
Collapse
|
7
|
DNA Barcoding Survey of Anurans across the Eastern Cordillera of Colombia and the Impact of the Andes on Cryptic Diversity. PLoS One 2015; 10:e0127312. [PMID: 26000447 PMCID: PMC4441516 DOI: 10.1371/journal.pone.0127312] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2015] [Accepted: 04/13/2015] [Indexed: 11/23/2022] Open
Abstract
Colombia hosts the second highest amphibian species diversity on Earth, yet its fauna remains poorly studied, especially using molecular genetic techniques. We present the results of the first wide-scale DNA barcoding survey of anurans of Colombia, focusing on a transect across the Eastern Cordillera. We surveyed 10 sites between the Magdalena Valley to the west and the eastern foothills of the Eastern Cordillera, sequencing portions of the mitochondrial 16S ribosomal RNA and cytochrome oxidase subunit 1 (CO1) genes for 235 individuals from 52 nominal species. We applied two barcode algorithms, Automatic Barcode Gap Discovery and Refined Single Linkage Analysis, to estimate the number of clusters or “unconfirmed candidate species” supported by DNA barcode data. Our survey included ~7% of the anuran species known from Colombia. While barcoding algorithms differed slightly in the number of clusters identified, between three and ten nominal species may be obscuring candidate species (in some cases, more than one cryptic species per nominal species). Our data suggest that the high elevations of the Eastern Cordillera and the low elevations of the Chicamocha canyon acted as geographic barriers in at least seven nominal species, promoting strong genetic divergences between populations associated with the Eastern Cordillera.
Collapse
|
9
|
Geiger MF, Herder F, Monaghan MT, Almada V, Barbieri R, Bariche M, Berrebi P, Bohlen J, Casal-Lopez M, Delmastro GB, Denys GPJ, Dettai A, Doadrio I, Kalogianni E, Kärst H, Kottelat M, Kovačić M, Laporte M, Lorenzoni M, Marčić Z, Özuluğ M, Perdices A, Perea S, Persat H, Porcelotti S, Puzzi C, Robalo J, Šanda R, Schneider M, Šlechtová V, Stoumboudi M, Walter S, Freyhof J. Spatial heterogeneity in the Mediterranean Biodiversity Hotspot affects barcoding accuracy of its freshwater fishes. Mol Ecol Resour 2014; 14:1210-21. [DOI: 10.1111/1755-0998.12257] [Citation(s) in RCA: 177] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2014] [Revised: 03/15/2014] [Accepted: 03/19/2014] [Indexed: 11/29/2022]
|
10
|
Porter TM, Gibson JF, Shokralla S, Baird DJ, Golding GB, Hajibabaei M. Rapid and accurate taxonomic classification of insect (class Insecta) cytochrome
c
oxidase subunit 1 (
COI
)
DNA
barcode sequences using a naïve Bayesian classifier. Mol Ecol Resour 2014. [PMCID: PMC4282328 DOI: 10.1111/1755-0998.12240] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Current methods to identify unknown insect (class Insecta) cytochrome c oxidase (COI barcode) sequences often rely on thresholds of distances that can be difficult to define, sequence similarity cut-offs, or monophyly. Some of the most commonly used metagenomic classification methods do not provide a measure of confidence for the taxonomic assignments they provide. The aim of this study was to use a naïve Bayesian classifier (Wang et al.Applied and Environmental Microbiology, 2007; 73: 5261) to automate taxonomic assignments for large batches of insect COI sequences such as data obtained from high-throughput environmental sequencing. This method provides rank-flexible taxonomic assignments with an associated bootstrap support value, and it is faster than the blast-based methods commonly used in environmental sequence surveys. We have developed and rigorously tested the performance of three different training sets using leave-one-out cross-validation, two field data sets, and targeted testing of Lepidoptera, Diptera and Mantodea sequences obtained from the Barcode of Life Data system. We found that type I error rates, incorrect taxonomic assignments with a high bootstrap support, were already relatively low but could be lowered further by ensuring that all query taxa are actually present in the reference database. Choosing bootstrap support cut-offs according to query length and summarizing taxonomic assignments to more inclusive ranks can also help to reduce error while retaining the maximum number of assignments. Additionally, we highlight gaps in the taxonomic and geographic representation of insects in public sequence databases that will require further work by taxonomists to improve the quality of assignments generated using any method.
Collapse
Affiliation(s)
- Teresita M. Porter
- McMaster University Department of Biology 1280 Main Street West Hamilton ON Canada L8S 4K1
| | - Joel F. Gibson
- Biodiversity Institute of Ontario & Department of Integrative Biology University of Guelph 50 Stone Road East Guelph ON Canada N1G 2W1
| | - Shadi Shokralla
- Biodiversity Institute of Ontario & Department of Integrative Biology University of Guelph 50 Stone Road East Guelph ON Canada N1G 2W1
| | - Donald J. Baird
- Environment Canada at Canadian Rivers Institute Department of Biology University of New Brunswick Fredericton NB Canada E3B 6E1
| | - G. Brian Golding
- McMaster University Department of Biology 1280 Main Street West Hamilton ON Canada L8S 4K1
| | - Mehrdad Hajibabaei
- Biodiversity Institute of Ontario & Department of Integrative Biology University of Guelph 50 Stone Road East Guelph ON Canada N1G 2W1
| |
Collapse
|