1
|
Mallik A, Ilie L. ALeS: adaptive-length spaced-seed design. Bioinformatics 2021; 37:1206-1210. [PMID: 34107042 DOI: 10.1093/bioinformatics/btaa945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2020] [Revised: 09/26/2020] [Accepted: 10/27/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Sequence similarity is the most frequently used procedure in biological research, as proved by the widely used BLAST program. The consecutive seed used by BLAST can be dramatically improved by considering multiple spaced seeds. Finding the best seeds is a hard problem and much effort went into developing heuristic algorithms and software for designing highly sensitive spaced seeds. RESULTS We introduce a new algorithm and software, ALeS, that produces more sensitive seeds than the current state-of-the-art programs, as shown by extensive testing. We also accurately estimate the sensitivity of a seed, enabling its computation for arbitrary seeds. AVAILABILITYAND IMPLEMENTATION The source code is freely available at github.com/lucian-ilie/ALeS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Arnab Mallik
- Department of Computer Science, University of Western Ontario, London, ON N6A 5B7, Canada
| | - Lucian Ilie
- Department of Computer Science, University of Western Ontario, London, ON N6A 5B7, Canada
| |
Collapse
|
2
|
Zahariev M, Chen W, Visagie CM, Lévesque CA. Cluster oligonucleotide signatures for rapid identification by sequencing. BMC Bioinformatics 2018; 19:395. [PMID: 30522439 PMCID: PMC6284311 DOI: 10.1186/s12859-018-2363-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2017] [Accepted: 09/09/2018] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND Oligonucleotide signatures (signatures) have been widely used for studying microbial diversity and function in wet-lab settings, but using them for accurate in silico identification of organisms from high-throughput sequencing (HTS) data is only a proof of concept. Existing signature design programs for sequence signatures (signatures matching exactly one sequence) or clade signatures (signatures matching every sequence in a phylogenetic clade) are not able to identify all possible polymorphic sites for sequences with high similarity and perform poorly when handling large genome sequencing datasets. RESULTS We introduce cluster signatures: subsequences that match perfectly and exclusively any group of sequences in a data set. Cluster signatures provide complete recall for primer/probe design and increased discrimination between sequences beyond that of clade signatures. Using cluster signatures for in silico identification of HTS targets achieves good precision/recall and running time performance. This method has been implemented into an open source tool, the Automated Oligonucleotide Design Pipeline (adop), included in supplementary material and available at: https://bitbucket.org/wenchen_aafc/aodp_v2.0_release . CONCLUSIONS Cluster signatures provide a rapid and universal analysis tool to identify all possible short diagnostic DNA markers and variants from any DNA sequencing dataset. They are particularly useful in discriminating genetic material from closely related organisms and in detecting deleterious mutations in highly or perfectly conserved genomic sites.
Collapse
Affiliation(s)
- Manuel Zahariev
- Ottawa R&D Centre, Agriculture & Agri-Food Canada, 960 Carling Ave., Ottawa, ON, K1A 0C6 Canada
| | - Wen Chen
- Skwez Technology Corp, Box 3674, Garibaldi Highlands, BC, V0N 1T0 Canada
| | - Cobus M. Visagie
- The Agricultural Research Counci –PPRI, P/Bag X134, Queenswood, 0121 South Africa
| | - C. André Lévesque
- Sidney Laboratory Project - Science, Canadian Food Inspection Agency, Floor 2E, Room 233, 59 Camelot Drive, Ottawa, ON, K1A 0Y9 Canada
| |
Collapse
|
3
|
Noé L. Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds. Algorithms Mol Biol 2017; 12:1. [PMID: 28289437 PMCID: PMC5310094 DOI: 10.1186/s13015-017-0092-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2016] [Accepted: 01/30/2017] [Indexed: 12/02/2022] Open
Abstract
Background Spaced seeds, also named gapped q-grams, gapped k-mers, spaced q-grams, have been proven to be more sensitive than contiguous seeds (contiguous q-grams, contiguous k-mers) in nucleic and amino-acid sequences analysis. Initially proposed to detect sequence similarities and to anchor sequence alignments, spaced seeds have more recently been applied in several alignment-free related methods. Unfortunately, spaced seeds need to be initially designed. This task is known to be time-consuming due to the number of spaced seed candidates. Moreover, it can be altered by a set of arbitrary chosen parameters from the probabilistic alignment models used. In this general context, Dominant seeds have been introduced by Mak and Benson (Bioinformatics 25:302–308, 2009) on the Bernoulli model, in order to reduce the number of spaced seed candidates that are further processed in a parameter-free calculation of the sensitivity. Results We expand the scope of work of Mak and Benson on single and multiple seeds by considering the Hit Integration model of Chung and Park (BMC Bioinform 11:31, 2010), demonstrate that the same dominance definition can be applied, and that a parameter-free study can be performed without any significant additional cost. We also consider two new discrete models, namely the Heaviside and the Dirac models, where lossless seeds can be integrated. From a theoretical standpoint, we establish a generic framework on all the proposed models, by applying a counting semi-ring to quickly compute large polynomial coefficients needed by the dominance filter. From a practical standpoint, we confirm that dominant seeds reduce the set of, either single seeds to thoroughly analyse, or multiple seeds to store. Moreover, in http://bioinfo.cristal.univ-lille.fr/yass/iedera_dominance, we provide a full list of spaced seeds computed on the four aforementioned models, with one (continuous) parameter left free for each model, and with several (discrete) alignment lengths.
Collapse
|
4
|
Parisot N, Peyretaillade E, Dugat-Bony E, Denonfoux J, Mahul A, Peyret P. Probe Design Strategies for Oligonucleotide Microarrays. Methods Mol Biol 2016; 1368:67-82. [PMID: 26614069 DOI: 10.1007/978-1-4939-3136-1_6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Oligonucleotide microarrays have been widely used for gene detection and/or quantification of gene expression in various samples ranging from a single organism to a complex microbial assemblage. The success of a microarray experiment, however, strongly relies on the quality of designed probes. Consequently, probe design is of critical importance and therefore multiple parameters should be considered for each probe in order to ensure high specificity, sensitivity, and uniformity as well as potentially quantitative power. Moreover, to assess the complete gene repertoire of complex biological samples such as those studied in the field of microbial ecology, exploratory probe design strategies must be also implemented to target not-yet-described sequences. To design such probes, two algorithms, KASpOD and HiSpOD, have been developed and they are available via two user-friendly web services. Here, we describe the use of this software necessary for the design of highly effective probes especially in the context of microbial oligonucleotide microarrays by taking into account all the crucial parameters.
Collapse
Affiliation(s)
- Nicolas Parisot
- Université d'Auvergne, EA 4678, CIDAM, Clermont Université, BP 10448, F-63000, Clermont-Ferrand, France
| | - Eric Peyretaillade
- Université d'Auvergne, EA 4678, CIDAM, Clermont Université, BP 10448, F-63000, Clermont-Ferrand, France
| | - Eric Dugat-Bony
- Génie et Microbiologie des Procédés Alimentaires, Centre de Biotechnologies Agro-Industrielles, INRA, AgroParisTech, UMR 782, Thiverval-Grignon, France
| | - Jérémie Denonfoux
- Genomic Platform and R&D, Genoscreen, Campus de l'Institut Pasteur, Lille, France
| | | | - Pierre Peyret
- Université d'Auvergne, EA 4678, CIDAM, Clermont Université, BP 10448, F-63000, Clermont-Ferrand, France.
| |
Collapse
|
5
|
Rouli L, Merhej V, Fournier PE, Raoult D. The bacterial pangenome as a new tool for analysing pathogenic bacteria. New Microbes New Infect 2015; 7:72-85. [PMID: 26442149 PMCID: PMC4552756 DOI: 10.1016/j.nmni.2015.06.005] [Citation(s) in RCA: 155] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2015] [Accepted: 06/16/2015] [Indexed: 01/18/2023] Open
Abstract
The bacterial pangenome was introduced in 2005 and, in recent years, has been the subject of many studies. Thanks to progress in next-generation sequencing methods, the pangenome can be divided into two parts, the core (common to the studied strains) and the accessory genome, offering a large panel of uses. In this review, we have presented the analysis methods, the pangenome composition and its application as a study of lifestyle. We have also shown that the pangenome may be used as a new tool for redefining the pathogenic species. We applied this to the Escherichia coli and Shigella species, which have been a subject of controversy regarding their taxonomic and pathogenic position. Pangenome is a new way of studying pathogenic bacteria. Pangenome can be used as a taxonomic tool. This review describes pangenome in the world of pathogenic bacteria.
Collapse
Affiliation(s)
- L Rouli
- Aix Marseille Université, URMITE, UM63, CNRS 7278, IRD 198, Inserm 1095, 13005 Marseille, France
| | - V Merhej
- Aix Marseille Université, URMITE, UM63, CNRS 7278, IRD 198, Inserm 1095, 13005 Marseille, France
| | - P-E Fournier
- Aix Marseille Université, URMITE, UM63, CNRS 7278, IRD 198, Inserm 1095, 13005 Marseille, France
| | - D Raoult
- Aix Marseille Université, URMITE, UM63, CNRS 7278, IRD 198, Inserm 1095, 13005 Marseille, France
| |
Collapse
|
6
|
Kushwaha SK, Manoharan L, Meerupati T, Hedlund K, Ahrén D. MetCap: a bioinformatics probe design pipeline for large-scale targeted metagenomics. BMC Bioinformatics 2015; 16:65. [PMID: 25880302 PMCID: PMC4355349 DOI: 10.1186/s12859-015-0501-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2014] [Accepted: 02/19/2015] [Indexed: 12/15/2022] Open
Abstract
Background Massive sequencing of genes from different environments has evolved metagenomics as central to enhancing the understanding of the wide diversity of micro-organisms and their roles in driving ecological processes. Reduced cost and high throughput sequencing has made large-scale projects achievable to a wider group of researchers, though complete metagenome sequencing is still a daunting task in terms of sequencing as well as the downstream bioinformatics analyses. Alternative approaches such as targeted amplicon sequencing requires custom PCR primer generation, and is not scalable to thousands of genes or gene families. Results In this study, we are presenting a web-based tool called MetCap that circumvents the limitations of amplicon sequencing of multiple genes by designing probes that are suitable for large-scale targeted metagenomics sequencing studies. MetCap provides a novel approach to target thousands of genes and genomic regions that could be used in targeted metagenomics studies. Automatic analysis of user-defined sequences is performed, and probes specifically designed for metagenome studies are generated. To illustrate the advantage of a targeted metagenome approach, we have generated more than 300,000 probes that match more than 400,000 publicly available sequences related to carbon degradation, and used these probes for target sequencing in a soil metagenome study. The results show high enrichment of target genes and a successful capturing of the majority of gene families. MetCap is freely available to users from: http://soilecology.biol.lu.se/metcap/. Conclusion MetCap is facilitating probe-based target enrichment as an easy and efficient alternative tool compared to complex primer-based enrichment for large-scale investigations of metagenomes. Our results have shown efficient large-scale target enrichment through MetCap-designed probes for a soil metagenome. The web service is suitable for any targeted metagenomics project that aims to study several genes simultaneously. The novel bioinformatics approach taken by the web service will enable researchers in microbial ecology to tap into the vast diversity of microbial communities using targeted metagenomics as a cost-effective alternative to whole metagenome sequencing. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0501-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sandeep K Kushwaha
- Department of Biology, Lund University, Ecology Building, 223 62, Lund, Sweden.
| | | | | | - Katarina Hedlund
- Department of Biology, Lund University, Ecology Building, 223 62, Lund, Sweden.
| | - Dag Ahrén
- Department of Biology, Lund University, Ecology Building, 223 62, Lund, Sweden. .,Bioinformatics Infrastructure for Life Sciences (BILS), Department of Biology, Lund University, Ecology Building, 223 62, Lund, Sweden.
| |
Collapse
|
7
|
Bougas B, Normandeau E, Pierron F, Campbell PGC, Bernatchez L, Couture P. How does exposure to nickel and cadmium affect the transcriptome of yellow perch (Perca flavescens)--results from a 1000 candidate-gene microarray. AQUATIC TOXICOLOGY (AMSTERDAM, NETHERLANDS) 2013; 142-143:355-64. [PMID: 24084258 DOI: 10.1016/j.aquatox.2013.09.009] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/20/2013] [Revised: 09/05/2013] [Accepted: 09/06/2013] [Indexed: 05/25/2023]
Abstract
The molecular mechanisms underlying nickel (Ni) and cadmium (Cd) toxicity and their specific effects on fish are poorly understood. Documenting gene transcription profiles offers a powerful approach toward identifying the molecular mechanisms affected by these metals and to discover biomarkers of their toxicity. However, confounding environmental factors can complicate the interpretation of the results and the detection of biomarkers for fish captured in their natural environment. In the present study, a 1000 candidate-gene microarray, developed from a previous RNA-seq study on a subset of individual fish from contrasting level of metal contamination, was used to investigate the transcriptional response to metal (Ni and Cd) and non metal (temperature, oxygen, and diet) stressors in yellow perch (Perca flavescens). Specifically, we aimed at (1) identifying transcriptional signatures specific to Ni and Cd exposure, (2) investigating the mechanisms of their toxicity, and (3) developing a predictive tool to identify the sublethal effects of Ni and Cd contaminants in fish sampled from natural environments. A total of 475 genes displayed significantly different transcription levels when temperature varied while 287 and 176 genes were differentially transcribed at different concentrations of Ni and Cd, respectively. These metals were found to mainly affect the transcription level of genes involved in iron metabolism, transcriptional and translational processes, vitamin metabolism, blood coagulation, and calcium transport. In addition, a linear discriminant analysis (LDA) made using gene transcription levels yielded 94% correctly reassigned samples regarding their level of metal contamination, which indicates the potential of the microarray to detect perch response to Cd or Ni effects.
Collapse
Affiliation(s)
- Bérénice Bougas
- Institut National de la Recherche Scientifique, Centre INRS Eau Terre et Environnement, 490, rue de la Couronne, Québec, Québec G1K 9A9, Canada; Département de biologie, Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Québec G1V 0A6, Canada.
| | | | | | | | | | | |
Collapse
|
8
|
Abstract
BACKGROUND The most frequently used tools in bioinformatics are those searching for similarities, or local alignments, between biological sequences. Since the exact dynamic programming algorithm is quadratic, linear-time heuristics such as BLAST are used. Spaced seeds are much more sensitive than the consecutive seed of BLAST and using several seeds represents the current state of the art in approximate search for biological sequences. The most important aspect is computing highly sensitive seeds. Since the problem seems hard, heuristic algorithms are used. The leading software in the common Bernoulli model is the SpEED program. FINDINGS SpEED uses a hill climbing method based on the overlap complexity heuristic. We propose a new algorithm for this heuristic that improves its speed by over one order of magnitude. We use the new implementation to compute improved seeds for several software programs. We compute as well multiple seeds of the same weight as MegaBLAST, that greatly improve its sensitivity. CONCLUSION Multiple spaced seeds are being successfully used in bioinformatics software programs. Enabling researchers to compute very fast high quality seeds will help expanding the range of their applications.
Collapse
Affiliation(s)
- Silvana Ilie
- Department of Mathematics, Ryerson University, Toronto, ON M5B 2K3, Canada.
| |
Collapse
|
9
|
He Z, Deng Y, Zhou J. Development of functional gene microarrays for microbial community analysis. Curr Opin Biotechnol 2011; 23:49-55. [PMID: 22100036 DOI: 10.1016/j.copbio.2011.11.001] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2011] [Revised: 10/31/2011] [Accepted: 11/01/2011] [Indexed: 01/21/2023]
Abstract
Functional gene arrays (FGAs) are a special type of microarrays containing probes for key genes involved in microbial functional processes, such as biogeochemical cycling of carbon, nitrogen, sulfur, phosphorus and metals, virulence and antibiotic resistance, biodegradation of environmental contaminants, and stress responses. FGAs have been demonstrated to be a specific, sensitive, and quantitative tool for rapid analysis of microbial communities from different habitats, such as waters, soils, extreme environments, bioreactors, and human microbiomes. In this review, we first summarize currently reported FGAs, and then focus on the FGA development. We will also discuss several key issues of FGA technology as well as challenges and directions in future FGA development.
Collapse
Affiliation(s)
- Zhili He
- Institute for Environmental Genomics, Department of Botany and Microbiology, University of Oklahoma, Norman, OK 73019, USA.
| | | | | |
Collapse
|
10
|
Dugat-Bony E, Peyretaillade E, Parisot N, Biderre-Petit C, Jaziri F, Hill D, Rimour S, Peyret P. Detecting unknown sequences with DNA microarrays: explorative probe design strategies. Environ Microbiol 2011; 14:356-71. [PMID: 21895914 DOI: 10.1111/j.1462-2920.2011.02559.x] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Designing environmental DNA microarrays that can be used to survey the extreme diversity of microorganisms existing in nature, represents a stimulating challenge in the field of molecular ecology. Indeed, recent efforts in metagenomics have produced a substantial amount of sequence information from various ecosystems, and will continue to accumulate large amounts of sequence data given the qualitative and quantitative improvements in the next-generation sequencing methods. It is now possible to take advantage of these data to develop comprehensive microarrays by using explorative probe design strategies. Such strategies anticipate genetic variations and thus are able to detect known and unknown sequences in environmental samples. In this review, we provide a detailed overview of the probe design strategies currently available to construct both phylogenetic and functional DNA microarrays, with emphasis on those permitting the selection of such explorative probes. Furthermore, exploration of complex environments requires particular attention on probe sensitivity and specificity criteria. Finally, these innovative probe design approaches require exploiting newly available high-density microarray formats.
Collapse
Affiliation(s)
- Eric Dugat-Bony
- Clermont Université, Université Blaise Pascal, Laboratoire Microorganismes: Génome et Environnement, Clermont-Ferrand, France
| | | | | | | | | | | | | | | |
Collapse
|
11
|
Mohtashemi M, Walburger DK, Peterson MW, Sutton FN, Skaer HB, Diggans JC. Open-target sparse sensing of biological agents using DNA microarray. BMC Bioinformatics 2011; 12:314. [PMID: 21801424 PMCID: PMC3161048 DOI: 10.1186/1471-2105-12-314] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2011] [Accepted: 07/29/2011] [Indexed: 11/24/2022] Open
Abstract
Background Current biosensors are designed to target and react to specific nucleic acid sequences or structural epitopes. These 'target-specific' platforms require creation of new physical capture reagents when new organisms are targeted. An 'open-target' approach to DNA microarray biosensing is proposed and substantiated using laboratory generated data. The microarray consisted of 12,900 25 bp oligonucleotide capture probes derived from a statistical model trained on randomly selected genomic segments of pathogenic prokaryotic organisms. Open-target detection of organisms was accomplished using a reference library of hybridization patterns for three test organisms whose DNA sequences were not included in the design of the microarray probes. Results A multivariate mathematical model based on the partial least squares regression (PLSR) was developed to detect the presence of three test organisms in mixed samples. When all 12,900 probes were used, the model correctly detected the signature of three test organisms in all mixed samples (mean(R2)) = 0.76, CI = 0.95), with a 6% false positive rate. A sampling algorithm was then developed to sparsely sample the probe space for a minimal number of probes required to capture the hybridization imprints of the test organisms. The PLSR detection model was capable of correctly identifying the presence of the three test organisms in all mixed samples using only 47 probes (mean(R2)) = 0.77, CI = 0.95) with nearly 100% specificity. Conclusions We conceived an 'open-target' approach to biosensing, and hypothesized that a relatively small, non-specifically designed, DNA microarray is capable of identifying the presence of multiple organisms in mixed samples. Coupled with a mathematical model applied to laboratory generated data, and sparse sampling of capture probes, the prototype microarray platform was able to capture the signature of each organism in all mixed samples with high sensitivity and specificity. It was demonstrated that this new approach to biosensing closely follows the principles of sparse sensing.
Collapse
Affiliation(s)
- Mojdeh Mohtashemi
- Emerging & Disruptive Technologies, The MITRE Corporation, McLean, Virginia, USA.
| | | | | | | | | | | |
Collapse
|
12
|
Abstract
SUMMARY Multiple spaced seeds represent the current state-of-the-art for similarity search in bioinformatics, with applications in various areas such as sequence alignment, read mapping, oligonucleotide design, etc. We present SpEED, a software program that computes highly sensitive multiple spaced seeds. SpEED can be several orders of magnitude faster and computes better seeds than the existing leading software programs. AVAILABILITY The source code of SpEED is freely available at www.csd.uwo.ca/~ilie/SpEED/ CONTACT: ilie@csd.uwo.ca SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lucian Ilie
- Department of Computer Science, University of Western Ontario, London, ON N6A 5B7, Canada.
| | | | | |
Collapse
|
13
|
Design and verification of a pangenome microarray oligonucleotide probe set for Dehalococcoides spp. Appl Environ Microbiol 2011; 77:5361-9. [PMID: 21666017 DOI: 10.1128/aem.00063-11] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Dehalococcoides spp. are an industrially relevant group of Chloroflexi bacteria capable of reductively dechlorinating contaminants in groundwater environments. Existing Dehalococcoides genomes revealed a high level of sequence identity within this group, including 98 to 100% 16S rRNA sequence identity between strains with diverse substrate specificities. Common molecular techniques for identification of microbial populations are often not applicable for distinguishing Dehalococcoides strains. Here we describe an oligonucleotide microarray probe set designed based on clustered Dehalococcoides genes from five different sources (strain DET195, CBDB1, BAV1, and VS genomes and the KB-1 metagenome). This "pangenome" probe set provides coverage of core Dehalococcoides genes as well as strain-specific genes while optimizing the potential for hybridization to closely related, previously unknown Dehalococcoides strains. The pangenome probe set was compared to probe sets designed independently for each of the five Dehalococcoides strains. The pangenome probe set demonstrated better predictability and higher detection of Dehalococcoides genes than strain-specific probe sets on nontarget strains with <99% average nucleotide identity. An in silico analysis of the expected probe hybridization against the recently released Dehalococcoides strain GT genome and additional KB-1 metagenome sequence data indicated that the pangenome probe set performs more robustly than the combined strain-specific probe sets in the detection of genes not included in the original design. The pangenome probe set represents a highly specific, universal tool for the detection and characterization of Dehalococcoides from contaminated sites. It has the potential to become a common platform for Dehalococcoides-focused research, allowing meaningful comparisons between microarray experiments regardless of the strain examined.
Collapse
|
14
|
Ilie L, Ilie S, Khoshraftar S, Bigvand AM. Seeds for effective oligonucleotide design. BMC Genomics 2011; 12:280. [PMID: 21627845 PMCID: PMC3128067 DOI: 10.1186/1471-2164-12-280] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2011] [Accepted: 06/01/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND DNA oligonucleotides are a very useful tool in biology. The best algorithms for designing good DNA oligonucleotides are filtering out unsuitable regions using a seeding approach. Determining the quality of the seeds is crucial for the performance of these algorithms. RESULTS We present a sound framework for evaluating the quality of seeds for oligonucleotide design. The F - score is used to measure the accuracy of each seed. A number of natural candidates are tested: contiguous (BLAST-like), spaced, transitions-constrained, and multiple spaced seeds. Multiple spaced seeds are the best, with more seeds providing better accuracy. Single spaced and transition seeds are very close whereas, as expected, contiguous seeds come last. Increased accuracy comes at the price of reduced efficiency. An exception is that single spaced and transitions-constrained seeds are both more accurate and more efficient than contiguous ones. CONCLUSIONS Our work confirms another application where multiple spaced seeds perform the best. It will be useful in improving the algorithms for oligonucleotide design.
Collapse
Affiliation(s)
- Lucian Ilie
- Department of Computer Science, University of Western Ontario, N6A 5B7, London, ON, Canada
| | - Silvana Ilie
- Department of Mathematics, Ryerson University, M5B 2K3, Toronto, ON, Canada
| | - Shima Khoshraftar
- Department of Computer Science, University of Western Ontario, N6A 5B7, London, ON, Canada
| | | |
Collapse
|
15
|
Bader KC, Grothoff C, Meier H. Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets. Bioinformatics 2011; 27:1546-54. [PMID: 21471017 DOI: 10.1093/bioinformatics/btr161] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION PCR, hybridization, DNA sequencing and other important methods in molecular diagnostics rely on both sequence-specific and sequence group-specific oligonucleotide primers and probes. Their design depends on the identification of oligonucleotide signatures in whole genome or marker gene sequences. Although genome and gene databases are generally available and regularly updated, collections of valuable signatures are rare. Even for single requests, the search for signatures becomes computationally expensive when working with large collections of target (and non-target) sequences. Moreover, with growing dataset sizes, the chance of finding exact group-matching signatures decreases, necessitating the application of relaxed search methods. The resultant substantial increase in complexity is exacerbated by the dearth of algorithms able to solve these problems efficiently. RESULTS We have developed CaSSiS, a fast and scalable method for computing comprehensive collections of sequence- and sequence group-specific oligonucleotide signatures from large sets of hierarchically clustered nucleic acid sequence data. Based on the ARB Positional Tree (PT-)Server and a newly developed BGRT data structure, CaSSiS not only determines sequence-specific signatures and perfect group-covering signatures for every node within the cluster (i.e. target groups), but also signatures with maximal group coverage (sensitivity) within a user-defined range of non-target hits (specificity) for groups lacking a perfect common signature. An upper limit of tolerated mismatches within the target group, as well as the minimum number of mismatches with non-target sequences, can be predefined. Test runs with one of the largest phylogenetic gene sequence datasets available indicate good runtime and memory performance, and in silico spot tests have shown the usefulness of the resulting signature sequences as blueprints for group-specific oligonucleotide probes. AVAILABILITY Software and Supplementary Material are available at http://cassis.in.tum.de/.
Collapse
Affiliation(s)
- Kai Christian Bader
- Services Department of Informatics, Technische Universität München, Boltzmannstrasse 3, 85748 Garching, Germany
| | | | | |
Collapse
|
16
|
Dugat-Bony E, Missaoui M, Peyretaillade E, Biderre-Petit C, Bouzid O, Gouinaud C, Hill D, Peyret P. HiSpOD: probe design for functional DNA microarrays. Bioinformatics 2011; 27:641-8. [DOI: 10.1093/bioinformatics/btq712] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
17
|
|
18
|
Chung WH, Park SB. An empirical study of choosing efficient discriminative seeds for oligonucleotide design. BMC Genomics 2009; 10 Suppl 3:S3. [PMID: 19958494 PMCID: PMC2788383 DOI: 10.1186/1471-2164-10-s3-s3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Oligonucleotide design is known as a time-consuming work in bioinformatics. In order to accelerate and be efficient the oligonucleotide design process, one of widely used approach is the prescreening unreliable regions using a hashing (or seeding) algorithm. Since the seeding algorithm is originally proposed to increase sensitivity for local alignment, the specificity should be considered as well as the sensitivity for the oligonucleotide design problem. However, a measure of evaluating the seeds regarding how adequate and efficient they are in the oligo design is not yet proposed. Here, we propose novel measures of evaluating the seeding algorithms based on the discriminability and the efficiency. RESULTS To evaluate the proposed measures, we examine five seeding algorithms in oligonucleotide design. We carried out a series of experiments to compare the seeding algorithms. As the result, the spaced seed is recorded as the most efficient discriminative seed for oligo design. The performance of transition-constrained seed is slightly lower than the spaced seed. Because BLAT seeding algorithm and Vector seeding algorithm give poor scores in specificity and efficiency, we conclude that these algorithms are not adequate to design oligos. Consequently, we recommend spaced seeds or transition-constrained seeds with 15 approximately 18 weight in order to design oligos with the length of 50 mer. The empirical experiments in real biological data reveal that the recommended seeds show consequently good performance. We also propose a software package which enables the users to get the adequate seeds under their own experimental conditions. CONCLUSION Our study is valuable to the two points. One is that our study can be applied to the oligo design programs in order to improve the performance by suggesting the experiment-specific seeds. The other is that our study is useful to improve the performance of the mapping assembly in the field of Next-Generation Sequencing. Our proposed measures are originally designed to be used for oligo design but we expect that our study will be helpful to the other genomic tasks.
Collapse
Affiliation(s)
- Won-Hyoung Chung
- Department of Computer Engineering, Kyungpook National University, Daegu 702-701, South Korea.
| | | |
Collapse
|
19
|
Phillippy AM, Deng X, Zhang W, Salzberg SL. Efficient oligonucleotide probe selection for pan-genomic tiling arrays. BMC Bioinformatics 2009; 10:293. [PMID: 19758451 PMCID: PMC2753849 DOI: 10.1186/1471-2105-10-293] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2009] [Accepted: 09/16/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Array comparative genomic hybridization is a fast and cost-effective method for detecting, genotyping, and comparing the genomic sequence of unknown bacterial isolates. This method, as with all microarray applications, requires adequate coverage of probes targeting the regions of interest. An unbiased tiling of probes across the entire length of the genome is the most flexible design approach. However, such a whole-genome tiling requires that the genome sequence is known in advance. For the accurate analysis of uncharacterized bacteria, an array must query a fully representative set of sequences from the species' pan-genome. Prior microarrays have included only a single strain per array or the conserved sequences of gene families. These arrays omit potentially important genes and sequence variants from the pan-genome. RESULTS This paper presents a new probe selection algorithm (PanArray) that can tile multiple whole genomes using a minimal number of probes. Unlike arrays built on clustered gene families, PanArray uses an unbiased, probe-centric approach that does not rely on annotations, gene clustering, or multi-alignments. Instead, probes are evenly tiled across all sequences of the pan-genome at a consistent level of coverage. To minimize the required number of probes, probes conserved across multiple strains in the pan-genome are selected first, and additional probes are used only where necessary to span polymorphic regions of the genome. The viability of the algorithm is demonstrated by array designs for seven different bacterial pan-genomes and, in particular, the design of a 385,000 probe array that fully tiles the genomes of 20 different Listeria monocytogenes strains with overlapping probes at greater than twofold coverage. CONCLUSION PanArray is an oligonucleotide probe selection algorithm for tiling multiple genome sequences using a minimal number of probes. It is capable of fully tiling all genomes of a species on a single microarray chip. These unique pan-genome tiling arrays provide maximum flexibility for the analysis of both known and uncharacterized strains.
Collapse
Affiliation(s)
- Adam M Phillippy
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA.
| | | | | | | |
Collapse
|
20
|
Severgnini M, Cremonesi P, Consolandi C, Caredda G, De Bellis G, Castiglioni B. ORMA: a tool for identification of species-specific variations in 16S rRNA gene and oligonucleotides design. Nucleic Acids Res 2009; 37:e109. [PMID: 19531738 PMCID: PMC2760787 DOI: 10.1093/nar/gkp499] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2009] [Revised: 04/30/2009] [Accepted: 05/24/2009] [Indexed: 11/24/2022] Open
Abstract
16S rRNA gene is one of the preferred targets for resolving species phylogenesis issues in microbiological-related contexts. However, the identification of single-nucleotide variations capable of distinguishing a sequence among a set of homologous ones can be problematic. Here we present ORMA (Oligonucleotide Retrieving for Molecular Applications), a set of scripts for discriminating positions search and for performing the selection of high-quality oligonucleotide probes to be used in molecular applications. Two assays based on Ligase Detection Reaction (LDR) are presented. First, a new set of probe pairs on cyanobacteria 16S rRNA sequences of 18 different species was compared to that of a previous study. Then, a set of LDR probe pairs for the discrimination of 13 pathogens contaminating bovine milk was evaluated. The software determined more than 100 candidate probe pairs per dataset, from more than 300 16S rRNA sequences, in less than 5 min. Results demonstrated how ORMA improved the performance of the LDR assay on cyanobacteria, correctly identifying 12 out of 14 samples, and allowed the perfect discrimination among the 13 milk pathogenic-related species. ORMA represents a significant improvement from other contexts where enzyme-based techniques have been employed on already known mutations of a single base or on entire subsequences.
Collapse
Affiliation(s)
- Marco Severgnini
- Institute of Biomedical Technologies, Italian National Research Council, Segrate, Italy.
| | | | | | | | | | | |
Collapse
|
21
|
Rouillard JM, Gulari E. OligoArrayDb: pangenomic oligonucleotide microarray probe sets database. Nucleic Acids Res 2008; 37:D938-41. [PMID: 18948290 PMCID: PMC2686523 DOI: 10.1093/nar/gkn761] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
OligoArrayDb is a comprehensive database containing pangenomic oligonucleotide microarray probe sets designed for most of the sequenced genomes that are not covered by commercial catalog arrays. The availability of probe sequences, associated with custom microarray fabrication services offered by many companies and cores presents the unequalled possibility to perform microarray experiments on most of the sequenced organisms. OligoArrayDb contains more than 2.8 probes per gene in average for more than 600 organisms, mostly archaea and bacteria strains available from public database. On average, 98% of the annotated genes have at least one probe which is predicted to be specific to its intended target in >94% of the cases. OligoArrayDb is weekly updated as new sequenced genomes become available. Probe sequences, in addition to a comprehensive set of annotations can be downloaded from this database. OligoArrayDb is publicly accessible online at http://berry.engin.umich.edu/oligoarraydb.
Collapse
Affiliation(s)
- Jean-Marie Rouillard
- Chemical Engineering Department, University of Michigan, 2300 Hayward Street, Ann Arbor, MI 48109, USA.
| | | |
Collapse
|
22
|
Vijaya Satya R, Zavaljevski N, Kumar K, Bode E, Padilla S, Wasieloski L, Geyer J, Reifman J. In silico microarray probe design for diagnosis of multiple pathogens. BMC Genomics 2008; 9:496. [PMID: 18940003 PMCID: PMC2596143 DOI: 10.1186/1471-2164-9-496] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2008] [Accepted: 10/21/2008] [Indexed: 12/05/2022] Open
Abstract
Background With multiple strains of various pathogens being sequenced, it is necessary to develop high-throughput methods that can simultaneously process multiple bacterial or viral genomes to find common fingerprints as well as fingerprints that are unique to each individual genome. We present algorithmic enhancements to an existing single-genome pipeline that allows for efficient design of microarray probes common to groups of target genomes. The enhanced pipeline takes advantage of the similarities in the input genomes to narrow the search to short, nonredundant regions of the target genomes and, thereby, significantly reduces the computation time. The pipeline also computes a three-state hybridization matrix, which gives the expected hybridization of each probe with each target. Results Design of microarray probes for eight pathogenic Burkholderia genomes shows that the multiple-genome pipeline is nearly four-times faster than the single-genome pipeline for this application. The probes designed for these eight genomes were experimentally tested with one non-target and three target genomes. Hybridization experiments show that less than 10% of the designed probes cross hybridize with non-targets. Also, more than 65% of the probes designed to identify all Burkholderia mallei and B. pseudomallei strains successfully hybridize with a B. pseudomallei strain not used for probe design. Conclusion The savings in runtime suggest that the enhanced pipeline can be used to design fingerprints for tens or even hundreds of related genomes in a single run. Hybridization results with an unsequenced B. pseudomallei strain indicate that the designed probes might be useful in identifying unsequenced strains of B. mallei and B. pseudomallei.
Collapse
Affiliation(s)
- Ravi Vijaya Satya
- Biotechnology HPC Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, MD 21702, USA.
| | | | | | | | | | | | | | | |
Collapse
|
23
|
Lee IH, Yang KA, Lee JH, Park JY, Chai YG, Lee JH, Zhang BT. The use of gold nanoparticle aggregation for DNA computing and logic-based biomolecular detection. NANOTECHNOLOGY 2008; 19:395103. [PMID: 21832585 DOI: 10.1088/0957-4484/19/39/395103] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The use of DNA molecules as a physical computational material has attracted much interest, especially in the area of DNA computing. DNAs are also useful for logical control and analysis of biological systems if efficient visualization methods are available. Here we present a quick and simple visualization technique that displays the results of the DNA computing process based on a colorimetric change induced by gold nanoparticle aggregation, and we apply it to the logic-based detection of biomolecules. Our results demonstrate its effectiveness in both DNA-based logical computation and logic-based biomolecular detection.
Collapse
Affiliation(s)
- In-Hee Lee
- School of Computer Science and Engineering, Seoul National University, 599 Gwanak-ro, Gwanak-gu, Seoul 151-742, Korea
| | | | | | | | | | | | | |
Collapse
|